Add OpenLineage extension for Kafka Connect#22050
Draft
rahul-madaan wants to merge 1 commit intoapache:trunkfrom
Draft
Add OpenLineage extension for Kafka Connect#22050rahul-madaan wants to merge 1 commit intoapache:trunkfrom
rahul-madaan wants to merge 1 commit intoapache:trunkfrom
Conversation
Member
|
Thanks for the PR. This introduces new public APIs so this change requires a Kafka Improvement Proposal (KIP). See https://cwiki.apache.org/confluence/display/KAFKA/Kafka+Improvement+Proposals for the process |
23498f1 to
802f4a0
Compare
Add a new connect/openlineage-extension module that emits OpenLineage
lineage events from Kafka Connect connectors. This enables automatic
data lineage tracking for Connect pipelines, following the same pattern
used by the OpenLineage integrations for Apache Spark and Apache Flink.
The extension implements ConnectRestExtension (KIP-285) and runs inside
the Connect worker JVM. A background monitor polls ConnectClusterState
to detect connector lifecycle changes (start, pause, fail, delete) and
emits OpenLineage RunEvents with input/output dataset information.
Event lifecycle (matches Flink streaming pattern):
START — when a connector is first observed RUNNING
RUNNING — periodic heartbeat (default every 5 minutes, configurable)
confirming the connector is still active with current lineage
COMPLETE — when a connector is paused or deleted
FAIL — when a connector enters FAILED state
Supported connectors (17 visitors):
- JDBC Source/Sink (PostgreSQL, MySQL, SQL Server, Oracle, DB2, Redshift)
- Debezium CDC (all variants)
- S3, GCS, Azure Blob, HDFS Sink
- MongoDB Source/Sink
- Elasticsearch Sink
- BigQuery, Snowflake, Redshift Sink
- MirrorMaker 2, HTTP Sink
- Generic fallback for unknown connectors
Events follow the OpenLineage spec (https://openlineage.io/docs/spec/naming/)
and include:
- processing_engine run facet (name, version, adapter version)
- jobType job facet (processingType: STREAMING, integration: KAFKA_CONNECT)
- errorMessage run facet on FAIL events
- Input/output datasets on all event types (START, RUNNING, COMPLETE, FAIL)
- Lineage cached per-connector for reliable COMPLETE events on deletion
- Lineage refreshed on RUNNING events to detect config changes
Dataset namespaces follow OL naming conventions:
postgres://, mysql://, kafka://, s3://, gs://, wasbs://,
hdfs://, mongodb://, elasticsearch://, bigquery, snowflake://,
redshift://, cassandra://
Transport options: HTTP (with Bearer auth), File (NDJSON), Console (SLF4J).
Configuration via OPENLINEAGE_CONFIG env var or openlineage.* worker properties.
Signed-off-by: Rahul Madan <madan.rahul9@gmail.com>
802f4a0 to
b02a7d5
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Add a new connect/openlineage-extension module that emits OpenLineage lineage events from Kafka Connect connectors. This enables automatic data lineage tracking for Connect pipelines, following the same pattern used by the OpenLineage integrations for Apache Spark and Apache Flink.
The extension implements ConnectRestExtension (KIP-285) and runs inside the Connect worker JVM. A background monitor polls ConnectClusterState to detect connector lifecycle changes (start, pause, fail, delete) and emits OpenLineage RunEvents with input/output dataset information.
Supported connectors:
Events follow the OpenLineage spec (https://openlineage.io/docs/spec/naming/) and include:
Transport options: HTTP (with Bearer auth), File (NDJSON), Console (SLF4J). Configuration via OPENLINEAGE_CONFIG env var or openlineage.* worker properties.