mgorsk1 98850ab5cc
feat: OpenLineage integration (#15317)
* 🎉 Init OpenLineage connector

Co-authored-by: dechoma <dominik.choma@gmail.com>

* MLH - make linter happy

* review fixes

* 🐛 Fix path for ol event in tests

* 🐛 Fix path for ol event in tests

* Update ingestion/setup.py

Co-authored-by: Mayur Singal <39544459+ulixius9@users.noreply.github.com>

* Update ingestion/src/metadata/ingestion/source/pipeline/openlineage/metadata.py

Co-authored-by: Mayur Singal <39544459+ulixius9@users.noreply.github.com>

* Update ingestion/src/metadata/ingestion/source/pipeline/openlineage/models.py

Co-authored-by: Mayur Singal <39544459+ulixius9@users.noreply.github.com>

* review fixes 2

* linter

* review

* review

* make linter happy

* fix test_yield_pipeline_lineage_details test

* make linter happy

* fix tests

* fix tests 2

---------

Co-authored-by: dechoma <dominik.choma@gmail.com>
Co-authored-by: Mayur Singal <39544459+ulixius9@users.noreply.github.com>
2024-03-12 08:39:25 +01:00

5.3 KiB

title slug
OpenLineage /connectors/pipeline/openlineage

OpenLineage

In this section, we provide guides and references to use the OpenLineage connector.

What is OpenLineage?

According to documentation:

OpenLineage is an open framework for data lineage collection and analysis. At its core is an extensible specification that systems can use to interoperate with lineage metadata.

OpenLineage, apart from being a specification, is also a set of integrations collecting lineage from various systems such as Apache Airflow and Spark.

OpenMetadata Openlineage connector

OpenMetadata OpenLineage connector consumes open lineage events from kafka broker and translates it to OpenMetadata Lineage information.

{% image src="/images/v1.3/connectors/pipeline/openlineage/connector-flow.svg" alt="OpenLineage Connector" /%}

Airflow OpenLineage events

Configure your Airflow instance

  1. Install appropriate provider in Airflow: apache-airflow-providers-openlineage
  2. Configure OpenLineage Provider in Airflow - documentation
    1. remember to use kafka transport mode as this connector works under assumption OL events are collected from kafka topic
    2. detailed list of configuration options for OpenLineage can be found here

Spark OpenLineage events

Configure Spark Session to produce OpenLineage events compatible with OpenLineage connector available in OpenMetadata.

@todo complete kafka config

from pyspark.sql import SparkSession
from uuid import uuid4

spark = SparkSession.builder\
.config('spark.openlineage.namespace', 'mynamespace')\
.config('spark.openlineage.parentJobName', 'hello-world')\
.config('spark.openlineage.parentRunId', str(uuid4()))\
.config('spark.jars.packages', 'io.openlineage:openlineage-spark:1.7.0')\
.config('spark.extraListeners', 'io.openlineage.spark.agent.OpenLineageSparkListener')\
.config('spark.openlineage.transport.type', 'kafka')\
.getOrCreate()

Requirements

{% note %} We support OpenLineage events created by OpenLineage versions starting from OpenLineage 1.7.0 {% /note %}

Metadata Ingestion

Connection Details

Providing connection details via UI

{% partial file="/v1.3/connectors/metadata-ingestion-ui.md" variables={ connector: "Openlineage", selectServicePath: "/images/v1.3/connectors/openlineage/select-service.png", addNewServicePath: "/images/v1.3/connectors/openlineage/add-new-service.png", serviceConnectionPath: "/images/v1.3/connectors/openlineage/service-connection.png", } /%}

{% stepsContainer %} {% extraContent parentTagNme="stepsContainer" %}

Providing connection details programmatically via API
1. Preparing the Client
from metadata.generated.schema.entity.services.connections.metadata.openMetadataConnection import (
    OpenMetadataConnection,
)
from metadata.generated.schema.security.client.openMetadataJWTClientConfig import (
    OpenMetadataJWTClientConfig,
)
from metadata.ingestion.ometa.ometa_api import OpenMetadata

server_config = OpenMetadataConnection(
    hostPort="http://localhost:8585/api",
    authProvider="openmetadata",
    securityConfig=OpenMetadataJWTClientConfig(
        jwtToken="<token>"
    ),
)
metadata = OpenMetadata(server_config)

assert metadata.health_check()  # Will fail if we cannot reach the server
2. Creating the OpenLineage Pipeline service

from metadata.generated.schema.api.services.createPipelineService import  CreatePipelineServiceRequest
from metadata.generated.schema.entity.services.pipelineService import (
    PipelineServiceType,
    PipelineConnection,
)
from metadata.generated.schema.entity.services.connections.pipeline.openLineageConnection import (
    OpenLineageConnection,
    SecurityProtocol as KafkaSecurityProtocol,
    ConsumerOffsets
)


openlineage_service_request = CreatePipelineServiceRequest(
    name='openlineage-service',
    displayName='OpenLineage Service',
    serviceType=PipelineServiceType.OpenLineage,
    connection=PipelineConnection(
        config=OpenLineageConnection(
            brokersUrl='broker1:9092,broker2:9092',
            topicName='openlineage-events',
            consumerGroupName='openmetadata-consumer',
            consumerOffsets=ConsumerOffsets.earliest,
            poolTimeout=3.0,
            sessionTimeout=60,
            securityProtocol=KafkaSecurityProtocol.SSL,
            # below ssl confing in optional and used only when securityProtocol=KafkaSecurityProtocol.SSL
            SSLCertificateLocation='/path/to/kafka/certs/Certificate.pem',
            SSLKeyLocation='/path/to/kafka/certs/Key.pem',
            SSLCALocation='/path/to/kafka/certs/RootCA.pem',
        )
    ),
)

metadata.create_or_update(openlineage_service_request)


{% /extraContent %} {% partial file="/v1.3/connectors/test-connection.md" /%} {% partial file="/v1.3/connectors/pipeline/configure-ingestion.md" /%} {% partial file="/v1.3/connectors/ingestion-schedule-and-deploy.md" /%} {% /stepsContainer %} {% partial file="/v1.3/connectors/troubleshooting.md" /%}