Ignoring revisions in .git-blame-ignore-revs. Click here to bypass and see the normal blame view.

151 lines
5.3 KiB
Markdown
Raw Normal View History

---
title: OpenLineage
slug: /connectors/pipeline/openlineage
---
# OpenLineage
In this section, we provide guides and references to use the OpenLineage connector.
## What is OpenLineage?
According to [documentation](https://openlineage.io/docs/):
```
OpenLineage is an open framework for data lineage collection and analysis. At its core is an extensible specification that systems can use to interoperate with lineage metadata.
```
OpenLineage, apart from being a specification, is also a set of integrations collecting lineage from various systems such as Apache Airflow and Spark.
## OpenMetadata Openlineage connector
OpenMetadata OpenLineage connector consumes open lineage events from kafka broker and translates it to OpenMetadata Lineage information.
{% image
src="/images/v1.3/connectors/pipeline/openlineage/connector-flow.svg"
alt="OpenLineage Connector" /%}
### Airflow OpenLineage events
Configure your Airflow instance
1. Install appropriate provider in Airflow: [apache-airflow-providers-openlineage](https://airflow.apache.org/docs/apache-airflow-providers-openlineage/stable/index.html)
2. Configure OpenLineage Provider in Airflow - [documentation](https://airflow.apache.org/docs/apache-airflow-providers-openlineage/stable/guides/user.html#using-openlineage-integration)
1. remember to use `kafka` transport mode as this connector works under assumption OL events are collected from kafka topic
2. detailed list of configuration options for OpenLineage can be found [here](https://openlineage.io/docs/client/python/#configuration)
### Spark OpenLineage events
Configure Spark Session to produce OpenLineage events compatible with OpenLineage connector available in OpenMetadata.
# @todo complete kafka config
```shell
from pyspark.sql import SparkSession
from uuid import uuid4
spark = SparkSession.builder\
.config('spark.openlineage.namespace', 'mynamespace')\
.config('spark.openlineage.parentJobName', 'hello-world')\
.config('spark.openlineage.parentRunId', str(uuid4()))\
.config('spark.jars.packages', 'io.openlineage:openlineage-spark:1.7.0')\
.config('spark.extraListeners', 'io.openlineage.spark.agent.OpenLineageSparkListener')\
.config('spark.openlineage.transport.type', 'kafka')\
.getOrCreate()
```
## Requirements
{% note %}
We support OpenLineage events created by OpenLineage versions starting from OpenLineage 1.7.0
{% /note %}
## Metadata Ingestion
#### Connection Details
##### Providing connection details via UI
{% partial
file="/v1.3/connectors/metadata-ingestion-ui.md"
variables={
connector: "Openlineage",
selectServicePath: "/images/v1.3/connectors/openlineage/select-service.png",
addNewServicePath: "/images/v1.3/connectors/openlineage/add-new-service.png",
serviceConnectionPath: "/images/v1.3/connectors/openlineage/service-connection.png",
}
/%}
{% stepsContainer %}
{% extraContent parentTagNme="stepsContainer" %}
##### Providing connection details programmatically via API
###### 1. Preparing the Client
```python
from metadata.generated.schema.entity.services.connections.metadata.openMetadataConnection import (
OpenMetadataConnection,
)
from metadata.generated.schema.security.client.openMetadataJWTClientConfig import (
OpenMetadataJWTClientConfig,
)
from metadata.ingestion.ometa.ometa_api import OpenMetadata
server_config = OpenMetadataConnection(
hostPort="http://localhost:8585/api",
authProvider="openmetadata",
securityConfig=OpenMetadataJWTClientConfig(
jwtToken="<token>"
),
)
metadata = OpenMetadata(server_config)
assert metadata.health_check() # Will fail if we cannot reach the server
```
###### 2. Creating the OpenLineage Pipeline service
```python
from metadata.generated.schema.api.services.createPipelineService import CreatePipelineServiceRequest
from metadata.generated.schema.entity.services.pipelineService import (
PipelineServiceType,
PipelineConnection,
)
from metadata.generated.schema.entity.services.connections.pipeline.openLineageConnection import (
OpenLineageConnection,
SecurityProtocol as KafkaSecurityProtocol,
ConsumerOffsets
)
openlineage_service_request = CreatePipelineServiceRequest(
name='openlineage-service',
displayName='OpenLineage Service',
serviceType=PipelineServiceType.OpenLineage,
connection=PipelineConnection(
config=OpenLineageConnection(
brokersUrl='broker1:9092,broker2:9092',
topicName='openlineage-events',
consumerGroupName='openmetadata-consumer',
consumerOffsets=ConsumerOffsets.earliest,
poolTimeout=3.0,
sessionTimeout=60,
securityProtocol=KafkaSecurityProtocol.SSL,
# below ssl confing in optional and used only when securityProtocol=KafkaSecurityProtocol.SSL
SSLCertificateLocation='/path/to/kafka/certs/Certificate.pem',
SSLKeyLocation='/path/to/kafka/certs/Key.pem',
SSLCALocation='/path/to/kafka/certs/RootCA.pem',
)
),
)
metadata.create_or_update(openlineage_service_request)
```
{% /extraContent %}
{% partial file="/v1.3/connectors/test-connection.md" /%}
{% partial file="/v1.3/connectors/pipeline/configure-ingestion.md" /%}
{% partial file="/v1.3/connectors/ingestion-schedule-and-deploy.md" /%}
{% /stepsContainer %}
{% partial file="/v1.3/connectors/troubleshooting.md" /%}