mirror of
https://github.com/datahub-project/datahub.git
synced 2025-07-06 16:49:03 +00:00
98 lines
4.1 KiB
Markdown
98 lines
4.1 KiB
Markdown
# OpenLineage
|
|
|
|
DataHub, now supports [OpenLineage](https://openlineage.io/) integration. With this support, DataHub can ingest and display lineage information from various data processing frameworks, providing users with a comprehensive understanding of their data pipelines.
|
|
|
|
## Features
|
|
|
|
- **REST Endpoint Support**: DataHub now includes a REST endpoint that can understand OpenLineage events. This allows users to send lineage information directly to DataHub, enabling easy integration with various data processing frameworks.
|
|
|
|
- **[Spark Event Listener Plugin](https://docs.datahub.com/docs/metadata-integration/java/acryl-spark-lineage)**: DataHub provides a Spark Event Listener plugin that seamlessly integrates OpenLineage's Spark plugin. This plugin enhances DataHub's OpenLineage support by offering additional features such as PathSpec support, column-level lineage, patch support and more.
|
|
|
|
## OpenLineage Support with DataHub
|
|
|
|
### 1. REST Endpoint Support
|
|
|
|
DataHub's REST endpoint allows users to send OpenLineage events directly to DataHub. This enables easy integration with various data processing frameworks, providing users with a centralized location for viewing and managing data lineage information.
|
|
|
|
With Spark and Airflow we recommend using the Spark Lineage or DataHub's Airflow plugin for tighter integration with DataHub.
|
|
|
|
#### How to Use
|
|
|
|
To send OpenLineage messages to DataHub using the REST endpoint, simply make a POST request to the following endpoint:
|
|
|
|
```
|
|
POST GMS_SERVER_HOST:GMS_PORT/api/v2/lineage
|
|
```
|
|
|
|
Include the OpenLineage message in the request body in JSON format.
|
|
|
|
Example:
|
|
|
|
```json
|
|
{
|
|
"eventType": "START",
|
|
"eventTime": "2020-12-28T19:52:00.001+10:00",
|
|
"run": {
|
|
"runId": "d46e465b-d358-4d32-83d4-df660ff614dd"
|
|
},
|
|
"job": {
|
|
"namespace": "workshop",
|
|
"name": "process_taxes"
|
|
},
|
|
"inputs": [
|
|
{
|
|
"namespace": "postgres://workshop-db:None",
|
|
"name": "workshop.public.taxes",
|
|
"facets": {
|
|
"dataSource": {
|
|
"_producer": "https://github.com/OpenLineage/OpenLineage/tree/0.10.0/integration/airflow",
|
|
"_schemaURL": "https://raw.githubusercontent.com/OpenLineage/OpenLineage/main/spec/OpenLineage.json#/definitions/DataSourceDatasetFacet",
|
|
"name": "postgres://workshop-db:None",
|
|
"uri": "workshop-db"
|
|
}
|
|
}
|
|
}
|
|
],
|
|
"producer": "https://github.com/OpenLineage/OpenLineage/blob/v1-0-0/client"
|
|
}
|
|
```
|
|
|
|
##### How to set up Airflow
|
|
|
|
Follow the Airflow guide to setup the Airflow DAGs to send lineage information to DataHub. The guide can be found [here](https://airflow.apache.org/docs/apache-airflow-providers-openlineage/stable/guides/user.html).
|
|
The transport should look like this:
|
|
|
|
```json
|
|
{
|
|
"type": "http",
|
|
"url": "https://GMS_SERVER_HOST:GMS_PORT/openapi/openlineage/",
|
|
"endpoint": "api/v1/lineage",
|
|
"auth": {
|
|
"type": "api_key",
|
|
"api_key": "your-datahub-api-key"
|
|
}
|
|
}
|
|
```
|
|
|
|
#### Known Limitations
|
|
|
|
With Spark and Airflow we recommend using the Spark Lineage or DataHub's Airflow plugin for tighter integration with DataHub.
|
|
|
|
- **[PathSpec](https://docs.datahub.com/docs/metadata-integration/java/acryl-spark-lineage/#configuring-hdfs-based-dataset-urns) Support**: While the REST endpoint supports OpenLineage messages, full [PathSpec](https://docs.datahub.com/docs/metadata-integration/java/acryl-spark-lineage/#configuring-hdfs-based-dataset-urns)) support is not yet available in the OpenLineage endpoint but it is available in the DataHub Cloud Spark Plugin.
|
|
|
|
etc...
|
|
|
|
### 2. Spark Event Listener Plugin
|
|
|
|
DataHub's Spark Event Listener plugin enhances OpenLineage support by providing additional features such as PathSpec support, column-level lineage, and more.
|
|
|
|
#### How to Use
|
|
|
|
Follow the guides of the Spark Lineage plugin page for more information on how to set up the Spark Lineage plugin. The guide can be found [here](https://docs.datahub.com/docs/metadata-integration/java/acryl-spark-lineage)
|
|
|
|
## References
|
|
|
|
- [OpenLineage](https://openlineage.io/)
|
|
- [DataHub OpenAPI Guide](../api/openapi/openapi-usage-guide.md)
|
|
- [DataHub Spark Lineage Plugin](https://docs.datahub.com/docs/metadata-integration/java/acryl-spark-lineage)
|