# OpenLineage DataHub, now supports [OpenLineage](https://openlineage.io/) integration. With this support, DataHub can ingest and display lineage information from various data processing frameworks, providing users with a comprehensive understanding of their data pipelines. ## Features - **REST Endpoint Support**: DataHub now includes a REST endpoint that can understand OpenLineage events. This allows users to send lineage information directly to DataHub, enabling easy integration with various data processing frameworks. - **[Spark Event Listener Plugin](https://docs.datahub.com/docs/metadata-integration/java/acryl-spark-lineage)**: DataHub provides a Spark Event Listener plugin that seamlessly integrates OpenLineage's Spark plugin. This plugin enhances DataHub's OpenLineage support by offering additional features such as PathSpec support, column-level lineage, patch support and more. ## OpenLineage Support with DataHub ### 1. REST Endpoint Support DataHub's REST endpoint allows users to send OpenLineage events directly to DataHub. This enables easy integration with various data processing frameworks, providing users with a centralized location for viewing and managing data lineage information. With Spark and Airflow we recommend using the Spark Lineage or DataHub's Airflow plugin for tighter integration with DataHub. #### How to Use To send OpenLineage messages to DataHub using the REST endpoint, simply make a POST request to the following endpoint: ``` POST GMS_SERVER_HOST:GMS_PORT/openapi/openlineage/api/v1/lineage ``` Include the OpenLineage message in the request body in JSON format. Example: ```json { "eventType": "START", "eventTime": "2020-12-28T19:52:00.001+10:00", "run": { "runId": "d46e465b-d358-4d32-83d4-df660ff614dd" }, "job": { "namespace": "workshop", "name": "process_taxes" }, "inputs": [ { "namespace": "postgres://workshop-db:None", "name": "workshop.public.taxes", "facets": { "dataSource": { "_producer": "https://github.com/OpenLineage/OpenLineage/tree/0.10.0/integration/airflow", "_schemaURL": "https://raw.githubusercontent.com/OpenLineage/OpenLineage/main/spec/OpenLineage.json#/definitions/DataSourceDatasetFacet", "name": "postgres://workshop-db:None", "uri": "workshop-db" } } } ], "producer": "https://github.com/OpenLineage/OpenLineage/blob/v1-0-0/client" } ``` ##### How to set up Airflow Follow the Airflow guide to setup the Airflow DAGs to send lineage information to DataHub. The guide can be found [here](https://airflow.apache.org/docs/apache-airflow-providers-openlineage/stable/guides/user.html). The transport should look like this: ```json { "type": "http", "url": "https://GMS_SERVER_HOST:GMS_PORT/openapi/openlineage/", "endpoint": "api/v1/lineage", "auth": { "type": "api_key", "api_key": "your-datahub-api-key" } } ``` #### How to modify configurations To modify the configurations for the OpenLineage REST endpoint, you can change it using environment variables. The following configurations are available: ##### DataHub OpenLineage Configuration This document describes all available configuration options for the DataHub OpenLineage integration, including environment variables, application properties, and their usage. ##### Configuration Overview The DataHub OpenLineage integration can be configured using environment variables, application properties files (`application.yml` or `application.properties`), or JVM system properties. All configuration options are prefixed with `datahub.openlineage`. ##### Environment Variables | Environment Variable | Property | Type | Default | Description | | ------------------------------------------------------ | ------------------------------------------------------ | ------- | ------- | --------------------------------------------------------------- | | `DATAHUB_OPENLINEAGE_PLATFORM_INSTANCE` | `datahub.openlineage.platform-instance` | String | `null` | Specific platform instance identifier | | `DATAHUB_OPENLINEAGE_COMMON_DATASET_PLATFORM_INSTANCE` | `datahub.openlineage.common-dataset-platform-instance` | String | `null` | Common platform instance for datasets | | `DATAHUB_OPENLINEAGE_MATERIALIZE_DATASET` | `datahub.openlineage.materialize-dataset` | Boolean | `true` | Whether to materialize dataset entities | | `DATAHUB_OPENLINEAGE_INCLUDE_SCHEMA_METADATA` | `datahub.openlineage.include-schema-metadata` | Boolean | `true` | Whether to include schema metadata in lineage | | `DATAHUB_OPENLINEAGE_CAPTURE_COLUMN_LEVEL_LINEAGE` | `datahub.openlineage.capture-column-level-lineage` | Boolean | `true` | Whether to capture column-level lineage information | | `DATAHUB_OPENLINEAGE_FILE_PARTITION_REGEXP_PATTERN` | `datahub.openlineage.file-partition-regexp-pattern` | String | `null` | Regular expression pattern for file partition detection | | `DATAHUB_OPENLINEAGE_USE_PATCH` | `datahub.openlineage.use-patch` | Boolean | `false` | Whether to use patch operations for lineage/incremental lineage | #### Known Limitations With Spark and Airflow we recommend using the Spark Lineage or DataHub's Airflow plugin for tighter integration with DataHub. - **[PathSpec](https://docs.datahub.com/docs/metadata-integration/java/acryl-spark-lineage/#configuring-hdfs-based-dataset-urns) Support**: While the REST endpoint supports OpenLineage messages, full [PathSpec](https://docs.datahub.com/docs/metadata-integration/java/acryl-spark-lineage/#configuring-hdfs-based-dataset-urns)) support is not yet available in the OpenLineage endpoint but it is available in the DataHub Cloud Spark Plugin. etc... ### 2. Spark Event Listener Plugin DataHub's Spark Event Listener plugin enhances OpenLineage support by providing additional features such as PathSpec support, column-level lineage, and more. #### How to Use Follow the guides of the Spark Lineage plugin page for more information on how to set up the Spark Lineage plugin. The guide can be found [here](https://docs.datahub.com/docs/metadata-integration/java/acryl-spark-lineage) ## References - [OpenLineage](https://openlineage.io/) - [DataHub OpenAPI Guide](../api/openapi/openapi-usage-guide.md) - [DataHub Spark Lineage Plugin](https://docs.datahub.com/docs/metadata-integration/java/acryl-spark-lineage)