3.3 KiB
Lineage with Airflow
There's a couple ways to get lineage information from Airflow into DataHub.
Using Datahub's Airflow lineage backend (recommended)
:::caution
The Airflow lineage backend is only supported in Airflow 1.10.15+ and 2.0.2+.
:::
Running on Docker locally
If you are looking to run Airflow and DataHub using docker locally, follow the guide here. Otherwise proceed to follow the instructions below.
Setting up Airflow to use DataHub as Lineage Backend
- You need to install the required dependency in your airflow. See https://registry.astronomer.io/providers/datahub/modules/datahublineagebackend
pip install acryl-datahub[airflow]
-
You must configure an Airflow hook for Datahub. We support both a Datahub REST hook and a Kafka-based hook, but you only need one.
# For REST-based: airflow connections add --conn-type 'datahub_rest' 'datahub_rest_default' --conn-host 'http://localhost:8080' # For Kafka-based (standard Kafka sink config can be passed via extras): airflow connections add --conn-type 'datahub_kafka' 'datahub_kafka_default' --conn-host 'broker:9092' --conn-extra '{}'
-
Add the following lines to your
airflow.cfg
file.[lineage] backend = datahub_provider.lineage.datahub.DatahubLineageBackend datahub_kwargs = { "datahub_conn_id": "datahub_rest_default", "cluster": "prod", "capture_ownership_info": true, "capture_tags_info": true, "graceful_exceptions": true } # The above indentation is important!
Configuration options:
datahub_conn_id
(required): Usuallydatahub_rest_default
ordatahub_kafka_default
, depending on what you named the connection in step 1.cluster
(defaults to "prod"): The "cluster" to associate Airflow DAGs and tasks with.capture_ownership_info
(defaults to true): If true, the owners field of the DAG will be capture as a DataHub corpuser.capture_tags_info
(defaults to true): If true, the tags field of the DAG will be captured as DataHub tags.graceful_exceptions
(defaults to true): If set to true, most runtime errors in the lineage backend will be suppressed and will not cause the overall task to fail. Note that configuration issues will still throw exceptions.
-
Configure
inlets
andoutlets
for your Airflow operators. For reference, look at the sample DAG inlineage_backend_demo.py
, or referencelineage_backend_taskflow_demo.py
if you're using the TaskFlow API. -
[optional] Learn more about Airflow lineage, including shorthand notation and some automation.
Emitting lineage via a separate operator
Take a look at this sample DAG:
lineage_emission_dag.py
- emits lineage using the DatahubEmitterOperator.
In order to use this example, you must first configure the Datahub hook. Like in ingestion, we support a Datahub REST hook and a Kafka-based hook. See step 1 above for details.