Pere Miquel Brull 7fcdf08ca4
#11626 & #14131 - Lineage with other Entities & attr-based xlets (#14191)
* Add OMEntity model

* Test OMEntity

* Update repr

* Fix __str__

* Add entity ref map

* Test serializer for backend

* Fix tests

* Fix serializer

* Test runner

* Add runner tests

* Update docs

* Format
2023-12-01 06:29:44 +01:00

4.7 KiB

title slug
Configuring DAG Lineage /connectors/pipeline/airflow/configuring-lineage

Configuring DAG Lineage

Regardless of the Airflow ingestion process you follow (Workflow, Lineage Backend or Lineage Operator), OpenMetadata will try to extract the lineage information based on the tasks inlets and outlets.

What it's important to consider here is that when we are ingesting Airflow lineage, we are actually building a graph:

Table A (node) -> DAG (edge) -> Table B (node)

Where tables are nodes and DAGs (Pipelines) are considered edges. This means that the correct way of setting these parameters is by making sure that we are informing both inlets and outlets, so that we have the nodes to build the relationship.

Configuring Lineage

{% note %}

This lineage configuration method is available for OpenMetadata release 1.2.3 or higher.

{% /note %}

Let's take a look at the following example:

from datetime import timedelta

from airflow import DAG
from airflow.operators.dummy import DummyOperator
from airflow.utils.dates import days_ago

from metadata.generated.schema.entity.data.container import Container
from metadata.generated.schema.entity.data.table import Table
from metadata.ingestion.source.pipeline.airflow.lineage_parser import OMEntity


default_args = {
    'owner': 'airflow',
    'depends_on_past': False,
    'email': ['airflow@example.com'],
    'email_on_failure': False,
    'email_on_retry': False,
    'retries': 1,
    'retry_delay': timedelta(seconds=1),
}


with DAG(
    "test-lineage",
    default_args=default_args,
    description="An example DAG which runs a lineage test",
    start_date=days_ago(1),
    is_paused_upon_creation=False,
    catchup=False,
) as dag:
    

    t0 = DummyOperator(
        task_id='task0',
        inlets=[
            OMEntity(entity=Container, fqn="Container A", key="group_A"),
            OMEntity(entity=Table, fqn="Table X", key="group_B"),
        ]
    )
    
    t1 = DummyOperator(
        task_id='task10',
        outlets=[
            OMEntity(entity=Table, fqn="Table B", key="group_A"),
            OMEntity(entity=Table, fqn="Table Y", key="group_B"),
        ]
    )

    t0 >> t1

We are passing inlets and outlets as a list of the OMEntity class, that lets us specify:

  1. The type of the asset we are using: Table, Container,... following our SDK
  2. The FQN of the asset, which is the unique name of each asset in OpenMetadata, e.g., serviceName.databaseName.schemaName.tableName.
  3. The key to group the lineage if needed.

This OMEntity class is defined following the example of Airflow's internal lineage models.

Keys

We can inform the lineage dependencies among different groups of tables. In the example above, we are not building the lineage from all inlets to all outlets, but rather grouping the tables by key (group_A and group_B). This means that after this lineage is processed, the relationship will be:

Container A (node) -> DAG (edge) -> Table B (node)

and

Table X (node) -> DAG (edge) -> Table Y (node)

It does not matter in which task of the DAG these inlet/outlet information is specified. During the ingestion process we group all these details at the DAG level.

Configuring Lineage between Tables

{% note %}

Note that this method only allows lineage between Tables.

We will deprecate it in OpenMetadata 1.4

{% /note %}

Let's take a look at the following example:

from datetime import timedelta

from airflow import DAG
from airflow.operators.dummy import DummyOperator
from airflow.utils.dates import days_ago


default_args = {
    'owner': 'airflow',
    'depends_on_past': False,
    'email': ['airflow@example.com'],
    'email_on_failure': False,
    'email_on_retry': False,
    'retries': 1,
    'retry_delay': timedelta(seconds=1),
}


with DAG(
    "test-multiple-inlet-keys",
    default_args=default_args,
        description="An example DAG which runs a lineage test",
    start_date=days_ago(1),
    is_paused_upon_creation=False,
    catchup=False,
) as dag:
    

    t0 = DummyOperator(
        task_id='task0',
        inlets={
            "group_A": ["Table A"],
            "group_B": ["Table X"]
        }
    )
    
    t1 = DummyOperator(
        task_id='task10',
        outlets={
            "group_A": ["Table B"],
            "group_B": ["Table Y"]
        }
    )

    t0 >> t1

{% note %}

Make sure to add the table Fully Qualified Name (FQN), which is the unique name of the table in OpenMetadata.

This name is composed as serviceName.databaseName.schemaName.tableName.

{% /note %}