yujunjun/datahub

Fork 0

mirror of https://github.com/datahub-project/datahub.git synced 2025-11-02 11:49:23 +00:00

Harshal Sheth 20bf794ec4

docs: hosted documentation website (#2174 )

2021-03-05 00:12:12 -08:00

11 KiB

Raw Blame History

Start Date: 2020-08-28
RFC PR: #1841
Discussion Issue: #1731
Implementation PR(s):

RFC - Field Level Lineage

Summary

DataHub supports dataset level lineage. UpStreamLineage is an aspect of dataset that powers the dataset level lineage (a.k.a., coarse-grained lineage). However, there is a need to understand the lineage at the field level (a.k.a., fine-grained lineage)

In this RFC, we will discuss below and get consensus on the modelling involved.

Representation of a field in a dataset
Representation of the field level lineage
Process of creating dataset fields and its relations to other entities.
Transformation function involved in the field level lineage is out of scope of the current RFC.

Basic example

DatasetFieldURN

A unique identifier for a field in a dataset will be introduced in the form of DatasetFieldUrn. And this urn will be the key for DatasetField entity. A sample is as below.

urn:li:datasetField:(urn:li:dataset:(urn:li:dataPlatform:kafka,demo.orders,PROD),/restaurant)

Aspect representing the field level lineage.

       {
          "DatasetUpstreamLineage": {
            "fieldMappings": [
              {
                "created": null,
                "transformationFunction": {
                  "string": "com.linkedin.finegrainedmetadata.transformation.Identity"
                },
                "sourceFields": [
                  {
                    "string": "urn:li:datasetField:(urn:li:dataset:(urn:li:dataPlatform:kafka,demo.orders,PROD),/header/customerId)"
                  }
                ],
                "destinationField": "urn:li:datasetField:(urn:li:dataset:(urn:li:dataPlatform:hdfs,/demo/fact_orders,PROD),/customerId)"
              },
              {
                "created": null,
                "transformationFunction": {
                  "string": "com.linkedin.finegrainedmetadata.transformation.Identity"
                },
                "sourceFields": [
                  {
                    "string": "urn:li:datasetField:(urn:li:dataset:(urn:li:dataPlatform:kafka,demo.orders,PROD),/header/timeStamp)"
                  }
                ],
                "destinationField": "urn:li:datasetField:(urn:li:dataset:(urn:li:dataPlatform:hdfs,/demo/fact_orders,PROD),/timeStamp)"
              }
          ]
        }
     }

Motivation

There is a lot of interest in the field level lineage for datasets. Related issues/rfcs are

when does Fine grain lineage feature come out? #1649
dataset field level Lineage support? #1519
add lineage workflow schedule support #1615
Design Review: column level lineage feature #1731
Alternate proposal for field level lineage #1784

There are alternate proposals for field level lineage (refer #1731 and #1784). However, I believe a there is a need to uniquely idenity a dataset field and represent as URN for the following reasons.

It provides a natural path forward to make dataset field a first class entity. Producers and Consumers of this dataset field can naturally provide more metadata for a field which doesn't come/can't be expressed as part of the schema definition.
Search and discovery of datasets based on the field and its metadata will be natural extension of this.

Detailed design

Models

DatasetField

We propose a standard identifier for the dataset field in the below format.

urn:li:datasetField:(\<datasetUrn>,\<fieldPath>)

It contains two parts

Dataset Urn -> Standard Identifier of the dataset. This URN is already part of DataHub models
Field Path -> Represents the field of a dataset

FieldPath in most typical cases is the fieldName or column name of the dataset. Where the fields are nested in nature this will be a path to reach the leaf node. To standardize the field paths for different formats, there is a need to build standardized schema normalizers.

{
   "type":"record",
   "name":"Record1",
   "fields":[
      {
         "name":"foo1",
         "type":"int"
      },
      {
         "name":"foo2",
         "type":{
            "type":"record",
            "name":"Record2",
            "fields":[
               {
                  "name":"bar1",
                  "type":"string"
               },
               {
                  "name":"bar2",
                  "type":[
                     "null",
                     "int"
                  ]
               }
            ]
         }
      }
   ]
}

If this is the schema of the dataset, then the dataset fields that emanate from this schema are

urn:li:datasetField:(urn:li:dataset:(urn:li:dataPlatform:kafka,demo.orders,PROD),/foo1)

urn:li:datasetField:(urn:li:dataset:(urn:li:dataPlatform:kafka,demo.orders,PROD),/foo2/bar1)

urn:li:datasetField:(urn:li:dataset:(urn:li:dataPlatform:kafka,demo.orders,PROD),/foo2/bar2/int)

Aspect

/**
* ASPECT :: Fine Grained upstream lineage for fields in a dataset
*/
record DatasetUpstreamLineage {
 /**
  * Upstream to downstream field level lineage mappings
  */
 fieldMappings: array[DatasetFieldMapping]
}

/**
* Representation of mapping between fields in sourcer dataset to the field in destination dataset
*/
record DatasetFieldMapping{
 /**
  * Source fields from which the fine grained lineage is derived
  */
 sourceFields: array[ typeref : union[DatasetFieldUrn]]

 /**
  * Destination field which is derived from source fields
  */
 destinationField: DatasetFieldUrn
 /**
  * A UDF mentioning how the source got transformed to destination. 
  * UDF also annotates how some metadata can be carried over from source fields to destination fields.
  * BlackBox UDF implies the transformation function is unknwon from source to destination.
  * Identity UDF implies pure copy of the data from source to destination. 
  */
 transformationFunction: string = "com.linkedin.finegrainedmetadata.transformation.BlackBox"
}

An aspect with name DatasetUpstreamLineage will be introduced to capture fine grained lineage. Technically coarse grained is already captured with fine-grained lineage.
One can also provide a transformation function on how the data got transformed from source fields to destination field. The exact syntax of such function is out of scope of this document.
- BlackBox UDF means destination field is derived from source fields, but the transformation function is not knwon.
- Identity UDF means destination field is a pure copy from source field and the transformation is Identity.
Upstream sources in the field level relations are dataset field urns and is extensible to support other types in future. Think of rest api as a possible upstream in producing a field in dataset.

DataFlow in DataHub for Field Level Lineage

As part of the POC we did, we used the below workflow. Essentially, DatasetFieldUrn is introduced paving the path for that being the first class entity.

GraphBuilder on receiving MAE for SchemaMetadata aspect, will do below
1. Create Dataset Entity in graph db.
2. Use schema normalizers and extract field paths. This schema and hence forth the fields are the source of truth for dataset fields.
3. Creates dataset field entities in graph db.
4. A new relationship builder Data Derived From Relation Builder will create HasField relation between Dataset entity and DatasetFields entities
GraphBuilder on receiving MAE for DatasetUpstreamLienage aspect will create the field level lineages (relationship DataDerivedFrom) between Source DatasetFields and Destination DatasetField

Models representing the lineage

@pairings = [ {
  "destination" : "com.linkedin.common.urn.DatasetFieldUrn",
  "source" : "com.linkedin.common.urn.DatasetUrn"
}  ]
record HasField includes BaseRelationship {
}

@pairings = [ {
  "destination" : "com.linkedin.common.urn.DatasetFieldUrn",
  "source" : "com.linkedin.common.urn.DatasetFieldUrn"
} ]
record DataDerivedFrom includes BaseRelationship {
}

Two new relationships can be introduced to represent the relationships in graph.

HasField relationship represents the relation from dataset to dataset field.
DataDerivedFrom relationship model represents the data in destination dataset field derived from source dataset fields.

DataFlow When DatasetField is First Class Entity

Once we decide to make dataset field as a first class entity, producers can start emitting MCEs for dataset fields. Below represents the end to end flow of dataset field entity will look like in the larger picture.

Schema Normalizers as a utility will be developed.
DatasetField entity will be introduced with aspect FieldInfo
Producers can use Schema Normalizers and send emit DatasetField MCEs for every field in the schema.
Producers will still emit the SchemaMetadata as an aspect of Dataset entity. This aspect serves as the metadata for the relationship HasField between Dataset and DatasetField entities.
An aspect with name DatasetUpstreamLineage will be introduced to capture field level lineage. Technically coarse grained is already captured with fine-grained lineage.

How we teach this

We are introducing the capability of field level lineage in DataHub. As part of this, below are the salient features one should know

Schema Normalizers will be defined to standardize the field paths in a schema. Once this is done, field level lineage will be relation between two standardized field paths of source and destination paths.
Dataset Field URN will be introduced and DatasetField will be a first class entity in DataHub.
HasField relations will be populated in graph db between Dataset and DatasetField
DataDerived relations will be populated at the field level.
SchemaMetadata will still serve the schema information of a dataset. But, it is uses as a SOT for presence of dataset field entities.

This is an extension to the current support of coarse grained lineage by DataHub.
Relationships tab in DataHub UI can also be enhanced to show field level lineage.

Drawbacks

Haven't thought about any potential drawbacks.

Alternatives

In the alternate design, we wouldn't need to consider defining a dataset field urn. There is an extensive RFC and discussion on this at ( #1784 )

Rollout / Adoption Strategy

This introduces a new aspect DatasetUpstreamLineage which is capable of defining lineage at field level. Hence, the existing customers shouldn't be impacted with this change.

Unresolved questions

The syntax of transformation function representing how the source fields got transformed to destination fields is not thought through.
How to automatically get the field level lineage by parsing the higher level languages or query plans of different execution environments.

For the above two, we need to have more detailed RFCs.

11 KiB Raw Blame History