mirror of
https://github.com/datahub-project/datahub.git
synced 2025-08-01 13:58:01 +00:00
76 lines
3.2 KiB
Markdown
76 lines
3.2 KiB
Markdown
![]() |
# File Based Lineage
|
||
|
|
||
|
For context on getting started with ingestion, check out our [metadata ingestion guide](../README.md).
|
||
|
|
||
|
## Setup
|
||
|
|
||
|
Works with `acryl-datahub` out of the box.
|
||
|
|
||
|
## Capabilities
|
||
|
|
||
|
This plugin pulls lineage metadata from a yaml-formatted file. An example of one such file is located in the examples
|
||
|
directory [here](../examples/bootstrap_data/file_lineage.yml).
|
||
|
|
||
|
## Quickstart recipe
|
||
|
|
||
|
Check out the following recipe to get started with ingestion! See [below](#config-details) for full configuration
|
||
|
options.
|
||
|
|
||
|
For general pointers on writing and running a recipe, see our [main recipe guide](../README.md#recipes).
|
||
|
|
||
|
```yml
|
||
|
source:
|
||
|
type: datahub-lineage-file
|
||
|
config:
|
||
|
# Coordinates
|
||
|
file: /path/to/file_lineage.yml
|
||
|
# Whether we want to query datahub-gms for upstream data
|
||
|
preserve_upstream: False
|
||
|
|
||
|
sink:
|
||
|
# sink configs
|
||
|
```
|
||
|
|
||
|
## Config details
|
||
|
|
||
|
Note that a `.` is used to denote nested fields in the YAML recipe.
|
||
|
|
||
|
| Field | Required | Default | Description |
|
||
|
|---------------------|----------|---------|--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
|
||
|
| `file` | ✅ | | Path to lineage file to ingest. |
|
||
|
| `preserve_upstream` | | `True` | Whether we want to query datahub-gms for upstream data. `False` means it will hard replace upstream data for a given entity. `True` means it will query the backend for existing upstreams and include it in the ingestion run |
|
||
|
|
||
|
### Lineage File Format
|
||
|
|
||
|
The lineage source file should be a `.yml` file with the following top-level keys:
|
||
|
|
||
|
**version**: the version of lineage file config the config conforms to. Currently, the only version released
|
||
|
is `1`.
|
||
|
|
||
|
**lineage**: the top level key of the lineage file containing a list of **EntityNodeConfig** objects
|
||
|
|
||
|
**EntityNodeConfig**:
|
||
|
|
||
|
- **entity**: **EntityConfig** object
|
||
|
- **upstream**: (optional) list of child **EntityNodeConfig** objects
|
||
|
|
||
|
**EntityConfig**:
|
||
|
|
||
|
- **name** : name of the entity
|
||
|
- **type**: type of the entity (only `dataset` is supported as of now)
|
||
|
- **env**: the environment of this entity. Should match the values in the
|
||
|
table [here](https://datahubproject.io/docs/graphql/enums/#fabrictype)
|
||
|
- **platform**: a valid platform like kafka, snowflake, etc..
|
||
|
- **platform_instance**: optional string specifying the platform instance of this entity
|
||
|
|
||
|
You can also view an example lineage file checked in [here](../examples/bootstrap_data/file_lineage.yml)
|
||
|
|
||
|
## Compatibility
|
||
|
|
||
|
Compatible with version 1 of lineage format. The source will be evolved as we publish newer versions of this
|
||
|
format.
|
||
|
|
||
|
## Questions
|
||
|
|
||
|
If you've got any questions on configuring this source, feel free to ping us
|
||
|
on [our Slack](https://slack.datahubproject.io/)!
|