76 lines
3.2 KiB
Markdown
Raw Normal View History

# File Based Lineage
For context on getting started with ingestion, check out our [metadata ingestion guide](../README.md).
## Setup
Works with `acryl-datahub` out of the box.
## Capabilities
This plugin pulls lineage metadata from a yaml-formatted file. An example of one such file is located in the examples
directory [here](../examples/bootstrap_data/file_lineage.yml).
## Quickstart recipe
Check out the following recipe to get started with ingestion! See [below](#config-details) for full configuration
options.
For general pointers on writing and running a recipe, see our [main recipe guide](../README.md#recipes).
```yml
source:
type: datahub-lineage-file
config:
# Coordinates
file: /path/to/file_lineage.yml
# Whether we want to query datahub-gms for upstream data
preserve_upstream: False
sink:
# sink configs
```
## Config details
Note that a `.` is used to denote nested fields in the YAML recipe.
| Field | Required | Default | Description |
|---------------------|----------|---------|--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| `file` | ✅ | | Path to lineage file to ingest. |
| `preserve_upstream` | | `True` | Whether we want to query datahub-gms for upstream data. `False` means it will hard replace upstream data for a given entity. `True` means it will query the backend for existing upstreams and include it in the ingestion run |
### Lineage File Format
The lineage source file should be a `.yml` file with the following top-level keys:
**version**: the version of lineage file config the config conforms to. Currently, the only version released
is `1`.
**lineage**: the top level key of the lineage file containing a list of **EntityNodeConfig** objects
**EntityNodeConfig**:
- **entity**: **EntityConfig** object
- **upstream**: (optional) list of child **EntityNodeConfig** objects
**EntityConfig**:
- **name** : name of the entity
- **type**: type of the entity (only `dataset` is supported as of now)
- **env**: the environment of this entity. Should match the values in the
table [here](https://datahubproject.io/docs/graphql/enums/#fabrictype)
- **platform**: a valid platform like kafka, snowflake, etc..
- **platform_instance**: optional string specifying the platform instance of this entity
You can also view an example lineage file checked in [here](../examples/bootstrap_data/file_lineage.yml)
## Compatibility
Compatible with version 1 of lineage format. The source will be evolved as we publish newer versions of this
format.
## Questions
If you've got any questions on configuring this source, feel free to ping us
on [our Slack](https://slack.datahubproject.io/)!