mirror of
https://github.com/datahub-project/datahub.git
synced 2025-12-24 08:28:12 +00:00
docs(scheduling): re-arrange docs related to scheduling, lineage, CLI (#3669)
This commit is contained in:
parent
d3081f4807
commit
b3ef5ee489
@ -1,7 +1,7 @@
|
||||
# Running Airflow locally with DataHub
|
||||
|
||||
## Introduction
|
||||
This document describes how you can run Airflow side-by-side with DataHub's docker images to test out Airflow lineage with DataHub.
|
||||
This document describes how you can run Airflow side-by-side with DataHub's quickstart docker images to test out Airflow lineage with DataHub.
|
||||
This offers a much easier way to try out Airflow with DataHub, compared to configuring containers by hand, setting up configurations and networking connectivity between the two systems.
|
||||
|
||||
## Pre-requisites
|
||||
@ -11,6 +11,7 @@ docker info | grep Memory
|
||||
|
||||
> Total Memory: 7.775GiB
|
||||
```
|
||||
- Quickstart: Ensure that you followed [quickstart](../../docs/quickstart.md) to get DataHub up and running.
|
||||
|
||||
## Step 1: Set up your Airflow area
|
||||
- Create an area to host your airflow installation
|
||||
@ -20,7 +21,7 @@ docker info | grep Memory
|
||||
```
|
||||
mkdir -p airflow_install
|
||||
cd airflow_install
|
||||
# Download docker-compose
|
||||
# Download docker-compose file
|
||||
curl -L 'https://raw.githubusercontent.com/linkedin/datahub/master/docker/airflow/docker-compose.yaml' -o docker-compose.yaml
|
||||
# Create dags directory
|
||||
mkdir -p dags
|
||||
@ -94,7 +95,7 @@ Default username and password is:
|
||||
airflow:airflow
|
||||
```
|
||||
|
||||
## Step 4: Register DataHub connection (hook) to Airflow
|
||||
## Step 3: Register DataHub connection (hook) to Airflow
|
||||
|
||||
```
|
||||
docker exec -it `docker ps | grep webserver | cut -d " " -f 1` airflow connections add --conn-type 'datahub_rest' 'datahub_rest_default' --conn-host 'http://datahub-gms:8080'
|
||||
@ -111,7 +112,7 @@ Successfully added `conn_id`=datahub_rest_default : datahub_rest://:@http://data
|
||||
- Note: This is what requires Airflow to be able to connect to `datahub-gms` the host (this is the container running datahub-gms image) and this is why we needed to connect the Airflow containers to the `datahub_network` using our custom docker-compose file.
|
||||
|
||||
|
||||
## Step 3: Find the DAGs and run it
|
||||
## Step 4: Find the DAGs and run it
|
||||
Navigate the Airflow UI to find the sample Airflow dag we just brought in
|
||||
|
||||

|
||||
|
||||
@ -85,6 +85,21 @@ module.exports = {
|
||||
{
|
||||
Sinks: list_ids_in_directory("metadata-ingestion/sink_docs"),
|
||||
},
|
||||
{
|
||||
Scheduling: [
|
||||
"metadata-ingestion/schedule_docs/intro",
|
||||
"metadata-ingestion/schedule_docs/cron",
|
||||
"metadata-ingestion/schedule_docs/airflow",
|
||||
],
|
||||
},
|
||||
{
|
||||
Lineage: [
|
||||
"docs/lineage/intro",
|
||||
"docs/lineage/airflow",
|
||||
"docker/airflow/local_airflow",
|
||||
"docs/lineage/sample_code",
|
||||
],
|
||||
},
|
||||
],
|
||||
"Metadata Modeling": [
|
||||
"docs/modeling/metadata-model",
|
||||
@ -191,7 +206,6 @@ module.exports = {
|
||||
"docs/how/delete-metadata",
|
||||
"datahub-web-react/src/app/analytics/README",
|
||||
"metadata-ingestion/developing",
|
||||
"docker/airflow/local_airflow",
|
||||
"docs/how/add-custom-data-platform",
|
||||
"docs/how/add-custom-ingestion-source",
|
||||
{
|
||||
|
||||
@ -2,9 +2,7 @@
|
||||
|
||||
DataHub comes with a friendly cli called `datahub` that allows you to perform a lot of common operations using just the command line.
|
||||
|
||||
## Install
|
||||
|
||||
### Using pip
|
||||
## Using pip
|
||||
|
||||
We recommend python virtual environments (venv-s) to namespace pip modules. Here's an example setup:
|
||||
|
||||
@ -27,6 +25,10 @@ datahub version
|
||||
|
||||
If you run into an error, try checking the [_common setup issues_](../metadata-ingestion/developing.md#Common-setup-issues).
|
||||
|
||||
### Using docker
|
||||
|
||||
You can use the `datahub-ingestion` docker image as explained in [Docker Images](../docker/README.md). In case you are using Kubernetes you can start a pod with the `datahub-ingestion` docker image, log onto a shell on the pod and you should have the access to datahub CLI in your kubernetes cluster.
|
||||
|
||||
## User Guide
|
||||
|
||||
The `datahub` cli allows you to do many things, such as quickstarting a DataHub docker instance locally, ingesting metadata from your sources, as well as retrieving and modifying metadata.
|
||||
|
||||
@ -4,15 +4,12 @@ There are a two ways to delete metadata from DataHub.
|
||||
- Delete metadata attached to entities by providing a specific urn or a filter that identifies a set of entities
|
||||
- Delete metadata affected by a single ingestion run
|
||||
|
||||
To follow this guide you need to use [DataHub CLI](../cli.md).
|
||||
|
||||
Read on to find out how to perform these kinds of deletes.
|
||||
|
||||
_Note: Deleting metadata should only be done with care. Always use `--dry-run` to understand what will be deleted before proceeding. Prefer soft-deletes (`--soft`) unless you really want to nuke metadata rows. Hard deletes will actually delete rows in the primary store and recovering them will require using backups of the primary metadata store. Make sure you understand the implications of issuing soft-deletes versus hard-deletes before proceeding._
|
||||
|
||||
## The `datahub` CLI
|
||||
|
||||
To use the datahub CLI you follow the installation and configuration guide at [DataHub CLI](../cli.md) or you can use the `datahub-ingestion` docker image as explained in [Docker Images](../../docker/README.md). In case you are using Kubernetes you can start a pod with the `datahub-ingestion` docker image, log onto a shell on the pod and you should have the access to datahub CLI in your kubernetes cluster.
|
||||
|
||||
|
||||
## Delete By Urn
|
||||
|
||||
To delete all the data related to a single entity, run
|
||||
|
||||
62
docs/lineage/airflow.md
Normal file
62
docs/lineage/airflow.md
Normal file
@ -0,0 +1,62 @@
|
||||
# Lineage with Airflow
|
||||
|
||||
There's a couple ways to get lineage information from Airflow into DataHub.
|
||||
|
||||
|
||||
## Using Datahub's Airflow lineage backend (recommended)
|
||||
|
||||
:::caution
|
||||
|
||||
The Airflow lineage backend is only supported in Airflow 1.10.15+ and 2.0.2+.
|
||||
|
||||
:::
|
||||
|
||||
## Running on Docker locally
|
||||
|
||||
If you are looking to run Airflow and DataHub using docker locally, follow the guide [here](../../docker/airflow/local_airflow.md). Otherwise proceed to follow the instructions below.
|
||||
|
||||
## Setting up Airflow to use DataHub as Lineage Backend
|
||||
|
||||
1. You need to install the required dependency in your airflow. See https://registry.astronomer.io/providers/datahub/modules/datahublineagebackend
|
||||
|
||||
```shell
|
||||
pip install acryl-datahub[airflow]
|
||||
```
|
||||
|
||||
2. You must configure an Airflow hook for Datahub. We support both a Datahub REST hook and a Kafka-based hook, but you only need one.
|
||||
|
||||
```shell
|
||||
# For REST-based:
|
||||
airflow connections add --conn-type 'datahub_rest' 'datahub_rest_default' --conn-host 'http://localhost:8080'
|
||||
# For Kafka-based (standard Kafka sink config can be passed via extras):
|
||||
airflow connections add --conn-type 'datahub_kafka' 'datahub_kafka_default' --conn-host 'broker:9092' --conn-extra '{}'
|
||||
```
|
||||
|
||||
3. Add the following lines to your `airflow.cfg` file.
|
||||
```ini
|
||||
[lineage]
|
||||
backend = datahub_provider.lineage.datahub.DatahubLineageBackend
|
||||
datahub_kwargs = {
|
||||
"datahub_conn_id": "datahub_rest_default",
|
||||
"cluster": "prod",
|
||||
"capture_ownership_info": true,
|
||||
"capture_tags_info": true,
|
||||
"graceful_exceptions": true }
|
||||
# The above indentation is important!
|
||||
```
|
||||
**Configuration options:**
|
||||
- `datahub_conn_id` (required): Usually `datahub_rest_default` or `datahub_kafka_default`, depending on what you named the connection in step 1.
|
||||
- `cluster` (defaults to "prod"): The "cluster" to associate Airflow DAGs and tasks with.
|
||||
- `capture_ownership_info` (defaults to true): If true, the owners field of the DAG will be capture as a DataHub corpuser.
|
||||
- `capture_tags_info` (defaults to true): If true, the tags field of the DAG will be captured as DataHub tags.
|
||||
- `graceful_exceptions` (defaults to true): If set to true, most runtime errors in the lineage backend will be suppressed and will not cause the overall task to fail. Note that configuration issues will still throw exceptions.
|
||||
4. Configure `inlets` and `outlets` for your Airflow operators. For reference, look at the sample DAG in [`lineage_backend_demo.py`](../../metadata-ingestion/src/datahub_provider/example_dags/lineage_backend_demo.py), or reference [`lineage_backend_taskflow_demo.py`](../../metadata-ingestion/src/datahub_provider/example_dags/lineage_backend_taskflow_demo.py) if you're using the [TaskFlow API](https://airflow.apache.org/docs/apache-airflow/stable/concepts/taskflow.html).
|
||||
5. [optional] Learn more about [Airflow lineage](https://airflow.apache.org/docs/apache-airflow/stable/lineage.html), including shorthand notation and some automation.
|
||||
|
||||
## Emitting lineage via a separate operator
|
||||
|
||||
Take a look at this sample DAG:
|
||||
|
||||
- [`lineage_emission_dag.py`](../../metadata-ingestion/src/datahub_provider/example_dags/lineage_emission_dag.py) - emits lineage using the DatahubEmitterOperator.
|
||||
|
||||
In order to use this example, you must first configure the Datahub hook. Like in ingestion, we support a Datahub REST hook and a Kafka-based hook. See step 1 above for details.
|
||||
3
docs/lineage/intro.md
Normal file
3
docs/lineage/intro.md
Normal file
@ -0,0 +1,3 @@
|
||||
# Introduction to Lineage
|
||||
|
||||
See [this video](https://www.youtube.com/watch?v=rONGpsndzRw&ab_channel=DataHub) for Lineage 101 in DataHub.
|
||||
19
docs/lineage/sample_code.md
Normal file
19
docs/lineage/sample_code.md
Normal file
@ -0,0 +1,19 @@
|
||||
# Lineage sample code
|
||||
|
||||
The following samples will cover emitting dataset-to-dataset, dataset-to-job-to-dataset, chart-to-dataset, dashboard-to-chart and job-to-dataflow lineages.
|
||||
- [lineage_emitter_mcpw_rest.py](../../metadata-ingestion/examples/library/lineage_emitter_mcpw_rest.py) - emits simple bigquery table-to-table (dataset-to-dataset) lineage via REST as MetadataChangeProposalWrapper.
|
||||
- [lineage_dataset_job_dataset.py](../../metadata-ingestion/examples/library/lineage_dataset_job_dataset.py) - emits mysql-to-airflow-to-kafka (dataset-to-job-to-dataset) lineage via REST as MetadataChangeProposalWrapper.
|
||||
- [lineage_dataset_chart.py](../../metadata-ingestion/examples/library/lineage_dataset_chart.py) - emits the dataset-to-chart lineage via REST as MetadataChangeProposalWrapper.
|
||||
- [lineage_chart_dashboard.py](../../metadata-ingestion/examples/library/lineage_chart_dashboard.py) - emits the chart-to-dashboard lineage via REST as MetadataChangeProposalWrapper.
|
||||
- [lineage_job_dataflow.py](../../metadata-ingestion/examples/library/lineage_job_dataflow.py) - emits the job-to-dataflow lineage via REST as MetadataChangeProposalWrapper.
|
||||
- [lineage_emitter_rest.py](../../metadata-ingestion/examples/library/lineage_emitter_rest.py) - emits simple dataset-to-dataset lineage via REST as MetadataChangeEvent.
|
||||
- [lineage_emitter_kafka.py](../../metadata-ingestion/examples/library/lineage_emitter_kafka.py) - emits simple dataset-to-dataset lineage via Kafka as MetadataChangeEvent.
|
||||
- [Datahub Snowflake Lineage](https://github.com/linkedin/datahub/blob/master/metadata-ingestion/src/datahub/ingestion/source/sql/snowflake.py#L249) - emits Datahub's Snowflake lineage as MetadataChangeProposalWrapper.
|
||||
- [Datahub Bigquery Lineage](https://github.com/linkedin/datahub/blob/a1bf95307b040074c8d65ebb86b5eb177fdcd591/metadata-ingestion/src/datahub/ingestion/source/sql/bigquery.py#L229) - emits Datahub's Bigquery lineage as MetadataChangeProposalWrapper.
|
||||
- [Datahub Dbt Lineage](https://github.com/linkedin/datahub/blob/a9754ebe83b6b73bc2bfbf49d9ebf5dbd2ca5a8f/metadata-ingestion/src/datahub/ingestion/source/dbt.py#L625,L630) - emits Datahub's DBT lineage as MetadataChangeEvent.
|
||||
|
||||
NOTE:
|
||||
- Emitting aspects as MetadataChangeProposalWrapper is recommended over emitting aspects via the
|
||||
MetadataChangeEvent.
|
||||
- Emitting any aspect associated with an entity completely overwrites the previous
|
||||
value of the aspect associated with the entity. This means that emitting a lineage aspect associated with a dataset will overwrite lineage edges that already exist.
|
||||
@ -186,101 +186,12 @@ In some cases, you might want to construct the MetadataChangeEvents yourself but
|
||||
|
||||
- [DataHub emitter via REST](./src/datahub/emitter/rest_emitter.py) (same requirements as `datahub-rest`).
|
||||
- [DataHub emitter via Kafka](./src/datahub/emitter/kafka_emitter.py) (same requirements as `datahub-kafka`).
|
||||
### Sample code
|
||||
#### Lineage
|
||||
The following samples will cover emitting dataset-to-dataset, dataset-to-job-to-dataset, chart-to-dataset, dashboard-to-chart and job-to-dataflow lineages.
|
||||
- [lineage_emitter_mcpw_rest.py](./examples/library/lineage_emitter_mcpw_rest.py) - emits simple bigquery table-to-table (dataset-to-dataset) lineage via REST as MetadataChangeProposalWrapper.
|
||||
- [lineage_dataset_job_dataset.py](./examples/library/lineage_dataset_job_dataset.py) - emits mysql-to-airflow-to-kafka (dataset-to-job-to-dataset) lineage via REST as MetadataChangeProposalWrapper.
|
||||
- [lineage_dataset_chart.py](./examples/library/lineage_dataset_chart.py) - emits the dataset-to-chart lineage via REST as MetadataChangeProposalWrapper.
|
||||
- [lineage_chart_dashboard.py](./examples/library/lineage_chart_dashboard.py) - emits the chart-to-dashboard lineage via REST as MetadataChangeProposalWrapper.
|
||||
- [lineage_job_dataflow.py](./examples/library/lineage_job_dataflow.py) - emits the job-to-dataflow lineage via REST as MetadataChangeProposalWrapper.
|
||||
- [lineage_emitter_rest.py](./examples/library/lineage_emitter_rest.py) - emits simple dataset-to-dataset lineage via REST as MetadataChangeEvent.
|
||||
- [lineage_emitter_kafka.py](./examples/library/lineage_emitter_kafka.py) - emits simple dataset-to-dataset lineage via Kafka as MetadataChangeEvent.
|
||||
- [Datahub Snowflake Lineage](https://github.com/linkedin/datahub/blob/master/metadata-ingestion/src/datahub/ingestion/source/sql/snowflake.py#L249) - emits Datahub's Snowflake lineage as MetadataChangeProposalWrapper.
|
||||
- [Datahub Bigquery Lineage](https://github.com/linkedin/datahub/blob/a1bf95307b040074c8d65ebb86b5eb177fdcd591/metadata-ingestion/src/datahub/ingestion/source/sql/bigquery.py#L229) - emits Datahub's Bigquery lineage as MetadataChangeProposalWrapper.
|
||||
- [Datahub Dbt Lineage](https://github.com/linkedin/datahub/blob/a9754ebe83b6b73bc2bfbf49d9ebf5dbd2ca5a8f/metadata-ingestion/src/datahub/ingestion/source/dbt.py#L625,L630) - emits Datahub's DBT lineage as MetadataChangeEvent.
|
||||
|
||||
NOTE:
|
||||
- Emitting aspects as MetadataChangeProposalWrapper is recommended over emitting aspects via the
|
||||
MetadataChangeEvent.
|
||||
- Emitting any aspect associated with an entity completely overwrites the previous
|
||||
value of the aspect associated with the entity. This means that emitting a lineage aspect associated with a dataset will overwrite lineage edges that already exist.
|
||||
#### Programmatic Pipeline
|
||||
### Programmatic Pipeline
|
||||
In some cases, you might want to configure and run a pipeline entirely from within your custom python script. Here is an example of how to do it.
|
||||
- [programmatic_pipeline.py](./examples/library/programatic_pipeline.py) - a basic mysql to REST programmatic pipeline.
|
||||
|
||||
|
||||
## Lineage with Airflow
|
||||
|
||||
There's a couple ways to get lineage information from Airflow into DataHub.
|
||||
|
||||
:::note
|
||||
|
||||
If you're simply looking to run ingestion on a schedule, take a look at these sample DAGs:
|
||||
|
||||
- [`generic_recipe_sample_dag.py`](./src/datahub_provider/example_dags/generic_recipe_sample_dag.py) - reads a DataHub ingestion recipe file and runs it
|
||||
- [`mysql_sample_dag.py`](./src/datahub_provider/example_dags/mysql_sample_dag.py) - runs a MySQL metadata ingestion pipeline using an inlined configuration.
|
||||
|
||||
:::
|
||||
|
||||
### Using Datahub's Airflow lineage backend (recommended)
|
||||
|
||||
:::caution
|
||||
|
||||
The Airflow lineage backend is only supported in Airflow 1.10.15+ and 2.0.2+.
|
||||
|
||||
:::
|
||||
|
||||
### Running on Docker locally
|
||||
|
||||
If you are looking to run Airflow and DataHub using docker locally, follow the guide [here](../docker/airflow/local_airflow.md). Otherwise proceed to follow the instructions below.
|
||||
|
||||
### Setting up Airflow to use DataHub as Lineage Backend
|
||||
|
||||
1. You need to install the required dependency in your airflow. See https://registry.astronomer.io/providers/datahub/modules/datahublineagebackend
|
||||
|
||||
```shell
|
||||
pip install acryl-datahub[airflow]
|
||||
```
|
||||
|
||||
2. You must configure an Airflow hook for Datahub. We support both a Datahub REST hook and a Kafka-based hook, but you only need one.
|
||||
|
||||
```shell
|
||||
# For REST-based:
|
||||
airflow connections add --conn-type 'datahub_rest' 'datahub_rest_default' --conn-host 'http://localhost:8080'
|
||||
# For Kafka-based (standard Kafka sink config can be passed via extras):
|
||||
airflow connections add --conn-type 'datahub_kafka' 'datahub_kafka_default' --conn-host 'broker:9092' --conn-extra '{}'
|
||||
```
|
||||
|
||||
3. Add the following lines to your `airflow.cfg` file.
|
||||
```ini
|
||||
[lineage]
|
||||
backend = datahub_provider.lineage.datahub.DatahubLineageBackend
|
||||
datahub_kwargs = {
|
||||
"datahub_conn_id": "datahub_rest_default",
|
||||
"cluster": "prod",
|
||||
"capture_ownership_info": true,
|
||||
"capture_tags_info": true,
|
||||
"graceful_exceptions": true }
|
||||
# The above indentation is important!
|
||||
```
|
||||
**Configuration options:**
|
||||
- `datahub_conn_id` (required): Usually `datahub_rest_default` or `datahub_kafka_default`, depending on what you named the connection in step 1.
|
||||
- `cluster` (defaults to "prod"): The "cluster" to associate Airflow DAGs and tasks with.
|
||||
- `capture_ownership_info` (defaults to true): If true, the owners field of the DAG will be capture as a DataHub corpuser.
|
||||
- `capture_tags_info` (defaults to true): If true, the tags field of the DAG will be captured as DataHub tags.
|
||||
- `graceful_exceptions` (defaults to true): If set to true, most runtime errors in the lineage backend will be suppressed and will not cause the overall task to fail. Note that configuration issues will still throw exceptions.
|
||||
4. Configure `inlets` and `outlets` for your Airflow operators. For reference, look at the sample DAG in [`lineage_backend_demo.py`](./src/datahub_provider/example_dags/lineage_backend_demo.py), or reference [`lineage_backend_taskflow_demo.py`](./src/datahub_provider/example_dags/lineage_backend_taskflow_demo.py) if you're using the [TaskFlow API](https://airflow.apache.org/docs/apache-airflow/stable/concepts/taskflow.html).
|
||||
5. [optional] Learn more about [Airflow lineage](https://airflow.apache.org/docs/apache-airflow/stable/lineage.html), including shorthand notation and some automation.
|
||||
|
||||
### Emitting lineage via a separate operator
|
||||
|
||||
Take a look at this sample DAG:
|
||||
|
||||
- [`lineage_emission_dag.py`](./src/datahub_provider/example_dags/lineage_emission_dag.py) - emits lineage using the DatahubEmitterOperator.
|
||||
|
||||
In order to use this example, you must first configure the Datahub hook. Like in ingestion, we support a Datahub REST hook and a Kafka-based hook. See step 1 above for details.
|
||||
|
||||
## Developing
|
||||
|
||||
See the guides on [developing](./developing.md), [adding a source](./adding-source.md) and [using transformers](./transformers.md).
|
||||
|
||||
12
metadata-ingestion/schedule_docs/airflow.md
Normal file
12
metadata-ingestion/schedule_docs/airflow.md
Normal file
@ -0,0 +1,12 @@
|
||||
# Using Airflow
|
||||
|
||||
If you are using Apache Airflow for your scheduling then you might want to also use it for scheduling your ingestion recipes. For any Airflow specific questions you can go through [Airflow docs](https://airflow.apache.org/docs/apache-airflow/stable/) for more details.
|
||||
|
||||
To schedule your recipe through Airflow you can follow these steps
|
||||
- Create a recipe file e.g. `recipe.yml`
|
||||
- Ensure the receipe file is in a folder accessible to your airflow workers. You can either specify absolute path on the machines where Airflow is installed or a path relative to `AIRFLOW_HOME`.
|
||||
- Ensure [DataHub CLI](../../docs/cli.md) is installed in your airflow environment
|
||||
- Create a sample DAG file like [`generic_recipe_sample_dag.py`](../src/datahub_provider/example_dags/generic_recipe_sample_dag.py). This will read your DataHub ingestion recipe file and run it.
|
||||
- Deploy the DAG file into airflow for scheduling. Typically this involves checking in the DAG file into your dags folder which is accessible to your Airflow instance.
|
||||
|
||||
Alternatively you can have an inline recipe as given in [`mysql_sample_dag.py`](../src/datahub_provider/example_dags/mysql_sample_dag.py). This runs a MySQL metadata ingestion pipeline using an inlined configuration.
|
||||
28
metadata-ingestion/schedule_docs/cron.md
Normal file
28
metadata-ingestion/schedule_docs/cron.md
Normal file
@ -0,0 +1,28 @@
|
||||
# Using Cron
|
||||
|
||||
Assume you have a recipe file `/home/ubuntu/datahub_ingest/mysql_to_datahub.yml` on your machine
|
||||
```
|
||||
source:
|
||||
type: mysql
|
||||
config:
|
||||
# Coordinates
|
||||
host_port: localhost:3306
|
||||
database: dbname
|
||||
|
||||
# Credentials
|
||||
username: root
|
||||
password: example
|
||||
|
||||
sink:
|
||||
type: datahub-rest
|
||||
config:
|
||||
server: http://localhost:8080
|
||||
```
|
||||
|
||||
We can use crontab to schedule ingestion to run five minutes after midnight, every day using [DataHub CLI](../../docs/cli.md).
|
||||
|
||||
```
|
||||
5 0 * * * datahub ingest -c /home/ubuntu/datahub_ingest/mysql_to_datahub.yml
|
||||
```
|
||||
|
||||
Read through [crontab docs](https://man7.org/linux/man-pages/man5/crontab.5.html) for more options related to scheduling.
|
||||
30
metadata-ingestion/schedule_docs/intro.md
Normal file
30
metadata-ingestion/schedule_docs/intro.md
Normal file
@ -0,0 +1,30 @@
|
||||
# Introduction to Scheduling Metadata Ingestion
|
||||
|
||||
Given a recipe file `/home/ubuntu/datahub_ingest/mysql_to_datahub.yml`.
|
||||
```
|
||||
source:
|
||||
type: mysql
|
||||
config:
|
||||
# Coordinates
|
||||
host_port: localhost:3306
|
||||
database: dbname
|
||||
|
||||
# Credentials
|
||||
username: root
|
||||
password: example
|
||||
|
||||
sink:
|
||||
type: datahub-rest
|
||||
config:
|
||||
server: http://localhost:8080
|
||||
```
|
||||
|
||||
We can do ingestion of our metadata using [DataHub CLI](../../docs/cli.md) as follows
|
||||
|
||||
```
|
||||
datahub ingest -c /home/ubuntu/datahub_ingest/mysql_to_datahub.yml
|
||||
```
|
||||
|
||||
This will ingest metadata from the `mysql` source which is configured in the recipe file. This does ingestion once. As the source system changes we would like to have the changes reflected in DataHub. To do this someone will need to re-run the ingestion command using a recipe file.
|
||||
|
||||
An alternate to running the command manually we can schedule the ingestion to run on a regular basis. In this section we give some examples of how scheduling ingestion of metadata into DataHub can be done.
|
||||
Loading…
x
Reference in New Issue
Block a user