2021-08-10 19:46:02 -04:00
# Metadata Ingestion
2021-02-12 10:46:28 -08:00
2021-03-23 02:12:41 -04:00

2021-02-14 11:35:45 -08:00
This module hosts an extensible Python-based metadata ingestion system for DataHub.
2021-04-05 19:11:28 -07:00
This supports sending data to DataHub using Kafka or through the REST API.
It can be used through our CLI tool, with an orchestrator like Airflow, or as a library.
2021-02-15 17:11:40 -08:00
2021-03-11 16:41:05 -05:00
## Getting Started
2021-03-10 17:32:12 -05:00
### Prerequisites
2021-02-14 11:35:45 -08:00
2021-02-15 11:03:38 -08:00
Before running any metadata ingestion job, you should make sure that DataHub backend services are all running. If you are trying this out locally, the easiest way to do that is through [quickstart Docker images ](../docker ).
2021-04-05 19:11:28 -07:00
### Install from PyPI
2021-02-18 20:06:30 -08:00
2021-05-13 20:02:47 -07:00
The folks over at [Acryl Data ](https://www.acryl.io/ ) maintain a PyPI package for DataHub metadata ingestion.
2021-03-10 17:32:12 -05:00
2021-05-11 15:16:12 -07:00
```shell
2021-04-05 19:11:28 -07:00
# Requires Python 3.6+
2021-04-12 17:40:15 -07:00
python3 -m pip install --upgrade pip wheel setuptools
python3 -m pip install --upgrade acryl-datahub
2021-04-05 19:11:28 -07:00
datahub version
2021-04-07 14:58:58 -07:00
# If you see "command not found", try running this instead: python3 -m datahub version
2021-03-10 17:32:12 -05:00
```
2021-04-05 19:11:28 -07:00
If you run into an error, try checking the [_common setup issues_ ](./developing.md#Common-setup-issues ).
2021-02-18 20:06:30 -08:00
2021-03-18 02:05:05 -04:00
#### Installing Plugins
2021-03-11 16:41:05 -05:00
2021-08-08 16:40:51 -04:00
We use a plugin architecture so that you can install only the dependencies you actually need. Click the plugin name to learn more about the specific source recipe and any FAQs!
Sources:
| Plugin Name | Install Command | Provides |
| ----------------------------------------------- | ---------------------------------------------------------- | ----------------------------------- |
| [file ](./source_docs/file.md ) | _included by default_ | File source and sink |
| [athena ](./source_docs/athena.md ) | `pip install 'acryl-datahub[athena]'` | AWS Athena source |
| [bigquery ](./source_docs/bigquery.md ) | `pip install 'acryl-datahub[bigquery]'` | BigQuery source |
| [bigquery-usage ](./source_docs/bigquery.md ) | `pip install 'acryl-datahub[bigquery-usage]'` | BigQuery usage statistics source |
2021-09-01 15:10:12 -07:00
| [datahub-business-glossary ](./source_docs/business_glossary.md ) | _no additional dependencies_ | Business Glossary File source |
2021-08-08 16:40:51 -04:00
| [dbt ](./source_docs/dbt.md ) | _no additional dependencies_ | dbt source |
| [druid ](./source_docs/druid.md ) | `pip install 'acryl-datahub[druid]'` | Druid Source |
| [feast ](./source_docs/feast.md ) | `pip install 'acryl-datahub[feast]'` | Feast source |
| [glue ](./source_docs/glue.md ) | `pip install 'acryl-datahub[glue]'` | AWS Glue source |
| [hive ](./source_docs/hive.md ) | `pip install 'acryl-datahub[hive]'` | Hive source |
| [kafka ](./source_docs/kafka.md ) | `pip install 'acryl-datahub[kafka]'` | Kafka source |
| [kafka-connect ](./source_docs/kafka-connect.md ) | `pip install 'acryl-datahub[kafka-connect]'` | Kafka connect source |
| [ldap ](./source_docs/ldap.md ) | `pip install 'acryl-datahub[ldap]'` ([extra requirements]) | LDAP source |
| [looker ](./source_docs/looker.md ) | `pip install 'acryl-datahub[looker]'` | Looker source |
| [lookml ](./source_docs/lookml.md ) | `pip install 'acryl-datahub[lookml]'` | LookML source, requires Python 3.7+ |
| [mongodb ](./source_docs/mongodb.md ) | `pip install 'acryl-datahub[mongodb]'` | MongoDB source |
| [mssql ](./source_docs/mssql.md ) | `pip install 'acryl-datahub[mssql]'` | SQL Server source |
| [mysql ](./source_docs/mysql.md ) | `pip install 'acryl-datahub[mysql]'` | MySQL source |
2021-10-13 11:57:47 +05:30
| [mariadb ](./source_docs/mariadb.md ) | `pip install 'acryl-datahub[mariadb]'` | MariaDB source |
2021-11-10 06:22:15 +01:00
| [openapi ](./source_docs/openapi.md ) | `pip install 'acryl-datahub[openapi]'` | OpenApi Source |
2021-08-08 16:40:51 -04:00
| [oracle ](./source_docs/oracle.md ) | `pip install 'acryl-datahub[oracle]'` | Oracle source |
| [postgres ](./source_docs/postgres.md ) | `pip install 'acryl-datahub[postgres]'` | Postgres source |
2021-08-19 02:03:03 +07:00
| [redash ](./source_docs/redash.md ) | `pip install 'acryl-datahub[redash]'` | Redash source |
2021-08-08 16:40:51 -04:00
| [redshift ](./source_docs/redshift.md ) | `pip install 'acryl-datahub[redshift]'` | Redshift source |
| [sagemaker ](./source_docs/sagemaker.md ) | `pip install 'acryl-datahub[sagemaker]'` | AWS SageMaker source |
| [snowflake ](./source_docs/snowflake.md ) | `pip install 'acryl-datahub[snowflake]'` | Snowflake source |
| [snowflake-usage ](./source_docs/snowflake.md ) | `pip install 'acryl-datahub[snowflake-usage]'` | Snowflake usage statistics source |
2021-08-09 13:00:47 -04:00
| [sql-profiles ](./source_docs/sql_profiles.md ) | `pip install 'acryl-datahub[sql-profiles]'` | Data profiles for SQL-based systems |
2021-08-08 16:40:51 -04:00
| [sqlalchemy ](./source_docs/sqlalchemy.md ) | `pip install 'acryl-datahub[sqlalchemy]'` | Generic SQLAlchemy source |
| [superset ](./source_docs/superset.md ) | `pip install 'acryl-datahub[superset]'` | Superset source |
2021-10-07 00:27:06 +05:30
| [trino ](./source_docs/trino.md ) | `pip install 'acryl-datahub[trino]` | Trino source |
2021-11-18 18:56:24 +01:00
| [starburst-trino-usage ](./source_docs/trino.md ) | `pip install 'acryl-datahub[starburst-trino-usage]'` | Starburst Trino usage statistics source |
2021-08-08 16:40:51 -04:00
Sinks
| Plugin Name | Install Command | Provides |
| --------------------------------------- | -------------------------------------------- | -------------------------- |
| [file ](./sink_docs/file.md ) | _included by default_ | File source and sink |
| [console ](./sink_docs/console.md ) | _included by default_ | Console sink |
| [datahub-rest ](./sink_docs/datahub.md ) | `pip install 'acryl-datahub[datahub-rest]'` | DataHub sink over REST API |
| [datahub-kafka ](./sink_docs/datahub.md ) | `pip install 'acryl-datahub[datahub-kafka]'` | DataHub sink over Kafka |
2021-03-11 16:41:05 -05:00
These plugins can be mixed and matched as desired. For example:
2021-05-11 15:16:12 -07:00
```shell
2021-04-05 19:11:28 -07:00
pip install 'acryl-datahub[bigquery,datahub-rest]'
2021-03-11 16:41:05 -05:00
```
You can check the active plugins:
2021-05-11 15:16:12 -07:00
```shell
2021-04-06 15:41:15 -07:00
datahub check plugins
2021-03-11 16:41:05 -05:00
```
[extra requirements]: https://www.python-ldap.org/en/python-ldap-3.3.0/installing.html#build -prerequisites
2021-03-18 02:05:05 -04:00
#### Basic Usage
2021-03-10 17:32:12 -05:00
2021-05-11 15:16:12 -07:00
```shell
2021-04-05 19:11:28 -07:00
pip install 'acryl-datahub[datahub-rest]' # install the required plugin
2021-02-18 20:06:30 -08:00
datahub ingest -c ./examples/recipes/example_to_datahub_rest.yml
2021-02-11 22:48:08 -08:00
```
2021-02-12 10:46:28 -08:00
2021-11-17 09:24:58 -08:00
The `--dry-run` option of the `ingest` command performs all of the ingestion steps, except writing to the sink. This is useful to ensure that the
ingestion recipe is producing the desired workunits before ingesting them into datahub.
```shell
# Dry run
datahub ingest -c ./examples/recipes/example_to_datahub_rest.yml --dry-run
# Short-form
datahub ingest -c ./examples/recipes/example_to_datahub_rest.yml -n
```
The `--preview` option of the `ingest` command performs all of the ingestion steps, but limits the processing to only the first 10 workunits produced by the source.
This option helps with quick end-to-end smoke testing of the ingestion recipe.
```shell
# Preview
datahub ingest -c ./examples/recipes/example_to_datahub_rest.yml --preview
# Preview with dry-run
datahub ingest -c ./examples/recipes/example_to_datahub_rest.yml -n --preview
```
2021-03-18 02:05:05 -04:00
### Install using Docker
2021-02-12 10:46:28 -08:00
2021-03-18 02:05:05 -04:00
[](https://hub.docker.com/r/linkedin/datahub-ingestion)
[](https://github.com/linkedin/datahub/actions/workflows/docker-ingestion.yml)
If you don't want to install locally, you can alternatively run metadata ingestion within a Docker container.
We have prebuilt images available on [Docker hub ](https://hub.docker.com/r/linkedin/datahub-ingestion ). All plugins will be installed and enabled automatically.
2021-03-23 02:12:41 -04:00
_Limitation: the datahub_docker.sh convenience script assumes that the recipe and any input/output files are accessible in the current working directory or its subdirectories. Files outside the current working directory will not be found, and you'll need to invoke the Docker image directly._
2021-02-11 22:48:08 -08:00
2021-05-11 15:16:12 -07:00
```shell
2021-07-22 13:30:20 -07:00
# Assumes the DataHub repo is cloned locally.
./metadata-ingestion/scripts/datahub_docker.sh ingest -c ./examples/recipes/example_to_datahub_rest.yml
2021-02-11 22:48:08 -08:00
```
2021-03-18 02:05:05 -04:00
2021-04-05 19:11:28 -07:00
### Install from source
If you'd like to install from source, see the [developer guide ](./developing.md ).
2021-03-05 00:12:12 -08:00
## Recipes
2021-02-08 20:56:05 -08:00
2021-02-15 11:03:38 -08:00
A recipe is a configuration file that tells our ingestion scripts where to pull data from (source) and where to put it (sink).
2021-02-08 20:56:05 -08:00
Here's a simple example that pulls metadata from MSSQL and puts it into datahub.
```yaml
2021-02-12 10:46:28 -08:00
# A sample recipe that pulls metadata from MSSQL and puts it into DataHub
# using the Rest API.
2021-02-08 20:56:05 -08:00
source:
type: mssql
2021-02-12 15:18:23 -08:00
config:
2021-02-08 20:56:05 -08:00
username: sa
2021-03-26 21:57:05 -07:00
password: ${MSSQL_PASSWORD}
2021-02-08 20:56:05 -08:00
database: DemoData
2021-04-18 20:15:05 +02:00
transformers:
- type: "fully-qualified-class-name-of-transformer"
config:
some_property: "some.value"
2021-02-08 20:56:05 -08:00
sink:
type: "datahub-rest"
2021-02-12 15:18:23 -08:00
config:
2021-03-10 17:32:12 -05:00
server: "http://localhost:8080"
2021-02-08 20:56:05 -08:00
```
2021-03-26 21:57:05 -07:00
We automatically expand environment variables in the config,
similar to variable substitution in GNU bash or in docker-compose files. For details, see
https://docs.docker.com/compose/compose-file/compose-file-v2/#variable -substitution.
2021-02-08 20:56:05 -08:00
Running a recipe is quite easy.
2021-05-11 15:16:12 -07:00
```shell
2021-02-15 14:54:48 -08:00
datahub ingest -c ./examples/recipes/mssql_to_datahub.yml
2021-02-08 20:56:05 -08:00
```
2021-08-08 16:40:51 -04:00
A number of recipes are included in the [examples/recipes ](./examples/recipes ) directory. For full info and context on each source and sink, see the pages described in the [table of plugins ](#installing-plugins ).
2021-01-31 22:40:30 -08:00
2021-04-18 20:15:05 +02:00
## Transformations
2021-07-28 14:38:13 -07:00
If you'd like to modify data before it reaches the ingestion sinks – for instance, adding additional owners or tags – you can use a transformer to write your own module and integrate it with DataHub.
2021-04-18 20:15:05 +02:00
2021-07-28 14:38:13 -07:00
Check out the [transformers guide ](./transformers.md ) for more info!
2021-04-18 20:15:05 +02:00
2021-03-10 17:32:12 -05:00
## Using as a library
In some cases, you might want to construct the MetadataChangeEvents yourself but still use this framework to emit that metadata to DataHub. In this case, take a look at the emitter interfaces, which can easily be imported and called from your own code.
2021-04-16 17:54:50 -07:00
- [DataHub emitter via REST ](./src/datahub/emitter/rest_emitter.py ) (same requirements as `datahub-rest` ). Basic usage [example ](./examples/library/lineage_emitter_rest.py ).
- [DataHub emitter via Kafka ](./src/datahub/emitter/kafka_emitter.py ) (same requirements as `datahub-kafka` ). Basic usage [example ](./examples/library/lineage_emitter_kafka.py ).
2021-02-16 15:31:13 -08:00
2021-05-06 19:12:19 -07:00
## Lineage with Airflow
2021-04-12 17:40:15 -07:00
2021-05-06 19:12:19 -07:00
There's a couple ways to get lineage information from Airflow into DataHub.
2021-04-12 17:40:15 -07:00
2021-07-26 13:09:25 -07:00
:::note
2021-05-11 15:16:12 -07:00
If you're simply looking to run ingestion on a schedule, take a look at these sample DAGs:
2021-05-12 15:01:11 -07:00
- [`generic_recipe_sample_dag.py` ](./src/datahub_provider/example_dags/generic_recipe_sample_dag.py ) - reads a DataHub ingestion recipe file and runs it
- [`mysql_sample_dag.py` ](./src/datahub_provider/example_dags/mysql_sample_dag.py ) - runs a MySQL metadata ingestion pipeline using an inlined configuration.
2021-05-11 15:16:12 -07:00
:::
2021-04-12 17:40:15 -07:00
2021-05-06 19:12:19 -07:00
### Using Datahub's Airflow lineage backend (recommended)
2021-04-12 17:40:15 -07:00
2021-05-11 15:16:12 -07:00
:::caution
The Airflow lineage backend is only supported in Airflow 1.10.15+ and 2.0.2+.
:::
2021-08-08 16:40:51 -04:00
2021-09-01 11:51:14 -07:00
### Running on Docker locally
If you are looking to run Airflow and DataHub using docker locally, follow the guide [here ](../docker/airflow/local_airflow.md ). Otherwise proceed to follow the instructions below.
### Setting up Airflow to use DataHub as Lineage Backend
2021-07-30 19:51:24 +05:30
1. You need to install the required dependency in your airflow. See https://registry.astronomer.io/providers/datahub/modules/datahublineagebackend
2021-08-08 16:40:51 -04:00
```shell
pip install acryl-datahub[airflow]
```
2021-07-30 19:51:24 +05:30
2. You must configure an Airflow hook for Datahub. We support both a Datahub REST hook and a Kafka-based hook, but you only need one.
2021-04-12 17:40:15 -07:00
2021-05-11 15:16:12 -07:00
```shell
2021-05-06 19:12:19 -07:00
# For REST-based:
airflow connections add --conn-type 'datahub_rest' 'datahub_rest_default' --conn-host 'http://localhost:8080'
# For Kafka-based (standard Kafka sink config can be passed via extras):
airflow connections add --conn-type 'datahub_kafka' 'datahub_kafka_default' --conn-host 'broker:9092' --conn-extra '{}'
```
2021-04-20 20:44:38 -07:00
2021-07-30 19:51:24 +05:30
3. Add the following lines to your `airflow.cfg` file.
2021-04-12 17:40:15 -07:00
```ini
[lineage]
2021-05-12 15:01:11 -07:00
backend = datahub_provider.lineage.datahub.DatahubLineageBackend
2021-05-13 20:02:47 -07:00
datahub_kwargs = {
"datahub_conn_id": "datahub_rest_default",
2021-10-06 18:01:59 -04:00
"cluster": "prod",
2021-05-13 20:02:47 -07:00
"capture_ownership_info": true,
"capture_tags_info": true,
"graceful_exceptions": true }
# The above indentation is important!
2021-04-12 17:40:15 -07:00
```
2021-07-26 13:09:25 -07:00
**Configuration options:**
2021-05-13 20:02:47 -07:00
- `datahub_conn_id` (required): Usually `datahub_rest_default` or `datahub_kafka_default` , depending on what you named the connection in step 1.
2021-10-06 18:01:59 -04:00
- `cluster` (defaults to "prod"): The "cluster" to associate Airflow DAGs and tasks with.
2021-05-13 20:02:47 -07:00
- `capture_ownership_info` (defaults to true): If true, the owners field of the DAG will be capture as a DataHub corpuser.
- `capture_tags_info` (defaults to true): If true, the tags field of the DAG will be captured as DataHub tags.
- `graceful_exceptions` (defaults to true): If set to true, most runtime errors in the lineage backend will be suppressed and will not cause the overall task to fail. Note that configuration issues will still throw exceptions.
2021-07-30 19:51:24 +05:30
4. Configure `inlets` and `outlets` for your Airflow operators. For reference, look at the sample DAG in [`lineage_backend_demo.py` ](./src/datahub_provider/example_dags/lineage_backend_demo.py ), or reference [`lineage_backend_taskflow_demo.py` ](./src/datahub_provider/example_dags/lineage_backend_taskflow_demo.py ) if you're using the [TaskFlow API ](https://airflow.apache.org/docs/apache-airflow/stable/concepts/taskflow.html ).
5. [optional] Learn more about [Airflow lineage ](https://airflow.apache.org/docs/apache-airflow/stable/lineage.html ), including shorthand notation and some automation.
2021-04-12 17:40:15 -07:00
2021-05-06 19:12:19 -07:00
### Emitting lineage via a separate operator
Take a look at this sample DAG:
2021-05-12 15:01:11 -07:00
- [`lineage_emission_dag.py` ](./src/datahub_provider/example_dags/lineage_emission_dag.py ) - emits lineage using the DatahubEmitterOperator.
2021-05-06 19:12:19 -07:00
In order to use this example, you must first configure the Datahub hook. Like in ingestion, we support a Datahub REST hook and a Kafka-based hook. See step 1 above for details.
2021-04-05 19:11:28 -07:00
## Developing
2021-02-16 15:31:13 -08:00
2021-07-28 14:38:13 -07:00
See the guides on [developing ](./developing.md ), [adding a source ](./adding-source.md ) and [using transformers ](./transformers.md ).
2021-08-11 15:47:18 -07:00