513 lines
16 KiB
Markdown
Raw Normal View History

# DataHub Metadata Ingestion
2021-02-12 10:46:28 -08:00
![Python version 3.6+](https://img.shields.io/badge/python-3.6%2B-blue)
This module hosts an extensible Python-based metadata ingestion system for DataHub.
This supports sending data to DataHub using Kafka or through the REST API.
It can be used through our CLI tool, with an orchestrator like Airflow, or as a library.
2021-02-15 17:11:40 -08:00
## Getting Started
### Prerequisites
2021-02-15 11:03:38 -08:00
Before running any metadata ingestion job, you should make sure that DataHub backend services are all running. If you are trying this out locally, the easiest way to do that is through [quickstart Docker images](../docker).
### Install from PyPI
The folks over at [Acryl](https://www.acryl.io/) maintain a PyPI package for DataHub metadata ingestion.
```sh
# Requires Python 3.6+
python3 -m pip install --upgrade pip==20.2.4 wheel setuptools
python3 -m pip uninstall datahub acryl-datahub || true # sanity check - ok if it fails
python3 -m pip install acryl-datahub
datahub version
# If you see "command not found", try running this instead: python3 -m datahub version
```
If you run into an error, try checking the [_common setup issues_](./developing.md#Common-setup-issues).
#### Installing Plugins
We use a plugin architecture so that you can install only the dependencies you actually need.
| Plugin Name | Install Command | Provides |
| ------------- | ---------------------------------------------------------- | -------------------------- |
| file | _included by default_ | File source and sink |
| console | _included by default_ | Console sink |
| athena | `pip install 'acryl-datahub[athena]'` | AWS Athena source |
| bigquery | `pip install 'acryl-datahub[bigquery]'` | BigQuery source |
| glue | `pip install 'acryl-datahub[glue]'` | AWS Glue source |
| hive | `pip install 'acryl-datahub[hive]'` | Hive source |
| mssql | `pip install 'acryl-datahub[mssql]'` | SQL Server source |
| mysql | `pip install 'acryl-datahub[mysql]'` | MySQL source |
| postgres | `pip install 'acryl-datahub[postgres]'` | Postgres source |
| oracle | `pip install 'acryl-datahub[oracle]'` | Oracle source |
| snowflake | `pip install 'acryl-datahub[snowflake]'` | Snowflake source |
| mongodb | `pip install 'acryl-datahub[mongodb]'` | MongoDB source |
| ldap | `pip install 'acryl-datahub[ldap]'` ([extra requirements]) | LDAP source |
| kakfa | `pip install 'acryl-datahub[kafka]'` | Kafka source |
| druid | `pip install 'acryl-datahub[druid]'` | Druid Source |
| dbt | _no additional dependencies_ | DBT source |
| datahub-rest | `pip install 'acryl-datahub[datahub-rest]'` | DataHub sink over REST API |
| datahub-kafka | `pip install 'acryl-datahub[datahub-kafka]'` | DataHub sink over Kafka |
These plugins can be mixed and matched as desired. For example:
```sh
pip install 'acryl-datahub[bigquery,datahub-rest]'
```
You can check the active plugins:
```sh
datahub check plugins
```
[extra requirements]: https://www.python-ldap.org/en/python-ldap-3.3.0/installing.html#build-prerequisites
#### Basic Usage
2021-02-11 22:48:08 -08:00
```sh
pip install 'acryl-datahub[datahub-rest]' # install the required plugin
datahub ingest -c ./examples/recipes/example_to_datahub_rest.yml
2021-02-11 22:48:08 -08:00
```
2021-02-12 10:46:28 -08:00
### Install using Docker
2021-02-12 10:46:28 -08:00
[![Docker Hub](https://img.shields.io/docker/pulls/linkedin/datahub-ingestion?style=plastic)](https://hub.docker.com/r/linkedin/datahub-ingestion)
[![datahub-ingestion docker](https://github.com/linkedin/datahub/actions/workflows/docker-ingestion.yml/badge.svg)](https://github.com/linkedin/datahub/actions/workflows/docker-ingestion.yml)
If you don't want to install locally, you can alternatively run metadata ingestion within a Docker container.
We have prebuilt images available on [Docker hub](https://hub.docker.com/r/linkedin/datahub-ingestion). All plugins will be installed and enabled automatically.
_Limitation: the datahub_docker.sh convenience script assumes that the recipe and any input/output files are accessible in the current working directory or its subdirectories. Files outside the current working directory will not be found, and you'll need to invoke the Docker image directly._
2021-02-11 22:48:08 -08:00
```sh
./scripts/datahub_docker.sh ingest -c ./examples/recipes/example_to_datahub_rest.yml
2021-02-11 22:48:08 -08:00
```
### Install from source
If you'd like to install from source, see the [developer guide](./developing.md).
### Usage within Airflow
2021-02-01 11:24:52 -08:00
We have also included a couple [sample DAGs](./examples/airflow) that can be used with [Airflow](https://airflow.apache.org/).
- `generic_recipe_sample_dag.py` - a simple Airflow DAG that picks up a DataHub ingestion recipe configuration and runs it.
- `mysql_sample_dag.py` - an Airflow DAG that runs a MySQL metadata ingestion pipeline using an inlined configuration.
## Recipes
2021-02-08 20:56:05 -08:00
2021-02-15 11:03:38 -08:00
A recipe is a configuration file that tells our ingestion scripts where to pull data from (source) and where to put it (sink).
2021-02-08 20:56:05 -08:00
Here's a simple example that pulls metadata from MSSQL and puts it into datahub.
```yaml
2021-02-12 10:46:28 -08:00
# A sample recipe that pulls metadata from MSSQL and puts it into DataHub
# using the Rest API.
2021-02-08 20:56:05 -08:00
source:
type: mssql
2021-02-12 15:18:23 -08:00
config:
2021-02-08 20:56:05 -08:00
username: sa
password: ${MSSQL_PASSWORD}
2021-02-08 20:56:05 -08:00
database: DemoData
sink:
type: "datahub-rest"
2021-02-12 15:18:23 -08:00
config:
server: "http://localhost:8080"
2021-02-08 20:56:05 -08:00
```
We automatically expand environment variables in the config,
similar to variable substitution in GNU bash or in docker-compose files. For details, see
https://docs.docker.com/compose/compose-file/compose-file-v2/#variable-substitution.
2021-02-08 20:56:05 -08:00
Running a recipe is quite easy.
2021-02-12 10:46:28 -08:00
```sh
2021-02-15 14:54:48 -08:00
datahub ingest -c ./examples/recipes/mssql_to_datahub.yml
2021-02-08 20:56:05 -08:00
```
2021-02-12 10:46:28 -08:00
A number of recipes are included in the examples/recipes directory.
## Sources
2021-02-12 15:18:23 -08:00
### Kafka Metadata `kafka`
2021-02-12 15:18:23 -08:00
Extracts:
2021-02-12 15:18:23 -08:00
- List of topics - from the Kafka broker
- Schemas associated with each topic - from the schema registry
```yml
source:
type: "kafka"
config:
connection:
bootstrap: "broker:9092"
schema_registry_url: http://localhost:8081
consumer_config: {} # passed to https://docs.confluent.io/platform/current/clients/confluent-kafka-python/index.html#deserializingconsumer
2021-02-12 15:18:23 -08:00
```
### MySQL Metadata `mysql`
2021-02-12 15:18:23 -08:00
Extracts:
2021-02-12 15:18:23 -08:00
- List of databases and tables
- Column types and schema associated with each table
```yml
source:
type: mysql
config:
username: root
password: example
database: dbname
2021-02-15 12:05:11 -08:00
host_port: localhost:3306
2021-02-12 15:18:23 -08:00
table_pattern:
deny:
# Note that the deny patterns take precedence over the allow patterns.
- "performance_schema"
allow:
- "schema1.table2"
# Although the 'table_pattern' enables you to skip everything from certain schemas,
# having another option to allow/deny on schema level is an optimization for the case when there is a large number
# of schemas that one wants to skip and you want to avoid the time to needlessly fetch those tables only to filter
# them out afterwards via the table_pattern.
schema_pattern:
deny:
- "garbage_schema"
allow:
- "schema1"
2021-02-12 15:18:23 -08:00
```
### Microsoft SQL Server Metadata `mssql`
2021-02-12 15:18:23 -08:00
Extracts:
2021-02-12 15:18:23 -08:00
- List of databases, schema, and tables
- Column types associated with each table
```yml
source:
type: mssql
config:
username: user
password: pass
host_port: localhost:1433
2021-02-12 15:18:23 -08:00
database: DemoDatabase
table_pattern:
deny:
- "^.*\\.sys_.*" # deny all tables that start with sys_
2021-02-12 15:18:23 -08:00
allow:
- "schema1.table1"
- "schema1.table2"
options:
# Any options specified here will be passed to SQLAlchemy's create_engine as kwargs.
# See https://docs.sqlalchemy.org/en/14/core/engines.html for details.
charset: "utf8"
2021-02-12 15:18:23 -08:00
```
### Hive `hive`
2021-02-15 12:05:11 -08:00
Extracts:
2021-02-15 12:05:11 -08:00
- List of databases, schema, and tables
- Column types associated with each table
```yml
source:
type: hive
config:
username: user
password: pass
host_port: localhost:10000
database: DemoDatabase
# table_pattern/schema_pattern is same as above
# options is same as above
2021-02-15 12:05:11 -08:00
```
### PostgreSQL `postgres`
2021-02-15 12:17:23 -08:00
Extracts:
2021-02-15 12:17:23 -08:00
- List of databases, schema, and tables
- Column types associated with each table
- Also supports PostGIS extensions
2021-02-15 12:17:23 -08:00
```yml
source:
type: postgres
config:
username: user
password: pass
host_port: localhost:5432
database: DemoDatabase
# table_pattern/schema_pattern is same as above
# options is same as above
2021-02-15 12:17:23 -08:00
```
### Snowflake `snowflake`
2021-02-15 12:21:06 -08:00
Extracts:
2021-02-15 12:21:06 -08:00
- List of databases, schema, and tables
- Column types associated with each table
```yml
source:
type: snowflake
config:
username: user
password: pass
host_port: account_name
# table_pattern/schema_pattern is same as above
# options is same as above
2021-02-15 12:21:06 -08:00
```
### Oracle `oracle`
Extracts:
- List of databases, schema, and tables
- Column types associated with each table
```yml
source:
type: oracle
config:
# For more details on authentication, see the documentation:
# https://docs.sqlalchemy.org/en/14/dialects/oracle.html#dialect-oracle-cx_oracle-connect and
# https://cx-oracle.readthedocs.io/en/latest/user_guide/connection_handling.html#connection-strings.
username: user
password: pass
host_port: localhost:5432
database: dbname
# table_pattern/schema_pattern is same as above
# options is same as above
```
### Google BigQuery `bigquery`
2021-02-15 14:39:59 -08:00
Extracts:
2021-02-15 14:39:59 -08:00
- List of databases, schema, and tables
- Column types associated with each table
```yml
source:
type: bigquery
2021-02-15 14:39:59 -08:00
config:
project_id: project # optional - can autodetect from environment
dataset: dataset_name
options: # options is same as above
# See https://github.com/mxmzdlv/pybigquery#authentication for details.
credentials_path: "/path/to/keyfile.json" # optional
# table_pattern/schema_pattern is same as above
2021-02-15 14:39:59 -08:00
```
### AWS Athena `athena`
Extracts:
- List of databases and tables
- Column types associated with each table
```yml
source:
type: athena
config:
username: aws_access_key_id # Optional. If not specified, credentials are picked up according to boto3 rules.
# See https://boto3.amazonaws.com/v1/documentation/api/latest/guide/credentials.html
password: aws_secret_access_key # Optional.
database: database # Optional, defaults to "default"
aws_region: aws_region_name # i.e. "eu-west-1"
s3_staging_dir: s3_location # "s3://<bucket-name>/prefix/"
# The s3_staging_dir parameter is needed because Athena always writes query results to S3.
# See https://docs.aws.amazon.com/athena/latest/ug/querying.html
# However, the athena driver will transparently fetch these results as you would expect from any other sql client.
work_group: athena_workgroup # "primary"
# table_pattern/schema_pattern is same as above
```
### AWS Glue `glue`
Extracts:
- List of tables
- Column types associated with each table
- Table metadata, such as owner, description and parameters
```yml
source:
type: glue
config:
aws_region: aws_region_name # i.e. "eu-west-1"
env: environment used for the DatasetSnapshot URN, one of "DEV", "EI", "PROD" or "CORP". # Optional, defaults to "PROD".
database_pattern: # Optional, to filter databases scanned, same as schema_pattern above.
table_pattern: # Optional, to filter tables scanned, same as table_pattern above.
aws_access_key_id # Optional. If not specified, credentials are picked up according to boto3 rules.
# See https://boto3.amazonaws.com/v1/documentation/api/latest/guide/credentials.html
aws_secret_access_key # Optional.
aws_session_token # Optional.
```
### Druid `druid`
Extracts:
- List of databases, schema, and tables
- Column types associated with each table
**Note** It is important to define a explicitly define deny schema pattern for internal druid databases (lookup & sys)
if adding a schema pattern otherwise the crawler may crash before processing relevant databases.
This deny pattern is defined by default but is overriden by user-submitted configurations
```yml
source:
type: druid
config:
# Point to broker address
host_port: localhost:8082
schema_pattern:
deny:
- "^(lookup|sys).*"
# options is same as above
```
### MongoDB `mongodb`
Extracts:
- List of databases
- List of collections in each database
```yml
source:
type: "mongodb"
config:
# For advanced configurations, see the MongoDB docs.
# https://pymongo.readthedocs.io/en/stable/examples/authentication.html
connect_uri: "mongodb://localhost"
username: admin
password: password
authMechanism: "DEFAULT"
options: {}
database_pattern: {}
collection_pattern: {}
# database_pattern/collection_pattern are similar to schema_pattern/table_pattern from above
```
### LDAP `ldap`
Extracts:
- List of people
- Names, emails, titles, and manager information for each person
```yml
source:
type: "ldap"
config:
ldap_server: ldap://localhost
ldap_user: "cn=admin,dc=example,dc=org"
ldap_password: "admin"
base_dn: "dc=example,dc=org"
filter: "(objectClass=*)" # optional field
```
### File `file`
2021-02-12 15:18:23 -08:00
Pulls metadata from a previously generated file. Note that the file sink
can produce such files, and a number of samples are included in the
[examples/mce_files](examples/mce_files) directory.
```yml
source:
type: file
filename: ./path/to/mce/file.json
```
2021-02-12 10:46:28 -08:00
### DBT `dbt`
Pull metadata from DBT output files:
- [dbt manifest file](https://docs.getdbt.com/reference/artifacts/manifest-json)
- This file contains model, source and lineage data.
- [dbt catalog file](https://docs.getdbt.com/reference/artifacts/catalog-json)
- This file contains schema data.
- DBT does not record schema data for Ephemeral models, as such datahub will show Ephemeral models in the lineage, however there will be no associated schema for Ephemeral models
```yml
source:
type: "dbt"
config:
manifest_path: "./path/dbt/manifest_file.json"
catalog_path: "./path/dbt/catalog_file.json"
```
## Sinks
2021-02-12 15:18:23 -08:00
### DataHub Rest `datahub-rest`
2021-02-12 15:18:23 -08:00
Pushes metadata to DataHub using the GMA rest API. The advantage of the rest-based interface
is that any errors can immediately be reported.
```yml
sink:
type: "datahub-rest"
config:
server: "http://localhost:8080"
2021-02-12 15:18:23 -08:00
```
### DataHub Kafka `datahub-kafka`
2021-02-12 15:18:23 -08:00
Pushes metadata to DataHub by publishing messages to Kafka. The advantage of the Kafka-based
2021-02-15 16:34:59 -08:00
interface is that it's asynchronous and can handle higher throughput. This requires the
Datahub mce-consumer container to be running.
2021-02-12 15:18:23 -08:00
```yml
sink:
type: "datahub-kafka"
config:
connection:
bootstrap: "localhost:9092"
producer_config: {} # passed to https://docs.confluent.io/platform/current/clients/confluent-kafka-python/index.html#serializingproducer
2021-02-12 15:18:23 -08:00
```
### Console `console`
2021-02-12 15:18:23 -08:00
Simply prints each metadata event to stdout. Useful for experimentation and debugging purposes.
```yml
sink:
type: "console"
```
### File `file`
2021-02-12 15:18:23 -08:00
Outputs metadata to a file. This can be used to decouple metadata sourcing from the
process of pushing it into DataHub, and is particularly useful for debugging purposes.
Note that the file source can read files generated by this sink.
```yml
sink:
type: file
config:
filename: ./path/to/mce/file.json
2021-02-12 15:18:23 -08:00
```
2021-01-31 22:40:30 -08:00
## Using as a library
In some cases, you might want to construct the MetadataChangeEvents yourself but still use this framework to emit that metadata to DataHub. In this case, take a look at the emitter interfaces, which can easily be imported and called from your own code.
- [DataHub emitter via REST](./src/datahub/emitter/rest_emitter.py) (same requirements as `datahub-rest`)
- [DataHub emitter via Kafka](./src/datahub/emitter/kafka_emitter.py) (same requirements as `datahub-kafka`)
For a basic usage example, see the [lineage_emitter.py](./examples/library/lineage_emitter.py) example.
## Developing
See the [developing guide](./developing.md).