2021-02-14 11:35:45 -08:00
# Metadata Ingestion
2021-02-12 10:46:28 -08:00
2021-02-14 11:35:45 -08:00
This module hosts an extensible Python-based metadata ingestion system for DataHub.
This supports sending data to DataHub using Kafka or through the REST api.
2021-02-17 18:01:22 -08:00
It can be used through our CLI tool or as a library e.g. with an orchestrator like Airflow.
2021-02-14 11:35:45 -08:00
2021-03-05 00:12:12 -08:00
### Architecture
2021-02-15 17:11:40 -08:00

2021-02-16 15:31:13 -08:00
The architecture of this metadata ingestion framework is heavily inspired by [Apache Gobblin ](https://gobblin.apache.org/ ) (also originally a LinkedIn project!). We have a standardized format - the MetadataChangeEvent - and sources and sinks which respectively produce and consume these objects. The sources pull metadata from a variety of data systems, while the sinks are primarily for moving this metadata into DataHub.
2021-02-15 17:11:40 -08:00
2021-03-10 17:32:12 -05:00
### Prerequisites
2021-02-14 11:35:45 -08:00
2021-02-15 11:03:38 -08:00
Before running any metadata ingestion job, you should make sure that DataHub backend services are all running. If you are trying this out locally, the easiest way to do that is through [quickstart Docker images ](../docker ).
<!-- You can run this ingestion framework by building from source or by running docker images. -->
2021-03-05 00:12:12 -08:00
### Building from source:
2021-02-12 10:46:28 -08:00
2021-03-10 17:32:12 -05:00
#### Requirements
2021-02-14 11:35:45 -08:00
1. Python 3.6+ must be installed in your host environment.
2. You also need to build the `mxe-schemas` module as below.
```
2021-02-18 20:06:30 -08:00
(cd .. & & ./gradlew :metadata-events:mxe-schemas:build)
2021-02-14 11:35:45 -08:00
```
This is needed to generate `MetadataChangeEvent.avsc` which is the schema for the `MetadataChangeEvent_v4` Kafka topic.
3. On MacOS: `brew install librdkafka`
4. On Debian/Ubuntu: `sudo apt install librdkafka-dev python3-dev python3-venv`
2021-02-08 20:56:05 -08:00
2021-03-05 00:12:12 -08:00
#### Set up your Python environment
2021-03-10 17:32:12 -05:00
2021-02-11 22:48:08 -08:00
```sh
2021-02-12 10:46:28 -08:00
python3 -m venv venv
source venv/bin/activate
2021-03-05 12:18:13 -08:00
pip install --upgrade pip wheel setuptools
2021-02-12 10:46:28 -08:00
pip install -e .
2021-02-12 20:17:25 -08:00
./scripts/codegen.sh
2021-02-11 22:48:08 -08:00
```
2021-02-12 10:46:28 -08:00
2021-03-10 17:32:12 -05:00
Common issues (click to expand):
2021-02-18 20:06:30 -08:00
< details >
< summary > Wheel issues e.g. "Failed building wheel for avro-python3" or "error: invalid command 'bdist_wheel'"< / summary >
2021-03-10 17:32:12 -05:00
This means Python's `wheel` is not installed. Try running the following commands and then retry.
```sh
pip install --upgrade pip wheel setuptools
pip cache purge
```
2021-02-18 20:06:30 -08:00
< / details >
< details >
< summary > Failure to install confluent_kafka: "error: command 'x86_64-linux-gnu-gcc' failed with exit status 1"< / summary >
2021-03-10 17:32:12 -05:00
This sometimes happens if there's a version mismatch between the Kafka's C library and the Python wrapper library. Try running `pip install confluent_kafka==1.5.0` and then retrying.
2021-02-18 20:06:30 -08:00
< / details >
< details >
< summary > Failure to install avro-python3: "distutils.errors.DistutilsOptionError: Version loaded from file: avro/VERSION.txt does not comply with PEP 440"< / summary >
2021-03-10 17:32:12 -05:00
The underlying `avro-python3` package is buggy. In particular, it often only installs correctly when installed from a pre-built "wheel" but not when from source. Try running the following commands and then retry.
```sh
pip uninstall avro-python3 # sanity check, ok if this fails
pip install --upgrade pip wheel setuptools
pip cache purge
pip install avro-python3
```
2021-02-18 20:06:30 -08:00
< / details >
2021-02-15 11:03:38 -08:00
### Usage
2021-03-10 17:32:12 -05:00
2021-02-11 22:48:08 -08:00
```sh
2021-02-18 20:06:30 -08:00
datahub ingest -c ./examples/recipes/example_to_datahub_rest.yml
2021-02-11 22:48:08 -08:00
```
2021-02-12 10:46:28 -08:00
2021-02-15 11:03:38 -08:00
<!--
2021-02-14 11:35:45 -08:00
## Running Ingestion using Docker:
2021-02-12 10:46:28 -08:00
### Build the image
2021-02-11 22:48:08 -08:00
```sh
2021-02-12 10:46:28 -08:00
source docker/docker_build.sh
2021-02-11 22:48:08 -08:00
```
2021-02-12 10:46:28 -08:00
### Usage - with Docker
We have a simple script provided that supports mounting a local directory for input recipes and an output directory for output data:
2021-02-11 22:48:08 -08:00
```sh
2021-02-12 10:46:28 -08:00
source docker/docker_run.sh examples/recipes/file_to_file.yml
2021-02-11 22:48:08 -08:00
```
2021-02-15 11:03:38 -08:00
-->
2021-02-01 11:24:52 -08:00
2021-02-17 18:01:22 -08:00
We have also included a couple [sample DAGs ](./examples/airflow ) that can be used with [Airflow ](https://airflow.apache.org/ ).
2021-03-10 17:32:12 -05:00
2021-02-17 18:01:22 -08:00
- `generic_recipe_sample_dag.py` - a simple Airflow DAG that picks up a DataHub ingestion recipe configuration and runs it.
- `mysql_sample_dag.py` - an Airflow DAG that runs a MySQL metadata ingestion pipeline using an inlined configuration.
2021-03-05 00:12:12 -08:00
## Recipes
2021-02-08 20:56:05 -08:00
2021-02-15 11:03:38 -08:00
A recipe is a configuration file that tells our ingestion scripts where to pull data from (source) and where to put it (sink).
2021-02-08 20:56:05 -08:00
Here's a simple example that pulls metadata from MSSQL and puts it into datahub.
```yaml
2021-02-12 10:46:28 -08:00
# A sample recipe that pulls metadata from MSSQL and puts it into DataHub
# using the Rest API.
2021-02-08 20:56:05 -08:00
source:
type: mssql
2021-02-12 15:18:23 -08:00
config:
2021-02-08 20:56:05 -08:00
username: sa
password: test!Password
database: DemoData
sink:
type: "datahub-rest"
2021-02-12 15:18:23 -08:00
config:
2021-03-10 17:32:12 -05:00
server: "http://localhost:8080"
2021-02-08 20:56:05 -08:00
```
Running a recipe is quite easy.
2021-02-12 10:46:28 -08:00
```sh
2021-02-15 14:54:48 -08:00
datahub ingest -c ./examples/recipes/mssql_to_datahub.yml
2021-02-08 20:56:05 -08:00
```
2021-02-12 10:46:28 -08:00
A number of recipes are included in the examples/recipes directory.
2021-03-05 00:12:12 -08:00
## Sources
2021-02-12 15:18:23 -08:00
2021-03-05 00:12:12 -08:00
### Kafka Metadata `kafka`
2021-03-10 17:32:12 -05:00
2021-02-12 15:18:23 -08:00
Extracts:
2021-03-10 17:32:12 -05:00
2021-02-12 15:18:23 -08:00
- List of topics - from the Kafka broker
- Schemas associated with each topic - from the schema registry
```yml
source:
type: "kafka"
config:
2021-03-04 11:42:57 -08:00
connection:
bootstrap: "broker:9092"
schema_registry_url: http://localhost:8081
2021-03-10 17:32:12 -05:00
consumer_config: {} # passed to https://docs.confluent.io/platform/current/clients/confluent-kafka-python/index.html#deserializingconsumer
2021-02-12 15:18:23 -08:00
```
2021-03-05 00:12:12 -08:00
### MySQL Metadata `mysql`
2021-03-10 17:32:12 -05:00
2021-02-12 15:18:23 -08:00
Extracts:
2021-03-10 17:32:12 -05:00
2021-02-12 15:18:23 -08:00
- List of databases and tables
- Column types and schema associated with each table
2021-02-15 12:17:23 -08:00
Extra requirements: `pip install pymysql`
2021-02-12 15:18:23 -08:00
```yml
source:
type: mysql
config:
username: root
password: example
database: dbname
2021-02-15 12:05:11 -08:00
host_port: localhost:3306
2021-02-12 15:18:23 -08:00
table_pattern:
allow:
2021-03-10 17:32:12 -05:00
- "schema1.table2"
2021-02-12 15:18:23 -08:00
deny:
2021-03-10 17:32:12 -05:00
- "performance_schema"
2021-03-11 09:08:01 +01:00
# Although the 'table_pattern' enables you to skip everything from certain schemas,
# having another option to allow/deny on schema level is an optimization for the case when there is a large number
# of schemas that one wants to skip and you want to avoid the time to needlessly fetch those tables only to filter
# them out afterwards via the table_pattern.
schema_pattern:
allow:
- "schema1"
deny:
- "garbage_schema"
2021-02-12 15:18:23 -08:00
```
2021-03-05 00:12:12 -08:00
### Microsoft SQL Server Metadata `mssql`
2021-03-10 17:32:12 -05:00
2021-02-12 15:18:23 -08:00
Extracts:
2021-03-10 17:32:12 -05:00
2021-02-12 15:18:23 -08:00
- List of databases, schema, and tables
- Column types associated with each table
2021-02-15 12:17:23 -08:00
Extra requirements: `pip install sqlalchemy-pytds`
2021-02-12 15:18:23 -08:00
```yml
source:
type: mssql
config:
username: user
password: pass
2021-03-05 12:18:13 -08:00
host_port: localhost:1433
2021-02-12 15:18:23 -08:00
database: DemoDatabase
table_pattern:
allow:
2021-03-10 17:32:12 -05:00
- "schema1.table1"
- "schema1.table2"
2021-02-12 15:18:23 -08:00
deny:
2021-03-10 17:32:12 -05:00
- "^.*\\.sys_.*" # deny all tables that start with sys_
2021-03-03 19:49:46 -08:00
options:
# Any options specified here will be passed to SQLAlchemy's create_engine as kwargs.
# See https://docs.sqlalchemy.org/en/14/core/engines.html for details.
2021-03-10 17:32:12 -05:00
charset: "utf8"
2021-02-12 15:18:23 -08:00
```
2021-03-05 00:12:12 -08:00
### Hive `hive`
2021-03-10 17:32:12 -05:00
2021-02-15 12:05:11 -08:00
Extracts:
2021-03-10 17:32:12 -05:00
2021-02-15 12:05:11 -08:00
- List of databases, schema, and tables
- Column types associated with each table
2021-02-15 12:17:23 -08:00
Extra requirements: `pip install pyhive[hive]`
2021-02-15 12:05:11 -08:00
```yml
source:
type: hive
config:
username: user
password: pass
host_port: localhost:10000
database: DemoDatabase
2021-03-11 09:08:01 +01:00
# table_pattern/schema_pattern is same as above
2021-03-03 19:49:46 -08:00
# options is same as above
2021-02-15 12:05:11 -08:00
```
2021-03-05 00:12:12 -08:00
### PostgreSQL `postgres`
2021-03-10 17:32:12 -05:00
2021-02-15 12:17:23 -08:00
Extracts:
2021-03-10 17:32:12 -05:00
2021-02-15 12:17:23 -08:00
- List of databases, schema, and tables
- Column types associated with each table
2021-02-15 14:39:59 -08:00
Extra requirements: `pip install psycopg2-binary` or `pip install psycopg2`
2021-02-15 12:17:23 -08:00
2021-02-25 09:05:16 -08:00
If you're using PostGIS extensions for Postgres, also use `pip install GeoAlchemy2` .
2021-02-15 12:17:23 -08:00
```yml
source:
type: postgres
config:
username: user
password: pass
host_port: localhost:5432
database: DemoDatabase
2021-03-11 09:08:01 +01:00
# table_pattern/schema_pattern is same as above
2021-03-03 19:49:46 -08:00
# options is same as above
2021-02-15 12:17:23 -08:00
```
2021-03-05 00:12:12 -08:00
### Snowflake `snowflake`
2021-03-10 17:32:12 -05:00
2021-02-15 12:21:06 -08:00
Extracts:
2021-03-10 17:32:12 -05:00
2021-02-15 12:21:06 -08:00
- List of databases, schema, and tables
- Column types associated with each table
Extra requirements: `pip install snowflake-sqlalchemy`
```yml
source:
type: snowflake
config:
username: user
password: pass
host_port: account_name
2021-03-11 09:08:01 +01:00
# table_pattern/schema_pattern is same as above
2021-03-03 19:49:46 -08:00
# options is same as above
2021-02-15 12:21:06 -08:00
```
2021-03-05 00:12:12 -08:00
### Google BigQuery `bigquery`
2021-03-10 17:32:12 -05:00
2021-02-15 14:39:59 -08:00
Extracts:
2021-03-10 17:32:12 -05:00
2021-02-15 14:39:59 -08:00
- List of databases, schema, and tables
- Column types associated with each table
Extra requirements: `pip install pybigquery`
```yml
source:
2021-02-16 15:31:13 -08:00
type: bigquery
2021-02-15 14:39:59 -08:00
config:
2021-03-10 17:32:12 -05:00
project_id: project # optional - can autodetect from environment
2021-03-03 19:49:46 -08:00
dataset: dataset_name
options: # options is same as above
# See https://github.com/mxmzdlv/pybigquery#authentication for details.
2021-03-10 17:32:12 -05:00
credentials_path: "/path/to/keyfile.json" # optional
2021-03-11 09:08:01 +01:00
# table_pattern/schema_pattern is same as above
2021-02-15 14:39:59 -08:00
```
2021-03-05 00:12:12 -08:00
### LDAP `ldap`
2021-03-10 17:32:12 -05:00
2021-02-18 20:05:39 -08:00
Extracts:
2021-03-10 17:32:12 -05:00
2021-02-18 20:05:39 -08:00
- List of people
- Names, emails, titles, and manager information for each person
Extra requirements: `pip install python-ldap>=2.4` . See that [library's docs ](https://www.python-ldap.org/en/python-ldap-3.3.0/installing.html#pre-built-binaries ) for extra build requirements.
```yml
source:
type: "ldap"
config:
ldap_server: ldap://localhost
ldap_user: "cn=admin,dc=example,dc=org"
ldap_password: "admin"
base_dn: "dc=example,dc=org"
2021-03-10 17:32:12 -05:00
filter: "(objectClass=*)" # optional field
2021-02-18 20:05:39 -08:00
```
2021-03-05 00:12:12 -08:00
### File `file`
2021-03-10 17:32:12 -05:00
2021-02-12 15:18:23 -08:00
Pulls metadata from a previously generated file. Note that the file sink
can produce such files, and a number of samples are included in the
[examples/mce_files ](examples/mce_files ) directory.
```yml
source:
type: file
filename: ./path/to/mce/file.json
```
2021-02-12 10:46:28 -08:00
2021-03-05 00:12:12 -08:00
## Sinks
2021-02-12 15:18:23 -08:00
2021-03-05 00:12:12 -08:00
### DataHub Rest `datahub-rest`
2021-03-10 17:32:12 -05:00
2021-02-12 15:18:23 -08:00
Pushes metadata to DataHub using the GMA rest API. The advantage of the rest-based interface
is that any errors can immediately be reported.
```yml
sink:
type: "datahub-rest"
config:
2021-03-10 17:32:12 -05:00
server: "http://localhost:8080"
2021-02-12 15:18:23 -08:00
```
2021-03-05 00:12:12 -08:00
### DataHub Kafka `datahub-kafka`
2021-03-10 17:32:12 -05:00
2021-02-12 15:18:23 -08:00
Pushes metadata to DataHub by publishing messages to Kafka. The advantage of the Kafka-based
2021-02-15 16:34:59 -08:00
interface is that it's asynchronous and can handle higher throughput. This requires the
Datahub mce-consumer container to be running.
2021-02-12 15:18:23 -08:00
```yml
sink:
type: "datahub-kafka"
config:
2021-03-04 11:42:57 -08:00
connection:
bootstrap: "localhost:9092"
2021-03-10 17:32:12 -05:00
producer_config: {} # passed to https://docs.confluent.io/platform/current/clients/confluent-kafka-python/index.html#serializingproducer
2021-02-12 15:18:23 -08:00
```
2021-03-05 00:12:12 -08:00
### Console `console`
2021-03-10 17:32:12 -05:00
2021-02-12 15:18:23 -08:00
Simply prints each metadata event to stdout. Useful for experimentation and debugging purposes.
```yml
sink:
type: "console"
```
2021-03-05 00:12:12 -08:00
### File `file`
2021-03-10 17:32:12 -05:00
2021-02-12 15:18:23 -08:00
Outputs metadata to a file. This can be used to decouple metadata sourcing from the
process of pushing it into DataHub, and is particularly useful for debugging purposes.
Note that the file source can read files generated by this sink.
```yml
sink:
type: file
filename: ./path/to/mce/file.json
```
2021-01-31 22:40:30 -08:00
2021-03-10 17:32:12 -05:00
## Using as a library
In some cases, you might want to construct the MetadataChangeEvents yourself but still use this framework to emit that metadata to DataHub. In this case, take a look at the emitter interfaces, which can easily be imported and called from your own code.
- [DataHub emitter via REST ](./src/datahub/emitter/rest_emitter.py )
- [DataHub emitter via Kafka ](./src/datahub/emitter/kafka_emitter.py )
2021-03-05 12:18:13 -08:00
## Migrating from the old scripts
2021-03-10 17:32:12 -05:00
2021-03-05 12:18:13 -08:00
If you were previously using the `mce_cli.py` tool to push metadata into DataHub: the new way for doing this is by creating a recipe with a file source pointing at your JSON file and a DataHub sink to push that metadata into DataHub.
This [example recipe ](./examples/recipes/example_to_datahub_rest.yml ) demonstrates how to ingest the [sample data ](./examples/mce_files/bootstrap_mce.json ) (previously called `bootstrap_mce.dat` ) into DataHub over the REST API.
Note that we no longer use the `.dat` format, but instead use JSON. The main differences are that the JSON uses `null` instead of `None` and uses objects/dictionaries instead of tuples when representing unions.
If you were previously using one of the `sql-etl` scripts: the new way for doing this is by using the associated source. See [below ](#Sources ) for configuration details. Note that the source needs to be paired with a sink - likely `datahub-kafka` or `datahub-rest` , depending on your needs.
2021-03-05 00:12:12 -08:00
## Contributing
2021-02-01 11:24:52 -08:00
2021-02-12 10:46:28 -08:00
Contributions welcome!
2021-02-01 11:24:52 -08:00
2021-03-05 00:12:12 -08:00
### Code layout
2021-02-16 15:31:13 -08:00
- The CLI interface is defined in [entrypoints.py ](./src/datahub/entrypoints.py ).
- The high level interfaces are defined in the [API directory ](./src/datahub/ingestion/api ).
- The actual [sources ](./src/datahub/ingestion/source ) and [sinks ](./src/datahub/ingestion/sink ) implementations have their own directories - the `__init__.py` files in those directories are used to register the short codes for use in recipes.
- The metadata models are created using code generation, and eventually live in the `./src/datahub/metadata` directory. However, these files are not checked in and instead are generated at build time. See the [codegen ](./scripts/codegen.sh ) script for details.
2021-03-05 00:12:12 -08:00
### Testing
2021-03-10 17:32:12 -05:00
2021-02-12 10:46:28 -08:00
```sh
# Follow standard install procedure - see above.
# Install requirements.
pip install -r test_requirements.txt
# Run unit tests.
pytest tests/unit
# Run integration tests.
# Note: the integration tests require docker.
pytest tests/integration
```
2021-03-05 00:12:12 -08:00
### Sanity check code before committing
2021-03-10 17:32:12 -05:00
2021-02-12 10:46:28 -08:00
```sh
2021-02-15 14:54:48 -08:00
# Requires test_requirements.txt to have been installed.
2021-02-15 15:04:21 -08:00
black --exclude 'datahub/metadata' -S -t py36 src tests
2021-02-12 10:46:28 -08:00
isort src tests
2021-02-15 14:39:59 -08:00
flake8 src tests
2021-02-15 15:04:21 -08:00
mypy -p datahub
2021-02-12 10:46:28 -08:00
pytest
```