From 43f1aeb538db66d0e5831ce3ba5882f751b1db22 Mon Sep 17 00:00:00 2001 From: David Schmidt Date: Tue, 26 Oct 2021 07:23:48 +0200 Subject: [PATCH] docs: add guide for using custom sources (#3324) --- docs/how/add-custom-ingestion-source.md | 35 +++++++++++++++++++++++ metadata-ingestion/adding-source.md | 38 +++++++++++++++++++------ 2 files changed, 64 insertions(+), 9 deletions(-) create mode 100644 docs/how/add-custom-ingestion-source.md diff --git a/docs/how/add-custom-ingestion-source.md b/docs/how/add-custom-ingestion-source.md new file mode 100644 index 0000000000..806af937b4 --- /dev/null +++ b/docs/how/add-custom-ingestion-source.md @@ -0,0 +1,35 @@ +# How to use a custom ingestion source without forking Datahub? + +Adding a custom ingestion source is the easiest way to extend Datahubs ingestion framework to support source systems +which are not yet officially supported by Datahub. + +## What you need to do + +First thing to do is building a custom source like it is described in +the [metadata-ingestion source guide](../../metadata-ingestion/adding-source.md) in your own project. + +### How to use this source? + +To be able to use this source you just need to do a few things. + +1. Build a python package out of your project including the custom source class. +2. Install this package in your working environment where you are using the Datahub CLI to ingest metadata. + +Now you are able to just reference your ingestion source class as a type in the YAML recipe by using the fully qualified +package name. For example if your project structure looks like this `/src/my-source/custom_ingestion_source.py` +with the custom source class named `MySourceClass` your YAML recipe would look like the following: + +```yaml +source: + type: my-source.custom_ingestion_source.MySourceClass + config: + # place for your custom config defined in the configModel +``` + +If you now execute the ingestion the datahub client will pick up your code and call the `get_workunits` method and do +the rest for you. That's it. + +### Example code? + +For examples how this setup could look like and a good starting point for building your first custom source visit +our [meta-world](https://github.com/acryldata/meta-world) example repository. \ No newline at end of file diff --git a/metadata-ingestion/adding-source.md b/metadata-ingestion/adding-source.md index 8988cea541..bc22ffe304 100644 --- a/metadata-ingestion/adding-source.md +++ b/metadata-ingestion/adding-source.md @@ -1,24 +1,40 @@ # Adding a Metadata Ingestion Source +There are two ways of adding a metadata ingestion source. + +1. You are going to contribute the custom source directly to the Datahub project. +2. You are writing the custom source for yourself and are not going to contribute back (yet). + +If you are going for case (1) just follow the steps 1 to 9 below. In case you are building it for yourself you can skip +steps 4-9 (but maybe write tests and docs for yourself as well) and follow the documentation +on [how to use custom ingestion sources](../docs/how/add-custom-ingestion-source.md) +without forking Datahub. + :::note -This guide assumes that you've already followed the metadata ingestion [developing guide](./developing.md) to set up your local environment. +This guide assumes that you've already followed the metadata ingestion [developing guide](./developing.md) to set up +your local environment. ::: ### 1. Set up the configuration model -We use [pydantic](https://pydantic-docs.helpmanual.io/) for configuration, and all models must inherit from `ConfigModel`. The [file source](./src/datahub/ingestion/source/file.py) is a good example. +We use [pydantic](https://pydantic-docs.helpmanual.io/) for configuration, and all models must inherit +from `ConfigModel`. The [file source](./src/datahub/ingestion/source/file.py) is a good example. ### 2. Set up the reporter -The reporter interface enables the source to report statistics, warnings, failures, and other information about the run. Some sources use the default `SourceReport` class, but others inherit and extend that class. +The reporter interface enables the source to report statistics, warnings, failures, and other information about the run. +Some sources use the default `SourceReport` class, but others inherit and extend that class. ### 3. Implement the source itself -The core for the source is the `get_workunits` method, which produces a stream of MCE objects. The [file source](./src/datahub/ingestion/source/file.py) is a good and simple example. +The core for the source is the `get_workunits` method, which produces a stream of MCE objects. +The [file source](./src/datahub/ingestion/source/file.py) is a good and simple example. -The MetadataChangeEventClass is defined in the metadata models which are generated under `metadata-ingestion/src/datahub/metadata/schema_classes.py`. There are also some [convenience methods](./src/datahub/emitter/mce_builder.py) for commonly used operations. +The MetadataChangeEventClass is defined in the metadata models which are generated +under `metadata-ingestion/src/datahub/metadata/schema_classes.py`. There are also +some [convenience methods](./src/datahub/emitter/mce_builder.py) for commonly used operations. ### 4. Set up the dependencies @@ -26,7 +42,8 @@ Declare the source's pip dependencies in the `plugins` variable of the [setup sc ### 5. Enable discoverability -Declare the source under the `entry_points` variable of the [setup script](./setup.py). This enables the source to be listed when running `datahub check plugins`, and sets up the source's shortened alias for use in recipes. +Declare the source under the `entry_points` variable of the [setup script](./setup.py). This enables the source to be +listed when running `datahub check plugins`, and sets up the source's shortened alias for use in recipes. ### 6. Write tests @@ -34,12 +51,15 @@ Tests go in the `tests` directory. We use the [pytest framework](https://pytest. ### 7. Write docs -Add the plugin to the table at the top of the README file, and add the source's documentation underneath the sources header. +Add the plugin to the table at the top of the README file, and add the source's documentation underneath the sources +header. ### 8. Add SQL Alchemy mapping (if applicable) -Add the source in `get_platform_from_sqlalchemy_uri` function in [sql_common.py](./src/datahub/ingestion/source/sql/sql_common.py) if the source has an sqlalchemy source +Add the source in `get_platform_from_sqlalchemy_uri` function +in [sql_common.py](./src/datahub/ingestion/source/sql/sql_common.py) if the source has an sqlalchemy source ### 9. Add logo -Add logo image in [images folder](../datahub-web-react/src/images) and add it to be ingested in [boot](../metadata-service/war/src/main/resources/boot/data_platforms.json) \ No newline at end of file +Add logo image in [images folder](../datahub-web-react/src/images) and add it to be ingested +in [boot](../metadata-service/war/src/main/resources/boot/data_platforms.json)