docs: add guide for using custom sources (#3324)

This commit is contained in:
David Schmidt 2021-10-26 07:23:48 +02:00 committed by GitHub
parent bd1a5505d7
commit 43f1aeb538
No known key found for this signature in database
GPG Key ID: 4AEE18F83AFDEB23
2 changed files with 64 additions and 9 deletions

View File

@ -0,0 +1,35 @@
# How to use a custom ingestion source without forking Datahub?
Adding a custom ingestion source is the easiest way to extend Datahubs ingestion framework to support source systems
which are not yet officially supported by Datahub.
## What you need to do
First thing to do is building a custom source like it is described in
the [metadata-ingestion source guide](../../metadata-ingestion/adding-source.md) in your own project.
### How to use this source?
To be able to use this source you just need to do a few things.
1. Build a python package out of your project including the custom source class.
2. Install this package in your working environment where you are using the Datahub CLI to ingest metadata.
Now you are able to just reference your ingestion source class as a type in the YAML recipe by using the fully qualified
package name. For example if your project structure looks like this `<project>/src/my-source/custom_ingestion_source.py`
with the custom source class named `MySourceClass` your YAML recipe would look like the following:
```yaml
source:
type: my-source.custom_ingestion_source.MySourceClass
config:
# place for your custom config defined in the configModel
```
If you now execute the ingestion the datahub client will pick up your code and call the `get_workunits` method and do
the rest for you. That's it.
### Example code?
For examples how this setup could look like and a good starting point for building your first custom source visit
our [meta-world](https://github.com/acryldata/meta-world) example repository.

View File

@ -1,24 +1,40 @@
# Adding a Metadata Ingestion Source
There are two ways of adding a metadata ingestion source.
1. You are going to contribute the custom source directly to the Datahub project.
2. You are writing the custom source for yourself and are not going to contribute back (yet).
If you are going for case (1) just follow the steps 1 to 9 below. In case you are building it for yourself you can skip
steps 4-9 (but maybe write tests and docs for yourself as well) and follow the documentation
on [how to use custom ingestion sources](../docs/how/add-custom-ingestion-source.md)
without forking Datahub.
:::note
This guide assumes that you've already followed the metadata ingestion [developing guide](./developing.md) to set up your local environment.
This guide assumes that you've already followed the metadata ingestion [developing guide](./developing.md) to set up
your local environment.
:::
### 1. Set up the configuration model
We use [pydantic](https://pydantic-docs.helpmanual.io/) for configuration, and all models must inherit from `ConfigModel`. The [file source](./src/datahub/ingestion/source/file.py) is a good example.
We use [pydantic](https://pydantic-docs.helpmanual.io/) for configuration, and all models must inherit
from `ConfigModel`. The [file source](./src/datahub/ingestion/source/file.py) is a good example.
### 2. Set up the reporter
The reporter interface enables the source to report statistics, warnings, failures, and other information about the run. Some sources use the default `SourceReport` class, but others inherit and extend that class.
The reporter interface enables the source to report statistics, warnings, failures, and other information about the run.
Some sources use the default `SourceReport` class, but others inherit and extend that class.
### 3. Implement the source itself
The core for the source is the `get_workunits` method, which produces a stream of MCE objects. The [file source](./src/datahub/ingestion/source/file.py) is a good and simple example.
The core for the source is the `get_workunits` method, which produces a stream of MCE objects.
The [file source](./src/datahub/ingestion/source/file.py) is a good and simple example.
The MetadataChangeEventClass is defined in the metadata models which are generated under `metadata-ingestion/src/datahub/metadata/schema_classes.py`. There are also some [convenience methods](./src/datahub/emitter/mce_builder.py) for commonly used operations.
The MetadataChangeEventClass is defined in the metadata models which are generated
under `metadata-ingestion/src/datahub/metadata/schema_classes.py`. There are also
some [convenience methods](./src/datahub/emitter/mce_builder.py) for commonly used operations.
### 4. Set up the dependencies
@ -26,7 +42,8 @@ Declare the source's pip dependencies in the `plugins` variable of the [setup sc
### 5. Enable discoverability
Declare the source under the `entry_points` variable of the [setup script](./setup.py). This enables the source to be listed when running `datahub check plugins`, and sets up the source's shortened alias for use in recipes.
Declare the source under the `entry_points` variable of the [setup script](./setup.py). This enables the source to be
listed when running `datahub check plugins`, and sets up the source's shortened alias for use in recipes.
### 6. Write tests
@ -34,12 +51,15 @@ Tests go in the `tests` directory. We use the [pytest framework](https://pytest.
### 7. Write docs
Add the plugin to the table at the top of the README file, and add the source's documentation underneath the sources header.
Add the plugin to the table at the top of the README file, and add the source's documentation underneath the sources
header.
### 8. Add SQL Alchemy mapping (if applicable)
Add the source in `get_platform_from_sqlalchemy_uri` function in [sql_common.py](./src/datahub/ingestion/source/sql/sql_common.py) if the source has an sqlalchemy source
Add the source in `get_platform_from_sqlalchemy_uri` function
in [sql_common.py](./src/datahub/ingestion/source/sql/sql_common.py) if the source has an sqlalchemy source
### 9. Add logo
Add logo image in [images folder](../datahub-web-react/src/images) and add it to be ingested in [boot](../metadata-service/war/src/main/resources/boot/data_platforms.json)
Add logo image in [images folder](../datahub-web-react/src/images) and add it to be ingested
in [boot](../metadata-service/war/src/main/resources/boot/data_platforms.json)