mirror of
https://github.com/datahub-project/datahub.git
synced 2025-12-12 18:47:45 +00:00
docs: add guide for using custom sources (#3324)
This commit is contained in:
parent
bd1a5505d7
commit
43f1aeb538
35
docs/how/add-custom-ingestion-source.md
Normal file
35
docs/how/add-custom-ingestion-source.md
Normal file
@ -0,0 +1,35 @@
|
||||
# How to use a custom ingestion source without forking Datahub?
|
||||
|
||||
Adding a custom ingestion source is the easiest way to extend Datahubs ingestion framework to support source systems
|
||||
which are not yet officially supported by Datahub.
|
||||
|
||||
## What you need to do
|
||||
|
||||
First thing to do is building a custom source like it is described in
|
||||
the [metadata-ingestion source guide](../../metadata-ingestion/adding-source.md) in your own project.
|
||||
|
||||
### How to use this source?
|
||||
|
||||
To be able to use this source you just need to do a few things.
|
||||
|
||||
1. Build a python package out of your project including the custom source class.
|
||||
2. Install this package in your working environment where you are using the Datahub CLI to ingest metadata.
|
||||
|
||||
Now you are able to just reference your ingestion source class as a type in the YAML recipe by using the fully qualified
|
||||
package name. For example if your project structure looks like this `<project>/src/my-source/custom_ingestion_source.py`
|
||||
with the custom source class named `MySourceClass` your YAML recipe would look like the following:
|
||||
|
||||
```yaml
|
||||
source:
|
||||
type: my-source.custom_ingestion_source.MySourceClass
|
||||
config:
|
||||
# place for your custom config defined in the configModel
|
||||
```
|
||||
|
||||
If you now execute the ingestion the datahub client will pick up your code and call the `get_workunits` method and do
|
||||
the rest for you. That's it.
|
||||
|
||||
### Example code?
|
||||
|
||||
For examples how this setup could look like and a good starting point for building your first custom source visit
|
||||
our [meta-world](https://github.com/acryldata/meta-world) example repository.
|
||||
@ -1,24 +1,40 @@
|
||||
# Adding a Metadata Ingestion Source
|
||||
|
||||
There are two ways of adding a metadata ingestion source.
|
||||
|
||||
1. You are going to contribute the custom source directly to the Datahub project.
|
||||
2. You are writing the custom source for yourself and are not going to contribute back (yet).
|
||||
|
||||
If you are going for case (1) just follow the steps 1 to 9 below. In case you are building it for yourself you can skip
|
||||
steps 4-9 (but maybe write tests and docs for yourself as well) and follow the documentation
|
||||
on [how to use custom ingestion sources](../docs/how/add-custom-ingestion-source.md)
|
||||
without forking Datahub.
|
||||
|
||||
:::note
|
||||
|
||||
This guide assumes that you've already followed the metadata ingestion [developing guide](./developing.md) to set up your local environment.
|
||||
This guide assumes that you've already followed the metadata ingestion [developing guide](./developing.md) to set up
|
||||
your local environment.
|
||||
|
||||
:::
|
||||
|
||||
### 1. Set up the configuration model
|
||||
|
||||
We use [pydantic](https://pydantic-docs.helpmanual.io/) for configuration, and all models must inherit from `ConfigModel`. The [file source](./src/datahub/ingestion/source/file.py) is a good example.
|
||||
We use [pydantic](https://pydantic-docs.helpmanual.io/) for configuration, and all models must inherit
|
||||
from `ConfigModel`. The [file source](./src/datahub/ingestion/source/file.py) is a good example.
|
||||
|
||||
### 2. Set up the reporter
|
||||
|
||||
The reporter interface enables the source to report statistics, warnings, failures, and other information about the run. Some sources use the default `SourceReport` class, but others inherit and extend that class.
|
||||
The reporter interface enables the source to report statistics, warnings, failures, and other information about the run.
|
||||
Some sources use the default `SourceReport` class, but others inherit and extend that class.
|
||||
|
||||
### 3. Implement the source itself
|
||||
|
||||
The core for the source is the `get_workunits` method, which produces a stream of MCE objects. The [file source](./src/datahub/ingestion/source/file.py) is a good and simple example.
|
||||
The core for the source is the `get_workunits` method, which produces a stream of MCE objects.
|
||||
The [file source](./src/datahub/ingestion/source/file.py) is a good and simple example.
|
||||
|
||||
The MetadataChangeEventClass is defined in the metadata models which are generated under `metadata-ingestion/src/datahub/metadata/schema_classes.py`. There are also some [convenience methods](./src/datahub/emitter/mce_builder.py) for commonly used operations.
|
||||
The MetadataChangeEventClass is defined in the metadata models which are generated
|
||||
under `metadata-ingestion/src/datahub/metadata/schema_classes.py`. There are also
|
||||
some [convenience methods](./src/datahub/emitter/mce_builder.py) for commonly used operations.
|
||||
|
||||
### 4. Set up the dependencies
|
||||
|
||||
@ -26,7 +42,8 @@ Declare the source's pip dependencies in the `plugins` variable of the [setup sc
|
||||
|
||||
### 5. Enable discoverability
|
||||
|
||||
Declare the source under the `entry_points` variable of the [setup script](./setup.py). This enables the source to be listed when running `datahub check plugins`, and sets up the source's shortened alias for use in recipes.
|
||||
Declare the source under the `entry_points` variable of the [setup script](./setup.py). This enables the source to be
|
||||
listed when running `datahub check plugins`, and sets up the source's shortened alias for use in recipes.
|
||||
|
||||
### 6. Write tests
|
||||
|
||||
@ -34,12 +51,15 @@ Tests go in the `tests` directory. We use the [pytest framework](https://pytest.
|
||||
|
||||
### 7. Write docs
|
||||
|
||||
Add the plugin to the table at the top of the README file, and add the source's documentation underneath the sources header.
|
||||
Add the plugin to the table at the top of the README file, and add the source's documentation underneath the sources
|
||||
header.
|
||||
|
||||
### 8. Add SQL Alchemy mapping (if applicable)
|
||||
|
||||
Add the source in `get_platform_from_sqlalchemy_uri` function in [sql_common.py](./src/datahub/ingestion/source/sql/sql_common.py) if the source has an sqlalchemy source
|
||||
Add the source in `get_platform_from_sqlalchemy_uri` function
|
||||
in [sql_common.py](./src/datahub/ingestion/source/sql/sql_common.py) if the source has an sqlalchemy source
|
||||
|
||||
### 9. Add logo
|
||||
|
||||
Add logo image in [images folder](../datahub-web-react/src/images) and add it to be ingested in [boot](../metadata-service/war/src/main/resources/boot/data_platforms.json)
|
||||
Add logo image in [images folder](../datahub-web-react/src/images) and add it to be ingested
|
||||
in [boot](../metadata-service/war/src/main/resources/boot/data_platforms.json)
|
||||
|
||||
Loading…
x
Reference in New Issue
Block a user