Oftentimes we want to modify metadata before it reaches the ingestion sink – for instance, we might want to add custom tags, ownership, properties, or patch some fields. A transformer allows us to do exactly these things.
Moreover, a transformer allows one to have fine-grained control over the metadata that’s ingested without having to modify the ingestion framework's code yourself. Instead, you can write your own module that can take MCEs however you like. To configure the recipe, all that's needed is a module name as well as any arguments.
Aside from the option of writing your own transformer (see below), we provide some simple transformers for the use cases of adding: dataset tags, dataset glossary terms, dataset properties and ownership information.
Let’s suppose we’d like to add a set of dataset tags. To do so, we can use the `simple_add_dataset_tags` module that’s included in the ingestion framework.
The config, which we’d append to our ingestion recipe YAML, would look like this:
Let’s suppose we’d like to append a series of tags to specific datasets. To do so, we can use the `pattern_add_dataset_tags` module that’s included in the ingestion framework. This will match the regex pattern to `urn` of the dataset and assign the respective tags urns given in the array.
The config, which we’d append to our ingestion recipe YAML, would look like this:
If you'd like to add more complex logic for assigning tags, you can use the more generic add_dataset_tags transformer, which calls a user-provided function to determine the tags for each dataset.
Finally, you can install and use your custom transformer as [shown here](#installing-the-package).
### Adding a set of glossary terms
We can use a similar convention to associate [Glossary Terms](https://datahubproject.io/docs/metadata-ingestion/source_docs/business_glossary) to datasets. We can use the `simple_add_dataset_terms` module that’s included in the ingestion framework.
The config, which we’d append to our ingestion recipe YAML, would look like this:
```yaml
transformers:
- type: "simple_add_dataset_terms"
config:
term_urns:
- "urn:li:glossaryTerm:Email"
- "urn:li:glossaryTerm:Address"
```
### Adding glossary terms by dataset urn pattern
Similar to the above example with tags, we can add glossary terms to datasets based on a regex filter.
If we wanted to clear existing owners sent by ingestion source we can use the `simple_remove_dataset_ownership` module which removes all owners sent by the ingestion source.
```yaml
transformers:
- type: "simple_remove_dataset_ownership"
config: {}
```
The main use case of `simple_remove_dataset_ownership` is to remove incorrect owners present in the source. You can use it along with the next `simple_add_dataset_ownership` to remove wrong owners and add the correct ones.
Let’s suppose we’d like to append a series of users who we know to own a dataset but aren't detected during normal ingestion. To do so, we can use the `simple_add_dataset_ownership` module that’s included in the ingestion framework.
The config, which we’d append to our ingestion recipe YAML, would look like this:
Again, let’s suppose we’d like to append a series of users who we know to own a different dataset from a data source but aren't detected during normal ingestion. To do so, we can use the `pattern_add_dataset_ownership` module that’s included in the ingestion framework. This will match the pattern to `urn` of the dataset and assign the respective owners.
If you'd like to add more complex logic for assigning ownership, you can use the more generic `add_dataset_ownership` transformer, which calls a user-provided function to determine the ownership of each dataset.
If you would like to stop a dataset from appearing in the UI, then you need to mark the status of the dataset as removed. You can use this transformer after filtering for the specific datasets that you want to mark as removed.
If you would like to add to browse paths of dataset can use this transformer. There are 3 optional variables that you can use to get information from the dataset `urn`:
- ENV: env passed (default: prod)
- PLATFORM: `mysql`, `postgres` or different platform supported by datahub
- DATASET_PARTS: slash separated parts of dataset name. e.g. `database_name/schema_name/[table_name]` for postgres
This will add 2 browse paths like `/mysql/marketing_db/sales/orders` and `/data_warehouse/sales/orders` for a table `sales.orders` in `mysql` database instance.
If you'd like to add more complex logic for assigning properties, you can use the `add_dataset_properties` transformer, which calls a user-provided class (that extends from `AddDatasetPropertiesResolverBase` class) to determine the properties for each dataset.
The config, which we’d append to our ingestion recipe YAML, would look like this:
In the above couple of examples, we use classes that have already been implemented in the ingestion framework. However, it’s common for more advanced cases to pop up where custom code is required, for instance if you'd like to utilize conditional logic or rewrite properties. In such cases, we can add our own modules and define the arguments it takes as a custom transformer.
As an example, suppose we want to append a set of ownership fields to our metadata that are dependent upon an external source – for instance, an API endpoint or file – rather than a preset list like above. In this case, we can set a JSON file as an argument to our custom config, and our transformer will read this file and append the included ownership classes to all our MCEs (if you'd like, you could also include filtering logic for specific MCEs).
Our JSON file might look like the following:
```json
[
"urn:li:corpuser:athos",
"urn:li:corpuser:porthos",
"urn:li:corpuser:aramis",
"urn:li:corpGroup:the_three_musketeers"
]
```
### Defining a config
To get started, we’ll initiate an `AddCustomOwnershipConfig` class that inherits from [`datahub.configuration.common.ConfigModel`](./src/datahub/configuration/common.py). The sole parameter will be an `owners_json` which expects a path to a JSON file containing a list of owner URNs. This will go in a file called `custom_transform_example.py`.
```python
from datahub.configuration.common import ConfigModel
class AddCustomOwnershipConfig(ConfigModel):
owners_json: str
```
### Defining the transformer
Next, we’ll define the transformer itself, which must inherit from [`datahub.ingestion.api.transform.Transformer`](./src/datahub/ingestion/api/transform.py). First, let's get all our imports in:
```python
# append these to the start of custom_transform_example.py
import json
from typing import Iterable
# for constructing URNs
import datahub.emitter.mce_builder as builder
# for typing the config model
from datahub.configuration.common import ConfigModel
# for typing context and records
from datahub.ingestion.api.common import PipelineContext, RecordEnvelope
# base transformer class
from datahub.ingestion.api.transform import Transformer
# MCE-related classes
from datahub.metadata.schema_classes import (
DatasetSnapshotClass,
MetadataChangeEventClass,
OwnerClass,
OwnershipClass,
OwnershipTypeClass,
)
```
Next, let's define the base scaffolding for the class:
```python
# append this to the end of custom_transform_example.py
class AddCustomOwnership(Transformer):
"""Transformer that adds owners to datasets according to a callback function."""
# context param to generate run metadata such as a run ID
Now we need to add a `transform()` method that does the work of adding our custom ownership classes. This method will take an MCE as input and output the transformed MCE. Let's offload the processing of each MCE to another `transform_one()` class.
```python
# add this as a function of AddCustomOwnership
def transform(
self, record_envelopes: Iterable[RecordEnvelope]
) -> Iterable[RecordEnvelope]:
# loop over envelopes
for envelope in record_envelopes:
# if envelope is an MCE, add the ownership classes
if isinstance(envelope.record, MetadataChangeEventClass):
### More Sophistication: Making calls to DataHub during Transformation
In some advanced cases, you might want to check with DataHub before performing a transformation. A good example for this might be retrieving the current set of owners of a dataset before providing the new set of owners during an ingestion process. To allow transformers to always be able to query the graph, the framework provides them access to the graph through the context object `ctx`. Connectivity to the graph is automatically instantiated anytime the pipeline uses a REST sink. In case you are using the Kafka sink, you can additionally provide access to the graph by configuring it in your pipeline.
Here is an example of a recipe that uses Kafka as the sink, but provides access to the graph by explicitly configuring the `datahub_api`.
```yaml
source:
type: mysql
config:
# ..source configs
sink:
type: datahub-kafka
config:
connection:
bootstrap: localhost:9092
schema_registry_url: "http://localhost:8081"
datahub_api:
server: http://localhost:8080
# standard configs accepted by datahub rest client ...
```
#### Advanced Use-Case: Patching Owners
With the above capability, we can now build more powerful transformers that can check with the server-side state before issuing changes in metadata.
e.g. Here is how the AddDatasetOwnership transformer can now support PATCH semantics by ensuring that it never deletes any owners that are stored on the server.
Now that we've defined the transformer, we need to make it visible to DataHub. The easiest way to do this is to just place it in the same directory as your recipe, in which case the module name is the same as the file – in this case, `custom_transform_example`.
<details>
<summary>Advanced: installing as a package</summary>
Alternatively, create a `setup.py` in the same directory as our transform script to make it visible globally. After installing this package (e.g. with `python setup.py` or `pip install -e .`), our module will be installed and importable as `custom_transform_example`.
```python
from setuptools import find_packages, setup
setup(
name="custom_transform_example",
version="1.0",
packages=find_packages(),
# if you don't already have DataHub installed, add it under install_requires