datahub/metadata-ingestion/developing.md

# Developing on Metadata Ingestion

If you just want to use metadata ingestion, check the [user-centric](./README.md) guide.
This document is for developers who want to develop and possibly contribute to the metadata ingestion framework.

Also take a look at the guide to [adding a source](./adding-source.md).

## Getting Started

### Requirements

1. Python 3.8+ must be installed in your host environment.
2. Java 17 (gradle won't work with newer or older versions)
3. On Debian/Ubuntu: `sudo apt install python3-dev python3-venv`
4. On Fedora (if using LDAP source integration): `sudo yum install openldap-devel`

### Set up your Python environment

From the repository root:

```shell
cd metadata-ingestion
../gradlew :metadata-ingestion:installDev
source venv/bin/activate
datahub version  # should print "DataHub CLI version: unavailable (installed in develop mode)"
```

### (Optional) Set up your Python environment for developing on Airflow Plugin

From the repository root:

```shell
cd metadata-ingestion-modules/airflow-plugin
../../gradlew :metadata-ingestion-modules:airflow-plugin:installDev
source venv/bin/activate
datahub version  # should print "DataHub CLI version: unavailable (installed in develop mode)"

# start the airflow web server
export AIRFLOW_HOME=~/airflow
airflow webserver --port 8090 -d

# start the airflow scheduler
airflow scheduler

# access the airflow service and run any of the DAG
# open http://localhost:8090/
# select any DAG and click on the `play arrow` button to start the DAG

# add the debug lines in the codebase, i.e. in ./src/datahub_airflow_plugin/datahub_listener.py
logger.debug("this is the sample debug line")

# run the DAG again and you can see the debug lines in the task_run log at,
#1. click on the `timestamp` in the `Last Run` column
#2. select the task
#3. click on the `log` option
```

> **P.S. if you are not able to see the log lines, then restart the `airflow scheduler` and rerun the DAG**

### (Optional) Set up your Python environment for developing on Dagster Plugin

From the repository root:

```shell
cd metadata-ingestion-modules/dagster-plugin
../../gradlew :metadata-ingestion-modules:dagster-plugin:installDev
source venv/bin/activate
datahub version  # should print "DataHub CLI version: unavailable (installed in develop mode)"
```

### (Optional) Set up your Python environment for developing on Prefect Plugin

From the repository root:

```shell
cd metadata-ingestion-modules/prefect-plugin
../../gradlew :metadata-ingestion-modules:prefect-plugin:installDev
source venv/bin/activate
datahub version   # should print "DataHub CLI version: unavailable (installed in develop mode)"
```

### (Optional) Set up your Python environment for developing on GX Plugin

From the repository root:

```shell
cd metadata-ingestion-modules/gx-plugin
../../gradlew :metadata-ingestion-modules:gx-plugin:installDev
source venv/bin/activate
datahub version  # should print "DataHub CLI version: unavailable (installed in develop mode)"
```

### (Optional) Set up your Python environment for developing on Dagster Plugin

From the repository root:

```shell
cd metadata-ingestion-modules/dagster-plugin
../../gradlew :metadata-ingestion-modules:dagster-plugin:installDev
source venv/bin/activate
datahub version  # should print "DataHub CLI version: unavailable (installed in develop mode)"
```

### Common setup issues

Common issues (click to expand):

<details>
  <summary>datahub command not found with PyPI install</summary>

If you've already run the pip install, but running `datahub` in your command line doesn't work, then there is likely an issue with your PATH setup and Python.

The easiest way to circumvent this is to install and run via Python, and use `python3 -m datahub` in place of `datahub`.

```shell
python3 -m pip install --upgrade acryl-datahub
python3 -m datahub --help
```

</details>

<details>
  <summary>Wheel issues e.g. "Failed building wheel for avro-python3" or "error: invalid command 'bdist_wheel'"</summary>

This means Python's `wheel` is not installed. Try running the following commands and then retry.

```shell
pip install --upgrade pip wheel setuptools
pip cache purge
```

</details>

<details>
  <summary>Failure to install confluent_kafka: "error: command 'x86_64-linux-gnu-gcc' failed with exit status 1"</summary>

This sometimes happens if there's a version mismatch between the Kafka's C library and the Python wrapper library. Try running `pip install confluent_kafka==1.5.0` and then retrying.

</details>

<details>
  <summary>Conflict: acryl-datahub requires pydantic 1.10</summary>

The base `acryl-datahub` package supports both Pydantic 1.x and 2.x. However, some of our specific sources require Pydantic 1.x because of transitive dependencies.

If you're primarily using `acryl-datahub` for the SDKs, you can install `acryl-datahub` and some extras, like `acryl-datahub[sql-parser]`, without getting conflicts related to Pydantic versioning.

We recommend not installing full ingestion sources into your main environment (e.g. avoid having a dependency on `acryl-datahub[snowflake]` or other ingestion sources).
Instead, we recommend using UI-based ingestion or isolating the ingestion pipelines using [virtual environments](https://docs.python.org/3/library/venv.html). If you're using an orchestrator, they often have first-class support for virtual environments - here's an [example for Airflow](./schedule_docs/airflow.md).

</details>

### Using Plugins in Development

The syntax for installing plugins is slightly different in development. For example:

```diff
- uv pip install 'acryl-datahub[bigquery,datahub-rest]'
+ uv pip install -e '.[bigquery,datahub-rest]'
```

## Architecture

<p align="center">
  <img width="70%"  src="https://raw.githubusercontent.com/datahub-project/static-assets/main/imgs/datahub-metadata-ingestion-framework.png"/>
</p>

The architecture of this metadata ingestion framework is heavily inspired by [Apache Gobblin](https://gobblin.apache.org/) (also originally a LinkedIn project!). We have a standardized format - the MetadataChangeEvent - and sources and sinks which respectively produce and consume these objects. The sources pull metadata from a variety of data systems, while the sinks are primarily for moving this metadata into DataHub.

## Code layout

- The CLI interface is defined in [entrypoints.py](./src/datahub/entrypoints.py) and in the [cli](./src/datahub/cli) directory.
- The high level interfaces are defined in the [API directory](./src/datahub/ingestion/api).
- The actual [sources](./src/datahub/ingestion/source) and [sinks](./src/datahub/ingestion/sink) have their own directories. The registry files in those directories import the implementations.
- The metadata models are created using code generation, and eventually live in the `./src/datahub/metadata` directory. However, these files are not checked in and instead are generated at build time. See the [codegen](./scripts/codegen.sh) script for details.
- Tests live in the [`tests`](./tests) directory. They're split between smaller unit tests and larger integration tests.

## Code style

We use ruff, and mypy to ensure consistent code style and quality.

```shell
# Assumes: ../gradlew :metadata-ingestion:installDev and venv is activated
ruff check src/ tests/
mypy src/ tests/
```

or you can run from root of the repository

```shell
./gradlew :metadata-ingestion:lint

# This will auto-fix some linting issues.
./gradlew :metadata-ingestion:lintFix
```

Some other notes:

- Prefer mixin classes over tall inheritance hierarchies.
- Write type annotations wherever possible.
- Use `typing.Protocol` to make implicit interfaces explicit.
- If you ever find yourself copying and pasting large chunks of code, there's probably a better way to do it.
- Prefer a standalone helper method over a `@staticmethod`.
- You probably should not be defining a `__hash__` method yourself. Using `@dataclass(frozen=True)` is a good way to get a hashable class.
- Avoid global state. In sources, this includes instance variables that effectively function as "global" state for the source.
- Avoid defining functions within other functions. This makes it harder to read and test the code.
- When interacting with external APIs, parse the responses into a dataclass rather than operating directly on the response object.

## Dependency Management

The vast majority of our dependencies are not required by the "core" package but instead can be optionally installed using Python "extras". This allows us to keep the core package lightweight. We should be deliberate about adding new dependencies to the core framework.

Where possible, we should avoid pinning version dependencies. The `acryl-datahub` package is frequently used as a library and hence installed alongside other tools. If you need to restrict the version of a dependency, use a range like `>=1.2.3,<2.0.0` or a negative constraint like `>=1.2.3, !=1.2.7` instead. Every upper bound and negative constraint should be accompanied by a comment explaining why it's necessary.

Caveat: Some packages like Great Expectations and Airflow frequently make breaking changes. For such packages, it's ok to add a "defensive" upper bound with the current latest version, accompanied by a comment. It's critical that we revisit these upper bounds at least once a month and broaden them if possible.

## Guidelines for Ingestion Configs

We use [pydantic](https://pydantic-docs.helpmanual.io/) to define the ingestion configs.
In order to ensure that the configs are consistent and easy to use, we have a few guidelines:

#### Naming

- Most important point: we should **match the terminology of the source system**. For example, snowflake shouldn’t have a `host_port`, it should have an `account_id`.
- We should prefer slightly more verbose names when the alternative isn’t descriptive enough. For example `client_id` or `tenant_id` over a bare `id` and `access_secret` over a bare `secret`.
- AllowDenyPatterns should be used whenever we need to filter a list. The pattern should always apply to the fully qualified name of the entity. These configs should be named `*_pattern`, for example `table_pattern`.
- Avoid `*_only` configs like `profile_table_level_only` in favor of `profile_table_level` and `profile_column_level`. `include_tables` and `include_views` are a good example.

#### Content

- All configs should have a description.
- When using inheritance or mixin classes, make sure that the fields and documentation is applicable in the base class. The `bigquery_temp_table_schema` field definitely shouldn’t be showing up in every single source’s profiling config!
- Set reasonable defaults!
  - The configs should not contain a default that you’d reasonably expect to be built in. As a **bad** example, the Postgres source’s `schema_pattern` has a default deny pattern containing `information_schema`. This means that if the user overrides the schema_pattern, they’ll need to manually add the information_schema to their deny patterns. This is a bad, and the filtering should’ve been handled automatically by the source’s implementation, not added at runtime by its config.

#### Coding

- Use a single pydantic validator per thing to validate - we shouldn’t have validation methods that are 50 lines long.
- Use `SecretStr` for passwords, auth tokens, etc.
- When doing simple field renames, use the `pydantic_renamed_field` helper.
- When doing field deprecations, use the `pydantic_removed_field` helper.
- Validator methods must only throw ValueError, TypeError, or AssertionError. Do not throw ConfigurationError from validators.
- Set `hidden_from_docs` for internal-only config flags. However, needing this often indicates a larger problem with the code structure. The hidden field should probably be a class attribute or an instance variable on the corresponding source.

## Testing

```shell
# Follow standard install from source procedure - see above.

# Install all dev and test requirements.
../gradlew :metadata-ingestion:installDevTest

# Run the full testing suite
pytest -vv

# Run unit tests.
pytest -m 'not integration'

# Run Docker-based integration tests.
pytest -m 'integration'

# You can also run these steps via the gradle build:
../gradlew :metadata-ingestion:lint
../gradlew :metadata-ingestion:lintFix
../gradlew :metadata-ingestion:testQuick
../gradlew :metadata-ingestion:testFull
../gradlew :metadata-ingestion:check
# Run all tests in a single file
../gradlew :metadata-ingestion:testSingle -PtestFile=tests/unit/test_bigquery_source.py
# Run all tests under tests/unit
../gradlew :metadata-ingestion:testSingle -PtestFile=tests/unit
```

### Updating golden test files

If you made some changes that require generating new "golden" data files for use in testing a specific ingestion source, you can run the following to re-generate them:

```shell
pytest tests/integration/<source>/<source>.py --update-golden-files
```

For example,

```shell
pytest tests/integration/dbt/test_dbt.py --update-golden-files
```

### Testing the Airflow plugin

For the Airflow plugin, we use `tox` to test across multiple sets of dependencies.

```sh
cd metadata-ingestion-modules/airflow-plugin

# Run all tests.
tox

# Run a specific environment.
# These are defined in the `tox.ini` file
tox -e py310-airflow26

# Run a specific test.
tox -e py310-airflow26 -- tests/integration/test_plugin.py

# Update all golden files.
tox -- --update-golden-files

# Update golden files for a specific environment.
tox -e py310-airflow26 -- --update-golden-files
```
-												feat(ingest): start airflow integration + metadata builders (#2331)


											
										
										
											2021-04-05 19:11:28 -07:00
+								# Developing on Metadata Ingestion
 								If you just want to use metadata ingestion, check the [user-centric](./README.md) guide.
-												docs(ingest): add ingestion configs guide (#7438)


											
										
										
											2023-02-27 05:34:23 +05:30
+								This document is for developers who want to develop and possibly contribute to the metadata ingestion framework.
-												feat(ingest): start airflow integration + metadata builders (#2331)


											
										
										
											2021-04-05 19:11:28 -07:00
-												docs(ingest): add ingestion configs guide (#7438)


											
										
										
											2023-02-27 05:34:23 +05:30
+								Also take a look at the guide to [adding a source](./adding-source.md).
-												feat(ingest): start airflow integration + metadata builders (#2331)


											
										
										
											2021-04-05 19:11:28 -07:00
 								## Getting Started
 								### Requirements
-												chore(cli): drop support for python 3.7 (#9731)


											
										
										
											2024-01-29 10:50:47 -08:00
+. Python 3.8+ must be installed in your host environment.
-												fix(ingestion/metabase): Fetch Dashboards through Collections (#9631)

Co-authored-by: Harshal Sheth <hsheth2@gmail.com>
											
										
										
											2024-01-26 02:51:41 +01:00
+. Java 17 (gradle won't work with newer or older versions)
-												chore(cli): drop support for python 3.7 (#9731)


											
										
										
											2024-01-29 10:50:47 -08:00
+. On Debian/Ubuntu: `sudo apt install python3-dev python3-venv`
 . On Fedora (if using LDAP source integration): `sudo yum install openldap-devel`
-												feat(ingest): start airflow integration + metadata builders (#2331)


											
										
										
											2021-04-05 19:11:28 -07:00
 								### Set up your Python environment
-												chore(ingest): drop python 3.6 support (#5521)


											
										
										
											2022-08-10 22:00:31 +00:00
+								From the repository root:
-												fix(docs): Update developing.md to mention directory context (#4899)


											
										
										
											2022-05-11 13:00:37 -04:00
-												docs: enable better syntax highlighting (#2529)


											
										
										
											2021-05-11 15:16:12 -07:00
+								```shell
-												fix(docs): Update developing.md to mention directory context (#4899)


											
										
										
											2022-05-11 13:00:37 -04:00
+								cd metadata-ingestion
-												build(ingest): use gradle in commands + docs (#2531)


											
										
										
											2021-05-11 19:03:20 -07:00
+								../gradlew :metadata-ingestion:installDev
-												feat(ingest): start airflow integration + metadata builders (#2331)


											
										
										
											2021-04-05 19:11:28 -07:00
+								source venv/bin/activate
-												docs(ingestion): Improve developer docs (#5644)


											
										
										
											2022-08-15 21:41:23 +02:00
+								datahub version  # should print "DataHub CLI version: unavailable (installed in develop mode)"
-												feat(ingest): start airflow integration + metadata builders (#2331)


											
										
										
											2021-04-05 19:11:28 -07:00
+								```
-												ci: separate airflow build and test (#8688)

Co-authored-by: Harshal Sheth <hsheth2@gmail.com>
											
										
										
											2023-08-31 02:38:42 +05:30
+								### (Optional) Set up your Python environment for developing on Airflow Plugin
 								From the repository root:
 								```shell
 								cd metadata-ingestion-modules/airflow-plugin
 								../../gradlew :metadata-ingestion-modules:airflow-plugin:installDev
 								source venv/bin/activate
 								datahub version  # should print "DataHub CLI version: unavailable (installed in develop mode)"
-												doc(ingestion/airflow-plugin): update for developers (#10633)


											
										
										
											2024-06-06 11:01:39 +02:00
 								# start the airflow web server
 								export AIRFLOW_HOME=~/airflow
 								airflow webserver --port 8090 -d
 								# start the airflow scheduler
 								airflow scheduler
 								# access the airflow service and run any of the DAG
 								# open http://localhost:8090/
 								# select any DAG and click on the `play arrow` button to start the DAG
 								# add the debug lines in the codebase, i.e. in ./src/datahub_airflow_plugin/datahub_listener.py
 								logger.debug("this is the sample debug line")
 								# run the DAG again and you can see the debug lines in the task_run log at,
 								#1. click on the `timestamp` in the `Last Run` column
 								#2. select the task
 								#3. click on the `log` option
-												ci: separate airflow build and test (#8688)

Co-authored-by: Harshal Sheth <hsheth2@gmail.com>
											
										
										
											2023-08-31 02:38:42 +05:30
+								```
-												doc(ingestion/airflow-plugin): update for developers (#10633)


											
										
										
											2024-06-06 11:01:39 +02:00
 								> **P.S. if you are not able to see the log lines, then restart the `airflow scheduler` and rerun the DAG**
-												feat(ingest/dagster): Dagster source (#10071)

Co-authored-by: shubhamjagtap639 <shubham.jagtap@gslab.com>
											
										
										
											2024-03-25 13:28:35 +01:00
+								### (Optional) Set up your Python environment for developing on Dagster Plugin
-												fix(ingest): refactor test markers + fix disk space issues in CI (#8938)


											
										
										
											2023-10-03 23:17:49 -04:00
-												feat(ingest/dagster): Dagster source (#10071)

Co-authored-by: shubhamjagtap639 <shubham.jagtap@gslab.com>
											
										
										
											2024-03-25 13:28:35 +01:00
+								From the repository root:
 								```shell
 								cd metadata-ingestion-modules/dagster-plugin
 								../../gradlew :metadata-ingestion-modules:dagster-plugin:installDev
 								source venv/bin/activate
 								datahub version  # should print "DataHub CLI version: unavailable (installed in develop mode)"
 								```
-												feat: separate great-expectations action package (#11096)

Co-authored-by: Harshal Sheth <hsheth2@gmail.com>
											
										
										
											2024-08-21 21:43:36 +05:30
-												fix(ingestion/prefect-plugin): Prefect plugin (#10643)

Co-authored-by: shubhamjagtap639 <shubham.jagtap@gslab.com>
Co-authored-by: Tamas Nemeth <treff7es@gmail.com>
											
										
										
											2024-08-29 15:40:10 +02:00
+								### (Optional) Set up your Python environment for developing on Prefect Plugin
-												docs(ingest): add docs on pydantic compatibility (#11423)


											
										
										
											2024-09-20 13:22:15 -07:00
-												fix(ingestion/prefect-plugin): Prefect plugin (#10643)

Co-authored-by: shubhamjagtap639 <shubham.jagtap@gslab.com>
Co-authored-by: Tamas Nemeth <treff7es@gmail.com>
											
										
										
											2024-08-29 15:40:10 +02:00
+								From the repository root:
 								```shell
 								cd metadata-ingestion-modules/prefect-plugin
 								../../gradlew :metadata-ingestion-modules:prefect-plugin:installDev
 								source venv/bin/activate
 								datahub version   # should print "DataHub CLI version: unavailable (installed in develop mode)"
 								```
-												feat: separate great-expectations action package (#11096)

Co-authored-by: Harshal Sheth <hsheth2@gmail.com>
											
										
										
											2024-08-21 21:43:36 +05:30
+								### (Optional) Set up your Python environment for developing on GX Plugin
 								From the repository root:
 								```shell
 								cd metadata-ingestion-modules/gx-plugin
 								../../gradlew :metadata-ingestion-modules:gx-plugin:installDev
 								source venv/bin/activate
 								datahub version  # should print "DataHub CLI version: unavailable (installed in develop mode)"
 								```
-												chore(ingest): speed up lintFix command (#12346)


											
										
										
											2025-01-15 10:59:20 -08:00
-												feat(ingestion/dagster): Dagster assetless ingestion (#11262)

Co-authored-by: shubhamjagtap639 <shubham.jagtap@gslab.com>
											
										
										
											2024-10-24 16:06:00 +02:00
+								### (Optional) Set up your Python environment for developing on Dagster Plugin
 								From the repository root:
-												feat: separate great-expectations action package (#11096)

Co-authored-by: Harshal Sheth <hsheth2@gmail.com>
											
										
										
											2024-08-21 21:43:36 +05:30
-												feat(ingestion/dagster): Dagster assetless ingestion (#11262)

Co-authored-by: shubhamjagtap639 <shubham.jagtap@gslab.com>
											
										
										
											2024-10-24 16:06:00 +02:00
+								```shell
 								cd metadata-ingestion-modules/dagster-plugin
 								../../gradlew :metadata-ingestion-modules:dagster-plugin:installDev
 								source venv/bin/activate
 								datahub version  # should print "DataHub CLI version: unavailable (installed in develop mode)"
 								```
-												chore(ingest): speed up lintFix command (#12346)


											
										
										
											2025-01-15 10:59:20 -08:00
-												feat(ingest): start airflow integration + metadata builders (#2331)


											
										
										
											2021-04-05 19:11:28 -07:00
+								### Common setup issues
 								Common issues (click to expand):
-												fix(ingest): support `python3 -m datahub` (#2359)


											
										
										
											2021-04-07 14:58:58 -07:00
+								<details>
 								  <summary>datahub command not found with PyPI install</summary>
 								If you've already run the pip install, but running `datahub` in your command line doesn't work, then there is likely an issue with your PATH setup and Python.
 								The easiest way to circumvent this is to install and run via Python, and use `python3 -m datahub` in place of `datahub`.
-												docs: enable better syntax highlighting (#2529)


											
										
										
											2021-05-11 15:16:12 -07:00
+								```shell
-												feat(ingest): add Airflow lineage backend (#2368)


											
										
										
											2021-04-12 17:40:15 -07:00
+								python3 -m pip install --upgrade acryl-datahub
-												fix(ingest): support `python3 -m datahub` (#2359)


											
										
										
											2021-04-07 14:58:58 -07:00
+								python3 -m datahub --help
 								```
 								</details>
-												feat(ingest): start airflow integration + metadata builders (#2331)


											
										
										
											2021-04-05 19:11:28 -07:00
+								<details>
 								  <summary>Wheel issues e.g. "Failed building wheel for avro-python3" or "error: invalid command 'bdist_wheel'"</summary>
 								This means Python's `wheel` is not installed. Try running the following commands and then retry.
-												docs: enable better syntax highlighting (#2529)


											
										
										
											2021-05-11 15:16:12 -07:00
+								```shell
-												feat(ingest): add Airflow lineage backend (#2368)


											
										
										
											2021-04-12 17:40:15 -07:00
+								pip install --upgrade pip wheel setuptools
-												feat(ingest): start airflow integration + metadata builders (#2331)


											
										
										
											2021-04-05 19:11:28 -07:00
+								pip cache purge
 								```
 								</details>
 								<details>
 								  <summary>Failure to install confluent_kafka: "error: command 'x86_64-linux-gnu-gcc' failed with exit status 1"</summary>
 								This sometimes happens if there's a version mismatch between the Kafka's C library and the Python wrapper library. Try running `pip install confluent_kafka==1.5.0` and then retrying.
 								</details>
-												docs(ingest): add docs on pydantic compatibility (#11423)


											
										
										
											2024-09-20 13:22:15 -07:00
+								<details>
 								  <summary>Conflict: acryl-datahub requires pydantic 1.10</summary>
 								The base `acryl-datahub` package supports both Pydantic 1.x and 2.x. However, some of our specific sources require Pydantic 1.x because of transitive dependencies.
 								If you're primarily using `acryl-datahub` for the SDKs, you can install `acryl-datahub` and some extras, like `acryl-datahub[sql-parser]`, without getting conflicts related to Pydantic versioning.
 								We recommend not installing full ingestion sources into your main environment (e.g. avoid having a dependency on `acryl-datahub[snowflake]` or other ingestion sources).
 								Instead, we recommend using UI-based ingestion or isolating the ingestion pipelines using [virtual environments](https://docs.python.org/3/library/venv.html). If you're using an orchestrator, they often have first-class support for virtual environments - here's an [example for Airflow](./schedule_docs/airflow.md).
 								</details>
-												feat(ingest): start airflow integration + metadata builders (#2331)


											
										
										
											2021-04-05 19:11:28 -07:00
+								### Using Plugins in Development
 								The syntax for installing plugins is slightly different in development. For example:
 								```diff
-												docs(ingest): update metadata-ingestion dev guide (#12779)


											
										
										
											2025-03-07 11:22:17 -08:00
+								- uv pip install 'acryl-datahub[bigquery,datahub-rest]'
 								+ uv pip install -e '.[bigquery,datahub-rest]'
-												feat(ingest): start airflow integration + metadata builders (#2331)


											
										
										
											2021-04-05 19:11:28 -07:00
+								```
-												docs(ingest): add ingestion configs guide (#7438)


											
										
										
											2023-02-27 05:34:23 +05:30
+								## Architecture
-												docs(docs): add native versioning (#8714)


											
										
										
											2023-08-26 06:10:13 +09:00
+								<p align="center">
 								  <img width="70%"  src="https://raw.githubusercontent.com/datahub-project/static-assets/main/imgs/datahub-metadata-ingestion-framework.png"/>
 								</p>
-												docs(ingest): add ingestion configs guide (#7438)


											
										
										
											2023-02-27 05:34:23 +05:30
 								The architecture of this metadata ingestion framework is heavily inspired by [Apache Gobblin](https://gobblin.apache.org/) (also originally a LinkedIn project!). We have a standardized format - the MetadataChangeEvent - and sources and sinks which respectively produce and consume these objects. The sources pull metadata from a variety of data systems, while the sinks are primarily for moving this metadata into DataHub.
-												feat(ingest): start airflow integration + metadata builders (#2331)


											
										
										
											2021-04-05 19:11:28 -07:00
+								## Code layout
-												docs(ingest): add ingestion configs guide (#7438)


											
										
										
											2023-02-27 05:34:23 +05:30
+								- The CLI interface is defined in [entrypoints.py](./src/datahub/entrypoints.py) and in the [cli](./src/datahub/cli) directory.
-												feat(ingest): start airflow integration + metadata builders (#2331)


											
										
										
											2021-04-05 19:11:28 -07:00
+								- The high level interfaces are defined in the [API directory](./src/datahub/ingestion/api).
 								- The actual [sources](./src/datahub/ingestion/source) and [sinks](./src/datahub/ingestion/sink) have their own directories. The registry files in those directories import the implementations.
 								- The metadata models are created using code generation, and eventually live in the `./src/datahub/metadata` directory. However, these files are not checked in and instead are generated at build time. See the [codegen](./scripts/codegen.sh) script for details.
 								- Tests live in the [`tests`](./tests) directory. They're split between smaller unit tests and larger integration tests.
-												docs(ingest): add ingestion configs guide (#7438)


											
										
										
											2023-02-27 05:34:23 +05:30
+								## Code style
-												feat(ingest): start airflow integration + metadata builders (#2331)


											
										
										
											2021-04-05 19:11:28 -07:00
-												dev: remove black in favor of ruff for formatting (#12378)


											
										
										
											2025-01-18 15:06:20 +05:30
+								We use ruff, and mypy to ensure consistent code style and quality.
-												feat(ingest): start airflow integration + metadata builders (#2331)


											
										
										
											2021-04-05 19:11:28 -07:00
-												docs(ingest): add ingestion configs guide (#7438)


											
										
										
											2023-02-27 05:34:23 +05:30
+								```shell
-												docs(ingest): update metadata-ingestion dev guide (#12779)


											
										
										
											2025-03-07 11:22:17 -08:00
+								# Assumes: ../gradlew :metadata-ingestion:installDev and venv is activated
-												dev(ingest): use ruff instead of flake8 (#12359)


											
										
										
											2025-01-16 08:19:07 +05:30
+								ruff check src/ tests/
-												docs(ingest): add ingestion configs guide (#7438)


											
										
										
											2023-02-27 05:34:23 +05:30
+								mypy src/ tests/
 								```
-												chore(ci): fix CI failing due to lint (#7863)


											
										
										
											2023-04-20 16:53:36 +05:30
+								or you can run from root of the repository
-												fix(ingest): refactor test markers + fix disk space issues in CI (#8938)


											
										
										
											2023-10-03 23:17:49 -04:00
-												chore(ci): fix CI failing due to lint (#7863)


											
										
										
											2023-04-20 16:53:36 +05:30
+								```shell
-												chore(ingest): speed up lintFix command (#12346)


											
										
										
											2025-01-15 10:59:20 -08:00
+								./gradlew :metadata-ingestion:lint
 								# This will auto-fix some linting issues.
-												chore(ci): fix CI failing due to lint (#7863)


											
										
										
											2023-04-20 16:53:36 +05:30
+								./gradlew :metadata-ingestion:lintFix
 								```
-												docs(ingest): add ingestion configs guide (#7438)


											
										
										
											2023-02-27 05:34:23 +05:30
+								Some other notes:
 								- Prefer mixin classes over tall inheritance hierarchies.
 								- Write type annotations wherever possible.
 								- Use `typing.Protocol` to make implicit interfaces explicit.
 								- If you ever find yourself copying and pasting large chunks of code, there's probably a better way to do it.
-												docs(ingest): add more guidelines for writing sources (#7451)


											
										
										
											2023-02-28 11:53:43 -08:00
+								- Prefer a standalone helper method over a `@staticmethod`.
 								- You probably should not be defining a `__hash__` method yourself. Using `@dataclass(frozen=True)` is a good way to get a hashable class.
 								- Avoid global state. In sources, this includes instance variables that effectively function as "global" state for the source.
 								- Avoid defining functions within other functions. This makes it harder to read and test the code.
 								- When interacting with external APIs, parse the responses into a dataclass rather than operating directly on the response object.
-												docs(ingest): add ingestion configs guide (#7438)


											
										
										
											2023-02-27 05:34:23 +05:30
-												docs(ingest): add guidelines around proactive version pinning (#7534)


											
										
										
											2023-03-10 13:16:01 -05:00
+								## Dependency Management
 								The vast majority of our dependencies are not required by the "core" package but instead can be optionally installed using Python "extras". This allows us to keep the core package lightweight. We should be deliberate about adding new dependencies to the core framework.
 								Where possible, we should avoid pinning version dependencies. The `acryl-datahub` package is frequently used as a library and hence installed alongside other tools. If you need to restrict the version of a dependency, use a range like `>=1.2.3,<2.0.0` or a negative constraint like `>=1.2.3, !=1.2.7` instead. Every upper bound and negative constraint should be accompanied by a comment explaining why it's necessary.
 								Caveat: Some packages like Great Expectations and Airflow frequently make breaking changes. For such packages, it's ok to add a "defensive" upper bound with the current latest version, accompanied by a comment. It's critical that we revisit these upper bounds at least once a month and broaden them if possible.
-												docs(ingest): add ingestion configs guide (#7438)


											
										
										
											2023-02-27 05:34:23 +05:30
+								## Guidelines for Ingestion Configs
 								We use [pydantic](https://pydantic-docs.helpmanual.io/) to define the ingestion configs.
 								In order to ensure that the configs are consistent and easy to use, we have a few guidelines:
 								#### Naming
-												docs(ingest): add a guide for writing sources (#2575)


											
										
										
											2021-05-24 12:23:03 -07:00
-												docs(ingest): add ingestion configs guide (#7438)


											
										
										
											2023-02-27 05:34:23 +05:30
+								- Most important point: we should **match the terminology of the source system**. For example, snowflake shouldn’t have a `host_port`, it should have an `account_id`.
 								- We should prefer slightly more verbose names when the alternative isn’t descriptive enough. For example `client_id` or `tenant_id` over a bare `id` and `access_secret` over a bare `secret`.
 								- AllowDenyPatterns should be used whenever we need to filter a list. The pattern should always apply to the fully qualified name of the entity. These configs should be named `*_pattern`, for example `table_pattern`.
 								- Avoid `*_only` configs like `profile_table_level_only` in favor of `profile_table_level` and `profile_column_level`. `include_tables` and `include_views` are a good example.
 								#### Content
 								- All configs should have a description.
 								- When using inheritance or mixin classes, make sure that the fields and documentation is applicable in the base class. The `bigquery_temp_table_schema` field definitely shouldn’t be showing up in every single source’s profiling config!
 								- Set reasonable defaults!
 								  - The configs should not contain a default that you’d reasonably expect to be built in. As a **bad** example, the Postgres source’s `schema_pattern` has a default deny pattern containing `information_schema`. This means that if the user overrides the schema_pattern, they’ll need to manually add the information_schema to their deny patterns. This is a bad, and the filtering should’ve been handled automatically by the source’s implementation, not added at runtime by its config.
 								#### Coding
 								- Use a single pydantic validator per thing to validate - we shouldn’t have validation methods that are 50 lines long.
 								- Use `SecretStr` for passwords, auth tokens, etc.
 								- When doing simple field renames, use the `pydantic_renamed_field` helper.
 								- When doing field deprecations, use the `pydantic_removed_field` helper.
 								- Validator methods must only throw ValueError, TypeError, or AssertionError. Do not throw ConfigurationError from validators.
 								- Set `hidden_from_docs` for internal-only config flags. However, needing this often indicates a larger problem with the code structure. The hidden field should probably be a class attribute or an instance variable on the corresponding source.
 								## Testing
-												feat(ingest): start airflow integration + metadata builders (#2331)


											
										
										
											2021-04-05 19:11:28 -07:00
-												docs: enable better syntax highlighting (#2529)


											
										
										
											2021-05-11 15:16:12 -07:00
+								```shell
-												feat(ingest): start airflow integration + metadata builders (#2331)


											
										
										
											2021-04-05 19:11:28 -07:00
+								# Follow standard install from source procedure - see above.
-												docs(ingest): update metadata-ingestion dev guide (#12779)


											
										
										
											2025-03-07 11:22:17 -08:00
+								# Install all dev and test requirements.
 								../gradlew :metadata-ingestion:installDevTest
-												feat(ingest): Adding an Okta Integration to extract Users, Groups, Group Membership (#3043)


											
										
										
											2021-08-11 18:49:16 -07:00
-												docs(ingest): add ingestion configs guide (#7438)


											
										
										
											2023-02-27 05:34:23 +05:30
+								# Run the full testing suite
 								pytest -vv
-												feat(ingest): start airflow integration + metadata builders (#2331)


											
										
										
											2021-04-05 19:11:28 -07:00
+								# Run unit tests.
-												fix(ingest): refactor test markers + fix disk space issues in CI (#8938)


											
										
										
											2023-10-03 23:17:49 -04:00
+								pytest -m 'not integration'
-												feat(ingest): start airflow integration + metadata builders (#2331)


											
										
										
											2021-04-05 19:11:28 -07:00
-												docs: upgrade docusaurus, minor ingestion updates (#2774)


											
										
										
											2021-06-27 23:38:38 -07:00
+								# Run Docker-based integration tests.
-												build(ingest): reduce dependencies for dev install (#2872)


											
										
										
											2021-07-14 20:02:48 -07:00
+								pytest -m 'integration'
-												feat(ingest): add nifi source (#3681)


											
										
										
											2021-12-09 04:26:31 +05:30
-												fix(docs): update metadata ingestion dev guide (#3039)


											
										
										
											2021-08-08 18:28:41 -04:00
+								# You can also run these steps via the gradle build:
 								../gradlew :metadata-ingestion:lint
-												docs: upgrade docusaurus, minor ingestion updates (#2774)


											
										
										
											2021-06-27 23:38:38 -07:00
+								../gradlew :metadata-ingestion:lintFix
-												fix(docs): update metadata ingestion dev guide (#3039)


											
										
										
											2021-08-08 18:28:41 -04:00
+								../gradlew :metadata-ingestion:testQuick
 								../gradlew :metadata-ingestion:testFull
-												build(ingest): use gradle in commands + docs (#2531)


											
										
										
											2021-05-11 19:03:20 -07:00
+								../gradlew :metadata-ingestion:check
-												feat(build): add incremental builds for python (#3647)


											
										
										
											2021-11-30 18:01:56 -08:00
+								# Run all tests in a single file
-												ci: separate airflow build and test (#8688)

Co-authored-by: Harshal Sheth <hsheth2@gmail.com>
											
										
										
											2023-08-31 02:38:42 +05:30
+								../gradlew :metadata-ingestion:testSingle -PtestFile=tests/unit/test_bigquery_source.py
-												feat(build): add incremental builds for python (#3647)


											
										
										
											2021-11-30 18:01:56 -08:00
+								# Run all tests under tests/unit
 								../gradlew :metadata-ingestion:testSingle -PtestFile=tests/unit
-												feat(ingest): start airflow integration + metadata builders (#2331)


											
										
										
											2021-04-05 19:11:28 -07:00
+								```
-												doc(ingest): update golden file command (#4992)


											
										
										
											2022-05-25 10:52:19 +05:30
 								### Updating golden test files
 								If you made some changes that require generating new "golden" data files for use in testing a specific ingestion source, you can run the following to re-generate them:
 								```shell
 								pytest tests/integration/<source>/<source>.py --update-golden-files
 								```
 								For example,
 								```shell
 								pytest tests/integration/dbt/test_dbt.py --update-golden-files
 								```
-												feat(ingest): loosen airflow plugin dependencies requirements (#10106)


											
										
										
											2024-03-27 14:32:53 -07:00
 								### Testing the Airflow plugin
 								For the Airflow plugin, we use `tox` to test across multiple sets of dependencies.
 								```sh
 								cd metadata-ingestion-modules/airflow-plugin
 								# Run all tests.
 								tox
 								# Run a specific environment.
 								# These are defined in the `tox.ini` file
 								tox -e py310-airflow26
 								# Run a specific test.
 								tox -e py310-airflow26 -- tests/integration/test_plugin.py
 								# Update all golden files.
 								tox -- --update-golden-files
 								# Update golden files for a specific environment.
 								tox -e py310-airflow26 -- --update-golden-files
-												docs(ingest): add docs on pydantic compatibility (#11423)


											
										
										
											2024-09-20 13:22:15 -07:00
+								```