datahub/docs/architecture/metadata-ingestion.md

---
title: "Ingestion Framework"
---

# Metadata Ingestion Architecture

DataHub supports an extremely flexible ingestion architecture that can support push, pull, asynchronous and synchronous models.
The figure below describes all the options possible for connecting your favorite system to DataHub.

<p align="center">
  <img width="70%"  src="https://raw.githubusercontent.com/datahub-project/static-assets/main/imgs/ingestion-architecture.png"/>
</p>

## Metadata Change Proposal: The Center Piece

The center piece for ingestion are [Metadata Change Proposal]s which represent requests to make a metadata change to an organization's Metadata Graph.
Metadata Change Proposals can be sent over Kafka, for highly scalable async publishing from source systems. They can also be sent directly to the HTTP endpoint exposed by the DataHub service tier to get synchronous success / failure responses.

## Pull-based Integration

DataHub ships with a Python based [metadata-ingestion system](../../metadata-ingestion/README.md) that can connect to different sources to pull metadata from them. This metadata is then pushed via Kafka or HTTP to the DataHub storage tier. Metadata ingestion pipelines can be [integrated with Airflow](../../metadata-ingestion/README.md#lineage-with-airflow) to set up scheduled ingestion or capture lineage. If you don't find a source already supported, it is very easy to [write your own](../../metadata-ingestion/README.md#contributing).

## Push-based Integration

As long as you can emit a [Metadata Change Proposal (MCP)] event to Kafka or make a REST call over HTTP, you can integrate any system with DataHub. For convenience, DataHub also provides simple [Python emitters] for you to integrate into your systems to emit metadata changes (MCP-s) at the point of origin.

## Internal Components

### Applying Metadata Change Proposals to DataHub Metadata Service (mce-consumer-job)

DataHub comes with a Spring job, [mce-consumer-job], which consumes the Metadata Change Proposals and writes them into the DataHub Metadata Service (datahub-gms) using the `/ingest` endpoint.

[Metadata Change Proposal (MCP)]: ../what/mxe.md#metadata-change-proposal-mcp
[Metadata Change Proposal]: ../what/mxe.md#metadata-change-proposal-mcp
[Metadata Change Log (MCL)]: ../what/mxe.md#metadata-change-log-mcl
[equivalent Pegasus format]: https://linkedin.github.io/rest.li/how_data_is_represented_in_memory#the-data-template-layer
[mce-consumer-job]: ../../metadata-jobs/mce-consumer-job
[Python emitters]: ../../metadata-ingestion/README.md#using-as-a-library
doc(components): Adding DataHub components overview (#3606) 2021-11-24 12:41:07 -08:00			`---`
			`title: "Ingestion Framework"`
			`---`

Add doc about search document & some cleanup 2019-12-19 13:17:53 -08:00			`# Metadata Ingestion Architecture`

ci(graphql,workflows): Format .md, .graphql, and workflow .yml files via prettier (#13220) 2025-04-16 16:55:51 -07:00			`DataHub supports an extremely flexible ingestion architecture that can support push, pull, asynchronous and synchronous models.`
			`The figure below describes all the options possible for connecting your favorite system to DataHub.`
docs(docs): add native versioning (#8714) 2023-08-26 06:10:13 +09:00
			`<p align="center">`
			`<img width="70%" src="https://raw.githubusercontent.com/datahub-project/static-assets/main/imgs/ingestion-architecture.png"/>`
			`</p>`

ci(graphql,workflows): Format .md, .graphql, and workflow .yml files via prettier (#13220) 2025-04-16 16:55:51 -07:00			`## Metadata Change Proposal: The Center Piece`
Update metadata-ingestion.md 2020-08-01 07:52:29 -07:00
ci(graphql,workflows): Format .md, .graphql, and workflow .yml files via prettier (#13220) 2025-04-16 16:55:51 -07:00			`The center piece for ingestion are [Metadata Change Proposal]s which represent requests to make a metadata change to an organization's Metadata Graph.`
			`Metadata Change Proposals can be sent over Kafka, for highly scalable async publishing from source systems. They can also be sent directly to the HTTP endpoint exposed by the DataHub service tier to get synchronous success / failure responses.`
Documentation update part-2 2019-12-19 17:23:48 -08:00
docs: Improving architecture docs (#2241) 2021-03-15 22:29:11 -07:00			`## Pull-based Integration`
Documentation update part-2 2019-12-19 17:23:48 -08:00
refactor(ingest): move Airflow into `datahub_provider` module (#2521) 2021-05-12 15:01:11 -07:00			DataHub ships with a Python based [metadata-ingestion system](../../metadata-ingestion/README.md) that can connect to different sources to pull metadata from them. This metadata is then pushed via Kafka or HTTP to the DataHub storage tier. Metadata ingestion pipelines can be [integrated with Airflow](../../metadata-ingestion/README.md#lineage-with-airflow) to set up scheduled ingestion or capture lineage. If you don't find a source already supported, it is very easy to [write your own](../../metadata-ingestion/README.md#contributing).
Documentation update part-2 2019-12-19 17:23:48 -08:00
docs: Improving architecture docs (#2241) 2021-03-15 22:29:11 -07:00			`## Push-based Integration`
Documentation update part-2 2019-12-19 17:23:48 -08:00
refactor(docs): Update Metadata Events Docs (#5173) 2022-06-17 09:29:50 -04:00			`As long as you can emit a [Metadata Change Proposal (MCP)] event to Kafka or make a REST call over HTTP, you can integrate any system with DataHub. For convenience, DataHub also provides simple [Python emitters] for you to integrate into your systems to emit metadata changes (MCP-s) at the point of origin.`
Documentation update part-2 2019-12-19 17:23:48 -08:00
docs: Improving architecture docs (#2241) 2021-03-15 22:29:11 -07:00			`## Internal Components`
Documentation update part-2 2019-12-19 17:23:48 -08:00
refactor(docs): Update Metadata Events Docs (#5173) 2022-06-17 09:29:50 -04:00			`### Applying Metadata Change Proposals to DataHub Metadata Service (mce-consumer-job)`
Documentation update part-2 2019-12-19 17:23:48 -08:00
ci(graphql,workflows): Format .md, .graphql, and workflow .yml files via prettier (#13220) 2025-04-16 16:55:51 -07:00			DataHub comes with a Spring job, [mce-consumer-job], which consumes the Metadata Change Proposals and writes them into the DataHub Metadata Service (datahub-gms) using the `/ingest` endpoint.
Documentation update part-2 2019-12-19 17:23:48 -08:00
refactor(docs): Update Metadata Events Docs (#5173) 2022-06-17 09:29:50 -04:00			`[Metadata Change Proposal (MCP)]: ../what/mxe.md#metadata-change-proposal-mcp`
fix(docs): edit text to link (#6798) 2022-12-20 01:45:22 +09:00			`[Metadata Change Proposal]: ../what/mxe.md#metadata-change-proposal-mcp`
refactor(docs): Update Metadata Events Docs (#5173) 2022-06-17 09:29:50 -04:00			`[Metadata Change Log (MCL)]: ../what/mxe.md#metadata-change-log-mcl`
docs: Improving architecture docs (#2241) 2021-03-15 22:29:11 -07:00			`[equivalent Pegasus format]: https://linkedin.github.io/rest.li/how_data_is_represented_in_memory#the-data-template-layer`
Update metadata-ingestion.md 2020-08-01 07:52:29 -07:00			`[mce-consumer-job]: ../../metadata-jobs/mce-consumer-job`
docs: Improving architecture docs (#2241) 2021-03-15 22:29:11 -07:00			`[Python emitters]: ../../metadata-ingestion/README.md#using-as-a-library`