datahub/metadata-ingestion/docs/dev_guides/profiling_ingestions.md

import FeatureAvailability from '@site/src/components/FeatureAvailability';
import Tabs from '@theme/Tabs';
import TabItem from '@theme/TabItem';

# Profiling ingestions

<FeatureAvailability/>

**🤝 Version compatibility**

> DataHub Core (Open Source): **0.11.1** | DataHub Cloud: **0.2.12**

This page documents how to perform memory profiles of ingestion runs.
It is useful when trying to size the amount of resources necessary to ingest some source or when developing new features or sources.

## How to use

<Tabs>
<TabItem value="ui" label="UI" default>

Create an ingestion as specified in the [Ingestion guide](../../../docs/ui-ingestion.md).

Add a flag to your ingestion recipe to generate a memray memory dump of your ingestion:

```yaml
source: ...

sink: ...

flags:
  generate_memory_profiles: "<path to folder where dumps will be written to>"
```

In the final panel, under the advanced section, add the `debug` datahub package under the **Extra DataHub Plugins** section.
As seen below:

<p align="center">
  <img width="70%"  src="https://raw.githubusercontent.com/datahub-project/static-assets/main/imgs/ingestion-advanced-extra-datahub-plugin.png"/>
</p>

Finally, save and run the ingestion process.

</TabItem>
<TabItem value="cli" label="CLI" default>
Install the `debug` plugin for DataHub's CLI wherever the ingestion runs:

```bash
pip install 'acryl-datahub[debug]'
```

This will install [memray](https://github.com/bloomberg/memray) in your python environment.

Add a flag to your ingestion recipe to generate a memray memory dump of your ingestion:

```yaml
source: ...

sink: ...

flags:
  generate_memory_profiles: "<path to folder where dumps will be written to>"
```

Finally run the ingestion recipe

```bash
$ datahub ingest -c recipe.yaml
```

</TabItem>
</Tabs>

Once the ingestion run starts a binary file will be created and appended to during the execution of the ingestion.

These files follow the pattern `file-<ingestion-run-urn>.bin` for a unique identification.
Once the ingestion has finished you can use `memray` to analyze the memory dump in a flamegraph view using:

`$ memray flamegraph file-None-file-2023_09_18-21_38_43.bin`

This will generate an interactive HTML file for analysis:

<p align="center">
    <img width="70%" src="https://github.com/datahub-project/static-assets/blob/main/imgs/metadata-ingestion/memray-example.png?raw=true"/>
</p>

`memray` has an extensive set of features for memory investigation. Take a look at their [documentation](https://bloomberg.github.io/memray/overview.html) to see the full feature set.
feat(ingestion): Adds support for memory profiling (#8856) Co-authored-by: Harshal Sheth <hsheth2@gmail.com> 2023-10-12 18:43:14 +01:00			`import FeatureAvailability from '@site/src/components/FeatureAvailability';`
			`import Tabs from '@theme/Tabs';`
			`import TabItem from '@theme/TabItem';`

			`# Profiling ingestions`

			`<FeatureAvailability/>`

			`🤝 Version compatibility`
ci(graphql,workflows): Format .md, .graphql, and workflow .yml files via prettier (#13220) 2025-04-16 16:55:51 -07:00
doc: Acryl to DataHub, datahubproject.io to datahub.com (#13252) Co-authored-by: Jay <159848059+jayacryl@users.noreply.github.com> 2025-04-28 23:34:33 +09:00			`> DataHub Core (Open Source): 0.11.1 \| DataHub Cloud: 0.2.12`
feat(ingestion): Adds support for memory profiling (#8856) Co-authored-by: Harshal Sheth <hsheth2@gmail.com> 2023-10-12 18:43:14 +01:00
ci(graphql,workflows): Format .md, .graphql, and workflow .yml files via prettier (#13220) 2025-04-16 16:55:51 -07:00			`This page documents how to perform memory profiles of ingestion runs.`
feat(ingestion): Adds support for memory profiling (#8856) Co-authored-by: Harshal Sheth <hsheth2@gmail.com> 2023-10-12 18:43:14 +01:00			`It is useful when trying to size the amount of resources necessary to ingest some source or when developing new features or sources.`

			`## How to use`
feat(ingestion): Adds more advanced configurations for runtime debugging (#8998) 2023-10-21 16:20:59 +01:00
			`<Tabs>`
			`<TabItem value="ui" label="UI" default>`

			`Create an ingestion as specified in the [Ingestion guide](../../../docs/ui-ingestion.md).`

			`Add a flag to your ingestion recipe to generate a memray memory dump of your ingestion:`
ci(graphql,workflows): Format .md, .graphql, and workflow .yml files via prettier (#13220) 2025-04-16 16:55:51 -07:00
feat(ingestion): Adds more advanced configurations for runtime debugging (#8998) 2023-10-21 16:20:59 +01:00			```yaml
ci(graphql,workflows): Format .md, .graphql, and workflow .yml files via prettier (#13220) 2025-04-16 16:55:51 -07:00			`source: ...`
feat(ingestion): Adds more advanced configurations for runtime debugging (#8998) 2023-10-21 16:20:59 +01:00
ci(graphql,workflows): Format .md, .graphql, and workflow .yml files via prettier (#13220) 2025-04-16 16:55:51 -07:00			`sink: ...`
feat(ingestion): Adds more advanced configurations for runtime debugging (#8998) 2023-10-21 16:20:59 +01:00
			`flags:`
			`generate_memory_profiles: "<path to folder where dumps will be written to>"`
			```

			In the final panel, under the advanced section, add the `debug` datahub package under the Extra DataHub Plugins section.
			`As seen below:`

			`<p align="center">`
			`<img width="70%" src="https://raw.githubusercontent.com/datahub-project/static-assets/main/imgs/ingestion-advanced-extra-datahub-plugin.png"/>`
			`</p>`

			`Finally, save and run the ingestion process.`

			`</TabItem>`
			`<TabItem value="cli" label="CLI" default>`
feat(ingestion): Adds support for memory profiling (#8856) Co-authored-by: Harshal Sheth <hsheth2@gmail.com> 2023-10-12 18:43:14 +01:00			Install the `debug` plugin for DataHub's CLI wherever the ingestion runs:

			```bash
			`pip install 'acryl-datahub[debug]'`
			```

			`This will install [memray](https://github.com/bloomberg/memray) in your python environment.`

			`Add a flag to your ingestion recipe to generate a memray memory dump of your ingestion:`
ci(graphql,workflows): Format .md, .graphql, and workflow .yml files via prettier (#13220) 2025-04-16 16:55:51 -07:00
feat(ingestion): Adds support for memory profiling (#8856) Co-authored-by: Harshal Sheth <hsheth2@gmail.com> 2023-10-12 18:43:14 +01:00			```yaml
ci(graphql,workflows): Format .md, .graphql, and workflow .yml files via prettier (#13220) 2025-04-16 16:55:51 -07:00			`source: ...`
feat(ingestion): Adds support for memory profiling (#8856) Co-authored-by: Harshal Sheth <hsheth2@gmail.com> 2023-10-12 18:43:14 +01:00
ci(graphql,workflows): Format .md, .graphql, and workflow .yml files via prettier (#13220) 2025-04-16 16:55:51 -07:00			`sink: ...`
feat(ingestion): Adds support for memory profiling (#8856) Co-authored-by: Harshal Sheth <hsheth2@gmail.com> 2023-10-12 18:43:14 +01:00
			`flags:`
			`generate_memory_profiles: "<path to folder where dumps will be written to>"`
			```

feat(ingestion): Adds more advanced configurations for runtime debugging (#8998) 2023-10-21 16:20:59 +01:00			`Finally run the ingestion recipe`

			```bash
			`$ datahub ingest -c recipe.yaml`
			```

			`</TabItem>`
			`</Tabs>`

ci(graphql,workflows): Format .md, .graphql, and workflow .yml files via prettier (#13220) 2025-04-16 16:55:51 -07:00			`Once the ingestion run starts a binary file will be created and appended to during the execution of the ingestion.`
feat(ingestion): Adds support for memory profiling (#8856) Co-authored-by: Harshal Sheth <hsheth2@gmail.com> 2023-10-12 18:43:14 +01:00
			These files follow the pattern `file-<ingestion-run-urn>.bin` for a unique identification.
			Once the ingestion has finished you can use `memray` to analyze the memory dump in a flamegraph view using:

ci(graphql,workflows): Format .md, .graphql, and workflow .yml files via prettier (#13220) 2025-04-16 16:55:51 -07:00			`$ memray flamegraph file-None-file-2023_09_18-21_38_43.bin`
feat(ingestion): Adds support for memory profiling (#8856) Co-authored-by: Harshal Sheth <hsheth2@gmail.com> 2023-10-12 18:43:14 +01:00
			`This will generate an interactive HTML file for analysis:`

			`<p align="center">`
			`<img width="70%" src="https://github.com/datahub-project/static-assets/blob/main/imgs/metadata-ingestion/memray-example.png?raw=true"/>`
			`</p>`

			`memray` has an extensive set of features for memory investigation. Take a look at their [documentation](https://bloomberg.github.io/memray/overview.html) to see the full feature set.