2023-10-12 18:43:14 +01:00
import FeatureAvailability from '@site/src/components/FeatureAvailability ';
import Tabs from '@theme/Tabs ';
import TabItem from '@theme/TabItem ';
# Profiling ingestions
< FeatureAvailability / >
**🤝 Version compatibility**
2025-04-16 16:55:51 -07:00
2025-04-28 23:34:33 +09:00
> DataHub Core (Open Source): **0.11.1** | DataHub Cloud: **0.2.12**
2023-10-12 18:43:14 +01:00
2025-04-16 16:55:51 -07:00
This page documents how to perform memory profiles of ingestion runs.
2023-10-12 18:43:14 +01:00
It is useful when trying to size the amount of resources necessary to ingest some source or when developing new features or sources.
## How to use
2023-10-21 16:20:59 +01:00
< Tabs >
< TabItem value = "ui" label = "UI" default >
Create an ingestion as specified in the [Ingestion guide ](../../../docs/ui-ingestion.md ).
Add a flag to your ingestion recipe to generate a memray memory dump of your ingestion:
2025-04-16 16:55:51 -07:00
2023-10-21 16:20:59 +01:00
```yaml
2025-04-16 16:55:51 -07:00
source: ...
2023-10-21 16:20:59 +01:00
2025-04-16 16:55:51 -07:00
sink: ...
2023-10-21 16:20:59 +01:00
flags:
generate_memory_profiles: "< path to folder where dumps will be written to > "
```
In the final panel, under the advanced section, add the `debug` datahub package under the **Extra DataHub Plugins** section.
As seen below:
< p align = "center" >
< img width = "70%" src = "https://raw.githubusercontent.com/datahub-project/static-assets/main/imgs/ingestion-advanced-extra-datahub-plugin.png" / >
< / p >
Finally, save and run the ingestion process.
< / TabItem >
< TabItem value = "cli" label = "CLI" default >
2023-10-12 18:43:14 +01:00
Install the `debug` plugin for DataHub's CLI wherever the ingestion runs:
```bash
pip install 'acryl-datahub[debug]'
```
This will install [memray ](https://github.com/bloomberg/memray ) in your python environment.
Add a flag to your ingestion recipe to generate a memray memory dump of your ingestion:
2025-04-16 16:55:51 -07:00
2023-10-12 18:43:14 +01:00
```yaml
2025-04-16 16:55:51 -07:00
source: ...
2023-10-12 18:43:14 +01:00
2025-04-16 16:55:51 -07:00
sink: ...
2023-10-12 18:43:14 +01:00
flags:
generate_memory_profiles: "< path to folder where dumps will be written to > "
```
2023-10-21 16:20:59 +01:00
Finally run the ingestion recipe
```bash
$ datahub ingest -c recipe.yaml
```
< / TabItem >
< / Tabs >
2025-04-16 16:55:51 -07:00
Once the ingestion run starts a binary file will be created and appended to during the execution of the ingestion.
2023-10-12 18:43:14 +01:00
These files follow the pattern `file-<ingestion-run-urn>.bin` for a unique identification.
Once the ingestion has finished you can use `memray` to analyze the memory dump in a flamegraph view using:
2025-04-16 16:55:51 -07:00
`$ memray flamegraph file-None-file-2023_09_18-21_38_43.bin`
2023-10-12 18:43:14 +01:00
This will generate an interactive HTML file for analysis:
< p align = "center" >
< img width = "70%" src = "https://github.com/datahub-project/static-assets/blob/main/imgs/metadata-ingestion/memray-example.png?raw=true" / >
< / p >
`memray` has an extensive set of features for memory investigation. Take a look at their [documentation ](https://bloomberg.github.io/memray/overview.html ) to see the full feature set.