2023-12-13 14:03:08 +01:00 
										
									 
								 
							 
							
								
							 
							
								 
							
								
									
								 
							
							
								---
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
								
									
								 
							
							
								title: Metadata Ingestion Best Practices
							 
						 
					
						
							
								
									
										
										
										
											2025-07-11 10:44:34 +05:30 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
								
									
								 
							
							
								description: Discover OpenMetadata data ingestion best practices to optimize your metadata collection, improve performance, and ensure reliable data connectivity.
							 
						 
					
						
							
								
									
										
										
										
											2023-12-13 14:03:08 +01:00 
										
									 
								 
							 
							
								
							 
							
								 
							
								
									
								 
							
							
								slug: /connectors/ingestion/best-practices
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
								
									
								 
							
							
								---
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
								
									
								 
							
							
								
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
								
									
								 
							
							
								# Best Practices for Metadata Ingestion
  
						 
					
						
							
								
							 
							
								
							 
							
								 
							
								
									
								 
							
							
								
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
								
									
								 
							
							
								In this section we are going to present some guidelines that can be useful when preparing metadata ingestion both
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
								
									
								 
							
							
								from the UI or via any custom orchestration system.
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
								
									
								 
							
							
								
							 
						 
					
						
							
								
									
										
										
										
											2024-03-07 16:37:05 +05:30 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
								
									
								 
							
							
								{% note %}
							 
						 
					
						
							
								
									
										
										
										
											2023-12-13 14:03:08 +01:00 
										
									 
								 
							 
							
								
							 
							
								 
							
								
									
								 
							
							
								We will use the generic terms from Airflow, as the most common used tool, but the underlying ideas can be applied anywhere.
							 
						 
					
						
							
								
									
										
										
										
											2024-03-07 16:37:05 +05:30 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
								
									
								 
							
							
								{% /note %}
							 
						 
					
						
							
								
									
										
										
										
											2023-12-13 14:03:08 +01:00 
										
									 
								 
							 
							
								
							 
							
								 
							
								
									
								 
							
							
								
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
								
									
								 
							
							
								## Generic Practices
  
						 
					
						
							
								
							 
							
								
							 
							
								 
							
								
									
								 
							
							
								
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
								
									
								 
							
							
								-  **DAGs should not have any retry**: If the workflow is marked as failed due to any error (unexpected exception, 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
								
									
										 
							
							
								    connectivity issues, individual assets’ 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
								
									
								 
							
							
								    heavy workflows failing in the middle of processing, it will just incur in extra costs. 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
								
									
								 
							
							
								    
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
								
									
								 
							
							
								    Note that for internal communications between the Ingestion Workflow and the OpenMetadata APIs, we already have an 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
								
									
								 
							
							
								    internal retry in place in case of intermittent networking issues.
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
								
									
								 
							
							
								-  **DAGs should not have a catch-up**: Any ingestion will be based on the current state of data and metadata. If old  
						 
					
						
							
								
							 
							
								
							 
							
								 
							
								
									
										 
							
							
								    runs were skipped for any reason, there is no point in triggering past executions as they won’ 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
								
									
								 
							
							
								    Just the single, most recent run will already be providing all the information available.
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
								
									
								 
							
							
								-  **Be mindful of enabled DEBUG logs**: When configuring the ingestion YAML you have the option to control the logging  
						 
					
						
							
								
							 
							
								
							 
							
								 
							
								
									
								 
							
							
								    level. Keeping it as INFO (default) is the usual best bet. Only use DEBUG logs when testing out ingestion for the first time
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
								
									
								 
							
							
								-  **Test the ingestions using the CLI if you will be building a DAG**: When preparing the first ingestion processes,  
						 
					
						
							
								
							 
							
								
							 
							
								 
							
								
									
								 
							
							
								    it is ok to try different configurations (debug logs, enable views, filtering of assets,...). The fastest and 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
								
									
								 
							
							
								    easiest way to test the ingestion process that will end up on a DAG is using the CLI (example). Playing with the 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
								
									
								 
							
							
								    CLI will help you find the right YAML configuration fast. Note that for OpenMetadata, the process that gets
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
								
									
								 
							
							
								    triggered from the CLI, is the same as the one that will eventually run in your DAGs. If you have the possibility to 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
								
									
								 
							
							
								    test the CLI first, it will give you fast feedback and will help you isolate your tests.
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
								
									
								 
							
							
								
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
								
									
								 
							
							
								## Metadata Ingestion
  
						 
					
						
							
								
							 
							
								
							 
							
								 
							
								
									
								 
							
							
								-  **Apply the right filters**: For example, there is usually no business-related information on schemas such as  
						 
					
						
							
								
							 
							
								
							 
							
								 
							
								
									
								 
							
							
								    `INFORMATION_SCHEMA` . You can use OpenMetadata filtering logic on databases, services and tables to opt in/out specific assets.
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
								
									
								 
							
							
								
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
								
									
								 
							
							
								## Profiler Ingestion
  
						 
					
						
							
								
							 
							
								
							 
							
								 
							
								
									
								 
							
							
								-  **On filters, scheduling and asset importance**: While OpenMetadata provides sampling and multi-threading, profiling 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
								
									
								 
							
							
								   can be a costly and time-consuming process. Then it is important to know which data assets are business critical.
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
								
									
								 
							
							
								  -  **Deploy multiple profiler ingestions for the same service**: For a given service, prepare different ingestion
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
								
									
								 
							
							
								      pipelines, each of them attacking a specific set of assets based on input filters. You can then schedule more 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
								
									
								 
							
							
								      important assets to be profiled more often, while keeping the rest of profiles to be executed either on demand, or with lower cadence.
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
								
									
								 
							
							
								-  **Apply the right sampling**: Important tables can hold higher sampling, while the rest of assets might be good enough with smaller %. 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
								
									
								 
							
							
								
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
								
									
								 
							
							
								## Usage & Lineage Ingestion
  
						 
					
						
							
								
							 
							
								
							 
							
								 
							
								
									
								 
							
							
								-  **Schedule and log duration should match**: The Log Duration configuration parameter specifies how many days in the  
						 
					
						
							
								
							 
							
								
							 
							
								 
							
								
									
								 
							
							
								    past we are going to look for query history data. If we schedule the workflows to run daily, there is no need to
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
								
									
										 
							
							
								    look for the past week, as we will be re-analysing data that won’ 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
								
									
								 
							
							
								
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
								
									
								 
							
							
								
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
								
									
								 
							
							
								# OpenMetadata Ingestion Troubleshooting
  
						 
					
						
							
								
							 
							
								
							 
							
								 
							
								
									
								 
							
							
								
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
								
									
								 
							
							
								Here we will discuss different errors that you might encounter when running a workflow:
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
								
									
								 
							
							
								
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
								
									
								 
							
							
								-  **Connection errors**: When deploying ingestions from the OpenMetadata UI you have the possibility to test the  
						 
					
						
							
								
							 
							
								
							 
							
								 
							
								
									
								 
							
							
								    connection when configuring the service. This connection test happens at the Airflow host configured with OpenMetadata.
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
								
									
										 
							
							
								    If instead, you are running your ingestion workflows from any external system, you’ 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
								
									
								 
							
							
								    where the ingestion runs has the proper network settings to reach both the source system and OpenMetadata.
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
								
									
								 
							
							
								-  **Processing Errors**: During the workflow process you might see logs like `Cannot ingest X due to Y`  or similar statements. 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
								
									
								 
							
							
								    They appear for specific assets being ingested, and the origin can be different:
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
								
									
								 
							
							
								  -  Missing permissions on a specific table or tag (e.g., due to BigQuery policies),
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
								
									
								 
							
							
								  -  Internal errors when processing specific assets or translating them to the OpenMetadata standard. 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
								
									
								 
							
							
								  In these cases, you can reach out to the OpenMetadata team. The workflow itself will continue, and the OpenMetadata
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
								
									
								 
							
							
								  team can help analyse the root cause and provide a fix.
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
								
									
								 
							
							
								-  **Workflow breaking exceptions**: In rare circumstances there can be exceptions that break the overall workflow processing. 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
								
									
								 
							
							
								    The goal of the Ingestion Framework is to be as robust as possible and continue even for specific assets failures 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
								
									
								 
							
							
								    (see point above). If there is a scenario not contemplated by the current code, the OpenMetadata team will apply the 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
								
									
								 
							
							
								    highest priority to fix the issue and allow the workflow to run end to end.