mirror of
				https://github.com/open-metadata/OpenMetadata.git
				synced 2025-10-31 02:29:03 +00:00 
			
		
		
		
	
		
			
	
	
		
			73 lines
		
	
	
		
			5.1 KiB
		
	
	
	
		
			Markdown
		
	
	
	
	
	
		
		
			
		
	
	
			73 lines
		
	
	
		
			5.1 KiB
		
	
	
	
		
			Markdown
		
	
	
	
	
	
|   | --- | |||
|  | title: Metadata Ingestion Best Practices | |||
|  | slug: /connectors/ingestion/best-practices | |||
|  | --- | |||
|  | 
 | |||
|  | # Best Practices for Metadata Ingestion
 | |||
|  | 
 | |||
|  | In this section we are going to present some guidelines that can be useful when preparing metadata ingestion both | |||
|  | from the UI or via any custom orchestration system. | |||
|  | 
 | |||
|  | <Note> | |||
|  | 
 | |||
|  | We will use the generic terms from Airflow, as the most common used tool, but the underlying ideas can be applied anywhere. | |||
|  | 
 | |||
|  | </Note> | |||
|  | 
 | |||
|  | ## Generic Practices
 | |||
|  | 
 | |||
|  | - **DAGs should not have any retry**: If the workflow is marked as failed due to any error (unexpected exception, | |||
|  |     connectivity issues, individual assets’ errors,...) there is usually no point on running automatic retries. For  | |||
|  |     heavy workflows failing in the middle of processing, it will just incur in extra costs.  | |||
|  |      | |||
|  |     Note that for internal communications between the Ingestion Workflow and the OpenMetadata APIs, we already have an  | |||
|  |     internal retry in place in case of intermittent networking issues. | |||
|  | - **DAGs should not have a catch-up**: Any ingestion will be based on the current state of data and metadata. If old  | |||
|  |     runs were skipped for any reason, there is no point in triggering past executions as they won’t be adding any value. | |||
|  |     Just the single, most recent run will already be providing all the information available. | |||
|  | - **Be mindful of enabled DEBUG logs**: When configuring the ingestion YAML you have the option to control the logging  | |||
|  |     level. Keeping it as INFO (default) is the usual best bet. Only use DEBUG logs when testing out ingestion for the first time | |||
|  | - **Test the ingestions using the CLI if you will be building a DAG**: When preparing the first ingestion processes,  | |||
|  |     it is ok to try different configurations (debug logs, enable views, filtering of assets,...). The fastest and  | |||
|  |     easiest way to test the ingestion process that will end up on a DAG is using the CLI (example). Playing with the  | |||
|  |     CLI will help you find the right YAML configuration fast. Note that for OpenMetadata, the process that gets | |||
|  |     triggered from the CLI, is the same as the one that will eventually run in your DAGs. If you have the possibility to  | |||
|  |     test the CLI first, it will give you fast feedback and will help you isolate your tests. | |||
|  | 
 | |||
|  | ## Metadata Ingestion
 | |||
|  | - **Apply the right filters**: For example, there is usually no business-related information on schemas such as  | |||
|  |     `INFORMATION_SCHEMA`. You can use OpenMetadata filtering logic on databases, services and tables to opt in/out specific assets. | |||
|  | 
 | |||
|  | ## Profiler Ingestion
 | |||
|  | - **On filters, scheduling and asset importance**: While OpenMetadata provides sampling and multi-threading, profiling | |||
|  |    can be a costly and time-consuming process. Then it is important to know which data assets are business critical. | |||
|  |   - **Deploy multiple profiler ingestions for the same service**: For a given service, prepare different ingestion | |||
|  |       pipelines, each of them attacking a specific set of assets based on input filters. You can then schedule more  | |||
|  |       important assets to be profiled more often, while keeping the rest of profiles to be executed either on demand, or with lower cadence. | |||
|  | - **Apply the right sampling**: Important tables can hold higher sampling, while the rest of assets might be good enough with smaller %. | |||
|  | 
 | |||
|  | ## Usage & Lineage Ingestion
 | |||
|  | - **Schedule and log duration should match**: The Log Duration configuration parameter specifies how many days in the  | |||
|  |     past we are going to look for query history data. If we schedule the workflows to run daily, there is no need to | |||
|  |     look for the past week, as we will be re-analysing data that won’t change. | |||
|  | 
 | |||
|  | 
 | |||
|  | # OpenMetadata Ingestion Troubleshooting
 | |||
|  | 
 | |||
|  | Here we will discuss different errors that you might encounter when running a workflow: | |||
|  | 
 | |||
|  | - **Connection errors**: When deploying ingestions from the OpenMetadata UI you have the possibility to test the  | |||
|  |     connection when configuring the service. This connection test happens at the Airflow host configured with OpenMetadata. | |||
|  |     If instead, you are running your ingestion workflows from any external system, you’ll need to validate that the host | |||
|  |     where the ingestion runs has the proper network settings to reach both the source system and OpenMetadata. | |||
|  | - **Processing Errors**: During the workflow process you might see logs like `Cannot ingest X due to Y` or similar statements. | |||
|  |     They appear for specific assets being ingested, and the origin can be different: | |||
|  |   - Missing permissions on a specific table or tag (e.g., due to BigQuery policies), | |||
|  |   - Internal errors when processing specific assets or translating them to the OpenMetadata standard.  | |||
|  |   In these cases, you can reach out to the OpenMetadata team. The workflow itself will continue, and the OpenMetadata | |||
|  |   team can help analyse the root cause and provide a fix. | |||
|  | - **Workflow breaking exceptions**: In rare circumstances there can be exceptions that break the overall workflow processing. | |||
|  |     The goal of the Ingestion Framework is to be as robust as possible and continue even for specific assets failures  | |||
|  |     (see point above). If there is a scenario not contemplated by the current code, the OpenMetadata team will apply the  | |||
|  |     highest priority to fix the issue and allow the workflow to run end to end. |