mirror of
				https://github.com/open-metadata/OpenMetadata.git
				synced 2025-10-26 00:04:52 +00:00 
			
		
		
		
	
		
			
	
	
		
			64 lines
		
	
	
		
			3.1 KiB
		
	
	
	
		
			Markdown
		
	
	
	
	
	
		
		
			
		
	
	
			64 lines
		
	
	
		
			3.1 KiB
		
	
	
	
		
			Markdown
		
	
	
	
	
	
|   | # OpenMetadata Ingestion Docker Operator
 | ||
|  | 
 | ||
|  | Utilities required to handle metadata ingestion in Airflow using `DockerOperator`. | ||
|  | 
 | ||
|  | The whole idea behind this approach is to avoid having to install packages directly | ||
|  | in any Airflow host, as this adds many (unnecessary) constraints to be aligned | ||
|  | on the `openmetadata-ingestion` package just to have the Python installation | ||
|  | as a `virtualenv` within the Airflow host. | ||
|  | 
 | ||
|  | The proposed solution - or alternative approach - is to use the | ||
|  | [DockerOperator](https://airflow.apache.org/docs/apache-airflow-providers-docker/stable/_api/airflow/providers/docker/operators/docker/index.html) | ||
|  | and run the ingestion workflows dynamically. | ||
|  | 
 | ||
|  | This requires the following: | ||
|  | - Docker image with the bare `openmetadata-ingestion` requirements, | ||
|  | - `main.py` file to execute the `Workflow`s, | ||
|  | - Handling of environment variables as input parameters for the operator. | ||
|  | 
 | ||
|  | Note that Airflow's Docker Operator works as follows (example from [here](https://github.com/apache/airflow/blob/providers-docker/3.0.0/tests/system/providers/docker/example_docker.py)): | ||
|  | 
 | ||
|  | ```python | ||
|  | DockerOperator( | ||
|  |     docker_url='unix://var/run/docker.sock',  # Set your docker URL | ||
|  |     command='/bin/sleep 30', | ||
|  |     image='centos:latest', | ||
|  |     network_mode='bridge', | ||
|  |     task_id='docker_op_tester', | ||
|  |     dag=dag, | ||
|  | ) | ||
|  | ``` | ||
|  | 
 | ||
|  | We need to provide as ingredients: | ||
|  | 1. Docker image to execute, | ||
|  | 2. And command to run. | ||
|  | 
 | ||
|  | This is not a Python-first approach, and therefore it is not allowing us to set a base image and pass a Python function | ||
|  | as a parameter (which would have been the preferred approach). Instead, we will leverage the `environment` input | ||
|  | parameter of the `DockerOperator` and pass all the necessary information in there. | ||
|  | 
 | ||
|  | Our `main.py` Python file will then be in charge of: | ||
|  | 1. Loading the workflow configuration from the environment variables, | ||
|  | 2. Get the required workflow class to run and finally, | ||
|  | 3. Execute the workflow. | ||
|  | 
 | ||
|  | To try this locally, you can build the DEV image with `make build-ingestion-base-local` from the project root. | ||
|  | 
 | ||
|  | ## Further improvements
 | ||
|  | 
 | ||
|  | We have two operator to leverage if we don't want to run the ingestion from Airflow's host environment: | ||
|  | 
 | ||
|  | ```python | ||
|  | from airflow.providers.docker.operators.docker import DockerOperator | ||
|  | from airflow.providers.cncf.kubernetes.operators.kubernetes_pod import KubernetesPodOperator | ||
|  | ``` | ||
|  | 
 | ||
|  | Which can be installed from `apache-airflow[docker]` and `apache-airflow[kubernetes]` respectively. | ||
|  | 
 | ||
|  | If we want to handle both of these directly on the `openmetadata-managed-apis` we need to consider a couple of things: | ||
|  | 1. `DockerOperator` will only work with Docker and `KubernetesPodOperator` will only work with a k8s cluster. This means | ||
|  |     that we'll need to dynamically handle the internal logic to use either of them depending on the deployment. | ||
|  |     [Docs](https://airflow.apache.org/docs/apache-airflow-providers-cncf-kubernetes/stable/operators.html). | ||
|  | 2. For GKE deployment things get a bit more complicated, as we'll need to use and test yet another operator | ||
|  |     custom-built for GKE: `GKEStartPodOperator`. [Docs](https://airflow.apache.org/docs/apache-airflow-providers-google/stable/operators/cloud/kubernetes_engine.html#howto-operator-gkestartpodoperator) |