mirror of
				https://github.com/open-metadata/OpenMetadata.git
				synced 2025-10-31 18:48:35 +00:00 
			
		
		
		
	
		
			
				
	
	
		
			64 lines
		
	
	
		
			3.1 KiB
		
	
	
	
		
			Markdown
		
	
	
	
	
	
			
		
		
	
	
			64 lines
		
	
	
		
			3.1 KiB
		
	
	
	
		
			Markdown
		
	
	
	
	
	
| # OpenMetadata Ingestion Docker Operator
 | |
| 
 | |
| Utilities required to handle metadata ingestion in Airflow using `DockerOperator`.
 | |
| 
 | |
| The whole idea behind this approach is to avoid having to install packages directly
 | |
| in any Airflow host, as this adds many (unnecessary) constraints to be aligned
 | |
| on the `openmetadata-ingestion` package just to have the Python installation
 | |
| as a `virtualenv` within the Airflow host.
 | |
| 
 | |
| The proposed solution - or alternative approach - is to use the
 | |
| [DockerOperator](https://airflow.apache.org/docs/apache-airflow-providers-docker/stable/_api/airflow/providers/docker/operators/docker/index.html)
 | |
| and run the ingestion workflows dynamically.
 | |
| 
 | |
| This requires the following:
 | |
| - Docker image with the bare `openmetadata-ingestion` requirements,
 | |
| - `main.py` file to execute the `Workflow`s,
 | |
| - Handling of environment variables as input parameters for the operator.
 | |
| 
 | |
| Note that Airflow's Docker Operator works as follows (example from [here](https://github.com/apache/airflow/blob/providers-docker/3.0.0/tests/system/providers/docker/example_docker.py)):
 | |
| 
 | |
| ```python
 | |
| DockerOperator(
 | |
|     docker_url='unix://var/run/docker.sock',  # Set your docker URL
 | |
|     command='/bin/sleep 30',
 | |
|     image='centos:latest',
 | |
|     network_mode='bridge',
 | |
|     task_id='docker_op_tester',
 | |
|     dag=dag,
 | |
| )
 | |
| ```
 | |
| 
 | |
| We need to provide as ingredients:
 | |
| 1. Docker image to execute,
 | |
| 2. And command to run.
 | |
| 
 | |
| This is not a Python-first approach, and therefore it is not allowing us to set a base image and pass a Python function
 | |
| as a parameter (which would have been the preferred approach). Instead, we will leverage the `environment` input
 | |
| parameter of the `DockerOperator` and pass all the necessary information in there.
 | |
| 
 | |
| Our `main.py` Python file will then be in charge of:
 | |
| 1. Loading the workflow configuration from the environment variables,
 | |
| 2. Get the required workflow class to run and finally,
 | |
| 3. Execute the workflow.
 | |
| 
 | |
| To try this locally, you can build the DEV image with `make build-ingestion-base-local` from the project root.
 | |
| 
 | |
| ## Further improvements
 | |
| 
 | |
| We have two operator to leverage if we don't want to run the ingestion from Airflow's host environment:
 | |
| 
 | |
| ```python
 | |
| from airflow.providers.docker.operators.docker import DockerOperator
 | |
| from airflow.providers.cncf.kubernetes.operators.kubernetes_pod import KubernetesPodOperator
 | |
| ```
 | |
| 
 | |
| Which can be installed from `apache-airflow[docker]` and `apache-airflow[kubernetes]` respectively.
 | |
| 
 | |
| If we want to handle both of these directly on the `openmetadata-managed-apis` we need to consider a couple of things:
 | |
| 1. `DockerOperator` will only work with Docker and `KubernetesPodOperator` will only work with a k8s cluster. This means
 | |
|     that we'll need to dynamically handle the internal logic to use either of them depending on the deployment.
 | |
|     [Docs](https://airflow.apache.org/docs/apache-airflow-providers-cncf-kubernetes/stable/operators.html).
 | |
| 2. For GKE deployment things get a bit more complicated, as we'll need to use and test yet another operator
 | |
|     custom-built for GKE: `GKEStartPodOperator`. [Docs](https://airflow.apache.org/docs/apache-airflow-providers-google/stable/operators/cloud/kubernetes_engine.html#howto-operator-gkestartpodoperator)
 | 
