mirror of
				https://github.com/open-metadata/OpenMetadata.git
				synced 2025-11-04 04:29:13 +00:00 
			
		
		
		
	
		
			
	
	
		
			200 lines
		
	
	
		
			7.5 KiB
		
	
	
	
		
			Markdown
		
	
	
	
	
	
		
		
			
		
	
	
			200 lines
		
	
	
		
			7.5 KiB
		
	
	
	
		
			Markdown
		
	
	
	
	
	
| 
								 | 
							
								# Installation and deployment instructions (using Postgres as example)
							 | 
						||
| 
								 | 
							
								
							 | 
						||
| 
								 | 
							
								Below are the instructions for connecting a Postgress server. The installation steps should be the same for connecting all kinds of servers. Different servers would require different configurations in the .yaml or DAG files. See https://docs.open-metadata.org/integrations/connectors for your configuration.
							 | 
						||
| 
								 | 
							
								
							 | 
						||
| 
								 | 
							
								# Goal: To run Postgres metadata ingestion and quality tests with OpenMetadata using Airflow scheduler
							 | 
						||
| 
								 | 
							
								
							 | 
						||
| 
								 | 
							
								Note: This procedure does not support Windows, because Windows does not implement "signal.SIGALRM". **It is highly recommended to use WSL 2 if you are on Windows**.
							 | 
						||
| 
								 | 
							
								
							 | 
						||
| 
								 | 
							
								## Requirements:
							 | 
						||
| 
								 | 
							
								See https://docs.open-metadata.org/overview/run-openmetadata-with-prefect "Requirements" section
							 | 
						||
| 
								 | 
							
								
							 | 
						||
| 
								 | 
							
								## Installation:
							 | 
						||
| 
								 | 
							
								1. Clone this git hub repo:
							 | 
						||
| 
								 | 
							
								`git clone https://github.com/open-metadata/OpenMetadata.git`
							 | 
						||
| 
								 | 
							
								
							 | 
						||
| 
								 | 
							
								2. Cd to ~/.../openmetadata/docker/metadata
							 | 
						||
| 
								 | 
							
								
							 | 
						||
| 
								 | 
							
								3. Start the OpenMetadata containers. This will allow you run OpenMetadata in Docker:
							 | 
						||
| 
								 | 
							
								`docker compose up -d`
							 | 
						||
| 
								 | 
							
								- To check the status of services, run `docker compose ps` 
							 | 
						||
| 
								 | 
							
								- To access the UI: http://localhost:8585
							 | 
						||
| 
								 | 
							
								
							 | 
						||
| 
								 | 
							
								4. Install the OpenMetadata ingestion package.
							 | 
						||
| 
								 | 
							
								- (optional but highly recommended): Before installing this package, it is recommended to create and activate a virtual environment. To do this, run:
							 | 
						||
| 
								 | 
							
								`python -m venv env` and `source env/bin/activate`
							 | 
						||
| 
								 | 
							
								
							 | 
						||
| 
								 | 
							
								- To install the OpenMetadata ingestion package:
							 | 
						||
| 
								 | 
							
								`pip install --upgrade "openmetadata-ingestion[docker]==0.10.3"` (specify the release version to ensure compatibility)
							 | 
						||
| 
								 | 
							
								
							 | 
						||
| 
								 | 
							
								5. Install Airflow:
							 | 
						||
| 
								 | 
							
								- 5A: Install Airflow Lineage Backend: `pip3 install "openmetadata-ingestion[airflow-container]"==0.10.3`
							 | 
						||
| 
								 | 
							
								- 5B: Install Airflow postgres connector module: `pip3 install "openmetadata-ingestion[postgres]"==0.10.3`
							 | 
						||
| 
								 | 
							
								- 5C: Install Airflow APIs: `pip3 install "openmetadata-airflow-managed-apis"==0.10.3`
							 | 
						||
| 
								 | 
							
								- 5D: Install necessary Airflow plugins:
							 | 
						||
| 
								 | 
							
								    - 1) Download the latest openmetadata-airflow-apis-plugins release from https://github.com/open-metadata/OpenMetadata/releases
							 | 
						||
| 
								 | 
							
								    - 2) Untar it under your {AIRFLOW_HOME} directory (usually c/Users/Yourname/airflow). This will create and setup a plugins directory under {AIRFLOW_HOME} .
							 | 
						||
| 
								 | 
							
								    - 3) `cp -r {AIRFLOW_HOME}/plugins/dag_templates {AIRFLOW_HOME}`
							 | 
						||
| 
								 | 
							
								    - 4) `mkdir -p {AIRFLOW_HOME}/dag_generated_configs`
							 | 
						||
| 
								 | 
							
								    - 5) (re)start the airflow webserver and scheduler
							 | 
						||
| 
								 | 
							
								
							 | 
						||
| 
								 | 
							
								6. Configure Airflow:
							 | 
						||
| 
								 | 
							
								- 6A: configure airflow.cfg in your AIRFLOW_HOME directory. Check and make all the folder directories point to the right places. For instance, dags_folder = YOUR_AIRFLOW_HOME/dags
							 | 
						||
| 
								 | 
							
								- 6B: configure openmetadata.yaml and update the airflowConfiguration section. See: https://docs.open-metadata.org/integrations/airflow/configure-airflow-in-the-openmetadata-server
							 | 
						||
| 
								 | 
							
								
							 | 
						||
| 
								 | 
							
								## To run a metadata ingestion workflow with Airflow ingestion DAGs on Postgres data:
							 | 
						||
| 
								 | 
							
								
							 | 
						||
| 
								 | 
							
								1. Prepare the Ingestion DAG:
							 | 
						||
| 
								 | 
							
								To see a more complete tutorial on ingestion DAG, see https://docs.open-metadata.org/integrations/connectors/postgres/run-postgres-connector-with-the-airflow-sdk
							 | 
						||
| 
								 | 
							
								To be brief, below is my own DAG. Copy & Paste the following into a python file (postgres_demo.py):
							 | 
						||
| 
								 | 
							
								
							 | 
						||
| 
								 | 
							
								```
							 | 
						||
| 
								 | 
							
								import pathlib
							 | 
						||
| 
								 | 
							
								import json
							 | 
						||
| 
								 | 
							
								from datetime import timedelta
							 | 
						||
| 
								 | 
							
								from airflow import DAG
							 | 
						||
| 
								 | 
							
								
							 | 
						||
| 
								 | 
							
								try:
							 | 
						||
| 
								 | 
							
								    from airflow.operators.python import PythonOperator
							 | 
						||
| 
								 | 
							
								except ModuleNotFoundError:
							 | 
						||
| 
								 | 
							
								    from airflow.operators.python_operator import PythonOperator
							 | 
						||
| 
								 | 
							
								
							 | 
						||
| 
								 | 
							
								from metadata.config.common import load_config_file
							 | 
						||
| 
								 | 
							
								from metadata.ingestion.api.workflow import Workflow
							 | 
						||
| 
								 | 
							
								from airflow.utils.dates import days_ago
							 | 
						||
| 
								 | 
							
								
							 | 
						||
| 
								 | 
							
								default_args = {
							 | 
						||
| 
								 | 
							
								    "owner": "user_name",
							 | 
						||
| 
								 | 
							
								    "email": ["username@org.com"],
							 | 
						||
| 
								 | 
							
								    "email_on_failure": False,
							 | 
						||
| 
								 | 
							
								    "retries": 3,
							 | 
						||
| 
								 | 
							
								    "retry_delay": timedelta(minutes=5),
							 | 
						||
| 
								 | 
							
								    "execution_timeout": timedelta(minutes=60)
							 | 
						||
| 
								 | 
							
								}
							 | 
						||
| 
								 | 
							
								
							 | 
						||
| 
								 | 
							
								config = """
							 | 
						||
| 
								 | 
							
								{
							 | 
						||
| 
								 | 
							
								    "source":{
							 | 
						||
| 
								 | 
							
								        "type": "postgres",
							 | 
						||
| 
								 | 
							
								        "serviceName": "postgres_demo",
							 | 
						||
| 
								 | 
							
								        "serviceConnection": {
							 | 
						||
| 
								 | 
							
								            "config": {
							 | 
						||
| 
								 | 
							
								                "type": "Postgres",
							 | 
						||
| 
								 | 
							
								                "username": "postgres", (change to your username)
							 | 
						||
| 
								 | 
							
								                "password": "postgres", (change to your password)
							 | 
						||
| 
								 | 
							
								                "hostPort": "192.168.1.55:5432", (change to your hostPort)
							 | 
						||
| 
								 | 
							
								                "database": "surveillance_hub" (change to your database)
							 | 
						||
| 
								 | 
							
								            }
							 | 
						||
| 
								 | 
							
								        },
							 | 
						||
| 
								 | 
							
								        "sourceConfig":{
							 | 
						||
| 
								 | 
							
								            "config": { (all of the following can switch to true or false)
							 | 
						||
| 
								 | 
							
								                "enableDataProfiler": "true" or "false", 
							 | 
						||
| 
								 | 
							
								                "markDeletedTables": "true" or "false",
							 | 
						||
| 
								 | 
							
								                "includeTables": "true" or "false",
							 | 
						||
| 
								 | 
							
								                "includeViews": "true" or "false",
							 | 
						||
| 
								 | 
							
								                "generateSampleData": "true" or "false" 
							 | 
						||
| 
								 | 
							
								            }
							 | 
						||
| 
								 | 
							
								        }
							 | 
						||
| 
								 | 
							
								    },      
							 | 
						||
| 
								 | 
							
								    "sink":{
							 | 
						||
| 
								 | 
							
								        "type": "metadata-rest",
							 | 
						||
| 
								 | 
							
								        "config": {}
							 | 
						||
| 
								 | 
							
								    },   
							 | 
						||
| 
								 | 
							
								    "workflowConfig": {
							 | 
						||
| 
								 | 
							
								        "openMetadataServerConfig": {
							 | 
						||
| 
								 | 
							
								            "hostPort": "http://localhost:8585/api",
							 | 
						||
| 
								 | 
							
								            "authProvider": "no-auth"
							 | 
						||
| 
								 | 
							
								        }
							 | 
						||
| 
								 | 
							
								    }
							 | 
						||
| 
								 | 
							
								        
							 | 
						||
| 
								 | 
							
								        
							 | 
						||
| 
								 | 
							
								}
							 | 
						||
| 
								 | 
							
								"""
							 | 
						||
| 
								 | 
							
								
							 | 
						||
| 
								 | 
							
								def metadata_ingestion_workflow():
							 | 
						||
| 
								 | 
							
								    workflow_config = json.loads(config)
							 | 
						||
| 
								 | 
							
								    workflow = Workflow.create(workflow_config)
							 | 
						||
| 
								 | 
							
								    workflow.execute()
							 | 
						||
| 
								 | 
							
								    workflow.raise_from_status()
							 | 
						||
| 
								 | 
							
								    workflow.print_status()
							 | 
						||
| 
								 | 
							
								    workflow.stop()
							 | 
						||
| 
								 | 
							
								
							 | 
						||
| 
								 | 
							
								
							 | 
						||
| 
								 | 
							
								with DAG(
							 | 
						||
| 
								 | 
							
								    "sample_data",
							 | 
						||
| 
								 | 
							
								    default_args=default_args,
							 | 
						||
| 
								 | 
							
								    description="An example DAG which runs a OpenMetadata ingestion workflow",
							 | 
						||
| 
								 | 
							
								    start_date=days_ago(1),
							 | 
						||
| 
								 | 
							
								    is_paused_upon_creation=False,
							 | 
						||
| 
								 | 
							
								    schedule_interval='*/5 * * * *', 
							 | 
						||
| 
								 | 
							
								    catchup=False,
							 | 
						||
| 
								 | 
							
								) as dag:
							 | 
						||
| 
								 | 
							
								    ingest_task = PythonOperator(
							 | 
						||
| 
								 | 
							
								        task_id="ingest_using_recipe",
							 | 
						||
| 
								 | 
							
								        python_callable=metadata_ingestion_workflow,
							 | 
						||
| 
								 | 
							
								    )
							 | 
						||
| 
								 | 
							
								
							 | 
						||
| 
								 | 
							
								if __name__ == "__main__":
							 | 
						||
| 
								 | 
							
								    metadata_ingestion_workflow()
							 | 
						||
| 
								 | 
							
								```
							 | 
						||
| 
								 | 
							
								
							 | 
						||
| 
								 | 
							
								2. Run the DAG:
							 | 
						||
| 
								 | 
							
								`
							 | 
						||
| 
								 | 
							
								python postgres_demo.py
							 | 
						||
| 
								 | 
							
								`
							 | 
						||
| 
								 | 
							
								
							 | 
						||
| 
								 | 
							
								- Alternatively, we could run without Airflow SDK and with OpenMetadata's own methods. Run `metadata ingest -c /Your_Path_To_Json/.json`
							 | 
						||
| 
								 | 
							
								The json configuration is exactly the same as the json configuration in the DAG.
							 | 
						||
| 
								 | 
							
								- Or, we could also run it with `metadata ingest -c /Your_Path_To_Yaml/.yaml`
							 | 
						||
| 
								 | 
							
								The yaml configuration would be the exact same except without the curly brackets and the double quotes.
							 | 
						||
| 
								 | 
							
								
							 | 
						||
| 
								 | 
							
								## To run a profiler workflow on Postgres data
							 | 
						||
| 
								 | 
							
								1. Prepare the DAG OR configure the yaml/json:
							 | 
						||
| 
								 | 
							
								- To configure the quality tests in json/yaml, see https://docs.open-metadata.org/data-quality/data-quality-overview/tests
							 | 
						||
| 
								 | 
							
								- To prepare the DAG, see https://github.com/open-metadata/OpenMetadata/tree/0.10.3-release/data-quality/data-quality-overview
							 | 
						||
| 
								 | 
							
								
							 | 
						||
| 
								 | 
							
								Example yaml I was using:
							 | 
						||
| 
								 | 
							
								```
							 | 
						||
| 
								 | 
							
								source:
							 | 
						||
| 
								 | 
							
								  type: postgres
							 | 
						||
| 
								 | 
							
								  serviceName: your_service_name
							 | 
						||
| 
								 | 
							
								  serviceConnection:
							 | 
						||
| 
								 | 
							
								    config:
							 | 
						||
| 
								 | 
							
								      type: Postgres
							 | 
						||
| 
								 | 
							
								      username: your_username
							 | 
						||
| 
								 | 
							
								      password: your_password
							 | 
						||
| 
								 | 
							
								      hostPort: 
							 | 
						||
| 
								 | 
							
								      database: your_database  
							 | 
						||
| 
								 | 
							
								  sourceConfig:
							 | 
						||
| 
								 | 
							
								    config:
							 | 
						||
| 
								 | 
							
								      type: Profiler
							 | 
						||
| 
								 | 
							
								
							 | 
						||
| 
								 | 
							
								processor:
							 | 
						||
| 
								 | 
							
								  type: orm-profiler
							 | 
						||
| 
								 | 
							
								  config:
							 | 
						||
| 
								 | 
							
								    test_suite:
							 | 
						||
| 
								 | 
							
								      name: demo_test
							 | 
						||
| 
								 | 
							
								      tests:
							 | 
						||
| 
								 | 
							
								        - table: your_table_name (FQN)
							 | 
						||
| 
								 | 
							
								          column_tests:
							 | 
						||
| 
								 | 
							
								            - columnName: id
							 | 
						||
| 
								 | 
							
								              testCase:
							 | 
						||
| 
								 | 
							
								                columnTestType: columnValuesToBeBetween
							 | 
						||
| 
								 | 
							
								                config:
							 | 
						||
| 
								 | 
							
								                  minValue: 0
							 | 
						||
| 
								 | 
							
								                  maxValue: 10
							 | 
						||
| 
								 | 
							
								sink:
							 | 
						||
| 
								 | 
							
								  type: metadata-rest
							 | 
						||
| 
								 | 
							
								  config: {}
							 | 
						||
| 
								 | 
							
								workflowConfig:
							 | 
						||
| 
								 | 
							
								  openMetadataServerConfig:
							 | 
						||
| 
								 | 
							
								    hostPort: http://localhost:8585/api
							 | 
						||
| 
								 | 
							
								    authProvider: no-auth
							 | 
						||
| 
								 | 
							
								```
							 | 
						||
| 
								 | 
							
								Note that the table name must be FQN and match exactly with the table path on the OpenMetadata UI.
							 | 
						||
| 
								 | 
							
								
							 | 
						||
| 
								 | 
							
								2. Run it with 
							 | 
						||
| 
								 | 
							
								`metadata profile -c /path_to_yaml/.yaml`
							 | 
						||
| 
								 | 
							
								
							 | 
						||
| 
								 | 
							
								Make sure to refresh the OpenMetadata UI and click on the Data Quality tab to see the results.
							 |