mirror of
				https://github.com/open-metadata/OpenMetadata.git
				synced 2025-10-30 18:17:53 +00:00 
			
		
		
		
	
		
			
				
	
	
		
			200 lines
		
	
	
		
			7.5 KiB
		
	
	
	
		
			Markdown
		
	
	
	
	
	
			
		
		
	
	
			200 lines
		
	
	
		
			7.5 KiB
		
	
	
	
		
			Markdown
		
	
	
	
	
	
| # Installation and deployment instructions (using Postgres as example)
 | |
| 
 | |
| Below are the instructions for connecting a Postgress server. The installation steps should be the same for connecting all kinds of servers. Different servers would require different configurations in the .yaml or DAG files. See https://docs.open-metadata.org/integrations/connectors for your configuration.
 | |
| 
 | |
| # Goal: To run Postgres metadata ingestion and quality tests with OpenMetadata using Airflow scheduler
 | |
| 
 | |
| Note: This procedure does not support Windows, because Windows does not implement "signal.SIGALRM". **It is highly recommended to use WSL 2 if you are on Windows**.
 | |
| 
 | |
| ## Requirements:
 | |
| See https://docs.open-metadata.org/overview/run-openmetadata-with-prefect "Requirements" section
 | |
| 
 | |
| ## Installation:
 | |
| 1. Clone this git hub repo:
 | |
| `git clone https://github.com/open-metadata/OpenMetadata.git`
 | |
| 
 | |
| 2. Cd to ~/.../openmetadata/docker/metadata
 | |
| 
 | |
| 3. Start the OpenMetadata containers. This will allow you run OpenMetadata in Docker:
 | |
| `docker compose up -d`
 | |
| - To check the status of services, run `docker compose ps` 
 | |
| - To access the UI: http://localhost:8585
 | |
| 
 | |
| 4. Install the OpenMetadata ingestion package.
 | |
| - (optional but highly recommended): Before installing this package, it is recommended to create and activate a virtual environment. To do this, run:
 | |
| `python -m venv env` and `source env/bin/activate`
 | |
| 
 | |
| - To install the OpenMetadata ingestion package:
 | |
| `pip install --upgrade "openmetadata-ingestion[docker]==0.10.3"` (specify the release version to ensure compatibility)
 | |
| 
 | |
| 5. Install Airflow:
 | |
| - 5A: Install Airflow Lineage Backend: `pip3 install "openmetadata-ingestion[airflow-container]"==0.10.3`
 | |
| - 5B: Install Airflow postgres connector module: `pip3 install "openmetadata-ingestion[postgres]"==0.10.3`
 | |
| - 5C: Install Airflow APIs: `pip3 install "openmetadata-airflow-managed-apis"==0.10.3`
 | |
| - 5D: Install necessary Airflow plugins:
 | |
|     - 1) Download the latest openmetadata-airflow-apis-plugins release from https://github.com/open-metadata/OpenMetadata/releases
 | |
|     - 2) Untar it under your {AIRFLOW_HOME} directory (usually c/Users/Yourname/airflow). This will create and setup a plugins directory under {AIRFLOW_HOME} .
 | |
|     - 3) `cp -r {AIRFLOW_HOME}/plugins/dag_templates {AIRFLOW_HOME}`
 | |
|     - 4) `mkdir -p {AIRFLOW_HOME}/dag_generated_configs`
 | |
|     - 5) (re)start the airflow webserver and scheduler
 | |
| 
 | |
| 6. Configure Airflow:
 | |
| - 6A: configure airflow.cfg in your AIRFLOW_HOME directory. Check and make all the folder directories point to the right places. For instance, dags_folder = YOUR_AIRFLOW_HOME/dags
 | |
| - 6B: configure openmetadata.yaml and update the airflowConfiguration section. See: https://docs.open-metadata.org/integrations/airflow/configure-airflow-in-the-openmetadata-server
 | |
| 
 | |
| ## To run a metadata ingestion workflow with Airflow ingestion DAGs on Postgres data:
 | |
| 
 | |
| 1. Prepare the Ingestion DAG:
 | |
| To see a more complete tutorial on ingestion DAG, see https://docs.open-metadata.org/integrations/connectors/postgres/run-postgres-connector-with-the-airflow-sdk
 | |
| To be brief, below is my own DAG. Copy & Paste the following into a python file (postgres_demo.py):
 | |
| 
 | |
| ```
 | |
| import pathlib
 | |
| import json
 | |
| from datetime import timedelta
 | |
| from airflow import DAG
 | |
| 
 | |
| try:
 | |
|     from airflow.operators.python import PythonOperator
 | |
| except ModuleNotFoundError:
 | |
|     from airflow.operators.python_operator import PythonOperator
 | |
| 
 | |
| from metadata.config.common import load_config_file
 | |
| from metadata.ingestion.api.workflow import Workflow
 | |
| from airflow.utils.dates import days_ago
 | |
| 
 | |
| default_args = {
 | |
|     "owner": "user_name",
 | |
|     "email": ["username@org.com"],
 | |
|     "email_on_failure": False,
 | |
|     "retries": 3,
 | |
|     "retry_delay": timedelta(minutes=5),
 | |
|     "execution_timeout": timedelta(minutes=60)
 | |
| }
 | |
| 
 | |
| config = """
 | |
| {
 | |
|     "source":{
 | |
|         "type": "postgres",
 | |
|         "serviceName": "postgres_demo",
 | |
|         "serviceConnection": {
 | |
|             "config": {
 | |
|                 "type": "Postgres",
 | |
|                 "username": "postgres", (change to your username)
 | |
|                 "password": "postgres", (change to your password)
 | |
|                 "hostPort": "192.168.1.55:5432", (change to your hostPort)
 | |
|                 "database": "surveillance_hub" (change to your database)
 | |
|             }
 | |
|         },
 | |
|         "sourceConfig":{
 | |
|             "config": { (all of the following can switch to true or false)
 | |
|                 "enableDataProfiler": "true" or "false", 
 | |
|                 "markDeletedTables": "true" or "false",
 | |
|                 "includeTables": "true" or "false",
 | |
|                 "includeViews": "true" or "false",
 | |
|                 "generateSampleData": "true" or "false" 
 | |
|             }
 | |
|         }
 | |
|     },      
 | |
|     "sink":{
 | |
|         "type": "metadata-rest",
 | |
|         "config": {}
 | |
|     },   
 | |
|     "workflowConfig": {
 | |
|         "openMetadataServerConfig": {
 | |
|             "hostPort": "http://localhost:8585/api",
 | |
|             "authProvider": "no-auth"
 | |
|         }
 | |
|     }
 | |
|         
 | |
|         
 | |
| }
 | |
| """
 | |
| 
 | |
| def metadata_ingestion_workflow():
 | |
|     workflow_config = json.loads(config)
 | |
|     workflow = Workflow.create(workflow_config)
 | |
|     workflow.execute()
 | |
|     workflow.raise_from_status()
 | |
|     workflow.print_status()
 | |
|     workflow.stop()
 | |
| 
 | |
| 
 | |
| with DAG(
 | |
|     "sample_data",
 | |
|     default_args=default_args,
 | |
|     description="An example DAG which runs a OpenMetadata ingestion workflow",
 | |
|     start_date=days_ago(1),
 | |
|     is_paused_upon_creation=False,
 | |
|     schedule_interval='*/5 * * * *', 
 | |
|     catchup=False,
 | |
| ) as dag:
 | |
|     ingest_task = PythonOperator(
 | |
|         task_id="ingest_using_recipe",
 | |
|         python_callable=metadata_ingestion_workflow,
 | |
|     )
 | |
| 
 | |
| if __name__ == "__main__":
 | |
|     metadata_ingestion_workflow()
 | |
| ```
 | |
| 
 | |
| 2. Run the DAG:
 | |
| `
 | |
| python postgres_demo.py
 | |
| `
 | |
| 
 | |
| - Alternatively, we could run without Airflow SDK and with OpenMetadata's own methods. Run `metadata ingest -c /Your_Path_To_Json/.json`
 | |
| The json configuration is exactly the same as the json configuration in the DAG.
 | |
| - Or, we could also run it with `metadata ingest -c /Your_Path_To_Yaml/.yaml`
 | |
| The yaml configuration would be the exact same except without the curly brackets and the double quotes.
 | |
| 
 | |
| ## To run a profiler workflow on Postgres data
 | |
| 1. Prepare the DAG OR configure the yaml/json:
 | |
| - To configure the quality tests in json/yaml, see https://docs.open-metadata.org/data-quality/data-quality-overview/tests
 | |
| - To prepare the DAG, see https://github.com/open-metadata/OpenMetadata/tree/0.10.3-release/data-quality/data-quality-overview
 | |
| 
 | |
| Example yaml I was using:
 | |
| ```
 | |
| source:
 | |
|   type: postgres
 | |
|   serviceName: your_service_name
 | |
|   serviceConnection:
 | |
|     config:
 | |
|       type: Postgres
 | |
|       username: your_username
 | |
|       password: your_password
 | |
|       hostPort: 
 | |
|       database: your_database  
 | |
|   sourceConfig:
 | |
|     config:
 | |
|       type: Profiler
 | |
| 
 | |
| processor:
 | |
|   type: orm-profiler
 | |
|   config:
 | |
|     test_suite:
 | |
|       name: demo_test
 | |
|       tests:
 | |
|         - table: your_table_name (FQN)
 | |
|           column_tests:
 | |
|             - columnName: id
 | |
|               testCase:
 | |
|                 columnTestType: columnValuesToBeBetween
 | |
|                 config:
 | |
|                   minValue: 0
 | |
|                   maxValue: 10
 | |
| sink:
 | |
|   type: metadata-rest
 | |
|   config: {}
 | |
| workflowConfig:
 | |
|   openMetadataServerConfig:
 | |
|     hostPort: http://localhost:8585/api
 | |
|     authProvider: no-auth
 | |
| ```
 | |
| Note that the table name must be FQN and match exactly with the table path on the OpenMetadata UI.
 | |
| 
 | |
| 2. Run it with 
 | |
| `metadata profile -c /path_to_yaml/.yaml`
 | |
| 
 | |
| Make sure to refresh the OpenMetadata UI and click on the Data Quality tab to see the results.
 | 
