From f3c007c153757971fc5efb09fc30e636063c498a Mon Sep 17 00:00:00 2001 From: Bohan Hou <64535834+bohanhou14@users.noreply.github.com> Date: Thu, 13 Apr 2023 19:08:45 -0400 Subject: [PATCH] added a installation manual (postgres demo) (#5855) Co-authored-by: Sriharsha Chintalapani --- installation_deployment_postgres_demo.md | 199 +++++++++++++++++++++++ 1 file changed, 199 insertions(+) create mode 100644 installation_deployment_postgres_demo.md diff --git a/installation_deployment_postgres_demo.md b/installation_deployment_postgres_demo.md new file mode 100644 index 00000000000..260593bda10 --- /dev/null +++ b/installation_deployment_postgres_demo.md @@ -0,0 +1,199 @@ +# Installation and deployment instructions (using Postgres as example) + +Below are the instructions for connecting a Postgress server. The installation steps should be the same for connecting all kinds of servers. Different servers would require different configurations in the .yaml or DAG files. See https://docs.open-metadata.org/integrations/connectors for your configuration. + +# Goal: To run Postgres metadata ingestion and quality tests with OpenMetadata using Airflow scheduler + +Note: This procedure does not support Windows, because Windows does not implement "signal.SIGALRM". **It is highly recommended to use WSL 2 if you are on Windows**. + +## Requirements: +See https://docs.open-metadata.org/overview/run-openmetadata-with-prefect "Requirements" section + +## Installation: +1. Clone this git hub repo: +`git clone https://github.com/open-metadata/OpenMetadata.git` + +2. Cd to ~/.../openmetadata/docker/metadata + +3. Start the OpenMetadata containers. This will allow you run OpenMetadata in Docker: +`docker compose up -d` +- To check the status of services, run `docker compose ps` +- To access the UI: http://localhost:8585 + +4. Install the OpenMetadata ingestion package. +- (optional but highly recommended): Before installing this package, it is recommended to create and activate a virtual environment. To do this, run: +`python -m venv env` and `source env/bin/activate` + +- To install the OpenMetadata ingestion package: +`pip install --upgrade "openmetadata-ingestion[docker]==0.10.3"` (specify the release version to ensure compatibility) + +5. Install Airflow: +- 5A: Install Airflow Lineage Backend: `pip3 install "openmetadata-ingestion[airflow-container]"==0.10.3` +- 5B: Install Airflow postgres connector module: `pip3 install "openmetadata-ingestion[postgres]"==0.10.3` +- 5C: Install Airflow APIs: `pip3 install "openmetadata-airflow-managed-apis"==0.10.3` +- 5D: Install necessary Airflow plugins: + - 1) Download the latest openmetadata-airflow-apis-plugins release from https://github.com/open-metadata/OpenMetadata/releases + - 2) Untar it under your {AIRFLOW_HOME} directory (usually c/Users/Yourname/airflow). This will create and setup a plugins directory under {AIRFLOW_HOME} . + - 3) `cp -r {AIRFLOW_HOME}/plugins/dag_templates {AIRFLOW_HOME}` + - 4) `mkdir -p {AIRFLOW_HOME}/dag_generated_configs` + - 5) (re)start the airflow webserver and scheduler + +6. Configure Airflow: +- 6A: configure airflow.cfg in your AIRFLOW_HOME directory. Check and make all the folder directories point to the right places. For instance, dags_folder = YOUR_AIRFLOW_HOME/dags +- 6B: configure openmetadata.yaml and update the airflowConfiguration section. See: https://docs.open-metadata.org/integrations/airflow/configure-airflow-in-the-openmetadata-server + +## To run a metadata ingestion workflow with Airflow ingestion DAGs on Postgres data: + +1. Prepare the Ingestion DAG: +To see a more complete tutorial on ingestion DAG, see https://docs.open-metadata.org/integrations/connectors/postgres/run-postgres-connector-with-the-airflow-sdk +To be brief, below is my own DAG. Copy & Paste the following into a python file (postgres_demo.py): + +``` +import pathlib +import json +from datetime import timedelta +from airflow import DAG + +try: + from airflow.operators.python import PythonOperator +except ModuleNotFoundError: + from airflow.operators.python_operator import PythonOperator + +from metadata.config.common import load_config_file +from metadata.ingestion.api.workflow import Workflow +from airflow.utils.dates import days_ago + +default_args = { + "owner": "user_name", + "email": ["username@org.com"], + "email_on_failure": False, + "retries": 3, + "retry_delay": timedelta(minutes=5), + "execution_timeout": timedelta(minutes=60) +} + +config = """ +{ + "source":{ + "type": "postgres", + "serviceName": "postgres_demo", + "serviceConnection": { + "config": { + "type": "Postgres", + "username": "postgres", (change to your username) + "password": "postgres", (change to your password) + "hostPort": "192.168.1.55:5432", (change to your hostPort) + "database": "surveillance_hub" (change to your database) + } + }, + "sourceConfig":{ + "config": { (all of the following can switch to true or false) + "enableDataProfiler": "true" or "false", + "markDeletedTables": "true" or "false", + "includeTables": "true" or "false", + "includeViews": "true" or "false", + "generateSampleData": "true" or "false" + } + } + }, + "sink":{ + "type": "metadata-rest", + "config": {} + }, + "workflowConfig": { + "openMetadataServerConfig": { + "hostPort": "http://localhost:8585/api", + "authProvider": "no-auth" + } + } + + +} +""" + +def metadata_ingestion_workflow(): + workflow_config = json.loads(config) + workflow = Workflow.create(workflow_config) + workflow.execute() + workflow.raise_from_status() + workflow.print_status() + workflow.stop() + + +with DAG( + "sample_data", + default_args=default_args, + description="An example DAG which runs a OpenMetadata ingestion workflow", + start_date=days_ago(1), + is_paused_upon_creation=False, + schedule_interval='*/5 * * * *', + catchup=False, +) as dag: + ingest_task = PythonOperator( + task_id="ingest_using_recipe", + python_callable=metadata_ingestion_workflow, + ) + +if __name__ == "__main__": + metadata_ingestion_workflow() +``` + +2. Run the DAG: +` +python postgres_demo.py +` + +- Alternatively, we could run without Airflow SDK and with OpenMetadata's own methods. Run `metadata ingest -c /Your_Path_To_Json/.json` +The json configuration is exactly the same as the json configuration in the DAG. +- Or, we could also run it with `metadata ingest -c /Your_Path_To_Yaml/.yaml` +The yaml configuration would be the exact same except without the curly brackets and the double quotes. + +## To run a profiler workflow on Postgres data +1. Prepare the DAG OR configure the yaml/json: +- To configure the quality tests in json/yaml, see https://docs.open-metadata.org/data-quality/data-quality-overview/tests +- To prepare the DAG, see https://github.com/open-metadata/OpenMetadata/tree/0.10.3-release/data-quality/data-quality-overview + +Example yaml I was using: +``` +source: + type: postgres + serviceName: your_service_name + serviceConnection: + config: + type: Postgres + username: your_username + password: your_password + hostPort: + database: your_database + sourceConfig: + config: + type: Profiler + +processor: + type: orm-profiler + config: + test_suite: + name: demo_test + tests: + - table: your_table_name (FQN) + column_tests: + - columnName: id + testCase: + columnTestType: columnValuesToBeBetween + config: + minValue: 0 + maxValue: 10 +sink: + type: metadata-rest + config: {} +workflowConfig: + openMetadataServerConfig: + hostPort: http://localhost:8585/api + authProvider: no-auth +``` +Note that the table name must be FQN and match exactly with the table path on the OpenMetadata UI. + +2. Run it with +`metadata profile -c /path_to_yaml/.yaml` + +Make sure to refresh the OpenMetadata UI and click on the Data Quality tab to see the results.