diff --git a/openmetadata-docs/content/deployment/airflow/index.md b/openmetadata-docs/content/deployment/airflow/index.md index 78c96558434..343f37ec20f 100644 --- a/openmetadata-docs/content/deployment/airflow/index.md +++ b/openmetadata-docs/content/deployment/airflow/index.md @@ -3,11 +3,156 @@ title: Airflow Deployment slug: /deployment/airflow --- -# Airflow Deployment +# Airflow + +This section will show you how to configure your Airflow instance to run the OpenMetadata workflows. + +Moreover, we will show the required steps to connect your Airflow instance to the OpenMetadata server +so that you can deploy with the OpenMetadata UI directly to your instance. + +1. If you do not have an Airflow service up and running on your platform, we provide a custom + [Docker](https://hub.docker.com/r/openmetadata/ingestion) image, which already contains the OpenMetadata ingestion + packages and custom [Airflow APIs](https://github.com/open-metadata/openmetadata-airflow-apis) to + deploy Workflows from the UI as well. +2. If you already have Airflow up and running and want to use it for the metadata ingestion, you will + need to install the ingestion modules to the host. You can find more information on how to do this + in the Custom Airflow Installation section. ## Custom Airflow Installation -## Configure in the OpenMetadata Server +If you already have an Airflow instance up and running, you might want to reuse it to host the metadata workflows as +well. Here we will guide you on the different aspects to consider when configuring an existing Airflow. -Show all security info here +There are three different angles here: +1. Installing the ingestion modules directly on the host to enable the [Airflow Lineage Backend](/openmetadata/connectors/pipeline/airflow/lineage-backend). +2. Installing connector modules on the host to run specific workflows. +3. Installing the Airflow APIs to enable the workflow deployment through the UI. + +Depending on what you wish to use, you might just need some of these installations. Note that the installation +commands shown below need to be run in the Airflow instances. + +### Airflow Lineage Backend + +Goals: + +- Ingest DAGs and Tasks as Pipeline Entities when they run. +- Track DAG and Task status. +- Document lineage as code directly on the DAG definition and ingest it when the DAGs run. + +Get the necessary information to install and extract metadata from the Lineage Backend [here](/openmetadata/connectors/pipeline/airflow/lineage-backend). + +### Connector Modules + +Goal: + +- Ingest metadata from specific sources. + +The current approach we are following here is preparing the metadata ingestion DAGs as `PythonOperators`. This means that +the packages need to be present in the Airflow instances. + +You will need to install: + +```python +pip3 install "openmetadata-ingestion[]" +``` + +And then run the DAG as explained in each [Connector](/openmetadata/connectors). + +### Airflow APIs + +Goal: + +- Deploy metadata ingestion workflows directly from the UI. + +This process consists of three steps: + +1. Install the APIs module, +2. Install the required plugins, and +3. Configure the OpenMetadata server. + +The goal of this module is to add some HTTP endpoints that the UI calls for deploying the Airflow DAGs. +The first step can be achieved by running: + +```python +pip3 install "openmetadata-airflow-managed-apis" +``` + +Then, check the Connector Modules guide above to learn how to install the `openmetadata-ingestion` package with the +necessary plugins. They are necessary because even if we install the APIs, the Airflow instance needs to have the +required libraries to connect to each source. + +On top of this installation, you'll need to follow these steps: + +1. Download the latest `openmetadata-airflow-apis-plugins` release from [here](https://github.com/open-metadata/OpenMetadata/releases) +2. Untar it under the `{AIRFLOW_HOME}` directory. This will create and `setup` a `plugins` directory under `{AIRFLOW_HOME}`. +3. `cp -r {AIRFLOW_HOME}/plugins/dag_templates {AIRFLOW_HOME}`. +4. `mkdir -p {AIRFLOW_HOME}/dag_generated_configs`. +5. (re)start the airflow webserver and scheduler. + +### Configure in the OpenMetadata Server + +After installing the Airflow APIs, you will need to update your OpenMetadata Server. + +The OpenMetadata server takes all its configurations from a YAML file. You can find them in our [repo](https://github.com/open-metadata/OpenMetadata/tree/main/conf). In +`openmetadata.yaml`, update the `airflowConfiguration` section accordingly. + +```yaml +[...] + +airflowConfiguration: + apiEndpoint: http://${AIRFLOW_HOST:-localhost}:${AIRFLOW_PORT:-8080} + username: ${AIRFLOW_USERNAME:-admin} + password: ${AIRFLOW_PASSWORD:-admin} + metadataApiEndpoint: http://${SERVER_HOST:-localhost}:${SERVER_PORT:-8585}/api + authProvider: "no-auth" +``` + +Note that we also support picking up these values from environment variables, so you can safely set that up in the +machine hosting the OpenMetadata server. + +If you are running OpenMetadata with the security enabled, you can take a look at the server +configuration for each security mode: + + + + Configure Auth0 SSO to access the UI and APIs + + + Configure Azure SSO to access the UI and APIs + + + Configure a Custom OIDC SSO to access the UI and APIs + + + Configure Google SSO to access the UI and APIs + + + Configure Okta SSO to access the UI and APIs + + diff --git a/openmetadata-docs/content/menu.md b/openmetadata-docs/content/menu.md index 353bdf83133..8ff3f8b9fab 100644 --- a/openmetadata-docs/content/menu.md +++ b/openmetadata-docs/content/menu.md @@ -316,6 +316,26 @@ site_menu: - category: OpenMetadata / Connectors / Pipeline url: /openmetadata/connectors/pipeline + - category: OpenMetadata / Connectors / Pipeline / Airflow + url: /openmetadata/connectors/pipeline/airflow + - category: OpenMetadata / Connectors / Pipeline / Airflow / CLI + url: /openmetadata/connectors/pipeline/airflow/cli + - category: OpenMetadata / Connectors / Pipeline / Airflow / GCS Composer + url: /openmetadata/connectors/pipeline/airflow/gcs + - category: OpenMetadata / Connectors / Pipeline / Airflow / Lineage Backend + url: /openmetadata/connectors/pipeline/airflow/lineage-backend + - category: OpenMetadata / Connectors / Pipeline / Airbyte + url: /openmetadata/connectors/pipeline/airbyte + - category: OpenMetadata / Connectors / Pipeline / Airbyte / Airflow + url: /openmetadata/connectors/pipeline/airbyte/airflow + - category: OpenMetadata / Connectors / Pipeline / Airbyte / CLI + url: /openmetadata/connectors/pipeline/airbyte/cli + - category: OpenMetadata / Connectors / Pipeline / Glue + url: /openmetadata/connectors/pipeline/glue + - category: OpenMetadata / Connectors / Pipeline / Glue / Airflow + url: /openmetadata/connectors/pipeline/glue/airflow + - category: OpenMetadata / Connectors / Pipeline / Glue / CLI + url: /openmetadata/connectors/pipeline/glue/cli - category: OpenMetadata / Ingestion url: /openmetadata/ingestion diff --git a/openmetadata-docs/content/openmetadata/connectors/database/glue/airflow.md b/openmetadata-docs/content/openmetadata/connectors/database/glue/airflow.md index 287757a969d..06b5c0bbcc6 100644 --- a/openmetadata-docs/content/openmetadata/connectors/database/glue/airflow.md +++ b/openmetadata-docs/content/openmetadata/connectors/database/glue/airflow.md @@ -17,8 +17,6 @@ slug: /openmetadata/connectors/database/glue/airflow - **awsSessionToken**: The AWS session token is an optional parameter. If you want, enter the details of your temporary session token. - **endPointURL**: Your Glue connector will automatically determine the AWS Glue endpoint URL based on the region. You may override this behavior by entering a value to the endpoint URL. - **storageServiceName**: OpenMetadata associates objects for each object store entity with a unique namespace. To ensure your data is well-organized and findable, choose a unique name by which you would like to identify the metadata for the object stores you are using through AWS Glue. -- **pipelineServiceName**: OpenMetadata associates each pipeline entity with a unique namespace. To ensure your data is well-organized and findable, choose a unique name by which you would like to identify the metadata for pipelines you are using through AWS Glue. When this metadata has been ingested you will find it in the OpenMetadata UI pipelines view under the name you have specified. -- **database**: The database of the data source is an optional parameter, if you would like to restrict the metadata reading to a single database. If left blank, OpenMetadata ingestion attempts to scan all the databases. For Glue, we use the Catalog ID as the database when mapping Glue metadata to OpenMetadata Entities. - **Connection Options (Optional)**: Enter the details for any additional connection options that can be sent to Glue during the connection. These details must be added as Key-Value pairs. - **Connection Arguments (Optional)**: Enter the details for any additional connection arguments such as security or protocol configs that can be sent to Glue during the connection. These details must be added as Key-Value pairs. - In case you are using Single-Sign-On (SSO) for authentication, add the `authenticator` details in the Connection Arguments as a Key-Value pair as follows: `"authenticator" : "sso_login_url"` diff --git a/openmetadata-docs/content/openmetadata/connectors/database/glue/cli.md b/openmetadata-docs/content/openmetadata/connectors/database/glue/cli.md index e59cdf86794..8b77afa95a4 100644 --- a/openmetadata-docs/content/openmetadata/connectors/database/glue/cli.md +++ b/openmetadata-docs/content/openmetadata/connectors/database/glue/cli.md @@ -17,8 +17,6 @@ slug: /openmetadata/connectors/database/glue/cli - **awsSessionToken**: The AWS session token is an optional parameter. If you want, enter the details of your temporary session token. - **endPointURL**: Your Glue connector will automatically determine the AWS Glue endpoint URL based on the region. You may override this behavior by entering a value to the endpoint URL. - **storageServiceName**: OpenMetadata associates objects for each object store entity with a unique namespace. To ensure your data is well-organized and findable, choose a unique name by which you would like to identify the metadata for the object stores you are using through AWS Glue. -- **pipelineServiceName**: OpenMetadata associates each pipeline entity with a unique namespace. To ensure your data is well-organized and findable, choose a unique name by which you would like to identify the metadata for pipelines you are using through AWS Glue. When this metadata has been ingested you will find it in the OpenMetadata UI pipelines view under the name you have specified. -- **database**: The database of the data source is an optional parameter, if you would like to restrict the metadata reading to a single database. If left blank, OpenMetadata ingestion attempts to scan all the databases. For Glue, we use the Catalog ID as the database when mapping Glue metadata to OpenMetadata Entities. - **Connection Options (Optional)**: Enter the details for any additional connection options that can be sent to Glue during the connection. These details must be added as Key-Value pairs. - **Connection Arguments (Optional)**: Enter the details for any additional connection arguments such as security or protocol configs that can be sent to Glue during the connection. These details must be added as Key-Value pairs. - In case you are using Single-Sign-On (SSO) for authentication, add the `authenticator` details in the Connection Arguments as a Key-Value pair as follows: `"authenticator" : "sso_login_url"` diff --git a/openmetadata-docs/content/openmetadata/connectors/database/glue/index.md b/openmetadata-docs/content/openmetadata/connectors/database/glue/index.md index e3569fad9d0..d5600cf3df1 100644 --- a/openmetadata-docs/content/openmetadata/connectors/database/glue/index.md +++ b/openmetadata-docs/content/openmetadata/connectors/database/glue/index.md @@ -16,10 +16,6 @@ slug: /openmetadata/connectors/database/glue - **AWS Region**: Enter the location of the amazon cluster that your data and account are associated with. - **AWS Session Token (optional)**: The AWS session token is an optional parameter. If you want, enter the details of your temporary session token. - **Endpoint URL (optional)**: Your Glue connector will automatically determine the AWS Glue endpoint URL based on the region. You may override this behavior by entering a value to the endpoint URL. -- **Database (optional)**: The database of the data source is an optional parameter if you would like to restrict the metadata reading to a single database. If left blank, OpenMetadata ingestion attempts to scan all the databases. -- **Storage Service Name**: OpenMetadata associates objects for each object store entity with a unique namespace. To ensure your data is well-organized and findable, choose a unique name by which you would like to identify the metadata for the object stores you are using through AWS Glue. -- **Pipeline Service Name**: OpenMetadata associates each pipeline entity with a unique namespace. To ensure your data is well-organized and findable, choose a unique name by which you would like to identify the metadata for pipelines you are using through AWS Glue. When this metadata has been ingested you will find it in the OpenMetadata UI pipelines view under the name you have specified. -- **Database (Optional)**: The database of the data source is an optional parameter, if you would like to restrict the metadata reading to a single database. If left blank, OpenMetadata ingestion attempts to scan all the databases. For Glue, we use the Catalog ID as the database when mapping Glue metadata to OpenMetadata Entities. - **Connection Options (Optional)**: Enter the details for any additional connection options that can be sent to Glue during the connection. These details must be added as Key-Value pairs. - **Connection Arguments (Optional)**: Enter the details for any additional connection arguments such as security or protocol configs that can be sent to Glue during the connection. These details must be added as Key-Value pairs. - In case you are using Single-Sign-On (SSO) for authentication, add the `authenticator` details in the Connection Arguments as a Key-Value pair as follows: `"authenticator" : "sso_login_url"` diff --git a/openmetadata-docs/content/openmetadata/connectors/index.md b/openmetadata-docs/content/openmetadata/connectors/index.md index 3d4c417c512..e71977f26a9 100644 --- a/openmetadata-docs/content/openmetadata/connectors/index.md +++ b/openmetadata-docs/content/openmetadata/connectors/index.md @@ -42,3 +42,9 @@ OpenMetadata can extract metadata from the following list of connectors: ## Messaging Services - [Kafka](/openmetadata/connectors/messaging/kafka) + +## Pipeline Services + +- [Airbyte](/openmetadata/connectors/pipeline/airbyte) +- [Airflow](/openmetadata/connectors/pipeline/airflow) +- [Glue](/openmetadata/connectors/pipeline/glue) diff --git a/openmetadata-docs/content/openmetadata/connectors/pipeline/airbyte/airflow.md b/openmetadata-docs/content/openmetadata/connectors/pipeline/airbyte/airflow.md new file mode 100644 index 00000000000..b84ae5a49d8 --- /dev/null +++ b/openmetadata-docs/content/openmetadata/connectors/pipeline/airbyte/airflow.md @@ -0,0 +1,16 @@ +--- +title: Run Airbyte Connector using Airflow SDK +slug: /openmetadata/connectors/pipeline/airbyte/airflow +--- + + + + + + + +

Source Configuration - Service Connection

+ +- **hostPort**: Pipeline Service Management UI URL + + diff --git a/openmetadata-docs/content/openmetadata/connectors/pipeline/airbyte/cli.md b/openmetadata-docs/content/openmetadata/connectors/pipeline/airbyte/cli.md new file mode 100644 index 00000000000..cf4b99e6387 --- /dev/null +++ b/openmetadata-docs/content/openmetadata/connectors/pipeline/airbyte/cli.md @@ -0,0 +1,16 @@ +--- +title: Run Airbyte Connector using the CLI +slug: /openmetadata/connectors/pipeline/airbyte/cli +--- + + + + + + + +

Source Configuration - Service Connection

+ +- **hostPort**: Pipeline Service Management UI URL. + + diff --git a/openmetadata-docs/content/openmetadata/connectors/pipeline/airbyte/index.md b/openmetadata-docs/content/openmetadata/connectors/pipeline/airbyte/index.md new file mode 100644 index 00000000000..ba2fe3b1f4c --- /dev/null +++ b/openmetadata-docs/content/openmetadata/connectors/pipeline/airbyte/index.md @@ -0,0 +1,18 @@ +--- +title: Airbyte +slug: /openmetadata/connectors/pipeline/airbyte +--- + + + + + + + +

Connection Options

+ +- **Host and Port**: Pipeline Service Management UI URL + + + + diff --git a/openmetadata-docs/content/openmetadata/connectors/pipeline/airflow/cli.md b/openmetadata-docs/content/openmetadata/connectors/pipeline/airflow/cli.md new file mode 100644 index 00000000000..9b4054fcb0e --- /dev/null +++ b/openmetadata-docs/content/openmetadata/connectors/pipeline/airflow/cli.md @@ -0,0 +1,27 @@ +--- +title: Run Airflow Connector using the CLI +slug: /openmetadata/connectors/pipeline/airflow/cli +--- + + + + + + + +

Source Configuration - Service Connection

+ +- **hostPort**: URL to the Airflow instance. +- **numberOfStatus**: Number of status we want to look back to in every ingestion (e.g., Past executions from a DAG). +- **connection**: Airflow metadata database connection. See + these [docs](https://airflow.apache.org/docs/apache-airflow/stable/howto/set-up-database.html) + for supported backends. + +In terms of `connection` we support the following selections: + +- `backend`: Should not be used from the UI. This is only applicable when ingesting Airflow metadata locally by running + the ingestion from a DAG. It will use the current Airflow SQLAlchemy connection to extract the data. +- `MySQL`, `Postgres`, `MSSQL` and `SQLite`: Pass the required credentials to reach out each of these services. We will + create a connection to the pointed database and read Airflow data from there. + + diff --git a/openmetadata-docs/content/openmetadata/connectors/pipeline/airflow/gcs.md b/openmetadata-docs/content/openmetadata/connectors/pipeline/airflow/gcs.md new file mode 100644 index 00000000000..3b2bb84eaaa --- /dev/null +++ b/openmetadata-docs/content/openmetadata/connectors/pipeline/airflow/gcs.md @@ -0,0 +1,127 @@ +--- +title: Extract GCS Composer Metadata +slug: /openmetadata/connectors/pipeline/airflow/gcs +--- + +# Extract GCS Composer Metadata + + + +This approach has been tested against Airflow 2.1.4. If you have any issues or questions, +please do not hesitate to reach out! + + + +The most comfortable way to extract metadata out of GCS Composer is by directly creating a DAG in there +that will handle the connection to the metadata database automatically and push the contents +to your OpenMetadata server. + +## Install the Requirements + +In your environment you will need to install the following packages: + +- `openmetadata-ingestion==0.11.1` +- `sqlalchemy==1.4.27`: This is needed to align OpenMetadata version with the Composer internal requirements. +- `flask-appbuilder==3.4.5`: Again, this is just an alignment of versions so that `openmetadata-ingestion` can + work with GCS Composer internals. + +## Prepare the DAG! + +Note that this DAG is a usual connector DAG, just using the Airflow service with the `Backend` connection. + +As an example of a DAG pushing data to OpenMetadata under Google SSO, we could have: + +```python +""" +This DAG can be used directly in your Airflow instance after installing +the `openmetadata-ingestion` package. Its purpose +is to connect to the underlying database, retrieve the information +and push it to OpenMetadata. +""" +from datetime import timedelta + +import yaml +from airflow import DAG + +try: + from airflow.operators.python import PythonOperator +except ModuleNotFoundError: + from airflow.operators.python_operator import PythonOperator + +from airflow.utils.dates import days_ago + +from metadata.ingestion.api.workflow import Workflow + +default_args = { + "owner": "user_name", + "email": ["username@org.com"], + "email_on_failure": False, + "retries": 3, + "retry_delay": timedelta(minutes=5), + "execution_timeout": timedelta(minutes=60), +} + +config = """ +source: + type: airflow + serviceName: airflow_gcs_composer + serviceConnection: + config: + type: Airflow + hostPort: http://localhost:8080 + numberOfStatus: 10 + connection: + type: Backend + sourceConfig: + config: + type: PipelineMetadata +sink: + type: metadata-rest + config: {} +workflowConfig: + loggerLevel: INFO + openMetadataServerConfig: + hostPort: https://sandbox-beta.open-metadata.org/api + authProvider: google + securityConfig: + credentials: + gcsConfig: + type: service_account + projectId: ... + privateKeyId: ... + privateKey: | + -----BEGIN PRIVATE KEY----- + ... + -----END PRIVATE KEY----- + clientEmail: ... + clientId: ... + authUri: https://accounts.google.com/o/oauth2/auth + tokenUri: https://oauth2.googleapis.com/token + authProviderX509CertUrl: https://www.googleapis.com/oauth2/v1/certs + clientX509CertUrl: ... +""" + + +def metadata_ingestion_workflow(): + workflow_config = yaml.safe_load(config) + workflow = Workflow.create(workflow_config) + workflow.execute() + workflow.raise_from_status() + workflow.print_status() + workflow.stop() + + +with DAG( + "airflow_metadata_extraction", + default_args=default_args, + description="An example DAG which pushes Airflow data to OM", + start_date=days_ago(1), + is_paused_upon_creation=True, + schedule_interval="*/5 * * * *", + catchup=False, +) as dag: + ingest_task = PythonOperator( + task_id="ingest_using_recipe", + python_callable=metadata_ingestion_workflow, + ) +``` diff --git a/openmetadata-docs/content/openmetadata/connectors/pipeline/airflow/index.md b/openmetadata-docs/content/openmetadata/connectors/pipeline/airflow/index.md new file mode 100644 index 00000000000..a5003168639 --- /dev/null +++ b/openmetadata-docs/content/openmetadata/connectors/pipeline/airflow/index.md @@ -0,0 +1,28 @@ +--- +title: Airflow +slug: /openmetadata/connectors/pipeline/airflow +--- + + + + + + + +

Connection Options

+ +- **Host and Port**: URL to the Airflow instance. +- **Number of Status**: Number of status we want to look back to in every ingestion (e.g., Past executions from a DAG). +- **Connection**: Airflow metadata database connection. See these [docs](https://airflow.apache.org/docs/apache-airflow/stable/howto/set-up-database.html) + for supported backends. + +In terms of `connection` we support the following selections: + +- `backend`: Should not be used from the UI. This is only applicable when ingesting Airflow metadata locally + by running the ingestion from a DAG. It will use the current Airflow SQLAlchemy connection to extract the data. +- `MySQL`, `Postgres`, `MSSQL` and `SQLite`: Pass the required credentials to reach out each of these services. We + will create a connection to the pointed database and read Airflow data from there. + + + + diff --git a/openmetadata-docs/content/openmetadata/connectors/pipeline/airflow/lineage-backend.md b/openmetadata-docs/content/openmetadata/connectors/pipeline/airflow/lineage-backend.md new file mode 100644 index 00000000000..405e9ba7f6b --- /dev/null +++ b/openmetadata-docs/content/openmetadata/connectors/pipeline/airflow/lineage-backend.md @@ -0,0 +1,111 @@ +--- +title: Airflow Lineage Backend +slug: /openmetadata/connectors/pipeline/airflow/lineage-backend +--- + +# Airflow Lineage Backend + +Learn how to capture lineage information directly from Airflow DAGs using the OpenMetadata Lineage Backend. + +## Introduction + +Obtaining metadata should be as simple as possible. Not only that, we want developers to be able to keep using their +tools without any major changes. + +We can directly use [Airflow code](https://airflow.apache.org/docs/apache-airflow/stable/lineage.html#lineage-backend) +to help us track data lineage. What we want to achieve through this backend is the ability to link OpenMetadata Table Entities and the pipelines that have those instances as inputs or outputs. + +Being able to control and monitor these relationships can play a major role in helping discover and communicate issues +to your company data practitioners and stakeholders. + +This document will guide you through the installation, configuration and internals of the process to help you unlock as +much value as possible from within your Airflow pipelines. + +## Quickstart + +### Installation + +The Lineage Backend can be directly installed to the Airflow instances as part of the usual OpenMetadata Python +distribution: + +```commandline +pip3 install "openmetadata-ingestion[airflow-container]" +``` + +### Adding Lineage Config + +After the installation, we need to update the Airflow configuration. This can be done following this example on +`airflow.cfg`: + +```ini +[lineage] +backend = airflow_provider_openmetadata.lineage.openmetadata.OpenMetadataLineageBackend +airflow_service_name = local_airflow +openmetadata_api_endpoint = http://localhost:8585/api +auth_provider_type = no-auth +``` + +Or we can directly provide environment variables: + +```env +AIRFLOW__LINEAGE__BACKEND="airflow_provider_openmetadata.lineage.openmetadata.OpenMetadataLineageBackend" +AIRFLOW__LINEAGE__AIRFLOW_SERVICE_NAME="local_airflow" +AIRFLOW__LINEAGE__OPENMETADATA_API_ENDPOINT="http://localhost:8585/api" +AIRFLOW__LINEAGE__AUTH_PROVIDER_TYPE="no-auth" +``` + +We can choose the option that best adapts to our current architecture. Find more information on Airflow configurations +[here](https://airflow.apache.org/docs/apache-airflow/stable/howto/set-config.html). + +In the following sections, we'll show how to adapt our pipelines to help us build the lineage information. + +## Lineage Backend + +You can find the source code [here](https://github.com/open-metadata/OpenMetadata/tree/main/ingestion/src/airflow_provider_openmetadata). + +### Pipeline Service + +The backend will look for a Pipeline Service Entity with the name specified in the configuration under +`airflow_service_name`. If it cannot find the instance, it will create one based on the following information: + +- `airflow_service_name` as name. If not informed, the default value will be `airflow`. +- It will use the `webserver` base URL as the URL of the service. + +### Pipeline Entity + +Each DAG processed by the backend will be created or updated as a Pipeline Entity linked to the above Pipeline Service. + +We are going to extract the task information and add it to the Pipeline task property list. Then, a +DAG created with some tasks as the following random example: + +```commandline +t1 >> [t2, t3] +``` + +We will capture this information as well, therefore showing how the DAG contains three tasks t1, t2 and t3; and t1 having +t2 and t3 as downstream tasks. + +### Adding Lineage + +Airflow [Operators](https://airflow.apache.org/docs/apache-airflow/stable/_api/airflow/models/baseoperator/index.html) +contain the attributes `inlets` and `outlets`. When creating our tasks, we can pass any of these two +parameters as follows: + +```python +BashOperator( + task_id='print_date', + bash_command='date', + outlets={ + "tables": ["service.database.schema.table"] + } +) +``` + +Note how in this example we are defining a Python `dict` with the key tables and value a `list`. +This list should contain the FQN of tables ingested through any of our connectors or APIs. + +When each task is processed, we will use the OpenMetadata client to add the lineage information (upstream for inlets +and downstream for outlets) between the Pipeline and Table Entities. + +It is important to get the naming right, as we will fetch the Table Entity by its FQN. If no information is specified +in terms of lineage, we will just ingest the Pipeline Entity without adding further information. diff --git a/openmetadata-docs/content/openmetadata/connectors/pipeline/glue/airflow.md b/openmetadata-docs/content/openmetadata/connectors/pipeline/glue/airflow.md new file mode 100644 index 00000000000..3f469098400 --- /dev/null +++ b/openmetadata-docs/content/openmetadata/connectors/pipeline/glue/airflow.md @@ -0,0 +1,23 @@ +--- +title: Run Glue Connector using Airflow SDK +slug: /openmetadata/connectors/pipeline/glue/airflow +--- + + + + + + + +

Source Configuration - Service Connection

+ +- **awsAccessKeyId**: Enter your secure access key ID for your Glue connection. The specified key ID should be + authorized to read all databases you want to include in the metadata ingestion workflow. +- **awsSecretAccessKey**: Enter the Secret Access Key (the passcode key pair to the key ID from above). +- **awsRegion**: Enter the location of the amazon cluster that your data and account are associated with. +- **awsSessionToken**: The AWS session token is an optional parameter. If you want, enter the details of your temporary + session token. +- **endPointURL**: Your Glue connector will automatically determine the AWS Glue endpoint URL based on the region. You + may override this behavior by entering a value to the endpoint URL. + + diff --git a/openmetadata-docs/content/openmetadata/connectors/pipeline/glue/cli.md b/openmetadata-docs/content/openmetadata/connectors/pipeline/glue/cli.md new file mode 100644 index 00000000000..ab145302d03 --- /dev/null +++ b/openmetadata-docs/content/openmetadata/connectors/pipeline/glue/cli.md @@ -0,0 +1,23 @@ +--- +title: Run Glue Connector using the CLI +slug: /openmetadata/connectors/pipeline/glue/cli +--- + + + + + + + +

Source Configuration - Service Connection

+ +- **awsAccessKeyId**: Enter your secure access key ID for your Glue connection. The specified key ID should be + authorized to read all databases you want to include in the metadata ingestion workflow. +- **awsSecretAccessKey**: Enter the Secret Access Key (the passcode key pair to the key ID from above). +- **awsRegion**: Enter the location of the amazon cluster that your data and account are associated with. +- **awsSessionToken**: The AWS session token is an optional parameter. If you want, enter the details of your temporary + session token. +- **endPointURL**: Your Glue connector will automatically determine the AWS Glue endpoint URL based on the region. You + may override this behavior by entering a value to the endpoint URL. + + diff --git a/openmetadata-docs/content/openmetadata/connectors/pipeline/glue/index.md b/openmetadata-docs/content/openmetadata/connectors/pipeline/glue/index.md new file mode 100644 index 00000000000..6430ef8bd7d --- /dev/null +++ b/openmetadata-docs/content/openmetadata/connectors/pipeline/glue/index.md @@ -0,0 +1,25 @@ +--- +title: Glue +slug: /openmetadata/connectors/pipeline/glue +--- + + + + + + + +

Connection Options

+ +- **AWS Access Key ID**: Enter your secure access key ID for your Glue connection. The specified key ID should be + authorized to read all databases you want to include in the metadata ingestion workflow. +- **AWS Secret Access Key**: Enter the Secret Access Key (the passcode key pair to the key ID from above). +- **AWS Region**: Enter the location of the amazon cluster that your data and account are associated with. +- **AWS Session Token (optional)**: The AWS session token is an optional parameter. If you want, enter the details of + your temporary session token. +- **Endpoint URL (optional)**: Your Glue connector will automatically determine the AWS Glue endpoint URL based on the + region. You may override this behavior by entering a value to the endpoint URL. + + + + diff --git a/openmetadata-docs/content/openmetadata/connectors/pipeline/index.md b/openmetadata-docs/content/openmetadata/connectors/pipeline/index.md index f304da8843f..c0e142239ae 100644 --- a/openmetadata-docs/content/openmetadata/connectors/pipeline/index.md +++ b/openmetadata-docs/content/openmetadata/connectors/pipeline/index.md @@ -4,3 +4,7 @@ slug: /openmetadata/connectors/pipeline --- # Pipeline Services + +- [Airbyte](/openmetadata/connectors/pipeline/airbyte) +- [Airflow](/openmetadata/connectors/pipeline/airflow) +- [Glue](/openmetadata/connectors/pipeline/glue) diff --git a/openmetadata-docs/images/openmetadata/connectors/airbyte/add-new-service.png b/openmetadata-docs/images/openmetadata/connectors/airbyte/add-new-service.png new file mode 100644 index 00000000000..4300031b33f Binary files /dev/null and b/openmetadata-docs/images/openmetadata/connectors/airbyte/add-new-service.png differ diff --git a/openmetadata-docs/images/openmetadata/connectors/airbyte/select-service.png b/openmetadata-docs/images/openmetadata/connectors/airbyte/select-service.png new file mode 100644 index 00000000000..b4af0196b57 Binary files /dev/null and b/openmetadata-docs/images/openmetadata/connectors/airbyte/select-service.png differ diff --git a/openmetadata-docs/images/openmetadata/connectors/airbyte/service-connection.png b/openmetadata-docs/images/openmetadata/connectors/airbyte/service-connection.png new file mode 100644 index 00000000000..b1267ada36c Binary files /dev/null and b/openmetadata-docs/images/openmetadata/connectors/airbyte/service-connection.png differ diff --git a/openmetadata-docs/images/openmetadata/connectors/airflow/add-new-service.png b/openmetadata-docs/images/openmetadata/connectors/airflow/add-new-service.png new file mode 100644 index 00000000000..772e0021b40 Binary files /dev/null and b/openmetadata-docs/images/openmetadata/connectors/airflow/add-new-service.png differ diff --git a/openmetadata-docs/images/openmetadata/connectors/airflow/select-service.png b/openmetadata-docs/images/openmetadata/connectors/airflow/select-service.png new file mode 100644 index 00000000000..f510df63a2c Binary files /dev/null and b/openmetadata-docs/images/openmetadata/connectors/airflow/select-service.png differ diff --git a/openmetadata-docs/images/openmetadata/connectors/airflow/service-connection.png b/openmetadata-docs/images/openmetadata/connectors/airflow/service-connection.png new file mode 100644 index 00000000000..29f82ffbab4 Binary files /dev/null and b/openmetadata-docs/images/openmetadata/connectors/airflow/service-connection.png differ diff --git a/openmetadata-docs/ingestion/connectors/airbyte/ingestion.yaml b/openmetadata-docs/ingestion/connectors/airbyte/ingestion.yaml new file mode 100644 index 00000000000..dbe1db1b088 --- /dev/null +++ b/openmetadata-docs/ingestion/connectors/airbyte/ingestion.yaml @@ -0,0 +1,17 @@ +source: + type: airbyte + serviceName: airbyte_source + serviceConnection: + config: + type: Airbyte + hostPort: http://localhost:8000 + sourceConfig: + config: + type: PipelineMetadata +sink: + type: metadata-rest + config: { } +workflowConfig: + openMetadataServerConfig: + hostPort: http://localhost:8585/api + authProvider: no-auth diff --git a/openmetadata-docs/ingestion/connectors/airflow/ingestion.yaml b/openmetadata-docs/ingestion/connectors/airflow/ingestion.yaml new file mode 100644 index 00000000000..eab2197fea6 --- /dev/null +++ b/openmetadata-docs/ingestion/connectors/airflow/ingestion.yaml @@ -0,0 +1,25 @@ +source: + type: airflow + serviceName: airflow_source + serviceConnection: + config: + type: Airflow + hostPort: http://localhost:8080 + numberOfStatus: 10 + connection: + type: Mysql + username: airflow_user + password: airflow_pass + databaseSchema: airflow_db + hostPort: localhost:3306 + sourceConfig: + config: + type: PipelineMetadata +sink: + type: metadata-rest + config: { } +workflowConfig: + loggerLevel: DEBUG + openMetadataServerConfig: + hostPort: http://localhost:8585/api + authProvider: no-auth