Docs - Airflow and pipeline connectors (#6008)

* Airflow and pipeline connectors

* Revert dockerfile
This commit is contained in:
Pere Miquel Brull 2022-07-11 16:54:40 +02:00 committed by GitHub
parent 6c8adb8014
commit 321cb4626c
No known key found for this signature in database
GPG Key ID: 4AEE18F83AFDEB23
25 changed files with 634 additions and 11 deletions

View File

@ -3,11 +3,156 @@ title: Airflow Deployment
slug: /deployment/airflow
---
# Airflow Deployment
# Airflow
This section will show you how to configure your Airflow instance to run the OpenMetadata workflows.
Moreover, we will show the required steps to connect your Airflow instance to the OpenMetadata server
so that you can deploy with the OpenMetadata UI directly to your instance.
1. If you do not have an Airflow service up and running on your platform, we provide a custom
[Docker](https://hub.docker.com/r/openmetadata/ingestion) image, which already contains the OpenMetadata ingestion
packages and custom [Airflow APIs](https://github.com/open-metadata/openmetadata-airflow-apis) to
deploy Workflows from the UI as well.
2. If you already have Airflow up and running and want to use it for the metadata ingestion, you will
need to install the ingestion modules to the host. You can find more information on how to do this
in the Custom Airflow Installation section.
## Custom Airflow Installation
## Configure in the OpenMetadata Server
If you already have an Airflow instance up and running, you might want to reuse it to host the metadata workflows as
well. Here we will guide you on the different aspects to consider when configuring an existing Airflow.
Show all security info here
There are three different angles here:
1. Installing the ingestion modules directly on the host to enable the [Airflow Lineage Backend](/openmetadata/connectors/pipeline/airflow/lineage-backend).
2. Installing connector modules on the host to run specific workflows.
3. Installing the Airflow APIs to enable the workflow deployment through the UI.
Depending on what you wish to use, you might just need some of these installations. Note that the installation
commands shown below need to be run in the Airflow instances.
### Airflow Lineage Backend
Goals:
- Ingest DAGs and Tasks as Pipeline Entities when they run.
- Track DAG and Task status.
- Document lineage as code directly on the DAG definition and ingest it when the DAGs run.
Get the necessary information to install and extract metadata from the Lineage Backend [here](/openmetadata/connectors/pipeline/airflow/lineage-backend).
### Connector Modules
Goal:
- Ingest metadata from specific sources.
The current approach we are following here is preparing the metadata ingestion DAGs as `PythonOperators`. This means that
the packages need to be present in the Airflow instances.
You will need to install:
```python
pip3 install "openmetadata-ingestion[<connector-name>]"
```
And then run the DAG as explained in each [Connector](/openmetadata/connectors).
### Airflow APIs
Goal:
- Deploy metadata ingestion workflows directly from the UI.
This process consists of three steps:
1. Install the APIs module,
2. Install the required plugins, and
3. Configure the OpenMetadata server.
The goal of this module is to add some HTTP endpoints that the UI calls for deploying the Airflow DAGs.
The first step can be achieved by running:
```python
pip3 install "openmetadata-airflow-managed-apis"
```
Then, check the Connector Modules guide above to learn how to install the `openmetadata-ingestion` package with the
necessary plugins. They are necessary because even if we install the APIs, the Airflow instance needs to have the
required libraries to connect to each source.
On top of this installation, you'll need to follow these steps:
1. Download the latest `openmetadata-airflow-apis-plugins` release from [here](https://github.com/open-metadata/OpenMetadata/releases)
2. Untar it under the `{AIRFLOW_HOME}` directory. This will create and `setup` a `plugins` directory under `{AIRFLOW_HOME}`.
3. `cp -r {AIRFLOW_HOME}/plugins/dag_templates {AIRFLOW_HOME}`.
4. `mkdir -p {AIRFLOW_HOME}/dag_generated_configs`.
5. (re)start the airflow webserver and scheduler.
### Configure in the OpenMetadata Server
After installing the Airflow APIs, you will need to update your OpenMetadata Server.
The OpenMetadata server takes all its configurations from a YAML file. You can find them in our [repo](https://github.com/open-metadata/OpenMetadata/tree/main/conf). In
`openmetadata.yaml`, update the `airflowConfiguration` section accordingly.
```yaml
[...]
airflowConfiguration:
apiEndpoint: http://${AIRFLOW_HOST:-localhost}:${AIRFLOW_PORT:-8080}
username: ${AIRFLOW_USERNAME:-admin}
password: ${AIRFLOW_PASSWORD:-admin}
metadataApiEndpoint: http://${SERVER_HOST:-localhost}:${SERVER_PORT:-8585}/api
authProvider: "no-auth"
```
Note that we also support picking up these values from environment variables, so you can safely set that up in the
machine hosting the OpenMetadata server.
If you are running OpenMetadata with the security enabled, you can take a look at the server
configuration for each security mode:
<InlineCalloutContainer>
<InlineCallout
color="violet-70"
bold="Auth0 SSO"
icon="add_moderator"
href="/deployment/security/auth0"
>
Configure Auth0 SSO to access the UI and APIs
</InlineCallout>
<InlineCallout
color="violet-70"
bold="Azure SSO"
icon="add_moderator"
href="/deployment/security/azure"
>
Configure Azure SSO to access the UI and APIs
</InlineCallout>
<InlineCallout
color="violet-70"
bold="Custom OIDC SSO"
icon="add_moderator"
href="/deployment/security/custom-oidc"
>
Configure a Custom OIDC SSO to access the UI and APIs
</InlineCallout>
<InlineCallout
color="violet-70"
bold="Google SSO"
icon="add_moderator"
href="/deployment/security/google"
>
Configure Google SSO to access the UI and APIs
</InlineCallout>
<InlineCallout
color="violet-70"
bold="Okta SSO"
icon="add_moderator"
href="/deployment/security/okta"
>
Configure Okta SSO to access the UI and APIs
</InlineCallout>
</InlineCalloutContainer>

View File

@ -316,6 +316,26 @@ site_menu:
- category: OpenMetadata / Connectors / Pipeline
url: /openmetadata/connectors/pipeline
- category: OpenMetadata / Connectors / Pipeline / Airflow
url: /openmetadata/connectors/pipeline/airflow
- category: OpenMetadata / Connectors / Pipeline / Airflow / CLI
url: /openmetadata/connectors/pipeline/airflow/cli
- category: OpenMetadata / Connectors / Pipeline / Airflow / GCS Composer
url: /openmetadata/connectors/pipeline/airflow/gcs
- category: OpenMetadata / Connectors / Pipeline / Airflow / Lineage Backend
url: /openmetadata/connectors/pipeline/airflow/lineage-backend
- category: OpenMetadata / Connectors / Pipeline / Airbyte
url: /openmetadata/connectors/pipeline/airbyte
- category: OpenMetadata / Connectors / Pipeline / Airbyte / Airflow
url: /openmetadata/connectors/pipeline/airbyte/airflow
- category: OpenMetadata / Connectors / Pipeline / Airbyte / CLI
url: /openmetadata/connectors/pipeline/airbyte/cli
- category: OpenMetadata / Connectors / Pipeline / Glue
url: /openmetadata/connectors/pipeline/glue
- category: OpenMetadata / Connectors / Pipeline / Glue / Airflow
url: /openmetadata/connectors/pipeline/glue/airflow
- category: OpenMetadata / Connectors / Pipeline / Glue / CLI
url: /openmetadata/connectors/pipeline/glue/cli
- category: OpenMetadata / Ingestion
url: /openmetadata/ingestion

View File

@ -17,8 +17,6 @@ slug: /openmetadata/connectors/database/glue/airflow
- **awsSessionToken**: The AWS session token is an optional parameter. If you want, enter the details of your temporary session token.
- **endPointURL**: Your Glue connector will automatically determine the AWS Glue endpoint URL based on the region. You may override this behavior by entering a value to the endpoint URL.
- **storageServiceName**: OpenMetadata associates objects for each object store entity with a unique namespace. To ensure your data is well-organized and findable, choose a unique name by which you would like to identify the metadata for the object stores you are using through AWS Glue.
- **pipelineServiceName**: OpenMetadata associates each pipeline entity with a unique namespace. To ensure your data is well-organized and findable, choose a unique name by which you would like to identify the metadata for pipelines you are using through AWS Glue. When this metadata has been ingested you will find it in the OpenMetadata UI pipelines view under the name you have specified.
- **database**: The database of the data source is an optional parameter, if you would like to restrict the metadata reading to a single database. If left blank, OpenMetadata ingestion attempts to scan all the databases. For Glue, we use the Catalog ID as the database when mapping Glue metadata to OpenMetadata Entities.
- **Connection Options (Optional)**: Enter the details for any additional connection options that can be sent to Glue during the connection. These details must be added as Key-Value pairs.
- **Connection Arguments (Optional)**: Enter the details for any additional connection arguments such as security or protocol configs that can be sent to Glue during the connection. These details must be added as Key-Value pairs.
- In case you are using Single-Sign-On (SSO) for authentication, add the `authenticator` details in the Connection Arguments as a Key-Value pair as follows: `"authenticator" : "sso_login_url"`

View File

@ -17,8 +17,6 @@ slug: /openmetadata/connectors/database/glue/cli
- **awsSessionToken**: The AWS session token is an optional parameter. If you want, enter the details of your temporary session token.
- **endPointURL**: Your Glue connector will automatically determine the AWS Glue endpoint URL based on the region. You may override this behavior by entering a value to the endpoint URL.
- **storageServiceName**: OpenMetadata associates objects for each object store entity with a unique namespace. To ensure your data is well-organized and findable, choose a unique name by which you would like to identify the metadata for the object stores you are using through AWS Glue.
- **pipelineServiceName**: OpenMetadata associates each pipeline entity with a unique namespace. To ensure your data is well-organized and findable, choose a unique name by which you would like to identify the metadata for pipelines you are using through AWS Glue. When this metadata has been ingested you will find it in the OpenMetadata UI pipelines view under the name you have specified.
- **database**: The database of the data source is an optional parameter, if you would like to restrict the metadata reading to a single database. If left blank, OpenMetadata ingestion attempts to scan all the databases. For Glue, we use the Catalog ID as the database when mapping Glue metadata to OpenMetadata Entities.
- **Connection Options (Optional)**: Enter the details for any additional connection options that can be sent to Glue during the connection. These details must be added as Key-Value pairs.
- **Connection Arguments (Optional)**: Enter the details for any additional connection arguments such as security or protocol configs that can be sent to Glue during the connection. These details must be added as Key-Value pairs.
- In case you are using Single-Sign-On (SSO) for authentication, add the `authenticator` details in the Connection Arguments as a Key-Value pair as follows: `"authenticator" : "sso_login_url"`

View File

@ -16,10 +16,6 @@ slug: /openmetadata/connectors/database/glue
- **AWS Region**: Enter the location of the amazon cluster that your data and account are associated with.
- **AWS Session Token (optional)**: The AWS session token is an optional parameter. If you want, enter the details of your temporary session token.
- **Endpoint URL (optional)**: Your Glue connector will automatically determine the AWS Glue endpoint URL based on the region. You may override this behavior by entering a value to the endpoint URL.
- **Database (optional)**: The database of the data source is an optional parameter if you would like to restrict the metadata reading to a single database. If left blank, OpenMetadata ingestion attempts to scan all the databases.
- **Storage Service Name**: OpenMetadata associates objects for each object store entity with a unique namespace. To ensure your data is well-organized and findable, choose a unique name by which you would like to identify the metadata for the object stores you are using through AWS Glue.
- **Pipeline Service Name**: OpenMetadata associates each pipeline entity with a unique namespace. To ensure your data is well-organized and findable, choose a unique name by which you would like to identify the metadata for pipelines you are using through AWS Glue. When this metadata has been ingested you will find it in the OpenMetadata UI pipelines view under the name you have specified.
- **Database (Optional)**: The database of the data source is an optional parameter, if you would like to restrict the metadata reading to a single database. If left blank, OpenMetadata ingestion attempts to scan all the databases. For Glue, we use the Catalog ID as the database when mapping Glue metadata to OpenMetadata Entities.
- **Connection Options (Optional)**: Enter the details for any additional connection options that can be sent to Glue during the connection. These details must be added as Key-Value pairs.
- **Connection Arguments (Optional)**: Enter the details for any additional connection arguments such as security or protocol configs that can be sent to Glue during the connection. These details must be added as Key-Value pairs.
- In case you are using Single-Sign-On (SSO) for authentication, add the `authenticator` details in the Connection Arguments as a Key-Value pair as follows: `"authenticator" : "sso_login_url"`

View File

@ -42,3 +42,9 @@ OpenMetadata can extract metadata from the following list of connectors:
## Messaging Services
- [Kafka](/openmetadata/connectors/messaging/kafka)
## Pipeline Services
- [Airbyte](/openmetadata/connectors/pipeline/airbyte)
- [Airflow](/openmetadata/connectors/pipeline/airflow)
- [Glue](/openmetadata/connectors/pipeline/glue)

View File

@ -0,0 +1,16 @@
---
title: Run Airbyte Connector using Airflow SDK
slug: /openmetadata/connectors/pipeline/airbyte/airflow
---
<ConnectorIntro connector="Airbyte" goal="Airflow"/>
<Requirements />
<MetadataIngestionServiceDev service="pipeline" connector="Airbyte" goal="Airflow"/>
<h4>Source Configuration - Service Connection</h4>
- **hostPort**: Pipeline Service Management UI URL
<MetadataIngestionConfig service="pipeline" connector="Airbyte" goal="Airflow" />

View File

@ -0,0 +1,16 @@
---
title: Run Airbyte Connector using the CLI
slug: /openmetadata/connectors/pipeline/airbyte/cli
---
<ConnectorIntro connector="Airbyte" goal="CLI"/>
<Requirements />
<MetadataIngestionServiceDev service="pipeline" connector="Airbyte" goal="CLI"/>
<h4>Source Configuration - Service Connection</h4>
- **hostPort**: Pipeline Service Management UI URL.
<MetadataIngestionConfig service="pipeline" connector="Airbyte" goal="CLI" />

View File

@ -0,0 +1,18 @@
---
title: Airbyte
slug: /openmetadata/connectors/pipeline/airbyte
---
<ConnectorIntro service="pipeline" connector="Airbyte"/>
<Requirements />
<MetadataIngestionService connector="Airbyte"/>
<h4>Connection Options</h4>
- **Host and Port**: Pipeline Service Management UI URL
<IngestionScheduleAndDeploy />
<ConnectorOutro connector="Airbyte" />

View File

@ -0,0 +1,27 @@
---
title: Run Airflow Connector using the CLI
slug: /openmetadata/connectors/pipeline/airflow/cli
---
<ConnectorIntro connector="Airflow" goal="CLI"/>
<Requirements />
<MetadataIngestionServiceDev service="pipeline" connector="Airflow" goal="CLI"/>
<h4>Source Configuration - Service Connection</h4>
- **hostPort**: URL to the Airflow instance.
- **numberOfStatus**: Number of status we want to look back to in every ingestion (e.g., Past executions from a DAG).
- **connection**: Airflow metadata database connection. See
these [docs](https://airflow.apache.org/docs/apache-airflow/stable/howto/set-up-database.html)
for supported backends.
In terms of `connection` we support the following selections:
- `backend`: Should not be used from the UI. This is only applicable when ingesting Airflow metadata locally by running
the ingestion from a DAG. It will use the current Airflow SQLAlchemy connection to extract the data.
- `MySQL`, `Postgres`, `MSSQL` and `SQLite`: Pass the required credentials to reach out each of these services. We will
create a connection to the pointed database and read Airflow data from there.
<MetadataIngestionConfig service="pipeline" connector="Airflow" goal="CLI" />

View File

@ -0,0 +1,127 @@
---
title: Extract GCS Composer Metadata
slug: /openmetadata/connectors/pipeline/airflow/gcs
---
# Extract GCS Composer Metadata
<Note>
This approach has been tested against Airflow 2.1.4. If you have any issues or questions,
please do not hesitate to reach out!
</Note>
The most comfortable way to extract metadata out of GCS Composer is by directly creating a DAG in there
that will handle the connection to the metadata database automatically and push the contents
to your OpenMetadata server.
## Install the Requirements
In your environment you will need to install the following packages:
- `openmetadata-ingestion==0.11.1`
- `sqlalchemy==1.4.27`: This is needed to align OpenMetadata version with the Composer internal requirements.
- `flask-appbuilder==3.4.5`: Again, this is just an alignment of versions so that `openmetadata-ingestion` can
work with GCS Composer internals.
## Prepare the DAG!
Note that this DAG is a usual connector DAG, just using the Airflow service with the `Backend` connection.
As an example of a DAG pushing data to OpenMetadata under Google SSO, we could have:
```python
"""
This DAG can be used directly in your Airflow instance after installing
the `openmetadata-ingestion` package. Its purpose
is to connect to the underlying database, retrieve the information
and push it to OpenMetadata.
"""
from datetime import timedelta
import yaml
from airflow import DAG
try:
from airflow.operators.python import PythonOperator
except ModuleNotFoundError:
from airflow.operators.python_operator import PythonOperator
from airflow.utils.dates import days_ago
from metadata.ingestion.api.workflow import Workflow
default_args = {
"owner": "user_name",
"email": ["username@org.com"],
"email_on_failure": False,
"retries": 3,
"retry_delay": timedelta(minutes=5),
"execution_timeout": timedelta(minutes=60),
}
config = """
source:
type: airflow
serviceName: airflow_gcs_composer
serviceConnection:
config:
type: Airflow
hostPort: http://localhost:8080
numberOfStatus: 10
connection:
type: Backend
sourceConfig:
config:
type: PipelineMetadata
sink:
type: metadata-rest
config: {}
workflowConfig:
loggerLevel: INFO
openMetadataServerConfig:
hostPort: https://sandbox-beta.open-metadata.org/api
authProvider: google
securityConfig:
credentials:
gcsConfig:
type: service_account
projectId: ...
privateKeyId: ...
privateKey: |
-----BEGIN PRIVATE KEY-----
...
-----END PRIVATE KEY-----
clientEmail: ...
clientId: ...
authUri: https://accounts.google.com/o/oauth2/auth
tokenUri: https://oauth2.googleapis.com/token
authProviderX509CertUrl: https://www.googleapis.com/oauth2/v1/certs
clientX509CertUrl: ...
"""
def metadata_ingestion_workflow():
workflow_config = yaml.safe_load(config)
workflow = Workflow.create(workflow_config)
workflow.execute()
workflow.raise_from_status()
workflow.print_status()
workflow.stop()
with DAG(
"airflow_metadata_extraction",
default_args=default_args,
description="An example DAG which pushes Airflow data to OM",
start_date=days_ago(1),
is_paused_upon_creation=True,
schedule_interval="*/5 * * * *",
catchup=False,
) as dag:
ingest_task = PythonOperator(
task_id="ingest_using_recipe",
python_callable=metadata_ingestion_workflow,
)
```

View File

@ -0,0 +1,28 @@
---
title: Airflow
slug: /openmetadata/connectors/pipeline/airflow
---
<ConnectorIntro service="pipeline" connector="Airflow"/>
<Requirements />
<MetadataIngestionService connector="Airflow"/>
<h4>Connection Options</h4>
- **Host and Port**: URL to the Airflow instance.
- **Number of Status**: Number of status we want to look back to in every ingestion (e.g., Past executions from a DAG).
- **Connection**: Airflow metadata database connection. See these [docs](https://airflow.apache.org/docs/apache-airflow/stable/howto/set-up-database.html)
for supported backends.
In terms of `connection` we support the following selections:
- `backend`: Should not be used from the UI. This is only applicable when ingesting Airflow metadata locally
by running the ingestion from a DAG. It will use the current Airflow SQLAlchemy connection to extract the data.
- `MySQL`, `Postgres`, `MSSQL` and `SQLite`: Pass the required credentials to reach out each of these services. We
will create a connection to the pointed database and read Airflow data from there.
<IngestionScheduleAndDeploy />
<ConnectorOutro connector="Airflow" />

View File

@ -0,0 +1,111 @@
---
title: Airflow Lineage Backend
slug: /openmetadata/connectors/pipeline/airflow/lineage-backend
---
# Airflow Lineage Backend
Learn how to capture lineage information directly from Airflow DAGs using the OpenMetadata Lineage Backend.
## Introduction
Obtaining metadata should be as simple as possible. Not only that, we want developers to be able to keep using their
tools without any major changes.
We can directly use [Airflow code](https://airflow.apache.org/docs/apache-airflow/stable/lineage.html#lineage-backend)
to help us track data lineage. What we want to achieve through this backend is the ability to link OpenMetadata Table Entities and the pipelines that have those instances as inputs or outputs.
Being able to control and monitor these relationships can play a major role in helping discover and communicate issues
to your company data practitioners and stakeholders.
This document will guide you through the installation, configuration and internals of the process to help you unlock as
much value as possible from within your Airflow pipelines.
## Quickstart
### Installation
The Lineage Backend can be directly installed to the Airflow instances as part of the usual OpenMetadata Python
distribution:
```commandline
pip3 install "openmetadata-ingestion[airflow-container]"
```
### Adding Lineage Config
After the installation, we need to update the Airflow configuration. This can be done following this example on
`airflow.cfg`:
```ini
[lineage]
backend = airflow_provider_openmetadata.lineage.openmetadata.OpenMetadataLineageBackend
airflow_service_name = local_airflow
openmetadata_api_endpoint = http://localhost:8585/api
auth_provider_type = no-auth
```
Or we can directly provide environment variables:
```env
AIRFLOW__LINEAGE__BACKEND="airflow_provider_openmetadata.lineage.openmetadata.OpenMetadataLineageBackend"
AIRFLOW__LINEAGE__AIRFLOW_SERVICE_NAME="local_airflow"
AIRFLOW__LINEAGE__OPENMETADATA_API_ENDPOINT="http://localhost:8585/api"
AIRFLOW__LINEAGE__AUTH_PROVIDER_TYPE="no-auth"
```
We can choose the option that best adapts to our current architecture. Find more information on Airflow configurations
[here](https://airflow.apache.org/docs/apache-airflow/stable/howto/set-config.html).
In the following sections, we'll show how to adapt our pipelines to help us build the lineage information.
## Lineage Backend
You can find the source code [here](https://github.com/open-metadata/OpenMetadata/tree/main/ingestion/src/airflow_provider_openmetadata).
### Pipeline Service
The backend will look for a Pipeline Service Entity with the name specified in the configuration under
`airflow_service_name`. If it cannot find the instance, it will create one based on the following information:
- `airflow_service_name` as name. If not informed, the default value will be `airflow`.
- It will use the `webserver` base URL as the URL of the service.
### Pipeline Entity
Each DAG processed by the backend will be created or updated as a Pipeline Entity linked to the above Pipeline Service.
We are going to extract the task information and add it to the Pipeline task property list. Then, a
DAG created with some tasks as the following random example:
```commandline
t1 >> [t2, t3]
```
We will capture this information as well, therefore showing how the DAG contains three tasks t1, t2 and t3; and t1 having
t2 and t3 as downstream tasks.
### Adding Lineage
Airflow [Operators](https://airflow.apache.org/docs/apache-airflow/stable/_api/airflow/models/baseoperator/index.html)
contain the attributes `inlets` and `outlets`. When creating our tasks, we can pass any of these two
parameters as follows:
```python
BashOperator(
task_id='print_date',
bash_command='date',
outlets={
"tables": ["service.database.schema.table"]
}
)
```
Note how in this example we are defining a Python `dict` with the key tables and value a `list`.
This list should contain the FQN of tables ingested through any of our connectors or APIs.
When each task is processed, we will use the OpenMetadata client to add the lineage information (upstream for inlets
and downstream for outlets) between the Pipeline and Table Entities.
It is important to get the naming right, as we will fetch the Table Entity by its FQN. If no information is specified
in terms of lineage, we will just ingest the Pipeline Entity without adding further information.

View File

@ -0,0 +1,23 @@
---
title: Run Glue Connector using Airflow SDK
slug: /openmetadata/connectors/pipeline/glue/airflow
---
<ConnectorIntro connector="Glue" goal="Airflow"/>
<Requirements />
<MetadataIngestionServiceDev service="pipeline" connector="Glue" goal="Airflow"/>
<h4>Source Configuration - Service Connection</h4>
- **awsAccessKeyId**: Enter your secure access key ID for your Glue connection. The specified key ID should be
authorized to read all databases you want to include in the metadata ingestion workflow.
- **awsSecretAccessKey**: Enter the Secret Access Key (the passcode key pair to the key ID from above).
- **awsRegion**: Enter the location of the amazon cluster that your data and account are associated with.
- **awsSessionToken**: The AWS session token is an optional parameter. If you want, enter the details of your temporary
session token.
- **endPointURL**: Your Glue connector will automatically determine the AWS Glue endpoint URL based on the region. You
may override this behavior by entering a value to the endpoint URL.
<MetadataIngestionConfig service="pipeline" connector="Glue" goal="Airflow" />

View File

@ -0,0 +1,23 @@
---
title: Run Glue Connector using the CLI
slug: /openmetadata/connectors/pipeline/glue/cli
---
<ConnectorIntro connector="Glue" goal="CLI"/>
<Requirements />
<MetadataIngestionServiceDev service="pipeline" connector="Glue" goal="CLI"/>
<h4>Source Configuration - Service Connection</h4>
- **awsAccessKeyId**: Enter your secure access key ID for your Glue connection. The specified key ID should be
authorized to read all databases you want to include in the metadata ingestion workflow.
- **awsSecretAccessKey**: Enter the Secret Access Key (the passcode key pair to the key ID from above).
- **awsRegion**: Enter the location of the amazon cluster that your data and account are associated with.
- **awsSessionToken**: The AWS session token is an optional parameter. If you want, enter the details of your temporary
session token.
- **endPointURL**: Your Glue connector will automatically determine the AWS Glue endpoint URL based on the region. You
may override this behavior by entering a value to the endpoint URL.
<MetadataIngestionConfig service="pipeline" connector="Glue" goal="CLI" />

View File

@ -0,0 +1,25 @@
---
title: Glue
slug: /openmetadata/connectors/pipeline/glue
---
<ConnectorIntro service="pipeline" connector="Glue"/>
<Requirements />
<MetadataIngestionService connector="Glue"/>
<h4>Connection Options</h4>
- **AWS Access Key ID**: Enter your secure access key ID for your Glue connection. The specified key ID should be
authorized to read all databases you want to include in the metadata ingestion workflow.
- **AWS Secret Access Key**: Enter the Secret Access Key (the passcode key pair to the key ID from above).
- **AWS Region**: Enter the location of the amazon cluster that your data and account are associated with.
- **AWS Session Token (optional)**: The AWS session token is an optional parameter. If you want, enter the details of
your temporary session token.
- **Endpoint URL (optional)**: Your Glue connector will automatically determine the AWS Glue endpoint URL based on the
region. You may override this behavior by entering a value to the endpoint URL.
<IngestionScheduleAndDeploy />
<ConnectorOutro connector="Glue" />

View File

@ -4,3 +4,7 @@ slug: /openmetadata/connectors/pipeline
---
# Pipeline Services
- [Airbyte](/openmetadata/connectors/pipeline/airbyte)
- [Airflow](/openmetadata/connectors/pipeline/airflow)
- [Glue](/openmetadata/connectors/pipeline/glue)

Binary file not shown.

After

Width:  |  Height:  |  Size: 120 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 97 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 101 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 122 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 100 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 325 KiB

View File

@ -0,0 +1,17 @@
source:
type: airbyte
serviceName: airbyte_source
serviceConnection:
config:
type: Airbyte
hostPort: http://localhost:8000
sourceConfig:
config:
type: PipelineMetadata
sink:
type: metadata-rest
config: { }
workflowConfig:
openMetadataServerConfig:
hostPort: http://localhost:8585/api
authProvider: no-auth

View File

@ -0,0 +1,25 @@
source:
type: airflow
serviceName: airflow_source
serviceConnection:
config:
type: Airflow
hostPort: http://localhost:8080
numberOfStatus: 10
connection:
type: Mysql
username: airflow_user
password: airflow_pass
databaseSchema: airflow_db
hostPort: localhost:3306
sourceConfig:
config:
type: PipelineMetadata
sink:
type: metadata-rest
config: { }
workflowConfig:
loggerLevel: DEBUG
openMetadataServerConfig:
hostPort: http://localhost:8585/api
authProvider: no-auth