GitBook: [#128] Airflow ingestion

This commit is contained in:
pmbrull 2022-03-23 09:56:47 +00:00 committed by Sriharsha Chintalapani
parent 7971380d89
commit c91d5622bf
9 changed files with 166 additions and 14 deletions

View File

@ -12,7 +12,7 @@ OpenMetadata provides connectors that enable you to perform metadata ingestion f
| A-H | I-M | N-R | S-Z |
| -------------------------------------------------------------------- | ----------------------------------------------- | -------------------------------------------------------------------- | ----------------------------------------------------------------------- |
| [Airflow](integrations/connectors/airflow/airflow.md) | [IBM Db2](integrations/connectors/ibm-db2.md) | [Oracle](integrations/connectors/oracle.md) | [Salesforce](integrations/connectors/salesforce.md) |
| [Airflow](integrations/airflow/airflow.md) | [IBM Db2](integrations/connectors/ibm-db2.md) | [Oracle](integrations/connectors/oracle.md) | [Salesforce](integrations/connectors/salesforce.md) |
| Amundsen | [Kafka](integrations/connectors/kafka.md) | [Postgres](integrations/connectors/postgres/) | [SingleStore](integrations/connectors/singlestore.md) |
| Apache Atlas | LDAP | Power BI | [Snowflake](integrations/connectors/snowflake/) |
| Apache Druid | [Looker](integrations/connectors/looker.md) | Prefect | [Snowflake Usage](integrations/connectors/snowflake/snowflake-usage.md) |

View File

@ -15,9 +15,6 @@
* [Metadata Ingestion Overview](../metadata-ingestion/metadata-ingestion.md)
* [Connectors](integrations/connectors/README.md)
* [Airflow](integrations/connectors/airflow/README.md)
* [Airflow Metadata Ingestion](integrations/connectors/airflow/airflow.md)
* [Airflow Lineage](integrations/connectors/airflow/airflow-lineage.md)
* [Athena](integrations/connectors/athena.md)
* [Azure SQL](integrations/connectors/azure-sql.md)
* [BigQuery](integrations/connectors/bigquery/README.md)
@ -61,6 +58,11 @@
* [Vertica](integrations/connectors/vertica.md)
* [Troubleshoot Connectors](integrations/troubleshoot-connectors.md)
* [Ingest Sample Data](../metadata-ingestion/ingest-sample-data.md)
* [Airflow](integrations/airflow/README.md)
* [Custom Airflow Installation](integrations/airflow/custom-airflow-installation.md)
* [Airflow Lineage](integrations/airflow/airflow-lineage.md)
* [Configure Airflow in the OpenMetadata Server](integrations/airflow/configure-airflow-in-the-openmetadata-server.md)
* [Example Metadata Ingestion](integrations/airflow/airflow.md)
## 📐 Metadata Standard

View File

@ -0,0 +1,30 @@
---
description: >-
Learn how to configure Airflow to run the metadata ingestion, the recommended
way to schedule the ingestions.
---
# Airflow
Use Airflow to define and deploy your metadata ingestion Workflows. This can be done in different ways:
1. If you do not have an Airflow service up and running on your platform, we provide a custom Docker [image](https://hub.docker.com/r/openmetadata/ingestion), which already contains the OpenMetadata ingestion packages and custom [Airflow APIs](https://github.com/open-metadata/openmetadata-airflow-apis) to deploy Workflows from the UI as well.
2. If you already have Airflow up and running and want to use it for the metadata ingestion, you will need to install the ingestion modules to the host. You can find more information on how to do this in the [Custom Airflow Installation](custom-airflow-installation.md) section.
## Guides
{% content-ref url="custom-airflow-installation.md" %}
[custom-airflow-installation.md](custom-airflow-installation.md)
{% endcontent-ref %}
{% content-ref url="airflow-lineage.md" %}
[airflow-lineage.md](airflow-lineage.md)
{% endcontent-ref %}
{% content-ref url="configure-airflow-in-the-openmetadata-server.md" %}
[configure-airflow-in-the-openmetadata-server.md](configure-airflow-in-the-openmetadata-server.md)
{% endcontent-ref %}
{% content-ref url="airflow.md" %}
[airflow.md](airflow.md)
{% endcontent-ref %}

View File

@ -25,7 +25,7 @@ The Lineage Backend can be directly installed to the Airflow instances as part o
{% tabs %}
{% tab title="Install Using PyPI" %}
```bash
pip install openmetadata-ingestion
pip install "openmetadata-ingestion[airflow-container]"
```
{% endtab %}
{% endtabs %}
@ -82,7 +82,7 @@ t1 >> [t2, t3]
Will capture this information as well, therefore showing how the DAG contains three tasks `t1`, `t2` and `t3`; and `t1` having `t2` and `t3` as downstream tasks.
![Pipeline and Tasks example](<../../../.gitbook/assets/image (13).png>)
![Pipeline and Tasks example](<../../.gitbook/assets/image (13).png>)
### Adding Lineage

View File

@ -4,7 +4,9 @@ description: >-
Connectors.
---
# Airflow Metadata Ingestion
# Example Metadata Ingestion
This is an example of how to create an Airflow DAG to ingest the sample data provided in the git [repository](https://github.com/open-metadata/OpenMetadata/tree/main/ingestion/examples/sample\_data).
## Airflow Example for Sample Data
@ -90,4 +92,4 @@ def metadata_ingestion_workflow():
workflow.stop
```
Create a Workflow instance and pass a sample-data configuration which will read metadata from Json files and ingest it into the OpenMetadata Server. You can customize this configuration or add different connectors please refer to our [examples](https://github.com/open-metadata/OpenMetadata/tree/main/ingestion/examples/workflows) and refer to [Connectors](../).
Create a Workflow instance and pass a sample-data configuration that will read metadata from JSON files and ingest it into the OpenMetadata Server. You can customize this configuration or add different connectors please refer to our [examples](https://github.com/open-metadata/OpenMetadata/tree/main/ingestion/examples/workflows) and refer to [Connectors](../connectors/).

View File

@ -0,0 +1,36 @@
---
description: >-
Learn how to use the workflow deployment from the UI with a simple
configuration.
---
# Configure Airflow in the OpenMetadata Server
## Prerequisites
This page will guide you on setting up the link between your Airflow host and the OpenMetadata server. Note that to enable the workflow deployment from the UI, you need the Airflow APIs correctly installed in your Airflow host.
This can be done either by using our custom Docker [image](https://hub.docker.com/r/openmetadata/ingestion) or following this [guide](custom-airflow-installation.md) to set up your existing Airflow service.
## Configuration
The OpenMetadata server takes all its configurations from a YAML file. You can find them in our [repo](https://github.com/open-metadata/OpenMetadata/tree/main/conf). In either `openmetadata-security.yaml` or `openmetadata.yaml`, update the `airflowConfiguration` section accordingly.
{% code title="openmetadata.yaml" %}
```
[...]
airflowConfiguration:
apiEndpoint: http://${AIRFLOW_HOST:-localhost}:${AIRFLOW_PORT:-8080}
username: ${AIRFLOW_USERNAME:-admin}
password: ${AIRFLOW_PASSWORD:-admin}
metadataApiEndpoint: http://${SERVER_HOST:-localhost}:${SERVER_PORT:-8585}/api
authProvider: "no-auth"
```
{% endcode %}
Note that we also support picking up these values from environment variables, so you can safely set that up in the machine hosting the OpenMetadata server.
## Connectors
Once this configuration is done, you can head to any of the [Connectors](configure-airflow-in-the-openmetadata-server.md#undefined) to find detailed guides on how to deploy an ingestion workflow from the UI.

View File

@ -0,0 +1,87 @@
---
description: How to install OpenMetadata ingestion modules to an existing Airflow host.
---
# Custom Airflow Installation
If you already have an Airflow instance up and running, you might want to reuse it to host the metadata workflows as well. This page will guide you on the different aspects to consider when configuring an existing Airflow.
There are three different angles here:
1. Installing the ingestion modules directly on the host to enable the [Airflow Lineage Backend](airflow-lineage.md).
2. Installing connector modules on the host to run specific workflows.
3. Installing the Airflow APIs to enable the workflow deployment through the UI.
Depending on what you wish to use, you might just need some of these installations. Note that the installation commands shown below need to be run in the **Airflow instances**.
## Airflow Lineage Backend
Goals:
* Ingest DAGs and Tasks as Pipeline Entities when they run.
* Track DAG and Task status.
* Document lineage as code directly on the DAG definition and ingest it when the DAGs run.
You can find the full information in [Airflow Lineage Backend](airflow-lineage.md). But as a quick summary, you need to
{% tabs %}
{% tab title="Install Using PyPI" %}
```bash
pip install "openmetadata-ingestion[airflow-container]"
```
{% endtab %}
{% endtabs %}
What this does is add the full core `openmetadata-ingestion` package plus some version alignments for a few libraries to make everything compatible with Airflow.
Afterward, you can jump into `airflow.cfg` following the guide above.
## Connector Modules
Goal:
* Ingest metadata from specific sources.
The current approach we are following here is preparing the metadata ingestion DAGs as `PythonOperator`s. This means that the packages need to be present in the Airflow instances.
> Note that we are working towards preparing specific `DockerOperator`s that will simplify this process and reduce requirements' inconsistencies in the Airflow host. We do not yet have a clear support or guide here, as it might depend on specific architectures. We are working on this and will be a part of future releases.
Then, you can just follow the guides for each [Connector](../connectors/). In the end, the installation process will look like this:
{% tabs %}
{% tab title="Install Using PyPI" %}
```bash
pip install "openmetadata-ingestion[<connector-name>]"
```
{% endtab %}
{% endtabs %}
If you have skipped the Airflow Lineage configuration, you will need to install the `airflow-container` plugin as well.
## Airflow APIs
Goal:
* Deploy metadata ingestion workflows directly from the UI.
This process consists of three steps:
1. Install the APIs module,
2. Install the required plugins, and
3. Configure the OpenMetadata server.
The goal of this module is to add some HTTP endpoints that the UI calls for deploying the Airflow DAGs. The first step can be achieved by running:
{% tabs %}
{% tab title="Install Using PyPI" %}
```bash
pip install "openmetadata-airflow-managed-apis"
```
{% endtab %}
{% endtabs %}
Then, check the Connector Modules guide above to learn how to install the `openmetadata-ingestion` package with the necessary plugins. They are necessary because even if we install the APIs, the Airflow instance needs to have the required libraries to connect to each source.
### Configure the OpenMetadata Server
Finally, you can check how to [Configure Airflow in the OpenMetadata Server](configure-airflow-in-the-openmetadata-server.md) to enable workflow deployments from the UI.

View File

@ -1,5 +0,0 @@
# Airflow
{% content-ref url="airflow-lineage.md" %}
[airflow-lineage.md](airflow-lineage.md)
{% endcontent-ref %}

View File

@ -7,4 +7,4 @@ OpenMetadata Ingestion is a simple framework to build connectors and ingest meta
* [Ingest Sample Data](ingest-sample-data.md)
* [Explore Connectors & Install](../docs/integrations/connectors/)
* [Ingest Sample Data](ingest-sample-data.md)
* [Configure Airflow](../docs/integrations/connectors/airflow/airflow.md)
* [Configure Airflow](../docs/integrations/airflow/)