Where `x.y.z` is the version of the OpenMetadata ingestion package. Note that the version needs to match the server version. If we are using the server at 1.1.0, then the ingestion package needs to also be 1.1.0.
The plugin parameter is a list of the sources that we want to ingest. An example would look like this `openmetadata-ingestion[mysql,snowflake,s3]==1.1.0`.
- The same logic as above applies. The `x.y.z` version needs to match the server version. For example, `docker.getcollate.io/openmetadata/ingestion-base:0.13.2`
We have tested this process with a Task Memory of 512MB and Task CPU (unit) of 256. This can be tuned depending on the amount of metadata that needs to be ingested.
When creating the ECS Cluster, take notes on the log groups assigned, as we will need them to prepare the MWAA Executor Role policies.
### 2. Update MWAA Executor Role policies
- Identify your MWAA executor role. This can be obtained from the details view of your MWAA environment.
- Add the following two policies to the role, the first with ECS permissions:
## Ingestion Workflows as a Python Virtualenv Operator
### PROs
- Installation does not clash with existing libraries
- Simpler than ECS
### CONs
- We need to install an additional plugin in MWAA
- DAGs take longer to run due to needing to set up the virtualenv from scratch for each run.
We need to update the `requirements.txt` file from the MWAA environment to add the following line:
```
virtualenv
```
Then, we need to set up a custom plugin in MWAA. Create a file named virtual_python_plugin.py. Note that you may need to update the python version (eg, python3.7 -> python3.10) depending on what your MWAA environment is running.
```python
"""
Copyright Amazon.com, Inc. or its affiliates. All Rights Reserved.
Permission is hereby granted, free of charge, to any person obtaining a copy of
this software and associated documentation files (the "Software"), to deal in
the Software without restriction, including without limitation the rights to
This is modified from the [AWS sample](https://docs.aws.amazon.com/mwaa/latest/userguide/samples-virtualenv.html).
Next, create the plugins.zip file and upload it according to [AWS docs](https://docs.aws.amazon.com/mwaa/latest/userguide/configuring-dag-import-plugins.html). You will also need to [disable lazy plugin loading in MWAA](https://docs.aws.amazon.com/mwaa/latest/userguide/samples-virtualenv.html#samples-virtualenv-airflow-config).
A DAG deployed using the PythonVirtualenvOperator would then look like:
```python
from datetime import timedelta
from airflow import DAG
from airflow.operators.python import PythonVirtualenvOperator
from airflow.utils.dates import days_ago
default_args = {
"retries": 3,
"retry_delay": timedelta(seconds=10),
"execution_timeout": timedelta(minutes=60),
}
def metadata_ingestion_workflow():
from metadata.ingestion.api.workflow import Workflow
import yaml
config = """
YAML config
"""
workflow_config = yaml.loads(config)
workflow = Workflow.create(workflow_config)
workflow.execute()
workflow.raise_from_status()
workflow.print_status()
workflow.stop()
with DAG(
"redshift_ingestion",
default_args=default_args,
description="An example DAG which runs a OpenMetadata ingestion workflow",
start_date=days_ago(1),
is_paused_upon_creation=False,
catchup=False,
) as dag:
ingest_task = PythonVirtualenvOperator(
task_id="ingest_redshift",
python_callable=metadata_ingestion_workflow,
requirements=['openmetadata-ingestion==1.0.5.0',
'apache-airflow==2.4.3', # note, v2.4.3 is the first version that does not conflict with OpenMetadata's 'tabulate' requirements
'apache-airflow-providers-amazon==6.0.0', # Amazon Airflow provider is necessary for MWAA
'watchtower',],
system_site_packages=False,
dag=dag,
)
```
Where you can update the YAML configuration and workflow classes accordingly. accordingly. Further examples on how to
run the ingestion can be found on the documentation (e.g., [Snowflake](https://docs.open-metadata.org/connectors/database/snowflake)).
You will also need to determine the OpenMetadata ingestion extras and Airflow providers you need. Note that the Openmetadata version needs to match the server version. If we are using the server at 0.12.2, then the ingestion package needs to also be 0.12.2. An example of the extras would look like this `openmetadata-ingestion[mysql,snowflake,s3]==0.12.2.2`.
For Airflow providers, you will want to pull the provider versions from [the matching constraints file](https://raw.githubusercontent.com/apache/airflow/constraints-2.4.3/constraints-3.7.txt). Since this example installs Airflow Providers v2.4.3 on Python 3.7, we use that constraints file.
Also note that the ingestion workflow function must be entirely self contained as it will run by itself in the virtualenv. Any imports it needs, including the configuration, must exist within the function itself.
### Extracting MWAA Metadata
As the ingestion process will be happening locally in MWAA, we can prepare a DAG with the following YAML
configuration:
```yaml
source:
type: airflow
serviceName: airflow_mwaa
serviceConnection:
config:
type: Airflow
hostPort: http://localhost:8080
numberOfStatus: 10
connection:
type: Backend
sourceConfig:
config:
type: PipelineMetadata
sink:
type: metadata-rest
config: {}
workflowConfig:
loggerLevel: INFO
openMetadataServerConfig:
hostPort: <OpenMetadatahostandport>
authProvider: <OpenMetadataauthprovider>
```
Using the connection type `Backend` will pick up the Airflow database session directly at runtime.