mirror of
				https://github.com/open-metadata/OpenMetadata.git
				synced 2025-10-25 07:42:40 +00:00 
			
		
		
		
	
		
			
	
	
		
			319 lines
		
	
	
		
			8.3 KiB
		
	
	
	
		
			Markdown
		
	
	
	
	
	
		
		
			
		
	
	
			319 lines
		
	
	
		
			8.3 KiB
		
	
	
	
		
			Markdown
		
	
	
	
	
	
|   | --- | ||
|  | title: Run Redpanda Connector using Airflow SDK | ||
|  | slug: /connectors/messaging/redpanda/airflow | ||
|  | --- | ||
|  | 
 | ||
|  | # Run Redpanda using the Airflow SDK
 | ||
|  | 
 | ||
|  | In this section, we provide guides and references to use the Redpanda connector. | ||
|  | 
 | ||
|  | Configure and schedule Redpanda metadata and profiler workflows from the OpenMetadata UI: | ||
|  | 
 | ||
|  | - [Requirements](#requirements) | ||
|  | - [Metadata Ingestion](#metadata-ingestion) | ||
|  | 
 | ||
|  | ## Requirements
 | ||
|  | 
 | ||
|  | {%inlineCallout icon="description" bold="OpenMetadata 0.12 or later" href="/deployment"%} | ||
|  | To deploy OpenMetadata, check the Deployment guides. | ||
|  | {%/inlineCallout%} | ||
|  | 
 | ||
|  | To run the Ingestion via the UI you'll need to use the OpenMetadata Ingestion Container, which comes shipped with | ||
|  | custom Airflow plugins to handle the workflow deployment. | ||
|  | 
 | ||
|  | ### Python Requirements
 | ||
|  | 
 | ||
|  | To run the Redpanda ingestion, you will need to install: | ||
|  | 
 | ||
|  | ```bash | ||
|  | pip3 install "openmetadata-ingestion[redpanda]" | ||
|  | ``` | ||
|  | 
 | ||
|  | ## Metadata Ingestion
 | ||
|  | 
 | ||
|  | All connectors are defined as JSON Schemas. | ||
|  | [Here](https://github.com/open-metadata/OpenMetadata/blob/main/openmetadata-spec/src/main/resources/json/schema/entity/services/connections/messaging/redpandaConnection.json) | ||
|  | you can find the structure to create a connection to Redpanda. | ||
|  | 
 | ||
|  | In order to create and run a Metadata Ingestion workflow, we will follow | ||
|  | the steps to create a YAML configuration able to connect to the source, | ||
|  | process the Entities if needed, and reach the OpenMetadata server. | ||
|  | 
 | ||
|  | The workflow is modeled around the following | ||
|  | [JSON Schema](https://github.com/open-metadata/OpenMetadata/blob/main/openmetadata-spec/src/main/resources/json/schema/metadataIngestion/workflow.json) | ||
|  | 
 | ||
|  | ### 1. Define the YAML Config
 | ||
|  | 
 | ||
|  | This is a sample config for Redpanda: | ||
|  | 
 | ||
|  | {% codePreview %} | ||
|  | 
 | ||
|  | {% codeInfoContainer %} | ||
|  | 
 | ||
|  | #### Source Configuration - Service Connection
 | ||
|  | 
 | ||
|  | {% codeInfo srNumber=1 %} | ||
|  | 
 | ||
|  | **bootstrapServers**: Redpanda bootstrap servers.  | ||
|  | 
 | ||
|  | Add them in comma separated values ex: host1:9092,host2:9092. | ||
|  | 
 | ||
|  | {% /codeInfo %} | ||
|  | 
 | ||
|  | {% codeInfo srNumber=2 %} | ||
|  | 
 | ||
|  | **schemaRegistryURL**: Confluent Redpanda Schema Registry URL. URI format. | ||
|  | 
 | ||
|  | {% /codeInfo %} | ||
|  | 
 | ||
|  | {% codeInfo srNumber=3 %} | ||
|  | 
 | ||
|  | **consumerConfig**: Confluent Redpanda Consumer Config. | ||
|  | 
 | ||
|  | {% /codeInfo %} | ||
|  | 
 | ||
|  | {% codeInfo srNumber=4 %} | ||
|  | 
 | ||
|  | **schemaRegistryConfig**:Confluent Redpanda Schema Registry Config. | ||
|  | 
 | ||
|  | **Note:** To ingest the topic schema `schemaRegistryURL` must be passed | ||
|  | 
 | ||
|  | {% /codeInfo %} | ||
|  | 
 | ||
|  | 
 | ||
|  | 
 | ||
|  | 
 | ||
|  | #### Source Configuration - Source Config
 | ||
|  | 
 | ||
|  | {% codeInfo srNumber=5 %} | ||
|  | 
 | ||
|  | The sourceConfig is defined [here](https://github.com/open-metadata/OpenMetadata/blob/main/openmetadata-spec/src/main/resources/json/schema/metadataIngestion/messagingServiceMetadataPipeline.json): | ||
|  | 
 | ||
|  | **generateSampleData:** Option to turn on/off generating sample data during metadata extraction. | ||
|  | 
 | ||
|  | **topicFilterPattern:** Note that the `topicFilterPattern` supports regex as include or exclude. | ||
|  | 
 | ||
|  | {% /codeInfo %} | ||
|  | 
 | ||
|  | #### Sink Configuration
 | ||
|  | 
 | ||
|  | {% codeInfo srNumber=6 %} | ||
|  | 
 | ||
|  | To send the metadata to OpenMetadata, it needs to be specified as `type: metadata-rest`. | ||
|  | 
 | ||
|  | {% /codeInfo %} | ||
|  | 
 | ||
|  | #### Workflow Configuration
 | ||
|  | 
 | ||
|  | {% codeInfo srNumber=7 %} | ||
|  | 
 | ||
|  | The main property here is the `openMetadataServerConfig`, where you can define the host and security provider of your OpenMetadata installation. | ||
|  | 
 | ||
|  | For a simple, local installation using our docker containers, this looks like: | ||
|  | 
 | ||
|  | {% /codeInfo %} | ||
|  | 
 | ||
|  | {% /codeInfoContainer %} | ||
|  | 
 | ||
|  | {% codeBlock fileName="filename.yaml" %} | ||
|  | 
 | ||
|  | 
 | ||
|  | ```yaml | ||
|  | source: | ||
|  |   type: redpanda | ||
|  |   serviceName: local_redpanda | ||
|  |   serviceConnection: | ||
|  |     config: | ||
|  |       type: Redpanda | ||
|  | ``` | ||
|  | ```yaml {% srNumber=1 %} | ||
|  |       bootstrapServers: localhost:9092 | ||
|  | ``` | ||
|  | ```yaml {% srNumber=2 %} | ||
|  |       schemaRegistryURL: http://localhost:8081  # Needs to be a URI | ||
|  | ``` | ||
|  | ```yaml {% srNumber=3 %} | ||
|  |       consumerConfig: {} | ||
|  | ``` | ||
|  | ```yaml {% srNumber=4 %} | ||
|  |       schemaRegistryConfig: {} | ||
|  | ``` | ||
|  | ```yaml {% srNumber=5 %} | ||
|  |   sourceConfig: | ||
|  |     config: | ||
|  |       type: MessagingMetadata | ||
|  |       topicFilterPattern: | ||
|  |         excludes: | ||
|  |           - _confluent.* | ||
|  |         # includes: | ||
|  |         #   - topic1 | ||
|  |       # generateSampleData: true | ||
|  | 
 | ||
|  | ``` | ||
|  | ```yaml {% srNumber=6 %} | ||
|  | sink: | ||
|  |   type: metadata-rest | ||
|  |   config: {} | ||
|  | ``` | ||
|  | 
 | ||
|  | ```yaml {% srNumber=7 %} | ||
|  | workflowConfig: | ||
|  |   openMetadataServerConfig: | ||
|  |     hostPort: "http://localhost:8585/api" | ||
|  |     authProvider: openmetadata | ||
|  |     securityConfig: | ||
|  |       jwtToken: "{bot_jwt_token}" | ||
|  | ``` | ||
|  | 
 | ||
|  | {% /codeBlock %} | ||
|  | 
 | ||
|  | {% /codePreview %} | ||
|  | 
 | ||
|  | ### Workflow Configs for Security Provider
 | ||
|  | 
 | ||
|  | We support different security providers. You can find their definitions [here](https://github.com/open-metadata/OpenMetadata/tree/main/openmetadata-spec/src/main/resources/json/schema/security/client). | ||
|  | 
 | ||
|  | ## Openmetadata JWT Auth
 | ||
|  | 
 | ||
|  | - JWT tokens will allow your clients to authenticate against the OpenMetadata server. To enable JWT Tokens, you will get more details [here](/deployment/security/enable-jwt-tokens). | ||
|  | 
 | ||
|  | ```yaml | ||
|  | workflowConfig: | ||
|  |   openMetadataServerConfig: | ||
|  |     hostPort: "http://localhost:8585/api" | ||
|  |     authProvider: openmetadata | ||
|  |     securityConfig: | ||
|  |       jwtToken: "{bot_jwt_token}" | ||
|  | ``` | ||
|  | 
 | ||
|  | - You can refer to the JWT Troubleshooting section [link](/deployment/security/jwt-troubleshooting) for any issues in your JWT configuration. If you need information on configuring the ingestion with other security providers in your bots, you can follow this doc [link](/deployment/security/workflow-config-auth). | ||
|  | 
 | ||
|  | 
 | ||
|  | ### 2. Prepare the Ingestion DAG
 | ||
|  | 
 | ||
|  | Create a Python file in your Airflow DAGs directory with the following contents: | ||
|  | 
 | ||
|  | {% codePreview %} | ||
|  | 
 | ||
|  | {% codeInfoContainer %} | ||
|  | 
 | ||
|  | 
 | ||
|  | {% codeInfo srNumber=8 %} | ||
|  | 
 | ||
|  | #### Import necessary modules
 | ||
|  | 
 | ||
|  | The `Workflow` class that is being imported is a part of a metadata ingestion framework, which defines a process of getting data from different sources and ingesting it into a central metadata repository. | ||
|  | 
 | ||
|  | Here we are also importing all the basic requirements to parse YAMLs, handle dates and build our DAG. | ||
|  | 
 | ||
|  | {% /codeInfo %} | ||
|  | 
 | ||
|  | {% codeInfo srNumber=9 %} | ||
|  | 
 | ||
|  | **Default arguments for all tasks in the Airflow DAG.**  | ||
|  | 
 | ||
|  | - Default arguments dictionary contains default arguments for tasks in the DAG, including the owner's name, email address, number of retries, retry delay, and execution timeout. | ||
|  | 
 | ||
|  | {% /codeInfo %} | ||
|  | 
 | ||
|  | {% codeInfo srNumber=10 %} | ||
|  | 
 | ||
|  | - **config**: Specifies config for the metadata ingestion as we prepare above. | ||
|  | 
 | ||
|  | {% /codeInfo %} | ||
|  | 
 | ||
|  | {% codeInfo srNumber=11 %} | ||
|  | 
 | ||
|  | - **metadata_ingestion_workflow()**: This code defines a function `metadata_ingestion_workflow()` that loads a YAML configuration, creates a `Workflow` object, executes the workflow, checks its status, prints the status to the console, and stops the workflow. | ||
|  | 
 | ||
|  | {% /codeInfo %} | ||
|  | 
 | ||
|  | {% codeInfo srNumber=12 %} | ||
|  | 
 | ||
|  | - **DAG**: creates a DAG using the Airflow framework, and tune the DAG configurations to whatever fits with your requirements | ||
|  | - For more Airflow DAGs creation details visit [here](https://airflow.apache.org/docs/apache-airflow/stable/core-concepts/dags.html#declaring-a-dag). | ||
|  | 
 | ||
|  | {% /codeInfo %} | ||
|  | 
 | ||
|  | Note that from connector to connector, this recipe will always be the same. | ||
|  | By updating the `YAML configuration`, you will be able to extract metadata from different sources. | ||
|  | 
 | ||
|  | {% /codeInfoContainer %} | ||
|  | 
 | ||
|  | {% codeBlock fileName="filename.py" %} | ||
|  | 
 | ||
|  | ```python {% srNumber=8 %} | ||
|  | import pathlib | ||
|  | import yaml | ||
|  | from datetime import timedelta | ||
|  | from airflow import DAG | ||
|  | from metadata.config.common import load_config_file | ||
|  | from metadata.ingestion.api.workflow import Workflow | ||
|  | from airflow.utils.dates import days_ago | ||
|  | 
 | ||
|  | try: | ||
|  |     from airflow.operators.python import PythonOperator | ||
|  | except ModuleNotFoundError: | ||
|  |     from airflow.operators.python_operator import PythonOperator | ||
|  | 
 | ||
|  | 
 | ||
|  | ``` | ||
|  | 
 | ||
|  | ```python {% srNumber=9 %} | ||
|  | default_args = { | ||
|  |     "owner": "user_name", | ||
|  |     "email": ["username@org.com"], | ||
|  |     "email_on_failure": False, | ||
|  |     "retries": 3, | ||
|  |     "retry_delay": timedelta(minutes=5), | ||
|  |     "execution_timeout": timedelta(minutes=60) | ||
|  | } | ||
|  | 
 | ||
|  | 
 | ||
|  | ``` | ||
|  | 
 | ||
|  | ```python {% srNumber=10 %} | ||
|  | config = """ | ||
|  | <your YAML configuration> | ||
|  | """ | ||
|  | 
 | ||
|  | 
 | ||
|  | ``` | ||
|  | 
 | ||
|  | ```python {% srNumber=11 %} | ||
|  | def metadata_ingestion_workflow(): | ||
|  |     workflow_config = yaml.safe_load(config) | ||
|  |     workflow = Workflow.create(workflow_config) | ||
|  |     workflow.execute() | ||
|  |     workflow.raise_from_status() | ||
|  |     workflow.print_status() | ||
|  |     workflow.stop() | ||
|  | 
 | ||
|  | 
 | ||
|  | ``` | ||
|  | 
 | ||
|  | ```python {% srNumber=12 %} | ||
|  | with DAG( | ||
|  |     "sample_data", | ||
|  |     default_args=default_args, | ||
|  |     description="An example DAG which runs a OpenMetadata ingestion workflow", | ||
|  |     start_date=days_ago(1), | ||
|  |     is_paused_upon_creation=False, | ||
|  |     schedule_interval='*/5 * * * *', | ||
|  |     catchup=False, | ||
|  | ) as dag: | ||
|  |     ingest_task = PythonOperator( | ||
|  |         task_id="ingest_using_recipe", | ||
|  |         python_callable=metadata_ingestion_workflow, | ||
|  |     ) | ||
|  | 
 | ||
|  | 
 | ||
|  | ``` | ||
|  | 
 | ||
|  | {% /codeBlock %} | ||
|  | 
 | ||
|  | {% /codePreview %} | ||
|  | 
 | ||
|  | 
 | ||
|  | 
 |