mirror of
				https://github.com/open-metadata/OpenMetadata.git
				synced 2025-10-31 02:29:03 +00:00 
			
		
		
		
	
		
			
	
	
		
			307 lines
		
	
	
		
			10 KiB
		
	
	
	
		
			Markdown
		
	
	
	
	
	
		
		
			
		
	
	
			307 lines
		
	
	
		
			10 KiB
		
	
	
	
		
			Markdown
		
	
	
	
	
	
|   | --- | ||
|  | title: Run DeltaLake Connector using the CLI | ||
|  | slug: /connectors/database/deltalake/cli | ||
|  | --- | ||
|  | 
 | ||
|  | # Run Deltalake using the metadata CLI
 | ||
|  | 
 | ||
|  | {% multiTablesWrapper %} | ||
|  | 
 | ||
|  | | Feature            | Status                       | | ||
|  | | :----------------- | :--------------------------- | | ||
|  | | Stage              | PROD                         | | ||
|  | | Metadata           | {% icon iconName="check" /%} | | ||
|  | | Query Usage        | {% icon iconName="cross" /%} | | ||
|  | | Data Profiler      | {% icon iconName="cross" /%} | | ||
|  | | Data Quality       | {% icon iconName="cross" /%} | | ||
|  | | Lineage            | Partially via Views          | | ||
|  | | DBT                | {% icon iconName="cross" /%} | | ||
|  | | Supported Versions | --                           | | ||
|  | 
 | ||
|  | | Feature      | Status                       | | ||
|  | | :----------- | :--------------------------- | | ||
|  | | Lineage      | Partially via Views          | | ||
|  | | Table-level  | {% icon iconName="check" /%} | | ||
|  | | Column-level | {% icon iconName="check" /%} | | ||
|  | 
 | ||
|  | {% /multiTablesWrapper %} | ||
|  | 
 | ||
|  | In this section, we provide guides and references to use the Deltalake connector. | ||
|  | 
 | ||
|  | Configure and schedule Deltalake metadata and profiler workflows from the OpenMetadata UI: | ||
|  | 
 | ||
|  | - [Requirements](#requirements) | ||
|  | - [Metadata Ingestion](#metadata-ingestion) | ||
|  | - [dbt Integration](#dbt-integration) | ||
|  | 
 | ||
|  | ## Requirements
 | ||
|  | 
 | ||
|  | {%inlineCallout icon="description" bold="OpenMetadata 0.12 or later" href="/deployment"%} | ||
|  | To deploy OpenMetadata, check the Deployment guides. | ||
|  | {%/inlineCallout%} | ||
|  | 
 | ||
|  | To run the Ingestion via the UI you'll need to use the OpenMetadata Ingestion Container, which comes shipped with | ||
|  | custom Airflow plugins to handle the workflow deployment. | ||
|  | 
 | ||
|  | ### Python Requirements
 | ||
|  | 
 | ||
|  | To run the Deltalake ingestion, you will need to install: | ||
|  | 
 | ||
|  | ```bash | ||
|  | pip3 install "openmetadata-ingestion[deltalake]" | ||
|  | ``` | ||
|  | 
 | ||
|  | ## Metadata Ingestion
 | ||
|  | 
 | ||
|  | All connectors are defined as JSON Schemas. | ||
|  | [Here](https://github.com/open-metadata/OpenMetadata/blob/main/openmetadata-spec/src/main/resources/json/schema/entity/services/connections/database/deltaLakeConnection.json) | ||
|  | you can find the structure to create a connection to Deltalake. | ||
|  | 
 | ||
|  | In order to create and run a Metadata Ingestion workflow, we will follow | ||
|  | the steps to create a YAML configuration able to connect to the source, | ||
|  | process the Entities if needed, and reach the OpenMetadata server. | ||
|  | 
 | ||
|  | The workflow is modeled around the following | ||
|  | [JSON Schema](https://github.com/open-metadata/OpenMetadata/blob/main/openmetadata-spec/src/main/resources/json/schema/metadataIngestion/workflow.json) | ||
|  | 
 | ||
|  | ### 1. Define the YAML Config
 | ||
|  | 
 | ||
|  | This is a sample config for Deltalake: | ||
|  | 
 | ||
|  | {% codePreview %} | ||
|  | 
 | ||
|  | {% codeInfoContainer %} | ||
|  | 
 | ||
|  | #### Source Configuration - Service Connection
 | ||
|  | 
 | ||
|  | {% codeInfo srNumber=1 %} | ||
|  | 
 | ||
|  | **Metastore Host Port**: Enter the Host & Port of Hive Metastore Service to configure the Spark Session. Either | ||
|  |   of `metastoreHostPort`, `metastoreDb` or `metastoreFilePath` is required. | ||
|  | 
 | ||
|  | **Metastore File Path**: Enter the file path to local Metastore in case Spark cluster is running locally. Either | ||
|  |   of `metastoreHostPort`, `metastoreDb` or `metastoreFilePath` is required. | ||
|  | 
 | ||
|  | **Metastore DB**: The JDBC connection to the underlying Hive metastore DB. Either | ||
|  |   of `metastoreHostPort`, `metastoreDb` or `metastoreFilePath` is required. | ||
|  | 
 | ||
|  | **appName (Optional)**: Enter the app name of spark session. | ||
|  | 
 | ||
|  | **Connection Arguments (Optional)**: Key-Value pairs that will be used to pass extra `config` elements to the Spark | ||
|  |   Session builder. | ||
|  | 
 | ||
|  | We are internally running with `pyspark` 3.X and `delta-lake` 2.0.0. This means that we need to consider Spark | ||
|  | configuration options for 3.X. | ||
|  | 
 | ||
|  | ##### Metastore Host Port
 | ||
|  | 
 | ||
|  | When connecting to an External Metastore passing the parameter `Metastore Host Port`, we will be preparing a Spark Session with the configuration | ||
|  | 
 | ||
|  | ``` | ||
|  | .config("hive.metastore.uris", "thrift://{connection.metastoreHostPort}")  | ||
|  | ``` | ||
|  | 
 | ||
|  | Then, we will be using the `catalog` functions from the Spark Session to pick up the metadata exposed by the Hive Metastore. | ||
|  | 
 | ||
|  | ##### Metastore File Path
 | ||
|  | 
 | ||
|  | If instead we use a local file path that contains the metastore information (e.g., for local testing with the default `metastore_db` directory), we will set | ||
|  | 
 | ||
|  | ``` | ||
|  | .config("spark.driver.extraJavaOptions", "-Dderby.system.home={connection.metastoreFilePath}")  | ||
|  | ``` | ||
|  | 
 | ||
|  | To update the `Derby` information. More information about this in a great [SO thread](https://stackoverflow.com/questions/38377188/how-to-get-rid-of-derby-log-metastore-db-from-spark-shell). | ||
|  | 
 | ||
|  | - You can find all supported configurations [here](https://spark.apache.org/docs/latest/configuration.html) | ||
|  | - If you need further information regarding the Hive metastore, you can find | ||
|  |   it [here](https://spark.apache.org/docs/3.0.0-preview/sql-data-sources-hive-tables.html), and in The Internals of | ||
|  |   Spark SQL [book](https://jaceklaskowski.gitbooks.io/mastering-spark-sql/content/spark-sql-hive-metastore.html). | ||
|  | 
 | ||
|  | {% /codeInfo %} | ||
|  | 
 | ||
|  | 
 | ||
|  | #### Source Configuration - Source Config
 | ||
|  | 
 | ||
|  | {% codeInfo srNumber=4 %} | ||
|  | 
 | ||
|  | The `sourceConfig` is defined [here](https://github.com/open-metadata/OpenMetadata/blob/main/openmetadata-spec/src/main/resources/json/schema/metadataIngestion/databaseServiceMetadataPipeline.json): | ||
|  | 
 | ||
|  | **markDeletedTables**: To flag tables as soft-deleted if they are not present anymore in the source system. | ||
|  | 
 | ||
|  | **includeTables**: true or false, to ingest table data. Default is true. | ||
|  | 
 | ||
|  | **includeViews**: true or false, to ingest views definitions. | ||
|  | 
 | ||
|  | **databaseFilterPattern**, **schemaFilterPattern**, **tableFilternPattern**: Note that the they support regex as include or exclude. E.g., | ||
|  | 
 | ||
|  | {% /codeInfo %} | ||
|  | 
 | ||
|  | #### Sink Configuration
 | ||
|  | 
 | ||
|  | {% codeInfo srNumber=5 %} | ||
|  | 
 | ||
|  | To send the metadata to OpenMetadata, it needs to be specified as `type: metadata-rest`. | ||
|  | 
 | ||
|  | {% /codeInfo %} | ||
|  | 
 | ||
|  | #### Workflow Configuration
 | ||
|  | 
 | ||
|  | {% codeInfo srNumber=6 %} | ||
|  | 
 | ||
|  | The main property here is the `openMetadataServerConfig`, where you can define the host and security provider of your OpenMetadata installation. | ||
|  | 
 | ||
|  | For a simple, local installation using our docker containers, this looks like: | ||
|  | 
 | ||
|  | {% /codeInfo %} | ||
|  | 
 | ||
|  | #### Advanced Configuration
 | ||
|  | 
 | ||
|  | {% codeInfo srNumber=2 %} | ||
|  | 
 | ||
|  | **Connection Options (Optional)**: Enter the details for any additional connection options that can be sent to Athena during the connection. These details must be added as Key-Value pairs. | ||
|  | 
 | ||
|  | {% /codeInfo %} | ||
|  | 
 | ||
|  | {% codeInfo srNumber=3 %} | ||
|  | 
 | ||
|  | **Connection Arguments (Optional)**: Enter the details for any additional connection arguments such as security or protocol configs that can be sent to Athena during the connection. These details must be added as Key-Value pairs. | ||
|  | 
 | ||
|  | - In case you are using Single-Sign-On (SSO) for authentication, add the `authenticator` details in the Connection Arguments as a Key-Value pair as follows: `"authenticator" : "sso_login_url"` | ||
|  | - In case you authenticate with SSO using an external browser popup, then add the `authenticator` details in the Connection Arguments as a Key-Value pair as follows: `"authenticator" : "externalbrowser"` | ||
|  | 
 | ||
|  | {% /codeInfo %} | ||
|  | 
 | ||
|  | {% /codeInfoContainer %} | ||
|  | 
 | ||
|  | {% codeBlock fileName="filename.yaml" %} | ||
|  | 
 | ||
|  | ```yaml | ||
|  | source: | ||
|  |   type: deltalake | ||
|  |   serviceName: "<service name>" | ||
|  |   serviceConnection: | ||
|  |     config: | ||
|  |       type: DeltaLake | ||
|  | ``` | ||
|  | ```yaml {% srNumber=1 %} | ||
|  |       metastoreConnection: | ||
|  |         # Pick only of the three | ||
|  |         metastoreHostPort: "<metastore host port>" | ||
|  |         # metastoreDb: jdbc:mysql://localhost:3306/demo_hive | ||
|  |         # metastoreFilePath: "<path_to_metastore>/metastore_db" | ||
|  |       appName: MyApp | ||
|  | ``` | ||
|  | ```yaml {% srNumber=2 %} | ||
|  |       # connectionOptions: | ||
|  |       #   key: value | ||
|  | ``` | ||
|  | ```yaml {% srNumber=3 %} | ||
|  |       # connectionArguments: | ||
|  |       #   key: value | ||
|  | ``` | ||
|  | 
 | ||
|  | ```yaml {% srNumber=4 %} | ||
|  |       sourceConfig: | ||
|  |         config: | ||
|  |           type: DatabaseMetadata | ||
|  |           markDeletedTables: true | ||
|  |           includeTables: true | ||
|  |           includeViews: true | ||
|  |           # includeTags: true | ||
|  |           # databaseFilterPattern: | ||
|  |           #   includes: | ||
|  |           #     - database1 | ||
|  |           #     - database2 | ||
|  |           #   excludes: | ||
|  |           #     - database3 | ||
|  |           #     - database4 | ||
|  |           # schemaFilterPattern: | ||
|  |           #   includes: | ||
|  |           #     - schema1 | ||
|  |           #     - schema2 | ||
|  |           #   excludes: | ||
|  |           #     - schema3 | ||
|  |           #     - schema4 | ||
|  |           # tableFilterPattern: | ||
|  |           #   includes: | ||
|  |           #     - users | ||
|  |           #     - type_test | ||
|  |           #   excludes: | ||
|  |           #     - table3 | ||
|  |           #     - table4 | ||
|  | ``` | ||
|  | 
 | ||
|  | ```yaml {% srNumber=5 %} | ||
|  | sink: | ||
|  |   type: metadata-rest | ||
|  |   config: {} | ||
|  | ``` | ||
|  | 
 | ||
|  | ```yaml {% srNumber=6 %} | ||
|  | workflowConfig: | ||
|  |   openMetadataServerConfig: | ||
|  |     hostPort: "http://localhost:8585/api" | ||
|  |     authProvider: openmetadata | ||
|  |     securityConfig: | ||
|  |       jwtToken: "{bot_jwt_token}" | ||
|  | ``` | ||
|  | 
 | ||
|  | {% /codeBlock %} | ||
|  | 
 | ||
|  | {% /codePreview %} | ||
|  | 
 | ||
|  | ### Workflow Configs for Security Provider
 | ||
|  | 
 | ||
|  | We support different security providers. You can find their definitions [here](https://github.com/open-metadata/OpenMetadata/tree/main/openmetadata-spec/src/main/resources/json/schema/security/client). | ||
|  | 
 | ||
|  | ## Openmetadata JWT Auth
 | ||
|  | 
 | ||
|  | - JWT tokens will allow your clients to authenticate against the OpenMetadata server. To enable JWT Tokens, you will get more details [here](/deployment/security/enable-jwt-tokens). | ||
|  | 
 | ||
|  | ```yaml | ||
|  | workflowConfig: | ||
|  |   openMetadataServerConfig: | ||
|  |     hostPort: "http://localhost:8585/api" | ||
|  |     authProvider: openmetadata | ||
|  |     securityConfig: | ||
|  |       jwtToken: "{bot_jwt_token}" | ||
|  | ``` | ||
|  | 
 | ||
|  | - You can refer to the JWT Troubleshooting section [link](/deployment/security/jwt-troubleshooting) for any issues in your JWT configuration. If you need information on configuring the ingestion with other security providers in your bots, you can follow this doc [link](/deployment/security/workflow-config-auth). | ||
|  | 
 | ||
|  | ### 2. Run with the CLI
 | ||
|  | 
 | ||
|  | First, we will need to save the YAML file. Afterward, and with all requirements installed, we can run: | ||
|  | 
 | ||
|  | ```bash | ||
|  | metadata ingest -c <path-to-yaml> | ||
|  | ``` | ||
|  | 
 | ||
|  | Note that from connector to connector, this recipe will always be the same. By updating the YAML configuration, | ||
|  | you will be able to extract metadata from different sources. | ||
|  | 
 | ||
|  | ## dbt Integration
 | ||
|  | 
 | ||
|  | {% tilesContainer %} | ||
|  | 
 | ||
|  | {% tile | ||
|  |   icon="mediation" | ||
|  |   title="dbt Integration" | ||
|  |   description="Learn more about how to ingest dbt models' definitions and their lineage." | ||
|  |   link="/connectors/ingestion/workflows/dbt" /%} | ||
|  | 
 | ||
|  | {% /tilesContainer %} | ||
|  | 
 | ||
|  | ## Related
 | ||
|  | 
 | ||
|  | {% tilesContainer %} | ||
|  | 
 | ||
|  | {% tile | ||
|  |     title="Ingest with Airflow" | ||
|  |     description="Configure the ingestion using Airflow SDK" | ||
|  |     link="/connectors/database/deltalake/airflow" | ||
|  |   / %} | ||
|  | 
 | ||
|  | {% /tilesContainer %} |