Keshav Mohta f633514bbe
Fixes: Reverse Metadata Documentation (#20726)
* fix: make reverse metadata docs visible to collate

* fix: reverse metadata ingestion to reverse metadata
2025-04-10 11:03:42 +05:30

6.0 KiB

title slug
Run the Databricks Connector Externally /connectors/database/databricks/yaml

{% connectorDetailsHeader name="Databricks" stage="PROD" platform="OpenMetadata" availableFeatures=["Metadata", "Query Usage", "Lineage", "Column-level Lineage", "Data Profiler", "Data Quality", "dbt", "Tags", "Sample Data", "Reverse Metadata (Collate Only)"] unavailableFeatures=["Owners", "Stored Procedures"] / %}

{% note %} As per the documentation here, note that we only support metadata tag extraction for databricks version 13.3 version and higher. {% /note %}

In this section, we provide guides and references to use the Databricks connector.

Configure and schedule Databricks metadata and profiler workflows from the OpenMetadata UI:

Requirements

Python Requirements

{% partial file="/v1.7/connectors/python-requirements.md" /%}

To run the Databricks ingestion, you will need to install:

pip3 install "openmetadata-ingestion[databricks]"

Metadata Ingestion

All connectors are defined as JSON Schemas. Here you can find the structure to create a connection to Databricks.

In order to create and run a Metadata Ingestion workflow, we will follow the steps to create a YAML configuration able to connect to the source, process the Entities if needed, and reach the OpenMetadata server.

The workflow is modeled around the following JSON Schema

1. Define the YAML Config

This is a sample config for Databricks:

{% codePreview %}

{% codeInfoContainer %}

Source Configuration - Service Connection

{% codeInfo srNumber=1 %}

catalog: Catalog of the data source(Example: hive_metastore). This is optional parameter, if you would like to restrict the metadata reading to a single catalog. When left blank, OpenMetadata Ingestion attempts to scan all the catalog.

{% /codeInfo %}

{% codeInfo srNumber=2 %}

databaseSchema: DatabaseSchema of the data source. This is optional parameter, if you would like to restrict the metadata reading to a single databaseSchema. When left blank, OpenMetadata Ingestion attempts to scan all the databaseSchema.

{% /codeInfo %}

{% codeInfo srNumber=3 %}

hostPort: Enter the fully qualified hostname and port number for your Databricks deployment in the Host and Port field.

{% /codeInfo %}

{% codeInfo srNumber=4 %}

token: Generated Token to connect to Databricks.

{% /codeInfo %}

{% codeInfo srNumber=5 %}

httpPath: Databricks compute resources URL.

{% /codeInfo %}

{% codeInfo srNumber=6 %}

connectionTimeout: The maximum amount of time (in seconds) to wait for a successful connection to the data source. If the connection attempt takes longer than this timeout period, an error will be returned.

{% /codeInfo %}

{% partial file="/v1.7/connectors/yaml/database/source-config-def.md" /%}

{% partial file="/v1.7/connectors/yaml/ingestion-sink-def.md" /%}

{% partial file="/v1.7/connectors/yaml/workflow-config-def.md" /%}

Advanced Configuration

{% codeInfo srNumber=7 %}

Connection Options (Optional): Enter the details for any additional connection options that can be sent to database during the connection. These details must be added as Key-Value pairs.

{% /codeInfo %}

{% codeInfo srNumber=8 %}

Connection Arguments (Optional): Enter the details for any additional connection arguments such as security or protocol configs that can be sent to database during the connection. These details must be added as Key-Value pairs.

  • In case you are using Single-Sign-On (SSO) for authentication, add the authenticator details in the Connection Arguments as a Key-Value pair as follows: "authenticator" : "sso_login_url"

{% /codeInfo %}

{% /codeInfoContainer %}

{% codeBlock fileName="filename.yaml" %}

source:
  type: databricks
  serviceName: local_databricks
  serviceConnection:
    config:
      type: Databricks
      catalog: hive_metastore
      databaseSchema: default
      token: <databricks token>
      hostPort: <databricks connection host & port>
      httpPath: <http path of databricks cluster>
      connectionTimeout: 120
      # connectionOptions:
      #   key: value
      # connectionArguments:
      #   key: value

{% partial file="/v1.7/connectors/yaml/database/source-config.md" /%}

{% partial file="/v1.7/connectors/yaml/ingestion-sink.md" /%}

{% partial file="/v1.7/connectors/yaml/workflow-config.md" /%}

{% /codeBlock %}

{% /codePreview %}

{% partial file="/v1.7/connectors/yaml/ingestion-cli.md" /%}

{% partial file="/v1.7/connectors/yaml/query-usage.md" variables={connector: "databricks"} /%}

{% partial file="/v1.7/connectors/yaml/lineage.md" variables={connector: "databricks"} /%}

{% partial file="/v1.7/connectors/yaml/data-profiler.md" variables={connector: "databricks"} /%}

{% partial file="/v1.7/connectors/yaml/auto-classification.md" variables={connector: "databricks"} /%}

{% partial file="/v1.7/connectors/yaml/data-quality.md" /%}

dbt Integration

You can learn more about how to ingest dbt models' definitions and their lineage here.