OpenMetadata/yaml.md at a3224f255d97925d513cb15e0785ec097fed2eb0

mirror of https://github.com/open-metadata/OpenMetadata.git synced 2025-07-13 03:59:45 +00:00

Pere Miquel Brull 60ed221cf1

DOCS - Update Version Snapshots (#17589 )

2024-08-27 14:41:12 +05:30

5.8 KiB

Raw Blame History

title	slug
Run the Databricks Connector Externally	/connectors/database/databricks/yaml

{% connectorDetailsHeader name="Databricks" stage="PROD" platform="OpenMetadata" availableFeatures=["Metadata", "Query Usage", "Lineage", "Column-level Lineage", "Data Profiler", "Data Quality", "dbt", "Tags"] unavailableFeatures=["Owners", "Stored Procedures"] / %}

{% note %} As per the documentation here, note that we only support metadata tag extraction for databricks version 13.3 version and higher. {% /note %}

In this section, we provide guides and references to use the Databricks connector.

Configure and schedule Databricks metadata and profiler workflows from the OpenMetadata UI:

Requirements
Metadata Ingestion
Query Usage
Lineage
Data Profiler
Data Quality
dbt Integration

{% partial file="/v1.5/connectors/external-ingestion-deployment.md" /%}

Requirements

Python Requirements

{% partial file="/v1.5/connectors/python-requirements.md" /%}

To run the Databricks ingestion, you will need to install:

pip3 install "openmetadata-ingestion[databricks]"

Metadata Ingestion

All connectors are defined as JSON Schemas. Here you can find the structure to create a connection to Databricks.

In order to create and run a Metadata Ingestion workflow, we will follow the steps to create a YAML configuration able to connect to the source, process the Entities if needed, and reach the OpenMetadata server.

The workflow is modeled around the following JSON Schema

1. Define the YAML Config

This is a sample config for Databricks:

{% codePreview %}

{% codeInfoContainer %}

Source Configuration - Service Connection

{% codeInfo srNumber=1 %}

catalog: Catalog of the data source(Example: hive_metastore). This is optional parameter, if you would like to restrict the metadata reading to a single catalog. When left blank, OpenMetadata Ingestion attempts to scan all the catalog.

{% /codeInfo %}

{% codeInfo srNumber=2 %}

databaseSchema: DatabaseSchema of the data source. This is optional parameter, if you would like to restrict the metadata reading to a single databaseSchema. When left blank, OpenMetadata Ingestion attempts to scan all the databaseSchema.

{% /codeInfo %}

{% codeInfo srNumber=3 %}

hostPort: Enter the fully qualified hostname and port number for your Databricks deployment in the Host and Port field.

{% /codeInfo %}

{% codeInfo srNumber=4 %}

token: Generated Token to connect to Databricks.

{% /codeInfo %}

{% codeInfo srNumber=5 %}

httpPath: Databricks compute resources URL.

{% /codeInfo %}

{% codeInfo srNumber=6 %}

connectionTimeout: The maximum amount of time (in seconds) to wait for a successful connection to the data source. If the connection attempt takes longer than this timeout period, an error will be returned.

{% /codeInfo %}

{% partial file="/v1.5/connectors/yaml/database/source-config-def.md" /%}

{% partial file="/v1.5/connectors/yaml/ingestion-sink-def.md" /%}

{% partial file="/v1.5/connectors/yaml/workflow-config-def.md" /%}

Advanced Configuration

{% codeInfo srNumber=7 %}

Connection Options (Optional): Enter the details for any additional connection options that can be sent to database during the connection. These details must be added as Key-Value pairs.

{% /codeInfo %}

{% codeInfo srNumber=8 %}

Connection Arguments (Optional): Enter the details for any additional connection arguments such as security or protocol configs that can be sent to database during the connection. These details must be added as Key-Value pairs.

In case you are using Single-Sign-On (SSO) for authentication, add the authenticator details in the Connection Arguments as a Key-Value pair as follows: "authenticator" : "sso_login_url"

{% /codeInfo %}

{% /codeInfoContainer %}

{% codeBlock fileName="filename.yaml" %}

source:
  type: databricks
  serviceName: local_databricks
  serviceConnection:
    config:
      type: Databricks

      catalog: hive_metastore

      databaseSchema: default

      token: <databricks token>

      hostPort: <databricks connection host & port>

      httpPath: <http path of databricks cluster>

      connectionTimeout: 120

      # connectionOptions:
      #   key: value

      # connectionArguments:
      #   key: value

{% partial file="/v1.5/connectors/yaml/database/source-config.md" /%}

{% partial file="/v1.5/connectors/yaml/ingestion-sink.md" /%}

{% partial file="/v1.5/connectors/yaml/workflow-config.md" /%}

{% /codeBlock %}

{% /codePreview %}

{% partial file="/v1.5/connectors/yaml/ingestion-cli.md" /%}

{% partial file="/v1.5/connectors/yaml/query-usage.md" variables={connector: "databricks"} /%}

{% partial file="/v1.5/connectors/yaml/lineage.md" variables={connector: "databricks"} /%}

{% partial file="/v1.5/connectors/yaml/data-profiler.md" variables={connector: "databricks"} /%}

{% partial file="/v1.5/connectors/yaml/data-quality.md" /%}

dbt Integration

You can learn more about how to ingest dbt models' definitions and their lineage here.

5.8 KiB Raw Blame History

Requirements

Python Requirements

Metadata Ingestion

1. Define the YAML Config

Source Configuration - Service Connection

Advanced Configuration

dbt Integration

5.8 KiB

Raw Blame History