From 2feb7b1ce4f44b50ad25a3502eab1cd5a6626d0d Mon Sep 17 00:00:00 2001 From: Keshav Mohta Date: Fri, 19 Sep 2025 15:38:42 +0530 Subject: [PATCH] refactor: add auth type changes in databricks.md --- .../locales/en-US/Database/Databricks.md | 161 +++++------------- 1 file changed, 41 insertions(+), 120 deletions(-) diff --git a/openmetadata-ui/src/main/resources/ui/public/locales/en-US/Database/Databricks.md b/openmetadata-ui/src/main/resources/ui/public/locales/en-US/Database/Databricks.md index 5080491bd63..40f46e88ef0 100644 --- a/openmetadata-ui/src/main/resources/ui/public/locales/en-US/Database/Databricks.md +++ b/openmetadata-ui/src/main/resources/ui/public/locales/en-US/Database/Databricks.md @@ -4,22 +4,16 @@ In this section, we provide guides and references to use the Databricks connecto ## Requirements -Databricks is a unified analytics platform for big data and AI. To connect to Databricks, you'll need: -- A Databricks workspace (AWS, Azure, or GCP) -- SQL Warehouse or All-Purpose Cluster with SQL endpoint -- Appropriate authentication credentials (Personal Access Token, OAuth, or Azure AD) +To learn more about the Databricks Connection Details (`hostPort`,`token`, `http_path`) information visit these docs. $$note -We support Databricks runtime version 9 and above. Ensure your cluster or SQL warehouse is running a compatible version. +We support Databricks runtime version 9 and above. $$ ### Usage & Lineage $$note -To extract Query Usage and Lineage details, you need: -- Databricks Premium or higher tier account -- Access to system.query.history table -- Proper permissions to read SQL Warehouse query history via REST API +To get Query Usage and Lineage details, you need a Databricks Premium account, since we will be extracting this information from your SQL Warehouse's history API. $$ You can find further information on the Databricks connector in the docs. @@ -33,149 +27,76 @@ $$ $$section ### Host Port $(id="hostPort") -The hostname and port of your Databricks workspace server. This is the URL where your Databricks workspace is hosted, combined with the HTTPS port (typically 443). +This parameter specifies the host and port of the Databricks instance. This should be specified as a string in the format `hostname:port`. For example, you might set the hostPort parameter to `adb-xyz.azuredatabricks.net:443`. -**Format:** `hostname:port` -**Example:** `adb-xyz.azuredatabricks.net:443` (Azure), `dbc-xyz.cloud.databricks.com:443` (AWS) - -To find this: -1. Navigate to your Databricks workspace -2. Copy the URL from your browser -3. Remove the `https://` prefix and any path -4. Add `:443` for the port - -$$note -If running OpenMetadata ingestion in Docker with Databricks on localhost, use `host.docker.internal:443` instead of `localhost:443`. -$$ +If you are running the OpenMetadata ingestion in a docker and your services are hosted on the `localhost`, then use `host.docker.internal:3000` as the value. $$ $$section ### Authentication Type $(id="authType") -Select the authentication method to connect to your Databricks workspace. Different methods are available depending on your Databricks deployment (AWS, Azure, GCP) and security requirements. +Select the authentication method to connect to your Databricks workspace. -#### Personal Access Token $(id="token") -The simplest authentication method using a Databricks-generated token. +- **Personal Access Token**: Generated Personal Access Token for Databricks workspace authentication. -**Token**: Personal Access Token (PAT) generated from your Databricks workspace. -- Navigate to User Settings → Developer → Access Tokens -- Click "Generate New Token" -- Set token lifetime (90 days max for most workspaces) -- Copy and securely store the token -- Example format: `dapi1234567890abcdef` +- **Databricks OAuth**: OAuth2 Machine-to-Machine authentication using a Service Principal. -$$note -Personal Access Tokens are user-specific and inherit the user's permissions. For production, consider using Service Principal authentication. +- **Azure AD Setup**: Specifically for Azure Databricks workspaces that use Azure Active Directory for identity management. Uses Azure Service Principal authentication through Azure AD. $$ -#### Databricks OAuth $(id="clientId,clientSecret") -OAuth2 Machine-to-Machine authentication using Service Principal (recommended for production). +$$section +### Token $(id="token") +Personal Access Token (PAT) for authenticating with Databricks workspace. +(e.g., `dapi1234567890abcdef`) +$$ -**Client ID** $(id="clientId"): The Application ID of your Service Principal -- Created in Databricks Account Console → Service Principals -- Format: UUID (e.g., `12345678-1234-1234-1234-123456789abc`) +$$section +### Client ID $(id="clientId") +The Application ID of your Databricks Service Principal for OAuth2 authentication. +(e.g., `12345678-1234-1234-1234-123456789abc`) +$$ -**Client Secret** $(id="clientSecret"): OAuth secret for the Service Principal -- Generated after creating the Service Principal -- Navigate to the Service Principal → Generate Secret -- Valid for up to 2 years -- Store securely - cannot be retrieved after creation +$$section +### Client Secret $(id="clientSecret") +OAuth secret for the Databricks Service Principal. +$$ -#### Azure AD Setup $(id="azureClientId,azureClientSecret,azureTenantId") -For Azure Databricks workspaces using Azure Active Directory authentication. +$$section +### Azure Client ID $(id="azureClientId") +Azure Active Directory Application (client) ID for Azure Databricks authentication. +(e.g., `a1b2c3d4-e5f6-7890-abcd-ef1234567890`) +$$ -**Azure Client ID** $(id="azureClientId"): Azure AD Application (client) ID -- Found in Azure Portal → App Registrations → Your App → Overview -- Format: UUID +$$section +### Azure Client Secret $(id="azureClientSecret") +Secret key for the Azure AD Application. +$$ -**Azure Client Secret** $(id="azureClientSecret"): Secret created for the Azure AD App -- Azure Portal → App Registrations → Your App → Certificates & Secrets -- Create new client secret with appropriate expiry - -**Azure Tenant ID** $(id="azureTenantId"): Your Azure AD tenant identifier -- Azure Portal → Azure Active Directory → Overview -- Format: UUID +$$section +### Azure Tenant ID $(id="azureTenantId") +Your Azure Active Directory tenant identifier. +(e.g., `98765432-dcba-4321-abcd-1234567890ab`) $$ $$section ### HTTP Path $(id="httpPath") -The HTTP endpoint path for your Databricks compute resource (SQL Warehouse or Cluster). This path routes queries to the appropriate compute engine. - -**For SQL Warehouses:** -- Format: `/sql/1.0/warehouses/{warehouse_id}` -- Find in: SQL Warehouses → Your Warehouse → Connection Details → HTTP Path -- Example: `/sql/1.0/warehouses/abc123def456` - -**For All-Purpose Clusters:** -- Format: `/sql/protocolv1/o/{workspace_id}/{cluster_id}` -- Find in: Compute → Your Cluster → Advanced Options → JDBC/ODBC → HTTP Path -- Example: `/sql/protocolv1/o/1234567890/0123-456789-abcde12` - -$$note -SQL Warehouses are recommended for metadata extraction as they provide better performance and cost efficiency for SQL workloads. -$$ +Databricks compute resources URL. E.g., `/sql/1.0/warehouses/xyz123`. $$ $$section ### Catalog $(id="catalog") -Unity Catalog name to restrict metadata extraction scope. Unity Catalog is Databricks' data governance solution that organizes data assets. - -**Optional field** - Leave blank to scan all accessible catalogs. - -Common values: -- `main` - Default catalog in Unity Catalog-enabled workspaces -- `hive_metastore` - Legacy Hive metastore (pre-Unity Catalog) -- Custom catalog names created in your workspace - -$$note -Unity Catalog requires Databricks Premium or higher. Legacy workspaces only have `hive_metastore`. -$$ +Catalog of the data source. This is an optional parameter, if you would like to restrict the metadata reading to a single catalog. When left blank, OpenMetadata Ingestion attempts to scan all the catalogs. E.g., `hive_metastore` $$ $$section ### Database Schema $(id="databaseSchema") -Specific schema (database) within a catalog to limit metadata extraction. - -**Optional field** - Leave blank to scan all schemas in the specified catalog(s). - -Format: `schema_name` (not `catalog.schema`) -Example: `default`, `bronze`, `silver`, `gold` - -Use this to: -- Reduce extraction time for large workspaces -- Focus on specific business domains -- Exclude development or temporary schemas -$$ - -$$section -### Query History Table $(id="queryHistoryTable") -System table containing query execution history for usage and lineage extraction. - -**Default:** `system.query.history` - -This table stores: -- Query text and execution plans -- User and timestamp information -- Resource usage metrics -- Data lineage relationships - -$$note -Requires Databricks Premium tier and appropriate permissions on the system catalog. The user must have SELECT permission on this table. -$$ +Schema of the data source. This is optional parameter, if you would like to restrict the metadata reading to a single schema. When left blank, OpenMetadata Ingestion attempts to scan all the schemas. $$ $$section ### Connection Timeout $(id="connectionTimeout") -Maximum seconds to wait for Databricks cluster startup and connection establishment. +The maximum amount of time (in seconds) to wait for a successful connection to the data source. If the connection attempt takes longer than this timeout period, an error will be returned. -**Default:** 120 seconds -**Recommended:** -- 120-180 for SQL Warehouses (usually pre-warmed) -- 300-600 for All-Purpose Clusters (cold start can take 5-10 minutes) - -Increase this value if you see timeout errors, especially when: -- Using auto-scaling clusters that start from zero nodes -- Connecting to clusters in auto-pause mode -- Network latency is high between OpenMetadata and Databricks +If your connection fails because your cluster has not had enough time to start, you can try updating this parameter with a bigger number. $$ $$section