mirror of
https://github.com/open-metadata/OpenMetadata.git
synced 2025-10-14 02:08:54 +00:00
refactor: add auth type changes in databricks.md
This commit is contained in:
parent
228f99517c
commit
2feb7b1ce4
@ -4,22 +4,16 @@ In this section, we provide guides and references to use the Databricks connecto
|
|||||||
|
|
||||||
## Requirements
|
## Requirements
|
||||||
|
|
||||||
Databricks is a unified analytics platform for big data and AI. To connect to Databricks, you'll need:
|
To learn more about the Databricks Connection Details (`hostPort`,`token`, `http_path`) information visit these <a href="https://docs.open-metadata.org/connectors/database/databricks/troubleshooting" target="_blank">docs</a>.
|
||||||
- A Databricks workspace (AWS, Azure, or GCP)
|
|
||||||
- SQL Warehouse or All-Purpose Cluster with SQL endpoint
|
|
||||||
- Appropriate authentication credentials (Personal Access Token, OAuth, or Azure AD)
|
|
||||||
|
|
||||||
$$note
|
$$note
|
||||||
We support Databricks runtime version 9 and above. Ensure your cluster or SQL warehouse is running a compatible version.
|
We support Databricks runtime version 9 and above.
|
||||||
$$
|
$$
|
||||||
|
|
||||||
### Usage & Lineage
|
### Usage & Lineage
|
||||||
|
|
||||||
$$note
|
$$note
|
||||||
To extract Query Usage and Lineage details, you need:
|
To get Query Usage and Lineage details, you need a Databricks Premium account, since we will be extracting this information from your SQL Warehouse's history API.
|
||||||
- Databricks Premium or higher tier account
|
|
||||||
- Access to system.query.history table
|
|
||||||
- Proper permissions to read SQL Warehouse query history via REST API
|
|
||||||
$$
|
$$
|
||||||
|
|
||||||
You can find further information on the Databricks connector in the <a href="https://docs.open-metadata.org/connectors/database/databricks" target="_blank">docs</a>.
|
You can find further information on the Databricks connector in the <a href="https://docs.open-metadata.org/connectors/database/databricks" target="_blank">docs</a>.
|
||||||
@ -33,149 +27,76 @@ $$
|
|||||||
|
|
||||||
$$section
|
$$section
|
||||||
### Host Port $(id="hostPort")
|
### Host Port $(id="hostPort")
|
||||||
The hostname and port of your Databricks workspace server. This is the URL where your Databricks workspace is hosted, combined with the HTTPS port (typically 443).
|
This parameter specifies the host and port of the Databricks instance. This should be specified as a string in the format `hostname:port`. For example, you might set the hostPort parameter to `adb-xyz.azuredatabricks.net:443`.
|
||||||
|
|
||||||
**Format:** `hostname:port`
|
If you are running the OpenMetadata ingestion in a docker and your services are hosted on the `localhost`, then use `host.docker.internal:3000` as the value.
|
||||||
**Example:** `adb-xyz.azuredatabricks.net:443` (Azure), `dbc-xyz.cloud.databricks.com:443` (AWS)
|
|
||||||
|
|
||||||
To find this:
|
|
||||||
1. Navigate to your Databricks workspace
|
|
||||||
2. Copy the URL from your browser
|
|
||||||
3. Remove the `https://` prefix and any path
|
|
||||||
4. Add `:443` for the port
|
|
||||||
|
|
||||||
$$note
|
|
||||||
If running OpenMetadata ingestion in Docker with Databricks on localhost, use `host.docker.internal:443` instead of `localhost:443`.
|
|
||||||
$$
|
|
||||||
$$
|
$$
|
||||||
|
|
||||||
$$section
|
$$section
|
||||||
### Authentication Type $(id="authType")
|
### Authentication Type $(id="authType")
|
||||||
Select the authentication method to connect to your Databricks workspace. Different methods are available depending on your Databricks deployment (AWS, Azure, GCP) and security requirements.
|
Select the authentication method to connect to your Databricks workspace.
|
||||||
|
|
||||||
#### Personal Access Token $(id="token")
|
- **Personal Access Token**: Generated Personal Access Token for Databricks workspace authentication.
|
||||||
The simplest authentication method using a Databricks-generated token.
|
|
||||||
|
|
||||||
**Token**: Personal Access Token (PAT) generated from your Databricks workspace.
|
- **Databricks OAuth**: OAuth2 Machine-to-Machine authentication using a Service Principal.
|
||||||
- Navigate to User Settings → Developer → Access Tokens
|
|
||||||
- Click "Generate New Token"
|
|
||||||
- Set token lifetime (90 days max for most workspaces)
|
|
||||||
- Copy and securely store the token
|
|
||||||
- Example format: `dapi1234567890abcdef`
|
|
||||||
|
|
||||||
$$note
|
- **Azure AD Setup**: Specifically for Azure Databricks workspaces that use Azure Active Directory for identity management. Uses Azure Service Principal authentication through Azure AD.
|
||||||
Personal Access Tokens are user-specific and inherit the user's permissions. For production, consider using Service Principal authentication.
|
|
||||||
$$
|
$$
|
||||||
|
|
||||||
#### Databricks OAuth $(id="clientId,clientSecret")
|
$$section
|
||||||
OAuth2 Machine-to-Machine authentication using Service Principal (recommended for production).
|
### Token $(id="token")
|
||||||
|
Personal Access Token (PAT) for authenticating with Databricks workspace.
|
||||||
|
(e.g., `dapi1234567890abcdef`)
|
||||||
|
$$
|
||||||
|
|
||||||
**Client ID** $(id="clientId"): The Application ID of your Service Principal
|
$$section
|
||||||
- Created in Databricks Account Console → Service Principals
|
### Client ID $(id="clientId")
|
||||||
- Format: UUID (e.g., `12345678-1234-1234-1234-123456789abc`)
|
The Application ID of your Databricks Service Principal for OAuth2 authentication.
|
||||||
|
(e.g., `12345678-1234-1234-1234-123456789abc`)
|
||||||
|
$$
|
||||||
|
|
||||||
**Client Secret** $(id="clientSecret"): OAuth secret for the Service Principal
|
$$section
|
||||||
- Generated after creating the Service Principal
|
### Client Secret $(id="clientSecret")
|
||||||
- Navigate to the Service Principal → Generate Secret
|
OAuth secret for the Databricks Service Principal.
|
||||||
- Valid for up to 2 years
|
$$
|
||||||
- Store securely - cannot be retrieved after creation
|
|
||||||
|
|
||||||
#### Azure AD Setup $(id="azureClientId,azureClientSecret,azureTenantId")
|
$$section
|
||||||
For Azure Databricks workspaces using Azure Active Directory authentication.
|
### Azure Client ID $(id="azureClientId")
|
||||||
|
Azure Active Directory Application (client) ID for Azure Databricks authentication.
|
||||||
|
(e.g., `a1b2c3d4-e5f6-7890-abcd-ef1234567890`)
|
||||||
|
$$
|
||||||
|
|
||||||
**Azure Client ID** $(id="azureClientId"): Azure AD Application (client) ID
|
$$section
|
||||||
- Found in Azure Portal → App Registrations → Your App → Overview
|
### Azure Client Secret $(id="azureClientSecret")
|
||||||
- Format: UUID
|
Secret key for the Azure AD Application.
|
||||||
|
$$
|
||||||
|
|
||||||
**Azure Client Secret** $(id="azureClientSecret"): Secret created for the Azure AD App
|
$$section
|
||||||
- Azure Portal → App Registrations → Your App → Certificates & Secrets
|
### Azure Tenant ID $(id="azureTenantId")
|
||||||
- Create new client secret with appropriate expiry
|
Your Azure Active Directory tenant identifier.
|
||||||
|
(e.g., `98765432-dcba-4321-abcd-1234567890ab`)
|
||||||
**Azure Tenant ID** $(id="azureTenantId"): Your Azure AD tenant identifier
|
|
||||||
- Azure Portal → Azure Active Directory → Overview
|
|
||||||
- Format: UUID
|
|
||||||
$$
|
$$
|
||||||
|
|
||||||
$$section
|
$$section
|
||||||
### HTTP Path $(id="httpPath")
|
### HTTP Path $(id="httpPath")
|
||||||
The HTTP endpoint path for your Databricks compute resource (SQL Warehouse or Cluster). This path routes queries to the appropriate compute engine.
|
Databricks compute resources URL. E.g., `/sql/1.0/warehouses/xyz123`.
|
||||||
|
|
||||||
**For SQL Warehouses:**
|
|
||||||
- Format: `/sql/1.0/warehouses/{warehouse_id}`
|
|
||||||
- Find in: SQL Warehouses → Your Warehouse → Connection Details → HTTP Path
|
|
||||||
- Example: `/sql/1.0/warehouses/abc123def456`
|
|
||||||
|
|
||||||
**For All-Purpose Clusters:**
|
|
||||||
- Format: `/sql/protocolv1/o/{workspace_id}/{cluster_id}`
|
|
||||||
- Find in: Compute → Your Cluster → Advanced Options → JDBC/ODBC → HTTP Path
|
|
||||||
- Example: `/sql/protocolv1/o/1234567890/0123-456789-abcde12`
|
|
||||||
|
|
||||||
$$note
|
|
||||||
SQL Warehouses are recommended for metadata extraction as they provide better performance and cost efficiency for SQL workloads.
|
|
||||||
$$
|
|
||||||
$$
|
$$
|
||||||
|
|
||||||
$$section
|
$$section
|
||||||
### Catalog $(id="catalog")
|
### Catalog $(id="catalog")
|
||||||
Unity Catalog name to restrict metadata extraction scope. Unity Catalog is Databricks' data governance solution that organizes data assets.
|
Catalog of the data source. This is an optional parameter, if you would like to restrict the metadata reading to a single catalog. When left blank, OpenMetadata Ingestion attempts to scan all the catalogs. E.g., `hive_metastore`
|
||||||
|
|
||||||
**Optional field** - Leave blank to scan all accessible catalogs.
|
|
||||||
|
|
||||||
Common values:
|
|
||||||
- `main` - Default catalog in Unity Catalog-enabled workspaces
|
|
||||||
- `hive_metastore` - Legacy Hive metastore (pre-Unity Catalog)
|
|
||||||
- Custom catalog names created in your workspace
|
|
||||||
|
|
||||||
$$note
|
|
||||||
Unity Catalog requires Databricks Premium or higher. Legacy workspaces only have `hive_metastore`.
|
|
||||||
$$
|
|
||||||
$$
|
$$
|
||||||
|
|
||||||
$$section
|
$$section
|
||||||
### Database Schema $(id="databaseSchema")
|
### Database Schema $(id="databaseSchema")
|
||||||
Specific schema (database) within a catalog to limit metadata extraction.
|
Schema of the data source. This is optional parameter, if you would like to restrict the metadata reading to a single schema. When left blank, OpenMetadata Ingestion attempts to scan all the schemas.
|
||||||
|
|
||||||
**Optional field** - Leave blank to scan all schemas in the specified catalog(s).
|
|
||||||
|
|
||||||
Format: `schema_name` (not `catalog.schema`)
|
|
||||||
Example: `default`, `bronze`, `silver`, `gold`
|
|
||||||
|
|
||||||
Use this to:
|
|
||||||
- Reduce extraction time for large workspaces
|
|
||||||
- Focus on specific business domains
|
|
||||||
- Exclude development or temporary schemas
|
|
||||||
$$
|
|
||||||
|
|
||||||
$$section
|
|
||||||
### Query History Table $(id="queryHistoryTable")
|
|
||||||
System table containing query execution history for usage and lineage extraction.
|
|
||||||
|
|
||||||
**Default:** `system.query.history`
|
|
||||||
|
|
||||||
This table stores:
|
|
||||||
- Query text and execution plans
|
|
||||||
- User and timestamp information
|
|
||||||
- Resource usage metrics
|
|
||||||
- Data lineage relationships
|
|
||||||
|
|
||||||
$$note
|
|
||||||
Requires Databricks Premium tier and appropriate permissions on the system catalog. The user must have SELECT permission on this table.
|
|
||||||
$$
|
|
||||||
$$
|
$$
|
||||||
|
|
||||||
$$section
|
$$section
|
||||||
### Connection Timeout $(id="connectionTimeout")
|
### Connection Timeout $(id="connectionTimeout")
|
||||||
Maximum seconds to wait for Databricks cluster startup and connection establishment.
|
The maximum amount of time (in seconds) to wait for a successful connection to the data source. If the connection attempt takes longer than this timeout period, an error will be returned.
|
||||||
|
|
||||||
**Default:** 120 seconds
|
If your connection fails because your cluster has not had enough time to start, you can try updating this parameter with a bigger number.
|
||||||
**Recommended:**
|
|
||||||
- 120-180 for SQL Warehouses (usually pre-warmed)
|
|
||||||
- 300-600 for All-Purpose Clusters (cold start can take 5-10 minutes)
|
|
||||||
|
|
||||||
Increase this value if you see timeout errors, especially when:
|
|
||||||
- Using auto-scaling clusters that start from zero nodes
|
|
||||||
- Connecting to clusters in auto-pause mode
|
|
||||||
- Network latency is high between OpenMetadata and Databricks
|
|
||||||
$$
|
$$
|
||||||
|
|
||||||
$$section
|
$$section
|
||||||
|
Loading…
x
Reference in New Issue
Block a user