mirror of
https://github.com/datahub-project/datahub.git
synced 2025-10-30 10:14:53 +00:00
feat(ingestion): Added Databricks support to Fivetran source (#14897)
This commit is contained in:
parent
a043d15193
commit
40b51ac2da
@ -9,9 +9,10 @@ This source extracts the following:
|
|||||||
|
|
||||||
## Configuration Notes
|
## Configuration Notes
|
||||||
|
|
||||||
1. Fivetran supports the fivetran platform connector to dump the log events and connectors, destinations, users and roles metadata in your destination.
|
1. Fivetran supports the [fivetran platform connector](https://fivetran.com/docs/logs/fivetran-platform) to dump the log events and connectors, destinations, users and roles metadata in your destination.
|
||||||
2. You need to setup and start the initial sync of the fivetran platform connector before using this source. Refer [link](https://fivetran.com/docs/logs/fivetran-platform/setup-guide).
|
2. You need to setup and start the initial sync of the fivetran platform connector before using this source. Refer [link](https://fivetran.com/docs/logs/fivetran-platform/setup-guide).
|
||||||
3. Once initial sync up of your fivetran platform connector is done, you need to provide the fivetran platform connector's destination platform and its configuration in the recipe.
|
3. Once initial sync up of your fivetran platform connector is done, you need to provide the fivetran platform connector's destination platform and its configuration in the recipe.
|
||||||
|
4. We expect our users to enable automatic schema updates (default) in fivetran platform connector configured for DataHub, this ensures latest schema changes are applied and avoids inconsistency data syncs.
|
||||||
|
|
||||||
## Concept mapping
|
## Concept mapping
|
||||||
|
|
||||||
@ -30,6 +31,7 @@ Works only for
|
|||||||
|
|
||||||
- Snowflake destination
|
- Snowflake destination
|
||||||
- Bigquery destination
|
- Bigquery destination
|
||||||
|
- Databricks destination
|
||||||
|
|
||||||
## Snowflake destination Configuration Guide
|
## Snowflake destination Configuration Guide
|
||||||
|
|
||||||
@ -58,6 +60,22 @@ grant role fivetran_datahub to user snowflake_user;
|
|||||||
1. If your fivetran platform connector destination is bigquery, you need to setup a ServiceAccount as per [BigQuery docs](https://cloud.google.com/iam/docs/creating-managing-service-accounts#iam-service-accounts-create-console) and select BigQuery Data Viewer and BigQuery Job User IAM roles.
|
1. If your fivetran platform connector destination is bigquery, you need to setup a ServiceAccount as per [BigQuery docs](https://cloud.google.com/iam/docs/creating-managing-service-accounts#iam-service-accounts-create-console) and select BigQuery Data Viewer and BigQuery Job User IAM roles.
|
||||||
2. Create and Download a service account JSON keyfile and provide bigquery connection credential in bigquery destination config.
|
2. Create and Download a service account JSON keyfile and provide bigquery connection credential in bigquery destination config.
|
||||||
|
|
||||||
|
## Databricks destination Configuration Guide
|
||||||
|
|
||||||
|
1. Get your Databricks instance's [workspace url](https://docs.databricks.com/workspace/workspace-details.html#workspace-instance-names-urls-and-ids)
|
||||||
|
2. Create a [Databricks Service Principal](https://docs.databricks.com/administration-guide/users-groups/service-principals.html#what-is-a-service-principal)
|
||||||
|
1. You can skip this step and use your own account to get things running quickly, but we strongly recommend creating a dedicated service principal for production use.
|
||||||
|
3. Generate a Databricks Personal Access token following the following guides:
|
||||||
|
1. [Service Principals](https://docs.databricks.com/administration-guide/users-groups/service-principals.html#personal-access-tokens)
|
||||||
|
2. [Personal Access Tokens](https://docs.databricks.com/dev-tools/auth.html#databricks-personal-access-tokens)
|
||||||
|
4. Provision your service account, to ingest your workspace's metadata and lineage, your service principal must have all of the following:
|
||||||
|
1. One of: metastore admin role, ownership of, or `USE CATALOG` privilege on any catalogs you want to ingest
|
||||||
|
2. One of: metastore admin role, ownership of, or `USE SCHEMA` privilege on any schemas you want to ingest
|
||||||
|
3. Ownership of or `SELECT` privilege on any tables and views you want to ingest
|
||||||
|
4. [Ownership documentation](https://docs.databricks.com/data-governance/unity-catalog/manage-privileges/ownership.html)
|
||||||
|
5. [Privileges documentation](https://docs.databricks.com/data-governance/unity-catalog/manage-privileges/privileges.html)
|
||||||
|
5. Check the starter recipe below and replace `workspace_url` and `token` with your information from the previous steps.
|
||||||
|
|
||||||
## Advanced Configurations
|
## Advanced Configurations
|
||||||
|
|
||||||
### Working with Platform Instances
|
### Working with Platform Instances
|
||||||
|
|||||||
@ -26,6 +26,17 @@ source:
|
|||||||
client_id: "client_id"
|
client_id: "client_id"
|
||||||
private_key: "private_key"
|
private_key: "private_key"
|
||||||
dataset: "fivetran_log_dataset"
|
dataset: "fivetran_log_dataset"
|
||||||
|
# Optional - If destination platform is 'databricks', provide databricks configuration.
|
||||||
|
databricks_destination_config:
|
||||||
|
# Credentials
|
||||||
|
credential:
|
||||||
|
token: "token"
|
||||||
|
workspace_url: "workspace_url"
|
||||||
|
warehouse_id: "warehouse_id"
|
||||||
|
|
||||||
|
# Coordinates
|
||||||
|
catalog: "fivetran_catalog"
|
||||||
|
log_schema: "fivetran_log"
|
||||||
|
|
||||||
# Optional - filter for certain connector names instead of ingesting everything.
|
# Optional - filter for certain connector names instead of ingesting everything.
|
||||||
# connector_patterns:
|
# connector_patterns:
|
||||||
|
|||||||
@ -365,6 +365,10 @@ slack = {
|
|||||||
"tenacity>=8.0.1",
|
"tenacity>=8.0.1",
|
||||||
}
|
}
|
||||||
|
|
||||||
|
databricks_common = {
|
||||||
|
"databricks-sqlalchemy~=1.0", # Note: This is pinned to 1.0 for compatibility with SQLAlchemy 1.x which is default for fivetran
|
||||||
|
}
|
||||||
|
|
||||||
databricks = {
|
databricks = {
|
||||||
# 0.1.11 appears to have authentication issues with azure databricks
|
# 0.1.11 appears to have authentication issues with azure databricks
|
||||||
# 0.22.0 has support for `include_browse` in metadata list apis
|
# 0.22.0 has support for `include_browse` in metadata list apis
|
||||||
@ -466,7 +470,14 @@ plugins: Dict[str, Set[str]] = {
|
|||||||
# https://www.elastic.co/guide/en/elasticsearch/client/python-api/current/release-notes.html#rn-7-14-0
|
# https://www.elastic.co/guide/en/elasticsearch/client/python-api/current/release-notes.html#rn-7-14-0
|
||||||
# https://github.com/elastic/elasticsearch-py/issues/1639#issuecomment-883587433
|
# https://github.com/elastic/elasticsearch-py/issues/1639#issuecomment-883587433
|
||||||
"elasticsearch": {"elasticsearch==7.13.4", *cachetools_lib},
|
"elasticsearch": {"elasticsearch==7.13.4", *cachetools_lib},
|
||||||
"excel": {"openpyxl>=3.1.5", "pandas", *aws_common, *abs_base, *cachetools_lib, *data_lake_profiling},
|
"excel": {
|
||||||
|
"openpyxl>=3.1.5",
|
||||||
|
"pandas",
|
||||||
|
*aws_common,
|
||||||
|
*abs_base,
|
||||||
|
*cachetools_lib,
|
||||||
|
*data_lake_profiling,
|
||||||
|
},
|
||||||
"cassandra": {
|
"cassandra": {
|
||||||
"cassandra-driver>=3.28.0",
|
"cassandra-driver>=3.28.0",
|
||||||
# We were seeing an error like this `numpy.dtype size changed, may indicate binary incompatibility. Expected 96 from C header, got 88 from PyObject`
|
# We were seeing an error like this `numpy.dtype size changed, may indicate binary incompatibility. Expected 96 from C header, got 88 from PyObject`
|
||||||
@ -582,7 +593,11 @@ plugins: Dict[str, Set[str]] = {
|
|||||||
"unity-catalog": databricks | sql_common,
|
"unity-catalog": databricks | sql_common,
|
||||||
# databricks is alias for unity-catalog and needs to be kept in sync
|
# databricks is alias for unity-catalog and needs to be kept in sync
|
||||||
"databricks": databricks | sql_common,
|
"databricks": databricks | sql_common,
|
||||||
"fivetran": snowflake_common | bigquery_common | sqlalchemy_lib | sqlglot_lib,
|
"fivetran": snowflake_common
|
||||||
|
| bigquery_common
|
||||||
|
| databricks_common
|
||||||
|
| sqlalchemy_lib
|
||||||
|
| sqlglot_lib,
|
||||||
"snaplogic": set(),
|
"snaplogic": set(),
|
||||||
"qlik-sense": sqlglot_lib | {"requests", "websocket-client"},
|
"qlik-sense": sqlglot_lib | {"requests", "websocket-client"},
|
||||||
"sigma": sqlglot_lib | {"requests"},
|
"sigma": sqlglot_lib | {"requests"},
|
||||||
@ -737,7 +752,7 @@ base_dev_requirements = {
|
|||||||
"cassandra",
|
"cassandra",
|
||||||
"neo4j",
|
"neo4j",
|
||||||
"vertexai",
|
"vertexai",
|
||||||
"mssql-odbc"
|
"mssql-odbc",
|
||||||
]
|
]
|
||||||
if plugin
|
if plugin
|
||||||
for dependency in plugins[plugin]
|
for dependency in plugins[plugin]
|
||||||
|
|||||||
@ -29,6 +29,9 @@ from datahub.ingestion.source.state.stale_entity_removal_handler import (
|
|||||||
from datahub.ingestion.source.state.stateful_ingestion_base import (
|
from datahub.ingestion.source.state.stateful_ingestion_base import (
|
||||||
StatefulIngestionConfigBase,
|
StatefulIngestionConfigBase,
|
||||||
)
|
)
|
||||||
|
from datahub.ingestion.source.unity.config import (
|
||||||
|
UnityCatalogConnectionConfig,
|
||||||
|
)
|
||||||
from datahub.utilities.lossy_collections import LossyList
|
from datahub.utilities.lossy_collections import LossyList
|
||||||
from datahub.utilities.perf_timer import PerfTimer
|
from datahub.utilities.perf_timer import PerfTimer
|
||||||
|
|
||||||
@ -56,8 +59,8 @@ class Constant:
|
|||||||
STATUS = "status"
|
STATUS = "status"
|
||||||
USER_ID = "user_id"
|
USER_ID = "user_id"
|
||||||
EMAIL = "email"
|
EMAIL = "email"
|
||||||
CONNECTOR_ID = "connector_id"
|
CONNECTOR_ID = "connection_id"
|
||||||
CONNECTOR_NAME = "connector_name"
|
CONNECTOR_NAME = "connection_name"
|
||||||
CONNECTOR_TYPE_ID = "connector_type_id"
|
CONNECTOR_TYPE_ID = "connector_type_id"
|
||||||
PAUSED = "paused"
|
PAUSED = "paused"
|
||||||
SYNC_FREQUENCY = "sync_frequency"
|
SYNC_FREQUENCY = "sync_frequency"
|
||||||
@ -85,11 +88,24 @@ class BigQueryDestinationConfig(BigQueryConnectionConfig):
|
|||||||
dataset: str = Field(description="The fivetran connector log dataset.")
|
dataset: str = Field(description="The fivetran connector log dataset.")
|
||||||
|
|
||||||
|
|
||||||
|
class DatabricksDestinationConfig(UnityCatalogConnectionConfig):
|
||||||
|
catalog: str = Field(description="The fivetran connector log catalog.")
|
||||||
|
log_schema: str = Field(description="The fivetran connector log schema.")
|
||||||
|
|
||||||
|
@pydantic.validator("warehouse_id")
|
||||||
|
def warehouse_id_should_not_be_empty(cls, warehouse_id: Optional[str]) -> str:
|
||||||
|
if warehouse_id is None or (warehouse_id and warehouse_id.strip() == ""):
|
||||||
|
raise ValueError("Fivetran requires warehouse_id to be set")
|
||||||
|
return warehouse_id
|
||||||
|
|
||||||
|
|
||||||
class FivetranLogConfig(ConfigModel):
|
class FivetranLogConfig(ConfigModel):
|
||||||
destination_platform: Literal["snowflake", "bigquery"] = pydantic.Field(
|
destination_platform: Literal["snowflake", "bigquery", "databricks"] = (
|
||||||
|
pydantic.Field(
|
||||||
default="snowflake",
|
default="snowflake",
|
||||||
description="The destination platform where fivetran connector log tables are dumped.",
|
description="The destination platform where fivetran connector log tables are dumped.",
|
||||||
)
|
)
|
||||||
|
)
|
||||||
snowflake_destination_config: Optional[SnowflakeDestinationConfig] = pydantic.Field(
|
snowflake_destination_config: Optional[SnowflakeDestinationConfig] = pydantic.Field(
|
||||||
default=None,
|
default=None,
|
||||||
description="If destination platform is 'snowflake', provide snowflake configuration.",
|
description="If destination platform is 'snowflake', provide snowflake configuration.",
|
||||||
@ -98,6 +114,12 @@ class FivetranLogConfig(ConfigModel):
|
|||||||
default=None,
|
default=None,
|
||||||
description="If destination platform is 'bigquery', provide bigquery configuration.",
|
description="If destination platform is 'bigquery', provide bigquery configuration.",
|
||||||
)
|
)
|
||||||
|
databricks_destination_config: Optional[DatabricksDestinationConfig] = (
|
||||||
|
pydantic.Field(
|
||||||
|
default=None,
|
||||||
|
description="If destination platform is 'databricks', provide databricks configuration.",
|
||||||
|
)
|
||||||
|
)
|
||||||
_rename_destination_config = pydantic_renamed_field(
|
_rename_destination_config = pydantic_renamed_field(
|
||||||
"destination_config", "snowflake_destination_config"
|
"destination_config", "snowflake_destination_config"
|
||||||
)
|
)
|
||||||
@ -115,6 +137,11 @@ class FivetranLogConfig(ConfigModel):
|
|||||||
raise ValueError(
|
raise ValueError(
|
||||||
"If destination platform is 'bigquery', user must provide bigquery destination configuration in the recipe."
|
"If destination platform is 'bigquery', user must provide bigquery destination configuration in the recipe."
|
||||||
)
|
)
|
||||||
|
elif destination_platform == "databricks":
|
||||||
|
if "databricks_destination_config" not in values:
|
||||||
|
raise ValueError(
|
||||||
|
"If destination platform is 'databricks', user must provide databricks destination configuration in the recipe."
|
||||||
|
)
|
||||||
else:
|
else:
|
||||||
raise ValueError(
|
raise ValueError(
|
||||||
f"Destination platform '{destination_platform}' is not yet supported."
|
f"Destination platform '{destination_platform}' is not yet supported."
|
||||||
|
|||||||
@ -66,7 +66,6 @@ logger = logging.getLogger(__name__)
|
|||||||
class FivetranSource(StatefulIngestionSourceBase):
|
class FivetranSource(StatefulIngestionSourceBase):
|
||||||
"""
|
"""
|
||||||
This plugin extracts fivetran users, connectors, destinations and sync history.
|
This plugin extracts fivetran users, connectors, destinations and sync history.
|
||||||
This plugin is in beta and has only been tested on Snowflake connector.
|
|
||||||
"""
|
"""
|
||||||
|
|
||||||
config: FivetranSourceConfig
|
config: FivetranSourceConfig
|
||||||
|
|||||||
@ -73,6 +73,19 @@ class FivetranLogAPI:
|
|||||||
if result is None:
|
if result is None:
|
||||||
raise ValueError("Failed to retrieve BigQuery project ID")
|
raise ValueError("Failed to retrieve BigQuery project ID")
|
||||||
fivetran_log_database = result[0]
|
fivetran_log_database = result[0]
|
||||||
|
elif destination_platform == "databricks":
|
||||||
|
databricks_destination_config = (
|
||||||
|
self.fivetran_log_config.databricks_destination_config
|
||||||
|
)
|
||||||
|
if databricks_destination_config is not None:
|
||||||
|
engine = create_engine(
|
||||||
|
databricks_destination_config.get_sql_alchemy_url(
|
||||||
|
databricks_destination_config.catalog
|
||||||
|
),
|
||||||
|
**databricks_destination_config.get_options(),
|
||||||
|
)
|
||||||
|
fivetran_log_query.set_schema(databricks_destination_config.log_schema)
|
||||||
|
fivetran_log_database = databricks_destination_config.catalog
|
||||||
else:
|
else:
|
||||||
raise ConfigurationError(
|
raise ConfigurationError(
|
||||||
f"Destination platform '{destination_platform}' is not yet supported."
|
f"Destination platform '{destination_platform}' is not yet supported."
|
||||||
|
|||||||
@ -6,6 +6,21 @@ MAX_COLUMN_LINEAGE_PER_CONNECTOR = 1000
|
|||||||
MAX_JOBS_PER_CONNECTOR = 500
|
MAX_JOBS_PER_CONNECTOR = 500
|
||||||
|
|
||||||
|
|
||||||
|
"""
|
||||||
|
------------------------------------------------------------------------------------------------------------
|
||||||
|
Fivetran Platform Connector Handling
|
||||||
|
------------------------------------------------------------------------------------------------------------
|
||||||
|
Current Query Change Log: August 2025 (See: https://fivetran.com/docs/changelog/2025/august-2025)
|
||||||
|
|
||||||
|
All queries have to be updated as per Fivetran Platform Connector release if any. We expect customers
|
||||||
|
and fivetran to keep platform connector configured for DataHub with auto sync enabled to get latest changes.
|
||||||
|
|
||||||
|
References:
|
||||||
|
- Fivetran Release Notes: https://fivetran.com/docs/changelog (Look for "Fivetran Platform Connector")
|
||||||
|
- Latest Platform Connector Schema: https://fivetran.com/docs/logs/fivetran-platform?erdModal=open
|
||||||
|
"""
|
||||||
|
|
||||||
|
|
||||||
class FivetranLogQuery:
|
class FivetranLogQuery:
|
||||||
# Note: All queries are written in Snowflake SQL.
|
# Note: All queries are written in Snowflake SQL.
|
||||||
# They will be transpiled to the target database's SQL dialect at runtime.
|
# They will be transpiled to the target database's SQL dialect at runtime.
|
||||||
@ -30,17 +45,17 @@ class FivetranLogQuery:
|
|||||||
def get_connectors_query(self) -> str:
|
def get_connectors_query(self) -> str:
|
||||||
return f"""\
|
return f"""\
|
||||||
SELECT
|
SELECT
|
||||||
connector_id,
|
connection_id,
|
||||||
connecting_user_id,
|
connecting_user_id,
|
||||||
connector_type_id,
|
connector_type_id,
|
||||||
connector_name,
|
connection_name,
|
||||||
paused,
|
paused,
|
||||||
sync_frequency,
|
sync_frequency,
|
||||||
destination_id
|
destination_id
|
||||||
FROM {self.schema_clause}connector
|
FROM {self.schema_clause}connection
|
||||||
WHERE
|
WHERE
|
||||||
_fivetran_deleted = FALSE
|
_fivetran_deleted = FALSE
|
||||||
QUALIFY ROW_NUMBER() OVER (PARTITION BY connector_id ORDER BY _fivetran_synced DESC) = 1
|
QUALIFY ROW_NUMBER() OVER (PARTITION BY connection_id ORDER BY _fivetran_synced DESC) = 1
|
||||||
"""
|
"""
|
||||||
|
|
||||||
def get_users_query(self) -> str:
|
def get_users_query(self) -> str:
|
||||||
@ -63,20 +78,20 @@ FROM {self.schema_clause}user
|
|||||||
return f"""\
|
return f"""\
|
||||||
WITH ranked_syncs AS (
|
WITH ranked_syncs AS (
|
||||||
SELECT
|
SELECT
|
||||||
connector_id,
|
connection_id,
|
||||||
sync_id,
|
sync_id,
|
||||||
MAX(CASE WHEN message_event = 'sync_start' THEN time_stamp END) as start_time,
|
MAX(CASE WHEN message_event = 'sync_start' THEN time_stamp END) as start_time,
|
||||||
MAX(CASE WHEN message_event = 'sync_end' THEN time_stamp END) as end_time,
|
MAX(CASE WHEN message_event = 'sync_end' THEN time_stamp END) as end_time,
|
||||||
MAX(CASE WHEN message_event = 'sync_end' THEN message_data END) as end_message_data,
|
MAX(CASE WHEN message_event = 'sync_end' THEN message_data END) as end_message_data,
|
||||||
ROW_NUMBER() OVER (PARTITION BY connector_id ORDER BY MAX(time_stamp) DESC) as rn
|
ROW_NUMBER() OVER (PARTITION BY connection_id ORDER BY MAX(time_stamp) DESC) as rn
|
||||||
FROM {self.schema_clause}log
|
FROM {self.schema_clause}log
|
||||||
WHERE message_event in ('sync_start', 'sync_end')
|
WHERE message_event in ('sync_start', 'sync_end')
|
||||||
AND time_stamp > CURRENT_TIMESTAMP - INTERVAL '{syncs_interval} days'
|
AND time_stamp > CURRENT_TIMESTAMP - INTERVAL '{syncs_interval} days'
|
||||||
AND connector_id IN ({formatted_connector_ids})
|
AND connection_id IN ({formatted_connector_ids})
|
||||||
GROUP BY connector_id, sync_id
|
GROUP BY connection_id, sync_id
|
||||||
)
|
)
|
||||||
SELECT
|
SELECT
|
||||||
connector_id,
|
connection_id,
|
||||||
sync_id,
|
sync_id,
|
||||||
start_time,
|
start_time,
|
||||||
end_time,
|
end_time,
|
||||||
@ -85,7 +100,7 @@ FROM ranked_syncs
|
|||||||
WHERE rn <= {MAX_JOBS_PER_CONNECTOR}
|
WHERE rn <= {MAX_JOBS_PER_CONNECTOR}
|
||||||
AND start_time IS NOT NULL
|
AND start_time IS NOT NULL
|
||||||
AND end_time IS NOT NULL
|
AND end_time IS NOT NULL
|
||||||
ORDER BY connector_id, end_time DESC
|
ORDER BY connection_id, end_time DESC
|
||||||
"""
|
"""
|
||||||
|
|
||||||
def get_table_lineage_query(self, connector_ids: List[str]) -> str:
|
def get_table_lineage_query(self, connector_ids: List[str]) -> str:
|
||||||
@ -97,7 +112,7 @@ SELECT
|
|||||||
*
|
*
|
||||||
FROM (
|
FROM (
|
||||||
SELECT
|
SELECT
|
||||||
stm.connector_id as connector_id,
|
stm.connection_id as connection_id,
|
||||||
stm.id as source_table_id,
|
stm.id as source_table_id,
|
||||||
stm.name as source_table_name,
|
stm.name as source_table_name,
|
||||||
ssm.name as source_schema_name,
|
ssm.name as source_schema_name,
|
||||||
@ -105,18 +120,18 @@ FROM (
|
|||||||
dtm.name as destination_table_name,
|
dtm.name as destination_table_name,
|
||||||
dsm.name as destination_schema_name,
|
dsm.name as destination_schema_name,
|
||||||
tl.created_at as created_at,
|
tl.created_at as created_at,
|
||||||
ROW_NUMBER() OVER (PARTITION BY stm.connector_id, stm.id, dtm.id ORDER BY tl.created_at DESC) as table_combo_rn
|
ROW_NUMBER() OVER (PARTITION BY stm.connection_id, stm.id, dtm.id ORDER BY tl.created_at DESC) as table_combo_rn
|
||||||
FROM {self.schema_clause}table_lineage as tl
|
FROM {self.schema_clause}table_lineage as tl
|
||||||
JOIN {self.schema_clause}source_table_metadata as stm on tl.source_table_id = stm.id
|
JOIN {self.schema_clause}source_table as stm on tl.source_table_id = stm.id -- stm: source_table_metadata
|
||||||
JOIN {self.schema_clause}destination_table_metadata as dtm on tl.destination_table_id = dtm.id
|
JOIN {self.schema_clause}destination_table as dtm on tl.destination_table_id = dtm.id -- dtm: destination_table_metadata
|
||||||
JOIN {self.schema_clause}source_schema_metadata as ssm on stm.schema_id = ssm.id
|
JOIN {self.schema_clause}source_schema as ssm on stm.schema_id = ssm.id -- ssm: source_schema_metadata
|
||||||
JOIN {self.schema_clause}destination_schema_metadata as dsm on dtm.schema_id = dsm.id
|
JOIN {self.schema_clause}destination_schema as dsm on dtm.schema_id = dsm.id -- dsm: destination_schema_metadata
|
||||||
WHERE stm.connector_id IN ({formatted_connector_ids})
|
WHERE stm.connection_id IN ({formatted_connector_ids})
|
||||||
)
|
)
|
||||||
-- Ensure that we only get back one entry per source and destination pair.
|
-- Ensure that we only get back one entry per source and destination pair.
|
||||||
WHERE table_combo_rn = 1
|
WHERE table_combo_rn = 1
|
||||||
QUALIFY ROW_NUMBER() OVER (PARTITION BY connector_id ORDER BY created_at DESC) <= {MAX_TABLE_LINEAGE_PER_CONNECTOR}
|
QUALIFY ROW_NUMBER() OVER (PARTITION BY connection_id ORDER BY created_at DESC) <= {MAX_TABLE_LINEAGE_PER_CONNECTOR}
|
||||||
ORDER BY connector_id, created_at DESC
|
ORDER BY connection_id, created_at DESC
|
||||||
"""
|
"""
|
||||||
|
|
||||||
def get_column_lineage_query(self, connector_ids: List[str]) -> str:
|
def get_column_lineage_query(self, connector_ids: List[str]) -> str:
|
||||||
@ -131,25 +146,25 @@ SELECT
|
|||||||
destination_column_name
|
destination_column_name
|
||||||
FROM (
|
FROM (
|
||||||
SELECT
|
SELECT
|
||||||
stm.connector_id as connector_id,
|
stm.connection_id as connection_id,
|
||||||
scm.table_id as source_table_id,
|
scm.table_id as source_table_id,
|
||||||
dcm.table_id as destination_table_id,
|
dcm.table_id as destination_table_id,
|
||||||
scm.name as source_column_name,
|
scm.name as source_column_name,
|
||||||
dcm.name as destination_column_name,
|
dcm.name as destination_column_name,
|
||||||
cl.created_at as created_at,
|
cl.created_at as created_at,
|
||||||
ROW_NUMBER() OVER (PARTITION BY stm.connector_id, cl.source_column_id, cl.destination_column_id ORDER BY cl.created_at DESC) as column_combo_rn
|
ROW_NUMBER() OVER (PARTITION BY stm.connection_id, cl.source_column_id, cl.destination_column_id ORDER BY cl.created_at DESC) as column_combo_rn
|
||||||
FROM {self.schema_clause}column_lineage as cl
|
FROM {self.schema_clause}column_lineage as cl
|
||||||
JOIN {self.schema_clause}source_column_metadata as scm
|
JOIN {self.schema_clause}source_column as scm -- scm: source_column_metadata
|
||||||
ON cl.source_column_id = scm.id
|
ON cl.source_column_id = scm.id
|
||||||
JOIN {self.schema_clause}destination_column_metadata as dcm
|
JOIN {self.schema_clause}destination_column as dcm -- dcm: destination_column_metadata
|
||||||
ON cl.destination_column_id = dcm.id
|
ON cl.destination_column_id = dcm.id
|
||||||
-- Only joining source_table_metadata to get the connector_id.
|
-- Only joining source_table to get the connection_id.
|
||||||
JOIN {self.schema_clause}source_table_metadata as stm
|
JOIN {self.schema_clause}source_table as stm -- stm: source_table_metadata
|
||||||
ON scm.table_id = stm.id
|
ON scm.table_id = stm.id
|
||||||
WHERE stm.connector_id IN ({formatted_connector_ids})
|
WHERE stm.connection_id IN ({formatted_connector_ids})
|
||||||
)
|
)
|
||||||
-- Ensure that we only get back one entry per (connector, source column, destination column) pair.
|
-- Ensure that we only get back one entry per (connector, source column, destination column) pair.
|
||||||
WHERE column_combo_rn = 1
|
WHERE column_combo_rn = 1
|
||||||
QUALIFY ROW_NUMBER() OVER (PARTITION BY connector_id ORDER BY created_at DESC) <= {MAX_COLUMN_LINEAGE_PER_CONNECTOR}
|
QUALIFY ROW_NUMBER() OVER (PARTITION BY connection_id ORDER BY created_at DESC) <= {MAX_COLUMN_LINEAGE_PER_CONNECTOR}
|
||||||
ORDER BY connector_id, created_at DESC
|
ORDER BY connection_id, created_at DESC
|
||||||
"""
|
"""
|
||||||
|
|||||||
@ -132,14 +132,13 @@ class UnityCatalogGEProfilerConfig(UnityCatalogProfilerConfig, GEProfilingConfig
|
|||||||
)
|
)
|
||||||
|
|
||||||
|
|
||||||
class UnityCatalogSourceConfig(
|
class UnityCatalogConnectionConfig(ConfigModel):
|
||||||
SQLCommonConfig,
|
"""
|
||||||
StatefulIngestionConfigBase,
|
Configuration for connecting to Databricks Unity Catalog.
|
||||||
BaseUsageConfig,
|
Contains only connection-related fields that can be reused across different sources.
|
||||||
DatasetSourceConfigMixin,
|
"""
|
||||||
StatefulProfilingConfigMixin,
|
|
||||||
LowerCaseDatasetUrnConfigMixin,
|
scheme: str = DATABRICKS
|
||||||
):
|
|
||||||
token: str = pydantic.Field(description="Databricks personal access token")
|
token: str = pydantic.Field(description="Databricks personal access token")
|
||||||
workspace_url: str = pydantic.Field(
|
workspace_url: str = pydantic.Field(
|
||||||
description="Databricks workspace url. e.g. https://my-workspace.cloud.databricks.com"
|
description="Databricks workspace url. e.g. https://my-workspace.cloud.databricks.com"
|
||||||
@ -156,15 +155,41 @@ class UnityCatalogSourceConfig(
|
|||||||
"When warehouse_id is missing, these features will be automatically disabled (with warnings) to allow ingestion to continue."
|
"When warehouse_id is missing, these features will be automatically disabled (with warnings) to allow ingestion to continue."
|
||||||
),
|
),
|
||||||
)
|
)
|
||||||
include_hive_metastore: bool = pydantic.Field(
|
|
||||||
default=INCLUDE_HIVE_METASTORE_DEFAULT,
|
extra_client_options: Dict[str, Any] = Field(
|
||||||
description="Whether to ingest legacy `hive_metastore` catalog. This requires executing queries on SQL warehouse.",
|
default={},
|
||||||
)
|
description="Additional options to pass to Databricks SQLAlchemy client.",
|
||||||
workspace_name: Optional[str] = pydantic.Field(
|
|
||||||
default=None,
|
|
||||||
description="Name of the workspace. Default to deployment name present in workspace_url",
|
|
||||||
)
|
)
|
||||||
|
|
||||||
|
def __init__(self, **data: Any):
|
||||||
|
super().__init__(**data)
|
||||||
|
|
||||||
|
def get_sql_alchemy_url(self, database: Optional[str] = None) -> str:
|
||||||
|
uri_opts = {"http_path": f"/sql/1.0/warehouses/{self.warehouse_id}"}
|
||||||
|
if database:
|
||||||
|
uri_opts["catalog"] = database
|
||||||
|
return make_sqlalchemy_uri(
|
||||||
|
scheme=self.scheme,
|
||||||
|
username="token",
|
||||||
|
password=self.token,
|
||||||
|
at=urlparse(self.workspace_url).netloc,
|
||||||
|
db=database,
|
||||||
|
uri_opts=uri_opts,
|
||||||
|
)
|
||||||
|
|
||||||
|
def get_options(self) -> dict:
|
||||||
|
return self.extra_client_options
|
||||||
|
|
||||||
|
|
||||||
|
class UnityCatalogSourceConfig(
|
||||||
|
UnityCatalogConnectionConfig,
|
||||||
|
SQLCommonConfig,
|
||||||
|
StatefulIngestionConfigBase,
|
||||||
|
BaseUsageConfig,
|
||||||
|
DatasetSourceConfigMixin,
|
||||||
|
StatefulProfilingConfigMixin,
|
||||||
|
LowerCaseDatasetUrnConfigMixin,
|
||||||
|
):
|
||||||
include_metastore: bool = pydantic.Field(
|
include_metastore: bool = pydantic.Field(
|
||||||
default=False,
|
default=False,
|
||||||
description=(
|
description=(
|
||||||
@ -344,7 +369,15 @@ class UnityCatalogSourceConfig(
|
|||||||
_forced_disable_tag_extraction: bool = pydantic.PrivateAttr(default=False)
|
_forced_disable_tag_extraction: bool = pydantic.PrivateAttr(default=False)
|
||||||
_forced_disable_hive_metastore_extraction = pydantic.PrivateAttr(default=False)
|
_forced_disable_hive_metastore_extraction = pydantic.PrivateAttr(default=False)
|
||||||
|
|
||||||
scheme: str = DATABRICKS
|
include_hive_metastore: bool = pydantic.Field(
|
||||||
|
default=INCLUDE_HIVE_METASTORE_DEFAULT,
|
||||||
|
description="Whether to ingest legacy `hive_metastore` catalog. This requires executing queries on SQL warehouse.",
|
||||||
|
)
|
||||||
|
|
||||||
|
workspace_name: Optional[str] = pydantic.Field(
|
||||||
|
default=None,
|
||||||
|
description="Name of the workspace. Default to deployment name present in workspace_url",
|
||||||
|
)
|
||||||
|
|
||||||
def __init__(self, **data):
|
def __init__(self, **data):
|
||||||
# First, let the parent handle the root validators and field processing
|
# First, let the parent handle the root validators and field processing
|
||||||
@ -386,19 +419,6 @@ class UnityCatalogSourceConfig(
|
|||||||
forced_disable_hive_metastore_extraction
|
forced_disable_hive_metastore_extraction
|
||||||
)
|
)
|
||||||
|
|
||||||
def get_sql_alchemy_url(self, database: Optional[str] = None) -> str:
|
|
||||||
uri_opts = {"http_path": f"/sql/1.0/warehouses/{self.warehouse_id}"}
|
|
||||||
if database:
|
|
||||||
uri_opts["catalog"] = database
|
|
||||||
return make_sqlalchemy_uri(
|
|
||||||
scheme=self.scheme,
|
|
||||||
username="token",
|
|
||||||
password=self.token,
|
|
||||||
at=urlparse(self.workspace_url).netloc,
|
|
||||||
db=database,
|
|
||||||
uri_opts=uri_opts,
|
|
||||||
)
|
|
||||||
|
|
||||||
def is_profiling_enabled(self) -> bool:
|
def is_profiling_enabled(self) -> bool:
|
||||||
return self.profiling.enabled and is_profiling_enabled(
|
return self.profiling.enabled and is_profiling_enabled(
|
||||||
self.profiling.operation_config
|
self.profiling.operation_config
|
||||||
|
|||||||
@ -26,19 +26,19 @@ FROZEN_TIME = "2022-06-07 17:00:00"
|
|||||||
|
|
||||||
default_connector_query_results = [
|
default_connector_query_results = [
|
||||||
{
|
{
|
||||||
"connector_id": "calendar_elected",
|
"connection_id": "calendar_elected",
|
||||||
"connecting_user_id": "reapply_phone",
|
"connecting_user_id": "reapply_phone",
|
||||||
"connector_type_id": "postgres",
|
"connector_type_id": "postgres",
|
||||||
"connector_name": "postgres",
|
"connection_name": "postgres",
|
||||||
"paused": False,
|
"paused": False,
|
||||||
"sync_frequency": 1440,
|
"sync_frequency": 1440,
|
||||||
"destination_id": "interval_unconstitutional",
|
"destination_id": "interval_unconstitutional",
|
||||||
},
|
},
|
||||||
{
|
{
|
||||||
"connector_id": "my_confluent_cloud_connector_id",
|
"connection_id": "my_confluent_cloud_connector_id",
|
||||||
"connecting_user_id": "reapply_phone",
|
"connecting_user_id": "reapply_phone",
|
||||||
"connector_type_id": "confluent_cloud",
|
"connector_type_id": "confluent_cloud",
|
||||||
"connector_name": "confluent_cloud",
|
"connection_name": "confluent_cloud",
|
||||||
"paused": False,
|
"paused": False,
|
||||||
"sync_frequency": 1440,
|
"sync_frequency": 1440,
|
||||||
"destination_id": "my_confluent_cloud_connector_id",
|
"destination_id": "my_confluent_cloud_connector_id",
|
||||||
@ -60,7 +60,7 @@ def default_query_results(
|
|||||||
):
|
):
|
||||||
return [
|
return [
|
||||||
{
|
{
|
||||||
"connector_id": "calendar_elected",
|
"connection_id": "calendar_elected",
|
||||||
"source_table_id": "10040",
|
"source_table_id": "10040",
|
||||||
"source_table_name": "employee",
|
"source_table_name": "employee",
|
||||||
"source_schema_name": "public",
|
"source_schema_name": "public",
|
||||||
@ -69,7 +69,7 @@ def default_query_results(
|
|||||||
"destination_schema_name": "postgres_public",
|
"destination_schema_name": "postgres_public",
|
||||||
},
|
},
|
||||||
{
|
{
|
||||||
"connector_id": "calendar_elected",
|
"connection_id": "calendar_elected",
|
||||||
"source_table_id": "10041",
|
"source_table_id": "10041",
|
||||||
"source_table_name": "company",
|
"source_table_name": "company",
|
||||||
"source_schema_name": "public",
|
"source_schema_name": "public",
|
||||||
@ -78,7 +78,7 @@ def default_query_results(
|
|||||||
"destination_schema_name": "postgres_public",
|
"destination_schema_name": "postgres_public",
|
||||||
},
|
},
|
||||||
{
|
{
|
||||||
"connector_id": "my_confluent_cloud_connector_id",
|
"connection_id": "my_confluent_cloud_connector_id",
|
||||||
"source_table_id": "10042",
|
"source_table_id": "10042",
|
||||||
"source_table_name": "my-source-topic",
|
"source_table_name": "my-source-topic",
|
||||||
"source_schema_name": "confluent_cloud",
|
"source_schema_name": "confluent_cloud",
|
||||||
@ -131,28 +131,28 @@ def default_query_results(
|
|||||||
):
|
):
|
||||||
return [
|
return [
|
||||||
{
|
{
|
||||||
"connector_id": "calendar_elected",
|
"connection_id": "calendar_elected",
|
||||||
"sync_id": "4c9a03d6-eded-4422-a46a-163266e58243",
|
"sync_id": "4c9a03d6-eded-4422-a46a-163266e58243",
|
||||||
"start_time": datetime.datetime(2023, 9, 20, 6, 37, 32, 606000),
|
"start_time": datetime.datetime(2023, 9, 20, 6, 37, 32, 606000),
|
||||||
"end_time": datetime.datetime(2023, 9, 20, 6, 38, 5, 56000),
|
"end_time": datetime.datetime(2023, 9, 20, 6, 38, 5, 56000),
|
||||||
"end_message_data": '"{\\"status\\":\\"SUCCESSFUL\\"}"',
|
"end_message_data": '"{\\"status\\":\\"SUCCESSFUL\\"}"',
|
||||||
},
|
},
|
||||||
{
|
{
|
||||||
"connector_id": "calendar_elected",
|
"connection_id": "calendar_elected",
|
||||||
"sync_id": "f773d1e9-c791-48f4-894f-8cf9b3dfc834",
|
"sync_id": "f773d1e9-c791-48f4-894f-8cf9b3dfc834",
|
||||||
"start_time": datetime.datetime(2023, 10, 3, 14, 35, 30, 345000),
|
"start_time": datetime.datetime(2023, 10, 3, 14, 35, 30, 345000),
|
||||||
"end_time": datetime.datetime(2023, 10, 3, 14, 35, 31, 512000),
|
"end_time": datetime.datetime(2023, 10, 3, 14, 35, 31, 512000),
|
||||||
"end_message_data": '"{\\"reason\\":\\"Sync has been cancelled because of a user action in the dashboard.Standard Config updated.\\",\\"status\\":\\"CANCELED\\"}"',
|
"end_message_data": '"{\\"reason\\":\\"Sync has been cancelled because of a user action in the dashboard.Standard Config updated.\\",\\"status\\":\\"CANCELED\\"}"',
|
||||||
},
|
},
|
||||||
{
|
{
|
||||||
"connector_id": "calendar_elected",
|
"connection_id": "calendar_elected",
|
||||||
"sync_id": "63c2fc85-600b-455f-9ba0-f576522465be",
|
"sync_id": "63c2fc85-600b-455f-9ba0-f576522465be",
|
||||||
"start_time": datetime.datetime(2023, 10, 3, 14, 35, 55, 401000),
|
"start_time": datetime.datetime(2023, 10, 3, 14, 35, 55, 401000),
|
||||||
"end_time": datetime.datetime(2023, 10, 3, 14, 36, 29, 678000),
|
"end_time": datetime.datetime(2023, 10, 3, 14, 36, 29, 678000),
|
||||||
"end_message_data": '"{\\"reason\\":\\"java.lang.RuntimeException: FATAL: too many connections for role \\\\\\"hxwraqld\\\\\\"\\",\\"taskType\\":\\"reconnect\\",\\"status\\":\\"FAILURE_WITH_TASK\\"}"',
|
"end_message_data": '"{\\"reason\\":\\"java.lang.RuntimeException: FATAL: too many connections for role \\\\\\"hxwraqld\\\\\\"\\",\\"taskType\\":\\"reconnect\\",\\"status\\":\\"FAILURE_WITH_TASK\\"}"',
|
||||||
},
|
},
|
||||||
{
|
{
|
||||||
"connector_id": "my_confluent_cloud_connector_id",
|
"connection_id": "my_confluent_cloud_connector_id",
|
||||||
"sync_id": "d9a03d6-eded-4422-a46a-163266e58244",
|
"sync_id": "d9a03d6-eded-4422-a46a-163266e58244",
|
||||||
"start_time": datetime.datetime(2023, 9, 20, 6, 37, 32, 606000),
|
"start_time": datetime.datetime(2023, 9, 20, 6, 37, 32, 606000),
|
||||||
"end_time": datetime.datetime(2023, 9, 20, 6, 38, 5, 56000),
|
"end_time": datetime.datetime(2023, 9, 20, 6, 38, 5, 56000),
|
||||||
@ -360,19 +360,19 @@ def test_fivetran_with_snowflake_dest_and_null_connector_user(pytestconfig, tmp_
|
|||||||
|
|
||||||
connector_query_results = [
|
connector_query_results = [
|
||||||
{
|
{
|
||||||
"connector_id": "calendar_elected",
|
"connection_id": "calendar_elected",
|
||||||
"connecting_user_id": None,
|
"connecting_user_id": None,
|
||||||
"connector_type_id": "postgres",
|
"connector_type_id": "postgres",
|
||||||
"connector_name": "postgres",
|
"connection_name": "postgres",
|
||||||
"paused": False,
|
"paused": False,
|
||||||
"sync_frequency": 1440,
|
"sync_frequency": 1440,
|
||||||
"destination_id": "interval_unconstitutional",
|
"destination_id": "interval_unconstitutional",
|
||||||
},
|
},
|
||||||
{
|
{
|
||||||
"connector_id": "my_confluent_cloud_connector_id",
|
"connection_id": "my_confluent_cloud_connector_id",
|
||||||
"connecting_user_id": None,
|
"connecting_user_id": None,
|
||||||
"connector_type_id": "confluent_cloud",
|
"connector_type_id": "confluent_cloud",
|
||||||
"connector_name": "confluent_cloud",
|
"connection_name": "confluent_cloud",
|
||||||
"paused": False,
|
"paused": False,
|
||||||
"sync_frequency": 1440,
|
"sync_frequency": 1440,
|
||||||
"destination_id": "interval_unconstitutional",
|
"destination_id": "interval_unconstitutional",
|
||||||
|
|||||||
@ -134,6 +134,9 @@ def test_warehouse_id_must_be_set_if_include_hive_metastore_is_true():
|
|||||||
assert config.warehouse_id is None
|
assert config.warehouse_id is None
|
||||||
|
|
||||||
|
|
||||||
|
@pytest.mark.skip(
|
||||||
|
reason="This test is making actual network calls with retries taking ~5 mins, needs to be mocked"
|
||||||
|
)
|
||||||
def test_warehouse_id_must_be_present_test_connection():
|
def test_warehouse_id_must_be_present_test_connection():
|
||||||
"""Test that connection succeeds when hive_metastore gets auto-disabled."""
|
"""Test that connection succeeds when hive_metastore gets auto-disabled."""
|
||||||
config_dict = {
|
config_dict = {
|
||||||
|
|||||||
Loading…
x
Reference in New Issue
Block a user