mirror of
https://github.com/datahub-project/datahub.git
synced 2026-01-03 13:24:42 +00:00
refactor(snowflake): move snowflake-beta to certified snowflake source (#5923)
Co-authored-by: Shirshanka Das <shirshanka@apache.org>
This commit is contained in:
parent
93681d7a71
commit
2558129391
@ -94,7 +94,6 @@ We use a plugin architecture so that you can install only the dependencies you a
|
||||
| [redshift](./generated/ingestion/sources/redshift.md) | `pip install 'acryl-datahub[redshift]'` | Redshift source |
|
||||
| [sagemaker](./generated/ingestion/sources/sagemaker.md) | `pip install 'acryl-datahub[sagemaker]'` | AWS SageMaker source |
|
||||
| [snowflake](./generated/ingestion/sources/snowflake.md) | `pip install 'acryl-datahub[snowflake]'` | Snowflake source |
|
||||
| [snowflake-usage](./generated/ingestion/sources/snowflake.md#module-snowflake-usage) | `pip install 'acryl-datahub[snowflake-usage]'` | Snowflake usage statistics source |
|
||||
| [sqlalchemy](./generated/ingestion/sources/sqlalchemy.md) | `pip install 'acryl-datahub[sqlalchemy]'` | Generic SQLAlchemy source |
|
||||
| [superset](./generated/ingestion/sources/superset.md) | `pip install 'acryl-datahub[superset]'` | Superset source |
|
||||
| [tableau](./generated/ingestion/sources/tableau.md) | `pip install 'acryl-datahub[tableau]'` | Tableau source |
|
||||
|
||||
@ -9,6 +9,7 @@ This file documents any backwards-incompatible changes in DataHub and assists pe
|
||||
- Browse Paths have been upgraded to a new format to align more closely with the intention of the feature.
|
||||
Learn more about the changes, including steps on upgrading, here: https://datahubproject.io/docs/advanced/browse-paths-upgrade
|
||||
- The dbt ingestion source's `disable_dbt_node_creation` and `load_schema` options have been removed. They were no longer necessary due to the recently added sibling entities functionality.
|
||||
- The `snowflake` source now uses newer faster implementation (earlier `snowflake-beta`). Config properties `provision_role` and `check_role_grants` are not supported. Older `snowflake` and `snowflake-usage` are available as `snowflake-legacy` and `snowflake-usage-legacy` sources respectively.
|
||||
|
||||
### Potential Downtime
|
||||
|
||||
|
||||
@ -105,7 +105,7 @@ Note that a `.` is used to denote nested fields in the YAML recipe.
|
||||
#### Sample Configuration
|
||||
```yaml
|
||||
source:
|
||||
type: "snowflake-usage"
|
||||
type: "snowflake-usage-legacy"
|
||||
config:
|
||||
username: <user_name>
|
||||
password: <password>
|
||||
|
||||
@ -1,9 +1,7 @@
|
||||
Ingesting metadata from Snowflake requires either using the **snowflake-beta** module with just one recipe (recommended) or the two separate modules **snowflake** and **snowflake-usage** (soon to be deprecated) with two separate recipes.
|
||||
Ingesting metadata from Snowflake requires either using the **snowflake** module with just one recipe (recommended) or the two separate modules **snowflake-legacy** and **snowflake-usage-legacy** (soon to be deprecated) with two separate recipes.
|
||||
|
||||
All three modules are described on this page.
|
||||
|
||||
We encourage you to try out the new **snowflake-beta** plugin as alternative to running both **snowflake** and **snowflake-usage** plugins and share feedback. `snowflake-beta` is much faster than `snowflake` for extracting metadata.
|
||||
|
||||
## Snowflake Ingestion through the UI
|
||||
|
||||
The following video shows you how to ingest Snowflake metadata through the UI.
|
||||
|
||||
@ -1,11 +1,20 @@
|
||||
source:
|
||||
type: snowflake-beta
|
||||
type: snowflake-legacy
|
||||
config:
|
||||
|
||||
check_role_grants: True
|
||||
provision_role: # Optional
|
||||
enabled: false
|
||||
dry_run: true
|
||||
run_ingestion: false
|
||||
admin_username: "${SNOWFLAKE_ADMIN_USER}"
|
||||
admin_password: "${SNOWFLAKE_ADMIN_PASS}"
|
||||
|
||||
# This option is recommended to be used for the first time to ingest all lineage
|
||||
ignore_start_time_lineage: true
|
||||
# This is an alternative option to specify the start_time for lineage
|
||||
# if you don't want to look back since beginning
|
||||
start_time: "2022-03-01T00:00:00Z"
|
||||
start_time: '2022-03-01T00:00:00Z'
|
||||
|
||||
# Coordinates
|
||||
account_id: "abc48144"
|
||||
@ -21,7 +30,9 @@ source:
|
||||
allow:
|
||||
- "^ACCOUNTING_DB$"
|
||||
- "^MARKETING_DB$"
|
||||
|
||||
schema_pattern:
|
||||
deny:
|
||||
- "information_schema.*"
|
||||
table_pattern:
|
||||
allow:
|
||||
# If you want to ingest only few tables with name revenue and sales
|
||||
@ -31,10 +42,12 @@ source:
|
||||
profiling:
|
||||
# Change to false to disable profiling
|
||||
enabled: true
|
||||
profile_table_level_only: true
|
||||
profile_pattern:
|
||||
allow:
|
||||
- "ACCOUNTING_DB.*.*"
|
||||
- "MARKETING_DB.*.*"
|
||||
- 'ACCOUNTING_DB.*.*'
|
||||
- 'MARKETING_DB.*.*'
|
||||
deny:
|
||||
- '.*information_schema.*'
|
||||
|
||||
# Default sink is datahub-rest and doesn't need to be configured
|
||||
# See https://datahubproject.io/docs/metadata-ingestion/sink_docs/datahub for customization options
|
||||
@ -1,6 +1,6 @@
|
||||
### Prerequisites
|
||||
|
||||
In order to execute the `snowflake-usage` source, your Snowflake user will need to have specific privileges granted to it. Specifically, you'll need to grant access to the [Account Usage](https://docs.snowflake.com/en/sql-reference/account-usage.html) system tables, using which the DataHub source extracts information. Assuming you've followed the steps outlined in `snowflake` plugin to create a DataHub-specific User & Role, you'll simply need to execute the following commands in Snowflake. This will require a user with the `ACCOUNTADMIN` role (or a role granted the IMPORT SHARES global privilege). Please see [Snowflake docs for more details](https://docs.snowflake.com/en/user-guide/data-share-consumers.html).
|
||||
In order to execute the `snowflake-usage-legacy` source, your Snowflake user will need to have specific privileges granted to it. Specifically, you'll need to grant access to the [Account Usage](https://docs.snowflake.com/en/sql-reference/account-usage.html) system tables, using which the DataHub source extracts information. Assuming you've followed the steps outlined in `snowflake` plugin to create a DataHub-specific User & Role, you'll simply need to execute the following commands in Snowflake. This will require a user with the `ACCOUNTADMIN` role (or a role granted the IMPORT SHARES global privilege). Please see [Snowflake docs for more details](https://docs.snowflake.com/en/user-guide/data-share-consumers.html).
|
||||
|
||||
```sql
|
||||
grant imported privileges on database snowflake to role datahub_role;
|
||||
@ -16,7 +16,7 @@ This plugin extracts the following:
|
||||
|
||||
:::note
|
||||
|
||||
This source only does usage statistics. To get the tables, views, and schemas in your Snowflake warehouse, ingest using the `snowflake` source described above.
|
||||
This source only does usage statistics. To get the tables, views, and schemas in your Snowflake warehouse, ingest using the `snowflake-legacy` source described above.
|
||||
|
||||
:::
|
||||
|
||||
@ -1,5 +1,5 @@
|
||||
source:
|
||||
type: snowflake-usage
|
||||
type: snowflake-usage-legacy
|
||||
config:
|
||||
# Coordinates
|
||||
account_id: account_name
|
||||
@ -1,20 +1,11 @@
|
||||
source:
|
||||
type: snowflake
|
||||
config:
|
||||
|
||||
check_role_grants: True
|
||||
provision_role: # Optional
|
||||
enabled: false
|
||||
dry_run: true
|
||||
run_ingestion: false
|
||||
admin_username: "${SNOWFLAKE_ADMIN_USER}"
|
||||
admin_password: "${SNOWFLAKE_ADMIN_PASS}"
|
||||
|
||||
# This option is recommended to be used for the first time to ingest all lineage
|
||||
ignore_start_time_lineage: true
|
||||
# This is an alternative option to specify the start_time for lineage
|
||||
# if you don't want to look back since beginning
|
||||
start_time: '2022-03-01T00:00:00Z'
|
||||
start_time: "2022-03-01T00:00:00Z"
|
||||
|
||||
# Coordinates
|
||||
account_id: "abc48144"
|
||||
@ -25,14 +16,12 @@ source:
|
||||
password: "${SNOWFLAKE_PASS}"
|
||||
role: "datahub_role"
|
||||
|
||||
# Change these as per your database names. Remove to all all databases
|
||||
# Change these as per your database names. Remove to get all databases
|
||||
database_pattern:
|
||||
allow:
|
||||
- "^ACCOUNTING_DB$"
|
||||
- "^MARKETING_DB$"
|
||||
schema_pattern:
|
||||
deny:
|
||||
- "information_schema.*"
|
||||
|
||||
table_pattern:
|
||||
allow:
|
||||
# If you want to ingest only few tables with name revenue and sales
|
||||
@ -42,12 +31,10 @@ source:
|
||||
profiling:
|
||||
# Change to false to disable profiling
|
||||
enabled: true
|
||||
turn_off_expensive_profiling_metrics: true
|
||||
profile_pattern:
|
||||
allow:
|
||||
- 'ACCOUNTING_DB.*.*'
|
||||
- 'MARKETING_DB.*.*'
|
||||
deny:
|
||||
- '.*information_schema.*'
|
||||
|
||||
- "ACCOUNTING_DB.*.*"
|
||||
- "MARKETING_DB.*.*"
|
||||
# Default sink is datahub-rest and doesn't need to be configured
|
||||
# See https://datahubproject.io/docs/metadata-ingestion/sink_docs/datahub for customization options
|
||||
|
||||
@ -1,7 +1,7 @@
|
||||
---
|
||||
# see https://datahubproject.io/docs/generated/ingestion/sources/snowflake/#config-details-1 for complete documentation
|
||||
source:
|
||||
type: snowflake-usage
|
||||
type: snowflake
|
||||
config:
|
||||
env: PROD
|
||||
# Coordinates
|
||||
@ -290,13 +290,13 @@ plugins: Dict[str, Set[str]] = {
|
||||
"s3": {*s3_base, *data_lake_profiling},
|
||||
"sagemaker": aws_common,
|
||||
"salesforce": {"simple-salesforce"},
|
||||
"snowflake": snowflake_common,
|
||||
"snowflake-usage": snowflake_common
|
||||
"snowflake-legacy": snowflake_common,
|
||||
"snowflake-usage-legacy": snowflake_common
|
||||
| usage_common
|
||||
| {
|
||||
"more-itertools>=8.12.0",
|
||||
},
|
||||
"snowflake-beta": snowflake_common | usage_common,
|
||||
"snowflake": snowflake_common | usage_common,
|
||||
"sqlalchemy": sql_common,
|
||||
"superset": {
|
||||
"requests",
|
||||
@ -505,9 +505,9 @@ entry_points = {
|
||||
"redash = datahub.ingestion.source.redash:RedashSource",
|
||||
"redshift = datahub.ingestion.source.sql.redshift:RedshiftSource",
|
||||
"redshift-usage = datahub.ingestion.source.usage.redshift_usage:RedshiftUsageSource",
|
||||
"snowflake = datahub.ingestion.source.sql.snowflake:SnowflakeSource",
|
||||
"snowflake-usage = datahub.ingestion.source.usage.snowflake_usage:SnowflakeUsageSource",
|
||||
"snowflake-beta = datahub.ingestion.source.snowflake.snowflake_v2:SnowflakeV2Source",
|
||||
"snowflake-legacy = datahub.ingestion.source.sql.snowflake:SnowflakeSource",
|
||||
"snowflake-usage-legacy = datahub.ingestion.source.usage.snowflake_usage:SnowflakeUsageSource",
|
||||
"snowflake = datahub.ingestion.source.snowflake.snowflake_v2:SnowflakeV2Source",
|
||||
"superset = datahub.ingestion.source.superset:SupersetSource",
|
||||
"tableau = datahub.ingestion.source.tableau:TableauSource",
|
||||
"openapi = datahub.ingestion.source.openapi:OpenApiSource",
|
||||
|
||||
@ -478,7 +478,7 @@ class SnowflakeQuery:
|
||||
query_text
|
||||
QUALIFY row_number() over ( partition by bucket_start_time, object_name
|
||||
order by
|
||||
total_queries desc ) <= {top_n_queries}
|
||||
total_queries desc, query_text asc ) <= {top_n_queries}
|
||||
)
|
||||
select
|
||||
basic_usage_counts.object_name AS "OBJECT_NAME",
|
||||
@ -486,9 +486,9 @@ class SnowflakeQuery:
|
||||
ANY_VALUE(basic_usage_counts.object_domain) AS "OBJECT_DOMAIN",
|
||||
ANY_VALUE(basic_usage_counts.total_queries) AS "TOTAL_QUERIES",
|
||||
ANY_VALUE(basic_usage_counts.total_users) AS "TOTAL_USERS",
|
||||
ARRAY_AGG( distinct top_queries.query_text) AS "TOP_SQL_QUERIES",
|
||||
ARRAY_AGG( distinct OBJECT_CONSTRUCT( 'col', field_usage_counts.column_name, 'total', field_usage_counts.total_queries ) ) AS "FIELD_COUNTS",
|
||||
ARRAY_AGG( distinct OBJECT_CONSTRUCT( 'user_name', user_usage_counts.user_name, 'email', user_usage_counts.user_email, 'total', user_usage_counts.total_queries ) ) AS "USER_COUNTS"
|
||||
ARRAY_UNIQUE_AGG(top_queries.query_text) AS "TOP_SQL_QUERIES",
|
||||
ARRAY_UNIQUE_AGG(OBJECT_CONSTRUCT( 'col', field_usage_counts.column_name, 'total', field_usage_counts.total_queries ) ) AS "FIELD_COUNTS",
|
||||
ARRAY_UNIQUE_AGG(OBJECT_CONSTRUCT( 'user_name', user_usage_counts.user_name, 'email', user_usage_counts.user_email, 'total', user_usage_counts.total_queries ) ) AS "USER_COUNTS"
|
||||
from
|
||||
basic_usage_counts basic_usage_counts
|
||||
left join
|
||||
|
||||
@ -154,13 +154,7 @@ class SnowflakeUsageExtractor(SnowflakeQueryMixin, SnowflakeCommonMixin):
|
||||
if self.config.include_top_n_queries
|
||||
else None,
|
||||
userCounts=self._map_user_counts(json.loads(row["USER_COUNTS"])),
|
||||
fieldCounts=[
|
||||
DatasetFieldUsageCounts(
|
||||
fieldPath=self.snowflake_identifier(field_count["col"]),
|
||||
count=field_count["total"],
|
||||
)
|
||||
for field_count in json.loads(row["FIELD_COUNTS"])
|
||||
],
|
||||
fieldCounts=self._map_field_counts(json.loads(row["FIELD_COUNTS"])),
|
||||
)
|
||||
dataset_urn = make_dataset_urn_with_platform_instance(
|
||||
self.platform,
|
||||
@ -180,12 +174,14 @@ class SnowflakeUsageExtractor(SnowflakeQueryMixin, SnowflakeCommonMixin):
|
||||
budget_per_query: int = int(
|
||||
total_budget_for_query_list / self.config.top_n_queries
|
||||
)
|
||||
return [
|
||||
trim_query(format_sql_query(query), budget_per_query)
|
||||
if self.config.format_sql_queries
|
||||
else query
|
||||
for query in top_sql_queries
|
||||
]
|
||||
return sorted(
|
||||
[
|
||||
trim_query(format_sql_query(query), budget_per_query)
|
||||
if self.config.format_sql_queries
|
||||
else query
|
||||
for query in top_sql_queries
|
||||
]
|
||||
)
|
||||
|
||||
def _map_user_counts(self, user_counts: Dict) -> List[DatasetUserUsageCounts]:
|
||||
filtered_user_counts = []
|
||||
@ -209,7 +205,19 @@ class SnowflakeUsageExtractor(SnowflakeQueryMixin, SnowflakeCommonMixin):
|
||||
userEmail=user_email,
|
||||
)
|
||||
)
|
||||
return filtered_user_counts
|
||||
return sorted(filtered_user_counts, key=lambda v: v.user)
|
||||
|
||||
def _map_field_counts(self, field_counts: Dict) -> List[DatasetFieldUsageCounts]:
|
||||
return sorted(
|
||||
[
|
||||
DatasetFieldUsageCounts(
|
||||
fieldPath=self.snowflake_identifier(field_count["col"]),
|
||||
count=field_count["total"],
|
||||
)
|
||||
for field_count in field_counts
|
||||
],
|
||||
key=lambda v: v.fieldPath,
|
||||
)
|
||||
|
||||
def _get_snowflake_history(
|
||||
self, conn: SnowflakeConnection
|
||||
|
||||
@ -146,7 +146,7 @@ SNOWFLAKE_FIELD_TYPE_MAPPINGS = {
|
||||
|
||||
@platform_name("Snowflake", doc_order=1)
|
||||
@config_class(SnowflakeV2Config)
|
||||
@support_status(SupportStatus.INCUBATING)
|
||||
@support_status(SupportStatus.CERTIFIED)
|
||||
@capability(SourceCapability.PLATFORM_INSTANCE, "Enabled by default")
|
||||
@capability(SourceCapability.DOMAINS, "Supported via the `domain` config field")
|
||||
@capability(SourceCapability.CONTAINERS, "Enabled by default")
|
||||
|
||||
@ -241,8 +241,9 @@ def test_snowflake_basic(pytestconfig, tmp_path, mock_time, mock_datahub_graph):
|
||||
|
||||
pipeline = Pipeline(
|
||||
config=PipelineConfig(
|
||||
run_id="snowflake-beta-2022_06_07-17_00_00",
|
||||
source=SourceConfig(
|
||||
type="snowflake-beta",
|
||||
type="snowflake",
|
||||
config=SnowflakeV2Config(
|
||||
account_id="ABC12345",
|
||||
username="TST_USR",
|
||||
|
||||
Loading…
x
Reference in New Issue
Block a user