refactor(snowflake): move snowflake-beta to certified snowflake source (#5923)

Co-authored-by: Shirshanka Das <shirshanka@apache.org>
2026-01-03 13:24:42 +00:00 · 2022-09-15 22:23:54 +05:30 · 2022-09-15 22:23:54 +05:30 · 2558129391
commit 2558129391
parent 93681d7a71
16 changed files with 67 additions and 60 deletions
--- a/docs/cli.md
+++ b/docs/cli.md
@ -94,7 +94,6 @@ We use a plugin architecture so that you can install only the dependencies you a
 | [redshift](./generated/ingestion/sources/redshift.md)                           | `pip install 'acryl-datahub[redshift]'`                    | Redshift source                     |
 | [sagemaker](./generated/ingestion/sources/sagemaker.md)                         | `pip install 'acryl-datahub[sagemaker]'`                   | AWS SageMaker source                |
 | [snowflake](./generated/ingestion/sources/snowflake.md)                         | `pip install 'acryl-datahub[snowflake]'`                   | Snowflake source                    |
-| [snowflake-usage](./generated/ingestion/sources/snowflake.md#module-snowflake-usage)                   | `pip install 'acryl-datahub[snowflake-usage]'`             | Snowflake usage statistics source   |
 | [sqlalchemy](./generated/ingestion/sources/sqlalchemy.md)                       | `pip install 'acryl-datahub[sqlalchemy]'`                  | Generic SQLAlchemy source           |
 | [superset](./generated/ingestion/sources/superset.md)                           | `pip install 'acryl-datahub[superset]'`                    | Superset source                     |
 | [tableau](./generated/ingestion/sources/tableau.md)                             | `pip install 'acryl-datahub[tableau]'`                     | Tableau source                      |
--- a/docs/how/updating-datahub.md
+++ b/docs/how/updating-datahub.md
@ -9,6 +9,7 @@ This file documents any backwards-incompatible changes in DataHub and assists pe
 - Browse Paths have been upgraded to a new format to align more closely with the intention of the feature. 
  Learn more about the changes, including steps on upgrading, here: https://datahubproject.io/docs/advanced/browse-paths-upgrade
 - The dbt ingestion source's `disable_dbt_node_creation` and `load_schema` options have been removed. They were no longer necessary due to the recently added sibling entities functionality.
+- The `snowflake` source now uses newer faster implementation (earlier `snowflake-beta`). Config properties `provision_role` and `check_role_grants` are not supported. Older `snowflake` and `snowflake-usage` are available as `snowflake-legacy` and `snowflake-usage-legacy` sources respectively.

 ### Potential Downtime

--- a/metadata-ingestion/docs/dev_guides/stateful.md
+++ b/metadata-ingestion/docs/dev_guides/stateful.md
@ -105,7 +105,7 @@ Note that a `.` is used to denote nested fields in the YAML recipe.
 #### Sample Configuration
 ```yaml
 source:
-  type: "snowflake-usage"
+  type: "snowflake-usage-legacy"
  config:
    username: <user_name>
    password: <password>
--- a/metadata-ingestion/docs/sources/snowflake/README.md
+++ b/metadata-ingestion/docs/sources/snowflake/README.md
@ -1,9 +1,7 @@
-Ingesting metadata from Snowflake requires either using the **snowflake-beta** module with just one recipe (recommended) or the two separate modules **snowflake** and **snowflake-usage** (soon to be deprecated) with two separate recipes. 
+Ingesting metadata from Snowflake requires either using the **snowflake** module with just one recipe (recommended) or the two separate modules **snowflake-legacy** and **snowflake-usage-legacy** (soon to be deprecated) with two separate recipes. 

 All three modules are described on this page. 

-We encourage you to try out the new **snowflake-beta** plugin as alternative to running both **snowflake** and **snowflake-usage** plugins and share feedback. `snowflake-beta` is much faster than `snowflake` for extracting metadata.
-
 ## Snowflake Ingestion through the UI

 The following video shows you how to ingest Snowflake metadata through the UI.
--- a/metadata-ingestion/docs/sources/snowflake/snowflake-legacy.md
+++ b/metadata-ingestion/docs/sources/snowflake/snowflake-legacy.md
--- a/metadata-ingestion/docs/sources/snowflake/snowflake-legacy_recipe.yml
+++ b/metadata-ingestion/docs/sources/snowflake/snowflake-legacy_recipe.yml
@ -1,11 +1,20 @@
 source:
-  type: snowflake-beta
+  type: snowflake-legacy
  config:
+
+    check_role_grants: True
+    provision_role: # Optional
+      enabled: false
+      dry_run: true
+      run_ingestion: false
+      admin_username: "${SNOWFLAKE_ADMIN_USER}"
+      admin_password: "${SNOWFLAKE_ADMIN_PASS}"
+
    # This option is recommended to be used for the first time to ingest all lineage
    ignore_start_time_lineage: true
    # This is an alternative option to specify the start_time for lineage
    # if you don't want to look back since beginning
-    start_time: "2022-03-01T00:00:00Z"
+    start_time: '2022-03-01T00:00:00Z'

    # Coordinates
    account_id: "abc48144"
@ -21,7 +30,9 @@ source:
      allow:
        - "^ACCOUNTING_DB$"
        - "^MARKETING_DB$"
-
+    schema_pattern:
+      deny:
+        - "information_schema.*"
    table_pattern:
      allow:
        # If you want to ingest only few tables with name revenue and sales
@ -31,10 +42,12 @@ source:
    profiling:
      # Change to false to disable profiling
      enabled: true
-      profile_table_level_only: true
    profile_pattern:
      allow:
-        - "ACCOUNTING_DB.*.*"
-        - "MARKETING_DB.*.*"
+        - 'ACCOUNTING_DB.*.*'
+        - 'MARKETING_DB.*.*'
+      deny:
+        - '.*information_schema.*'
+
 # Default sink is datahub-rest and doesn't need to be configured
 # See https://datahubproject.io/docs/metadata-ingestion/sink_docs/datahub for customization options
--- a/metadata-ingestion/docs/sources/snowflake/snowflake-usage-legacy_pre.md
+++ b/metadata-ingestion/docs/sources/snowflake/snowflake-usage-legacy_pre.md
@ -1,6 +1,6 @@
 ### Prerequisites

-In order to execute the `snowflake-usage` source, your Snowflake user will need to have specific privileges granted to it. Specifically, you'll need to grant access to the [Account Usage](https://docs.snowflake.com/en/sql-reference/account-usage.html) system tables, using which the DataHub source extracts information. Assuming you've followed the steps outlined in `snowflake` plugin to create a DataHub-specific User & Role, you'll simply need to execute the following commands in Snowflake. This will require a user with the `ACCOUNTADMIN` role (or a role granted the IMPORT SHARES global privilege). Please see [Snowflake docs for more details](https://docs.snowflake.com/en/user-guide/data-share-consumers.html).
+In order to execute the `snowflake-usage-legacy` source, your Snowflake user will need to have specific privileges granted to it. Specifically, you'll need to grant access to the [Account Usage](https://docs.snowflake.com/en/sql-reference/account-usage.html) system tables, using which the DataHub source extracts information. Assuming you've followed the steps outlined in `snowflake` plugin to create a DataHub-specific User & Role, you'll simply need to execute the following commands in Snowflake. This will require a user with the `ACCOUNTADMIN` role (or a role granted the IMPORT SHARES global privilege). Please see [Snowflake docs for more details](https://docs.snowflake.com/en/user-guide/data-share-consumers.html).

 ```sql
 grant imported privileges on database snowflake to role datahub_role;
@ -16,7 +16,7 @@ This plugin extracts the following:

 :::note

-This source only does usage statistics. To get the tables, views, and schemas in your Snowflake warehouse, ingest using the `snowflake` source described above.
+This source only does usage statistics. To get the tables, views, and schemas in your Snowflake warehouse, ingest using the `snowflake-legacy` source described above.

 :::

--- a/metadata-ingestion/docs/sources/snowflake/snowflake-usage-legacy_recipe.yml
+++ b/metadata-ingestion/docs/sources/snowflake/snowflake-usage-legacy_recipe.yml
@ -1,5 +1,5 @@
 source:
-  type: snowflake-usage
+  type: snowflake-usage-legacy
  config:
    # Coordinates
    account_id: account_name
--- a/metadata-ingestion/docs/sources/snowflake/snowflake-beta_pre.md
+++ b/metadata-ingestion/docs/sources/snowflake/snowflake-beta_pre.md
--- a/metadata-ingestion/docs/sources/snowflake/snowflake_recipe.yml
+++ b/metadata-ingestion/docs/sources/snowflake/snowflake_recipe.yml
@ -1,20 +1,11 @@
 source:
  type: snowflake
  config:
-
-    check_role_grants: True
-    provision_role: # Optional
-      enabled: false
-      dry_run: true
-      run_ingestion: false
-      admin_username: "${SNOWFLAKE_ADMIN_USER}"
-      admin_password: "${SNOWFLAKE_ADMIN_PASS}"
-
    # This option is recommended to be used for the first time to ingest all lineage
    ignore_start_time_lineage: true
    # This is an alternative option to specify the start_time for lineage
    # if you don't want to look back since beginning
-    start_time: '2022-03-01T00:00:00Z'
+    start_time: "2022-03-01T00:00:00Z"

    # Coordinates
    account_id: "abc48144"
@ -25,14 +16,12 @@ source:
    password: "${SNOWFLAKE_PASS}"
    role: "datahub_role"

-    # Change these as per your database names. Remove to all all databases
+    # Change these as per your database names. Remove to get all databases
    database_pattern:
      allow:
        - "^ACCOUNTING_DB$"
        - "^MARKETING_DB$"
-    schema_pattern:
-      deny:
-        - "information_schema.*"
+
    table_pattern:
      allow:
        # If you want to ingest only few tables with name revenue and sales
@ -42,12 +31,10 @@ source:
    profiling:
      # Change to false to disable profiling
      enabled: true
+      turn_off_expensive_profiling_metrics: true
    profile_pattern:
      allow:
-        - 'ACCOUNTING_DB.*.*'
-        - 'MARKETING_DB.*.*'
-      deny:
-        - '.*information_schema.*'
-
+        - "ACCOUNTING_DB.*.*"
+        - "MARKETING_DB.*.*"
 # Default sink is datahub-rest and doesn't need to be configured
 # See https://datahubproject.io/docs/metadata-ingestion/sink_docs/datahub for customization options
--- a/metadata-ingestion/examples/recipes/snowflake_usage_to_datahub.dhub.yaml
+++ b/metadata-ingestion/examples/recipes/snowflake_usage_to_datahub.dhub.yaml
@ -1,7 +1,7 @@
 ---
 # see https://datahubproject.io/docs/generated/ingestion/sources/snowflake/#config-details-1 for complete documentation
 source:
-  type: snowflake-usage
+  type: snowflake
  config:
    env: PROD
    # Coordinates
--- a/metadata-ingestion/setup.py
+++ b/metadata-ingestion/setup.py
@ -290,13 +290,13 @@ plugins: Dict[str, Set[str]] = {
    "s3": {*s3_base, *data_lake_profiling},
    "sagemaker": aws_common,
    "salesforce": {"simple-salesforce"},
-    "snowflake": snowflake_common,
-    "snowflake-usage": snowflake_common
+    "snowflake-legacy": snowflake_common,
+    "snowflake-usage-legacy": snowflake_common
    | usage_common
    | {
        "more-itertools>=8.12.0",
    },
-    "snowflake-beta": snowflake_common | usage_common,
+    "snowflake": snowflake_common | usage_common,
    "sqlalchemy": sql_common,
    "superset": {
        "requests",
@ -505,9 +505,9 @@ entry_points = {
        "redash = datahub.ingestion.source.redash:RedashSource",
        "redshift = datahub.ingestion.source.sql.redshift:RedshiftSource",
        "redshift-usage = datahub.ingestion.source.usage.redshift_usage:RedshiftUsageSource",
-        "snowflake = datahub.ingestion.source.sql.snowflake:SnowflakeSource",
-        "snowflake-usage = datahub.ingestion.source.usage.snowflake_usage:SnowflakeUsageSource",
-        "snowflake-beta = datahub.ingestion.source.snowflake.snowflake_v2:SnowflakeV2Source",
+        "snowflake-legacy = datahub.ingestion.source.sql.snowflake:SnowflakeSource",
+        "snowflake-usage-legacy = datahub.ingestion.source.usage.snowflake_usage:SnowflakeUsageSource",
+        "snowflake = datahub.ingestion.source.snowflake.snowflake_v2:SnowflakeV2Source",
        "superset = datahub.ingestion.source.superset:SupersetSource",
        "tableau = datahub.ingestion.source.tableau:TableauSource",
        "openapi = datahub.ingestion.source.openapi:OpenApiSource",
--- a/metadata-ingestion/src/datahub/ingestion/source/snowflake/snowflake_query.py
+++ b/metadata-ingestion/src/datahub/ingestion/source/snowflake/snowflake_query.py
@ -478,7 +478,7 @@ class SnowflakeQuery:
                query_text
            QUALIFY row_number() over ( partition by bucket_start_time, object_name
            order by
-                total_queries desc ) <= {top_n_queries}
+                total_queries desc, query_text asc ) <= {top_n_queries}
        )
        select
            basic_usage_counts.object_name AS "OBJECT_NAME",
@ -486,9 +486,9 @@ class SnowflakeQuery:
            ANY_VALUE(basic_usage_counts.object_domain) AS "OBJECT_DOMAIN",
            ANY_VALUE(basic_usage_counts.total_queries) AS "TOTAL_QUERIES",
            ANY_VALUE(basic_usage_counts.total_users) AS "TOTAL_USERS",
-            ARRAY_AGG( distinct top_queries.query_text) AS "TOP_SQL_QUERIES",
-            ARRAY_AGG( distinct OBJECT_CONSTRUCT( 'col', field_usage_counts.column_name, 'total', field_usage_counts.total_queries ) ) AS "FIELD_COUNTS",
-            ARRAY_AGG( distinct OBJECT_CONSTRUCT( 'user_name', user_usage_counts.user_name, 'email', user_usage_counts.user_email, 'total', user_usage_counts.total_queries ) ) AS "USER_COUNTS"
+            ARRAY_UNIQUE_AGG(top_queries.query_text) AS "TOP_SQL_QUERIES",
+            ARRAY_UNIQUE_AGG(OBJECT_CONSTRUCT( 'col', field_usage_counts.column_name, 'total', field_usage_counts.total_queries ) ) AS "FIELD_COUNTS",
+            ARRAY_UNIQUE_AGG(OBJECT_CONSTRUCT( 'user_name', user_usage_counts.user_name, 'email', user_usage_counts.user_email, 'total', user_usage_counts.total_queries ) ) AS "USER_COUNTS"
        from
            basic_usage_counts basic_usage_counts
            left join
--- a/metadata-ingestion/src/datahub/ingestion/source/snowflake/snowflake_usage_v2.py
+++ b/metadata-ingestion/src/datahub/ingestion/source/snowflake/snowflake_usage_v2.py
@ -154,13 +154,7 @@ class SnowflakeUsageExtractor(SnowflakeQueryMixin, SnowflakeCommonMixin):
                if self.config.include_top_n_queries
                else None,
                userCounts=self._map_user_counts(json.loads(row["USER_COUNTS"])),
-                fieldCounts=[
-                    DatasetFieldUsageCounts(
-                        fieldPath=self.snowflake_identifier(field_count["col"]),
-                        count=field_count["total"],
-                    )
-                    for field_count in json.loads(row["FIELD_COUNTS"])
-                ],
+                fieldCounts=self._map_field_counts(json.loads(row["FIELD_COUNTS"])),
            )
            dataset_urn = make_dataset_urn_with_platform_instance(
                self.platform,
@ -180,12 +174,14 @@ class SnowflakeUsageExtractor(SnowflakeQueryMixin, SnowflakeCommonMixin):
        budget_per_query: int = int(
            total_budget_for_query_list / self.config.top_n_queries
        )
-        return [
-            trim_query(format_sql_query(query), budget_per_query)
-            if self.config.format_sql_queries
-            else query
-            for query in top_sql_queries
-        ]
+        return sorted(
+            [
+                trim_query(format_sql_query(query), budget_per_query)
+                if self.config.format_sql_queries
+                else query
+                for query in top_sql_queries
+            ]
+        )

    def _map_user_counts(self, user_counts: Dict) -> List[DatasetUserUsageCounts]:
        filtered_user_counts = []
@ -209,7 +205,19 @@ class SnowflakeUsageExtractor(SnowflakeQueryMixin, SnowflakeCommonMixin):
                    userEmail=user_email,
                )
            )
-        return filtered_user_counts
+        return sorted(filtered_user_counts, key=lambda v: v.user)
+
+    def _map_field_counts(self, field_counts: Dict) -> List[DatasetFieldUsageCounts]:
+        return sorted(
+            [
+                DatasetFieldUsageCounts(
+                    fieldPath=self.snowflake_identifier(field_count["col"]),
+                    count=field_count["total"],
+                )
+                for field_count in field_counts
+            ],
+            key=lambda v: v.fieldPath,
+        )

    def _get_snowflake_history(
        self, conn: SnowflakeConnection
--- a/metadata-ingestion/src/datahub/ingestion/source/snowflake/snowflake_v2.py
+++ b/metadata-ingestion/src/datahub/ingestion/source/snowflake/snowflake_v2.py
@ -146,7 +146,7 @@ SNOWFLAKE_FIELD_TYPE_MAPPINGS = {

@platform_name("Snowflake", doc_order=1)
@config_class(SnowflakeV2Config)
-@support_status(SupportStatus.INCUBATING)
+@support_status(SupportStatus.CERTIFIED)
@capability(SourceCapability.PLATFORM_INSTANCE, "Enabled by default")
@capability(SourceCapability.DOMAINS, "Supported via the `domain` config field")
@capability(SourceCapability.CONTAINERS, "Enabled by default")
--- a/metadata-ingestion/tests/integration/snowflake-beta/test_snowflake_beta.py
+++ b/metadata-ingestion/tests/integration/snowflake-beta/test_snowflake_beta.py
@ -241,8 +241,9 @@ def test_snowflake_basic(pytestconfig, tmp_path, mock_time, mock_datahub_graph):

        pipeline = Pipeline(
            config=PipelineConfig(
+                run_id="snowflake-beta-2022_06_07-17_00_00",
                source=SourceConfig(
-                    type="snowflake-beta",
+                    type="snowflake",
                    config=SnowflakeV2Config(
                        account_id="ABC12345",
                        username="TST_USR",