refactor(snowflake): move snowflake-beta to certified snowflake source (#5923)

Co-authored-by: Shirshanka Das <shirshanka@apache.org>
This commit is contained in:
Mayuri Nehate 2022-09-15 22:23:54 +05:30 committed by GitHub
parent 93681d7a71
commit 2558129391
No known key found for this signature in database
GPG Key ID: 4AEE18F83AFDEB23
16 changed files with 67 additions and 60 deletions

View File

@ -94,7 +94,6 @@ We use a plugin architecture so that you can install only the dependencies you a
| [redshift](./generated/ingestion/sources/redshift.md) | `pip install 'acryl-datahub[redshift]'` | Redshift source |
| [sagemaker](./generated/ingestion/sources/sagemaker.md) | `pip install 'acryl-datahub[sagemaker]'` | AWS SageMaker source |
| [snowflake](./generated/ingestion/sources/snowflake.md) | `pip install 'acryl-datahub[snowflake]'` | Snowflake source |
| [snowflake-usage](./generated/ingestion/sources/snowflake.md#module-snowflake-usage) | `pip install 'acryl-datahub[snowflake-usage]'` | Snowflake usage statistics source |
| [sqlalchemy](./generated/ingestion/sources/sqlalchemy.md) | `pip install 'acryl-datahub[sqlalchemy]'` | Generic SQLAlchemy source |
| [superset](./generated/ingestion/sources/superset.md) | `pip install 'acryl-datahub[superset]'` | Superset source |
| [tableau](./generated/ingestion/sources/tableau.md) | `pip install 'acryl-datahub[tableau]'` | Tableau source |

View File

@ -9,6 +9,7 @@ This file documents any backwards-incompatible changes in DataHub and assists pe
- Browse Paths have been upgraded to a new format to align more closely with the intention of the feature.
Learn more about the changes, including steps on upgrading, here: https://datahubproject.io/docs/advanced/browse-paths-upgrade
- The dbt ingestion source's `disable_dbt_node_creation` and `load_schema` options have been removed. They were no longer necessary due to the recently added sibling entities functionality.
- The `snowflake` source now uses newer faster implementation (earlier `snowflake-beta`). Config properties `provision_role` and `check_role_grants` are not supported. Older `snowflake` and `snowflake-usage` are available as `snowflake-legacy` and `snowflake-usage-legacy` sources respectively.
### Potential Downtime

View File

@ -105,7 +105,7 @@ Note that a `.` is used to denote nested fields in the YAML recipe.
#### Sample Configuration
```yaml
source:
type: "snowflake-usage"
type: "snowflake-usage-legacy"
config:
username: <user_name>
password: <password>

View File

@ -1,9 +1,7 @@
Ingesting metadata from Snowflake requires either using the **snowflake-beta** module with just one recipe (recommended) or the two separate modules **snowflake** and **snowflake-usage** (soon to be deprecated) with two separate recipes.
Ingesting metadata from Snowflake requires either using the **snowflake** module with just one recipe (recommended) or the two separate modules **snowflake-legacy** and **snowflake-usage-legacy** (soon to be deprecated) with two separate recipes.
All three modules are described on this page.
We encourage you to try out the new **snowflake-beta** plugin as alternative to running both **snowflake** and **snowflake-usage** plugins and share feedback. `snowflake-beta` is much faster than `snowflake` for extracting metadata.
## Snowflake Ingestion through the UI
The following video shows you how to ingest Snowflake metadata through the UI.

View File

@ -1,11 +1,20 @@
source:
type: snowflake-beta
type: snowflake-legacy
config:
check_role_grants: True
provision_role: # Optional
enabled: false
dry_run: true
run_ingestion: false
admin_username: "${SNOWFLAKE_ADMIN_USER}"
admin_password: "${SNOWFLAKE_ADMIN_PASS}"
# This option is recommended to be used for the first time to ingest all lineage
ignore_start_time_lineage: true
# This is an alternative option to specify the start_time for lineage
# if you don't want to look back since beginning
start_time: "2022-03-01T00:00:00Z"
start_time: '2022-03-01T00:00:00Z'
# Coordinates
account_id: "abc48144"
@ -21,7 +30,9 @@ source:
allow:
- "^ACCOUNTING_DB$"
- "^MARKETING_DB$"
schema_pattern:
deny:
- "information_schema.*"
table_pattern:
allow:
# If you want to ingest only few tables with name revenue and sales
@ -31,10 +42,12 @@ source:
profiling:
# Change to false to disable profiling
enabled: true
profile_table_level_only: true
profile_pattern:
allow:
- "ACCOUNTING_DB.*.*"
- "MARKETING_DB.*.*"
- 'ACCOUNTING_DB.*.*'
- 'MARKETING_DB.*.*'
deny:
- '.*information_schema.*'
# Default sink is datahub-rest and doesn't need to be configured
# See https://datahubproject.io/docs/metadata-ingestion/sink_docs/datahub for customization options

View File

@ -1,6 +1,6 @@
### Prerequisites
In order to execute the `snowflake-usage` source, your Snowflake user will need to have specific privileges granted to it. Specifically, you'll need to grant access to the [Account Usage](https://docs.snowflake.com/en/sql-reference/account-usage.html) system tables, using which the DataHub source extracts information. Assuming you've followed the steps outlined in `snowflake` plugin to create a DataHub-specific User & Role, you'll simply need to execute the following commands in Snowflake. This will require a user with the `ACCOUNTADMIN` role (or a role granted the IMPORT SHARES global privilege). Please see [Snowflake docs for more details](https://docs.snowflake.com/en/user-guide/data-share-consumers.html).
In order to execute the `snowflake-usage-legacy` source, your Snowflake user will need to have specific privileges granted to it. Specifically, you'll need to grant access to the [Account Usage](https://docs.snowflake.com/en/sql-reference/account-usage.html) system tables, using which the DataHub source extracts information. Assuming you've followed the steps outlined in `snowflake` plugin to create a DataHub-specific User & Role, you'll simply need to execute the following commands in Snowflake. This will require a user with the `ACCOUNTADMIN` role (or a role granted the IMPORT SHARES global privilege). Please see [Snowflake docs for more details](https://docs.snowflake.com/en/user-guide/data-share-consumers.html).
```sql
grant imported privileges on database snowflake to role datahub_role;
@ -16,7 +16,7 @@ This plugin extracts the following:
:::note
This source only does usage statistics. To get the tables, views, and schemas in your Snowflake warehouse, ingest using the `snowflake` source described above.
This source only does usage statistics. To get the tables, views, and schemas in your Snowflake warehouse, ingest using the `snowflake-legacy` source described above.
:::

View File

@ -1,5 +1,5 @@
source:
type: snowflake-usage
type: snowflake-usage-legacy
config:
# Coordinates
account_id: account_name

View File

@ -1,20 +1,11 @@
source:
type: snowflake
config:
check_role_grants: True
provision_role: # Optional
enabled: false
dry_run: true
run_ingestion: false
admin_username: "${SNOWFLAKE_ADMIN_USER}"
admin_password: "${SNOWFLAKE_ADMIN_PASS}"
# This option is recommended to be used for the first time to ingest all lineage
ignore_start_time_lineage: true
# This is an alternative option to specify the start_time for lineage
# if you don't want to look back since beginning
start_time: '2022-03-01T00:00:00Z'
start_time: "2022-03-01T00:00:00Z"
# Coordinates
account_id: "abc48144"
@ -25,14 +16,12 @@ source:
password: "${SNOWFLAKE_PASS}"
role: "datahub_role"
# Change these as per your database names. Remove to all all databases
# Change these as per your database names. Remove to get all databases
database_pattern:
allow:
- "^ACCOUNTING_DB$"
- "^MARKETING_DB$"
schema_pattern:
deny:
- "information_schema.*"
table_pattern:
allow:
# If you want to ingest only few tables with name revenue and sales
@ -42,12 +31,10 @@ source:
profiling:
# Change to false to disable profiling
enabled: true
turn_off_expensive_profiling_metrics: true
profile_pattern:
allow:
- 'ACCOUNTING_DB.*.*'
- 'MARKETING_DB.*.*'
deny:
- '.*information_schema.*'
- "ACCOUNTING_DB.*.*"
- "MARKETING_DB.*.*"
# Default sink is datahub-rest and doesn't need to be configured
# See https://datahubproject.io/docs/metadata-ingestion/sink_docs/datahub for customization options

View File

@ -1,7 +1,7 @@
---
# see https://datahubproject.io/docs/generated/ingestion/sources/snowflake/#config-details-1 for complete documentation
source:
type: snowflake-usage
type: snowflake
config:
env: PROD
# Coordinates

View File

@ -290,13 +290,13 @@ plugins: Dict[str, Set[str]] = {
"s3": {*s3_base, *data_lake_profiling},
"sagemaker": aws_common,
"salesforce": {"simple-salesforce"},
"snowflake": snowflake_common,
"snowflake-usage": snowflake_common
"snowflake-legacy": snowflake_common,
"snowflake-usage-legacy": snowflake_common
| usage_common
| {
"more-itertools>=8.12.0",
},
"snowflake-beta": snowflake_common | usage_common,
"snowflake": snowflake_common | usage_common,
"sqlalchemy": sql_common,
"superset": {
"requests",
@ -505,9 +505,9 @@ entry_points = {
"redash = datahub.ingestion.source.redash:RedashSource",
"redshift = datahub.ingestion.source.sql.redshift:RedshiftSource",
"redshift-usage = datahub.ingestion.source.usage.redshift_usage:RedshiftUsageSource",
"snowflake = datahub.ingestion.source.sql.snowflake:SnowflakeSource",
"snowflake-usage = datahub.ingestion.source.usage.snowflake_usage:SnowflakeUsageSource",
"snowflake-beta = datahub.ingestion.source.snowflake.snowflake_v2:SnowflakeV2Source",
"snowflake-legacy = datahub.ingestion.source.sql.snowflake:SnowflakeSource",
"snowflake-usage-legacy = datahub.ingestion.source.usage.snowflake_usage:SnowflakeUsageSource",
"snowflake = datahub.ingestion.source.snowflake.snowflake_v2:SnowflakeV2Source",
"superset = datahub.ingestion.source.superset:SupersetSource",
"tableau = datahub.ingestion.source.tableau:TableauSource",
"openapi = datahub.ingestion.source.openapi:OpenApiSource",

View File

@ -478,7 +478,7 @@ class SnowflakeQuery:
query_text
QUALIFY row_number() over ( partition by bucket_start_time, object_name
order by
total_queries desc ) <= {top_n_queries}
total_queries desc, query_text asc ) <= {top_n_queries}
)
select
basic_usage_counts.object_name AS "OBJECT_NAME",
@ -486,9 +486,9 @@ class SnowflakeQuery:
ANY_VALUE(basic_usage_counts.object_domain) AS "OBJECT_DOMAIN",
ANY_VALUE(basic_usage_counts.total_queries) AS "TOTAL_QUERIES",
ANY_VALUE(basic_usage_counts.total_users) AS "TOTAL_USERS",
ARRAY_AGG( distinct top_queries.query_text) AS "TOP_SQL_QUERIES",
ARRAY_AGG( distinct OBJECT_CONSTRUCT( 'col', field_usage_counts.column_name, 'total', field_usage_counts.total_queries ) ) AS "FIELD_COUNTS",
ARRAY_AGG( distinct OBJECT_CONSTRUCT( 'user_name', user_usage_counts.user_name, 'email', user_usage_counts.user_email, 'total', user_usage_counts.total_queries ) ) AS "USER_COUNTS"
ARRAY_UNIQUE_AGG(top_queries.query_text) AS "TOP_SQL_QUERIES",
ARRAY_UNIQUE_AGG(OBJECT_CONSTRUCT( 'col', field_usage_counts.column_name, 'total', field_usage_counts.total_queries ) ) AS "FIELD_COUNTS",
ARRAY_UNIQUE_AGG(OBJECT_CONSTRUCT( 'user_name', user_usage_counts.user_name, 'email', user_usage_counts.user_email, 'total', user_usage_counts.total_queries ) ) AS "USER_COUNTS"
from
basic_usage_counts basic_usage_counts
left join

View File

@ -154,13 +154,7 @@ class SnowflakeUsageExtractor(SnowflakeQueryMixin, SnowflakeCommonMixin):
if self.config.include_top_n_queries
else None,
userCounts=self._map_user_counts(json.loads(row["USER_COUNTS"])),
fieldCounts=[
DatasetFieldUsageCounts(
fieldPath=self.snowflake_identifier(field_count["col"]),
count=field_count["total"],
)
for field_count in json.loads(row["FIELD_COUNTS"])
],
fieldCounts=self._map_field_counts(json.loads(row["FIELD_COUNTS"])),
)
dataset_urn = make_dataset_urn_with_platform_instance(
self.platform,
@ -180,12 +174,14 @@ class SnowflakeUsageExtractor(SnowflakeQueryMixin, SnowflakeCommonMixin):
budget_per_query: int = int(
total_budget_for_query_list / self.config.top_n_queries
)
return [
trim_query(format_sql_query(query), budget_per_query)
if self.config.format_sql_queries
else query
for query in top_sql_queries
]
return sorted(
[
trim_query(format_sql_query(query), budget_per_query)
if self.config.format_sql_queries
else query
for query in top_sql_queries
]
)
def _map_user_counts(self, user_counts: Dict) -> List[DatasetUserUsageCounts]:
filtered_user_counts = []
@ -209,7 +205,19 @@ class SnowflakeUsageExtractor(SnowflakeQueryMixin, SnowflakeCommonMixin):
userEmail=user_email,
)
)
return filtered_user_counts
return sorted(filtered_user_counts, key=lambda v: v.user)
def _map_field_counts(self, field_counts: Dict) -> List[DatasetFieldUsageCounts]:
return sorted(
[
DatasetFieldUsageCounts(
fieldPath=self.snowflake_identifier(field_count["col"]),
count=field_count["total"],
)
for field_count in field_counts
],
key=lambda v: v.fieldPath,
)
def _get_snowflake_history(
self, conn: SnowflakeConnection

View File

@ -146,7 +146,7 @@ SNOWFLAKE_FIELD_TYPE_MAPPINGS = {
@platform_name("Snowflake", doc_order=1)
@config_class(SnowflakeV2Config)
@support_status(SupportStatus.INCUBATING)
@support_status(SupportStatus.CERTIFIED)
@capability(SourceCapability.PLATFORM_INSTANCE, "Enabled by default")
@capability(SourceCapability.DOMAINS, "Supported via the `domain` config field")
@capability(SourceCapability.CONTAINERS, "Enabled by default")

View File

@ -241,8 +241,9 @@ def test_snowflake_basic(pytestconfig, tmp_path, mock_time, mock_datahub_graph):
pipeline = Pipeline(
config=PipelineConfig(
run_id="snowflake-beta-2022_06_07-17_00_00",
source=SourceConfig(
type="snowflake-beta",
type="snowflake",
config=SnowflakeV2Config(
account_id="ABC12345",
username="TST_USR",