feat(ingest): bigquery - Promoting bigquery-beta to bigquery source (#6222)

Co-authored-by: Shirshanka Das <shirshanka@apache.org>
This commit is contained in:
Tamas Nemeth 2022-10-26 20:15:44 +02:00 committed by GitHub
parent 31f90a4b52
commit 94fae0a464
No known key found for this signature in database
GPG Key ID: 4AEE18F83AFDEB23
11 changed files with 45 additions and 35 deletions

View File

@ -11,6 +11,7 @@ CLI release is made through a different repository and release notes can be foun
If server with version `0.8.28` is being used then CLI used to connect to it should be `0.8.28.x`. Tests of new CLI are not ran with older server versions so it is not recommended to update the CLI if the server is not updated.
## Installation
### Using pip
We recommend python virtual environments (venv-s) to namespace pip modules. The folks over at [Acryl Data](https://www.acryl.io/) maintain a PyPI package for DataHub metadata ingestion. Here's an example setup:
@ -66,7 +67,6 @@ We use a plugin architecture so that you can install only the dependencies you a
| [file](./generated/ingestion/sources/file.md) | _included by default_ | File source and sink |
| [athena](./generated/ingestion/sources/athena.md) | `pip install 'acryl-datahub[athena]'` | AWS Athena source |
| [bigquery](./generated/ingestion/sources/bigquery.md) | `pip install 'acryl-datahub[bigquery]'` | BigQuery source |
| [bigquery-usage](./generated/ingestion/sources/bigquery.md#module-bigquery-usage) | `pip install 'acryl-datahub[bigquery-usage]'` | BigQuery usage statistics source |
| [datahub-lineage-file](./generated/ingestion/sources/file-based-lineage.md) | _no additional dependencies_ | Lineage File source |
| [datahub-business-glossary](./generated/ingestion/sources/business-glossary.md) | _no additional dependencies_ | Business Glossary File source |
| [dbt](./generated/ingestion/sources/dbt.md) | _no additional dependencies_ | dbt source |
@ -126,7 +126,9 @@ datahub check plugins
[extra requirements]: https://www.python-ldap.org/en/python-ldap-3.3.0/installing.html#build-prerequisites
## Environment variables supported
The env variables take precedence over what is in the DataHub CLI config created through `init` command. The list of supported environment variables are as follows
- `DATAHUB_SKIP_CONFIG` (default `false`) - Set to `true` to skip creating the configuration file.
- `DATAHUB_GMS_URL` (default `http://localhost:8080`) - Set to a URL of GMS instance
- `DATAHUB_GMS_HOST` (default `localhost`) - Set to a host of GMS instance. Prefer using `DATAHUB_GMS_URL` to set the URL.
@ -136,7 +138,7 @@ The env variables take precedence over what is in the DataHub CLI config created
- `DATAHUB_TELEMETRY_ENABLED` (default `true`) - Set to `false` to disable telemetry. If CLI is being run in an environment with no access to public internet then this should be disabled.
- `DATAHUB_TELEMETRY_TIMEOUT` (default `10`) - Set to a custom integer value to specify timeout in secs when sending telemetry.
- `DATAHUB_DEBUG` (default `false`) - Set to `true` to enable debug logging for CLI. Can also be achieved through `--debug` option of the CLI.
- `DATAHUB_VERSION` (default `head`) - Set to a specific version to run quickstart with the particular version of docker images.
- `DATAHUB_VERSION` (default `head`) - Set to a specific version to run quickstart with the particular version of docker images.
- `ACTIONS_VERSION` (default `head`) - Set to a specific version to run quickstart with that image tag of `datahub-actions` container.
```shell
@ -271,6 +273,7 @@ datahub get --urn "urn:li:dataset:(urn:li:dataPlatform:hive,SampleHiveDataset,PR
The `put` group of commands allows you to write metadata into DataHub. This is a flexible way for you to issue edits to metadata from the command line.
#### put aspect
The **put aspect** (also the default `put`) command instructs `datahub` to set a specific aspect for an entity to a specified value.
For example, the command shown below sets the `ownership` aspect of the dataset `urn:li:dataset:(urn:li:dataPlatform:hive,SampleHiveDataset,PROD)` to the value in the file `ownership.json`.
The JSON in the `ownership.json` file needs to conform to the [`Ownership`](https://github.com/datahub-project/datahub/blob/master/metadata-models/src/main/pegasus/com/linkedin/common/Ownership.pdl) Aspect model as shown below.
@ -299,6 +302,7 @@ Update succeeded with status 200
```
#### put platform
The **put platform** command (available in version>0.8.44.4) instructs `datahub` to create or update metadata about a data platform. This is very useful if you are using a custom data platform, to set up its logo and display name for a native UI experience.
```shell
@ -306,7 +310,6 @@ datahub put platform --name longtail_schemas --display_name "Long Tail Schemas"
✅ Successfully wrote data platform metadata for urn:li:dataPlatform:longtail_schemas to DataHub (DataHubRestEmitter: configured to talk to https://longtailcompanions.acryl.io/api/gms with token: eyJh**********Cics)
```
### migrate
The `migrate` group of commands allows you to perform certain kinds of migrations.
@ -316,6 +319,7 @@ The `migrate` group of commands allows you to perform certain kinds of migration
The `dataplatform2instance` migration command allows you to migrate your entities from an instance-agnostic platform identifier to an instance-specific platform identifier. If you have ingested metadata in the past for this platform and would like to transfer any important metadata over to the new instance-specific entities, then you should use this command. For example, if your users have added documentation or added tags or terms to your datasets, then you should run this command to transfer this metadata over to the new entities. For further context, read the Platform Instance Guide [here](./platform-instances.md).
A few important options worth calling out:
- --dry-run / -n : Use this to get a report for what will be migrated before running
- --force / -F : Use this if you know what you are doing and do not want to get a confirmation prompt before migration is started
- --keep : When enabled, will preserve the old entities and not delete them. Default behavior is to soft-delete old entities.
@ -324,6 +328,7 @@ A few important options worth calling out:
**_Note_**: Timeseries aspects such as Usage Statistics and Dataset Profiles are not migrated over to the new entity instances, you will get new data points created when you re-run ingestion using the `usage` or sources with profiling turned on.
##### Dry Run
```console
datahub migrate dataplatform2instance --platform elasticsearch --instance prod_index --dry-run
Starting migration: platform:elasticsearch, instance=prod_index, force=False, dry-run=True
@ -341,6 +346,7 @@ Starting migration: platform:elasticsearch, instance=prod_index, force=False, dr
```
##### Real Migration (with soft-delete)
```
> datahub migrate dataplatform2instance --platform hive --instance
datahub migrate dataplatform2instance --platform hive --instance warehouse
@ -373,7 +379,7 @@ to get the raw JSON difference in addition to the API output you can add the `--
```console
datahub timeline --urn "urn:li:dataset:(urn:li:dataPlatform:mysql,User.UserAccount,PROD)" --category TAG --start 7daysago
2022-02-17 14:03:42 - 0.0.0-computed
MODIFY TAG dataset:mysql:User.UserAccount : A change in aspect editableSchemaMetadata happened at time 2022-02-17 20:03:42.0
MODIFY TAG dataset:mysql:User.UserAccount : A change in aspect editableSchemaMetadata happened at time 2022-02-17 20:03:42.0
2022-02-17 14:17:30 - 0.0.0-computed
MODIFY TAG dataset:mysql:User.UserAccount : A change in aspect editableSchemaMetadata happened at time 2022-02-17 20:17:30.118
MODIFY TAG dataset:mysql:User.UserAccount : A change in aspect editableSchemaMetadata happened at time 2022-02-17 20:17:30.118
```

View File

@ -1 +1 @@
To get all metadata from BigQuery you need to use two plugins `bigquery` and `bigquery-usage`. Both of them are described in this page. These will require 2 separate recipes. We understand this is not ideal and we plan to make this easier in the future.
Ingesting metadata from Bigquery requires either using the **bigquery** module with just one recipe (recommended) or the two separate modules **bigquery-legacy** and **bigquery-usage-legacy** (soon to be deprecated) with two separate recipes.

View File

@ -1,10 +1,14 @@
source:
type: bigquery-beta
type: bigquery-legacy
config:
# Coordinates
project_id: my_project_id
# `schema_pattern` for BQ Datasets
schema_pattern:
allow:
- finance_bq_dataset
table_pattern:
deny:
# The exact name of the table is revenue_table_name
@ -12,11 +16,6 @@ source:
# project_id.dataset_name.table_name
# We will improve this in the future
- .*revenue_table_name
include_table_lineage: true
include_usage_statistics: true
profiling:
enabled: true
profile_table_level_only: true
sink:
# sink configs

View File

@ -1,5 +1,5 @@
source:
type: bigquery-usage
type: bigquery-usage-legacy
config:
# Coordinates
projects:

View File

@ -15,14 +15,14 @@ There are two important concepts to understand and identify:
##### Basic Requirements (needed for metadata ingestion)
1. Identify your Extractor Project where the service account will run queries to extract metadata.
| permission                       | Description                                                                                                                         | Capability                                                               |
|----------------------------------|-------------------------------------------------------------------------------------------------------------------------------------|----------------------------------------------------------------------------------------------------------------|
| `bigquery.jobs.create`           | Run jobs (e.g. queries) within the project. *This only needs for the extractor project where the service account belongs*           |                                                                                                               |
| `bigquery.jobs.list`             | Manage the queries that the service account has sent. *This only needs for the extractor project where the service account belongs* |                                                                                                               |
| `bigquery.readsessions.create`   | Create a session for streaming large results. *This only needs for the extractor project where the service account belongs*         |                                                                                                               |
| `bigquery.readsessions.getData` | Get data from the read session. *This only needs for the extractor project where the service account belongs*                       |
| `bigquery.tables.create`         | Create temporary tables when profiling tables. Tip: Use the `profiling.bigquery_temp_table_schema` to ensure that all temp tables (across multiple projects) are created in this project under a specific dataset.                 | Profiling                           |                                                                                                                 |
| `bigquery.tables.delete`         | Delete temporary tables when profiling tables. Tip: Use the `profiling.bigquery_temp_table_schema` to ensure that all temp tables (across multiple projects) are created in this project under a specific dataset.                   | Profiling                           |                                                                                                                 |
| permission                       | Description                                                                                                                         | Capability                                                               |
|----------------------------------|--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|---------------------------------------------------------------------------------------------------------------|
| `bigquery.jobs.create`           | Run jobs (e.g. queries) within the project. *This only needs for the extractor project where the service account belongs*           |                                                                                                               |
| `bigquery.jobs.list`             | Manage the queries that the service account has sent. *This only needs for the extractor project where the service account belongs* |                                                                                                               |
| `bigquery.readsessions.create`   | Create a session for streaming large results. *This only needs for the extractor project where the service account belongs*         |                                                                                                               |
| `bigquery.readsessions.getData` | Get data from the read session. *This only needs for the extractor project where the service account belongs*                       |
| `bigquery.tables.create`         | Create temporary tables when profiling tables. Tip: Use the `profiling.bigquery_temp_table_schema` to ensure that all temp tables (across multiple projects) are created in this project under a specific dataset.                 | Profiling                           |                                                                                                                 |
| `bigquery.tables.delete`         | Delete temporary tables when profiling tables. Tip: Use the `profiling.bigquery_temp_table_schema` to ensure that all temp tables (across multiple projects) are created in this project under a specific dataset.                   | Profiling                           |                                                                                                                 |
2. Grant the following permissions to the Service Account on every project where you would like to extract metadata from
:::info

View File

@ -1,14 +1,10 @@
source:
type: bigquery
config:
# Coordinates
project_id: my_project_id
# `schema_pattern` for BQ Datasets
schema_pattern:
allow:
- finance_bq_dataset
table_pattern:
deny:
# The exact name of the table is revenue_table_name
@ -16,6 +12,11 @@ source:
# project_id.dataset_name.table_name
# We will improve this in the future
- .*revenue_table_name
include_table_lineage: true
include_usage_statistics: true
profiling:
enabled: true
profile_table_level_only: true
sink:
# sink configs

View File

@ -224,13 +224,16 @@ plugins: Dict[str, Set[str]] = {
# PyAthena is pinned with exact version because we use private method in PyAthena
"athena": sql_common | {"PyAthena[SQLAlchemy]==2.4.1"},
"azure-ad": set(),
"bigquery": sql_common
"bigquery-legacy": sql_common
| bigquery_common
| {"sqlalchemy-bigquery>=1.4.1", "sqllineage==1.3.6", "sqlparse"},
"bigquery-usage": bigquery_common | usage_common | {"cachetools"},
"bigquery-beta": sql_common
"bigquery-usage-legacy": bigquery_common | usage_common | {"cachetools"},
"bigquery": sql_common
| bigquery_common
| {"sqllineage==1.3.6", "sql_metadata"},
"bigquery-beta": sql_common
| bigquery_common
| {"sqllineage==1.3.6", "sql_metadata"}, # deprecated, but keeping the extra for backwards compatibility
"clickhouse": sql_common | {"clickhouse-sqlalchemy==0.1.8"},
"clickhouse-usage": sql_common
| usage_common
@ -379,7 +382,8 @@ base_dev_requirements = {
dependency
for plugin in [
"bigquery",
"bigquery-usage",
"bigquery-legacy",
"bigquery-usage-legacy",
"clickhouse",
"clickhouse-usage",
"delta-lake",
@ -480,9 +484,9 @@ entry_points = {
"sqlalchemy = datahub.ingestion.source.sql.sql_generic:SQLAlchemyGenericSource",
"athena = datahub.ingestion.source.sql.athena:AthenaSource",
"azure-ad = datahub.ingestion.source.identity.azure_ad:AzureADSource",
"bigquery = datahub.ingestion.source.sql.bigquery:BigQuerySource",
"bigquery-beta = datahub.ingestion.source.bigquery_v2.bigquery:BigqueryV2Source",
"bigquery-usage = datahub.ingestion.source.usage.bigquery_usage:BigQueryUsageSource",
"bigquery-legacy = datahub.ingestion.source.sql.bigquery:BigQuerySource",
"bigquery = datahub.ingestion.source.bigquery_v2.bigquery:BigqueryV2Source",
"bigquery-usage-legacy = datahub.ingestion.source.usage.bigquery_usage:BigQueryUsageSource",
"clickhouse = datahub.ingestion.source.sql.clickhouse:ClickHouseSource",
"clickhouse-usage = datahub.ingestion.source.usage.clickhouse_usage:ClickHouseUsageSource",
"delta-lake = datahub.ingestion.source.delta_lake:DeltaLakeSource",

View File

@ -118,7 +118,7 @@ def cleanup(config: BigQueryV2Config) -> None:
@platform_name("BigQuery", doc_order=1)
@config_class(BigQueryV2Config)
@support_status(SupportStatus.INCUBATING)
@support_status(SupportStatus.CERTIFIED)
@capability(SourceCapability.PLATFORM_INSTANCE, "Enabled by default")
@capability(SourceCapability.DOMAINS, "Supported via the `domain` config field")
@capability(SourceCapability.CONTAINERS, "Enabled by default")

View File

@ -92,7 +92,7 @@ def test_bq_usage_source(pytestconfig, tmp_path):
{
"run_id": "test-bigquery-usage",
"source": {
"type": "bigquery-usage",
"type": "bigquery-usage-legacy",
"config": {
"projects": ["sample-bigquery-project-1234"],
"start_time": "2021-01-01T00:00Z",
@ -160,7 +160,7 @@ def test_bq_usage_source_with_read_events(pytestconfig, tmp_path):
{
"run_id": "test-bigquery-usage",
"source": {
"type": "bigquery-usage",
"type": "bigquery-usage-legacy",
"config": {
"projects": ["sample-bigquery-project-1234"],
"start_time": "2021-01-01T00:00Z",