mirror of
https://github.com/datahub-project/datahub.git
synced 2026-01-07 15:27:05 +00:00
feat(ingest/databricks): ingest hive metastore by default, more docs (#9601)
Co-authored-by: Harshal Sheth <hsheth2@gmail.com>
This commit is contained in:
parent
6cb3dc839c
commit
f2e78db92e
@ -10,6 +10,26 @@ This file documents any backwards-incompatible changes in DataHub and assists pe
|
||||
- Neo4j 5.x, may require migration from 4.x
|
||||
- Build requires JDK17 (Runtime Java 11)
|
||||
- Build requires Docker Compose > 2.20
|
||||
- #9601 - The Unity Catalog(UC) ingestion source config `include_metastore` is now disabled by default. This change will affect the urns of all entities in the workspace.<br/>
|
||||
Entity Hierarchy with `include_metastore: true` (Old)
|
||||
```
|
||||
- UC Metastore
|
||||
- Catalog
|
||||
- Schema
|
||||
- Table
|
||||
```
|
||||
|
||||
Entity Hierarchy with `include_metastore: false` (New)
|
||||
```
|
||||
- Catalog
|
||||
- Schema
|
||||
- Table
|
||||
```
|
||||
We recommend using `platform_instance` for differentiating across metastores.
|
||||
|
||||
If stateful ingestion is enabled, running ingestion with latest cli version will perform all required cleanup. Otherwise, we recommend soft deleting all databricks data via the DataHub CLI:
|
||||
`datahub delete --platform databricks --soft` and then reingesting with latest cli version.
|
||||
- #9601 - The Unity Catalog(UC) ingestion source config `include_hive_metastore` is now enabled by default. This requires config `warehouse_id` to be set. You can disable `include_hive_metastore` by setting it to `False` to avoid ingesting legacy hive metastore catalog in Databricks.
|
||||
|
||||
### Potential Downtime
|
||||
|
||||
|
||||
@ -1,11 +1,33 @@
|
||||
#### Troubleshooting
|
||||
|
||||
##### No data lineage captured or missing lineage
|
||||
|
||||
### Advanced
|
||||
|
||||
#### Multiple Databricks Workspaces
|
||||
|
||||
If you have multiple databricks workspaces **that point to the same Unity Catalog metastore**, our suggestion is to use separate recipes for ingesting the workspace-specific Hive Metastore catalog and Unity Catalog metastore's information schema.
|
||||
|
||||
To ingest Hive metastore information schema
|
||||
- Setup one ingestion recipe per workspace
|
||||
- Use platform instance equivalent to workspace name
|
||||
- Ingest only hive_metastore catalog in the recipe using config `catalogs: ["hive_metastore"]`
|
||||
|
||||
To ingest Unity Catalog information schema
|
||||
- Disable hive metastore catalog ingestion in the recipe using config `include_hive_metastore: False`
|
||||
- Ideally, just ingest from one workspace
|
||||
- To ingest from both workspaces (e.g. if each workspace has different permissions and therefore restricted view of the UC metastore):
|
||||
- Use same platform instance for all workspaces using same UC metastore
|
||||
- Ingest usage from only one workspace (you lose usage from other workspace)
|
||||
- Use filters to only ingest each catalog once, but shouldn’t be necessary
|
||||
|
||||
|
||||
### Troubleshooting
|
||||
|
||||
#### No data lineage captured or missing lineage
|
||||
|
||||
Check that you meet the [Unity Catalog lineage requirements](https://docs.databricks.com/data-governance/unity-catalog/data-lineage.html#requirements).
|
||||
|
||||
Also check the [Unity Catalog limitations](https://docs.databricks.com/data-governance/unity-catalog/data-lineage.html#limitations) to make sure that lineage would be expected to exist in this case.
|
||||
|
||||
##### Lineage extraction is too slow
|
||||
#### Lineage extraction is too slow
|
||||
|
||||
Currently, there is no way to get table or column lineage in bulk from the Databricks Unity Catalog REST api. Table lineage calls require one API call per table, and column lineage calls require one API call per column. If you find metadata extraction taking too long, you can turn off column level lineage extraction via the `include_column_lineage` config flag.
|
||||
|
||||
@ -13,6 +13,11 @@
|
||||
* Ownership of or `SELECT` privilege on any tables and views you want to ingest
|
||||
* [Ownership documentation](https://docs.databricks.com/data-governance/unity-catalog/manage-privileges/ownership.html)
|
||||
* [Privileges documentation](https://docs.databricks.com/data-governance/unity-catalog/manage-privileges/privileges.html)
|
||||
+ To ingest legacy hive_metastore catalog (`include_hive_metastore` - disabled by default), your service principal must have all of the following:
|
||||
* `READ_METADATA` and `USAGE` privilege on `hive_metastore` catalog
|
||||
* `READ_METADATA` and `USAGE` privilege on schemas you want to ingest
|
||||
* `READ_METADATA` and `USAGE` privilege on tables and views you want to ingest
|
||||
* [Hive Metastore Privileges documentation](https://docs.databricks.com/en/sql/language-manual/sql-ref-privileges-hms.html)
|
||||
+ To ingest your workspace's notebooks and respective lineage, your service principal must have `CAN_READ` privileges on the folders containing the notebooks you want to ingest: [guide](https://docs.databricks.com/en/security/auth-authz/access-control/workspace-acl.html#folder-permissions).
|
||||
+ To `include_usage_statistics` (enabled by default), your service principal must have `CAN_MANAGE` permissions on any SQL Warehouses you want to ingest: [guide](https://docs.databricks.com/security/auth-authz/access-control/sql-endpoint-acl.html).
|
||||
+ To ingest `profiling` information with `method: ge`, you need `SELECT` privileges on all profiled tables.
|
||||
|
||||
@ -126,7 +126,7 @@ class UnityCatalogSourceConfig(
|
||||
description="SQL Warehouse id, for running queries. If not set, will use the default warehouse.",
|
||||
)
|
||||
include_hive_metastore: bool = pydantic.Field(
|
||||
default=False,
|
||||
default=True,
|
||||
description="Whether to ingest legacy `hive_metastore` catalog. This requires executing queries on SQL warehouse.",
|
||||
)
|
||||
workspace_name: Optional[str] = pydantic.Field(
|
||||
@ -135,12 +135,12 @@ class UnityCatalogSourceConfig(
|
||||
)
|
||||
|
||||
include_metastore: bool = pydantic.Field(
|
||||
default=True,
|
||||
default=False,
|
||||
description=(
|
||||
"Whether to ingest the workspace's metastore as a container and include it in all urns."
|
||||
" Changing this will affect the urns of all entities in the workspace."
|
||||
" This will be disabled by default in the future,"
|
||||
" so it is recommended to set this to `False` for new ingestions."
|
||||
" This config is deprecated and will be removed in the future,"
|
||||
" so it is recommended to not set this to `True` for new ingestions."
|
||||
" If you have an existing unity catalog ingestion, you'll want to avoid duplicates by soft deleting existing data."
|
||||
" If stateful ingestion is enabled, running with `include_metastore: false` should be sufficient."
|
||||
" Otherwise, we recommend deleting via the cli: `datahub delete --platform databricks` and re-ingesting with `include_metastore: false`."
|
||||
@ -299,7 +299,7 @@ class UnityCatalogSourceConfig(
|
||||
if v:
|
||||
msg = (
|
||||
"`include_metastore` is enabled."
|
||||
" This is not recommended and will be disabled by default in the future, which is a breaking change."
|
||||
" This is not recommended and this option will be removed in the future, which is a breaking change."
|
||||
" All databricks urns will change if you re-ingest with this disabled."
|
||||
" We recommend soft deleting all databricks data and re-ingesting with `include_metastore` set to `False`."
|
||||
)
|
||||
|
||||
File diff suppressed because it is too large
Load Diff
@ -15,6 +15,7 @@ def test_within_thirty_days():
|
||||
"token": "token",
|
||||
"workspace_url": "https://workspace_url",
|
||||
"include_usage_statistics": True,
|
||||
"include_hive_metastore": False,
|
||||
"start_time": FROZEN_TIME - timedelta(days=30),
|
||||
}
|
||||
)
|
||||
@ -38,6 +39,7 @@ def test_profiling_requires_warehouses_id():
|
||||
{
|
||||
"token": "token",
|
||||
"workspace_url": "https://workspace_url",
|
||||
"include_hive_metastore": False,
|
||||
"profiling": {
|
||||
"enabled": True,
|
||||
"method": "ge",
|
||||
@ -51,6 +53,7 @@ def test_profiling_requires_warehouses_id():
|
||||
{
|
||||
"token": "token",
|
||||
"workspace_url": "https://workspace_url",
|
||||
"include_hive_metastore": False,
|
||||
"profiling": {"enabled": False, "method": "ge"},
|
||||
}
|
||||
)
|
||||
@ -60,6 +63,7 @@ def test_profiling_requires_warehouses_id():
|
||||
UnityCatalogSourceConfig.parse_obj(
|
||||
{
|
||||
"token": "token",
|
||||
"include_hive_metastore": False,
|
||||
"workspace_url": "workspace_url",
|
||||
}
|
||||
)
|
||||
|
||||
Loading…
x
Reference in New Issue
Block a user