mirror of
https://github.com/datahub-project/datahub.git
synced 2025-10-26 16:34:44 +00:00
feat(cli): delete cli v2 (#8068)
This commit is contained in:
parent
3c0d720eb6
commit
afd65e16fb
@ -25,8 +25,8 @@ DataHub Docker Images:
|
||||
|
||||
Do not use `latest` or `debug` tags for any of the image as those are not supported and present only due to legacy reasons. Please use `head` or tags specific for versions like `v0.8.40`. For production we recommend using version specific tags not `head`.
|
||||
|
||||
* [linkedin/datahub-ingestion](https://hub.docker.com/r/linkedin/datahub-ingestion/) - This contains the Python CLI. If you are looking for docker image for every minor CLI release you can find them under [acryldata/datahub-ingestion](https://hub.docker.com/r/acryldata/datahub-ingestion/).
|
||||
* [linkedin/datahub-gms](https://hub.docker.com/repository/docker/linkedin/datahub-gms/).
|
||||
* [acryldata/datahub-ingestion](https://hub.docker.com/r/acryldata/datahub-ingestion/)
|
||||
* [linkedin/datahub-gms](https://hub.docker.com/repository/docker/linkedin/datahub-gms/)
|
||||
* [linkedin/datahub-frontend-react](https://hub.docker.com/repository/docker/linkedin/datahub-frontend-react/)
|
||||
* [linkedin/datahub-mae-consumer](https://hub.docker.com/repository/docker/linkedin/datahub-mae-consumer/)
|
||||
* [linkedin/datahub-mce-consumer](https://hub.docker.com/repository/docker/linkedin/datahub-mce-consumer/)
|
||||
|
||||
15
docs/cli.md
15
docs/cli.md
@ -138,14 +138,9 @@ The `check` command allows you to check if all plugins are loaded correctly as w
|
||||
|
||||
### delete
|
||||
|
||||
The `delete` command allows you to delete metadata from DataHub. Read this [guide](./how/delete-metadata.md) to understand how you can delete metadata from DataHub.
|
||||
:::info
|
||||
Deleting metadata using DataHub's CLI and GraphQL API is a simple, systems-level action. If you attempt to delete an Entity with children, such as a Container, it will not automatically delete the children, you will instead need to delete each child by URN in addition to deleting the parent.
|
||||
:::
|
||||
The `delete` command allows you to delete metadata from DataHub.
|
||||
|
||||
```console
|
||||
datahub delete --urn "urn:li:dataset:(urn:li:dataPlatform:hive,SampleHiveDataset,PROD)" --soft
|
||||
```
|
||||
The [metadata deletion guide](./how/delete-metadata.md) covers the various options for the delete command.
|
||||
|
||||
### exists
|
||||
|
||||
@ -534,11 +529,11 @@ Old Entities Migrated = {'urn:li:dataset:(urn:li:dataPlatform:hive,logging_event
|
||||
|
||||
### Using docker
|
||||
|
||||
[](https://hub.docker.com/r/linkedin/datahub-ingestion)
|
||||
[](https://github.com/datahub-project/datahub/actions/workflows/docker-ingestion.yml)
|
||||
[](https://hub.docker.com/r/acryldata/datahub-ingestion)
|
||||
[](https://github.com/acryldata/datahub/actions/workflows/docker-ingestion.yml)
|
||||
|
||||
If you don't want to install locally, you can alternatively run metadata ingestion within a Docker container.
|
||||
We have prebuilt images available on [Docker hub](https://hub.docker.com/r/linkedin/datahub-ingestion). All plugins will be installed and enabled automatically.
|
||||
We have prebuilt images available on [Docker hub](https://hub.docker.com/r/acryldata/datahub-ingestion). All plugins will be installed and enabled automatically.
|
||||
|
||||
You can use the `datahub-ingestion` docker image as explained in [Docker Images](../docker/README.md). In case you are using Kubernetes you can start a pod with the `datahub-ingestion` docker image, log onto a shell on the pod and you should have the access to datahub CLI in your kubernetes cluster.
|
||||
|
||||
|
||||
@ -1,130 +1,236 @@
|
||||
# Removing Metadata from DataHub
|
||||
|
||||
:::tip
|
||||
To follow this guide, you'll need the [DataHub CLI](../cli.md).
|
||||
:::
|
||||
|
||||
There are a two ways to delete metadata from DataHub:
|
||||
|
||||
1. Delete metadata attached to entities by providing a specific urn or filters that identify a set of entities
|
||||
2. Delete metadata created by a single ingestion run
|
||||
1. Delete metadata attached to entities by providing a specific urn or filters that identify a set of urns (delete CLI).
|
||||
2. Delete metadata created by a single ingestion run (rollback).
|
||||
|
||||
To follow this guide you need to use [DataHub CLI](../cli.md).
|
||||
:::caution Be careful when deleting metadata
|
||||
|
||||
Read on to find out how to perform these kinds of deletes.
|
||||
- Always use `--dry-run` to test your delete command before executing it.
|
||||
- Prefer reversible soft deletes (`--soft`) over irreversible hard deletes (`--hard`).
|
||||
|
||||
_Note: Deleting metadata should only be done with care. Always use `--dry-run` to understand what will be deleted before proceeding. Prefer soft-deletes (`--soft`) unless you really want to nuke metadata rows. Hard deletes will actually delete rows in the primary store and recovering them will require using backups of the primary metadata store. Make sure you understand the implications of issuing soft-deletes versus hard-deletes before proceeding._
|
||||
:::
|
||||
|
||||
## Delete CLI Usage
|
||||
|
||||
:::info
|
||||
Deleting metadata using DataHub's CLI and GraphQL API is a simple, systems-level action. If you attempt to delete an Entity with children, such as a Domain, it will not delete those children, you will instead need to delete each child by URN in addition to deleting the parent.
|
||||
|
||||
Deleting metadata using DataHub's CLI is a simple, systems-level action. If you attempt to delete an entity with children, such as a container, it will not delete those children. Instead, you will need to delete each child by URN in addition to deleting the parent.
|
||||
|
||||
:::
|
||||
## Delete By Urn
|
||||
|
||||
To delete all the data related to a single entity, run
|
||||
All the commands below support the following options:
|
||||
|
||||
### Soft Delete (the default)
|
||||
- `-n/--dry-run`: Execute a dry run instead of the actual delete.
|
||||
- `--force`: Skip confirmation prompts.
|
||||
|
||||
This sets the `Status` aspect of the entity to `Removed`, which hides the entity and all its aspects from being returned by the UI.
|
||||
```
|
||||
### Selecting entities to delete
|
||||
|
||||
You can either provide a single urn to delete, or use filters to select a set of entities to delete.
|
||||
|
||||
```shell
|
||||
# Soft delete a single urn.
|
||||
datahub delete --urn "<my urn>"
|
||||
```
|
||||
or
|
||||
```
|
||||
datahub delete --urn "<my urn>" --soft
|
||||
|
||||
# Soft delete using a filter.
|
||||
datahub delete --platform snowflake
|
||||
|
||||
# Filters can be combined, which will select entities that match all filters.
|
||||
datahub delete --platform looker --entity-type chart
|
||||
datahub delete --platform bigquery --env PROD
|
||||
```
|
||||
|
||||
### Hard Delete
|
||||
When performing hard deletes, you can optionally add the `--only-soft-deleted` flag to only hard delete entities that were previously soft deleted.
|
||||
|
||||
This physically deletes all rows for all aspects of the entity. This action cannot be undone, so execute this only after you are sure you want to delete all data associated with this entity.
|
||||
### Performing the delete
|
||||
|
||||
#### Soft delete an entity (default)
|
||||
|
||||
By default, the delete command will perform a soft delete.
|
||||
|
||||
This will set the `status` aspect's `removed` field to `true`, which will hide the entity from the UI. However, you'll still be able to view the entity's metadata in the UI with a direct link.
|
||||
|
||||
```shell
|
||||
# The `--soft` flag is redundant since it's the default.
|
||||
datahub delete --urn "<urn>" --soft
|
||||
# or using a filter
|
||||
datahub delete --platform snowflake --soft
|
||||
```
|
||||
|
||||
#### Hard delete an entity
|
||||
|
||||
This will physically delete all rows for all aspects of the entity. This action cannot be undone, so execute this only after you are sure you want to delete all data associated with this entity.
|
||||
|
||||
```shell
|
||||
datahub delete --urn "<my urn>" --hard
|
||||
# or using a filter
|
||||
datahub delete --platform snowflake --hard
|
||||
```
|
||||
|
||||
As of datahub v0.8.35 doing a hard delete by urn will also provide you with a way to remove references to the urn being deleted across the metadata graph. This is important to use if you don't want to have ghost references in your metadata model and want to save space in the graph database.
|
||||
For now, this behaviour must be opted into by a prompt that will appear for you to manually accept or deny.
|
||||
As of datahub v0.10.2.3, hard deleting tags, glossary terms, users, and groups will also remove references to those entities across the metadata graph.
|
||||
|
||||
You can optionally add `-n` or `--dry-run` to execute a dry run before issuing the final delete command.
|
||||
You can optionally add `-f` or `--force` to skip confirmations
|
||||
You can optionally add `--only-soft-deleted` flag to remove soft-deleted items only.
|
||||
#### Hard delete a timeseries aspect
|
||||
|
||||
:::note
|
||||
It's also possible to delete a range of timeseries aspect data for an entity without deleting the entire entity.
|
||||
|
||||
Make sure you surround your urn with quotes! If you do not include the quotes, your terminal may misinterpret the command._
|
||||
For these deletes, the aspect and time ranges are required. You can delete all data for a timeseries aspect by providing `--start-time min --end-time max`.
|
||||
|
||||
```shell
|
||||
datahub delete --urn "<my urn>" --aspect <aspect name> --start-time '-30 days' --end-time '-7 days'
|
||||
# or using a filter
|
||||
datahub delete --platform snowflake --entity-type dataset --aspect datasetProfile --start-time '0' --end-time '2023-01-01'
|
||||
```
|
||||
|
||||
The start and end time fields filter on the `timestampMillis` field of the timeseries aspect. Allowed start and end times formats:
|
||||
|
||||
- `YYYY-MM-DD`: a specific date
|
||||
- `YYYY-MM-DD HH:mm:ss`: a specific timestamp, assumed to be in UTC unless otherwise specified
|
||||
- `+/-<number> <unit>` (e.g. `-7 days`): a relative time, where `<number>` is an integer and `<unit>` is one of `days`, `hours`, `minutes`, `seconds`
|
||||
- `ddddddddd` (e.g. `1684384045`): a unix timestamp
|
||||
- `min`, `max`, `now`: special keywords
|
||||
|
||||
## Delete CLI Examples
|
||||
|
||||
:::note
|
||||
|
||||
Make sure you surround your urn with quotes! If you do not include the quotes, your terminal may misinterpret the command.
|
||||
|
||||
:::
|
||||
|
||||
If you wish to hard-delete using a curl request you can use something like below. Replace the URN with the URN that you wish to delete
|
||||
_Note: All of the commands below support `--dry-run` and `--force` (skips confirmation prompts)._
|
||||
|
||||
#### Soft delete a single entity
|
||||
|
||||
```shell
|
||||
datahub delete --urn "urn:li:dataset:(urn:li:dataPlatform:hive,fct_users_deleted,PROD)"
|
||||
```
|
||||
|
||||
#### Hard delete a single entity
|
||||
|
||||
```shell
|
||||
datahub delete --urn "urn:li:dataset:(urn:li:dataPlatform:hive,fct_users_deleted,PROD)" --hard
|
||||
```
|
||||
|
||||
#### Delete everything from the Snowflake DEV environment
|
||||
|
||||
```shell
|
||||
datahub delete --platform snowflake --env DEV
|
||||
```
|
||||
|
||||
#### Delete all BigQuery datasets in the PROD environment
|
||||
|
||||
```shell
|
||||
# Note: this will leave BigQuery containers intact.
|
||||
datahub delete --env PROD --entity-type dataset --platform bigquery
|
||||
```
|
||||
|
||||
#### Delete all pipelines and tasks from Airflow
|
||||
|
||||
```shell
|
||||
datahub delete --platform "airflow"
|
||||
```
|
||||
|
||||
#### Delete all containers for a particular platform
|
||||
|
||||
```shell
|
||||
datahub delete --entity-type container --platform s3
|
||||
```
|
||||
|
||||
#### Delete everything in the DEV environment
|
||||
|
||||
```shell
|
||||
# This is a pretty broad filter, so make sure you know what you're doing!
|
||||
datahub delete --env DEV
|
||||
```
|
||||
|
||||
#### Delete all Looker dashboards and charts
|
||||
|
||||
```shell
|
||||
datahub delete --platform looker
|
||||
```
|
||||
|
||||
#### Delete all Looker charts (but not dashboards)
|
||||
|
||||
```shell
|
||||
datahub delete --platform looker --entity-type chart
|
||||
```
|
||||
|
||||
#### Clean up old datasetProfiles
|
||||
|
||||
```shell
|
||||
datahub delete --entity-type dataset --aspect datasetProfile --start-time 'min' --end-time '-60 days'
|
||||
```
|
||||
|
||||
#### Delete a tag
|
||||
|
||||
```shell
|
||||
# Soft delete.
|
||||
datahub delete --urn 'urn:li:tag:Legacy' --soft
|
||||
|
||||
# Or, using a hard delete. This will automatically clean up all tag associations.
|
||||
datahub delete --urn 'urn:li:tag:Legacy' --hard
|
||||
```
|
||||
|
||||
#### Delete all datasets that match a query
|
||||
|
||||
```shell
|
||||
# Note: the query is an advanced feature, but can sometimes select extra entities - use it with caution!
|
||||
datahub delete --entity-type dataset --query "_tmp"
|
||||
```
|
||||
|
||||
#### Hard delete everything in Snowflake that was previously soft deleted
|
||||
|
||||
```shell
|
||||
datahub delete --platform snowflake --only-soft-deleted --hard
|
||||
```
|
||||
|
||||
## Deletes using the SDK and APIs
|
||||
|
||||
The Python SDK's [DataHubGraph](../../python-sdk/clients.md) client supports deletes via the following methods:
|
||||
|
||||
- `soft_delete_entity`
|
||||
- `hard_delete_entity`
|
||||
- `hard_delete_timeseries_aspect`
|
||||
|
||||
Deletes via the REST API are also possible, although we recommend using the SDK instead.
|
||||
|
||||
```shell
|
||||
# hard delete an entity by urn
|
||||
curl "http://localhost:8080/entities?action=delete" -X POST --data '{"urn": "urn:li:dataset:(urn:li:dataPlatform:hive,fct_users_deleted,PROD)"}'
|
||||
```
|
||||
|
||||
## Delete by filters
|
||||
|
||||
_Note: All these commands below support the soft-delete option (`-s/--soft`) as well as the dry-run option (`-n/--dry-run`).
|
||||
|
||||
|
||||
### Delete all Datasets from the Snowflake platform
|
||||
```
|
||||
datahub delete --entity_type dataset --platform snowflake
|
||||
```
|
||||
|
||||
### Delete all containers for a particular platform
|
||||
```
|
||||
datahub delete --entity_type container --platform s3
|
||||
```
|
||||
|
||||
### Delete all datasets in the DEV environment
|
||||
```
|
||||
datahub delete --env DEV --entity_type dataset
|
||||
```
|
||||
|
||||
### Delete all Pipelines and Tasks in the DEV environment
|
||||
```
|
||||
datahub delete --env DEV --entity_type "dataJob"
|
||||
datahub delete --env DEV --entity_type "dataFlow"
|
||||
```
|
||||
|
||||
### Delete all bigquery datasets in the PROD environment
|
||||
```
|
||||
datahub delete --env PROD --entity_type dataset --platform bigquery
|
||||
```
|
||||
|
||||
### Delete all looker dashboards and charts
|
||||
```
|
||||
datahub delete --entity_type dashboard --platform looker
|
||||
datahub delete --entity_type chart --platform looker
|
||||
```
|
||||
|
||||
### Delete all datasets that match a query
|
||||
```
|
||||
datahub delete --entity_type dataset --query "_tmp"
|
||||
```
|
||||
|
||||
## Rollback Ingestion Run
|
||||
|
||||
The second way to delete metadata is to identify entities (and the aspects affected) by using an ingestion `run-id`. Whenever you run `datahub ingest -c ...`, all the metadata ingested with that run will have the same run id.
|
||||
|
||||
To view the ids of the most recent set of ingestion batches, execute
|
||||
|
||||
```
|
||||
```shell
|
||||
datahub ingest list-runs
|
||||
```
|
||||
|
||||
That will print out a table of all the runs. Once you have an idea of which run you want to roll back, run
|
||||
|
||||
```
|
||||
```shell
|
||||
datahub ingest show --run-id <run-id>
|
||||
```
|
||||
|
||||
to see more info of the run.
|
||||
|
||||
Alternately, you can execute a dry-run rollback to achieve the same outcome.
|
||||
```
|
||||
Alternately, you can execute a dry-run rollback to achieve the same outcome.
|
||||
|
||||
```shell
|
||||
datahub ingest rollback --dry-run --run-id <run-id>
|
||||
```
|
||||
|
||||
Finally, once you are sure you want to delete this data forever, run
|
||||
|
||||
```
|
||||
```shell
|
||||
datahub ingest rollback --run-id <run-id>
|
||||
```
|
||||
|
||||
@ -133,10 +239,9 @@ This deletes both the versioned and the timeseries aspects associated with these
|
||||
|
||||
### Unsafe Entities and Rollback
|
||||
|
||||
> **_NOTE:_** Preservation of unsafe entities has been added in datahub `0.8.32`. Read on to understand what it means and how it works.
|
||||
|
||||
In some cases, entities that were initially ingested by a run might have had further modifications to their metadata (e.g. adding terms, tags, or documentation) through the UI or other means. During a roll back of the ingestion that initially created these entities (technically, if the key aspect for these entities are being rolled back), the ingestion process will analyse the metadata graph for aspects that will be left "dangling" and will:
|
||||
1. Leave these aspects untouched in the database, and soft-delete the entity. A re-ingestion of these entities will result in this additional metadata becoming visible again in the UI, so you don't lose any of your work.
|
||||
|
||||
1. Leave these aspects untouched in the database, and soft delete the entity. A re-ingestion of these entities will result in this additional metadata becoming visible again in the UI, so you don't lose any of your work.
|
||||
2. The datahub cli will save information about these unsafe entities as a CSV for operators to later review and decide on next steps (keep or remove).
|
||||
|
||||
The rollback command will report how many entities have such aspects and save as a CSV the urns of these entities under a rollback reports directory, which defaults to `rollback_reports` under the current directory where the cli is run, and can be configured further using the `--reports-dir` command line arg.
|
||||
|
||||
@ -7,6 +7,8 @@ This file documents any backwards-incompatible changes in DataHub and assists pe
|
||||
### Breaking Changes
|
||||
|
||||
- #7900: The `catalog_pattern` and `schema_pattern` options of the Unity Catalog source now match against the fully qualified name of the catalog/schema instead of just the name. Unless you're using regex `^` in your patterns, this should not affect you.
|
||||
- #8068: In the `datahub delete` CLI, if an `--entity-type` filter is not specified, we automatically delete across all entity types. The previous behavior was to use a default entity type of dataset.
|
||||
- #8068: In the `datahub delete` CLI, the `--start-time` and `--end-time` parameters are not required for timeseries aspect hard deletes. To recover the previous behavior of deleting all data, use `--start-time min --end-time max`.
|
||||
|
||||
### Potential Downtime
|
||||
|
||||
|
||||
@ -1,15 +1,20 @@
|
||||
import logging
|
||||
|
||||
from datahub.cli import delete_cli
|
||||
from datahub.emitter.mce_builder import make_dataset_urn
|
||||
from datahub.emitter.rest_emitter import DatahubRestEmitter
|
||||
from datahub.ingestion.graph.client import DatahubClientConfig, DataHubGraph
|
||||
|
||||
log = logging.getLogger(__name__)
|
||||
logging.basicConfig(level=logging.INFO)
|
||||
|
||||
rest_emitter = DatahubRestEmitter(gms_server="http://localhost:8080")
|
||||
graph = DataHubGraph(
|
||||
config=DatahubClientConfig(
|
||||
server="http://localhost:8080",
|
||||
)
|
||||
)
|
||||
|
||||
dataset_urn = make_dataset_urn(name="fct_users_created", platform="hive")
|
||||
|
||||
delete_cli._delete_one_urn(urn=dataset_urn, soft=True, cached_emitter=rest_emitter)
|
||||
# Soft-delete the dataset.
|
||||
graph.delete_entity(urn=dataset_urn, hard=False)
|
||||
|
||||
log.info(f"Deleted dataset {dataset_urn}")
|
||||
|
||||
@ -10,6 +10,7 @@ from typing import Any, Dict, Iterable, List, Optional, Tuple, Type, Union
|
||||
import click
|
||||
import requests
|
||||
import yaml
|
||||
from deprecated import deprecated
|
||||
from pydantic import BaseModel, ValidationError
|
||||
from requests.models import Response
|
||||
from requests.sessions import Session
|
||||
@ -317,50 +318,7 @@ def post_rollback_endpoint(
|
||||
)
|
||||
|
||||
|
||||
def post_delete_references_endpoint(
|
||||
payload_obj: dict,
|
||||
path: str,
|
||||
cached_session_host: Optional[Tuple[Session, str]] = None,
|
||||
) -> Tuple[int, List[Dict]]:
|
||||
session, gms_host = cached_session_host or get_session_and_host()
|
||||
url = gms_host + path
|
||||
|
||||
payload = json.dumps(payload_obj)
|
||||
response = session.post(url, payload)
|
||||
summary = parse_run_restli_response(response)
|
||||
reference_count = summary.get("total", 0)
|
||||
related_aspects = summary.get("relatedAspects", [])
|
||||
return reference_count, related_aspects
|
||||
|
||||
|
||||
def post_delete_endpoint(
|
||||
payload_obj: dict,
|
||||
path: str,
|
||||
cached_session_host: Optional[Tuple[Session, str]] = None,
|
||||
) -> typing.Tuple[str, int, int]:
|
||||
session, gms_host = cached_session_host or get_session_and_host()
|
||||
url = gms_host + path
|
||||
|
||||
return post_delete_endpoint_with_session_and_url(session, url, payload_obj)
|
||||
|
||||
|
||||
def post_delete_endpoint_with_session_and_url(
|
||||
session: Session,
|
||||
url: str,
|
||||
payload_obj: dict,
|
||||
) -> typing.Tuple[str, int, int]:
|
||||
payload = json.dumps(payload_obj)
|
||||
|
||||
response = session.post(url, payload)
|
||||
|
||||
summary = parse_run_restli_response(response)
|
||||
urn: str = summary.get("urn", "")
|
||||
rows_affected: int = summary.get("rows", 0)
|
||||
timeseries_rows_affected: int = summary.get("timeseriesRows", 0)
|
||||
|
||||
return urn, rows_affected, timeseries_rows_affected
|
||||
|
||||
|
||||
@deprecated(reason="Use DataHubGraph.get_urns_by_filter instead")
|
||||
def get_urns_by_filter(
|
||||
platform: Optional[str],
|
||||
env: Optional[str] = None,
|
||||
|
||||
@ -1,65 +1,99 @@
|
||||
import logging
|
||||
import time
|
||||
from dataclasses import dataclass
|
||||
from datetime import datetime
|
||||
from random import choices
|
||||
from typing import Any, Dict, List, Optional, Tuple
|
||||
from typing import Dict, List, Optional
|
||||
|
||||
import click
|
||||
import humanfriendly
|
||||
import progressbar
|
||||
from click_default_group import DefaultGroup
|
||||
from requests import sessions
|
||||
from tabulate import tabulate
|
||||
|
||||
from datahub.cli import cli_utils
|
||||
from datahub.emitter import rest_emitter
|
||||
from datahub.emitter.mcp import MetadataChangeProposalWrapper
|
||||
from datahub.metadata.schema_classes import StatusClass, SystemMetadataClass
|
||||
from datahub.configuration.datetimes import ClickDatetime
|
||||
from datahub.emitter.aspect import ASPECT_MAP, TIMESERIES_ASPECT_MAP
|
||||
from datahub.ingestion.graph.client import (
|
||||
DataHubGraph,
|
||||
RemovedStatusFilter,
|
||||
get_default_graph,
|
||||
)
|
||||
from datahub.telemetry import telemetry
|
||||
from datahub.upgrade import upgrade
|
||||
from datahub.utilities.perf_timer import PerfTimer
|
||||
from datahub.utilities.urns.urn import guess_entity_type
|
||||
|
||||
logger = logging.getLogger(__name__)
|
||||
|
||||
RUN_TABLE_COLUMNS = ["urn", "aspect name", "created at"]
|
||||
_RUN_TABLE_COLUMNS = ["urn", "aspect name", "created at"]
|
||||
_UNKNOWN_NUM_RECORDS = -1
|
||||
|
||||
UNKNOWN_NUM_RECORDS = -1
|
||||
_DELETE_WITH_REFERENCES_TYPES = {
|
||||
"tag",
|
||||
"corpuser",
|
||||
"corpGroup",
|
||||
"domain",
|
||||
"glossaryTerm",
|
||||
"glossaryNode",
|
||||
}
|
||||
|
||||
|
||||
@click.group(cls=DefaultGroup, default="by-filter")
|
||||
def delete() -> None:
|
||||
"""Delete metadata from DataHub."""
|
||||
"""Delete metadata from DataHub.
|
||||
|
||||
See https://datahubproject.io/docs/how/delete-metadata for more detailed docs.
|
||||
"""
|
||||
pass
|
||||
|
||||
|
||||
@dataclass
|
||||
class DeletionResult:
|
||||
start_time: int = int(time.time() * 1000.0)
|
||||
end_time: int = 0
|
||||
num_records: int = 0
|
||||
num_timeseries_records: int = 0
|
||||
num_entities: int = 0
|
||||
sample_records: Optional[List[List[str]]] = None
|
||||
|
||||
def start(self) -> None:
|
||||
self.start_time = int(time.time() * 1000.0)
|
||||
|
||||
def end(self) -> None:
|
||||
self.end_time = int(time.time() * 1000.0)
|
||||
num_referenced_entities: int = 0
|
||||
|
||||
def merge(self, another_result: "DeletionResult") -> None:
|
||||
self.end_time = another_result.end_time
|
||||
self.num_records = (
|
||||
self.num_records + another_result.num_records
|
||||
if another_result.num_records != UNKNOWN_NUM_RECORDS
|
||||
else UNKNOWN_NUM_RECORDS
|
||||
self.num_records = self._sum_handle_unknown(
|
||||
self.num_records, another_result.num_records
|
||||
)
|
||||
self.num_timeseries_records += another_result.num_timeseries_records
|
||||
self.num_entities += another_result.num_entities
|
||||
if another_result.sample_records:
|
||||
if not self.sample_records:
|
||||
self.sample_records = []
|
||||
self.sample_records.extend(another_result.sample_records)
|
||||
self.num_timeseries_records = self._sum_handle_unknown(
|
||||
self.num_timeseries_records, another_result.num_timeseries_records
|
||||
)
|
||||
self.num_entities = self._sum_handle_unknown(
|
||||
self.num_entities, another_result.num_entities
|
||||
)
|
||||
self.num_referenced_entities = self._sum_handle_unknown(
|
||||
self.num_referenced_entities, another_result.num_referenced_entities
|
||||
)
|
||||
|
||||
def format_message(self, *, dry_run: bool, soft: bool, time_sec: float) -> str:
|
||||
counters = (
|
||||
f"{self.num_entities} entities"
|
||||
f" (impacts {self._value_or_unknown(self.num_records)} versioned rows"
|
||||
f" and {self._value_or_unknown(self.num_timeseries_records)} timeseries aspect rows)"
|
||||
)
|
||||
if self.num_referenced_entities > 0:
|
||||
counters += (
|
||||
f" and cleaned up {self.num_referenced_entities} referenced entities"
|
||||
)
|
||||
|
||||
if not dry_run:
|
||||
delete_type = "Soft deleted" if soft else "Hard deleted"
|
||||
return f"{delete_type} {counters} in {humanfriendly.format_timespan(time_sec)}."
|
||||
else:
|
||||
return f"[Dry-run] Would delete {counters}."
|
||||
|
||||
@classmethod
|
||||
def _value_or_unknown(cls, value: int) -> str:
|
||||
return str(value) if value != _UNKNOWN_NUM_RECORDS else "an unknown number of"
|
||||
|
||||
@classmethod
|
||||
def _sum_handle_unknown(cls, value1: int, value2: int) -> int:
|
||||
if value1 == _UNKNOWN_NUM_RECORDS or value2 == _UNKNOWN_NUM_RECORDS:
|
||||
return _UNKNOWN_NUM_RECORDS
|
||||
return value1 + value2
|
||||
|
||||
|
||||
@delete.command()
|
||||
@ -79,7 +113,7 @@ def by_registry(
|
||||
registry_id: str,
|
||||
soft: bool,
|
||||
dry_run: bool,
|
||||
) -> DeletionResult:
|
||||
) -> None:
|
||||
"""
|
||||
Delete all metadata written using the given registry id and version pair.
|
||||
"""
|
||||
@ -89,35 +123,96 @@ def by_registry(
|
||||
"Soft-deleting with a registry-id is not yet supported. Try --dry-run to see what you will be deleting, before issuing a hard-delete using the --hard flag"
|
||||
)
|
||||
|
||||
deletion_result = DeletionResult()
|
||||
deletion_result.num_entities = 1
|
||||
deletion_result.num_records = UNKNOWN_NUM_RECORDS # Default is unknown
|
||||
registry_delete = {"registryId": registry_id, "dryRun": dry_run, "soft": soft}
|
||||
(
|
||||
structured_rows,
|
||||
entities_affected,
|
||||
aspects_affected,
|
||||
unsafe_aspects,
|
||||
unsafe_entity_count,
|
||||
unsafe_entities,
|
||||
) = cli_utils.post_rollback_endpoint(registry_delete, "/entities?action=deleteAll")
|
||||
deletion_result.num_entities = entities_affected
|
||||
deletion_result.num_records = aspects_affected
|
||||
deletion_result.sample_records = structured_rows
|
||||
deletion_result.end()
|
||||
return deletion_result
|
||||
with PerfTimer() as timer:
|
||||
registry_delete = {"registryId": registry_id, "dryRun": dry_run, "soft": soft}
|
||||
(
|
||||
structured_rows,
|
||||
entities_affected,
|
||||
aspects_affected,
|
||||
unsafe_aspects,
|
||||
unsafe_entity_count,
|
||||
unsafe_entities,
|
||||
) = cli_utils.post_rollback_endpoint(
|
||||
registry_delete, "/entities?action=deleteAll"
|
||||
)
|
||||
|
||||
if not dry_run:
|
||||
message = "soft delete" if soft else "hard delete"
|
||||
click.echo(
|
||||
f"Took {timer.elapsed_seconds()} seconds to {message}"
|
||||
f" {aspects_affected} versioned rows"
|
||||
f" for {entities_affected} entities."
|
||||
)
|
||||
else:
|
||||
click.echo(
|
||||
f"{entities_affected} entities with {aspects_affected} rows will be affected. "
|
||||
f"Took {timer.elapsed_seconds()} seconds to evaluate."
|
||||
)
|
||||
if structured_rows:
|
||||
click.echo(tabulate(structured_rows, _RUN_TABLE_COLUMNS, tablefmt="grid"))
|
||||
|
||||
|
||||
@delete.command()
|
||||
@click.option("--urn", required=True, type=str, help="the urn of the entity")
|
||||
@click.option("-n", "--dry-run", required=False, is_flag=True)
|
||||
@click.option(
|
||||
"-f", "--force", required=False, is_flag=True, help="force the delete if set"
|
||||
)
|
||||
@telemetry.with_telemetry()
|
||||
def references(urn: str, dry_run: bool, force: bool) -> None:
|
||||
"""
|
||||
Delete all references to an entity (but not the entity itself).
|
||||
"""
|
||||
|
||||
graph = get_default_graph()
|
||||
logger.info(f"Using graph: {graph}")
|
||||
|
||||
references_count, related_aspects = graph.delete_references_to_urn(
|
||||
urn=urn,
|
||||
dry_run=True,
|
||||
)
|
||||
|
||||
if references_count == 0:
|
||||
click.echo(f"No references to {urn} found")
|
||||
return
|
||||
|
||||
click.echo(f"Found {references_count} references to {urn}")
|
||||
sample_msg = (
|
||||
"\nSample of references\n"
|
||||
+ tabulate(
|
||||
[x.values() for x in related_aspects],
|
||||
["relationship", "entity", "aspect"],
|
||||
)
|
||||
+ "\n"
|
||||
)
|
||||
click.echo(sample_msg)
|
||||
|
||||
if dry_run:
|
||||
logger.info(f"[Dry-run] Would remove {references_count} references to {urn}")
|
||||
else:
|
||||
if not force:
|
||||
click.confirm(
|
||||
f"This will delete {references_count} references to {urn} from DataHub. Do you want to continue?",
|
||||
abort=True,
|
||||
)
|
||||
|
||||
references_count, _ = graph.delete_references_to_urn(
|
||||
urn=urn,
|
||||
dry_run=False,
|
||||
)
|
||||
logger.info(f"Deleted {references_count} references to {urn}")
|
||||
|
||||
|
||||
@delete.command()
|
||||
@click.option("--urn", required=False, type=str, help="the urn of the entity")
|
||||
@click.option(
|
||||
"-a",
|
||||
# option with `_` is inconsistent with rest of CLI but kept for backward compatibility
|
||||
"--aspect_name",
|
||||
"--aspect",
|
||||
# This option is inconsistent with rest of CLI but kept for backward compatibility
|
||||
"--aspect-name",
|
||||
required=False,
|
||||
type=str,
|
||||
help="the aspect name associated with the entity(only for timeseries aspects)",
|
||||
help="the aspect name associated with the entity",
|
||||
)
|
||||
@click.option(
|
||||
"-f", "--force", required=False, is_flag=True, help="force the delete if set"
|
||||
@ -136,40 +231,37 @@ def by_registry(
|
||||
"-p", "--platform", required=False, type=str, help="the platform of the entity"
|
||||
)
|
||||
@click.option(
|
||||
# option with `_` is inconsistent with rest of CLI but kept for backward compatibility
|
||||
"--entity_type",
|
||||
"--entity-type",
|
||||
required=False,
|
||||
type=str,
|
||||
default="dataset",
|
||||
help="the entity type of the entity",
|
||||
)
|
||||
@click.option("--query", required=False, type=str)
|
||||
@click.option(
|
||||
"--start-time",
|
||||
required=False,
|
||||
type=click.DateTime(),
|
||||
help="the start time(only for timeseries aspects)",
|
||||
type=ClickDatetime(),
|
||||
help="the start time (only for timeseries aspects)",
|
||||
)
|
||||
@click.option(
|
||||
"--end-time",
|
||||
required=False,
|
||||
type=click.DateTime(),
|
||||
help="the end time(only for timeseries aspects)",
|
||||
type=ClickDatetime(),
|
||||
help="the end time (only for timeseries aspects)",
|
||||
)
|
||||
@click.option("-n", "--dry-run", required=False, is_flag=True)
|
||||
@click.option("--only-soft-deleted", required=False, is_flag=True, default=False)
|
||||
@upgrade.check_upgrade
|
||||
@telemetry.with_telemetry()
|
||||
def by_filter(
|
||||
urn: str,
|
||||
aspect_name: Optional[str],
|
||||
urn: Optional[str],
|
||||
aspect: Optional[str],
|
||||
force: bool,
|
||||
soft: bool,
|
||||
env: str,
|
||||
platform: str,
|
||||
entity_type: str,
|
||||
query: str,
|
||||
env: Optional[str],
|
||||
platform: Optional[str],
|
||||
entity_type: Optional[str],
|
||||
query: Optional[str],
|
||||
start_time: Optional[datetime],
|
||||
end_time: Optional[datetime],
|
||||
dry_run: bool,
|
||||
@ -177,23 +269,15 @@ def by_filter(
|
||||
) -> None:
|
||||
"""Delete metadata from datahub using a single urn or a combination of filters"""
|
||||
|
||||
cli_utils.test_connectivity_complain_exit("delete")
|
||||
# one of these must be provided
|
||||
if not urn and not platform and not env and not query:
|
||||
raise click.UsageError(
|
||||
"You must provide one of urn / platform / env / query in order to delete entities."
|
||||
)
|
||||
|
||||
include_removed: bool
|
||||
if soft:
|
||||
# For soft-delete include-removed does not make any sense
|
||||
include_removed = False
|
||||
else:
|
||||
# For hard-delete we always include the soft-deleted items
|
||||
include_removed = True
|
||||
|
||||
# default query is set to "*" if not provided
|
||||
query = "*" if query is None else query
|
||||
# Validate the cli arguments.
|
||||
_validate_user_urn_and_filters(
|
||||
urn=urn, entity_type=entity_type, platform=platform, env=env, query=query
|
||||
)
|
||||
soft_delete_filter = _validate_user_soft_delete_flags(
|
||||
soft=soft, aspect=aspect, only_soft_deleted=only_soft_deleted
|
||||
)
|
||||
_validate_user_aspect_flags(aspect=aspect, start_time=start_time, end_time=end_time)
|
||||
# TODO: add some validation on entity_type
|
||||
|
||||
if not force and not soft and not dry_run:
|
||||
click.confirm(
|
||||
@ -201,305 +285,241 @@ def by_filter(
|
||||
abort=True,
|
||||
)
|
||||
|
||||
graph = get_default_graph()
|
||||
logger.info(f"Using {graph}")
|
||||
|
||||
# Determine which urns to delete.
|
||||
if urn:
|
||||
# Single urn based delete
|
||||
session, host = cli_utils.get_session_and_host()
|
||||
entity_type = guess_entity_type(urn=urn)
|
||||
logger.info(f"DataHub configured with {host}")
|
||||
|
||||
if not aspect_name:
|
||||
references_count, related_aspects = delete_references(
|
||||
urn, dry_run=True, cached_session_host=(session, host)
|
||||
)
|
||||
remove_references: bool = False
|
||||
|
||||
if (not force) and references_count > 0:
|
||||
click.echo(
|
||||
f"This urn was referenced in {references_count} other aspects across your metadata graph:"
|
||||
)
|
||||
click.echo(
|
||||
tabulate(
|
||||
[x.values() for x in related_aspects],
|
||||
["relationship", "entity", "aspect"],
|
||||
tablefmt="grid",
|
||||
)
|
||||
)
|
||||
remove_references = click.confirm(
|
||||
"Do you want to delete these references?"
|
||||
)
|
||||
|
||||
if force or remove_references:
|
||||
delete_references(
|
||||
urn, dry_run=False, cached_session_host=(session, host)
|
||||
)
|
||||
|
||||
deletion_result: DeletionResult = delete_one_urn_cmd(
|
||||
urn,
|
||||
aspect_name=aspect_name,
|
||||
soft=soft,
|
||||
dry_run=dry_run,
|
||||
start_time=start_time,
|
||||
end_time=end_time,
|
||||
cached_session_host=(session, host),
|
||||
)
|
||||
|
||||
if not dry_run:
|
||||
if deletion_result.num_records == 0:
|
||||
click.echo(f"Nothing deleted for {urn}")
|
||||
else:
|
||||
click.echo(
|
||||
f"Successfully deleted {urn}. {deletion_result.num_records} rows deleted"
|
||||
)
|
||||
|
||||
delete_by_urn = True
|
||||
urns = [urn]
|
||||
else:
|
||||
# Filter based delete
|
||||
deletion_result = delete_with_filters(
|
||||
env=env,
|
||||
platform=platform,
|
||||
dry_run=dry_run,
|
||||
soft=soft,
|
||||
entity_type=entity_type,
|
||||
search_query=query,
|
||||
force=force,
|
||||
include_removed=include_removed,
|
||||
aspect_name=aspect_name,
|
||||
only_soft_deleted=only_soft_deleted,
|
||||
)
|
||||
|
||||
if not dry_run:
|
||||
message = "soft delete" if soft else "hard delete"
|
||||
click.echo(
|
||||
f"Took {(deletion_result.end_time-deletion_result.start_time)/1000.0} seconds to {message}"
|
||||
f" {deletion_result.num_records} versioned rows"
|
||||
f" and {deletion_result.num_timeseries_records} timeseries aspect rows"
|
||||
f" for {deletion_result.num_entities} entities."
|
||||
)
|
||||
else:
|
||||
click.echo(
|
||||
f"{deletion_result.num_entities} entities with {deletion_result.num_records if deletion_result.num_records != UNKNOWN_NUM_RECORDS else 'unknown'} rows will be affected. Took {(deletion_result.end_time-deletion_result.start_time)/1000.0} seconds to evaluate."
|
||||
)
|
||||
if deletion_result.sample_records:
|
||||
click.echo(
|
||||
tabulate(deletion_result.sample_records, RUN_TABLE_COLUMNS, tablefmt="grid")
|
||||
)
|
||||
|
||||
|
||||
def _get_current_time() -> int:
|
||||
return int(time.time() * 1000.0)
|
||||
|
||||
|
||||
@telemetry.with_telemetry()
|
||||
def delete_with_filters(
|
||||
dry_run: bool,
|
||||
soft: bool,
|
||||
force: bool,
|
||||
include_removed: bool,
|
||||
aspect_name: Optional[str] = None,
|
||||
search_query: str = "*",
|
||||
entity_type: str = "dataset",
|
||||
env: Optional[str] = None,
|
||||
platform: Optional[str] = None,
|
||||
only_soft_deleted: Optional[bool] = False,
|
||||
) -> DeletionResult:
|
||||
session, gms_host = cli_utils.get_session_and_host()
|
||||
token = cli_utils.get_token()
|
||||
|
||||
logger.info(f"datahub configured with {gms_host}")
|
||||
emitter = rest_emitter.DatahubRestEmitter(gms_server=gms_host, token=token)
|
||||
batch_deletion_result = DeletionResult()
|
||||
|
||||
urns: List[str] = []
|
||||
if not only_soft_deleted:
|
||||
delete_by_urn = False
|
||||
urns = list(
|
||||
cli_utils.get_urns_by_filter(
|
||||
env=env,
|
||||
graph.get_urns_by_filter(
|
||||
entity_types=[entity_type] if entity_type else None,
|
||||
platform=platform,
|
||||
search_query=search_query,
|
||||
entity_type=entity_type,
|
||||
include_removed=False,
|
||||
env=env,
|
||||
query=query,
|
||||
status=soft_delete_filter,
|
||||
)
|
||||
)
|
||||
|
||||
soft_deleted_urns: List[str] = []
|
||||
if include_removed or only_soft_deleted:
|
||||
soft_deleted_urns = list(
|
||||
cli_utils.get_urns_by_filter(
|
||||
env=env,
|
||||
platform=platform,
|
||||
search_query=search_query,
|
||||
entity_type=entity_type,
|
||||
only_soft_deleted=True,
|
||||
if len(urns) == 0:
|
||||
click.echo(
|
||||
"Found no urns to delete. Maybe you want to change your filters to be something different?"
|
||||
)
|
||||
return
|
||||
|
||||
urns_by_type: Dict[str, List[str]] = {}
|
||||
for urn in urns:
|
||||
entity_type = guess_entity_type(urn)
|
||||
urns_by_type.setdefault(entity_type, []).append(urn)
|
||||
if len(urns_by_type) > 1:
|
||||
# Display a breakdown of urns by entity type if there's multiple.
|
||||
click.echo("Filter matched urns of multiple entity types")
|
||||
for entity_type, entity_urns in urns_by_type.items():
|
||||
click.echo(
|
||||
f"- {len(entity_urns)} {entity_type} urn(s). Sample: {choices(entity_urns, k=min(5, len(entity_urns)))}"
|
||||
)
|
||||
else:
|
||||
click.echo(
|
||||
f"Filter matched {len(urns)} {entity_type} urn(s). Sample: {choices(urns, k=min(5, len(urns)))}"
|
||||
)
|
||||
|
||||
if not force and not dry_run:
|
||||
click.confirm(
|
||||
f"This will delete {len(urns)} entities from DataHub. Do you want to continue?",
|
||||
abort=True,
|
||||
)
|
||||
|
||||
urns_iter = urns
|
||||
if not delete_by_urn and not dry_run:
|
||||
urns_iter = progressbar.progressbar(urns, redirect_stdout=True)
|
||||
|
||||
# Run the deletion.
|
||||
deletion_result = DeletionResult()
|
||||
with PerfTimer() as timer:
|
||||
for urn in urns_iter:
|
||||
one_result = _delete_one_urn(
|
||||
graph=graph,
|
||||
urn=urn,
|
||||
aspect_name=aspect,
|
||||
soft=soft,
|
||||
dry_run=dry_run,
|
||||
start_time=start_time,
|
||||
end_time=end_time,
|
||||
)
|
||||
deletion_result.merge(one_result)
|
||||
|
||||
# Report out a summary of the deletion result.
|
||||
click.echo(
|
||||
deletion_result.format_message(
|
||||
dry_run=dry_run, soft=soft, time_sec=timer.elapsed_seconds()
|
||||
)
|
||||
|
||||
final_message = ""
|
||||
if len(urns) > 0:
|
||||
final_message = f"{len(urns)} "
|
||||
if len(urns) > 0 and len(soft_deleted_urns) > 0:
|
||||
final_message += "and "
|
||||
if len(soft_deleted_urns) > 0:
|
||||
final_message = f"{len(soft_deleted_urns)} (soft-deleted) "
|
||||
|
||||
logger.info(
|
||||
f"Filter matched {final_message} {entity_type} entities of {platform}. Sample: {choices(urns, k=min(5, len(urns)))}"
|
||||
)
|
||||
if len(urns) == 0 and len(soft_deleted_urns) == 0:
|
||||
click.echo(
|
||||
f"No urns to delete. Maybe you want to change entity_type={entity_type} or platform={platform} to be something different?"
|
||||
)
|
||||
return DeletionResult(end_time=int(time.time() * 1000.0))
|
||||
|
||||
if not force and not dry_run:
|
||||
type_delete = "soft" if soft else "permanently"
|
||||
click.confirm(
|
||||
f"This will {type_delete} delete {len(urns)} entities. Are you sure?",
|
||||
abort=True,
|
||||
|
||||
def _validate_user_urn_and_filters(
|
||||
urn: Optional[str],
|
||||
entity_type: Optional[str],
|
||||
platform: Optional[str],
|
||||
env: Optional[str],
|
||||
query: Optional[str],
|
||||
) -> None:
|
||||
# Check urn / filters options.
|
||||
if urn:
|
||||
if entity_type or platform or env or query:
|
||||
raise click.UsageError(
|
||||
"You cannot provide both an urn and a filter rule (entity-type / platform / env / query)."
|
||||
)
|
||||
elif not urn and not (entity_type or platform or env or query):
|
||||
raise click.UsageError(
|
||||
"You must provide either an urn or at least one filter (entity-type / platform / env / query) in order to delete entities."
|
||||
)
|
||||
elif query:
|
||||
logger.warning(
|
||||
"Using --query is an advanced feature and can easily delete unintended entities. Please use with caution."
|
||||
)
|
||||
elif env and not (platform or entity_type):
|
||||
logger.warning(
|
||||
f"Using --env without other filters will delete all metadata in the {env} environment. Please use with caution."
|
||||
)
|
||||
|
||||
if len(urns) > 0:
|
||||
for urn in progressbar.progressbar(urns, redirect_stdout=True):
|
||||
one_result = _delete_one_urn(
|
||||
urn,
|
||||
soft=soft,
|
||||
aspect_name=aspect_name,
|
||||
dry_run=dry_run,
|
||||
cached_session_host=(session, gms_host),
|
||||
cached_emitter=emitter,
|
||||
)
|
||||
batch_deletion_result.merge(one_result)
|
||||
|
||||
if len(soft_deleted_urns) > 0 and not soft:
|
||||
click.echo("Starting to delete soft-deleted URNs")
|
||||
for urn in progressbar.progressbar(soft_deleted_urns, redirect_stdout=True):
|
||||
one_result = _delete_one_urn(
|
||||
urn,
|
||||
soft=soft,
|
||||
dry_run=dry_run,
|
||||
cached_session_host=(session, gms_host),
|
||||
cached_emitter=emitter,
|
||||
is_soft_deleted=True,
|
||||
)
|
||||
batch_deletion_result.merge(one_result)
|
||||
batch_deletion_result.end()
|
||||
def _validate_user_soft_delete_flags(
|
||||
soft: bool, aspect: Optional[str], only_soft_deleted: bool
|
||||
) -> RemovedStatusFilter:
|
||||
# Check soft / hard delete flags.
|
||||
# Note: aspect not None ==> hard delete,
|
||||
# but aspect is None ==> could be either soft or hard delete
|
||||
|
||||
return batch_deletion_result
|
||||
if soft:
|
||||
if aspect:
|
||||
raise click.UsageError(
|
||||
"You cannot provide an aspect name when performing a soft delete. Use --hard to perform a hard delete."
|
||||
)
|
||||
|
||||
if only_soft_deleted:
|
||||
raise click.UsageError(
|
||||
"You cannot provide --only-soft-deleted when performing a soft delete. Use --hard to perform a hard delete."
|
||||
)
|
||||
|
||||
soft_delete_filter = RemovedStatusFilter.NOT_SOFT_DELETED
|
||||
else:
|
||||
# For hard deletes, we will always include the soft-deleted entities, and
|
||||
# can optionally filter to exclude non-soft-deleted entities.
|
||||
if only_soft_deleted:
|
||||
soft_delete_filter = RemovedStatusFilter.ONLY_SOFT_DELETED
|
||||
else:
|
||||
soft_delete_filter = RemovedStatusFilter.ALL
|
||||
|
||||
return soft_delete_filter
|
||||
|
||||
|
||||
def _validate_user_aspect_flags(
|
||||
aspect: Optional[str],
|
||||
start_time: Optional[datetime],
|
||||
end_time: Optional[datetime],
|
||||
) -> None:
|
||||
# Check the aspect name.
|
||||
if aspect and aspect not in ASPECT_MAP:
|
||||
logger.info(f"Supported aspects: {list(sorted(ASPECT_MAP.keys()))}")
|
||||
raise click.UsageError(
|
||||
f"Unknown aspect {aspect}. Ensure the aspect is in the above list."
|
||||
)
|
||||
|
||||
# Check that start/end time are set if and only if the aspect is a timeseries aspect.
|
||||
if aspect and aspect in TIMESERIES_ASPECT_MAP:
|
||||
if not start_time or not end_time:
|
||||
raise click.UsageError(
|
||||
"You must provide both --start-time and --end-time when deleting a timeseries aspect."
|
||||
)
|
||||
elif start_time or end_time:
|
||||
raise click.UsageError(
|
||||
"You can only provide --start-time and --end-time when deleting a timeseries aspect."
|
||||
)
|
||||
elif aspect:
|
||||
raise click.UsageError(
|
||||
"Aspect-specific deletion is only supported for timeseries aspects. Please delete the full entity or use a rollback instead."
|
||||
)
|
||||
|
||||
|
||||
def _delete_one_urn(
|
||||
graph: DataHubGraph,
|
||||
urn: str,
|
||||
soft: bool = False,
|
||||
dry_run: bool = False,
|
||||
aspect_name: Optional[str] = None,
|
||||
start_time: Optional[datetime] = None,
|
||||
end_time: Optional[datetime] = None,
|
||||
cached_session_host: Optional[Tuple[sessions.Session, str]] = None,
|
||||
cached_emitter: Optional[rest_emitter.DatahubRestEmitter] = None,
|
||||
run_id: str = "delete-run-id",
|
||||
deletion_timestamp: Optional[int] = None,
|
||||
is_soft_deleted: Optional[bool] = None,
|
||||
run_id: str = "__datahub-delete-cli",
|
||||
) -> DeletionResult:
|
||||
deletion_timestamp = deletion_timestamp or _get_current_time()
|
||||
soft_delete_msg: str = ""
|
||||
if dry_run and is_soft_deleted:
|
||||
soft_delete_msg = "(soft-deleted)"
|
||||
|
||||
deletion_result = DeletionResult()
|
||||
deletion_result.num_entities = 1
|
||||
deletion_result.num_records = UNKNOWN_NUM_RECORDS # Default is unknown
|
||||
rows_affected: int = 0
|
||||
ts_rows_affected: int = 0
|
||||
referenced_entities_affected: int = 0
|
||||
|
||||
if soft:
|
||||
if aspect_name:
|
||||
raise click.UsageError(
|
||||
"Please provide --hard flag, as aspect values cannot be soft deleted."
|
||||
)
|
||||
# Add removed aspect
|
||||
if cached_emitter:
|
||||
emitter = cached_emitter
|
||||
else:
|
||||
_, gms_host = cli_utils.get_session_and_host()
|
||||
token = cli_utils.get_token()
|
||||
emitter = rest_emitter.DatahubRestEmitter(gms_server=gms_host, token=token)
|
||||
# Soft delete of entity.
|
||||
assert not aspect_name, "aspects cannot be soft deleted"
|
||||
|
||||
if not dry_run:
|
||||
emitter.emit_mcp(
|
||||
MetadataChangeProposalWrapper(
|
||||
entityUrn=urn,
|
||||
aspect=StatusClass(removed=True),
|
||||
systemMetadata=SystemMetadataClass(
|
||||
runId=run_id, lastObserved=deletion_timestamp
|
||||
),
|
||||
)
|
||||
)
|
||||
graph.soft_delete_entity(urn=urn, run_id=run_id)
|
||||
else:
|
||||
logger.info(f"[Dry-run] Would soft-delete {urn}")
|
||||
elif not dry_run:
|
||||
payload_obj: Dict[str, Any] = {"urn": urn}
|
||||
if aspect_name:
|
||||
payload_obj["aspectName"] = aspect_name
|
||||
if start_time:
|
||||
payload_obj["startTimeMillis"] = int(round(start_time.timestamp() * 1000))
|
||||
if end_time:
|
||||
payload_obj["endTimeMillis"] = int(round(end_time.timestamp() * 1000))
|
||||
rows_affected: int
|
||||
ts_rows_affected: int
|
||||
urn, rows_affected, ts_rows_affected = cli_utils.post_delete_endpoint(
|
||||
payload_obj,
|
||||
"/entities?action=delete",
|
||||
cached_session_host=cached_session_host,
|
||||
)
|
||||
deletion_result.num_records = rows_affected
|
||||
deletion_result.num_timeseries_records = ts_rows_affected
|
||||
else:
|
||||
if aspect_name:
|
||||
logger.info(
|
||||
f"[Dry-run] Would hard-delete aspect {aspect_name} of {urn} {soft_delete_msg}"
|
||||
|
||||
rows_affected = 1
|
||||
ts_rows_affected = 0
|
||||
|
||||
elif aspect_name and aspect_name in TIMESERIES_ASPECT_MAP:
|
||||
# Hard delete of timeseries aspect.
|
||||
|
||||
if not dry_run:
|
||||
ts_rows_affected = graph.hard_delete_timeseries_aspect(
|
||||
urn=urn,
|
||||
aspect_name=aspect_name,
|
||||
start_time=start_time,
|
||||
end_time=end_time,
|
||||
)
|
||||
else:
|
||||
logger.info(f"[Dry-run] Would hard-delete {urn} {soft_delete_msg}")
|
||||
deletion_result.num_records = (
|
||||
UNKNOWN_NUM_RECORDS # since we don't know how many rows will be affected
|
||||
logger.info(
|
||||
f"[Dry-run] Would hard-delete {urn} timeseries aspect {aspect_name}"
|
||||
)
|
||||
ts_rows_affected = _UNKNOWN_NUM_RECORDS
|
||||
|
||||
elif aspect_name:
|
||||
# Hard delete of non-timeseries aspect.
|
||||
|
||||
# TODO: The backend doesn't support this yet.
|
||||
raise NotImplementedError(
|
||||
"Delete by aspect is not supported yet for non-timeseries aspects. Please delete the full entity or use rollback instead."
|
||||
)
|
||||
|
||||
deletion_result.end()
|
||||
return deletion_result
|
||||
else:
|
||||
# Full entity hard delete.
|
||||
assert not soft and not aspect_name
|
||||
|
||||
if not dry_run:
|
||||
rows_affected, ts_rows_affected = graph.hard_delete_entity(
|
||||
urn=urn,
|
||||
)
|
||||
else:
|
||||
logger.info(f"[Dry-run] Would hard-delete {urn}")
|
||||
rows_affected = _UNKNOWN_NUM_RECORDS
|
||||
ts_rows_affected = _UNKNOWN_NUM_RECORDS
|
||||
|
||||
@telemetry.with_telemetry()
|
||||
def delete_one_urn_cmd(
|
||||
urn: str,
|
||||
aspect_name: Optional[str] = None,
|
||||
soft: bool = False,
|
||||
dry_run: bool = False,
|
||||
start_time: Optional[datetime] = None,
|
||||
end_time: Optional[datetime] = None,
|
||||
cached_session_host: Optional[Tuple[sessions.Session, str]] = None,
|
||||
cached_emitter: Optional[rest_emitter.DatahubRestEmitter] = None,
|
||||
) -> DeletionResult:
|
||||
"""
|
||||
Wrapper around delete_one_urn because it is also called in a loop via delete_with_filters.
|
||||
# For full entity deletes, we also might clean up references to the entity.
|
||||
if guess_entity_type(urn) in _DELETE_WITH_REFERENCES_TYPES:
|
||||
referenced_entities_affected, _ = graph.delete_references_to_urn(
|
||||
urn=urn,
|
||||
dry_run=dry_run,
|
||||
)
|
||||
if dry_run and referenced_entities_affected > 0:
|
||||
logger.info(
|
||||
f"[Dry-run] Would remove {referenced_entities_affected} references to {urn}"
|
||||
)
|
||||
|
||||
This is a separate function that is called only when a single URN is deleted via the CLI.
|
||||
"""
|
||||
|
||||
return _delete_one_urn(
|
||||
urn,
|
||||
soft,
|
||||
dry_run,
|
||||
aspect_name,
|
||||
start_time,
|
||||
end_time,
|
||||
cached_session_host,
|
||||
cached_emitter,
|
||||
)
|
||||
|
||||
|
||||
def delete_references(
|
||||
urn: str,
|
||||
dry_run: bool = False,
|
||||
cached_session_host: Optional[Tuple[sessions.Session, str]] = None,
|
||||
) -> Tuple[int, List[Dict]]:
|
||||
payload_obj = {"urn": urn, "dryRun": dry_run}
|
||||
return cli_utils.post_delete_references_endpoint(
|
||||
payload_obj,
|
||||
"/entities?action=deleteReferences",
|
||||
cached_session_host=cached_session_host,
|
||||
return DeletionResult(
|
||||
num_entities=1,
|
||||
num_records=rows_affected,
|
||||
num_timeseries_records=ts_rows_affected,
|
||||
num_referenced_entities=referenced_entities_affected,
|
||||
)
|
||||
|
||||
@ -23,6 +23,7 @@ from datahub.emitter.mcp_builder import (
|
||||
SchemaKey,
|
||||
)
|
||||
from datahub.emitter.rest_emitter import DatahubRestEmitter
|
||||
from datahub.ingestion.graph.client import DataHubGraph, DataHubGraphConfig
|
||||
from datahub.metadata.schema_classes import (
|
||||
ContainerKeyClass,
|
||||
ContainerPropertiesClass,
|
||||
@ -141,8 +142,8 @@ def dataplatform2instance_func(
|
||||
system_metadata = SystemMetadataClass(runId=run_id)
|
||||
|
||||
if not dry_run:
|
||||
rest_emitter = DatahubRestEmitter(
|
||||
gms_server=cli_utils.get_session_and_host()[1]
|
||||
graph = DataHubGraph(
|
||||
config=DataHubGraphConfig(server=cli_utils.get_session_and_host()[1])
|
||||
)
|
||||
|
||||
urns_to_migrate = []
|
||||
@ -214,11 +215,11 @@ def dataplatform2instance_func(
|
||||
run_id=run_id,
|
||||
):
|
||||
if not dry_run:
|
||||
rest_emitter.emit_mcp(mcp)
|
||||
graph.emit_mcp(mcp)
|
||||
migration_report.on_entity_create(mcp.entityUrn, mcp.aspectName) # type: ignore
|
||||
|
||||
if not dry_run:
|
||||
rest_emitter.emit_mcp(
|
||||
graph.emit_mcp(
|
||||
MetadataChangeProposalWrapper(
|
||||
entityUrn=new_urn,
|
||||
aspect=DataPlatformInstanceClass(
|
||||
@ -252,14 +253,16 @@ def dataplatform2instance_func(
|
||||
aspect=aspect,
|
||||
)
|
||||
if not dry_run:
|
||||
rest_emitter.emit_mcp(mcp)
|
||||
graph.emit_mcp(mcp)
|
||||
migration_report.on_entity_affected(mcp.entityUrn, mcp.aspectName) # type: ignore
|
||||
else:
|
||||
log.debug(f"Didn't find aspect {aspect_name} for urn {target_urn}")
|
||||
|
||||
if not dry_run and not keep:
|
||||
log.info(f"will {'hard' if hard else 'soft'} delete {src_entity_urn}")
|
||||
delete_cli._delete_one_urn(src_entity_urn, soft=not hard, run_id=run_id)
|
||||
delete_cli._delete_one_urn(
|
||||
graph, src_entity_urn, soft=not hard, run_id=run_id
|
||||
)
|
||||
migration_report.on_entity_migrated(src_entity_urn, "status") # type: ignore
|
||||
|
||||
click.echo(f"{migration_report}")
|
||||
@ -270,7 +273,7 @@ def dataplatform2instance_func(
|
||||
instance=instance,
|
||||
platform=platform,
|
||||
keep=keep,
|
||||
rest_emitter=rest_emitter,
|
||||
rest_emitter=graph,
|
||||
)
|
||||
|
||||
|
||||
@ -281,7 +284,7 @@ def migrate_containers(
|
||||
hard: bool,
|
||||
instance: str,
|
||||
keep: bool,
|
||||
rest_emitter: DatahubRestEmitter,
|
||||
rest_emitter: DataHubGraph,
|
||||
) -> None:
|
||||
run_id: str = f"container-migrate-{uuid.uuid4()}"
|
||||
migration_report = MigrationReport(run_id, dry_run, keep)
|
||||
@ -369,7 +372,9 @@ def migrate_containers(
|
||||
|
||||
if not dry_run and not keep:
|
||||
log.info(f"will {'hard' if hard else 'soft'} delete {src_urn}")
|
||||
delete_cli._delete_one_urn(src_urn, soft=not hard, run_id=run_id)
|
||||
delete_cli._delete_one_urn(
|
||||
rest_emitter, src_urn, soft=not hard, run_id=run_id
|
||||
)
|
||||
migration_report.on_entity_migrated(src_urn, "status") # type: ignore
|
||||
|
||||
click.echo(f"{migration_report}")
|
||||
|
||||
@ -13,7 +13,6 @@ import click
|
||||
from click_default_group import DefaultGroup
|
||||
|
||||
from datahub.api.entities.dataproduct.dataproduct import DataProduct
|
||||
from datahub.cli.delete_cli import delete_one_urn_cmd, delete_references
|
||||
from datahub.cli.specific.file_loader import load_file
|
||||
from datahub.emitter.mce_builder import make_group_urn, make_user_urn
|
||||
from datahub.ingestion.graph.client import DataHubGraph, get_default_graph
|
||||
@ -213,6 +212,7 @@ def delete(urn: str, file: Path, hard: bool) -> None:
|
||||
)
|
||||
raise click.Abort()
|
||||
|
||||
graph: DataHubGraph
|
||||
with get_default_graph() as graph:
|
||||
data_product_urn = (
|
||||
urn if urn.startswith("urn:li:dataProduct") else f"urn:li:dataProduct:{urn}"
|
||||
@ -225,9 +225,10 @@ def delete(urn: str, file: Path, hard: bool) -> None:
|
||||
|
||||
if hard:
|
||||
# we only delete references if this is a hard delete
|
||||
delete_references(data_product_urn)
|
||||
graph.delete_references_to_urn(data_product_urn)
|
||||
|
||||
graph.delete_entity(data_product_urn, hard=hard)
|
||||
|
||||
delete_one_urn_cmd(data_product_urn, soft=not hard)
|
||||
click.secho(f"Data Product {data_product_urn} deleted")
|
||||
|
||||
|
||||
|
||||
94
metadata-ingestion/src/datahub/configuration/datetimes.py
Normal file
94
metadata-ingestion/src/datahub/configuration/datetimes.py
Normal file
@ -0,0 +1,94 @@
|
||||
import contextlib
|
||||
import logging
|
||||
from datetime import datetime, timedelta, timezone
|
||||
from typing import Any, Optional
|
||||
|
||||
import click
|
||||
import dateutil.parser
|
||||
import humanfriendly
|
||||
|
||||
logger = logging.getLogger(__name__)
|
||||
|
||||
|
||||
def parse_user_datetime(input: str) -> datetime:
|
||||
"""Parse absolute and relative time strings into datetime objects.
|
||||
|
||||
This parses strings like "2022-01-01 01:02:03" and "-7 days"
|
||||
and timestamps like "1630440123".
|
||||
|
||||
Args:
|
||||
input: A string representing a datetime or relative time.
|
||||
|
||||
Returns:
|
||||
A timezone-aware datetime object in UTC. If the input specifies a different
|
||||
timezone, it will be converted to UTC.
|
||||
"""
|
||||
|
||||
# Special cases.
|
||||
if input == "now":
|
||||
return datetime.now(tz=timezone.utc)
|
||||
elif input == "min":
|
||||
return datetime.min.replace(tzinfo=timezone.utc)
|
||||
elif input == "max":
|
||||
return datetime.max.replace(tzinfo=timezone.utc)
|
||||
|
||||
# First try parsing as a timestamp.
|
||||
with contextlib.suppress(ValueError):
|
||||
ts = float(input)
|
||||
try:
|
||||
return datetime.fromtimestamp(ts, tz=timezone.utc)
|
||||
except (OverflowError, ValueError):
|
||||
# This is likely a timestamp in milliseconds.
|
||||
return datetime.fromtimestamp(ts / 1000, tz=timezone.utc)
|
||||
|
||||
# Then try parsing as a relative time.
|
||||
with contextlib.suppress(humanfriendly.InvalidTimespan):
|
||||
delta = _parse_relative_timespan(input)
|
||||
return datetime.now(tz=timezone.utc) + delta
|
||||
|
||||
# Finally, try parsing as an absolute time.
|
||||
with contextlib.suppress(dateutil.parser.ParserError):
|
||||
dt = dateutil.parser.parse(input)
|
||||
if dt.tzinfo is None:
|
||||
# Assume that the user meant to specify a time in UTC.
|
||||
dt = dt.replace(tzinfo=timezone.utc)
|
||||
else:
|
||||
# Convert to UTC.
|
||||
dt = dt.astimezone(timezone.utc)
|
||||
return dt
|
||||
|
||||
raise ValueError(f"Could not parse {input} as a datetime or relative time.")
|
||||
|
||||
|
||||
def _parse_relative_timespan(input: str) -> timedelta:
|
||||
neg = False
|
||||
input = input.strip()
|
||||
|
||||
if input.startswith("+"):
|
||||
input = input[1:]
|
||||
elif input.startswith("-"):
|
||||
input = input[1:]
|
||||
neg = True
|
||||
|
||||
seconds = humanfriendly.parse_timespan(input)
|
||||
delta = timedelta(seconds=seconds)
|
||||
if neg:
|
||||
delta = -delta
|
||||
|
||||
logger.debug(f'Parsed "{input}" as {delta}.')
|
||||
return delta
|
||||
|
||||
|
||||
class ClickDatetime(click.ParamType):
|
||||
name = "datetime"
|
||||
|
||||
def convert(
|
||||
self, value: Any, param: Optional[click.Parameter], ctx: Optional[click.Context]
|
||||
) -> datetime:
|
||||
if isinstance(value, datetime):
|
||||
return value
|
||||
|
||||
try:
|
||||
return parse_user_datetime(value)
|
||||
except ValueError as e:
|
||||
self.fail(str(e), param, ctx)
|
||||
@ -266,9 +266,14 @@ class DataHubRestEmitter(Closeable):
|
||||
response.raise_for_status()
|
||||
except HTTPError as e:
|
||||
try:
|
||||
info = response.json()
|
||||
info: Dict = response.json()
|
||||
logger.debug(
|
||||
"Full stack trace from DataHub:\n%s", info.get("stackTrace")
|
||||
)
|
||||
info.pop("stackTrace", None)
|
||||
raise OperationalError(
|
||||
"Unable to emit metadata to DataHub GMS", info
|
||||
f"Unable to emit metadata to DataHub GMS: {info.get('message')}",
|
||||
info,
|
||||
) from e
|
||||
except JSONDecodeError:
|
||||
# If we can't parse the JSON, just raise the original error.
|
||||
@ -286,9 +291,7 @@ class DataHubRestEmitter(Closeable):
|
||||
if self._token
|
||||
else ""
|
||||
)
|
||||
return (
|
||||
f"DataHubRestEmitter: configured to talk to {self._gms_server}{token_str}"
|
||||
)
|
||||
return f"{self.__class__.__name__}: configured to talk to {self._gms_server}{token_str}"
|
||||
|
||||
def flush(self) -> None:
|
||||
# No-op, but present to keep the interface consistent with the Kafka emitter.
|
||||
|
||||
@ -1,17 +1,16 @@
|
||||
import enum
|
||||
import json
|
||||
import logging
|
||||
import textwrap
|
||||
import time
|
||||
from dataclasses import dataclass
|
||||
from enum import Enum
|
||||
from datetime import datetime
|
||||
from json.decoder import JSONDecodeError
|
||||
from typing import TYPE_CHECKING, Any, Dict, Iterable, List, Optional, Type, Union
|
||||
from typing import TYPE_CHECKING, Any, Dict, Iterable, List, Optional, Tuple, Type
|
||||
|
||||
from avro.schema import RecordSchema
|
||||
from deprecated import deprecated
|
||||
from requests.adapters import Response
|
||||
from requests.models import HTTPError
|
||||
from typing_extensions import Literal
|
||||
|
||||
from datahub.cli.cli_utils import get_url_and_token
|
||||
from datahub.configuration.common import ConfigModel, GraphError, OperationalError
|
||||
@ -72,6 +71,19 @@ class DatahubClientConfig(ConfigModel):
|
||||
DataHubGraphConfig = DatahubClientConfig
|
||||
|
||||
|
||||
class RemovedStatusFilter(enum.Enum):
|
||||
"""Filter for the status of entities during search."""
|
||||
|
||||
NOT_SOFT_DELETED = "NOT_SOFT_DELETED"
|
||||
"""Search only entities that have not been marked as deleted."""
|
||||
|
||||
ALL = "ALL"
|
||||
"""Search all entities, including deleted entities."""
|
||||
|
||||
ONLY_SOFT_DELETED = "ONLY_SOFT_DELETED"
|
||||
"""Search only soft-deleted entities."""
|
||||
|
||||
|
||||
def _graphql_entity_type(entity_type: str) -> str:
|
||||
"""Convert the entity types into GraphQL "EntityType" enum values."""
|
||||
|
||||
@ -124,9 +136,9 @@ class DataHubGraph(DatahubRestEmitter):
|
||||
self.server_id = "missing"
|
||||
logger.debug(f"Failed to get server id due to {e}")
|
||||
|
||||
def _get_generic(self, url: str, params: Optional[Dict] = None) -> Dict:
|
||||
def _send_restli_request(self, method: str, url: str, **kwargs: Any) -> Dict:
|
||||
try:
|
||||
response = self._session.get(url, params=params)
|
||||
response = self._session.request(method, url, **kwargs)
|
||||
response.raise_for_status()
|
||||
return response.json()
|
||||
except HTTPError as e:
|
||||
@ -141,24 +153,11 @@ class DataHubGraph(DatahubRestEmitter):
|
||||
"Unable to get metadata from DataHub", {"message": str(e)}
|
||||
) from e
|
||||
|
||||
def _get_generic(self, url: str, params: Optional[Dict] = None) -> Dict:
|
||||
return self._send_restli_request("GET", url, params=params)
|
||||
|
||||
def _post_generic(self, url: str, payload_dict: Dict) -> Dict:
|
||||
payload = json.dumps(payload_dict)
|
||||
logger.debug(payload)
|
||||
try:
|
||||
response: Response = self._session.post(url, payload)
|
||||
response.raise_for_status()
|
||||
return response.json()
|
||||
except HTTPError as e:
|
||||
try:
|
||||
info = response.json()
|
||||
raise OperationalError(
|
||||
"Unable to get metadata from DataHub", info
|
||||
) from e
|
||||
except JSONDecodeError:
|
||||
# If we can't parse the JSON, just raise the original error.
|
||||
raise OperationalError(
|
||||
"Unable to get metadata from DataHub", {"message": str(e)}
|
||||
) from e
|
||||
return self._send_restli_request("POST", url, json=payload_dict)
|
||||
|
||||
def get_aspect(
|
||||
self,
|
||||
@ -449,10 +448,6 @@ class DataHubGraph(DatahubRestEmitter):
|
||||
def _aspect_count_endpoint(self):
|
||||
return f"{self.config.server}/aspects?action=getCount"
|
||||
|
||||
@property
|
||||
def _scroll_across_entities_endpoint(self):
|
||||
return f"{self.config.server}/entities?action=scrollAcrossEntities"
|
||||
|
||||
def get_domain_urn_by_name(self, domain_name: str) -> Optional[str]:
|
||||
"""Retrieve a domain urn based on its name. Returns None if there is no match found"""
|
||||
|
||||
@ -487,6 +482,9 @@ class DataHubGraph(DatahubRestEmitter):
|
||||
entities.append(x["entity"])
|
||||
return entities[0] if entities_yielded else None
|
||||
|
||||
@deprecated(
|
||||
reason='Use get_urns_by_filter(entity_types=["container"], ...) instead'
|
||||
)
|
||||
def get_container_urns_by_filter(
|
||||
self,
|
||||
env: Optional[str] = None,
|
||||
@ -536,15 +534,21 @@ class DataHubGraph(DatahubRestEmitter):
|
||||
*,
|
||||
entity_types: Optional[List[str]] = None,
|
||||
platform: Optional[str] = None,
|
||||
env: Optional[str] = None,
|
||||
query: Optional[str] = None,
|
||||
status: RemovedStatusFilter = RemovedStatusFilter.NOT_SOFT_DELETED,
|
||||
batch_size: int = 10000,
|
||||
) -> Iterable[str]:
|
||||
"""Fetch all urns that match the given filters.
|
||||
|
||||
Filters are combined conjunctively. If multiple filters are specified, the results will match all of them.
|
||||
Note that specifying a platform filter will automatically exclude all entity types that do not have a platform.
|
||||
The same goes for the env filter.
|
||||
|
||||
:param entity_types: List of entity types to include. If None, all entity types will be returned.
|
||||
:param platform: Platform to filter on. If None, all platforms will be returned.
|
||||
:param env: Environment (e.g. PROD, DEV) to filter on. If None, all environments will be returned.
|
||||
:param status: Filter on the deletion status of the entity. The default is only return non-soft-deleted entities.
|
||||
"""
|
||||
|
||||
types: Optional[List[str]] = None
|
||||
@ -554,11 +558,13 @@ class DataHubGraph(DatahubRestEmitter):
|
||||
|
||||
types = [_graphql_entity_type(entity_type) for entity_type in entity_types]
|
||||
|
||||
# Does not filter on env, because env is missing in dashboard / chart urns and custom properties
|
||||
# For containers, use { field: "customProperties", values: ["instance=env}"], condition:EQUAL }
|
||||
# For others, use { field: "origin", values: ["env"], condition:EQUAL }
|
||||
# Add the query default of * if no query is specified.
|
||||
query = query or "*"
|
||||
|
||||
andFilters = []
|
||||
FilterRule = Dict[str, Any]
|
||||
andFilters: List[FilterRule] = []
|
||||
|
||||
# Platform filter.
|
||||
if platform:
|
||||
andFilters += [
|
||||
{
|
||||
@ -567,23 +573,90 @@ class DataHubGraph(DatahubRestEmitter):
|
||||
"condition": "EQUAL",
|
||||
}
|
||||
]
|
||||
orFilters = [{"and": andFilters}]
|
||||
|
||||
query = textwrap.dedent(
|
||||
# Status filter.
|
||||
if status == RemovedStatusFilter.NOT_SOFT_DELETED:
|
||||
# Subtle: in some cases (e.g. when the dataset doesn't have a status aspect), the
|
||||
# removed field is simply not present in the ElasticSearch document. Ideally this
|
||||
# would be a "removed" : "false" filter, but that doesn't work. Instead, we need to
|
||||
# use a negated filter.
|
||||
andFilters.append(
|
||||
{
|
||||
"field": "removed",
|
||||
"values": ["true"],
|
||||
"condition": "EQUAL",
|
||||
"negated": True,
|
||||
}
|
||||
)
|
||||
elif status == RemovedStatusFilter.ONLY_SOFT_DELETED:
|
||||
andFilters.append(
|
||||
{
|
||||
"field": "removed",
|
||||
"values": ["true"],
|
||||
"condition": "EQUAL",
|
||||
}
|
||||
)
|
||||
elif status == RemovedStatusFilter.ALL:
|
||||
# We don't need to add a filter for this case.
|
||||
pass
|
||||
else:
|
||||
raise ValueError(f"Invalid status filter: {status}")
|
||||
|
||||
orFilters: List[Dict[str, List[FilterRule]]] = [{"and": andFilters}]
|
||||
|
||||
# Env filter.
|
||||
if env:
|
||||
# The env filter is a bit more tricky since it's not always stored
|
||||
# in the same place in ElasticSearch.
|
||||
|
||||
envOrConditions: List[FilterRule] = [
|
||||
# For most entity types, we look at the origin field.
|
||||
{
|
||||
"field": "origin",
|
||||
"value": env,
|
||||
"condition": "EQUAL",
|
||||
},
|
||||
# For containers, we look at the customProperties field.
|
||||
# For any containers created after https://github.com/datahub-project/datahub/pull/8027,
|
||||
# we look for the "env" property. Otherwise, we use the "instance" property.
|
||||
{
|
||||
"field": "customProperties",
|
||||
"value": f"env={env}",
|
||||
},
|
||||
{
|
||||
"field": "customProperties",
|
||||
"value": f"instance={env}",
|
||||
},
|
||||
# Note that not all entity types have an env (e.g. dashboards / charts).
|
||||
# If the env filter is specified, these will be excluded.
|
||||
]
|
||||
|
||||
# This matches ALL of the andFilters and at least one of the envOrConditions.
|
||||
orFilters = [
|
||||
{"and": andFilters["and"] + [extraCondition]}
|
||||
for extraCondition in envOrConditions
|
||||
for andFilters in orFilters
|
||||
]
|
||||
|
||||
graphql_query = textwrap.dedent(
|
||||
"""
|
||||
query scrollUrnsWithFilters(
|
||||
$types: [EntityType!],
|
||||
$query: String!,
|
||||
$orFilters: [AndFilterInput!],
|
||||
$batchSize: Int!,
|
||||
$scrollId: String) {
|
||||
|
||||
scrollAcrossEntities(input: {
|
||||
query: "*",
|
||||
query: $query,
|
||||
count: $batchSize,
|
||||
scrollId: $scrollId,
|
||||
types: $types,
|
||||
orFilters: $orFilters,
|
||||
searchFlags: { skipHighlighting: true }
|
||||
searchFlags: {
|
||||
skipHighlighting: true
|
||||
skipAggregates: true
|
||||
}
|
||||
}) {
|
||||
nextScrollId
|
||||
searchResults {
|
||||
@ -596,23 +669,32 @@ class DataHubGraph(DatahubRestEmitter):
|
||||
"""
|
||||
)
|
||||
|
||||
# Set scroll_id to False to enter while loop
|
||||
scroll_id: Union[Literal[False], str, None] = False
|
||||
while scroll_id is not None:
|
||||
first_iter = True
|
||||
scroll_id: Optional[str] = None
|
||||
while first_iter or scroll_id:
|
||||
first_iter = False
|
||||
|
||||
variables = {
|
||||
"types": types,
|
||||
"query": query,
|
||||
"orFilters": orFilters,
|
||||
"batchSize": batch_size,
|
||||
"scrollId": scroll_id,
|
||||
}
|
||||
response = self.execute_graphql(
|
||||
query,
|
||||
variables={
|
||||
"types": types,
|
||||
"orFilters": orFilters,
|
||||
"batchSize": batch_size,
|
||||
"scrollId": scroll_id or None,
|
||||
},
|
||||
graphql_query,
|
||||
variables=variables,
|
||||
)
|
||||
data = response["scrollAcrossEntities"]
|
||||
scroll_id = data["nextScrollId"]
|
||||
for entry in data["searchResults"]:
|
||||
yield entry["entity"]["urn"]
|
||||
|
||||
if scroll_id:
|
||||
logger.debug(
|
||||
f"Scrolling to next scrollAcrossEntities page: {scroll_id}"
|
||||
)
|
||||
|
||||
def get_latest_pipeline_checkpoint(
|
||||
self, pipeline_name: str, platform: str
|
||||
) -> Optional[Checkpoint["GenericCheckpointState"]]:
|
||||
@ -663,13 +745,18 @@ class DataHubGraph(DatahubRestEmitter):
|
||||
if variables:
|
||||
body["variables"] = variables
|
||||
|
||||
logger.debug(
|
||||
f"Executing graphql query: {query} with variables: {json.dumps(variables)}"
|
||||
)
|
||||
result = self._post_generic(url, body)
|
||||
if result.get("errors"):
|
||||
raise GraphError(f"Error executing graphql query: {result['errors']}")
|
||||
|
||||
return result["data"]
|
||||
|
||||
class RelationshipDirection(str, Enum):
|
||||
class RelationshipDirection(str, enum.Enum):
|
||||
# FIXME: Upgrade to enum.StrEnum when we drop support for Python 3.10
|
||||
|
||||
INCOMING = "INCOMING"
|
||||
OUTGOING = "OUTGOING"
|
||||
|
||||
@ -707,22 +794,6 @@ class DataHubGraph(DatahubRestEmitter):
|
||||
)
|
||||
start = start + response.get("count", 0)
|
||||
|
||||
def soft_delete_urn(
|
||||
self,
|
||||
urn: str,
|
||||
run_id: str = "soft-delete-urns",
|
||||
) -> None:
|
||||
timestamp = int(time.time() * 1000)
|
||||
self.emit_mcp(
|
||||
MetadataChangeProposalWrapper(
|
||||
entityUrn=urn,
|
||||
aspect=StatusClass(removed=True),
|
||||
systemMetadata=SystemMetadataClass(
|
||||
runId=run_id, lastObserved=timestamp
|
||||
),
|
||||
)
|
||||
)
|
||||
|
||||
def exists(self, entity_urn: str) -> bool:
|
||||
entity_urn_parsed: Urn = Urn.create_from_string(entity_urn)
|
||||
try:
|
||||
@ -740,6 +811,143 @@ class DataHubGraph(DatahubRestEmitter):
|
||||
)
|
||||
raise
|
||||
|
||||
def soft_delete_entity(
|
||||
self,
|
||||
urn: str,
|
||||
run_id: str = "__datahub-graph-client",
|
||||
deletion_timestamp: Optional[int] = None,
|
||||
) -> None:
|
||||
"""Soft-delete an entity by urn.
|
||||
|
||||
Args:
|
||||
urn: The urn of the entity to soft-delete.
|
||||
"""
|
||||
|
||||
assert urn
|
||||
|
||||
deletion_timestamp = deletion_timestamp or int(time.time() * 1000)
|
||||
self.emit_mcp(
|
||||
MetadataChangeProposalWrapper(
|
||||
entityUrn=urn,
|
||||
aspect=StatusClass(removed=True),
|
||||
systemMetadata=SystemMetadataClass(
|
||||
runId=run_id, lastObserved=deletion_timestamp
|
||||
),
|
||||
)
|
||||
)
|
||||
|
||||
def hard_delete_entity(
|
||||
self,
|
||||
urn: str,
|
||||
) -> Tuple[int, int]:
|
||||
"""Hard delete an entity by urn.
|
||||
|
||||
Args:
|
||||
urn: The urn of the entity to hard delete.
|
||||
|
||||
Returns:
|
||||
A tuple of (rows_affected, timeseries_rows_affected).
|
||||
"""
|
||||
|
||||
assert urn
|
||||
|
||||
payload_obj: Dict = {"urn": urn}
|
||||
summary = self._post_generic(
|
||||
f"{self._gms_server}/entities?action=delete", payload_obj
|
||||
).get("value", {})
|
||||
|
||||
rows_affected: int = summary.get("rows", 0)
|
||||
timeseries_rows_affected: int = summary.get("timeseriesRows", 0)
|
||||
return rows_affected, timeseries_rows_affected
|
||||
|
||||
def delete_entity(self, urn: str, hard: bool = False) -> None:
|
||||
"""Delete an entity by urn.
|
||||
|
||||
Args:
|
||||
urn: The urn of the entity to delete.
|
||||
hard: Whether to hard delete the entity. If False (default), the entity will be soft deleted.
|
||||
"""
|
||||
|
||||
if hard:
|
||||
rows_affected, timeseries_rows_affected = self.hard_delete_entity(urn)
|
||||
logger.debug(
|
||||
f"Hard deleted entity {urn} with {rows_affected} rows affected and {timeseries_rows_affected} timeseries rows affected"
|
||||
)
|
||||
else:
|
||||
self.soft_delete_entity(urn)
|
||||
logger.debug(f"Soft deleted entity {urn}")
|
||||
|
||||
# TODO: Create hard_delete_aspect once we support that in GMS.
|
||||
|
||||
def hard_delete_timeseries_aspect(
|
||||
self,
|
||||
urn: str,
|
||||
aspect_name: str,
|
||||
start_time: Optional[datetime],
|
||||
end_time: Optional[datetime],
|
||||
) -> int:
|
||||
"""Hard delete timeseries aspects of an entity.
|
||||
|
||||
Args:
|
||||
urn: The urn of the entity.
|
||||
aspect_name: The name of the timeseries aspect to delete.
|
||||
start_time: The start time of the timeseries data to delete. If not specified, defaults to the beginning of time.
|
||||
end_time: The end time of the timeseries data to delete. If not specified, defaults to the end of time.
|
||||
|
||||
Returns:
|
||||
The number of timeseries rows affected.
|
||||
"""
|
||||
|
||||
assert urn
|
||||
assert aspect_name in TIMESERIES_ASPECT_MAP, "must be a timeseries aspect"
|
||||
|
||||
payload_obj: Dict = {
|
||||
"urn": urn,
|
||||
"aspectName": aspect_name,
|
||||
}
|
||||
if start_time:
|
||||
payload_obj["startTimeMillis"] = int(start_time.timestamp() * 1000)
|
||||
if end_time:
|
||||
payload_obj["endTimeMillis"] = int(end_time.timestamp() * 1000)
|
||||
|
||||
summary = self._post_generic(
|
||||
f"{self._gms_server}/entities?action=delete", payload_obj
|
||||
).get("value", {})
|
||||
|
||||
timeseries_rows_affected: int = summary.get("timeseriesRows", 0)
|
||||
return timeseries_rows_affected
|
||||
|
||||
def delete_references_to_urn(
|
||||
self, urn: str, dry_run: bool = False
|
||||
) -> Tuple[int, List[Dict]]:
|
||||
"""Delete references to a given entity.
|
||||
|
||||
This is useful for cleaning up references to an entity that is about to be deleted.
|
||||
For example, when deleting a tag, you might use this to remove that tag from all other
|
||||
entities that reference it.
|
||||
|
||||
This does not delete the entity itself.
|
||||
|
||||
Args:
|
||||
urn: The urn of the entity to delete references to.
|
||||
dry_run: If True, do not actually delete the references, just return the count of
|
||||
references and the list of related aspects.
|
||||
|
||||
Returns:
|
||||
A tuple of (reference_count, sample of related_aspects).
|
||||
"""
|
||||
|
||||
assert urn
|
||||
|
||||
payload_obj = {"urn": urn, "dryRun": dry_run}
|
||||
|
||||
response = self._post_generic(
|
||||
f"{self._gms_server}/entities?action=deleteReferences", payload_obj
|
||||
).get("value", {})
|
||||
reference_count = response.get("total", 0)
|
||||
related_aspects = response.get("relatedAspects", [])
|
||||
return reference_count, related_aspects
|
||||
|
||||
|
||||
def get_default_graph() -> DataHubGraph:
|
||||
(url, token) = get_url_and_token()
|
||||
|
||||
51
metadata-ingestion/tests/unit/config/test_datetime_parser.py
Normal file
51
metadata-ingestion/tests/unit/config/test_datetime_parser.py
Normal file
@ -0,0 +1,51 @@
|
||||
from datetime import datetime, timezone
|
||||
|
||||
import freezegun
|
||||
import pytest
|
||||
|
||||
from datahub.configuration.datetimes import parse_user_datetime
|
||||
|
||||
|
||||
# FIXME: Ideally we'd use tz_offset here to test this code in a non-UTC timezone.
|
||||
# However, freezegun has a long-standing bug that prevents this from working:
|
||||
# https://github.com/spulec/freezegun/issues/348.
|
||||
@freezegun.freeze_time("2021-09-01 10:02:03")
|
||||
def test_user_time_parser():
|
||||
# Absolute times.
|
||||
assert parse_user_datetime("2022-01-01 01:02:03 UTC") == datetime(
|
||||
2022, 1, 1, 1, 2, 3, tzinfo=timezone.utc
|
||||
)
|
||||
assert parse_user_datetime("2022-01-01 01:02:03 -02:00") == datetime(
|
||||
2022, 1, 1, 3, 2, 3, tzinfo=timezone.utc
|
||||
)
|
||||
|
||||
# Times with no timestamp are assumed to be in UTC.
|
||||
assert parse_user_datetime("2022-01-01 01:02:03") == datetime(
|
||||
2022, 1, 1, 1, 2, 3, tzinfo=timezone.utc
|
||||
)
|
||||
assert parse_user_datetime("2022-02-03") == datetime(
|
||||
2022, 2, 3, tzinfo=timezone.utc
|
||||
)
|
||||
|
||||
# Timestamps.
|
||||
assert parse_user_datetime("1630440123") == datetime(
|
||||
2021, 8, 31, 20, 2, 3, tzinfo=timezone.utc
|
||||
)
|
||||
assert parse_user_datetime("1630440123837.018") == datetime(
|
||||
2021, 8, 31, 20, 2, 3, 837018, tzinfo=timezone.utc
|
||||
)
|
||||
|
||||
# Relative times.
|
||||
assert parse_user_datetime("10m") == datetime(
|
||||
2021, 9, 1, 10, 12, 3, tzinfo=timezone.utc
|
||||
)
|
||||
assert parse_user_datetime("+ 1 day") == datetime(
|
||||
2021, 9, 2, 10, 2, 3, tzinfo=timezone.utc
|
||||
)
|
||||
assert parse_user_datetime("-2 days") == datetime(
|
||||
2021, 8, 30, 10, 2, 3, tzinfo=timezone.utc
|
||||
)
|
||||
|
||||
# Invalid inputs.
|
||||
with pytest.raises(ValueError):
|
||||
parse_user_datetime("invalid")
|
||||
@ -1,4 +1,5 @@
|
||||
import json
|
||||
import logging
|
||||
import tempfile
|
||||
import time
|
||||
import sys
|
||||
@ -16,6 +17,8 @@ from tests.aspect_generators.timeseries.dataset_profile_gen import \
|
||||
from tests.utils import get_strftime_from_timestamp_millis
|
||||
import requests_wrapper as requests
|
||||
|
||||
logger = logging.getLogger(__name__)
|
||||
|
||||
test_aspect_name: str = "datasetProfile"
|
||||
test_dataset_urn: str = builder.make_dataset_urn_with_platform_instance(
|
||||
"test_platform",
|
||||
@ -79,6 +82,9 @@ def datahub_delete(params: List[str]) -> None:
|
||||
args.extend(params)
|
||||
args.append("--hard")
|
||||
delete_result: Result = runner.invoke(datahub, args, input="y\ny\n")
|
||||
logger.info(delete_result.stdout)
|
||||
if delete_result.stderr:
|
||||
logger.error(delete_result.stderr)
|
||||
assert delete_result.exit_code == 0
|
||||
|
||||
|
||||
|
||||
@ -4,8 +4,12 @@ import pytest
|
||||
from time import sleep
|
||||
from datahub.cli.cli_utils import get_aspects_for_entity
|
||||
from datahub.cli.ingest_cli import get_session_and_host
|
||||
from datahub.cli.delete_cli import delete_references
|
||||
from tests.utils import ingest_file_via_rest, wait_for_healthcheck_util, delete_urns_from_file
|
||||
from tests.utils import (
|
||||
ingest_file_via_rest,
|
||||
wait_for_healthcheck_util,
|
||||
delete_urns_from_file,
|
||||
get_datahub_graph,
|
||||
)
|
||||
from requests_wrapper import ELASTICSEARCH_REFRESH_INTERVAL_SECONDS
|
||||
|
||||
# Disable telemetry
|
||||
@ -37,24 +41,41 @@ def test_setup():
|
||||
session, gms_host = get_session_and_host()
|
||||
|
||||
try:
|
||||
assert "browsePaths" not in get_aspects_for_entity(entity_urn=dataset_urn, aspects=["browsePaths"], typed=False)
|
||||
assert "editableDatasetProperties" not in get_aspects_for_entity(entity_urn=dataset_urn, aspects=["editableDatasetProperties"], typed=False)
|
||||
assert "browsePaths" not in get_aspects_for_entity(
|
||||
entity_urn=dataset_urn, aspects=["browsePaths"], typed=False
|
||||
)
|
||||
assert "editableDatasetProperties" not in get_aspects_for_entity(
|
||||
entity_urn=dataset_urn, aspects=["editableDatasetProperties"], typed=False
|
||||
)
|
||||
except Exception as e:
|
||||
delete_urns_from_file("tests/delete/cli_test_data.json")
|
||||
raise e
|
||||
|
||||
ingested_dataset_run_id = ingest_file_via_rest("tests/delete/cli_test_data.json").config.run_id
|
||||
ingested_dataset_run_id = ingest_file_via_rest(
|
||||
"tests/delete/cli_test_data.json"
|
||||
).config.run_id
|
||||
|
||||
assert "browsePaths" in get_aspects_for_entity(entity_urn=dataset_urn, aspects=["browsePaths"], typed=False)
|
||||
assert "browsePaths" in get_aspects_for_entity(
|
||||
entity_urn=dataset_urn, aspects=["browsePaths"], typed=False
|
||||
)
|
||||
|
||||
yield
|
||||
rollback_url = f"{gms_host}/runs?action=rollback"
|
||||
session.post(rollback_url, data=json.dumps({"runId": ingested_dataset_run_id, "dryRun": False, "hardDelete": True}))
|
||||
session.post(
|
||||
rollback_url,
|
||||
data=json.dumps(
|
||||
{"runId": ingested_dataset_run_id, "dryRun": False, "hardDelete": True}
|
||||
),
|
||||
)
|
||||
|
||||
sleep(ELASTICSEARCH_REFRESH_INTERVAL_SECONDS)
|
||||
|
||||
assert "browsePaths" not in get_aspects_for_entity(entity_urn=dataset_urn, aspects=["browsePaths"], typed=False)
|
||||
assert "editableDatasetProperties" not in get_aspects_for_entity(entity_urn=dataset_urn, aspects=["editableDatasetProperties"], typed=False)
|
||||
assert "browsePaths" not in get_aspects_for_entity(
|
||||
entity_urn=dataset_urn, aspects=["browsePaths"], typed=False
|
||||
)
|
||||
assert "editableDatasetProperties" not in get_aspects_for_entity(
|
||||
entity_urn=dataset_urn, aspects=["editableDatasetProperties"], typed=False
|
||||
)
|
||||
|
||||
|
||||
@pytest.mark.dependency()
|
||||
@ -66,20 +87,24 @@ def test_delete_reference(test_setup, depends=["test_healthchecks"]):
|
||||
dataset_urn = f"urn:li:dataset:({platform},{dataset_name},{env})"
|
||||
tag_urn = "urn:li:tag:NeedsDocs"
|
||||
|
||||
session, gms_host = get_session_and_host()
|
||||
graph = get_datahub_graph()
|
||||
|
||||
# Validate that the ingested tag is being referenced by the dataset
|
||||
references_count, related_aspects = delete_references(tag_urn, dry_run=True, cached_session_host=(session, gms_host))
|
||||
references_count, related_aspects = graph.delete_references_to_urn(
|
||||
tag_urn, dry_run=True
|
||||
)
|
||||
print("reference count: " + str(references_count))
|
||||
print(related_aspects)
|
||||
assert references_count == 1
|
||||
assert related_aspects[0]['entity'] == dataset_urn
|
||||
assert related_aspects[0]["entity"] == dataset_urn
|
||||
|
||||
# Delete references to the tag
|
||||
delete_references(tag_urn, dry_run=False, cached_session_host=(session, gms_host))
|
||||
graph.delete_references_to_urn(tag_urn, dry_run=False)
|
||||
|
||||
sleep(ELASTICSEARCH_REFRESH_INTERVAL_SECONDS)
|
||||
|
||||
# Validate that references no longer exist
|
||||
references_count, related_aspects = delete_references(tag_urn, dry_run=True, cached_session_host=(session, gms_host))
|
||||
references_count, related_aspects = graph.delete_references_to_urn(
|
||||
tag_urn, dry_run=True
|
||||
)
|
||||
assert references_count == 0
|
||||
|
||||
@ -1,10 +1,9 @@
|
||||
import json
|
||||
from time import sleep
|
||||
|
||||
from datahub.cli import delete_cli
|
||||
from datahub.cli import timeline_cli
|
||||
from datahub.cli.cli_utils import guess_entity_type, post_entity
|
||||
from tests.utils import ingest_file_via_rest
|
||||
from tests.utils import ingest_file_via_rest, get_datahub_graph
|
||||
from requests_wrapper import ELASTICSEARCH_REFRESH_INTERVAL_SECONDS
|
||||
|
||||
|
||||
@ -22,7 +21,7 @@ def test_all():
|
||||
|
||||
res_data = timeline_cli.get_timeline(dataset_urn, ["TAG", "DOCUMENTATION", "TECHNICAL_SCHEMA", "GLOSSARY_TERM",
|
||||
"OWNER"], None, None, False)
|
||||
delete_cli.delete_one_urn_cmd(urn=dataset_urn)
|
||||
get_datahub_graph().hard_delete_entity(urn=dataset_urn)
|
||||
|
||||
assert res_data
|
||||
assert len(res_data) == 3
|
||||
@ -49,7 +48,7 @@ def test_schema():
|
||||
|
||||
res_data = timeline_cli.get_timeline(dataset_urn, ["TECHNICAL_SCHEMA"], None, None, False)
|
||||
|
||||
delete_cli.delete_one_urn_cmd(urn=dataset_urn)
|
||||
get_datahub_graph().hard_delete_entity(urn=dataset_urn)
|
||||
assert res_data
|
||||
assert len(res_data) == 3
|
||||
assert res_data[0]["semVerChange"] == "MINOR"
|
||||
@ -75,7 +74,7 @@ def test_glossary():
|
||||
|
||||
res_data = timeline_cli.get_timeline(dataset_urn, ["GLOSSARY_TERM"], None, None, False)
|
||||
|
||||
delete_cli.delete_one_urn_cmd(urn=dataset_urn)
|
||||
get_datahub_graph().hard_delete_entity(urn=dataset_urn)
|
||||
assert res_data
|
||||
assert len(res_data) == 3
|
||||
assert res_data[0]["semVerChange"] == "MINOR"
|
||||
@ -101,7 +100,7 @@ def test_documentation():
|
||||
|
||||
res_data = timeline_cli.get_timeline(dataset_urn, ["DOCUMENTATION"], None, None, False)
|
||||
|
||||
delete_cli.delete_one_urn_cmd(urn=dataset_urn)
|
||||
get_datahub_graph().hard_delete_entity(urn=dataset_urn)
|
||||
assert res_data
|
||||
assert len(res_data) == 3
|
||||
assert res_data[0]["semVerChange"] == "MINOR"
|
||||
@ -127,7 +126,7 @@ def test_tags():
|
||||
|
||||
res_data = timeline_cli.get_timeline(dataset_urn, ["TAG"], None, None, False)
|
||||
|
||||
delete_cli.delete_one_urn_cmd(urn=dataset_urn)
|
||||
get_datahub_graph().hard_delete_entity(urn=dataset_urn)
|
||||
assert res_data
|
||||
assert len(res_data) == 3
|
||||
assert res_data[0]["semVerChange"] == "MINOR"
|
||||
@ -153,7 +152,7 @@ def test_ownership():
|
||||
|
||||
res_data = timeline_cli.get_timeline(dataset_urn, ["OWNER"], None, None, False)
|
||||
|
||||
delete_cli.delete_one_urn_cmd(urn=dataset_urn)
|
||||
get_datahub_graph().hard_delete_entity(urn=dataset_urn)
|
||||
assert res_data
|
||||
assert len(res_data) == 3
|
||||
assert res_data[0]["semVerChange"] == "MINOR"
|
||||
|
||||
@ -1,6 +1,7 @@
|
||||
import functools
|
||||
import json
|
||||
import os
|
||||
from datetime import datetime, timedelta
|
||||
from datetime import datetime, timedelta, timezone
|
||||
import subprocess
|
||||
import time
|
||||
from typing import Any, Dict, List, Tuple
|
||||
@ -11,11 +12,13 @@ import requests_wrapper as requests
|
||||
import logging
|
||||
from datahub.cli import cli_utils
|
||||
from datahub.cli.cli_utils import get_system_auth
|
||||
from datahub.ingestion.graph.client import DataHubGraph, DatahubClientConfig
|
||||
from datahub.ingestion.run.pipeline import Pipeline
|
||||
|
||||
TIME: int = 1581407189000
|
||||
logger = logging.getLogger(__name__)
|
||||
|
||||
|
||||
def get_frontend_session():
|
||||
session = requests.Session()
|
||||
|
||||
@ -126,14 +129,13 @@ def ingest_file_via_rest(filename: str) -> Pipeline:
|
||||
return pipeline
|
||||
|
||||
|
||||
def delete_urn(urn: str) -> None:
|
||||
payload_obj = {"urn": urn}
|
||||
@functools.lru_cache(maxsize=1)
|
||||
def get_datahub_graph() -> DataHubGraph:
|
||||
return DataHubGraph(DatahubClientConfig(server=get_gms_url()))
|
||||
|
||||
cli_utils.post_delete_endpoint_with_session_and_url(
|
||||
requests.Session(),
|
||||
get_gms_url() + "/entities?action=delete",
|
||||
payload_obj,
|
||||
)
|
||||
|
||||
def delete_urn(urn: str) -> None:
|
||||
get_datahub_graph().hard_delete_entity(urn)
|
||||
|
||||
|
||||
def delete_urns(urns: List[str]) -> None:
|
||||
@ -172,15 +174,18 @@ def delete_urns_from_file(filename: str, shared_data: bool = False) -> None:
|
||||
# Deletes require 60 seconds when run between tests operating on common data, otherwise standard sync wait
|
||||
if shared_data:
|
||||
wait_for_writes_to_sync()
|
||||
# sleep(60)
|
||||
# sleep(60)
|
||||
else:
|
||||
wait_for_writes_to_sync()
|
||||
|
||||
|
||||
# sleep(requests.ELASTICSEARCH_REFRESH_INTERVAL_SECONDS)
|
||||
|
||||
|
||||
# Fixed now value
|
||||
NOW: datetime = datetime.now()
|
||||
|
||||
|
||||
def get_timestampmillis_at_start_of_day(relative_day_num: int) -> int:
|
||||
"""
|
||||
Returns the time in milliseconds from epoch at the start of the day
|
||||
@ -201,7 +206,7 @@ def get_timestampmillis_at_start_of_day(relative_day_num: int) -> int:
|
||||
|
||||
|
||||
def get_strftime_from_timestamp_millis(ts_millis: int) -> str:
|
||||
return datetime.fromtimestamp(ts_millis / 1000).strftime("%Y-%m-%d %H:%M:%S")
|
||||
return datetime.fromtimestamp(ts_millis / 1000, tz=timezone.utc).isoformat()
|
||||
|
||||
|
||||
def create_datahub_step_state_aspect(
|
||||
@ -242,19 +247,22 @@ def wait_for_writes_to_sync(max_timeout_in_sec: int = 120) -> None:
|
||||
# get offsets
|
||||
lag_zero = False
|
||||
while not lag_zero and (time.time() - start_time) < max_timeout_in_sec:
|
||||
time.sleep(1) # micro-sleep
|
||||
time.sleep(1) # micro-sleep
|
||||
completed_process = subprocess.run(
|
||||
"docker exec broker /bin/kafka-consumer-groups --bootstrap-server broker:29092 --group generic-mae-consumer-job-client --describe | grep -v LAG | awk '{print $6}'",
|
||||
capture_output=True,
|
||||
shell=True,
|
||||
text=True)
|
||||
|
||||
text=True,
|
||||
)
|
||||
|
||||
result = str(completed_process.stdout)
|
||||
lines = result.splitlines()
|
||||
lag_values = [int(l) for l in lines if l != ""]
|
||||
maximum_lag = max(lag_values)
|
||||
if maximum_lag == 0:
|
||||
lag_zero = True
|
||||
|
||||
|
||||
if not lag_zero:
|
||||
logger.warning(f"Exiting early from waiting for elastic to catch up due to a timeout. Current lag is {lag_values}")
|
||||
logger.warning(
|
||||
f"Exiting early from waiting for elastic to catch up due to a timeout. Current lag is {lag_values}"
|
||||
)
|
||||
|
||||
Loading…
x
Reference in New Issue
Block a user