feat(cli): delete cli v2 (#8068)

This commit is contained in:
Harshal Sheth 2023-05-24 01:13:44 +05:30 committed by GitHub
parent 3c0d720eb6
commit afd65e16fb
No known key found for this signature in database
GPG Key ID: 4AEE18F83AFDEB23
17 changed files with 1081 additions and 596 deletions

View File

@ -25,8 +25,8 @@ DataHub Docker Images:
Do not use `latest` or `debug` tags for any of the image as those are not supported and present only due to legacy reasons. Please use `head` or tags specific for versions like `v0.8.40`. For production we recommend using version specific tags not `head`.
* [linkedin/datahub-ingestion](https://hub.docker.com/r/linkedin/datahub-ingestion/) - This contains the Python CLI. If you are looking for docker image for every minor CLI release you can find them under [acryldata/datahub-ingestion](https://hub.docker.com/r/acryldata/datahub-ingestion/).
* [linkedin/datahub-gms](https://hub.docker.com/repository/docker/linkedin/datahub-gms/).
* [acryldata/datahub-ingestion](https://hub.docker.com/r/acryldata/datahub-ingestion/)
* [linkedin/datahub-gms](https://hub.docker.com/repository/docker/linkedin/datahub-gms/)
* [linkedin/datahub-frontend-react](https://hub.docker.com/repository/docker/linkedin/datahub-frontend-react/)
* [linkedin/datahub-mae-consumer](https://hub.docker.com/repository/docker/linkedin/datahub-mae-consumer/)
* [linkedin/datahub-mce-consumer](https://hub.docker.com/repository/docker/linkedin/datahub-mce-consumer/)

View File

@ -138,14 +138,9 @@ The `check` command allows you to check if all plugins are loaded correctly as w
### delete
The `delete` command allows you to delete metadata from DataHub. Read this [guide](./how/delete-metadata.md) to understand how you can delete metadata from DataHub.
:::info
Deleting metadata using DataHub's CLI and GraphQL API is a simple, systems-level action. If you attempt to delete an Entity with children, such as a Container, it will not automatically delete the children, you will instead need to delete each child by URN in addition to deleting the parent.
:::
The `delete` command allows you to delete metadata from DataHub.
```console
datahub delete --urn "urn:li:dataset:(urn:li:dataPlatform:hive,SampleHiveDataset,PROD)" --soft
```
The [metadata deletion guide](./how/delete-metadata.md) covers the various options for the delete command.
### exists
@ -534,11 +529,11 @@ Old Entities Migrated = {'urn:li:dataset:(urn:li:dataPlatform:hive,logging_event
### Using docker
[![Docker Hub](https://img.shields.io/docker/pulls/linkedin/datahub-ingestion?style=plastic)](https://hub.docker.com/r/linkedin/datahub-ingestion)
[![datahub-ingestion docker](https://github.com/datahub-project/datahub/actions/workflows/docker-ingestion.yml/badge.svg)](https://github.com/datahub-project/datahub/actions/workflows/docker-ingestion.yml)
[![Docker Hub](https://img.shields.io/docker/pulls/acryldata/datahub-ingestion?style=plastic)](https://hub.docker.com/r/acryldata/datahub-ingestion)
[![datahub-ingestion docker](https://github.com/acryldata/datahub/actions/workflows/docker-ingestion.yml/badge.svg)](https://github.com/acryldata/datahub/actions/workflows/docker-ingestion.yml)
If you don't want to install locally, you can alternatively run metadata ingestion within a Docker container.
We have prebuilt images available on [Docker hub](https://hub.docker.com/r/linkedin/datahub-ingestion). All plugins will be installed and enabled automatically.
We have prebuilt images available on [Docker hub](https://hub.docker.com/r/acryldata/datahub-ingestion). All plugins will be installed and enabled automatically.
You can use the `datahub-ingestion` docker image as explained in [Docker Images](../docker/README.md). In case you are using Kubernetes you can start a pod with the `datahub-ingestion` docker image, log onto a shell on the pod and you should have the access to datahub CLI in your kubernetes cluster.

View File

@ -1,130 +1,236 @@
# Removing Metadata from DataHub
:::tip
To follow this guide, you'll need the [DataHub CLI](../cli.md).
:::
There are a two ways to delete metadata from DataHub:
1. Delete metadata attached to entities by providing a specific urn or filters that identify a set of entities
2. Delete metadata created by a single ingestion run
1. Delete metadata attached to entities by providing a specific urn or filters that identify a set of urns (delete CLI).
2. Delete metadata created by a single ingestion run (rollback).
To follow this guide you need to use [DataHub CLI](../cli.md).
:::caution Be careful when deleting metadata
Read on to find out how to perform these kinds of deletes.
- Always use `--dry-run` to test your delete command before executing it.
- Prefer reversible soft deletes (`--soft`) over irreversible hard deletes (`--hard`).
_Note: Deleting metadata should only be done with care. Always use `--dry-run` to understand what will be deleted before proceeding. Prefer soft-deletes (`--soft`) unless you really want to nuke metadata rows. Hard deletes will actually delete rows in the primary store and recovering them will require using backups of the primary metadata store. Make sure you understand the implications of issuing soft-deletes versus hard-deletes before proceeding._
:::
## Delete CLI Usage
:::info
Deleting metadata using DataHub's CLI and GraphQL API is a simple, systems-level action. If you attempt to delete an Entity with children, such as a Domain, it will not delete those children, you will instead need to delete each child by URN in addition to deleting the parent.
Deleting metadata using DataHub's CLI is a simple, systems-level action. If you attempt to delete an entity with children, such as a container, it will not delete those children. Instead, you will need to delete each child by URN in addition to deleting the parent.
:::
## Delete By Urn
To delete all the data related to a single entity, run
All the commands below support the following options:
### Soft Delete (the default)
- `-n/--dry-run`: Execute a dry run instead of the actual delete.
- `--force`: Skip confirmation prompts.
This sets the `Status` aspect of the entity to `Removed`, which hides the entity and all its aspects from being returned by the UI.
```
### Selecting entities to delete
You can either provide a single urn to delete, or use filters to select a set of entities to delete.
```shell
# Soft delete a single urn.
datahub delete --urn "<my urn>"
```
or
```
datahub delete --urn "<my urn>" --soft
# Soft delete using a filter.
datahub delete --platform snowflake
# Filters can be combined, which will select entities that match all filters.
datahub delete --platform looker --entity-type chart
datahub delete --platform bigquery --env PROD
```
### Hard Delete
When performing hard deletes, you can optionally add the `--only-soft-deleted` flag to only hard delete entities that were previously soft deleted.
This physically deletes all rows for all aspects of the entity. This action cannot be undone, so execute this only after you are sure you want to delete all data associated with this entity.
### Performing the delete
#### Soft delete an entity (default)
By default, the delete command will perform a soft delete.
This will set the `status` aspect's `removed` field to `true`, which will hide the entity from the UI. However, you'll still be able to view the entity's metadata in the UI with a direct link.
```shell
# The `--soft` flag is redundant since it's the default.
datahub delete --urn "<urn>" --soft
# or using a filter
datahub delete --platform snowflake --soft
```
#### Hard delete an entity
This will physically delete all rows for all aspects of the entity. This action cannot be undone, so execute this only after you are sure you want to delete all data associated with this entity.
```shell
datahub delete --urn "<my urn>" --hard
# or using a filter
datahub delete --platform snowflake --hard
```
As of datahub v0.8.35 doing a hard delete by urn will also provide you with a way to remove references to the urn being deleted across the metadata graph. This is important to use if you don't want to have ghost references in your metadata model and want to save space in the graph database.
For now, this behaviour must be opted into by a prompt that will appear for you to manually accept or deny.
As of datahub v0.10.2.3, hard deleting tags, glossary terms, users, and groups will also remove references to those entities across the metadata graph.
You can optionally add `-n` or `--dry-run` to execute a dry run before issuing the final delete command.
You can optionally add `-f` or `--force` to skip confirmations
You can optionally add `--only-soft-deleted` flag to remove soft-deleted items only.
#### Hard delete a timeseries aspect
:::note
It's also possible to delete a range of timeseries aspect data for an entity without deleting the entire entity.
Make sure you surround your urn with quotes! If you do not include the quotes, your terminal may misinterpret the command._
For these deletes, the aspect and time ranges are required. You can delete all data for a timeseries aspect by providing `--start-time min --end-time max`.
```shell
datahub delete --urn "<my urn>" --aspect <aspect name> --start-time '-30 days' --end-time '-7 days'
# or using a filter
datahub delete --platform snowflake --entity-type dataset --aspect datasetProfile --start-time '0' --end-time '2023-01-01'
```
The start and end time fields filter on the `timestampMillis` field of the timeseries aspect. Allowed start and end times formats:
- `YYYY-MM-DD`: a specific date
- `YYYY-MM-DD HH:mm:ss`: a specific timestamp, assumed to be in UTC unless otherwise specified
- `+/-<number> <unit>` (e.g. `-7 days`): a relative time, where `<number>` is an integer and `<unit>` is one of `days`, `hours`, `minutes`, `seconds`
- `ddddddddd` (e.g. `1684384045`): a unix timestamp
- `min`, `max`, `now`: special keywords
## Delete CLI Examples
:::note
Make sure you surround your urn with quotes! If you do not include the quotes, your terminal may misinterpret the command.
:::
If you wish to hard-delete using a curl request you can use something like below. Replace the URN with the URN that you wish to delete
_Note: All of the commands below support `--dry-run` and `--force` (skips confirmation prompts)._
#### Soft delete a single entity
```shell
datahub delete --urn "urn:li:dataset:(urn:li:dataPlatform:hive,fct_users_deleted,PROD)"
```
#### Hard delete a single entity
```shell
datahub delete --urn "urn:li:dataset:(urn:li:dataPlatform:hive,fct_users_deleted,PROD)" --hard
```
#### Delete everything from the Snowflake DEV environment
```shell
datahub delete --platform snowflake --env DEV
```
#### Delete all BigQuery datasets in the PROD environment
```shell
# Note: this will leave BigQuery containers intact.
datahub delete --env PROD --entity-type dataset --platform bigquery
```
#### Delete all pipelines and tasks from Airflow
```shell
datahub delete --platform "airflow"
```
#### Delete all containers for a particular platform
```shell
datahub delete --entity-type container --platform s3
```
#### Delete everything in the DEV environment
```shell
# This is a pretty broad filter, so make sure you know what you're doing!
datahub delete --env DEV
```
#### Delete all Looker dashboards and charts
```shell
datahub delete --platform looker
```
#### Delete all Looker charts (but not dashboards)
```shell
datahub delete --platform looker --entity-type chart
```
#### Clean up old datasetProfiles
```shell
datahub delete --entity-type dataset --aspect datasetProfile --start-time 'min' --end-time '-60 days'
```
#### Delete a tag
```shell
# Soft delete.
datahub delete --urn 'urn:li:tag:Legacy' --soft
# Or, using a hard delete. This will automatically clean up all tag associations.
datahub delete --urn 'urn:li:tag:Legacy' --hard
```
#### Delete all datasets that match a query
```shell
# Note: the query is an advanced feature, but can sometimes select extra entities - use it with caution!
datahub delete --entity-type dataset --query "_tmp"
```
#### Hard delete everything in Snowflake that was previously soft deleted
```shell
datahub delete --platform snowflake --only-soft-deleted --hard
```
## Deletes using the SDK and APIs
The Python SDK's [DataHubGraph](../../python-sdk/clients.md) client supports deletes via the following methods:
- `soft_delete_entity`
- `hard_delete_entity`
- `hard_delete_timeseries_aspect`
Deletes via the REST API are also possible, although we recommend using the SDK instead.
```shell
# hard delete an entity by urn
curl "http://localhost:8080/entities?action=delete" -X POST --data '{"urn": "urn:li:dataset:(urn:li:dataPlatform:hive,fct_users_deleted,PROD)"}'
```
## Delete by filters
_Note: All these commands below support the soft-delete option (`-s/--soft`) as well as the dry-run option (`-n/--dry-run`).
### Delete all Datasets from the Snowflake platform
```
datahub delete --entity_type dataset --platform snowflake
```
### Delete all containers for a particular platform
```
datahub delete --entity_type container --platform s3
```
### Delete all datasets in the DEV environment
```
datahub delete --env DEV --entity_type dataset
```
### Delete all Pipelines and Tasks in the DEV environment
```
datahub delete --env DEV --entity_type "dataJob"
datahub delete --env DEV --entity_type "dataFlow"
```
### Delete all bigquery datasets in the PROD environment
```
datahub delete --env PROD --entity_type dataset --platform bigquery
```
### Delete all looker dashboards and charts
```
datahub delete --entity_type dashboard --platform looker
datahub delete --entity_type chart --platform looker
```
### Delete all datasets that match a query
```
datahub delete --entity_type dataset --query "_tmp"
```
## Rollback Ingestion Run
The second way to delete metadata is to identify entities (and the aspects affected) by using an ingestion `run-id`. Whenever you run `datahub ingest -c ...`, all the metadata ingested with that run will have the same run id.
To view the ids of the most recent set of ingestion batches, execute
```
```shell
datahub ingest list-runs
```
That will print out a table of all the runs. Once you have an idea of which run you want to roll back, run
```
```shell
datahub ingest show --run-id <run-id>
```
to see more info of the run.
Alternately, you can execute a dry-run rollback to achieve the same outcome.
```
Alternately, you can execute a dry-run rollback to achieve the same outcome.
```shell
datahub ingest rollback --dry-run --run-id <run-id>
```
Finally, once you are sure you want to delete this data forever, run
```
```shell
datahub ingest rollback --run-id <run-id>
```
@ -133,10 +239,9 @@ This deletes both the versioned and the timeseries aspects associated with these
### Unsafe Entities and Rollback
> **_NOTE:_** Preservation of unsafe entities has been added in datahub `0.8.32`. Read on to understand what it means and how it works.
In some cases, entities that were initially ingested by a run might have had further modifications to their metadata (e.g. adding terms, tags, or documentation) through the UI or other means. During a roll back of the ingestion that initially created these entities (technically, if the key aspect for these entities are being rolled back), the ingestion process will analyse the metadata graph for aspects that will be left "dangling" and will:
1. Leave these aspects untouched in the database, and soft-delete the entity. A re-ingestion of these entities will result in this additional metadata becoming visible again in the UI, so you don't lose any of your work.
1. Leave these aspects untouched in the database, and soft delete the entity. A re-ingestion of these entities will result in this additional metadata becoming visible again in the UI, so you don't lose any of your work.
2. The datahub cli will save information about these unsafe entities as a CSV for operators to later review and decide on next steps (keep or remove).
The rollback command will report how many entities have such aspects and save as a CSV the urns of these entities under a rollback reports directory, which defaults to `rollback_reports` under the current directory where the cli is run, and can be configured further using the `--reports-dir` command line arg.

View File

@ -7,6 +7,8 @@ This file documents any backwards-incompatible changes in DataHub and assists pe
### Breaking Changes
- #7900: The `catalog_pattern` and `schema_pattern` options of the Unity Catalog source now match against the fully qualified name of the catalog/schema instead of just the name. Unless you're using regex `^` in your patterns, this should not affect you.
- #8068: In the `datahub delete` CLI, if an `--entity-type` filter is not specified, we automatically delete across all entity types. The previous behavior was to use a default entity type of dataset.
- #8068: In the `datahub delete` CLI, the `--start-time` and `--end-time` parameters are not required for timeseries aspect hard deletes. To recover the previous behavior of deleting all data, use `--start-time min --end-time max`.
### Potential Downtime

View File

@ -1,15 +1,20 @@
import logging
from datahub.cli import delete_cli
from datahub.emitter.mce_builder import make_dataset_urn
from datahub.emitter.rest_emitter import DatahubRestEmitter
from datahub.ingestion.graph.client import DatahubClientConfig, DataHubGraph
log = logging.getLogger(__name__)
logging.basicConfig(level=logging.INFO)
rest_emitter = DatahubRestEmitter(gms_server="http://localhost:8080")
graph = DataHubGraph(
config=DatahubClientConfig(
server="http://localhost:8080",
)
)
dataset_urn = make_dataset_urn(name="fct_users_created", platform="hive")
delete_cli._delete_one_urn(urn=dataset_urn, soft=True, cached_emitter=rest_emitter)
# Soft-delete the dataset.
graph.delete_entity(urn=dataset_urn, hard=False)
log.info(f"Deleted dataset {dataset_urn}")

View File

@ -10,6 +10,7 @@ from typing import Any, Dict, Iterable, List, Optional, Tuple, Type, Union
import click
import requests
import yaml
from deprecated import deprecated
from pydantic import BaseModel, ValidationError
from requests.models import Response
from requests.sessions import Session
@ -317,50 +318,7 @@ def post_rollback_endpoint(
)
def post_delete_references_endpoint(
payload_obj: dict,
path: str,
cached_session_host: Optional[Tuple[Session, str]] = None,
) -> Tuple[int, List[Dict]]:
session, gms_host = cached_session_host or get_session_and_host()
url = gms_host + path
payload = json.dumps(payload_obj)
response = session.post(url, payload)
summary = parse_run_restli_response(response)
reference_count = summary.get("total", 0)
related_aspects = summary.get("relatedAspects", [])
return reference_count, related_aspects
def post_delete_endpoint(
payload_obj: dict,
path: str,
cached_session_host: Optional[Tuple[Session, str]] = None,
) -> typing.Tuple[str, int, int]:
session, gms_host = cached_session_host or get_session_and_host()
url = gms_host + path
return post_delete_endpoint_with_session_and_url(session, url, payload_obj)
def post_delete_endpoint_with_session_and_url(
session: Session,
url: str,
payload_obj: dict,
) -> typing.Tuple[str, int, int]:
payload = json.dumps(payload_obj)
response = session.post(url, payload)
summary = parse_run_restli_response(response)
urn: str = summary.get("urn", "")
rows_affected: int = summary.get("rows", 0)
timeseries_rows_affected: int = summary.get("timeseriesRows", 0)
return urn, rows_affected, timeseries_rows_affected
@deprecated(reason="Use DataHubGraph.get_urns_by_filter instead")
def get_urns_by_filter(
platform: Optional[str],
env: Optional[str] = None,

View File

@ -1,65 +1,99 @@
import logging
import time
from dataclasses import dataclass
from datetime import datetime
from random import choices
from typing import Any, Dict, List, Optional, Tuple
from typing import Dict, List, Optional
import click
import humanfriendly
import progressbar
from click_default_group import DefaultGroup
from requests import sessions
from tabulate import tabulate
from datahub.cli import cli_utils
from datahub.emitter import rest_emitter
from datahub.emitter.mcp import MetadataChangeProposalWrapper
from datahub.metadata.schema_classes import StatusClass, SystemMetadataClass
from datahub.configuration.datetimes import ClickDatetime
from datahub.emitter.aspect import ASPECT_MAP, TIMESERIES_ASPECT_MAP
from datahub.ingestion.graph.client import (
DataHubGraph,
RemovedStatusFilter,
get_default_graph,
)
from datahub.telemetry import telemetry
from datahub.upgrade import upgrade
from datahub.utilities.perf_timer import PerfTimer
from datahub.utilities.urns.urn import guess_entity_type
logger = logging.getLogger(__name__)
RUN_TABLE_COLUMNS = ["urn", "aspect name", "created at"]
_RUN_TABLE_COLUMNS = ["urn", "aspect name", "created at"]
_UNKNOWN_NUM_RECORDS = -1
UNKNOWN_NUM_RECORDS = -1
_DELETE_WITH_REFERENCES_TYPES = {
"tag",
"corpuser",
"corpGroup",
"domain",
"glossaryTerm",
"glossaryNode",
}
@click.group(cls=DefaultGroup, default="by-filter")
def delete() -> None:
"""Delete metadata from DataHub."""
"""Delete metadata from DataHub.
See https://datahubproject.io/docs/how/delete-metadata for more detailed docs.
"""
pass
@dataclass
class DeletionResult:
start_time: int = int(time.time() * 1000.0)
end_time: int = 0
num_records: int = 0
num_timeseries_records: int = 0
num_entities: int = 0
sample_records: Optional[List[List[str]]] = None
def start(self) -> None:
self.start_time = int(time.time() * 1000.0)
def end(self) -> None:
self.end_time = int(time.time() * 1000.0)
num_referenced_entities: int = 0
def merge(self, another_result: "DeletionResult") -> None:
self.end_time = another_result.end_time
self.num_records = (
self.num_records + another_result.num_records
if another_result.num_records != UNKNOWN_NUM_RECORDS
else UNKNOWN_NUM_RECORDS
self.num_records = self._sum_handle_unknown(
self.num_records, another_result.num_records
)
self.num_timeseries_records += another_result.num_timeseries_records
self.num_entities += another_result.num_entities
if another_result.sample_records:
if not self.sample_records:
self.sample_records = []
self.sample_records.extend(another_result.sample_records)
self.num_timeseries_records = self._sum_handle_unknown(
self.num_timeseries_records, another_result.num_timeseries_records
)
self.num_entities = self._sum_handle_unknown(
self.num_entities, another_result.num_entities
)
self.num_referenced_entities = self._sum_handle_unknown(
self.num_referenced_entities, another_result.num_referenced_entities
)
def format_message(self, *, dry_run: bool, soft: bool, time_sec: float) -> str:
counters = (
f"{self.num_entities} entities"
f" (impacts {self._value_or_unknown(self.num_records)} versioned rows"
f" and {self._value_or_unknown(self.num_timeseries_records)} timeseries aspect rows)"
)
if self.num_referenced_entities > 0:
counters += (
f" and cleaned up {self.num_referenced_entities} referenced entities"
)
if not dry_run:
delete_type = "Soft deleted" if soft else "Hard deleted"
return f"{delete_type} {counters} in {humanfriendly.format_timespan(time_sec)}."
else:
return f"[Dry-run] Would delete {counters}."
@classmethod
def _value_or_unknown(cls, value: int) -> str:
return str(value) if value != _UNKNOWN_NUM_RECORDS else "an unknown number of"
@classmethod
def _sum_handle_unknown(cls, value1: int, value2: int) -> int:
if value1 == _UNKNOWN_NUM_RECORDS or value2 == _UNKNOWN_NUM_RECORDS:
return _UNKNOWN_NUM_RECORDS
return value1 + value2
@delete.command()
@ -79,7 +113,7 @@ def by_registry(
registry_id: str,
soft: bool,
dry_run: bool,
) -> DeletionResult:
) -> None:
"""
Delete all metadata written using the given registry id and version pair.
"""
@ -89,35 +123,96 @@ def by_registry(
"Soft-deleting with a registry-id is not yet supported. Try --dry-run to see what you will be deleting, before issuing a hard-delete using the --hard flag"
)
deletion_result = DeletionResult()
deletion_result.num_entities = 1
deletion_result.num_records = UNKNOWN_NUM_RECORDS # Default is unknown
registry_delete = {"registryId": registry_id, "dryRun": dry_run, "soft": soft}
(
structured_rows,
entities_affected,
aspects_affected,
unsafe_aspects,
unsafe_entity_count,
unsafe_entities,
) = cli_utils.post_rollback_endpoint(registry_delete, "/entities?action=deleteAll")
deletion_result.num_entities = entities_affected
deletion_result.num_records = aspects_affected
deletion_result.sample_records = structured_rows
deletion_result.end()
return deletion_result
with PerfTimer() as timer:
registry_delete = {"registryId": registry_id, "dryRun": dry_run, "soft": soft}
(
structured_rows,
entities_affected,
aspects_affected,
unsafe_aspects,
unsafe_entity_count,
unsafe_entities,
) = cli_utils.post_rollback_endpoint(
registry_delete, "/entities?action=deleteAll"
)
if not dry_run:
message = "soft delete" if soft else "hard delete"
click.echo(
f"Took {timer.elapsed_seconds()} seconds to {message}"
f" {aspects_affected} versioned rows"
f" for {entities_affected} entities."
)
else:
click.echo(
f"{entities_affected} entities with {aspects_affected} rows will be affected. "
f"Took {timer.elapsed_seconds()} seconds to evaluate."
)
if structured_rows:
click.echo(tabulate(structured_rows, _RUN_TABLE_COLUMNS, tablefmt="grid"))
@delete.command()
@click.option("--urn", required=True, type=str, help="the urn of the entity")
@click.option("-n", "--dry-run", required=False, is_flag=True)
@click.option(
"-f", "--force", required=False, is_flag=True, help="force the delete if set"
)
@telemetry.with_telemetry()
def references(urn: str, dry_run: bool, force: bool) -> None:
"""
Delete all references to an entity (but not the entity itself).
"""
graph = get_default_graph()
logger.info(f"Using graph: {graph}")
references_count, related_aspects = graph.delete_references_to_urn(
urn=urn,
dry_run=True,
)
if references_count == 0:
click.echo(f"No references to {urn} found")
return
click.echo(f"Found {references_count} references to {urn}")
sample_msg = (
"\nSample of references\n"
+ tabulate(
[x.values() for x in related_aspects],
["relationship", "entity", "aspect"],
)
+ "\n"
)
click.echo(sample_msg)
if dry_run:
logger.info(f"[Dry-run] Would remove {references_count} references to {urn}")
else:
if not force:
click.confirm(
f"This will delete {references_count} references to {urn} from DataHub. Do you want to continue?",
abort=True,
)
references_count, _ = graph.delete_references_to_urn(
urn=urn,
dry_run=False,
)
logger.info(f"Deleted {references_count} references to {urn}")
@delete.command()
@click.option("--urn", required=False, type=str, help="the urn of the entity")
@click.option(
"-a",
# option with `_` is inconsistent with rest of CLI but kept for backward compatibility
"--aspect_name",
"--aspect",
# This option is inconsistent with rest of CLI but kept for backward compatibility
"--aspect-name",
required=False,
type=str,
help="the aspect name associated with the entity(only for timeseries aspects)",
help="the aspect name associated with the entity",
)
@click.option(
"-f", "--force", required=False, is_flag=True, help="force the delete if set"
@ -136,40 +231,37 @@ def by_registry(
"-p", "--platform", required=False, type=str, help="the platform of the entity"
)
@click.option(
# option with `_` is inconsistent with rest of CLI but kept for backward compatibility
"--entity_type",
"--entity-type",
required=False,
type=str,
default="dataset",
help="the entity type of the entity",
)
@click.option("--query", required=False, type=str)
@click.option(
"--start-time",
required=False,
type=click.DateTime(),
help="the start time(only for timeseries aspects)",
type=ClickDatetime(),
help="the start time (only for timeseries aspects)",
)
@click.option(
"--end-time",
required=False,
type=click.DateTime(),
help="the end time(only for timeseries aspects)",
type=ClickDatetime(),
help="the end time (only for timeseries aspects)",
)
@click.option("-n", "--dry-run", required=False, is_flag=True)
@click.option("--only-soft-deleted", required=False, is_flag=True, default=False)
@upgrade.check_upgrade
@telemetry.with_telemetry()
def by_filter(
urn: str,
aspect_name: Optional[str],
urn: Optional[str],
aspect: Optional[str],
force: bool,
soft: bool,
env: str,
platform: str,
entity_type: str,
query: str,
env: Optional[str],
platform: Optional[str],
entity_type: Optional[str],
query: Optional[str],
start_time: Optional[datetime],
end_time: Optional[datetime],
dry_run: bool,
@ -177,23 +269,15 @@ def by_filter(
) -> None:
"""Delete metadata from datahub using a single urn or a combination of filters"""
cli_utils.test_connectivity_complain_exit("delete")
# one of these must be provided
if not urn and not platform and not env and not query:
raise click.UsageError(
"You must provide one of urn / platform / env / query in order to delete entities."
)
include_removed: bool
if soft:
# For soft-delete include-removed does not make any sense
include_removed = False
else:
# For hard-delete we always include the soft-deleted items
include_removed = True
# default query is set to "*" if not provided
query = "*" if query is None else query
# Validate the cli arguments.
_validate_user_urn_and_filters(
urn=urn, entity_type=entity_type, platform=platform, env=env, query=query
)
soft_delete_filter = _validate_user_soft_delete_flags(
soft=soft, aspect=aspect, only_soft_deleted=only_soft_deleted
)
_validate_user_aspect_flags(aspect=aspect, start_time=start_time, end_time=end_time)
# TODO: add some validation on entity_type
if not force and not soft and not dry_run:
click.confirm(
@ -201,305 +285,241 @@ def by_filter(
abort=True,
)
graph = get_default_graph()
logger.info(f"Using {graph}")
# Determine which urns to delete.
if urn:
# Single urn based delete
session, host = cli_utils.get_session_and_host()
entity_type = guess_entity_type(urn=urn)
logger.info(f"DataHub configured with {host}")
if not aspect_name:
references_count, related_aspects = delete_references(
urn, dry_run=True, cached_session_host=(session, host)
)
remove_references: bool = False
if (not force) and references_count > 0:
click.echo(
f"This urn was referenced in {references_count} other aspects across your metadata graph:"
)
click.echo(
tabulate(
[x.values() for x in related_aspects],
["relationship", "entity", "aspect"],
tablefmt="grid",
)
)
remove_references = click.confirm(
"Do you want to delete these references?"
)
if force or remove_references:
delete_references(
urn, dry_run=False, cached_session_host=(session, host)
)
deletion_result: DeletionResult = delete_one_urn_cmd(
urn,
aspect_name=aspect_name,
soft=soft,
dry_run=dry_run,
start_time=start_time,
end_time=end_time,
cached_session_host=(session, host),
)
if not dry_run:
if deletion_result.num_records == 0:
click.echo(f"Nothing deleted for {urn}")
else:
click.echo(
f"Successfully deleted {urn}. {deletion_result.num_records} rows deleted"
)
delete_by_urn = True
urns = [urn]
else:
# Filter based delete
deletion_result = delete_with_filters(
env=env,
platform=platform,
dry_run=dry_run,
soft=soft,
entity_type=entity_type,
search_query=query,
force=force,
include_removed=include_removed,
aspect_name=aspect_name,
only_soft_deleted=only_soft_deleted,
)
if not dry_run:
message = "soft delete" if soft else "hard delete"
click.echo(
f"Took {(deletion_result.end_time-deletion_result.start_time)/1000.0} seconds to {message}"
f" {deletion_result.num_records} versioned rows"
f" and {deletion_result.num_timeseries_records} timeseries aspect rows"
f" for {deletion_result.num_entities} entities."
)
else:
click.echo(
f"{deletion_result.num_entities} entities with {deletion_result.num_records if deletion_result.num_records != UNKNOWN_NUM_RECORDS else 'unknown'} rows will be affected. Took {(deletion_result.end_time-deletion_result.start_time)/1000.0} seconds to evaluate."
)
if deletion_result.sample_records:
click.echo(
tabulate(deletion_result.sample_records, RUN_TABLE_COLUMNS, tablefmt="grid")
)
def _get_current_time() -> int:
return int(time.time() * 1000.0)
@telemetry.with_telemetry()
def delete_with_filters(
dry_run: bool,
soft: bool,
force: bool,
include_removed: bool,
aspect_name: Optional[str] = None,
search_query: str = "*",
entity_type: str = "dataset",
env: Optional[str] = None,
platform: Optional[str] = None,
only_soft_deleted: Optional[bool] = False,
) -> DeletionResult:
session, gms_host = cli_utils.get_session_and_host()
token = cli_utils.get_token()
logger.info(f"datahub configured with {gms_host}")
emitter = rest_emitter.DatahubRestEmitter(gms_server=gms_host, token=token)
batch_deletion_result = DeletionResult()
urns: List[str] = []
if not only_soft_deleted:
delete_by_urn = False
urns = list(
cli_utils.get_urns_by_filter(
env=env,
graph.get_urns_by_filter(
entity_types=[entity_type] if entity_type else None,
platform=platform,
search_query=search_query,
entity_type=entity_type,
include_removed=False,
env=env,
query=query,
status=soft_delete_filter,
)
)
soft_deleted_urns: List[str] = []
if include_removed or only_soft_deleted:
soft_deleted_urns = list(
cli_utils.get_urns_by_filter(
env=env,
platform=platform,
search_query=search_query,
entity_type=entity_type,
only_soft_deleted=True,
if len(urns) == 0:
click.echo(
"Found no urns to delete. Maybe you want to change your filters to be something different?"
)
return
urns_by_type: Dict[str, List[str]] = {}
for urn in urns:
entity_type = guess_entity_type(urn)
urns_by_type.setdefault(entity_type, []).append(urn)
if len(urns_by_type) > 1:
# Display a breakdown of urns by entity type if there's multiple.
click.echo("Filter matched urns of multiple entity types")
for entity_type, entity_urns in urns_by_type.items():
click.echo(
f"- {len(entity_urns)} {entity_type} urn(s). Sample: {choices(entity_urns, k=min(5, len(entity_urns)))}"
)
else:
click.echo(
f"Filter matched {len(urns)} {entity_type} urn(s). Sample: {choices(urns, k=min(5, len(urns)))}"
)
if not force and not dry_run:
click.confirm(
f"This will delete {len(urns)} entities from DataHub. Do you want to continue?",
abort=True,
)
urns_iter = urns
if not delete_by_urn and not dry_run:
urns_iter = progressbar.progressbar(urns, redirect_stdout=True)
# Run the deletion.
deletion_result = DeletionResult()
with PerfTimer() as timer:
for urn in urns_iter:
one_result = _delete_one_urn(
graph=graph,
urn=urn,
aspect_name=aspect,
soft=soft,
dry_run=dry_run,
start_time=start_time,
end_time=end_time,
)
deletion_result.merge(one_result)
# Report out a summary of the deletion result.
click.echo(
deletion_result.format_message(
dry_run=dry_run, soft=soft, time_sec=timer.elapsed_seconds()
)
final_message = ""
if len(urns) > 0:
final_message = f"{len(urns)} "
if len(urns) > 0 and len(soft_deleted_urns) > 0:
final_message += "and "
if len(soft_deleted_urns) > 0:
final_message = f"{len(soft_deleted_urns)} (soft-deleted) "
logger.info(
f"Filter matched {final_message} {entity_type} entities of {platform}. Sample: {choices(urns, k=min(5, len(urns)))}"
)
if len(urns) == 0 and len(soft_deleted_urns) == 0:
click.echo(
f"No urns to delete. Maybe you want to change entity_type={entity_type} or platform={platform} to be something different?"
)
return DeletionResult(end_time=int(time.time() * 1000.0))
if not force and not dry_run:
type_delete = "soft" if soft else "permanently"
click.confirm(
f"This will {type_delete} delete {len(urns)} entities. Are you sure?",
abort=True,
def _validate_user_urn_and_filters(
urn: Optional[str],
entity_type: Optional[str],
platform: Optional[str],
env: Optional[str],
query: Optional[str],
) -> None:
# Check urn / filters options.
if urn:
if entity_type or platform or env or query:
raise click.UsageError(
"You cannot provide both an urn and a filter rule (entity-type / platform / env / query)."
)
elif not urn and not (entity_type or platform or env or query):
raise click.UsageError(
"You must provide either an urn or at least one filter (entity-type / platform / env / query) in order to delete entities."
)
elif query:
logger.warning(
"Using --query is an advanced feature and can easily delete unintended entities. Please use with caution."
)
elif env and not (platform or entity_type):
logger.warning(
f"Using --env without other filters will delete all metadata in the {env} environment. Please use with caution."
)
if len(urns) > 0:
for urn in progressbar.progressbar(urns, redirect_stdout=True):
one_result = _delete_one_urn(
urn,
soft=soft,
aspect_name=aspect_name,
dry_run=dry_run,
cached_session_host=(session, gms_host),
cached_emitter=emitter,
)
batch_deletion_result.merge(one_result)
if len(soft_deleted_urns) > 0 and not soft:
click.echo("Starting to delete soft-deleted URNs")
for urn in progressbar.progressbar(soft_deleted_urns, redirect_stdout=True):
one_result = _delete_one_urn(
urn,
soft=soft,
dry_run=dry_run,
cached_session_host=(session, gms_host),
cached_emitter=emitter,
is_soft_deleted=True,
)
batch_deletion_result.merge(one_result)
batch_deletion_result.end()
def _validate_user_soft_delete_flags(
soft: bool, aspect: Optional[str], only_soft_deleted: bool
) -> RemovedStatusFilter:
# Check soft / hard delete flags.
# Note: aspect not None ==> hard delete,
# but aspect is None ==> could be either soft or hard delete
return batch_deletion_result
if soft:
if aspect:
raise click.UsageError(
"You cannot provide an aspect name when performing a soft delete. Use --hard to perform a hard delete."
)
if only_soft_deleted:
raise click.UsageError(
"You cannot provide --only-soft-deleted when performing a soft delete. Use --hard to perform a hard delete."
)
soft_delete_filter = RemovedStatusFilter.NOT_SOFT_DELETED
else:
# For hard deletes, we will always include the soft-deleted entities, and
# can optionally filter to exclude non-soft-deleted entities.
if only_soft_deleted:
soft_delete_filter = RemovedStatusFilter.ONLY_SOFT_DELETED
else:
soft_delete_filter = RemovedStatusFilter.ALL
return soft_delete_filter
def _validate_user_aspect_flags(
aspect: Optional[str],
start_time: Optional[datetime],
end_time: Optional[datetime],
) -> None:
# Check the aspect name.
if aspect and aspect not in ASPECT_MAP:
logger.info(f"Supported aspects: {list(sorted(ASPECT_MAP.keys()))}")
raise click.UsageError(
f"Unknown aspect {aspect}. Ensure the aspect is in the above list."
)
# Check that start/end time are set if and only if the aspect is a timeseries aspect.
if aspect and aspect in TIMESERIES_ASPECT_MAP:
if not start_time or not end_time:
raise click.UsageError(
"You must provide both --start-time and --end-time when deleting a timeseries aspect."
)
elif start_time or end_time:
raise click.UsageError(
"You can only provide --start-time and --end-time when deleting a timeseries aspect."
)
elif aspect:
raise click.UsageError(
"Aspect-specific deletion is only supported for timeseries aspects. Please delete the full entity or use a rollback instead."
)
def _delete_one_urn(
graph: DataHubGraph,
urn: str,
soft: bool = False,
dry_run: bool = False,
aspect_name: Optional[str] = None,
start_time: Optional[datetime] = None,
end_time: Optional[datetime] = None,
cached_session_host: Optional[Tuple[sessions.Session, str]] = None,
cached_emitter: Optional[rest_emitter.DatahubRestEmitter] = None,
run_id: str = "delete-run-id",
deletion_timestamp: Optional[int] = None,
is_soft_deleted: Optional[bool] = None,
run_id: str = "__datahub-delete-cli",
) -> DeletionResult:
deletion_timestamp = deletion_timestamp or _get_current_time()
soft_delete_msg: str = ""
if dry_run and is_soft_deleted:
soft_delete_msg = "(soft-deleted)"
deletion_result = DeletionResult()
deletion_result.num_entities = 1
deletion_result.num_records = UNKNOWN_NUM_RECORDS # Default is unknown
rows_affected: int = 0
ts_rows_affected: int = 0
referenced_entities_affected: int = 0
if soft:
if aspect_name:
raise click.UsageError(
"Please provide --hard flag, as aspect values cannot be soft deleted."
)
# Add removed aspect
if cached_emitter:
emitter = cached_emitter
else:
_, gms_host = cli_utils.get_session_and_host()
token = cli_utils.get_token()
emitter = rest_emitter.DatahubRestEmitter(gms_server=gms_host, token=token)
# Soft delete of entity.
assert not aspect_name, "aspects cannot be soft deleted"
if not dry_run:
emitter.emit_mcp(
MetadataChangeProposalWrapper(
entityUrn=urn,
aspect=StatusClass(removed=True),
systemMetadata=SystemMetadataClass(
runId=run_id, lastObserved=deletion_timestamp
),
)
)
graph.soft_delete_entity(urn=urn, run_id=run_id)
else:
logger.info(f"[Dry-run] Would soft-delete {urn}")
elif not dry_run:
payload_obj: Dict[str, Any] = {"urn": urn}
if aspect_name:
payload_obj["aspectName"] = aspect_name
if start_time:
payload_obj["startTimeMillis"] = int(round(start_time.timestamp() * 1000))
if end_time:
payload_obj["endTimeMillis"] = int(round(end_time.timestamp() * 1000))
rows_affected: int
ts_rows_affected: int
urn, rows_affected, ts_rows_affected = cli_utils.post_delete_endpoint(
payload_obj,
"/entities?action=delete",
cached_session_host=cached_session_host,
)
deletion_result.num_records = rows_affected
deletion_result.num_timeseries_records = ts_rows_affected
else:
if aspect_name:
logger.info(
f"[Dry-run] Would hard-delete aspect {aspect_name} of {urn} {soft_delete_msg}"
rows_affected = 1
ts_rows_affected = 0
elif aspect_name and aspect_name in TIMESERIES_ASPECT_MAP:
# Hard delete of timeseries aspect.
if not dry_run:
ts_rows_affected = graph.hard_delete_timeseries_aspect(
urn=urn,
aspect_name=aspect_name,
start_time=start_time,
end_time=end_time,
)
else:
logger.info(f"[Dry-run] Would hard-delete {urn} {soft_delete_msg}")
deletion_result.num_records = (
UNKNOWN_NUM_RECORDS # since we don't know how many rows will be affected
logger.info(
f"[Dry-run] Would hard-delete {urn} timeseries aspect {aspect_name}"
)
ts_rows_affected = _UNKNOWN_NUM_RECORDS
elif aspect_name:
# Hard delete of non-timeseries aspect.
# TODO: The backend doesn't support this yet.
raise NotImplementedError(
"Delete by aspect is not supported yet for non-timeseries aspects. Please delete the full entity or use rollback instead."
)
deletion_result.end()
return deletion_result
else:
# Full entity hard delete.
assert not soft and not aspect_name
if not dry_run:
rows_affected, ts_rows_affected = graph.hard_delete_entity(
urn=urn,
)
else:
logger.info(f"[Dry-run] Would hard-delete {urn}")
rows_affected = _UNKNOWN_NUM_RECORDS
ts_rows_affected = _UNKNOWN_NUM_RECORDS
@telemetry.with_telemetry()
def delete_one_urn_cmd(
urn: str,
aspect_name: Optional[str] = None,
soft: bool = False,
dry_run: bool = False,
start_time: Optional[datetime] = None,
end_time: Optional[datetime] = None,
cached_session_host: Optional[Tuple[sessions.Session, str]] = None,
cached_emitter: Optional[rest_emitter.DatahubRestEmitter] = None,
) -> DeletionResult:
"""
Wrapper around delete_one_urn because it is also called in a loop via delete_with_filters.
# For full entity deletes, we also might clean up references to the entity.
if guess_entity_type(urn) in _DELETE_WITH_REFERENCES_TYPES:
referenced_entities_affected, _ = graph.delete_references_to_urn(
urn=urn,
dry_run=dry_run,
)
if dry_run and referenced_entities_affected > 0:
logger.info(
f"[Dry-run] Would remove {referenced_entities_affected} references to {urn}"
)
This is a separate function that is called only when a single URN is deleted via the CLI.
"""
return _delete_one_urn(
urn,
soft,
dry_run,
aspect_name,
start_time,
end_time,
cached_session_host,
cached_emitter,
)
def delete_references(
urn: str,
dry_run: bool = False,
cached_session_host: Optional[Tuple[sessions.Session, str]] = None,
) -> Tuple[int, List[Dict]]:
payload_obj = {"urn": urn, "dryRun": dry_run}
return cli_utils.post_delete_references_endpoint(
payload_obj,
"/entities?action=deleteReferences",
cached_session_host=cached_session_host,
return DeletionResult(
num_entities=1,
num_records=rows_affected,
num_timeseries_records=ts_rows_affected,
num_referenced_entities=referenced_entities_affected,
)

View File

@ -23,6 +23,7 @@ from datahub.emitter.mcp_builder import (
SchemaKey,
)
from datahub.emitter.rest_emitter import DatahubRestEmitter
from datahub.ingestion.graph.client import DataHubGraph, DataHubGraphConfig
from datahub.metadata.schema_classes import (
ContainerKeyClass,
ContainerPropertiesClass,
@ -141,8 +142,8 @@ def dataplatform2instance_func(
system_metadata = SystemMetadataClass(runId=run_id)
if not dry_run:
rest_emitter = DatahubRestEmitter(
gms_server=cli_utils.get_session_and_host()[1]
graph = DataHubGraph(
config=DataHubGraphConfig(server=cli_utils.get_session_and_host()[1])
)
urns_to_migrate = []
@ -214,11 +215,11 @@ def dataplatform2instance_func(
run_id=run_id,
):
if not dry_run:
rest_emitter.emit_mcp(mcp)
graph.emit_mcp(mcp)
migration_report.on_entity_create(mcp.entityUrn, mcp.aspectName) # type: ignore
if not dry_run:
rest_emitter.emit_mcp(
graph.emit_mcp(
MetadataChangeProposalWrapper(
entityUrn=new_urn,
aspect=DataPlatformInstanceClass(
@ -252,14 +253,16 @@ def dataplatform2instance_func(
aspect=aspect,
)
if not dry_run:
rest_emitter.emit_mcp(mcp)
graph.emit_mcp(mcp)
migration_report.on_entity_affected(mcp.entityUrn, mcp.aspectName) # type: ignore
else:
log.debug(f"Didn't find aspect {aspect_name} for urn {target_urn}")
if not dry_run and not keep:
log.info(f"will {'hard' if hard else 'soft'} delete {src_entity_urn}")
delete_cli._delete_one_urn(src_entity_urn, soft=not hard, run_id=run_id)
delete_cli._delete_one_urn(
graph, src_entity_urn, soft=not hard, run_id=run_id
)
migration_report.on_entity_migrated(src_entity_urn, "status") # type: ignore
click.echo(f"{migration_report}")
@ -270,7 +273,7 @@ def dataplatform2instance_func(
instance=instance,
platform=platform,
keep=keep,
rest_emitter=rest_emitter,
rest_emitter=graph,
)
@ -281,7 +284,7 @@ def migrate_containers(
hard: bool,
instance: str,
keep: bool,
rest_emitter: DatahubRestEmitter,
rest_emitter: DataHubGraph,
) -> None:
run_id: str = f"container-migrate-{uuid.uuid4()}"
migration_report = MigrationReport(run_id, dry_run, keep)
@ -369,7 +372,9 @@ def migrate_containers(
if not dry_run and not keep:
log.info(f"will {'hard' if hard else 'soft'} delete {src_urn}")
delete_cli._delete_one_urn(src_urn, soft=not hard, run_id=run_id)
delete_cli._delete_one_urn(
rest_emitter, src_urn, soft=not hard, run_id=run_id
)
migration_report.on_entity_migrated(src_urn, "status") # type: ignore
click.echo(f"{migration_report}")

View File

@ -13,7 +13,6 @@ import click
from click_default_group import DefaultGroup
from datahub.api.entities.dataproduct.dataproduct import DataProduct
from datahub.cli.delete_cli import delete_one_urn_cmd, delete_references
from datahub.cli.specific.file_loader import load_file
from datahub.emitter.mce_builder import make_group_urn, make_user_urn
from datahub.ingestion.graph.client import DataHubGraph, get_default_graph
@ -213,6 +212,7 @@ def delete(urn: str, file: Path, hard: bool) -> None:
)
raise click.Abort()
graph: DataHubGraph
with get_default_graph() as graph:
data_product_urn = (
urn if urn.startswith("urn:li:dataProduct") else f"urn:li:dataProduct:{urn}"
@ -225,9 +225,10 @@ def delete(urn: str, file: Path, hard: bool) -> None:
if hard:
# we only delete references if this is a hard delete
delete_references(data_product_urn)
graph.delete_references_to_urn(data_product_urn)
graph.delete_entity(data_product_urn, hard=hard)
delete_one_urn_cmd(data_product_urn, soft=not hard)
click.secho(f"Data Product {data_product_urn} deleted")

View File

@ -0,0 +1,94 @@
import contextlib
import logging
from datetime import datetime, timedelta, timezone
from typing import Any, Optional
import click
import dateutil.parser
import humanfriendly
logger = logging.getLogger(__name__)
def parse_user_datetime(input: str) -> datetime:
"""Parse absolute and relative time strings into datetime objects.
This parses strings like "2022-01-01 01:02:03" and "-7 days"
and timestamps like "1630440123".
Args:
input: A string representing a datetime or relative time.
Returns:
A timezone-aware datetime object in UTC. If the input specifies a different
timezone, it will be converted to UTC.
"""
# Special cases.
if input == "now":
return datetime.now(tz=timezone.utc)
elif input == "min":
return datetime.min.replace(tzinfo=timezone.utc)
elif input == "max":
return datetime.max.replace(tzinfo=timezone.utc)
# First try parsing as a timestamp.
with contextlib.suppress(ValueError):
ts = float(input)
try:
return datetime.fromtimestamp(ts, tz=timezone.utc)
except (OverflowError, ValueError):
# This is likely a timestamp in milliseconds.
return datetime.fromtimestamp(ts / 1000, tz=timezone.utc)
# Then try parsing as a relative time.
with contextlib.suppress(humanfriendly.InvalidTimespan):
delta = _parse_relative_timespan(input)
return datetime.now(tz=timezone.utc) + delta
# Finally, try parsing as an absolute time.
with contextlib.suppress(dateutil.parser.ParserError):
dt = dateutil.parser.parse(input)
if dt.tzinfo is None:
# Assume that the user meant to specify a time in UTC.
dt = dt.replace(tzinfo=timezone.utc)
else:
# Convert to UTC.
dt = dt.astimezone(timezone.utc)
return dt
raise ValueError(f"Could not parse {input} as a datetime or relative time.")
def _parse_relative_timespan(input: str) -> timedelta:
neg = False
input = input.strip()
if input.startswith("+"):
input = input[1:]
elif input.startswith("-"):
input = input[1:]
neg = True
seconds = humanfriendly.parse_timespan(input)
delta = timedelta(seconds=seconds)
if neg:
delta = -delta
logger.debug(f'Parsed "{input}" as {delta}.')
return delta
class ClickDatetime(click.ParamType):
name = "datetime"
def convert(
self, value: Any, param: Optional[click.Parameter], ctx: Optional[click.Context]
) -> datetime:
if isinstance(value, datetime):
return value
try:
return parse_user_datetime(value)
except ValueError as e:
self.fail(str(e), param, ctx)

View File

@ -266,9 +266,14 @@ class DataHubRestEmitter(Closeable):
response.raise_for_status()
except HTTPError as e:
try:
info = response.json()
info: Dict = response.json()
logger.debug(
"Full stack trace from DataHub:\n%s", info.get("stackTrace")
)
info.pop("stackTrace", None)
raise OperationalError(
"Unable to emit metadata to DataHub GMS", info
f"Unable to emit metadata to DataHub GMS: {info.get('message')}",
info,
) from e
except JSONDecodeError:
# If we can't parse the JSON, just raise the original error.
@ -286,9 +291,7 @@ class DataHubRestEmitter(Closeable):
if self._token
else ""
)
return (
f"DataHubRestEmitter: configured to talk to {self._gms_server}{token_str}"
)
return f"{self.__class__.__name__}: configured to talk to {self._gms_server}{token_str}"
def flush(self) -> None:
# No-op, but present to keep the interface consistent with the Kafka emitter.

View File

@ -1,17 +1,16 @@
import enum
import json
import logging
import textwrap
import time
from dataclasses import dataclass
from enum import Enum
from datetime import datetime
from json.decoder import JSONDecodeError
from typing import TYPE_CHECKING, Any, Dict, Iterable, List, Optional, Type, Union
from typing import TYPE_CHECKING, Any, Dict, Iterable, List, Optional, Tuple, Type
from avro.schema import RecordSchema
from deprecated import deprecated
from requests.adapters import Response
from requests.models import HTTPError
from typing_extensions import Literal
from datahub.cli.cli_utils import get_url_and_token
from datahub.configuration.common import ConfigModel, GraphError, OperationalError
@ -72,6 +71,19 @@ class DatahubClientConfig(ConfigModel):
DataHubGraphConfig = DatahubClientConfig
class RemovedStatusFilter(enum.Enum):
"""Filter for the status of entities during search."""
NOT_SOFT_DELETED = "NOT_SOFT_DELETED"
"""Search only entities that have not been marked as deleted."""
ALL = "ALL"
"""Search all entities, including deleted entities."""
ONLY_SOFT_DELETED = "ONLY_SOFT_DELETED"
"""Search only soft-deleted entities."""
def _graphql_entity_type(entity_type: str) -> str:
"""Convert the entity types into GraphQL "EntityType" enum values."""
@ -124,9 +136,9 @@ class DataHubGraph(DatahubRestEmitter):
self.server_id = "missing"
logger.debug(f"Failed to get server id due to {e}")
def _get_generic(self, url: str, params: Optional[Dict] = None) -> Dict:
def _send_restli_request(self, method: str, url: str, **kwargs: Any) -> Dict:
try:
response = self._session.get(url, params=params)
response = self._session.request(method, url, **kwargs)
response.raise_for_status()
return response.json()
except HTTPError as e:
@ -141,24 +153,11 @@ class DataHubGraph(DatahubRestEmitter):
"Unable to get metadata from DataHub", {"message": str(e)}
) from e
def _get_generic(self, url: str, params: Optional[Dict] = None) -> Dict:
return self._send_restli_request("GET", url, params=params)
def _post_generic(self, url: str, payload_dict: Dict) -> Dict:
payload = json.dumps(payload_dict)
logger.debug(payload)
try:
response: Response = self._session.post(url, payload)
response.raise_for_status()
return response.json()
except HTTPError as e:
try:
info = response.json()
raise OperationalError(
"Unable to get metadata from DataHub", info
) from e
except JSONDecodeError:
# If we can't parse the JSON, just raise the original error.
raise OperationalError(
"Unable to get metadata from DataHub", {"message": str(e)}
) from e
return self._send_restli_request("POST", url, json=payload_dict)
def get_aspect(
self,
@ -449,10 +448,6 @@ class DataHubGraph(DatahubRestEmitter):
def _aspect_count_endpoint(self):
return f"{self.config.server}/aspects?action=getCount"
@property
def _scroll_across_entities_endpoint(self):
return f"{self.config.server}/entities?action=scrollAcrossEntities"
def get_domain_urn_by_name(self, domain_name: str) -> Optional[str]:
"""Retrieve a domain urn based on its name. Returns None if there is no match found"""
@ -487,6 +482,9 @@ class DataHubGraph(DatahubRestEmitter):
entities.append(x["entity"])
return entities[0] if entities_yielded else None
@deprecated(
reason='Use get_urns_by_filter(entity_types=["container"], ...) instead'
)
def get_container_urns_by_filter(
self,
env: Optional[str] = None,
@ -536,15 +534,21 @@ class DataHubGraph(DatahubRestEmitter):
*,
entity_types: Optional[List[str]] = None,
platform: Optional[str] = None,
env: Optional[str] = None,
query: Optional[str] = None,
status: RemovedStatusFilter = RemovedStatusFilter.NOT_SOFT_DELETED,
batch_size: int = 10000,
) -> Iterable[str]:
"""Fetch all urns that match the given filters.
Filters are combined conjunctively. If multiple filters are specified, the results will match all of them.
Note that specifying a platform filter will automatically exclude all entity types that do not have a platform.
The same goes for the env filter.
:param entity_types: List of entity types to include. If None, all entity types will be returned.
:param platform: Platform to filter on. If None, all platforms will be returned.
:param env: Environment (e.g. PROD, DEV) to filter on. If None, all environments will be returned.
:param status: Filter on the deletion status of the entity. The default is only return non-soft-deleted entities.
"""
types: Optional[List[str]] = None
@ -554,11 +558,13 @@ class DataHubGraph(DatahubRestEmitter):
types = [_graphql_entity_type(entity_type) for entity_type in entity_types]
# Does not filter on env, because env is missing in dashboard / chart urns and custom properties
# For containers, use { field: "customProperties", values: ["instance=env}"], condition:EQUAL }
# For others, use { field: "origin", values: ["env"], condition:EQUAL }
# Add the query default of * if no query is specified.
query = query or "*"
andFilters = []
FilterRule = Dict[str, Any]
andFilters: List[FilterRule] = []
# Platform filter.
if platform:
andFilters += [
{
@ -567,23 +573,90 @@ class DataHubGraph(DatahubRestEmitter):
"condition": "EQUAL",
}
]
orFilters = [{"and": andFilters}]
query = textwrap.dedent(
# Status filter.
if status == RemovedStatusFilter.NOT_SOFT_DELETED:
# Subtle: in some cases (e.g. when the dataset doesn't have a status aspect), the
# removed field is simply not present in the ElasticSearch document. Ideally this
# would be a "removed" : "false" filter, but that doesn't work. Instead, we need to
# use a negated filter.
andFilters.append(
{
"field": "removed",
"values": ["true"],
"condition": "EQUAL",
"negated": True,
}
)
elif status == RemovedStatusFilter.ONLY_SOFT_DELETED:
andFilters.append(
{
"field": "removed",
"values": ["true"],
"condition": "EQUAL",
}
)
elif status == RemovedStatusFilter.ALL:
# We don't need to add a filter for this case.
pass
else:
raise ValueError(f"Invalid status filter: {status}")
orFilters: List[Dict[str, List[FilterRule]]] = [{"and": andFilters}]
# Env filter.
if env:
# The env filter is a bit more tricky since it's not always stored
# in the same place in ElasticSearch.
envOrConditions: List[FilterRule] = [
# For most entity types, we look at the origin field.
{
"field": "origin",
"value": env,
"condition": "EQUAL",
},
# For containers, we look at the customProperties field.
# For any containers created after https://github.com/datahub-project/datahub/pull/8027,
# we look for the "env" property. Otherwise, we use the "instance" property.
{
"field": "customProperties",
"value": f"env={env}",
},
{
"field": "customProperties",
"value": f"instance={env}",
},
# Note that not all entity types have an env (e.g. dashboards / charts).
# If the env filter is specified, these will be excluded.
]
# This matches ALL of the andFilters and at least one of the envOrConditions.
orFilters = [
{"and": andFilters["and"] + [extraCondition]}
for extraCondition in envOrConditions
for andFilters in orFilters
]
graphql_query = textwrap.dedent(
"""
query scrollUrnsWithFilters(
$types: [EntityType!],
$query: String!,
$orFilters: [AndFilterInput!],
$batchSize: Int!,
$scrollId: String) {
scrollAcrossEntities(input: {
query: "*",
query: $query,
count: $batchSize,
scrollId: $scrollId,
types: $types,
orFilters: $orFilters,
searchFlags: { skipHighlighting: true }
searchFlags: {
skipHighlighting: true
skipAggregates: true
}
}) {
nextScrollId
searchResults {
@ -596,23 +669,32 @@ class DataHubGraph(DatahubRestEmitter):
"""
)
# Set scroll_id to False to enter while loop
scroll_id: Union[Literal[False], str, None] = False
while scroll_id is not None:
first_iter = True
scroll_id: Optional[str] = None
while first_iter or scroll_id:
first_iter = False
variables = {
"types": types,
"query": query,
"orFilters": orFilters,
"batchSize": batch_size,
"scrollId": scroll_id,
}
response = self.execute_graphql(
query,
variables={
"types": types,
"orFilters": orFilters,
"batchSize": batch_size,
"scrollId": scroll_id or None,
},
graphql_query,
variables=variables,
)
data = response["scrollAcrossEntities"]
scroll_id = data["nextScrollId"]
for entry in data["searchResults"]:
yield entry["entity"]["urn"]
if scroll_id:
logger.debug(
f"Scrolling to next scrollAcrossEntities page: {scroll_id}"
)
def get_latest_pipeline_checkpoint(
self, pipeline_name: str, platform: str
) -> Optional[Checkpoint["GenericCheckpointState"]]:
@ -663,13 +745,18 @@ class DataHubGraph(DatahubRestEmitter):
if variables:
body["variables"] = variables
logger.debug(
f"Executing graphql query: {query} with variables: {json.dumps(variables)}"
)
result = self._post_generic(url, body)
if result.get("errors"):
raise GraphError(f"Error executing graphql query: {result['errors']}")
return result["data"]
class RelationshipDirection(str, Enum):
class RelationshipDirection(str, enum.Enum):
# FIXME: Upgrade to enum.StrEnum when we drop support for Python 3.10
INCOMING = "INCOMING"
OUTGOING = "OUTGOING"
@ -707,22 +794,6 @@ class DataHubGraph(DatahubRestEmitter):
)
start = start + response.get("count", 0)
def soft_delete_urn(
self,
urn: str,
run_id: str = "soft-delete-urns",
) -> None:
timestamp = int(time.time() * 1000)
self.emit_mcp(
MetadataChangeProposalWrapper(
entityUrn=urn,
aspect=StatusClass(removed=True),
systemMetadata=SystemMetadataClass(
runId=run_id, lastObserved=timestamp
),
)
)
def exists(self, entity_urn: str) -> bool:
entity_urn_parsed: Urn = Urn.create_from_string(entity_urn)
try:
@ -740,6 +811,143 @@ class DataHubGraph(DatahubRestEmitter):
)
raise
def soft_delete_entity(
self,
urn: str,
run_id: str = "__datahub-graph-client",
deletion_timestamp: Optional[int] = None,
) -> None:
"""Soft-delete an entity by urn.
Args:
urn: The urn of the entity to soft-delete.
"""
assert urn
deletion_timestamp = deletion_timestamp or int(time.time() * 1000)
self.emit_mcp(
MetadataChangeProposalWrapper(
entityUrn=urn,
aspect=StatusClass(removed=True),
systemMetadata=SystemMetadataClass(
runId=run_id, lastObserved=deletion_timestamp
),
)
)
def hard_delete_entity(
self,
urn: str,
) -> Tuple[int, int]:
"""Hard delete an entity by urn.
Args:
urn: The urn of the entity to hard delete.
Returns:
A tuple of (rows_affected, timeseries_rows_affected).
"""
assert urn
payload_obj: Dict = {"urn": urn}
summary = self._post_generic(
f"{self._gms_server}/entities?action=delete", payload_obj
).get("value", {})
rows_affected: int = summary.get("rows", 0)
timeseries_rows_affected: int = summary.get("timeseriesRows", 0)
return rows_affected, timeseries_rows_affected
def delete_entity(self, urn: str, hard: bool = False) -> None:
"""Delete an entity by urn.
Args:
urn: The urn of the entity to delete.
hard: Whether to hard delete the entity. If False (default), the entity will be soft deleted.
"""
if hard:
rows_affected, timeseries_rows_affected = self.hard_delete_entity(urn)
logger.debug(
f"Hard deleted entity {urn} with {rows_affected} rows affected and {timeseries_rows_affected} timeseries rows affected"
)
else:
self.soft_delete_entity(urn)
logger.debug(f"Soft deleted entity {urn}")
# TODO: Create hard_delete_aspect once we support that in GMS.
def hard_delete_timeseries_aspect(
self,
urn: str,
aspect_name: str,
start_time: Optional[datetime],
end_time: Optional[datetime],
) -> int:
"""Hard delete timeseries aspects of an entity.
Args:
urn: The urn of the entity.
aspect_name: The name of the timeseries aspect to delete.
start_time: The start time of the timeseries data to delete. If not specified, defaults to the beginning of time.
end_time: The end time of the timeseries data to delete. If not specified, defaults to the end of time.
Returns:
The number of timeseries rows affected.
"""
assert urn
assert aspect_name in TIMESERIES_ASPECT_MAP, "must be a timeseries aspect"
payload_obj: Dict = {
"urn": urn,
"aspectName": aspect_name,
}
if start_time:
payload_obj["startTimeMillis"] = int(start_time.timestamp() * 1000)
if end_time:
payload_obj["endTimeMillis"] = int(end_time.timestamp() * 1000)
summary = self._post_generic(
f"{self._gms_server}/entities?action=delete", payload_obj
).get("value", {})
timeseries_rows_affected: int = summary.get("timeseriesRows", 0)
return timeseries_rows_affected
def delete_references_to_urn(
self, urn: str, dry_run: bool = False
) -> Tuple[int, List[Dict]]:
"""Delete references to a given entity.
This is useful for cleaning up references to an entity that is about to be deleted.
For example, when deleting a tag, you might use this to remove that tag from all other
entities that reference it.
This does not delete the entity itself.
Args:
urn: The urn of the entity to delete references to.
dry_run: If True, do not actually delete the references, just return the count of
references and the list of related aspects.
Returns:
A tuple of (reference_count, sample of related_aspects).
"""
assert urn
payload_obj = {"urn": urn, "dryRun": dry_run}
response = self._post_generic(
f"{self._gms_server}/entities?action=deleteReferences", payload_obj
).get("value", {})
reference_count = response.get("total", 0)
related_aspects = response.get("relatedAspects", [])
return reference_count, related_aspects
def get_default_graph() -> DataHubGraph:
(url, token) = get_url_and_token()

View File

@ -0,0 +1,51 @@
from datetime import datetime, timezone
import freezegun
import pytest
from datahub.configuration.datetimes import parse_user_datetime
# FIXME: Ideally we'd use tz_offset here to test this code in a non-UTC timezone.
# However, freezegun has a long-standing bug that prevents this from working:
# https://github.com/spulec/freezegun/issues/348.
@freezegun.freeze_time("2021-09-01 10:02:03")
def test_user_time_parser():
# Absolute times.
assert parse_user_datetime("2022-01-01 01:02:03 UTC") == datetime(
2022, 1, 1, 1, 2, 3, tzinfo=timezone.utc
)
assert parse_user_datetime("2022-01-01 01:02:03 -02:00") == datetime(
2022, 1, 1, 3, 2, 3, tzinfo=timezone.utc
)
# Times with no timestamp are assumed to be in UTC.
assert parse_user_datetime("2022-01-01 01:02:03") == datetime(
2022, 1, 1, 1, 2, 3, tzinfo=timezone.utc
)
assert parse_user_datetime("2022-02-03") == datetime(
2022, 2, 3, tzinfo=timezone.utc
)
# Timestamps.
assert parse_user_datetime("1630440123") == datetime(
2021, 8, 31, 20, 2, 3, tzinfo=timezone.utc
)
assert parse_user_datetime("1630440123837.018") == datetime(
2021, 8, 31, 20, 2, 3, 837018, tzinfo=timezone.utc
)
# Relative times.
assert parse_user_datetime("10m") == datetime(
2021, 9, 1, 10, 12, 3, tzinfo=timezone.utc
)
assert parse_user_datetime("+ 1 day") == datetime(
2021, 9, 2, 10, 2, 3, tzinfo=timezone.utc
)
assert parse_user_datetime("-2 days") == datetime(
2021, 8, 30, 10, 2, 3, tzinfo=timezone.utc
)
# Invalid inputs.
with pytest.raises(ValueError):
parse_user_datetime("invalid")

View File

@ -1,4 +1,5 @@
import json
import logging
import tempfile
import time
import sys
@ -16,6 +17,8 @@ from tests.aspect_generators.timeseries.dataset_profile_gen import \
from tests.utils import get_strftime_from_timestamp_millis
import requests_wrapper as requests
logger = logging.getLogger(__name__)
test_aspect_name: str = "datasetProfile"
test_dataset_urn: str = builder.make_dataset_urn_with_platform_instance(
"test_platform",
@ -79,6 +82,9 @@ def datahub_delete(params: List[str]) -> None:
args.extend(params)
args.append("--hard")
delete_result: Result = runner.invoke(datahub, args, input="y\ny\n")
logger.info(delete_result.stdout)
if delete_result.stderr:
logger.error(delete_result.stderr)
assert delete_result.exit_code == 0

View File

@ -4,8 +4,12 @@ import pytest
from time import sleep
from datahub.cli.cli_utils import get_aspects_for_entity
from datahub.cli.ingest_cli import get_session_and_host
from datahub.cli.delete_cli import delete_references
from tests.utils import ingest_file_via_rest, wait_for_healthcheck_util, delete_urns_from_file
from tests.utils import (
ingest_file_via_rest,
wait_for_healthcheck_util,
delete_urns_from_file,
get_datahub_graph,
)
from requests_wrapper import ELASTICSEARCH_REFRESH_INTERVAL_SECONDS
# Disable telemetry
@ -37,24 +41,41 @@ def test_setup():
session, gms_host = get_session_and_host()
try:
assert "browsePaths" not in get_aspects_for_entity(entity_urn=dataset_urn, aspects=["browsePaths"], typed=False)
assert "editableDatasetProperties" not in get_aspects_for_entity(entity_urn=dataset_urn, aspects=["editableDatasetProperties"], typed=False)
assert "browsePaths" not in get_aspects_for_entity(
entity_urn=dataset_urn, aspects=["browsePaths"], typed=False
)
assert "editableDatasetProperties" not in get_aspects_for_entity(
entity_urn=dataset_urn, aspects=["editableDatasetProperties"], typed=False
)
except Exception as e:
delete_urns_from_file("tests/delete/cli_test_data.json")
raise e
ingested_dataset_run_id = ingest_file_via_rest("tests/delete/cli_test_data.json").config.run_id
ingested_dataset_run_id = ingest_file_via_rest(
"tests/delete/cli_test_data.json"
).config.run_id
assert "browsePaths" in get_aspects_for_entity(entity_urn=dataset_urn, aspects=["browsePaths"], typed=False)
assert "browsePaths" in get_aspects_for_entity(
entity_urn=dataset_urn, aspects=["browsePaths"], typed=False
)
yield
rollback_url = f"{gms_host}/runs?action=rollback"
session.post(rollback_url, data=json.dumps({"runId": ingested_dataset_run_id, "dryRun": False, "hardDelete": True}))
session.post(
rollback_url,
data=json.dumps(
{"runId": ingested_dataset_run_id, "dryRun": False, "hardDelete": True}
),
)
sleep(ELASTICSEARCH_REFRESH_INTERVAL_SECONDS)
assert "browsePaths" not in get_aspects_for_entity(entity_urn=dataset_urn, aspects=["browsePaths"], typed=False)
assert "editableDatasetProperties" not in get_aspects_for_entity(entity_urn=dataset_urn, aspects=["editableDatasetProperties"], typed=False)
assert "browsePaths" not in get_aspects_for_entity(
entity_urn=dataset_urn, aspects=["browsePaths"], typed=False
)
assert "editableDatasetProperties" not in get_aspects_for_entity(
entity_urn=dataset_urn, aspects=["editableDatasetProperties"], typed=False
)
@pytest.mark.dependency()
@ -66,20 +87,24 @@ def test_delete_reference(test_setup, depends=["test_healthchecks"]):
dataset_urn = f"urn:li:dataset:({platform},{dataset_name},{env})"
tag_urn = "urn:li:tag:NeedsDocs"
session, gms_host = get_session_and_host()
graph = get_datahub_graph()
# Validate that the ingested tag is being referenced by the dataset
references_count, related_aspects = delete_references(tag_urn, dry_run=True, cached_session_host=(session, gms_host))
references_count, related_aspects = graph.delete_references_to_urn(
tag_urn, dry_run=True
)
print("reference count: " + str(references_count))
print(related_aspects)
assert references_count == 1
assert related_aspects[0]['entity'] == dataset_urn
assert related_aspects[0]["entity"] == dataset_urn
# Delete references to the tag
delete_references(tag_urn, dry_run=False, cached_session_host=(session, gms_host))
graph.delete_references_to_urn(tag_urn, dry_run=False)
sleep(ELASTICSEARCH_REFRESH_INTERVAL_SECONDS)
# Validate that references no longer exist
references_count, related_aspects = delete_references(tag_urn, dry_run=True, cached_session_host=(session, gms_host))
references_count, related_aspects = graph.delete_references_to_urn(
tag_urn, dry_run=True
)
assert references_count == 0

View File

@ -1,10 +1,9 @@
import json
from time import sleep
from datahub.cli import delete_cli
from datahub.cli import timeline_cli
from datahub.cli.cli_utils import guess_entity_type, post_entity
from tests.utils import ingest_file_via_rest
from tests.utils import ingest_file_via_rest, get_datahub_graph
from requests_wrapper import ELASTICSEARCH_REFRESH_INTERVAL_SECONDS
@ -22,7 +21,7 @@ def test_all():
res_data = timeline_cli.get_timeline(dataset_urn, ["TAG", "DOCUMENTATION", "TECHNICAL_SCHEMA", "GLOSSARY_TERM",
"OWNER"], None, None, False)
delete_cli.delete_one_urn_cmd(urn=dataset_urn)
get_datahub_graph().hard_delete_entity(urn=dataset_urn)
assert res_data
assert len(res_data) == 3
@ -49,7 +48,7 @@ def test_schema():
res_data = timeline_cli.get_timeline(dataset_urn, ["TECHNICAL_SCHEMA"], None, None, False)
delete_cli.delete_one_urn_cmd(urn=dataset_urn)
get_datahub_graph().hard_delete_entity(urn=dataset_urn)
assert res_data
assert len(res_data) == 3
assert res_data[0]["semVerChange"] == "MINOR"
@ -75,7 +74,7 @@ def test_glossary():
res_data = timeline_cli.get_timeline(dataset_urn, ["GLOSSARY_TERM"], None, None, False)
delete_cli.delete_one_urn_cmd(urn=dataset_urn)
get_datahub_graph().hard_delete_entity(urn=dataset_urn)
assert res_data
assert len(res_data) == 3
assert res_data[0]["semVerChange"] == "MINOR"
@ -101,7 +100,7 @@ def test_documentation():
res_data = timeline_cli.get_timeline(dataset_urn, ["DOCUMENTATION"], None, None, False)
delete_cli.delete_one_urn_cmd(urn=dataset_urn)
get_datahub_graph().hard_delete_entity(urn=dataset_urn)
assert res_data
assert len(res_data) == 3
assert res_data[0]["semVerChange"] == "MINOR"
@ -127,7 +126,7 @@ def test_tags():
res_data = timeline_cli.get_timeline(dataset_urn, ["TAG"], None, None, False)
delete_cli.delete_one_urn_cmd(urn=dataset_urn)
get_datahub_graph().hard_delete_entity(urn=dataset_urn)
assert res_data
assert len(res_data) == 3
assert res_data[0]["semVerChange"] == "MINOR"
@ -153,7 +152,7 @@ def test_ownership():
res_data = timeline_cli.get_timeline(dataset_urn, ["OWNER"], None, None, False)
delete_cli.delete_one_urn_cmd(urn=dataset_urn)
get_datahub_graph().hard_delete_entity(urn=dataset_urn)
assert res_data
assert len(res_data) == 3
assert res_data[0]["semVerChange"] == "MINOR"

View File

@ -1,6 +1,7 @@
import functools
import json
import os
from datetime import datetime, timedelta
from datetime import datetime, timedelta, timezone
import subprocess
import time
from typing import Any, Dict, List, Tuple
@ -11,11 +12,13 @@ import requests_wrapper as requests
import logging
from datahub.cli import cli_utils
from datahub.cli.cli_utils import get_system_auth
from datahub.ingestion.graph.client import DataHubGraph, DatahubClientConfig
from datahub.ingestion.run.pipeline import Pipeline
TIME: int = 1581407189000
logger = logging.getLogger(__name__)
def get_frontend_session():
session = requests.Session()
@ -126,14 +129,13 @@ def ingest_file_via_rest(filename: str) -> Pipeline:
return pipeline
def delete_urn(urn: str) -> None:
payload_obj = {"urn": urn}
@functools.lru_cache(maxsize=1)
def get_datahub_graph() -> DataHubGraph:
return DataHubGraph(DatahubClientConfig(server=get_gms_url()))
cli_utils.post_delete_endpoint_with_session_and_url(
requests.Session(),
get_gms_url() + "/entities?action=delete",
payload_obj,
)
def delete_urn(urn: str) -> None:
get_datahub_graph().hard_delete_entity(urn)
def delete_urns(urns: List[str]) -> None:
@ -172,15 +174,18 @@ def delete_urns_from_file(filename: str, shared_data: bool = False) -> None:
# Deletes require 60 seconds when run between tests operating on common data, otherwise standard sync wait
if shared_data:
wait_for_writes_to_sync()
# sleep(60)
# sleep(60)
else:
wait_for_writes_to_sync()
# sleep(requests.ELASTICSEARCH_REFRESH_INTERVAL_SECONDS)
# Fixed now value
NOW: datetime = datetime.now()
def get_timestampmillis_at_start_of_day(relative_day_num: int) -> int:
"""
Returns the time in milliseconds from epoch at the start of the day
@ -201,7 +206,7 @@ def get_timestampmillis_at_start_of_day(relative_day_num: int) -> int:
def get_strftime_from_timestamp_millis(ts_millis: int) -> str:
return datetime.fromtimestamp(ts_millis / 1000).strftime("%Y-%m-%d %H:%M:%S")
return datetime.fromtimestamp(ts_millis / 1000, tz=timezone.utc).isoformat()
def create_datahub_step_state_aspect(
@ -242,19 +247,22 @@ def wait_for_writes_to_sync(max_timeout_in_sec: int = 120) -> None:
# get offsets
lag_zero = False
while not lag_zero and (time.time() - start_time) < max_timeout_in_sec:
time.sleep(1) # micro-sleep
time.sleep(1) # micro-sleep
completed_process = subprocess.run(
"docker exec broker /bin/kafka-consumer-groups --bootstrap-server broker:29092 --group generic-mae-consumer-job-client --describe | grep -v LAG | awk '{print $6}'",
capture_output=True,
shell=True,
text=True)
text=True,
)
result = str(completed_process.stdout)
lines = result.splitlines()
lag_values = [int(l) for l in lines if l != ""]
maximum_lag = max(lag_values)
if maximum_lag == 0:
lag_zero = True
if not lag_zero:
logger.warning(f"Exiting early from waiting for elastic to catch up due to a timeout. Current lag is {lag_values}")
logger.warning(
f"Exiting early from waiting for elastic to catch up due to a timeout. Current lag is {lag_values}"
)