mirror of
https://github.com/datahub-project/datahub.git
synced 2025-06-27 05:03:31 +00:00
300 lines
10 KiB
Markdown
300 lines
10 KiB
Markdown
# DataHub Dataset Command
|
|
|
|
The `dataset` command allows you to interact with Dataset entities in DataHub. This includes creating, updating, retrieving, validating, and synchronizing Dataset metadata.
|
|
|
|
## Commands
|
|
|
|
### sync
|
|
|
|
Synchronize Dataset metadata between YAML files and DataHub.
|
|
|
|
```shell
|
|
datahub dataset sync -f PATH_TO_YAML_FILE --to-datahub|--from-datahub
|
|
```
|
|
|
|
**Options:**
|
|
|
|
- `-f, --file` - Path to the YAML file (required)
|
|
- `--to-datahub` - Push metadata from YAML file to DataHub
|
|
- `--from-datahub` - Pull metadata from DataHub to YAML file
|
|
|
|
**Example:**
|
|
|
|
```shell
|
|
# Push to DataHub
|
|
datahub dataset sync -f dataset.yaml --to-datahub
|
|
|
|
# Pull from DataHub
|
|
datahub dataset sync -f dataset.yaml --from-datahub
|
|
```
|
|
|
|
The `sync` command offers bidirectional synchronization, allowing you to keep your local YAML files in sync with the DataHub platform. The `upsert` command actually uses `sync` with the `--to-datahub` flag internally.
|
|
|
|
For details on the supported YAML format, see the [Dataset YAML Format](#dataset-yaml-format) section.
|
|
|
|
### file
|
|
|
|
Operate on a Dataset YAML file for validation or linting.
|
|
|
|
```shell
|
|
datahub dataset file [--lintCheck] [--lintFix] PATH_TO_YAML_FILE
|
|
```
|
|
|
|
**Options:**
|
|
|
|
- `--lintCheck` - Check the YAML file for formatting issues (optional)
|
|
- `--lintFix` - Fix formatting issues in the YAML file (optional)
|
|
|
|
**Example:**
|
|
|
|
```shell
|
|
# Check for linting issues
|
|
datahub dataset file --lintCheck dataset.yaml
|
|
|
|
# Fix linting issues
|
|
datahub dataset file --lintFix dataset.yaml
|
|
```
|
|
|
|
This command helps maintain consistent formatting of your Dataset YAML files. For more information on the expected format, refer to the [Dataset YAML Format](#dataset-yaml-format) section.
|
|
|
|
### upsert
|
|
|
|
Create or update Dataset metadata in DataHub.
|
|
|
|
```shell
|
|
datahub dataset upsert -f PATH_TO_YAML_FILE
|
|
```
|
|
|
|
**Options:**
|
|
|
|
- `-f, --file` - Path to the YAML file containing Dataset metadata (required)
|
|
|
|
**Example:**
|
|
|
|
```shell
|
|
datahub dataset upsert -f dataset.yaml
|
|
```
|
|
|
|
This command will parse the YAML file, validate that any entity references exist in DataHub, and then emit the corresponding metadata change proposals to update or create the Dataset.
|
|
|
|
For details on the required structure of your YAML file, see the [Dataset YAML Format](#dataset-yaml-format) section.
|
|
|
|
### get
|
|
|
|
Retrieve Dataset metadata from DataHub and optionally write it to a file.
|
|
|
|
```shell
|
|
datahub dataset get --urn DATASET_URN [--to-file OUTPUT_FILE]
|
|
```
|
|
|
|
**Options:**
|
|
|
|
- `--urn` - The Dataset URN to retrieve (required)
|
|
- `--to-file` - Path to write the Dataset metadata as YAML (optional)
|
|
|
|
**Example:**
|
|
|
|
```shell
|
|
datahub dataset get --urn "urn:li:dataset:(urn:li:dataPlatform:hive,example_table,PROD)" --to-file my_dataset.yaml
|
|
```
|
|
|
|
If the URN does not start with `urn:li:dataset:`, it will be automatically prefixed.
|
|
|
|
The output file will be formatted according to the [Dataset YAML Format](#dataset-yaml-format) section.
|
|
|
|
### add_sibling
|
|
|
|
Add sibling relationships between Datasets.
|
|
|
|
```shell
|
|
datahub dataset add_sibling --urn PRIMARY_URN --sibling-urns SECONDARY_URN [--sibling-urns ANOTHER_URN ...]
|
|
```
|
|
|
|
**Options:**
|
|
|
|
- `--urn` - URN of the primary Dataset (required)
|
|
- `--sibling-urns` - URNs of secondary sibling Datasets (required, multiple allowed)
|
|
|
|
**Example:**
|
|
|
|
```shell
|
|
datahub dataset add_sibling --urn "urn:li:dataset:(urn:li:dataPlatform:hive,example_table,PROD)" --sibling-urns "urn:li:dataset:(urn:li:dataPlatform:snowflake,example_table,PROD)"
|
|
```
|
|
|
|
Siblings are semantically equivalent datasets, typically representing the same data across different platforms or environments.
|
|
|
|
## Dataset YAML Format
|
|
|
|
The Dataset YAML file follows a structured format with various supported fields:
|
|
|
|
```yaml
|
|
# Basic identification (required)
|
|
id: "example_table" # Dataset identifier
|
|
platform: "hive" # Platform name
|
|
env: "PROD" # Environment (PROD by default)
|
|
|
|
# Metadata (optional)
|
|
name: "Example Table" # Display name (defaults to id if not specified)
|
|
description: "This is an example table"
|
|
|
|
# Schema definition (optional)
|
|
schema:
|
|
fields:
|
|
- id: "field1" # Field identifier
|
|
type: "string" # Data type
|
|
description: "First field" # Field description
|
|
doc: "First field" # Alias for description
|
|
nativeDataType: "VARCHAR" # Native platform type (defaults to type if not specified)
|
|
nullable: false # Whether field can be null (default: false)
|
|
label: "Field One" # Display label (optional business label for the field)
|
|
isPartOfKey: true # Whether field is part of primary key
|
|
isPartitioningKey: false # Whether field is a partitioning key
|
|
jsonProps: { "customProp": "value" } # Custom JSON properties
|
|
|
|
- id: "field2"
|
|
type: "number"
|
|
description: "Second field"
|
|
nullable: true
|
|
globalTags: ["PII", "Sensitive"]
|
|
glossaryTerms: ["urn:li:glossaryTerm:Revenue"]
|
|
structured_properties:
|
|
property1: "value1"
|
|
property2: 42
|
|
file: example.schema.avsc # Optional schema file (required if defining tables with nested fields)
|
|
|
|
# Additional metadata (all optional)
|
|
properties: # Custom properties as key-value pairs
|
|
origin: "external"
|
|
pipeline: "etl_daily"
|
|
|
|
subtype: "View" # Dataset subtype
|
|
subtypes: ["View", "Materialized"] # Multiple subtypes (if only one, use subtype field instead)
|
|
|
|
downstreams: # Downstream Dataset URNs
|
|
- "urn:li:dataset:(urn:li:dataPlatform:hive,downstream_table,PROD)"
|
|
|
|
tags: # Tags
|
|
- "Tier1"
|
|
- "Verified"
|
|
|
|
glossary_terms: # Associated glossary terms
|
|
- "urn:li:glossaryTerm:Revenue"
|
|
|
|
owners: # Dataset owners
|
|
- "jdoe" # Simple format (defaults to TECHNICAL_OWNER)
|
|
- id: "alice" # Extended format with ownership type
|
|
type: "BUSINESS_OWNER"
|
|
|
|
structured_properties: # Structured properties
|
|
priority: "P1"
|
|
cost_center: 123
|
|
|
|
external_url: "https://example.com/datasets/example_table"
|
|
```
|
|
|
|
You can also define multiple datasets in a single YAML file by using a list format:
|
|
|
|
```yaml
|
|
- id: "dataset1"
|
|
platform: "hive"
|
|
description: "First dataset"
|
|
# other properties...
|
|
|
|
- id: "dataset2"
|
|
platform: "snowflake"
|
|
description: "Second dataset"
|
|
# other properties...
|
|
```
|
|
|
|
### Schema Definition
|
|
|
|
You can define Dataset schema in two ways:
|
|
|
|
1. **Direct field definitions** as shown above
|
|
|
|
> **Important limitation**: When using inline schema field definitions, only non-nested (flat) fields are currently supported. For nested or complex schemas, you must use the Avro file approach described below.
|
|
|
|
2. **Reference to an Avro schema file**:
|
|
```yaml
|
|
schema:
|
|
file: "path/to/schema.avsc"
|
|
```
|
|
|
|
Even when using the Avro file approach for the basic schema structure, you can still use the `fields` section to provide additional metadata like structured properties, tags, and glossary terms for your schema fields.
|
|
|
|
#### Schema Field Properties
|
|
|
|
The Schema Field object supports the following properties:
|
|
|
|
| Property | Type | Description |
|
|
| ----------------------- | ------- | ----------------------------------------------------------------------------- |
|
|
| `id` | string | Field identifier/path (required if `urn` not provided) |
|
|
| `urn` | string | URN of the schema field (required if `id` not provided) |
|
|
| `type` | string | Data type (one of the supported [Field Types](#field-types)) |
|
|
| `nativeDataType` | string | Native data type in the source platform (defaults to `type` if not specified) |
|
|
| `description` | string | Field description |
|
|
| `doc` | string | Alias for description |
|
|
| `nullable` | boolean | Whether the field can be null (default: false) |
|
|
| `label` | string | Display label for the field |
|
|
| `recursive` | boolean | Whether the field is recursive (default: false) |
|
|
| `isPartOfKey` | boolean | Whether the field is part of the primary key |
|
|
| `isPartitioningKey` | boolean | Whether the field is a partitioning key |
|
|
| `jsonProps` | object | Custom JSON properties |
|
|
| `globalTags` | array | List of tags associated with the field |
|
|
| `glossaryTerms` | array | List of glossary terms associated with the field |
|
|
| `structured_properties` | object | Structured properties for the field |
|
|
|
|
**Important Note on Schema Field Types**:
|
|
When specifying fields in the YAML file, you must follow an all-or-nothing approach with the `type` field:
|
|
|
|
- If you want the command to generate the schema for you, specify the `type` field for ALL fields.
|
|
- If you only want to add field-level metadata (like tags, glossary terms, or structured properties), do NOT specify the `type` field for ANY field.
|
|
|
|
Example of fields with only metadata (no types):
|
|
|
|
```yaml
|
|
schema:
|
|
fields:
|
|
- id: "field1" # Field identifier
|
|
structured_properties:
|
|
prop1: prop_value
|
|
- id: "field2"
|
|
structured_properties:
|
|
prop1: prop_value
|
|
```
|
|
|
|
### Ownership Types
|
|
|
|
When specifying owners, the following ownership types are supported:
|
|
|
|
- `TECHNICAL_OWNER` (default)
|
|
- `BUSINESS_OWNER`
|
|
- `DATA_STEWARD`
|
|
|
|
Custom ownership types can be specified using the URN format.
|
|
|
|
### Field Types
|
|
|
|
When defining schema fields, the following primitive types are supported:
|
|
|
|
- `string`
|
|
- `number`
|
|
- `int`
|
|
- `long`
|
|
- `float`
|
|
- `double`
|
|
- `boolean`
|
|
- `bytes`
|
|
- `fixed`
|
|
|
|
## Implementation Notes
|
|
|
|
- URNs are generated automatically if not provided, based on the platform, id, and env values
|
|
- The command performs validation to ensure referenced entities (like structured properties) exist
|
|
- When updating schema fields, changes are propagated correctly to maintain consistent metadata
|
|
- The Dataset object will check for existence of entity references and will skip datasets with missing references
|
|
- When using the `sync` command with `--from-datahub`, existing YAML files will be updated with metadata from DataHub while preserving comments and structure
|
|
- For structured properties, single values are simplified (not wrapped in lists) when appropriate
|
|
- Field paths are simplified for better readability
|
|
- When specifying field types, all fields must have type information or none of them should
|