datahub/docs/api/tutorials/creating-datasets.md

# Creating Datasets

## Why Would You Create Datasets?

The dataset entity is one the most important entities in the metadata model. They represent collections of data that are typically represented as Tables or Views in a database (e.g. BigQuery, Snowflake, Redshift etc.), Streams in a stream-processing environment (Kafka, Pulsar etc.), bundles of data found as Files or Folders in data lake systems (S3, ADLS, etc.).
For more information about datasets, refer to [Dataset](/docs/generated/metamodel/entities/dataset.md).

### Goal Of This Guide

This guide will show you how to create a dataset named `realestate_db.sales` with three columns.

## Prerequisites

For this tutorial, you need to deploy DataHub Quickstart and ingest sample data.
For detailed steps, please refer to [Datahub Quickstart Guide](/docs/quickstart.md).

## Create Datasets With GraphQL (Not Supported)

> 🚫 Creating a dataset via `graphql` is currently not supported.
> Please check out [API feature comparison table](/docs/api/datahub-apis.md#datahub-api-comparison) for more information,

## Create Datasets With Python SDK

The following code creates a hive dataset named `realestate_db.sales` with three fields.
You can refer to the complete code in [dataset_schema.py](https://github.com/datahub-project/datahub/blob/master/metadata-ingestion/examples/library/dataset_schema.py).

```python
# inlined from metadata-ingestion/examples/library/dataset_schema.py
# Imports for urn construction utility methods
from datahub.emitter.mce_builder import make_data_platform_urn, make_dataset_urn
from datahub.emitter.mcp import MetadataChangeProposalWrapper
from datahub.emitter.rest_emitter import DatahubRestEmitter

# Imports for metadata model classes
from datahub.metadata.schema_classes import (
    AuditStampClass,
    DateTypeClass,
    OtherSchemaClass,
    SchemaFieldClass,
    SchemaFieldDataTypeClass,
    SchemaMetadataClass,
    StringTypeClass,
)

event: MetadataChangeProposalWrapper = MetadataChangeProposalWrapper(
    entityUrn=make_dataset_urn(platform="hive", name="realestate_db.sales", env="PROD"),
    aspect=SchemaMetadataClass(
        schemaName="customer",  # not used
        platform=make_data_platform_urn("hive"),  # important <- platform must be an urn
        version=0,  # when the source system has a notion of versioning of schemas, insert this in, otherwise leave as 0
        hash="",  # when the source system has a notion of unique schemas identified via hash, include a hash, else leave it as empty string
        platformSchema=OtherSchemaClass(rawSchema="__insert raw schema here__"),
        lastModified=AuditStampClass(
            time=1640692800000, actor="urn:li:corpuser:ingestion"
        ),
        fields=[
            SchemaFieldClass(
                fieldPath="address.zipcode",
                type=SchemaFieldDataTypeClass(type=StringTypeClass()),
                nativeDataType="VARCHAR(50)",  # use this to provide the type of the field in the source system's vernacular
                description="This is the zipcode of the address. Specified using extended form and limited to addresses in the United States",
                lastModified=AuditStampClass(
                    time=1640692800000, actor="urn:li:corpuser:ingestion"
                ),
            ),
            SchemaFieldClass(
                fieldPath="address.street",
                type=SchemaFieldDataTypeClass(type=StringTypeClass()),
                nativeDataType="VARCHAR(100)",
                description="Street corresponding to the address",
                lastModified=AuditStampClass(
                    time=1640692800000, actor="urn:li:corpuser:ingestion"
                ),
            ),
            SchemaFieldClass(
                fieldPath="last_sold_date",
                type=SchemaFieldDataTypeClass(type=DateTypeClass()),
                nativeDataType="Date",
                description="Date of the last sale date for this property",
                created=AuditStampClass(
                    time=1640692800000, actor="urn:li:corpuser:ingestion"
                ),
                lastModified=AuditStampClass(
                    time=1640692800000, actor="urn:li:corpuser:ingestion"
                ),
            ),
        ],
    ),
)

# Create rest emitter
rest_emitter = DatahubRestEmitter(gms_server="http://localhost:8080")
rest_emitter.emit(event)

```

We're using the `MetdataChangeProposalWrapper` to change entities in this example.
For more information about the `MetadataChangeProposal`, please refer to [MetadataChangeProposal & MetadataChangeLog Events](/docs/advanced/mcp-mcl.md)

## Expected Outcomes

You can now see `realestate_db.sales` dataset has been created.

![dataset-created](../../imgs/apis/tutorials/dataset-created.png)

## What's Next?

Now that you created a dataset, how about enriching it? Here are some guides that you can check out.

- [how to add a tag on a dataset](/docs/api/tutorials/adding-tags.md).
- [how to add a term on a dataset](/docs/api/tutorials/adding-terms.md).
- [how to add owner on a dataset](/docs/api/tutorials/adding-ownerships.md).
- [how to add lineage on a dataset](/docs/api/tutorials/adding-lineage.md).
feat: add docs on creating tags/terms/datasets (#7608) Co-authored-by: Hyejin Yoon <yoonhyejin@Hyejins-MacBook-Pro.local> Co-authored-by: Pedro Silva <pedro@acryl.io> 2023-03-17 06:12:35 +09:00			`# Creating Datasets`

feat(docs): refactor guide on graphql (#7745) Co-authored-by: Hyejin Yoon <yoonhyejin@Hyejins-MacBook-Pro.local> Co-authored-by: Hyejin Yoon <hyejin.yoon@acryl.io> 2023-04-08 08:26:58 +09:00			`## Why Would You Create Datasets?`

feat: add docs on creating tags/terms/datasets (#7608) Co-authored-by: Hyejin Yoon <yoonhyejin@Hyejins-MacBook-Pro.local> Co-authored-by: Pedro Silva <pedro@acryl.io> 2023-03-17 06:12:35 +09:00			`The dataset entity is one the most important entities in the metadata model. They represent collections of data that are typically represented as Tables or Views in a database (e.g. BigQuery, Snowflake, Redshift etc.), Streams in a stream-processing environment (Kafka, Pulsar etc.), bundles of data found as Files or Folders in data lake systems (S3, ADLS, etc.).`
			`For more information about datasets, refer to [Dataset](/docs/generated/metamodel/entities/dataset.md).`

			`### Goal Of This Guide`
feat(docs): refactor guide on graphql (#7745) Co-authored-by: Hyejin Yoon <yoonhyejin@Hyejins-MacBook-Pro.local> Co-authored-by: Hyejin Yoon <hyejin.yoon@acryl.io> 2023-04-08 08:26:58 +09:00
feat: add docs on creating tags/terms/datasets (#7608) Co-authored-by: Hyejin Yoon <yoonhyejin@Hyejins-MacBook-Pro.local> Co-authored-by: Pedro Silva <pedro@acryl.io> 2023-03-17 06:12:35 +09:00			This guide will show you how to create a dataset named `realestate_db.sales` with three columns.

			`## Prerequisites`

feat(docs): refactor guide on graphql (#7745) Co-authored-by: Hyejin Yoon <yoonhyejin@Hyejins-MacBook-Pro.local> Co-authored-by: Hyejin Yoon <hyejin.yoon@acryl.io> 2023-04-08 08:26:58 +09:00			`For this tutorial, you need to deploy DataHub Quickstart and ingest sample data.`
			`For detailed steps, please refer to [Datahub Quickstart Guide](/docs/quickstart.md).`
feat: add docs on creating tags/terms/datasets (#7608) Co-authored-by: Hyejin Yoon <yoonhyejin@Hyejins-MacBook-Pro.local> Co-authored-by: Pedro Silva <pedro@acryl.io> 2023-03-17 06:12:35 +09:00
feat(docs): refactor guide on graphql (#7745) Co-authored-by: Hyejin Yoon <yoonhyejin@Hyejins-MacBook-Pro.local> Co-authored-by: Hyejin Yoon <hyejin.yoon@acryl.io> 2023-04-08 08:26:58 +09:00			`## Create Datasets With GraphQL (Not Supported)`
feat: add docs on creating tags/terms/datasets (#7608) Co-authored-by: Hyejin Yoon <yoonhyejin@Hyejins-MacBook-Pro.local> Co-authored-by: Pedro Silva <pedro@acryl.io> 2023-03-17 06:12:35 +09:00
feat(docs): refactor guide on graphql (#7745) Co-authored-by: Hyejin Yoon <yoonhyejin@Hyejins-MacBook-Pro.local> Co-authored-by: Hyejin Yoon <hyejin.yoon@acryl.io> 2023-04-08 08:26:58 +09:00			> 🚫 Creating a dataset via `graphql` is currently not supported.
			`> Please check out [API feature comparison table](/docs/api/datahub-apis.md#datahub-api-comparison) for more information,`
feat: add docs on creating tags/terms/datasets (#7608) Co-authored-by: Hyejin Yoon <yoonhyejin@Hyejins-MacBook-Pro.local> Co-authored-by: Pedro Silva <pedro@acryl.io> 2023-03-17 06:12:35 +09:00
			`## Create Datasets With Python SDK`

feat(docs): refactor guide on graphql (#7745) Co-authored-by: Hyejin Yoon <yoonhyejin@Hyejins-MacBook-Pro.local> Co-authored-by: Hyejin Yoon <hyejin.yoon@acryl.io> 2023-04-08 08:26:58 +09:00			The following code creates a hive dataset named `realestate_db.sales` with three fields.
feat: add docs on creating tags/terms/datasets (#7608) Co-authored-by: Hyejin Yoon <yoonhyejin@Hyejins-MacBook-Pro.local> Co-authored-by: Pedro Silva <pedro@acryl.io> 2023-03-17 06:12:35 +09:00			`You can refer to the complete code in [dataset_schema.py](https://github.com/datahub-project/datahub/blob/master/metadata-ingestion/examples/library/dataset_schema.py).`
feat(docs): refactor guide on graphql (#7745) Co-authored-by: Hyejin Yoon <yoonhyejin@Hyejins-MacBook-Pro.local> Co-authored-by: Hyejin Yoon <hyejin.yoon@acryl.io> 2023-04-08 08:26:58 +09:00
feat: add docs on creating tags/terms/datasets (#7608) Co-authored-by: Hyejin Yoon <yoonhyejin@Hyejins-MacBook-Pro.local> Co-authored-by: Pedro Silva <pedro@acryl.io> 2023-03-17 06:12:35 +09:00			```python
			`# inlined from metadata-ingestion/examples/library/dataset_schema.py`
			`# Imports for urn construction utility methods`
			`from datahub.emitter.mce_builder import make_data_platform_urn, make_dataset_urn`
			`from datahub.emitter.mcp import MetadataChangeProposalWrapper`
			`from datahub.emitter.rest_emitter import DatahubRestEmitter`

			`# Imports for metadata model classes`
			`from datahub.metadata.schema_classes import (`
			`AuditStampClass,`
			`DateTypeClass,`
			`OtherSchemaClass,`
			`SchemaFieldClass,`
			`SchemaFieldDataTypeClass,`
			`SchemaMetadataClass,`
			`StringTypeClass,`
			`)`

			`event: MetadataChangeProposalWrapper = MetadataChangeProposalWrapper(`
			`entityUrn=make_dataset_urn(platform="hive", name="realestate_db.sales", env="PROD"),`
			`aspect=SchemaMetadataClass(`
			`schemaName="customer", # not used`
			`platform=make_data_platform_urn("hive"), # important <- platform must be an urn`
			`version=0, # when the source system has a notion of versioning of schemas, insert this in, otherwise leave as 0`
			`hash="", # when the source system has a notion of unique schemas identified via hash, include a hash, else leave it as empty string`
			`platformSchema=OtherSchemaClass(rawSchema="__insert raw schema here__"),`
			`lastModified=AuditStampClass(`
			`time=1640692800000, actor="urn:li:corpuser:ingestion"`
			`),`
			`fields=[`
			`SchemaFieldClass(`
			`fieldPath="address.zipcode",`
			`type=SchemaFieldDataTypeClass(type=StringTypeClass()),`
			`nativeDataType="VARCHAR(50)", # use this to provide the type of the field in the source system's vernacular`
			`description="This is the zipcode of the address. Specified using extended form and limited to addresses in the United States",`
			`lastModified=AuditStampClass(`
			`time=1640692800000, actor="urn:li:corpuser:ingestion"`
			`),`
			`),`
			`SchemaFieldClass(`
			`fieldPath="address.street",`
			`type=SchemaFieldDataTypeClass(type=StringTypeClass()),`
			`nativeDataType="VARCHAR(100)",`
			`description="Street corresponding to the address",`
			`lastModified=AuditStampClass(`
			`time=1640692800000, actor="urn:li:corpuser:ingestion"`
			`),`
			`),`
			`SchemaFieldClass(`
			`fieldPath="last_sold_date",`
			`type=SchemaFieldDataTypeClass(type=DateTypeClass()),`
			`nativeDataType="Date",`
			`description="Date of the last sale date for this property",`
			`created=AuditStampClass(`
			`time=1640692800000, actor="urn:li:corpuser:ingestion"`
			`),`
			`lastModified=AuditStampClass(`
			`time=1640692800000, actor="urn:li:corpuser:ingestion"`
			`),`
			`),`
			`],`
			`),`
			`)`

			`# Create rest emitter`
			`rest_emitter = DatahubRestEmitter(gms_server="http://localhost:8080")`
			`rest_emitter.emit(event)`

			```

			We're using the `MetdataChangeProposalWrapper` to change entities in this example.
			For more information about the `MetadataChangeProposal`, please refer to [MetadataChangeProposal & MetadataChangeLog Events](/docs/advanced/mcp-mcl.md)

			`## Expected Outcomes`
feat(docs): refactor guide on graphql (#7745) Co-authored-by: Hyejin Yoon <yoonhyejin@Hyejins-MacBook-Pro.local> Co-authored-by: Hyejin Yoon <hyejin.yoon@acryl.io> 2023-04-08 08:26:58 +09:00
feat: add docs on creating tags/terms/datasets (#7608) Co-authored-by: Hyejin Yoon <yoonhyejin@Hyejins-MacBook-Pro.local> Co-authored-by: Pedro Silva <pedro@acryl.io> 2023-03-17 06:12:35 +09:00			You can now see `realestate_db.sales` dataset has been created.

docs: reformat use case guide toc & api comparison table (#7621) Co-authored-by: Hyejin Yoon <yoonhyejin@ip-172-30-1-77.us-west-2.compute.internal> Co-authored-by: Shirshanka Das <shirshanka@apache.org> Co-authored-by: Hyejin Yoon <yoonhyejin@ip-192-168-0-10.us-west-2.compute.internal> 2023-03-19 05:00:41 +09:00			`![dataset-created](../../imgs/apis/tutorials/dataset-created.png)`
feat: add docs on creating tags/terms/datasets (#7608) Co-authored-by: Hyejin Yoon <yoonhyejin@Hyejins-MacBook-Pro.local> Co-authored-by: Pedro Silva <pedro@acryl.io> 2023-03-17 06:12:35 +09:00
			`## What's Next?`

feat(docs): refactor guide on graphql (#7745) Co-authored-by: Hyejin Yoon <yoonhyejin@Hyejins-MacBook-Pro.local> Co-authored-by: Hyejin Yoon <hyejin.yoon@acryl.io> 2023-04-08 08:26:58 +09:00			`Now that you created a dataset, how about enriching it? Here are some guides that you can check out.`
feat: add docs on creating tags/terms/datasets (#7608) Co-authored-by: Hyejin Yoon <yoonhyejin@Hyejins-MacBook-Pro.local> Co-authored-by: Pedro Silva <pedro@acryl.io> 2023-03-17 06:12:35 +09:00
feat(docs): refactor guide on graphql (#7745) Co-authored-by: Hyejin Yoon <yoonhyejin@Hyejins-MacBook-Pro.local> Co-authored-by: Hyejin Yoon <hyejin.yoon@acryl.io> 2023-04-08 08:26:58 +09:00			`- [how to add a tag on a dataset](/docs/api/tutorials/adding-tags.md).`
			`- [how to add a term on a dataset](/docs/api/tutorials/adding-terms.md).`
			`- [how to add owner on a dataset](/docs/api/tutorials/adding-ownerships.md).`
			`- [how to add lineage on a dataset](/docs/api/tutorials/adding-lineage.md).`