datahub/docs/api/tutorials/creating-datasets.md

# Creating Datasets

## Why Would You Create Datasets?

The dataset entity is one the most important entities in the metadata model. They represent collections of data that are typically represented as Tables or Views in a database (e.g. BigQuery, Snowflake, Redshift etc.), Streams in a stream-processing environment (Kafka, Pulsar etc.), bundles of data found as Files or Folders in data lake systems (S3, ADLS, etc.).
For more information about datasets, refer to [Dataset](/docs/generated/metamodel/entities/dataset.md).

### Goal Of This Guide

This guide will show you how to create a dataset named `realestate_db.sales` with three columns.

## Prerequisites

For this tutorial, you need to deploy DataHub Quickstart and ingest sample data.
For detailed steps, please refer to [Datahub Quickstart Guide](/docs/quickstart.md).

## Create Datasets With GraphQL (Not Supported)

> 🚫 Creating a dataset via `graphql` is currently not supported.
> Please check out [API feature comparison table](/docs/api/datahub-apis.md#datahub-api-comparison) for more information,

## Create Datasets With Python SDK

The following code creates a Hive dataset named realestate_db.sales with three fields and a URN of urn:li:dataset:(urn:li:dataPlatform:hive,realestate_db.sales,PROD):

```python
{{ inline /metadata-ingestion/examples/library/dataset_schema.py show_path_as_comment }}
```

Note that the `name` property of `make_dataset_urn` sets the display name of the dataset.

After creating the dataset, you can perform various manipulations, such as adding lineage and custom properties.
Here are some steps to start with, but for more detailed guidance, please refer to the [What's Next](/docs/api/tutorials/creating-datasets.md#whats-next) section.

### Add Lineage

The following code creates a lineage from `fct_users_deleted` to `realestate_db.sales`:

```python
import datahub.emitter.mce_builder as builder
from datahub.emitter.rest_emitter import DatahubRestEmitter

# Construct a lineage object.
lineage_mce = builder.make_lineage_mce(
    [
        builder.make_dataset_urn("hive", "fct_users_deleted"), # Upstream
    ],
    builder.make_dataset_urn("hive", "realestate_db.sales"), # Downstream
)

# Create an emitter to the GMS REST API.
emitter = DatahubRestEmitter("http://localhost:8080")

# Emit metadata!
emitter.emit_mce(lineage_mce)
```
For more information on adding lineages, please refer to [how to add lineage on a dataset using PythonSDK](/docs/api/tutorials/adding-lineage.md#add-lineage-with-python-sdk).

### Add custom properties

You can also set custom properties using the following code:

```python
{{ inline /metadata-ingestion/examples/library/dataset_add_properties.py show_path_as_comment }}
```

For more information on adding custom properties, please refer to [Modifying Custom Properties on Datasets](/docs/api/tutorials/modifying-dataset-custom-properties.md)

We're using the `MetdataChangeProposalWrapper` to change entities in this example.
For more information about the `MetadataChangeProposal`, please refer to [MetadataChangeProposal & MetadataChangeLog Events](/docs/advanced/mcp-mcl.md).

## Expected Outcomes

You can now see `realestate_db.sales` dataset has been created.

![dataset-created](../../imgs/apis/tutorials/dataset-created.png)

## What's Next?

Now that you created a dataset, how about enriching it? Here are some guides that you can check out.

- [how to add a tag on a dataset](/docs/api/tutorials/adding-tags.md).
- [how to add a term on a dataset](/docs/api/tutorials/adding-terms.md).
- [how to add owner on a dataset](/docs/api/tutorials/adding-ownerships.md).
- [how to add lineage on a dataset](/docs/api/tutorials/adding-lineage.md).
feat: add docs on creating tags/terms/datasets (#7608) Co-authored-by: Hyejin Yoon <yoonhyejin@Hyejins-MacBook-Pro.local> Co-authored-by: Pedro Silva <pedro@acryl.io> 2023-03-17 06:12:35 +09:00			`# Creating Datasets`

feat(docs): refactor guide on graphql (#7745) Co-authored-by: Hyejin Yoon <yoonhyejin@Hyejins-MacBook-Pro.local> Co-authored-by: Hyejin Yoon <hyejin.yoon@acryl.io> 2023-04-08 08:26:58 +09:00			`## Why Would You Create Datasets?`

feat: add docs on creating tags/terms/datasets (#7608) Co-authored-by: Hyejin Yoon <yoonhyejin@Hyejins-MacBook-Pro.local> Co-authored-by: Pedro Silva <pedro@acryl.io> 2023-03-17 06:12:35 +09:00			`The dataset entity is one the most important entities in the metadata model. They represent collections of data that are typically represented as Tables or Views in a database (e.g. BigQuery, Snowflake, Redshift etc.), Streams in a stream-processing environment (Kafka, Pulsar etc.), bundles of data found as Files or Folders in data lake systems (S3, ADLS, etc.).`
			`For more information about datasets, refer to [Dataset](/docs/generated/metamodel/entities/dataset.md).`

			`### Goal Of This Guide`
feat(docs): refactor guide on graphql (#7745) Co-authored-by: Hyejin Yoon <yoonhyejin@Hyejins-MacBook-Pro.local> Co-authored-by: Hyejin Yoon <hyejin.yoon@acryl.io> 2023-04-08 08:26:58 +09:00
feat: add docs on creating tags/terms/datasets (#7608) Co-authored-by: Hyejin Yoon <yoonhyejin@Hyejins-MacBook-Pro.local> Co-authored-by: Pedro Silva <pedro@acryl.io> 2023-03-17 06:12:35 +09:00			This guide will show you how to create a dataset named `realestate_db.sales` with three columns.

			`## Prerequisites`

feat(docs): refactor guide on graphql (#7745) Co-authored-by: Hyejin Yoon <yoonhyejin@Hyejins-MacBook-Pro.local> Co-authored-by: Hyejin Yoon <hyejin.yoon@acryl.io> 2023-04-08 08:26:58 +09:00			`For this tutorial, you need to deploy DataHub Quickstart and ingest sample data.`
			`For detailed steps, please refer to [Datahub Quickstart Guide](/docs/quickstart.md).`
feat: add docs on creating tags/terms/datasets (#7608) Co-authored-by: Hyejin Yoon <yoonhyejin@Hyejins-MacBook-Pro.local> Co-authored-by: Pedro Silva <pedro@acryl.io> 2023-03-17 06:12:35 +09:00
feat(docs): refactor guide on graphql (#7745) Co-authored-by: Hyejin Yoon <yoonhyejin@Hyejins-MacBook-Pro.local> Co-authored-by: Hyejin Yoon <hyejin.yoon@acryl.io> 2023-04-08 08:26:58 +09:00			`## Create Datasets With GraphQL (Not Supported)`
feat: add docs on creating tags/terms/datasets (#7608) Co-authored-by: Hyejin Yoon <yoonhyejin@Hyejins-MacBook-Pro.local> Co-authored-by: Pedro Silva <pedro@acryl.io> 2023-03-17 06:12:35 +09:00
feat(docs): refactor guide on graphql (#7745) Co-authored-by: Hyejin Yoon <yoonhyejin@Hyejins-MacBook-Pro.local> Co-authored-by: Hyejin Yoon <hyejin.yoon@acryl.io> 2023-04-08 08:26:58 +09:00			> 🚫 Creating a dataset via `graphql` is currently not supported.
			`> Please check out [API feature comparison table](/docs/api/datahub-apis.md#datahub-api-comparison) for more information,`
feat: add docs on creating tags/terms/datasets (#7608) Co-authored-by: Hyejin Yoon <yoonhyejin@Hyejins-MacBook-Pro.local> Co-authored-by: Pedro Silva <pedro@acryl.io> 2023-03-17 06:12:35 +09:00
			`## Create Datasets With Python SDK`

feat: enriching guide on creating dataset (#7777) Co-authored-by: Hyejin Yoon <hyejin.yoon@acryl.io> Co-authored-by: socar-dini <dini@socar.kr> 2023-04-19 12:58:03 +09:00			`The following code creates a Hive dataset named realestate_db.sales with three fields and a URN of urn:li:dataset:(urn:li:dataPlatform:hive,realestate_db.sales,PROD):`
feat(docs): refactor guide on graphql (#7745) Co-authored-by: Hyejin Yoon <yoonhyejin@Hyejins-MacBook-Pro.local> Co-authored-by: Hyejin Yoon <hyejin.yoon@acryl.io> 2023-04-08 08:26:58 +09:00
feat: add docs on creating tags/terms/datasets (#7608) Co-authored-by: Hyejin Yoon <yoonhyejin@Hyejins-MacBook-Pro.local> Co-authored-by: Pedro Silva <pedro@acryl.io> 2023-03-17 06:12:35 +09:00			```python
feat: enriching guide on creating dataset (#7777) Co-authored-by: Hyejin Yoon <hyejin.yoon@acryl.io> Co-authored-by: socar-dini <dini@socar.kr> 2023-04-19 12:58:03 +09:00			`{{ inline /metadata-ingestion/examples/library/dataset_schema.py show_path_as_comment }}`
			```

			Note that the `name` property of `make_dataset_urn` sets the display name of the dataset.

			`After creating the dataset, you can perform various manipulations, such as adding lineage and custom properties.`
			`Here are some steps to start with, but for more detailed guidance, please refer to the [What's Next](/docs/api/tutorials/creating-datasets.md#whats-next) section.`

			`### Add Lineage`

			The following code creates a lineage from `fct_users_deleted` to `realestate_db.sales`:

			```python
			`import datahub.emitter.mce_builder as builder`
feat: add docs on creating tags/terms/datasets (#7608) Co-authored-by: Hyejin Yoon <yoonhyejin@Hyejins-MacBook-Pro.local> Co-authored-by: Pedro Silva <pedro@acryl.io> 2023-03-17 06:12:35 +09:00			`from datahub.emitter.rest_emitter import DatahubRestEmitter`

feat: enriching guide on creating dataset (#7777) Co-authored-by: Hyejin Yoon <hyejin.yoon@acryl.io> Co-authored-by: socar-dini <dini@socar.kr> 2023-04-19 12:58:03 +09:00			`# Construct a lineage object.`
			`lineage_mce = builder.make_lineage_mce(`
			`[`
			`builder.make_dataset_urn("hive", "fct_users_deleted"), # Upstream`
			`],`
			`builder.make_dataset_urn("hive", "realestate_db.sales"), # Downstream`
feat: add docs on creating tags/terms/datasets (#7608) Co-authored-by: Hyejin Yoon <yoonhyejin@Hyejins-MacBook-Pro.local> Co-authored-by: Pedro Silva <pedro@acryl.io> 2023-03-17 06:12:35 +09:00			`)`

feat: enriching guide on creating dataset (#7777) Co-authored-by: Hyejin Yoon <hyejin.yoon@acryl.io> Co-authored-by: socar-dini <dini@socar.kr> 2023-04-19 12:58:03 +09:00			`# Create an emitter to the GMS REST API.`
			`emitter = DatahubRestEmitter("http://localhost:8080")`

			`# Emit metadata!`
			`emitter.emit_mce(lineage_mce)`
			```
			`For more information on adding lineages, please refer to [how to add lineage on a dataset using PythonSDK](/docs/api/tutorials/adding-lineage.md#add-lineage-with-python-sdk).`
feat: add docs on creating tags/terms/datasets (#7608) Co-authored-by: Hyejin Yoon <yoonhyejin@Hyejins-MacBook-Pro.local> Co-authored-by: Pedro Silva <pedro@acryl.io> 2023-03-17 06:12:35 +09:00
feat: enriching guide on creating dataset (#7777) Co-authored-by: Hyejin Yoon <hyejin.yoon@acryl.io> Co-authored-by: socar-dini <dini@socar.kr> 2023-04-19 12:58:03 +09:00			`### Add custom properties`
feat: add docs on creating tags/terms/datasets (#7608) Co-authored-by: Hyejin Yoon <yoonhyejin@Hyejins-MacBook-Pro.local> Co-authored-by: Pedro Silva <pedro@acryl.io> 2023-03-17 06:12:35 +09:00
feat: enriching guide on creating dataset (#7777) Co-authored-by: Hyejin Yoon <hyejin.yoon@acryl.io> Co-authored-by: socar-dini <dini@socar.kr> 2023-04-19 12:58:03 +09:00			`You can also set custom properties using the following code:`

			```python
			`{{ inline /metadata-ingestion/examples/library/dataset_add_properties.py show_path_as_comment }}`
feat: add docs on creating tags/terms/datasets (#7608) Co-authored-by: Hyejin Yoon <yoonhyejin@Hyejins-MacBook-Pro.local> Co-authored-by: Pedro Silva <pedro@acryl.io> 2023-03-17 06:12:35 +09:00			```

feat: enriching guide on creating dataset (#7777) Co-authored-by: Hyejin Yoon <hyejin.yoon@acryl.io> Co-authored-by: socar-dini <dini@socar.kr> 2023-04-19 12:58:03 +09:00			`For more information on adding custom properties, please refer to [Modifying Custom Properties on Datasets](/docs/api/tutorials/modifying-dataset-custom-properties.md)`

feat: add docs on creating tags/terms/datasets (#7608) Co-authored-by: Hyejin Yoon <yoonhyejin@Hyejins-MacBook-Pro.local> Co-authored-by: Pedro Silva <pedro@acryl.io> 2023-03-17 06:12:35 +09:00			We're using the `MetdataChangeProposalWrapper` to change entities in this example.
feat: enriching guide on creating dataset (#7777) Co-authored-by: Hyejin Yoon <hyejin.yoon@acryl.io> Co-authored-by: socar-dini <dini@socar.kr> 2023-04-19 12:58:03 +09:00			For more information about the `MetadataChangeProposal`, please refer to [MetadataChangeProposal & MetadataChangeLog Events](/docs/advanced/mcp-mcl.md).
feat: add docs on creating tags/terms/datasets (#7608) Co-authored-by: Hyejin Yoon <yoonhyejin@Hyejins-MacBook-Pro.local> Co-authored-by: Pedro Silva <pedro@acryl.io> 2023-03-17 06:12:35 +09:00
			`## Expected Outcomes`
feat(docs): refactor guide on graphql (#7745) Co-authored-by: Hyejin Yoon <yoonhyejin@Hyejins-MacBook-Pro.local> Co-authored-by: Hyejin Yoon <hyejin.yoon@acryl.io> 2023-04-08 08:26:58 +09:00
feat: add docs on creating tags/terms/datasets (#7608) Co-authored-by: Hyejin Yoon <yoonhyejin@Hyejins-MacBook-Pro.local> Co-authored-by: Pedro Silva <pedro@acryl.io> 2023-03-17 06:12:35 +09:00			You can now see `realestate_db.sales` dataset has been created.

docs: reformat use case guide toc & api comparison table (#7621) Co-authored-by: Hyejin Yoon <yoonhyejin@ip-172-30-1-77.us-west-2.compute.internal> Co-authored-by: Shirshanka Das <shirshanka@apache.org> Co-authored-by: Hyejin Yoon <yoonhyejin@ip-192-168-0-10.us-west-2.compute.internal> 2023-03-19 05:00:41 +09:00			`![dataset-created](../../imgs/apis/tutorials/dataset-created.png)`
feat: add docs on creating tags/terms/datasets (#7608) Co-authored-by: Hyejin Yoon <yoonhyejin@Hyejins-MacBook-Pro.local> Co-authored-by: Pedro Silva <pedro@acryl.io> 2023-03-17 06:12:35 +09:00
			`## What's Next?`

feat(docs): refactor guide on graphql (#7745) Co-authored-by: Hyejin Yoon <yoonhyejin@Hyejins-MacBook-Pro.local> Co-authored-by: Hyejin Yoon <hyejin.yoon@acryl.io> 2023-04-08 08:26:58 +09:00			`Now that you created a dataset, how about enriching it? Here are some guides that you can check out.`
feat: add docs on creating tags/terms/datasets (#7608) Co-authored-by: Hyejin Yoon <yoonhyejin@Hyejins-MacBook-Pro.local> Co-authored-by: Pedro Silva <pedro@acryl.io> 2023-03-17 06:12:35 +09:00
feat(docs): refactor guide on graphql (#7745) Co-authored-by: Hyejin Yoon <yoonhyejin@Hyejins-MacBook-Pro.local> Co-authored-by: Hyejin Yoon <hyejin.yoon@acryl.io> 2023-04-08 08:26:58 +09:00			`- [how to add a tag on a dataset](/docs/api/tutorials/adding-tags.md).`
			`- [how to add a term on a dataset](/docs/api/tutorials/adding-terms.md).`
			`- [how to add owner on a dataset](/docs/api/tutorials/adding-ownerships.md).`
			`- [how to add lineage on a dataset](/docs/api/tutorials/adding-lineage.md).`