datahub/docs/api/tutorials/creating-datasets.md
Hyejin Yoon ea4036c1c8
feat: enriching guide on creating dataset (#7777)
Co-authored-by: Hyejin Yoon <hyejin.yoon@acryl.io>
Co-authored-by: socar-dini <dini@socar.kr>
2023-04-19 12:58:03 +09:00

3.6 KiB

Creating Datasets

Why Would You Create Datasets?

The dataset entity is one the most important entities in the metadata model. They represent collections of data that are typically represented as Tables or Views in a database (e.g. BigQuery, Snowflake, Redshift etc.), Streams in a stream-processing environment (Kafka, Pulsar etc.), bundles of data found as Files or Folders in data lake systems (S3, ADLS, etc.). For more information about datasets, refer to Dataset.

Goal Of This Guide

This guide will show you how to create a dataset named realestate_db.sales with three columns.

Prerequisites

For this tutorial, you need to deploy DataHub Quickstart and ingest sample data. For detailed steps, please refer to Datahub Quickstart Guide.

Create Datasets With GraphQL (Not Supported)

🚫 Creating a dataset via graphql is currently not supported. Please check out API feature comparison table for more information,

Create Datasets With Python SDK

The following code creates a Hive dataset named realestate_db.sales with three fields and a URN of urn:li:dataset:(urn:li:dataPlatform:hive,realestate_db.sales,PROD):

{{ inline /metadata-ingestion/examples/library/dataset_schema.py show_path_as_comment }}

Note that the name property of make_dataset_urn sets the display name of the dataset.

After creating the dataset, you can perform various manipulations, such as adding lineage and custom properties. Here are some steps to start with, but for more detailed guidance, please refer to the What's Next section.

Add Lineage

The following code creates a lineage from fct_users_deleted to realestate_db.sales:

import datahub.emitter.mce_builder as builder
from datahub.emitter.rest_emitter import DatahubRestEmitter

# Construct a lineage object.
lineage_mce = builder.make_lineage_mce(
    [
        builder.make_dataset_urn("hive", "fct_users_deleted"), # Upstream
    ],
    builder.make_dataset_urn("hive", "realestate_db.sales"), # Downstream
)

# Create an emitter to the GMS REST API.
emitter = DatahubRestEmitter("http://localhost:8080")

# Emit metadata!
emitter.emit_mce(lineage_mce)

For more information on adding lineages, please refer to how to add lineage on a dataset using PythonSDK.

Add custom properties

You can also set custom properties using the following code:

{{ inline /metadata-ingestion/examples/library/dataset_add_properties.py show_path_as_comment }}

For more information on adding custom properties, please refer to Modifying Custom Properties on Datasets

We're using the MetdataChangeProposalWrapper to change entities in this example. For more information about the MetadataChangeProposal, please refer to MetadataChangeProposal & MetadataChangeLog Events.

Expected Outcomes

You can now see realestate_db.sales dataset has been created.

dataset-created

What's Next?

Now that you created a dataset, how about enriching it? Here are some guides that you can check out.