
Co-authored-by: Hyejin Yoon <hyejin.yoon@acryl.io> Co-authored-by: socar-dini <dini@socar.kr>
3.6 KiB
Creating Datasets
Why Would You Create Datasets?
The dataset entity is one the most important entities in the metadata model. They represent collections of data that are typically represented as Tables or Views in a database (e.g. BigQuery, Snowflake, Redshift etc.), Streams in a stream-processing environment (Kafka, Pulsar etc.), bundles of data found as Files or Folders in data lake systems (S3, ADLS, etc.). For more information about datasets, refer to Dataset.
Goal Of This Guide
This guide will show you how to create a dataset named realestate_db.sales
with three columns.
Prerequisites
For this tutorial, you need to deploy DataHub Quickstart and ingest sample data. For detailed steps, please refer to Datahub Quickstart Guide.
Create Datasets With GraphQL (Not Supported)
🚫 Creating a dataset via
graphql
is currently not supported. Please check out API feature comparison table for more information,
Create Datasets With Python SDK
The following code creates a Hive dataset named realestate_db.sales with three fields and a URN of urn:li:dataset:(urn:li:dataPlatform:hive,realestate_db.sales,PROD):
{{ inline /metadata-ingestion/examples/library/dataset_schema.py show_path_as_comment }}
Note that the name
property of make_dataset_urn
sets the display name of the dataset.
After creating the dataset, you can perform various manipulations, such as adding lineage and custom properties. Here are some steps to start with, but for more detailed guidance, please refer to the What's Next section.
Add Lineage
The following code creates a lineage from fct_users_deleted
to realestate_db.sales
:
import datahub.emitter.mce_builder as builder
from datahub.emitter.rest_emitter import DatahubRestEmitter
# Construct a lineage object.
lineage_mce = builder.make_lineage_mce(
[
builder.make_dataset_urn("hive", "fct_users_deleted"), # Upstream
],
builder.make_dataset_urn("hive", "realestate_db.sales"), # Downstream
)
# Create an emitter to the GMS REST API.
emitter = DatahubRestEmitter("http://localhost:8080")
# Emit metadata!
emitter.emit_mce(lineage_mce)
For more information on adding lineages, please refer to how to add lineage on a dataset using PythonSDK.
Add custom properties
You can also set custom properties using the following code:
{{ inline /metadata-ingestion/examples/library/dataset_add_properties.py show_path_as_comment }}
For more information on adding custom properties, please refer to Modifying Custom Properties on Datasets
We're using the MetdataChangeProposalWrapper
to change entities in this example.
For more information about the MetadataChangeProposal
, please refer to MetadataChangeProposal & MetadataChangeLog Events.
Expected Outcomes
You can now see realestate_db.sales
dataset has been created.
What's Next?
Now that you created a dataset, how about enriching it? Here are some guides that you can check out.