The dataset entity is one the most important entities in the metadata model. They represent collections of data that are typically represented as Tables or Views in a database (e.g. BigQuery, Snowflake, Redshift etc.), Streams in a stream-processing environment (Kafka, Pulsar etc.), bundles of data found as Files or Folders in data lake systems (S3, ADLS, etc.).
For more information about datasets, refer to [Dataset](/docs/generated/metamodel/entities/dataset.md).
The following code creates a Hive dataset named realestate_db.sales with three fields and a URN of urn:li:dataset:(urn:li:dataPlatform:hive,realestate_db.sales,PROD):
Note that the `name` property of `make_dataset_urn` sets the display name of the dataset.
After creating the dataset, you can perform various manipulations, such as adding lineage and custom properties.
Here are some steps to start with, but for more detailed guidance, please refer to the [What's Next](/docs/api/tutorials/creating-datasets.md#whats-next) section.
### Add Lineage
The following code creates a lineage from `fct_users_deleted` to `realestate_db.sales`:
For more information on adding lineages, please refer to [how to add lineage on a dataset using PythonSDK](/docs/api/tutorials/adding-lineage.md#add-lineage-with-python-sdk).
For more information on adding custom properties, please refer to [Modifying Custom Properties on Datasets](/docs/api/tutorials/modifying-dataset-custom-properties.md)
For more information about the `MetadataChangeProposal`, please refer to [MetadataChangeProposal & MetadataChangeLog Events](/docs/advanced/mcp-mcl.md).