2023-03-30 06:34:48 +09:00
# DataHub Concepts
Explore key concepts of DataHub to take full advantage of its capabilities in managing your data.
## General Concepts
### URN (Uniform Resource Name)
2023-05-03 07:32:23 +09:00
2023-03-30 06:34:48 +09:00
URN (Uniform Resource Name) is the chosen scheme of URI to uniquely define any resource in DataHub. It has the following form.
2023-05-03 07:32:23 +09:00
2023-03-30 06:34:48 +09:00
```
urn:< Namespace > :< Entity Type > :< ID >
```
2023-05-03 07:32:23 +09:00
Examples include `urn:li:dataset:(urn:li:dataPlatform:hive,fct_users_created,PROD)` , `urn:li:corpuser:jdoe` .
2023-03-30 06:34:48 +09:00
2023-05-03 07:32:23 +09:00
> - [What is URN?](/docs/what/urn.md)
2023-03-30 06:34:48 +09:00
### Policy
2023-05-03 07:32:23 +09:00
Access policies in DataHub define who can do what to which resources.
> - [Authorization: Policies Guide](/docs/authorization/policies.md)
> - [Developer Guides: DataHubPolicy](/docs/generated/metamodel/entities/dataHubPolicy.md)
> - [Feature Guides: About DataHub Access Policies](/docs/authorization/access-policies-guide.md)
2023-03-30 06:34:48 +09:00
### Role
2023-05-03 07:32:23 +09:00
2023-03-30 06:34:48 +09:00
DataHub provides the ability to use Roles to manage permissions.
2023-05-03 07:32:23 +09:00
> - [Authorization: About DataHub Roles](/docs/authorization/roles.md)
> - [Developer Guides: DataHubRole](/docs/generated/metamodel/entities/dataHubRole.md)
2023-03-30 06:34:48 +09:00
### Access Token (Personal Access Token)
2023-05-03 07:32:23 +09:00
2023-03-30 06:34:48 +09:00
Personal Access Tokens, or PATs for short, allow users to represent themselves in code and programmatically use DataHub's APIs in deployments where security is a concern.
Used along-side with [authentication-enabled metadata service ](/docs/authentication/introducing-metadata-service-authentication.md ), PATs add a layer of protection to DataHub where only authorized users are able to perform actions in an automated way.
2023-05-03 07:32:23 +09:00
> - [Authentication: About DataHub Personal Access Tokens](/docs/authentication/personal-access-tokens.md)
> - [Developer Guides: DataHubAccessToken](/docs/generated/metamodel/entities/dataHubAccessToken.md)
2023-03-30 06:34:48 +09:00
### View
2023-05-03 07:32:23 +09:00
2023-03-30 06:34:48 +09:00
Views allow you to save and share sets of filters for reuse when browsing DataHub. A view can either be public or personal.
2023-05-03 07:32:23 +09:00
> - [DataHubView](/docs/generated/metamodel/entities/dataHubView.md)
2023-03-30 06:34:48 +09:00
### Deprecation
2023-05-03 07:32:23 +09:00
Deprecation is an aspect that indicates the deprecation status of an entity. Typically it is expressed as a Boolean value.
> - [Deprecation of a dataset](/docs/generated/metamodel/entities/dataset.md#deprecation)
2023-03-30 06:34:48 +09:00
### Ingestion Source
2023-05-03 07:32:23 +09:00
2023-03-30 06:34:48 +09:00
Ingestion sources refer to the data systems that we are extracting metadata from. For example, we have sources for BigQuery, Looker, Tableau and many others.
2023-05-03 07:32:23 +09:00
> - [Sources](/metadata-ingestion/README.md#sources)
> - [DataHub Integrations](https://datahubproject.io/integrations)
2023-03-30 06:34:48 +09:00
### Container
2023-05-03 07:32:23 +09:00
2023-03-30 06:34:48 +09:00
A container of related data assets.
2023-05-03 07:32:23 +09:00
> - [Developer Guides: Container](/docs/generated/metamodel/entities/container.md)
### Data Platform
2023-03-30 06:34:48 +09:00
Data Platforms are systems or tools that contain Datasets, Dashboards, Charts, and all other kinds of data assets modeled in the metadata graph.
< details > < summary >
List of Data Platforms
< / summary >
2023-05-03 07:32:23 +09:00
- Azure Data Lake (Gen 1)
- Azure Data Lake (Gen 2)
- Airflow
- Ambry
- ClickHouse
- Couchbase
- External Source
- HDFS
- SAP HANA
- Hive
- Iceberg
- AWS S3
- Kafka
- Kafka Connect
- Kusto
- Mode
- MongoDB
- MySQL
- MariaDB
- OpenAPI
- Oracle
- Pinot
- PostgreSQL
- Presto
- Tableau
- Vertica
2023-03-30 06:34:48 +09:00
2024-10-04 11:57:42 -05:00
Reference : [data_platforms.yaml ](https://github.com/datahub-project/datahub/blob/master/metadata-service/configuration/src/main/resources/bootstrap_mcps/data-platforms.yaml )
2023-03-30 06:34:48 +09:00
< / details >
2023-05-03 07:32:23 +09:00
> - [Developer Guides: Data Platform](/docs/generated/metamodel/entities/dataPlatform.md)
2023-03-30 06:34:48 +09:00
### Dataset
2023-05-03 07:32:23 +09:00
2023-03-30 06:34:48 +09:00
Datasets represent collections of data that are typically represented as Tables or Views in a database (e.g. BigQuery, Snowflake, Redshift etc.), Streams in a stream-processing environment (Kafka, Pulsar etc.), bundles of data found as Files or Folders in data lake systems (S3, ADLS, etc.).
2023-05-03 07:32:23 +09:00
> - [Developer Guides: Dataset](/docs/generated/metamodel/entities/dataset.md)
2023-03-30 06:34:48 +09:00
### Chart
2023-05-03 07:32:23 +09:00
A single data vizualization derived from a Dataset. A single Chart can be a part of multiple Dashboards. Charts can have tags, owners, links, glossary terms, and descriptions attached to them. Examples include a Superset or Looker Chart.
2023-03-30 06:34:48 +09:00
2023-05-03 07:32:23 +09:00
> - [Developer Guides: Chart](/docs/generated/metamodel/entities/chart.md)
2023-03-30 06:34:48 +09:00
### Dashboard
2023-05-03 07:32:23 +09:00
2023-03-30 06:34:48 +09:00
A collection of Charts for visualization. Dashboards can have tags, owners, links, glossary terms, and descriptions attached to them. Examples include a Superset or Mode Dashboard.
2023-05-03 07:32:23 +09:00
> - [Developer Guides: Dashboard](/docs/generated/metamodel/entities/dashboard.md)
2023-03-30 06:34:48 +09:00
2023-05-03 07:32:23 +09:00
### Data Job
2023-03-30 06:34:48 +09:00
2023-05-03 07:32:23 +09:00
An executable job that processes data assets, where "processing" implies consuming data, producing data, or both.
2023-03-30 06:34:48 +09:00
In orchestration systems, this is sometimes referred to as an individual "Task" within a "DAG". Examples include an Airflow Task.
2023-05-03 07:32:23 +09:00
> - [Developer Guides: Data Job](/docs/generated/metamodel/entities/dataJob.md)
2023-03-30 06:34:48 +09:00
### Data Flow
2023-05-03 07:32:23 +09:00
An executable collection of Data Jobs with dependencies among them, or a DAG.
2023-03-30 06:34:48 +09:00
Sometimes referred to as a "Pipeline". Examples include an Airflow DAG.
2023-05-03 07:32:23 +09:00
> - [Developer Guides: Data Flow](/docs/generated/metamodel/entities/dataFlow.md)
### Glossary Term
2023-03-30 06:34:48 +09:00
Shared vocabulary within the data ecosystem.
2023-05-03 07:32:23 +09:00
> - [Feature Guides: Glossary](/docs/glossary/business-glossary.md)
> - [Developer Guides: GlossaryTerm](/docs/generated/metamodel/entities/glossaryTerm.md)
2023-03-30 06:34:48 +09:00
### Glossary Term Group
2023-05-03 07:32:23 +09:00
2023-03-30 06:34:48 +09:00
Glossary Term Group is similar to a folder, containing Terms and even other Term Groups to allow for a nested structure.
2023-05-03 07:32:23 +09:00
> - [Feature Guides: Term & Term Group](/docs/glossary/business-glossary.md#terms--term-groups)
### Tag
2023-03-30 06:34:48 +09:00
Tags are informal, loosely controlled labels that help in search & discovery. They can be added to datasets, dataset schemas, or containers, for an easy way to label or categorize entities – without having to associate them to a broader business glossary or vocabulary.
2023-05-03 07:32:23 +09:00
> - [Feature Guides: About DataHub Tags](/docs/tags.md)
> - [Developer Guides: Tags](/docs/generated/metamodel/entities/tag.md)
2023-03-30 06:34:48 +09:00
### Domain
2023-05-03 07:32:23 +09:00
Domains are curated, top-level folders or categories where related assets can be explicitly grouped.
2023-03-30 06:34:48 +09:00
2023-05-03 07:32:23 +09:00
> - [Feature Guides: About DataHub Domains](/docs/domains.md)
> - [Developer Guides: Domain](/docs/generated/metamodel/entities/domain.md)
2023-03-30 06:34:48 +09:00
### Owner
2023-05-03 07:32:23 +09:00
Owner refers to the users or groups that has ownership rights over entities. For example, owner can be acceessed to dataset or a column or a dataset.
> - [Getting Started : Adding Owners On Datasets/Columns](/docs/api/tutorials/owners.md#add-owners)
2023-03-30 06:34:48 +09:00
### Users (CorpUser)
2023-05-03 07:32:23 +09:00
2023-03-30 06:34:48 +09:00
CorpUser represents an identity of a person (or an account) in the enterprise.
2023-05-03 07:32:23 +09:00
> - [Developer Guides: CorpUser](/docs/generated/metamodel/entities/corpuser.md)
2023-03-30 06:34:48 +09:00
### Groups (CorpGroup)
2023-05-03 07:32:23 +09:00
2023-03-30 06:34:48 +09:00
CorpGroup represents an identity of a group of users in the enterprise.
2023-05-03 07:32:23 +09:00
> - [Developer Guides: CorpGroup](/docs/generated/metamodel/entities/corpGroup.md)
2023-03-30 06:34:48 +09:00
2023-05-03 07:32:23 +09:00
## Metadata Model
2023-03-30 06:34:48 +09:00
### Entity
2023-05-03 07:32:23 +09:00
An entity is the primary node in the metadata graph. For example, an instance of a Dataset or a CorpUser is an Entity.
> - [How does DataHub model metadata?](/docs/modeling/metadata-model.md)
2023-03-30 06:34:48 +09:00
### Aspect
2023-05-03 07:32:23 +09:00
An aspect is a collection of attributes that describes a particular facet of an entity.
Aspects can be shared across entities, for example "Ownership" is an aspect that is re-used across all the Entities that have owners.
> - [What is a metadata aspect?](/docs/what/aspect.md)
> - [How does DataHub model metadata?](/docs/modeling/metadata-model.md)
### Relationships
2023-03-30 06:34:48 +09:00
2023-05-03 07:32:23 +09:00
A relationship represents a named edge between 2 entities. They are declared via foreign key attributes within Aspects along with a custom annotation (@Relationship ).
2023-03-30 06:34:48 +09:00
2023-05-03 07:32:23 +09:00
> - [What is a relationship?](/docs/what/relationship.md)
> - [How does DataHub model metadata?](/docs/modeling/metadata-model.md)