mirror of
				https://github.com/datahub-project/datahub.git
				synced 2025-10-24 23:48:23 +00:00 
			
		
		
		
	
		
			
				
	
	
		
			207 lines
		
	
	
		
			7.5 KiB
		
	
	
	
		
			Markdown
		
	
	
	
	
	
			
		
		
	
	
			207 lines
		
	
	
		
			7.5 KiB
		
	
	
	
		
			Markdown
		
	
	
	
	
	
| # DataHub Concepts
 | ||
| 
 | ||
| Explore key concepts of DataHub to take full advantage of its capabilities in managing your data.
 | ||
| 
 | ||
| ## General Concepts
 | ||
| 
 | ||
| ### URN (Uniform Resource Name)
 | ||
| 
 | ||
| URN (Uniform Resource Name) is the chosen scheme of URI to uniquely define any resource in DataHub. It has the following form.
 | ||
| 
 | ||
| ```
 | ||
| urn:<Namespace>:<Entity Type>:<ID>
 | ||
| ```
 | ||
| 
 | ||
| Examples include `urn:li:dataset:(urn:li:dataPlatform:hive,fct_users_created,PROD)`, `urn:li:corpuser:jdoe`.
 | ||
| 
 | ||
| > - [What is URN?](/docs/what/urn.md)
 | ||
| 
 | ||
| ### Policy
 | ||
| 
 | ||
| Access policies in DataHub define who can do what to which resources.
 | ||
| 
 | ||
| > - [Authorization: Policies Guide](/docs/authorization/policies.md)
 | ||
| > - [Developer Guides: DataHubPolicy](/docs/generated/metamodel/entities/dataHubPolicy.md)
 | ||
| > - [Feature Guides: About DataHub Access Policies](/docs/authorization/access-policies-guide.md)
 | ||
| 
 | ||
| ### Role
 | ||
| 
 | ||
| DataHub provides the ability to use Roles to manage permissions.
 | ||
| 
 | ||
| > - [Authorization: About DataHub Roles](/docs/authorization/roles.md)
 | ||
| > - [Developer Guides: DataHubRole](/docs/generated/metamodel/entities/dataHubRole.md)
 | ||
| 
 | ||
| ### Access Token (Personal Access Token)
 | ||
| 
 | ||
| Personal Access Tokens, or PATs for short, allow users to represent themselves in code and programmatically use DataHub's APIs in deployments where security is a concern.
 | ||
| Used along-side with [authentication-enabled metadata service](/docs/authentication/introducing-metadata-service-authentication.md), PATs add a layer of protection to DataHub where only authorized users are able to perform actions in an automated way.
 | ||
| 
 | ||
| > - [Authentication: About DataHub Personal Access Tokens](/docs/authentication/personal-access-tokens.md)
 | ||
| > - [Developer Guides: DataHubAccessToken](/docs/generated/metamodel/entities/dataHubAccessToken.md)
 | ||
| 
 | ||
| ### View
 | ||
| 
 | ||
| Views allow you to save and share sets of filters for reuse when browsing DataHub. A view can either be public or personal.
 | ||
| 
 | ||
| > - [DataHubView](/docs/generated/metamodel/entities/dataHubView.md)
 | ||
| 
 | ||
| ### Deprecation
 | ||
| 
 | ||
| Deprecation is an aspect that indicates the deprecation status of an entity. Typically it is expressed as a Boolean value.
 | ||
| 
 | ||
| > - [Deprecation of a dataset](/docs/generated/metamodel/entities/dataset.md#deprecation)
 | ||
| 
 | ||
| ### Ingestion Source
 | ||
| 
 | ||
| Ingestion sources refer to the data systems that we are extracting metadata from. For example, we have sources for BigQuery, Looker, Tableau and many others.
 | ||
| 
 | ||
| > - [Sources](/metadata-ingestion/README.md#sources)
 | ||
| > - [DataHub Integrations](https://datahubproject.io/integrations)
 | ||
| 
 | ||
| ### Container
 | ||
| 
 | ||
| A container of related data assets.
 | ||
| 
 | ||
| > - [Developer Guides: Container](/docs/generated/metamodel/entities/container.md)
 | ||
| 
 | ||
| ### Data Platform
 | ||
| 
 | ||
| Data Platforms are systems or tools that contain Datasets, Dashboards, Charts, and all other kinds of data assets modeled in the metadata graph.
 | ||
| 
 | ||
| <details><summary>
 | ||
| List of Data Platforms
 | ||
| </summary>
 | ||
| 
 | ||
| - Azure Data Lake (Gen 1)
 | ||
| - Azure Data Lake (Gen 2)
 | ||
| - Airflow
 | ||
| - Ambry
 | ||
| - ClickHouse
 | ||
| - Couchbase
 | ||
| - External Source
 | ||
| - HDFS
 | ||
| - SAP HANA
 | ||
| - Hive
 | ||
| - Iceberg
 | ||
| - AWS S3
 | ||
| - Kafka
 | ||
| - Kafka Connect
 | ||
| - Kusto
 | ||
| - Mode
 | ||
| - MongoDB
 | ||
| - MySQL
 | ||
| - MariaDB
 | ||
| - OpenAPI
 | ||
| - Oracle
 | ||
| - Pinot
 | ||
| - PostgreSQL
 | ||
| - Presto
 | ||
| - Tableau
 | ||
| - Vertica
 | ||
| 
 | ||
| Reference : [data_platforms.json](https://github.com/datahub-project/datahub/blob/master/metadata-service/war/src/main/resources/boot/data_platforms.json)
 | ||
| 
 | ||
| </details>
 | ||
| 
 | ||
| > - [Developer Guides: Data Platform](/docs/generated/metamodel/entities/dataPlatform.md)
 | ||
| 
 | ||
| ### Dataset
 | ||
| 
 | ||
| Datasets represent collections of data that are typically represented as Tables or Views in a database (e.g. BigQuery, Snowflake, Redshift etc.), Streams in a stream-processing environment (Kafka, Pulsar etc.), bundles of data found as Files or Folders in data lake systems (S3, ADLS, etc.).
 | ||
| 
 | ||
| > - [Developer Guides: Dataset](/docs/generated/metamodel/entities/dataset.md)
 | ||
| 
 | ||
| ### Chart
 | ||
| 
 | ||
| A single data vizualization derived from a Dataset. A single Chart can be a part of multiple Dashboards. Charts can have tags, owners, links, glossary terms, and descriptions attached to them. Examples include a Superset or Looker Chart.
 | ||
| 
 | ||
| > - [Developer Guides: Chart](/docs/generated/metamodel/entities/chart.md)
 | ||
| 
 | ||
| ### Dashboard
 | ||
| 
 | ||
| A collection of Charts for visualization. Dashboards can have tags, owners, links, glossary terms, and descriptions attached to them. Examples include a Superset or Mode Dashboard.
 | ||
| 
 | ||
| > - [Developer Guides: Dashboard](/docs/generated/metamodel/entities/dashboard.md)
 | ||
| 
 | ||
| ### Data Job
 | ||
| 
 | ||
| An executable job that processes data assets, where "processing" implies consuming data, producing data, or both.
 | ||
| In orchestration systems, this is sometimes referred to as an individual "Task" within a "DAG". Examples include an Airflow Task.
 | ||
| 
 | ||
| > - [Developer Guides: Data Job](/docs/generated/metamodel/entities/dataJob.md)
 | ||
| 
 | ||
| ### Data Flow
 | ||
| 
 | ||
| An executable collection of Data Jobs with dependencies among them, or a DAG.
 | ||
| Sometimes referred to as a "Pipeline". Examples include an Airflow DAG.
 | ||
| 
 | ||
| > - [Developer Guides: Data Flow](/docs/generated/metamodel/entities/dataFlow.md)
 | ||
| 
 | ||
| ### Glossary Term
 | ||
| 
 | ||
| Shared vocabulary within the data ecosystem.
 | ||
| 
 | ||
| > - [Feature Guides: Glossary](/docs/glossary/business-glossary.md)
 | ||
| > - [Developer Guides: GlossaryTerm](/docs/generated/metamodel/entities/glossaryTerm.md)
 | ||
| 
 | ||
| ### Glossary Term Group
 | ||
| 
 | ||
| Glossary Term Group is similar to a folder, containing Terms and even other Term Groups to allow for a nested structure.
 | ||
| 
 | ||
| > - [Feature Guides: Term & Term Group](/docs/glossary/business-glossary.md#terms--term-groups)
 | ||
| 
 | ||
| ### Tag
 | ||
| 
 | ||
| Tags are informal, loosely controlled labels that help in search & discovery. They can be added to datasets, dataset schemas, or containers, for an easy way to label or categorize entities – without having to associate them to a broader business glossary or vocabulary.
 | ||
| 
 | ||
| > - [Feature Guides: About DataHub Tags](/docs/tags.md)
 | ||
| > - [Developer Guides: Tags](/docs/generated/metamodel/entities/tag.md)
 | ||
| 
 | ||
| ### Domain
 | ||
| 
 | ||
| Domains are curated, top-level folders or categories where related assets can be explicitly grouped.
 | ||
| 
 | ||
| > - [Feature Guides: About DataHub Domains](/docs/domains.md)
 | ||
| > - [Developer Guides: Domain](/docs/generated/metamodel/entities/domain.md)
 | ||
| 
 | ||
| ### Owner
 | ||
| 
 | ||
| Owner refers to the users or groups that has ownership rights over entities. For example, owner can be acceessed to dataset or a column or a dataset.
 | ||
| 
 | ||
| > - [Getting Started : Adding Owners On Datasets/Columns](/docs/api/tutorials/owners.md#add-owners)
 | ||
| 
 | ||
| ### Users (CorpUser)
 | ||
| 
 | ||
| CorpUser represents an identity of a person (or an account) in the enterprise.
 | ||
| 
 | ||
| > - [Developer Guides: CorpUser](/docs/generated/metamodel/entities/corpuser.md)
 | ||
| 
 | ||
| ### Groups (CorpGroup)
 | ||
| 
 | ||
| CorpGroup represents an identity of a group of users in the enterprise.
 | ||
| 
 | ||
| > - [Developer Guides: CorpGroup](/docs/generated/metamodel/entities/corpGroup.md)
 | ||
| 
 | ||
| ## Metadata Model
 | ||
| 
 | ||
| ### Entity
 | ||
| 
 | ||
| An entity is the primary node in the metadata graph. For example, an instance of a Dataset or a CorpUser is an Entity.
 | ||
| 
 | ||
| > - [How does DataHub model metadata?](/docs/modeling/metadata-model.md)
 | ||
| 
 | ||
| ### Aspect
 | ||
| 
 | ||
| An aspect is a collection of attributes that describes a particular facet of an entity.
 | ||
| Aspects can be shared across entities, for example "Ownership" is an aspect that is re-used across all the Entities that have owners.
 | ||
| 
 | ||
| > - [What is a metadata aspect?](/docs/what/aspect.md)
 | ||
| > - [How does DataHub model metadata?](/docs/modeling/metadata-model.md)
 | ||
| 
 | ||
| ### Relationships
 | ||
| 
 | ||
| A relationship represents a named edge between 2 entities. They are declared via foreign key attributes within Aspects along with a custom annotation (@Relationship).
 | ||
| 
 | ||
| > - [What is a relationship?](/docs/what/relationship.md)
 | ||
| > - [How does DataHub model metadata?](/docs/modeling/metadata-model.md)
 | 
