mirror of
https://github.com/datahub-project/datahub.git
synced 2025-07-17 22:30:30 +00:00
275 lines
11 KiB
Markdown
275 lines
11 KiB
Markdown
### Business Glossary File Format
|
|
|
|
The business glossary source file should be a .yml file with the following top-level keys:
|
|
|
|
**Glossary**: the top level keys of the business glossary file
|
|
|
|
Example **Glossary**:
|
|
|
|
```yaml
|
|
version: "1" # the version of business glossary file config the config conforms to. Currently the only version released is `1`.
|
|
source: DataHub # the source format of the terms. Currently only supports `DataHub`
|
|
owners: # owners contains two nested fields
|
|
users: # (optional) a list of user IDs
|
|
- njones
|
|
groups: # (optional) a list of group IDs
|
|
- logistics
|
|
url: "https://github.com/datahub-project/datahub/" # (optional) external url pointing to where the glossary is defined externally, if applicable
|
|
nodes: # list of child **GlossaryNode** objects. See **GlossaryNode** section below
|
|
...
|
|
```
|
|
|
|
**GlossaryNode**: a container of **GlossaryNode** and **GlossaryTerm** objects
|
|
|
|
Example **GlossaryNode**:
|
|
|
|
```yaml
|
|
- name: "Shipping" # name of the node
|
|
id: "Shipping-Logistics" # (optional) custom identifier for the node
|
|
description: Provides terms related to the shipping domain # description of the node
|
|
owners: # (optional) owners contains 2 nested fields
|
|
users: # (optional) a list of user IDs
|
|
- njones
|
|
groups: # (optional) a list of group IDs
|
|
- logistics
|
|
nodes: # list of child **GlossaryNode** objects
|
|
...
|
|
knowledge_links: # (optional) list of **KnowledgeCard** objects
|
|
- label: Wiki link for shipping
|
|
url: "https://en.wikipedia.org/wiki/Freight_transport"
|
|
```
|
|
|
|
**GlossaryTerm**: a term in your business glossary
|
|
|
|
Example **GlossaryTerm**:
|
|
|
|
```yaml
|
|
- name: "Full Address" # name of the term
|
|
id: "Full-Address-Details" # (optional) custom identifier for the term
|
|
description: A collection of information to give the location of a building or plot of land. # description of the term
|
|
owners: # (optional) owners contains 2 nested fields
|
|
users: # (optional) a list of user IDs
|
|
- njones
|
|
groups: # (optional) a list of group IDs
|
|
- logistics
|
|
term_source: "EXTERNAL" # one of `EXTERNAL` or `INTERNAL`. Whether the term is coming from an external glossary or one defined in your organization.
|
|
source_ref: FIBO # (optional) if external, what is the name of the source the glossary term is coming from?
|
|
source_url: "https://www.google.com" # (optional) if external, what is the url of the source definition?
|
|
inherits: # (optional) list of **GlossaryTerm** that this term inherits from
|
|
- Privacy.PII
|
|
contains: # (optional) a list of **GlossaryTerm** that this term contains
|
|
- Shipping.ZipCode
|
|
- Shipping.CountryCode
|
|
- Shipping.StreetAddress
|
|
custom_properties: # (optional) a map of key/value pairs of arbitrary custom properties
|
|
- is_used_for_compliance_tracking: "true"
|
|
knowledge_links: # (optional) a list of **KnowledgeCard** related to this term. These appear as links on the glossary node's page
|
|
- url: "https://en.wikipedia.org/wiki/Address"
|
|
label: Wiki link
|
|
domain: "urn:li:domain:Logistics" # (optional) domain name or domain urn
|
|
```
|
|
|
|
## ID Management and URL Generation
|
|
|
|
The business glossary provides two primary ways to manage term and node identifiers:
|
|
|
|
1. **Custom IDs**: You can explicitly specify an ID for any term or node using the `id` field. This is recommended for terms that need stable, predictable identifiers:
|
|
|
|
```yaml
|
|
terms:
|
|
- name: "Response Time"
|
|
id: "support-response-time" # Explicit ID
|
|
description: "Target time to respond to customer inquiries"
|
|
```
|
|
|
|
2. **Automatic ID Generation**: When no ID is specified, the system will generate one based on the `enable_auto_id` setting:
|
|
|
|
- With `enable_auto_id: false` (default):
|
|
|
|
- Node and term names are converted to URL-friendly format
|
|
- Spaces within names are replaced with hyphens
|
|
- Special characters are removed (except hyphens)
|
|
- Case is preserved
|
|
- Multiple hyphens are collapsed to single ones
|
|
- Path components (node/term hierarchy) are joined with periods
|
|
- Example: Node "Customer Support" with term "Response Time" → "Customer-Support.Response-Time"
|
|
|
|
- With `enable_auto_id: true`:
|
|
- Generates GUID-based IDs
|
|
- Recommended for guaranteed uniqueness
|
|
- Required for terms with non-ASCII characters
|
|
|
|
Here's how path-based ID generation works:
|
|
|
|
```yaml
|
|
nodes:
|
|
- name: "Customer Support" # Node ID: Customer-Support
|
|
terms:
|
|
- name: "Response Time" # Term ID: Customer-Support.Response-Time
|
|
description: "Response SLA"
|
|
|
|
- name: "First Reply" # Term ID: Customer-Support.First-Reply
|
|
description: "Initial response"
|
|
|
|
- name: "Product Feedback" # Node ID: Product-Feedback
|
|
terms:
|
|
- name: "Response Time" # Term ID: Product-Feedback.Response-Time
|
|
description: "Feedback response"
|
|
```
|
|
|
|
**Important Notes**:
|
|
|
|
- Periods (.) are used exclusively as path separators between nodes and terms
|
|
- Periods in term or node names themselves will be removed
|
|
- Each component of the path (node names, term names) is cleaned independently:
|
|
- Spaces to hyphens
|
|
- Special characters removed
|
|
- Case preserved
|
|
- The cleaned components are then joined with periods to form the full path
|
|
- Non-ASCII characters in any component trigger automatic GUID generation
|
|
- Once an ID is created (either manually or automatically), it cannot be easily changed
|
|
- All references to a term (in `inherits`, `contains`, etc.) must use its correct ID
|
|
- Moving terms in the hierarchy does NOT update their IDs:
|
|
- The ID retains its original path components even after moving
|
|
- This can lead to IDs that don't match the current location
|
|
- Consider using `enable_auto_id: true` if you plan to reorganize your glossary
|
|
- For terms that other terms will reference, consider using explicit IDs or enable auto_id
|
|
|
|
Example of how different names are handled:
|
|
|
|
```yaml
|
|
nodes:
|
|
- name: "Data Services" # Node ID: Data-Services
|
|
terms:
|
|
# Basic term name
|
|
- name: "Response Time" # Term ID: Data-Services.Response-Time
|
|
description: "SLA metrics"
|
|
|
|
# Term name with special characters
|
|
- name: "API @ Response" # Term ID: Data-Services.API-Response
|
|
description: "API metrics"
|
|
|
|
# Term with non-ASCII (triggers GUID)
|
|
- name: "パフォーマンス" # Term ID will be a 32-character GUID
|
|
description: "Performance"
|
|
```
|
|
|
|
To see how these all work together, check out this comprehensive example business glossary file below:
|
|
|
|
```yaml
|
|
version: "1"
|
|
source: DataHub
|
|
owners:
|
|
users:
|
|
- mjames
|
|
url: "https://github.com/datahub-project/datahub/"
|
|
nodes:
|
|
- name: "Data Classification"
|
|
id: "Data-Classification" # Custom ID for stable references
|
|
description: A set of terms related to Data Classification
|
|
knowledge_links:
|
|
- label: Wiki link for classification
|
|
url: "https://en.wikipedia.org/wiki/Classification"
|
|
terms:
|
|
- name: "Sensitive Data" # Will generate: Data-Classification.Sensitive-Data
|
|
description: Sensitive Data
|
|
custom_properties:
|
|
is_confidential: "false"
|
|
- name: "Confidential Information" # Will generate: Data-Classification.Confidential-Information
|
|
description: Confidential Data
|
|
custom_properties:
|
|
is_confidential: "true"
|
|
- name: "Highly Confidential" # Will generate: Data-Classification.Highly-Confidential
|
|
description: Highly Confidential Data
|
|
custom_properties:
|
|
is_confidential: "true"
|
|
domain: Marketing
|
|
|
|
- name: "Personal Information"
|
|
description: All terms related to personal information
|
|
owners:
|
|
users:
|
|
- mjames
|
|
terms:
|
|
- name: "Email" # Will generate: Personal-Information.Email
|
|
description: An individual's email address
|
|
inherits:
|
|
- Data-Classification.Confidential # References parent node path
|
|
owners:
|
|
groups:
|
|
- Trust and Safety
|
|
- name: "Address" # Will generate: Personal-Information.Address
|
|
description: A physical address
|
|
- name: "Gender" # Will generate: Personal-Information.Gender
|
|
description: The gender identity of the individual
|
|
inherits:
|
|
- Data-Classification.Sensitive # References parent node path
|
|
|
|
- name: "Clients And Accounts"
|
|
description: Provides basic concepts such as account, account holder, account provider, relationship manager that are commonly used by financial services providers to describe customers and to determine counterparty identities
|
|
owners:
|
|
groups:
|
|
- finance
|
|
type: DATAOWNER
|
|
terms:
|
|
- name: "Account" # Will generate: Clients-And-Accounts.Account
|
|
description: Container for records associated with a business arrangement for regular transactions and services
|
|
term_source: "EXTERNAL"
|
|
source_ref: FIBO
|
|
source_url: "https://spec.edmcouncil.org/fibo/ontology/FBC/ProductsAndServices/ClientsAndAccounts/Account"
|
|
inherits:
|
|
- Data-Classification.Highly-Confidential # References parent node path
|
|
contains:
|
|
- Clients-And-Accounts.Balance # References term in same node
|
|
- name: "Balance" # Will generate: Clients-And-Accounts.Balance
|
|
description: Amount of money available or owed
|
|
term_source: "EXTERNAL"
|
|
source_ref: FIBO
|
|
source_url: "https://spec.edmcouncil.org/fibo/ontology/FBC/ProductsAndServices/ClientsAndAccounts/Balance"
|
|
|
|
- name: "KPIs"
|
|
description: Common Business KPIs
|
|
terms:
|
|
- name: "CSAT %" # Will generate: KPIs.CSAT
|
|
description: Customer Satisfaction Score
|
|
```
|
|
|
|
## Custom ID Specification
|
|
|
|
Custom IDs can be specified in two ways, both of which are fully supported and acceptable:
|
|
|
|
1. Just the ID portion (simpler approach):
|
|
|
|
```yaml
|
|
terms:
|
|
- name: "Email"
|
|
id: "company-email" # Will become urn:li:glossaryTerm:company-email
|
|
description: "Company email address"
|
|
```
|
|
|
|
2. Full URN format:
|
|
|
|
```yaml
|
|
terms:
|
|
- name: "Email"
|
|
id: "urn:li:glossaryTerm:company-email"
|
|
description: "Company email address"
|
|
```
|
|
|
|
Both methods are valid and will work correctly. The system will automatically handle the URN prefix if you specify just the ID portion.
|
|
|
|
The same applies for nodes:
|
|
|
|
```yaml
|
|
nodes:
|
|
- name: "Communications"
|
|
id: "internal-comms" # Will become urn:li:glossaryNode:internal-comms
|
|
description: "Internal communication methods"
|
|
```
|
|
|
|
Note: Once you select a custom ID, it cannot be easily changed.
|
|
|
|
## Compatibility
|
|
|
|
Compatible with version 1 of business glossary format. The source will be evolved as newer versions of this format are published.
|