2021-09-01 15:10:12 -07:00
### Business Glossary File Format
2023-06-26 11:00:09 -07:00
The business glossary source file should be a .yml file with the following top-level keys:
2021-11-29 23:53:08 -08:00
**Glossary**: the top level keys of the business glossary file
2023-06-26 11:00:09 -07:00
Example **Glossary** :
```yaml
2024-11-29 19:59:18 +00:00
version: "1" # the version of business glossary file config the config conforms to. Currently the only version released is `1` .
2023-06-26 11:00:09 -07:00
source: DataHub # the source format of the terms. Currently only supports `DataHub`
owners: # owners contains two nested fields
users: # (optional) a list of user IDs
- njones
groups: # (optional) a list of group IDs
- logistics
url: "https://github.com/datahub-project/datahub/" # (optional) external url pointing to where the glossary is defined externally, if applicable
nodes: # list of child **GlossaryNode** objects. See **GlossaryNode** section below
...
```
2021-11-29 23:53:08 -08:00
**GlossaryNode**: a container of **GlossaryNode** and **GlossaryTerm** objects
2023-06-26 11:00:09 -07:00
Example **GlossaryNode** :
```yaml
2025-04-16 16:55:51 -07:00
- name: "Shipping" # name of the node
id: "Shipping-Logistics" # (optional) custom identifier for the node
description: Provides terms related to the shipping domain # description of the node
owners: # (optional) owners contains 2 nested fields
users: # (optional) a list of user IDs
2023-06-26 11:00:09 -07:00
- njones
2025-04-16 16:55:51 -07:00
groups: # (optional) a list of group IDs
2023-06-26 11:00:09 -07:00
- logistics
2025-04-16 16:55:51 -07:00
nodes: # list of child **GlossaryNode** objects
2023-06-26 11:00:09 -07:00
...
2025-04-16 16:55:51 -07:00
knowledge_links: # (optional) list of **KnowledgeCard** objects
2023-06-26 11:00:09 -07:00
- label: Wiki link for shipping
url: "https://en.wikipedia.org/wiki/Freight_transport"
```
2021-11-29 23:53:08 -08:00
**GlossaryTerm**: a term in your business glossary
2023-06-26 11:00:09 -07:00
Example **GlossaryTerm** :
```yaml
2025-04-16 16:55:51 -07:00
- name: "Full Address" # name of the term
id: "Full-Address-Details" # (optional) custom identifier for the term
description: A collection of information to give the location of a building or plot of land. # description of the term
owners: # (optional) owners contains 2 nested fields
users: # (optional) a list of user IDs
2023-06-26 11:00:09 -07:00
- njones
2025-04-16 16:55:51 -07:00
groups: # (optional) a list of group IDs
2023-06-26 11:00:09 -07:00
- logistics
2025-04-16 16:55:51 -07:00
term_source: "EXTERNAL" # one of `EXTERNAL` or `INTERNAL` . Whether the term is coming from an external glossary or one defined in your organization.
source_ref: FIBO # (optional) if external, what is the name of the source the glossary term is coming from?
source_url: "https://www.google.com" # (optional) if external, what is the url of the source definition?
inherits: # (optional) list of **GlossaryTerm** that this term inherits from
- Privacy.PII
contains: # (optional) a list of **GlossaryTerm** that this term contains
2023-06-26 11:00:09 -07:00
- Shipping.ZipCode
- Shipping.CountryCode
- Shipping.StreetAddress
2025-04-16 16:55:51 -07:00
custom_properties: # (optional) a map of key/value pairs of arbitrary custom properties
2024-11-29 19:59:18 +00:00
- is_used_for_compliance_tracking: "true"
2025-04-16 16:55:51 -07:00
knowledge_links: # (optional) a list of **KnowledgeCard** related to this term. These appear as links on the glossary node's page
2023-06-26 11:00:09 -07:00
- url: "https://en.wikipedia.org/wiki/Address"
label: Wiki link
2025-04-16 16:55:51 -07:00
domain: "urn:li:domain:Logistics" # (optional) domain name or domain urn
2023-06-26 11:00:09 -07:00
```
2025-03-06 06:30:10 -08:00
## ID Management and URL Generation
The business glossary provides two primary ways to manage term and node identifiers:
1. **Custom IDs** : You can explicitly specify an ID for any term or node using the `id` field. This is recommended for terms that need stable, predictable identifiers:
2025-04-16 16:55:51 -07:00
2025-03-06 06:30:10 -08:00
```yaml
terms:
- name: "Response Time"
2025-04-16 16:55:51 -07:00
id: "support-response-time" # Explicit ID
2025-03-06 06:30:10 -08:00
description: "Target time to respond to customer inquiries"
```
2. **Automatic ID Generation** : When no ID is specified, the system will generate one based on the `enable_auto_id` setting:
2025-04-16 16:55:51 -07:00
2025-03-06 06:30:10 -08:00
- With `enable_auto_id: false` (default):
2025-04-16 16:55:51 -07:00
2025-03-06 06:30:10 -08:00
- Node and term names are converted to URL-friendly format
- Spaces within names are replaced with hyphens
- Special characters are removed (except hyphens)
- Case is preserved
- Multiple hyphens are collapsed to single ones
- Path components (node/term hierarchy) are joined with periods
- Example: Node "Customer Support" with term "Response Time" → "Customer-Support.Response-Time"
- With `enable_auto_id: true` :
- Generates GUID-based IDs
- Recommended for guaranteed uniqueness
- Required for terms with non-ASCII characters
Here's how path-based ID generation works:
2025-04-16 16:55:51 -07:00
2025-03-06 06:30:10 -08:00
```yaml
nodes:
2025-04-16 16:55:51 -07:00
- name: "Customer Support" # Node ID: Customer-Support
2025-03-06 06:30:10 -08:00
terms:
2025-04-16 16:55:51 -07:00
- name: "Response Time" # Term ID: Customer-Support.Response-Time
2025-03-06 06:30:10 -08:00
description: "Response SLA"
2025-04-16 16:55:51 -07:00
- name: "First Reply" # Term ID: Customer-Support.First-Reply
2025-03-06 06:30:10 -08:00
description: "Initial response"
2025-04-16 16:55:51 -07:00
- name: "Product Feedback" # Node ID: Product-Feedback
2025-03-06 06:30:10 -08:00
terms:
2025-04-16 16:55:51 -07:00
- name: "Response Time" # Term ID: Product-Feedback.Response-Time
2025-03-06 06:30:10 -08:00
description: "Feedback response"
```
**Important Notes**:
2025-04-16 16:55:51 -07:00
2025-03-06 06:30:10 -08:00
- Periods (.) are used exclusively as path separators between nodes and terms
- Periods in term or node names themselves will be removed
- Each component of the path (node names, term names) is cleaned independently:
- Spaces to hyphens
- Special characters removed
- Case preserved
- The cleaned components are then joined with periods to form the full path
- Non-ASCII characters in any component trigger automatic GUID generation
- Once an ID is created (either manually or automatically), it cannot be easily changed
- All references to a term (in `inherits` , `contains` , etc.) must use its correct ID
- Moving terms in the hierarchy does NOT update their IDs:
- The ID retains its original path components even after moving
- This can lead to IDs that don't match the current location
- Consider using `enable_auto_id: true` if you plan to reorganize your glossary
- For terms that other terms will reference, consider using explicit IDs or enable auto_id
Example of how different names are handled:
2025-04-16 16:55:51 -07:00
2025-03-06 06:30:10 -08:00
```yaml
nodes:
2025-04-16 16:55:51 -07:00
- name: "Data Services" # Node ID: Data-Services
2025-03-06 06:30:10 -08:00
terms:
# Basic term name
2025-04-16 16:55:51 -07:00
- name: "Response Time" # Term ID: Data-Services.Response-Time
2025-03-06 06:30:10 -08:00
description: "SLA metrics"
2025-04-16 16:55:51 -07:00
2025-03-06 06:30:10 -08:00
# Term name with special characters
2025-04-16 16:55:51 -07:00
- name: "API @ Response" # Term ID: Data-Services.API-Response
2025-03-06 06:30:10 -08:00
description: "API metrics"
2025-04-16 16:55:51 -07:00
2025-03-06 06:30:10 -08:00
# Term with non-ASCII (triggers GUID)
2025-04-16 16:55:51 -07:00
- name: "パフォーマンス" # Term ID will be a 32-character GUID
2025-03-06 06:30:10 -08:00
description: "Performance"
```
2023-06-26 11:00:09 -07:00
2025-03-06 06:30:10 -08:00
To see how these all work together, check out this comprehensive example business glossary file below:
2023-06-26 11:00:09 -07:00
```yaml
2024-11-29 19:59:18 +00:00
version: "1"
2023-06-26 11:00:09 -07:00
source: DataHub
owners:
users:
- mjames
url: "https://github.com/datahub-project/datahub/"
nodes:
2025-03-06 06:30:10 -08:00
- name: "Data Classification"
2025-04-16 16:55:51 -07:00
id: "Data-Classification" # Custom ID for stable references
2023-06-26 11:00:09 -07:00
description: A set of terms related to Data Classification
knowledge_links:
- label: Wiki link for classification
url: "https://en.wikipedia.org/wiki/Classification"
terms:
2025-04-16 16:55:51 -07:00
- name: "Sensitive Data" # Will generate: Data-Classification.Sensitive-Data
2023-06-26 11:00:09 -07:00
description: Sensitive Data
custom_properties:
2024-11-29 19:59:18 +00:00
is_confidential: "false"
2025-04-16 16:55:51 -07:00
- name: "Confidential Information" # Will generate: Data-Classification.Confidential-Information
2023-06-26 11:00:09 -07:00
description: Confidential Data
custom_properties:
2024-11-29 19:59:18 +00:00
is_confidential: "true"
2025-04-16 16:55:51 -07:00
- name: "Highly Confidential" # Will generate: Data-Classification.Highly-Confidential
2023-06-26 11:00:09 -07:00
description: Highly Confidential Data
custom_properties:
2024-11-29 19:59:18 +00:00
is_confidential: "true"
2023-06-26 11:00:09 -07:00
domain: Marketing
2025-03-06 06:30:10 -08:00
- name: "Personal Information"
2023-06-26 11:00:09 -07:00
description: All terms related to personal information
owners:
users:
- mjames
terms:
2025-04-16 16:55:51 -07:00
- name: "Email" # Will generate: Personal-Information.Email
2023-06-26 11:00:09 -07:00
description: An individual's email address
inherits:
2025-04-16 16:55:51 -07:00
- Data-Classification.Confidential # References parent node path
2023-06-26 11:00:09 -07:00
owners:
groups:
- Trust and Safety
2025-04-16 16:55:51 -07:00
- name: "Address" # Will generate: Personal-Information.Address
2023-06-26 11:00:09 -07:00
description: A physical address
2025-04-16 16:55:51 -07:00
- name: "Gender" # Will generate: Personal-Information.Gender
2023-06-26 11:00:09 -07:00
description: The gender identity of the individual
inherits:
2025-04-16 16:55:51 -07:00
- Data-Classification.Sensitive # References parent node path
2025-03-06 06:30:10 -08:00
- name: "Clients And Accounts"
2023-06-26 11:00:09 -07:00
description: Provides basic concepts such as account, account holder, account provider, relationship manager that are commonly used by financial services providers to describe customers and to determine counterparty identities
owners:
groups:
- finance
2025-03-06 06:30:10 -08:00
type: DATAOWNER
2023-06-26 11:00:09 -07:00
terms:
2025-04-16 16:55:51 -07:00
- name: "Account" # Will generate: Clients-And-Accounts.Account
2023-06-26 11:00:09 -07:00
description: Container for records associated with a business arrangement for regular transactions and services
term_source: "EXTERNAL"
source_ref: FIBO
source_url: "https://spec.edmcouncil.org/fibo/ontology/FBC/ProductsAndServices/ClientsAndAccounts/Account"
inherits:
2025-04-16 16:55:51 -07:00
- Data-Classification.Highly-Confidential # References parent node path
2023-06-26 11:00:09 -07:00
contains:
2025-04-16 16:55:51 -07:00
- Clients-And-Accounts.Balance # References term in same node
- name: "Balance" # Will generate: Clients-And-Accounts.Balance
2023-06-26 11:00:09 -07:00
description: Amount of money available or owed
term_source: "EXTERNAL"
source_ref: FIBO
source_url: "https://spec.edmcouncil.org/fibo/ontology/FBC/ProductsAndServices/ClientsAndAccounts/Balance"
2025-03-06 06:30:10 -08:00
- name: "KPIs"
description: Common Business KPIs
2023-06-26 11:00:09 -07:00
terms:
2025-04-16 16:55:51 -07:00
- name: "CSAT %" # Will generate: KPIs.CSAT
2025-03-06 06:30:10 -08:00
description: Customer Satisfaction Score
```
2023-06-26 11:00:09 -07:00
2025-03-06 06:30:10 -08:00
## Custom ID Specification
2023-06-26 11:00:09 -07:00
2025-03-06 06:30:10 -08:00
Custom IDs can be specified in two ways, both of which are fully supported and acceptable:
2023-06-26 11:00:09 -07:00
2025-03-06 06:30:10 -08:00
1. Just the ID portion (simpler approach):
2025-04-16 16:55:51 -07:00
2025-03-06 06:30:10 -08:00
```yaml
terms:
- name: "Email"
2025-04-16 16:55:51 -07:00
id: "company-email" # Will become urn:li:glossaryTerm:company-email
2025-03-06 06:30:10 -08:00
description: "Company email address"
```
2023-06-26 11:00:09 -07:00
2025-03-06 06:30:10 -08:00
2. Full URN format:
2025-04-16 16:55:51 -07:00
2025-03-06 06:30:10 -08:00
```yaml
terms:
- name: "Email"
id: "urn:li:glossaryTerm:company-email"
description: "Company email address"
```
2023-06-26 11:00:09 -07:00
2025-03-06 06:30:10 -08:00
Both methods are valid and will work correctly. The system will automatically handle the URN prefix if you specify just the ID portion.
2023-06-26 11:00:09 -07:00
2025-03-06 06:30:10 -08:00
The same applies for nodes:
2025-04-16 16:55:51 -07:00
2025-03-06 06:30:10 -08:00
```yaml
nodes:
- name: "Communications"
2025-04-16 16:55:51 -07:00
id: "internal-comms" # Will become urn:li:glossaryNode:internal-comms
2025-03-06 06:30:10 -08:00
description: "Internal communication methods"
```
2023-06-26 11:00:09 -07:00
2025-03-06 06:30:10 -08:00
Note: Once you select a custom ID, it cannot be easily changed.
2021-09-01 15:10:12 -07:00
## Compatibility
2025-04-16 16:55:51 -07:00
Compatible with version 1 of business glossary format. The source will be evolved as newer versions of this format are published.