275 lines
11 KiB
Markdown
Raw Permalink Normal View History

### Business Glossary File Format
The business glossary source file should be a .yml file with the following top-level keys:
**Glossary**: the top level keys of the business glossary file
Example **Glossary**:
```yaml
version: "1" # the version of business glossary file config the config conforms to. Currently the only version released is `1`.
source: DataHub # the source format of the terms. Currently only supports `DataHub`
owners: # owners contains two nested fields
users: # (optional) a list of user IDs
- njones
groups: # (optional) a list of group IDs
- logistics
url: "https://github.com/datahub-project/datahub/" # (optional) external url pointing to where the glossary is defined externally, if applicable
nodes: # list of child **GlossaryNode** objects. See **GlossaryNode** section below
...
```
**GlossaryNode**: a container of **GlossaryNode** and **GlossaryTerm** objects
Example **GlossaryNode**:
```yaml
- name: "Shipping" # name of the node
id: "Shipping-Logistics" # (optional) custom identifier for the node
description: Provides terms related to the shipping domain # description of the node
owners: # (optional) owners contains 2 nested fields
users: # (optional) a list of user IDs
- njones
groups: # (optional) a list of group IDs
- logistics
nodes: # list of child **GlossaryNode** objects
...
knowledge_links: # (optional) list of **KnowledgeCard** objects
- label: Wiki link for shipping
url: "https://en.wikipedia.org/wiki/Freight_transport"
```
**GlossaryTerm**: a term in your business glossary
Example **GlossaryTerm**:
```yaml
- name: "Full Address" # name of the term
id: "Full-Address-Details" # (optional) custom identifier for the term
description: A collection of information to give the location of a building or plot of land. # description of the term
owners: # (optional) owners contains 2 nested fields
users: # (optional) a list of user IDs
- njones
groups: # (optional) a list of group IDs
- logistics
term_source: "EXTERNAL" # one of `EXTERNAL` or `INTERNAL`. Whether the term is coming from an external glossary or one defined in your organization.
source_ref: FIBO # (optional) if external, what is the name of the source the glossary term is coming from?
source_url: "https://www.google.com" # (optional) if external, what is the url of the source definition?
inherits: # (optional) list of **GlossaryTerm** that this term inherits from
- Privacy.PII
contains: # (optional) a list of **GlossaryTerm** that this term contains
- Shipping.ZipCode
- Shipping.CountryCode
- Shipping.StreetAddress
custom_properties: # (optional) a map of key/value pairs of arbitrary custom properties
- is_used_for_compliance_tracking: "true"
knowledge_links: # (optional) a list of **KnowledgeCard** related to this term. These appear as links on the glossary node's page
- url: "https://en.wikipedia.org/wiki/Address"
label: Wiki link
domain: "urn:li:domain:Logistics" # (optional) domain name or domain urn
```
## ID Management and URL Generation
The business glossary provides two primary ways to manage term and node identifiers:
1. **Custom IDs**: You can explicitly specify an ID for any term or node using the `id` field. This is recommended for terms that need stable, predictable identifiers:
```yaml
terms:
- name: "Response Time"
id: "support-response-time" # Explicit ID
description: "Target time to respond to customer inquiries"
```
2. **Automatic ID Generation**: When no ID is specified, the system will generate one based on the `enable_auto_id` setting:
- With `enable_auto_id: false` (default):
- Node and term names are converted to URL-friendly format
- Spaces within names are replaced with hyphens
- Special characters are removed (except hyphens)
- Case is preserved
- Multiple hyphens are collapsed to single ones
- Path components (node/term hierarchy) are joined with periods
- Example: Node "Customer Support" with term "Response Time" → "Customer-Support.Response-Time"
- With `enable_auto_id: true`:
- Generates GUID-based IDs
- Recommended for guaranteed uniqueness
- Required for terms with non-ASCII characters
Here's how path-based ID generation works:
```yaml
nodes:
- name: "Customer Support" # Node ID: Customer-Support
terms:
- name: "Response Time" # Term ID: Customer-Support.Response-Time
description: "Response SLA"
- name: "First Reply" # Term ID: Customer-Support.First-Reply
description: "Initial response"
- name: "Product Feedback" # Node ID: Product-Feedback
terms:
- name: "Response Time" # Term ID: Product-Feedback.Response-Time
description: "Feedback response"
```
**Important Notes**:
- Periods (.) are used exclusively as path separators between nodes and terms
- Periods in term or node names themselves will be removed
- Each component of the path (node names, term names) is cleaned independently:
- Spaces to hyphens
- Special characters removed
- Case preserved
- The cleaned components are then joined with periods to form the full path
- Non-ASCII characters in any component trigger automatic GUID generation
- Once an ID is created (either manually or automatically), it cannot be easily changed
- All references to a term (in `inherits`, `contains`, etc.) must use its correct ID
- Moving terms in the hierarchy does NOT update their IDs:
- The ID retains its original path components even after moving
- This can lead to IDs that don't match the current location
- Consider using `enable_auto_id: true` if you plan to reorganize your glossary
- For terms that other terms will reference, consider using explicit IDs or enable auto_id
Example of how different names are handled:
```yaml
nodes:
- name: "Data Services" # Node ID: Data-Services
terms:
# Basic term name
- name: "Response Time" # Term ID: Data-Services.Response-Time
description: "SLA metrics"
# Term name with special characters
- name: "API @ Response" # Term ID: Data-Services.API-Response
description: "API metrics"
# Term with non-ASCII (triggers GUID)
- name: "パフォーマンス" # Term ID will be a 32-character GUID
description: "Performance"
```
To see how these all work together, check out this comprehensive example business glossary file below:
```yaml
version: "1"
source: DataHub
owners:
users:
- mjames
url: "https://github.com/datahub-project/datahub/"
nodes:
- name: "Data Classification"
id: "Data-Classification" # Custom ID for stable references
description: A set of terms related to Data Classification
knowledge_links:
- label: Wiki link for classification
url: "https://en.wikipedia.org/wiki/Classification"
terms:
- name: "Sensitive Data" # Will generate: Data-Classification.Sensitive-Data
description: Sensitive Data
custom_properties:
is_confidential: "false"
- name: "Confidential Information" # Will generate: Data-Classification.Confidential-Information
description: Confidential Data
custom_properties:
is_confidential: "true"
- name: "Highly Confidential" # Will generate: Data-Classification.Highly-Confidential
description: Highly Confidential Data
custom_properties:
is_confidential: "true"
domain: Marketing
- name: "Personal Information"
description: All terms related to personal information
owners:
users:
- mjames
terms:
- name: "Email" # Will generate: Personal-Information.Email
description: An individual's email address
inherits:
- Data-Classification.Confidential # References parent node path
owners:
groups:
- Trust and Safety
- name: "Address" # Will generate: Personal-Information.Address
description: A physical address
- name: "Gender" # Will generate: Personal-Information.Gender
description: The gender identity of the individual
inherits:
- Data-Classification.Sensitive # References parent node path
- name: "Clients And Accounts"
description: Provides basic concepts such as account, account holder, account provider, relationship manager that are commonly used by financial services providers to describe customers and to determine counterparty identities
owners:
groups:
- finance
type: DATAOWNER
terms:
- name: "Account" # Will generate: Clients-And-Accounts.Account
description: Container for records associated with a business arrangement for regular transactions and services
term_source: "EXTERNAL"
source_ref: FIBO
source_url: "https://spec.edmcouncil.org/fibo/ontology/FBC/ProductsAndServices/ClientsAndAccounts/Account"
inherits:
- Data-Classification.Highly-Confidential # References parent node path
contains:
- Clients-And-Accounts.Balance # References term in same node
- name: "Balance" # Will generate: Clients-And-Accounts.Balance
description: Amount of money available or owed
term_source: "EXTERNAL"
source_ref: FIBO
source_url: "https://spec.edmcouncil.org/fibo/ontology/FBC/ProductsAndServices/ClientsAndAccounts/Balance"
- name: "KPIs"
description: Common Business KPIs
terms:
- name: "CSAT %" # Will generate: KPIs.CSAT
description: Customer Satisfaction Score
```
## Custom ID Specification
Custom IDs can be specified in two ways, both of which are fully supported and acceptable:
1. Just the ID portion (simpler approach):
```yaml
terms:
- name: "Email"
id: "company-email" # Will become urn:li:glossaryTerm:company-email
description: "Company email address"
```
2. Full URN format:
```yaml
terms:
- name: "Email"
id: "urn:li:glossaryTerm:company-email"
description: "Company email address"
```
Both methods are valid and will work correctly. The system will automatically handle the URN prefix if you specify just the ID portion.
The same applies for nodes:
```yaml
nodes:
- name: "Communications"
id: "internal-comms" # Will become urn:li:glossaryNode:internal-comms
description: "Internal communication methods"
```
Note: Once you select a custom ID, it cannot be easily changed.
## Compatibility
Compatible with version 1 of business glossary format. The source will be evolved as newer versions of this format are published.