mirror of
https://github.com/datahub-project/datahub.git
synced 2025-11-06 13:51:50 +00:00
485 lines
16 KiB
Markdown
485 lines
16 KiB
Markdown
|
|
# GlossaryNode
|
||
|
|
|
||
|
|
A GlossaryNode represents a hierarchical grouping or category within DataHub's Business Glossary. GlossaryNodes act as folders or containers that organize GlossaryTerms into a logical structure, making it easier to navigate and manage large business glossaries.
|
||
|
|
|
||
|
|
In practice, GlossaryNodes allow you to:
|
||
|
|
|
||
|
|
- Create hierarchical categories for organizing business terminology
|
||
|
|
- Build multi-level taxonomies (e.g., Finance > Revenue > Recurring Revenue)
|
||
|
|
- Establish ownership and governance over specific glossary sections
|
||
|
|
- Apply metadata consistently across related terms within a category
|
||
|
|
- Manage permissions at the category level
|
||
|
|
|
||
|
|
For example, you might create a GlossaryNode called "Finance" containing terms like "Revenue", "Profit", and "EBITDA", with a nested GlossaryNode "Compliance" underneath containing "SOX", "GDPR", and "CCPA" terms.
|
||
|
|
|
||
|
|
## Identity
|
||
|
|
|
||
|
|
GlossaryNodes are uniquely identified by a single field: their **name**. This name serves as the persistent identifier for the node throughout its lifecycle.
|
||
|
|
|
||
|
|
### URN Structure
|
||
|
|
|
||
|
|
The URN (Uniform Resource Name) for a GlossaryNode follows this pattern:
|
||
|
|
|
||
|
|
```
|
||
|
|
urn:li:glossaryNode:<node_name>
|
||
|
|
```
|
||
|
|
|
||
|
|
Where:
|
||
|
|
|
||
|
|
- `<node_name>`: A unique string identifier for the node. This can be human-readable (e.g., "Finance") or a generated ID (e.g., "fin-category-001" or a UUID).
|
||
|
|
|
||
|
|
### Examples
|
||
|
|
|
||
|
|
```
|
||
|
|
# Simple node name
|
||
|
|
urn:li:glossaryNode:Finance
|
||
|
|
|
||
|
|
# Hierarchical naming convention (common pattern)
|
||
|
|
urn:li:glossaryNode:Finance.Revenue
|
||
|
|
urn:li:glossaryNode:Classification
|
||
|
|
urn:li:glossaryNode:Classification.DataSensitivity
|
||
|
|
|
||
|
|
# UUID-based identifier
|
||
|
|
urn:li:glossaryNode:41516e31-0acb-fd90-76ff-fc2c98d2d1a3
|
||
|
|
|
||
|
|
# Descriptive identifier
|
||
|
|
urn:li:glossaryNode:PersonalInformation
|
||
|
|
```
|
||
|
|
|
||
|
|
### Best Practices for Node Names
|
||
|
|
|
||
|
|
1. **Use hierarchical notation**: Prefix nodes with their parent category (e.g., `Finance.Revenue`, `Classification.PII`) to indicate structure even though the name is flat.
|
||
|
|
2. **Be consistent**: Choose a naming convention (camelCase, dot notation, etc.) and apply it uniformly across your glossary.
|
||
|
|
3. **Keep it permanent**: The node name is the identifier and should not change. Use the `name` field in `glossaryNodeInfo` for the display name.
|
||
|
|
4. **Consider depth**: While nesting is supported, keep hierarchies manageable (typically 2-4 levels deep) for usability.
|
||
|
|
|
||
|
|
## Important Capabilities
|
||
|
|
|
||
|
|
### Core Node Information (glossaryNodeInfo)
|
||
|
|
|
||
|
|
The `glossaryNodeInfo` aspect contains the essential information about a glossary node:
|
||
|
|
|
||
|
|
- **definition** (required): A description of what this node/category represents. This helps users understand the purpose and scope of terms within this node.
|
||
|
|
- **name**: The display name shown in the UI. This can be more human-friendly than the URN identifier (e.g., "Financial Metrics" vs. "FinancialMetrics").
|
||
|
|
- **parentNode**: A reference to another GlossaryNode that acts as the parent in the hierarchy. This creates the tree structure visible in the UI.
|
||
|
|
- **id**: An optional identifier field that can store an external reference or alternate ID.
|
||
|
|
- **customProperties**: Key-value pairs for additional metadata specific to your organization.
|
||
|
|
|
||
|
|
Example:
|
||
|
|
|
||
|
|
```python
|
||
|
|
{
|
||
|
|
"name": "Financial Metrics",
|
||
|
|
"definition": "Category for all financial and accounting-related business terms including revenue, costs, and profitability measures.",
|
||
|
|
"parentNode": "urn:li:glossaryNode:Finance"
|
||
|
|
}
|
||
|
|
```
|
||
|
|
|
||
|
|
### Hierarchical Structure
|
||
|
|
|
||
|
|
GlossaryNodes support arbitrary nesting through the `parentNode` field, creating tree structures:
|
||
|
|
|
||
|
|
```
|
||
|
|
GlossaryNode: DataGovernance
|
||
|
|
├── GlossaryNode: Classification
|
||
|
|
│ ├── GlossaryTerm: Public
|
||
|
|
│ ├── GlossaryTerm: Internal
|
||
|
|
│ └── GlossaryTerm: Confidential
|
||
|
|
│
|
||
|
|
├── GlossaryNode: PersonalInformation
|
||
|
|
│ ├── GlossaryNode: DirectIdentifiers
|
||
|
|
│ │ ├── GlossaryTerm: Email
|
||
|
|
│ │ └── GlossaryTerm: SSN
|
||
|
|
│ └── GlossaryNode: IndirectIdentifiers
|
||
|
|
│ ├── GlossaryTerm: IPAddress
|
||
|
|
│ └── GlossaryTerm: DeviceID
|
||
|
|
│
|
||
|
|
└── GlossaryNode: Compliance
|
||
|
|
├── GlossaryTerm: GDPR
|
||
|
|
└── GlossaryTerm: CCPA
|
||
|
|
```
|
||
|
|
|
||
|
|
Key characteristics:
|
||
|
|
|
||
|
|
- A GlossaryNode can have at most one parent node (single inheritance)
|
||
|
|
- A GlossaryNode can contain both GlossaryTerms and child GlossaryNodes
|
||
|
|
- Nodes at the root level (no parent) appear at the top of the glossary hierarchy
|
||
|
|
- Moving a node automatically moves all its descendants
|
||
|
|
|
||
|
|
### Ownership and Governance
|
||
|
|
|
||
|
|
GlossaryNodes support standard ownership metadata through the `ownership` aspect. Ownership at the node level can represent:
|
||
|
|
|
||
|
|
- Stewardship responsibility for maintaining the category and its terms
|
||
|
|
- Subject matter expertise for the business domain
|
||
|
|
- Accountability for term quality and accuracy within the category
|
||
|
|
|
||
|
|
Ownership is particularly powerful for GlossaryNodes because:
|
||
|
|
|
||
|
|
- Owners can be granted special permissions (Manage Direct Children, Manage All Children)
|
||
|
|
- Ownership can cascade to terms within the node
|
||
|
|
- It establishes clear accountability for glossary sections
|
||
|
|
|
||
|
|
### Documentation and Links
|
||
|
|
|
||
|
|
GlossaryNodes support the `institutionalMemory` aspect, allowing you to:
|
||
|
|
|
||
|
|
- Link to external documentation (Confluence pages, wikis, etc.)
|
||
|
|
- Reference governance policies or standards
|
||
|
|
- Point to training materials or style guides
|
||
|
|
- Maintain a history of important links related to the category
|
||
|
|
|
||
|
|
This is especially useful for top-level nodes representing major domains or initiatives.
|
||
|
|
|
||
|
|
## Code Examples
|
||
|
|
|
||
|
|
### Creating a GlossaryNode
|
||
|
|
|
||
|
|
<details>
|
||
|
|
<summary>Python SDK: Create a root-level GlossaryNode</summary>
|
||
|
|
|
||
|
|
```python
|
||
|
|
{{ inline /metadata-ingestion/examples/library/glossary_node_create.py show_path_as_comment }}
|
||
|
|
```
|
||
|
|
|
||
|
|
</details>
|
||
|
|
|
||
|
|
<details>
|
||
|
|
<summary>Python SDK: Create a nested GlossaryNode with parent</summary>
|
||
|
|
|
||
|
|
```python
|
||
|
|
{{ inline /metadata-ingestion/examples/library/glossary_node_create_nested.py show_path_as_comment }}
|
||
|
|
```
|
||
|
|
|
||
|
|
</details>
|
||
|
|
|
||
|
|
### Managing Hierarchy
|
||
|
|
|
||
|
|
<details>
|
||
|
|
<summary>Python SDK: Build a multi-level glossary hierarchy</summary>
|
||
|
|
|
||
|
|
```python
|
||
|
|
{{ inline /metadata-ingestion/examples/library/glossary_term_create_hierarchy.py show_path_as_comment }}
|
||
|
|
```
|
||
|
|
|
||
|
|
</details>
|
||
|
|
|
||
|
|
### Adding Ownership
|
||
|
|
|
||
|
|
<details>
|
||
|
|
<summary>Python SDK: Add an owner to a GlossaryNode</summary>
|
||
|
|
|
||
|
|
```python
|
||
|
|
{{ inline /metadata-ingestion/examples/library/glossary_node_add_owner.py show_path_as_comment }}
|
||
|
|
```
|
||
|
|
|
||
|
|
</details>
|
||
|
|
|
||
|
|
### Querying GlossaryNodes
|
||
|
|
|
||
|
|
<details>
|
||
|
|
<summary>REST API: Get a GlossaryNode by URN</summary>
|
||
|
|
|
||
|
|
```bash
|
||
|
|
# Fetch a GlossaryNode entity
|
||
|
|
curl -X GET 'http://localhost:8080/entities/urn%3Ali%3AglossaryNode%3AFinance' \
|
||
|
|
-H 'Authorization: Bearer <token>'
|
||
|
|
|
||
|
|
# Response includes all aspects:
|
||
|
|
# - glossaryNodeKey (identity)
|
||
|
|
# - glossaryNodeInfo (definition, name, parentNode, etc.)
|
||
|
|
# - ownership (who owns this node)
|
||
|
|
# - institutionalMemory (links to documentation)
|
||
|
|
# - etc.
|
||
|
|
```
|
||
|
|
|
||
|
|
</details>
|
||
|
|
|
||
|
|
<details>
|
||
|
|
<summary>GraphQL: Query root-level GlossaryNodes</summary>
|
||
|
|
|
||
|
|
```graphql
|
||
|
|
query GetRootGlossaryNodes {
|
||
|
|
getRootGlossaryNodes {
|
||
|
|
nodes {
|
||
|
|
urn
|
||
|
|
properties {
|
||
|
|
name
|
||
|
|
definition
|
||
|
|
}
|
||
|
|
ownership {
|
||
|
|
owners {
|
||
|
|
owner {
|
||
|
|
... on CorpUser {
|
||
|
|
urn
|
||
|
|
username
|
||
|
|
}
|
||
|
|
}
|
||
|
|
}
|
||
|
|
}
|
||
|
|
}
|
||
|
|
}
|
||
|
|
}
|
||
|
|
```
|
||
|
|
|
||
|
|
</details>
|
||
|
|
|
||
|
|
<details>
|
||
|
|
<summary>GraphQL: Query children of a GlossaryNode</summary>
|
||
|
|
|
||
|
|
```graphql
|
||
|
|
query GetGlossaryNodeChildren {
|
||
|
|
glossaryNode(urn: "urn:li:glossaryNode:Finance") {
|
||
|
|
urn
|
||
|
|
properties {
|
||
|
|
name
|
||
|
|
definition
|
||
|
|
}
|
||
|
|
children {
|
||
|
|
count
|
||
|
|
relationships {
|
||
|
|
entity {
|
||
|
|
... on GlossaryNode {
|
||
|
|
urn
|
||
|
|
properties {
|
||
|
|
name
|
||
|
|
}
|
||
|
|
}
|
||
|
|
... on GlossaryTerm {
|
||
|
|
urn
|
||
|
|
properties {
|
||
|
|
name
|
||
|
|
definition
|
||
|
|
}
|
||
|
|
}
|
||
|
|
}
|
||
|
|
}
|
||
|
|
}
|
||
|
|
}
|
||
|
|
}
|
||
|
|
```
|
||
|
|
|
||
|
|
</details>
|
||
|
|
|
||
|
|
### Bulk Operations
|
||
|
|
|
||
|
|
<details>
|
||
|
|
<summary>YAML Ingestion: Create node hierarchy from Business Glossary file</summary>
|
||
|
|
|
||
|
|
```yaml
|
||
|
|
# business_glossary.yml
|
||
|
|
version: "1"
|
||
|
|
source: MyOrganization
|
||
|
|
owners:
|
||
|
|
users:
|
||
|
|
- datahub
|
||
|
|
nodes:
|
||
|
|
- name: DataGovernance
|
||
|
|
description: Top-level governance structure
|
||
|
|
nodes:
|
||
|
|
- name: Classification
|
||
|
|
description: Data classification categories
|
||
|
|
terms:
|
||
|
|
- name: Public
|
||
|
|
description: Publicly available data
|
||
|
|
- name: Internal
|
||
|
|
description: Internal use only
|
||
|
|
- name: Confidential
|
||
|
|
description: Restricted access data
|
||
|
|
|
||
|
|
- name: PersonalInformation
|
||
|
|
description: Personal and sensitive data categories
|
||
|
|
nodes:
|
||
|
|
- name: DirectIdentifiers
|
||
|
|
description: Direct personal identifiers
|
||
|
|
terms:
|
||
|
|
- name: Email
|
||
|
|
description: Email addresses
|
||
|
|
- name: SSN
|
||
|
|
description: Social Security Numbers
|
||
|
|
|
||
|
|
- name: IndirectIdentifiers
|
||
|
|
description: Indirect identifiers
|
||
|
|
terms:
|
||
|
|
- name: IPAddress
|
||
|
|
description: Internet Protocol addresses
|
||
|
|
- name: DeviceID
|
||
|
|
description: Device identifiers
|
||
|
|
# Ingest using the DataHub CLI:
|
||
|
|
# datahub ingest -c business_glossary.yml
|
||
|
|
```
|
||
|
|
|
||
|
|
See the [Business Glossary Source](../../../generated/ingestion/sources/business-glossary.md) documentation for the full YAML format specification.
|
||
|
|
|
||
|
|
</details>
|
||
|
|
|
||
|
|
## Integration Points
|
||
|
|
|
||
|
|
### Relationship with GlossaryTerm
|
||
|
|
|
||
|
|
GlossaryNodes provide organizational structure for GlossaryTerms. The relationship is established through:
|
||
|
|
|
||
|
|
- **GlossaryTerm → GlossaryNode**: A term's `glossaryTermInfo.parentNode` field references its containing node
|
||
|
|
- **Navigation**: The UI renders this as a browsable hierarchy where users can expand nodes to see contained terms
|
||
|
|
- **Search**: Users can filter by glossary node to find all terms within a category
|
||
|
|
|
||
|
|
Think of this relationship as:
|
||
|
|
|
||
|
|
- **GlossaryNode**: Folder/directory (can contain terms and other nodes)
|
||
|
|
- **GlossaryTerm**: File (the actual business definition)
|
||
|
|
|
||
|
|
### Parent-Child Relationships
|
||
|
|
|
||
|
|
GlossaryNodes form a tree structure through self-referential parent-child relationships:
|
||
|
|
|
||
|
|
- A child node references its parent via `glossaryNodeInfo.parentNode`
|
||
|
|
- A parent node can have many children (both nodes and terms)
|
||
|
|
- The DataHub UI displays this as an expandable tree in the glossary browser
|
||
|
|
- GraphQL resolvers provide specialized queries for traversing the hierarchy
|
||
|
|
|
||
|
|
**Key operations:**
|
||
|
|
|
||
|
|
- `getRootGlossaryNodes`: Fetch all top-level nodes (no parent)
|
||
|
|
- `parentNodes`: Navigate upward to find all ancestors
|
||
|
|
- `children`: Navigate downward to find immediate children
|
||
|
|
- Moving a node updates its `parentNode` reference and affects the entire subtree
|
||
|
|
|
||
|
|
### GraphQL API
|
||
|
|
|
||
|
|
The GraphQL API provides specialized operations for GlossaryNodes:
|
||
|
|
|
||
|
|
**Queries:**
|
||
|
|
|
||
|
|
- `glossaryNode(urn)`: Fetch a specific node with children
|
||
|
|
- `getRootGlossaryNodes`: Get all root-level nodes
|
||
|
|
- `search(entity: "glossaryNode")`: Search nodes by name/definition
|
||
|
|
|
||
|
|
**Mutations:**
|
||
|
|
|
||
|
|
- `createGlossaryNode`: Create a new node with optional parent
|
||
|
|
- `updateParentNode`: Move a node to a different parent
|
||
|
|
- `updateName`: Update the display name
|
||
|
|
- `updateDescription`: Update the definition
|
||
|
|
|
||
|
|
**Resolvers:**
|
||
|
|
|
||
|
|
- `children`: Fetch immediate children (nodes and terms)
|
||
|
|
- `childrenCount`: Count of children under this node
|
||
|
|
- `parentNodes`: Fetch ancestor path from node to root
|
||
|
|
|
||
|
|
See the [Business Glossary documentation](../../../glossary/business-glossary.md) for UI operations.
|
||
|
|
|
||
|
|
### Access Control and Permissions
|
||
|
|
|
||
|
|
GlossaryNodes support fine-grained access control through special glossary-specific privileges:
|
||
|
|
|
||
|
|
#### Manage Direct Glossary Children
|
||
|
|
|
||
|
|
Users with this privilege on a node can:
|
||
|
|
|
||
|
|
- Create new terms and nodes directly under this node
|
||
|
|
- Edit terms and nodes directly under this node
|
||
|
|
- Delete terms and nodes directly under this node
|
||
|
|
- Cannot affect grandchildren or deeper descendants
|
||
|
|
|
||
|
|
**Use case**: Department leads managing their immediate category structure
|
||
|
|
|
||
|
|
#### Manage All Glossary Children
|
||
|
|
|
||
|
|
Users with this privilege on a node can:
|
||
|
|
|
||
|
|
- Create, edit, and delete any term or node in the entire subtree
|
||
|
|
- Manage nested hierarchies of any depth
|
||
|
|
- Full control over the category and all descendants
|
||
|
|
|
||
|
|
**Use case**: Data governance team managing an entire domain (e.g., all PII-related terms)
|
||
|
|
|
||
|
|
#### Global Privilege: Manage Glossaries
|
||
|
|
|
||
|
|
Users with this platform-level privilege can:
|
||
|
|
|
||
|
|
- Manage any node or term across the entire glossary
|
||
|
|
- Create root-level nodes
|
||
|
|
- Full administrative control
|
||
|
|
|
||
|
|
These privileges are checked hierarchically - if you have permission on a parent node, it may grant permissions on children depending on the privilege type.
|
||
|
|
|
||
|
|
### Integration with Search and Discovery
|
||
|
|
|
||
|
|
While GlossaryNodes don't get applied to data assets directly (that's the role of GlossaryTerms), they enhance discoverability by:
|
||
|
|
|
||
|
|
1. **Faceted Navigation**: Users can browse the glossary hierarchy to find relevant terms
|
||
|
|
2. **Context**: The node structure provides semantic grouping that helps users understand term relationships
|
||
|
|
3. **Filtering**: Search interfaces can filter terms by their containing node
|
||
|
|
4. **Autocomplete**: Node structure influences term suggestions and grouping
|
||
|
|
|
||
|
|
## Notable Exceptions
|
||
|
|
|
||
|
|
### Node Name vs Display Name
|
||
|
|
|
||
|
|
Similar to GlossaryTerms, the URN identifier (`name` in `glossaryNodeKey`) is separate from the display name (`name` in `glossaryNodeInfo`):
|
||
|
|
|
||
|
|
- **URN name**: Use a stable, unchanging identifier (e.g., "finance-001", "DataGovernance")
|
||
|
|
- **Display name**: Use a human-friendly label that can be updated (e.g., "Financial Metrics", "Data Governance")
|
||
|
|
|
||
|
|
This separation allows you to rename nodes in the UI without breaking references.
|
||
|
|
|
||
|
|
### Circular References Not Allowed
|
||
|
|
|
||
|
|
The hierarchy must be a tree structure (directed acyclic graph):
|
||
|
|
|
||
|
|
- A node cannot be its own ancestor
|
||
|
|
- Moving a node under one of its descendants is prevented
|
||
|
|
- DataHub validates the hierarchy to prevent cycles
|
||
|
|
|
||
|
|
If you attempt to create a circular reference, the operation will fail with a validation error.
|
||
|
|
|
||
|
|
### Root-Level Nodes
|
||
|
|
|
||
|
|
Nodes with no parent (`parentNode` is null or not set) appear at the root level of the glossary:
|
||
|
|
|
||
|
|
- These represent top-level categories
|
||
|
|
- Creating root-level nodes may require higher privileges
|
||
|
|
- Root nodes typically represent major domains or organizational divisions
|
||
|
|
|
||
|
|
### Deleting Nodes with Children
|
||
|
|
|
||
|
|
Current behavior (subject to change):
|
||
|
|
|
||
|
|
- **DataHub may require nodes to be empty before deletion**
|
||
|
|
- You must first delete or move all child nodes and terms
|
||
|
|
- This prevents accidental loss of large glossary sections
|
||
|
|
|
||
|
|
Best practice: Always move or reassign children before deleting a node, or use bulk operations that handle the entire subtree.
|
||
|
|
|
||
|
|
### Display Properties
|
||
|
|
|
||
|
|
GlossaryNodes support the `displayProperties` aspect (added in newer versions), which provides additional UI customization:
|
||
|
|
|
||
|
|
- Custom icons or colors for the node
|
||
|
|
- Display order hints
|
||
|
|
- UI-specific rendering preferences
|
||
|
|
|
||
|
|
This is an optional enhancement for organizations that want more visual control over their glossary.
|
||
|
|
|
||
|
|
### No Direct Application to Assets
|
||
|
|
|
||
|
|
Unlike GlossaryTerms, GlossaryNodes are **not** directly applied to data assets:
|
||
|
|
|
||
|
|
- You cannot tag a dataset with a GlossaryNode
|
||
|
|
- Only GlossaryTerms can be applied to datasets, columns, dashboards, etc.
|
||
|
|
- Nodes exist solely for organizational purposes within the glossary itself
|
||
|
|
|
||
|
|
If you need to tag assets with a category, create a GlossaryTerm within that node and apply the term.
|
||
|
|
|
||
|
|
### Moving Nodes Affects All Descendants
|
||
|
|
|
||
|
|
When you move a node to a new parent:
|
||
|
|
|
||
|
|
- All child nodes and terms move with it
|
||
|
|
- The entire subtree is relocated
|
||
|
|
- References from terms to their parent node are automatically maintained
|
||
|
|
- No manual updates to individual terms are needed
|
||
|
|
|
||
|
|
This makes reorganization efficient but requires care to avoid unintended moves.
|