285 lines
11 KiB
Markdown

# Container
The container entity is a core entity in the metadata model that represents a grouping of related data assets. Containers provide hierarchical organization for datasets, charts, dashboards, and other containers, enabling navigation and structure discovery within data platforms.
## Identity
Containers are uniquely identified by a GUID (Globally Unique Identifier) that is typically derived from a combination of attributes specific to the container type. Unlike datasets which use platform, name, and environment, containers use a more flexible identification scheme based on their hierarchical properties.
The URN structure for a container is: `urn:li:container:{guid}`
The GUID is typically computed from container-specific properties such as:
- **Database containers**: platform + instance + database name
- **Schema containers**: platform + instance + database + schema name
- **Project containers**: platform + instance + project_id
- **Folder containers**: platform + instance + folder_abs_path
- **Bucket containers**: platform + instance + bucket_name
### URN Examples
```
urn:li:container:b5e95fce839e7d78151ed7e0a7420d84
```
The GUID is generated using the `datahub_guid()` function from a dictionary of properties. For example, a Snowflake schema container would be identified by:
```python
{
"platform": "snowflake",
"instance": "prod_instance",
"database": "analytics",
"schema": "reporting"
}
```
## Real-World Concepts
Containers represent various hierarchical structures in data platforms:
- **Databases**: Top-level organizational units in relational systems (MySQL, PostgreSQL, Snowflake)
- **Schemas**: Logical groupings within databases (Snowflake schemas, PostgreSQL schemas)
- **Projects**: Organizational units in cloud platforms (BigQuery projects)
- **Datasets**: Logical groupings in cloud platforms (BigQuery datasets)
- **Folders**: Directory structures in file systems and data lakes (S3 folders, ADLS directories)
- **Buckets**: Top-level storage containers in cloud object stores (S3 buckets, GCS buckets)
- **Workspaces**: Organizational units in BI platforms (Power BI workspaces, Tableau sites)
- **Catalogs**: Top-level organizational units in data catalogs (Unity Catalog, Iceberg catalogs)
- **Metastores**: Storage metadata repositories (Hive metastore, Unity metastore)
## Important Capabilities
### Container Properties
The `containerProperties` aspect contains metadata inherited from the source system:
- **name**: Display name of the container (required)
- **qualifiedName**: Fully-qualified name (optional, e.g., "prod.analytics.reporting")
- **description**: Description from the source system
- **env**: Environment indicator (PROD, DEV, QA, etc.)
- **customProperties**: Additional key-value properties from the source system
- **externalUrl**: Link to the container in the source system
- **created**: Timestamp when the container was created in the source system
- **lastModified**: Timestamp when the container was last modified in the source system
### Editable Container Properties
The `editableContainerProperties` aspect allows users to override or add information via the UI:
- **description**: User-provided description that supplements or overrides the source system description
This separation ensures that metadata from source systems doesn't conflict with user-provided annotations.
### Hierarchical Relationships
Containers support nested hierarchies through the `container` aspect, which links a container to its parent container. This enables multi-level organizational structures:
```
Platform (implicit)
└── Database Container
└── Schema Container
└── Dataset
```
For example, in Snowflake:
```
Snowflake Platform
└── ANALYTICS_DB (Database Container)
└── REPORTING (Schema Container)
└── SALES_METRICS (Dataset)
└── REVENUE_TABLE (Dataset)
```
### Subtypes
The `subTypes` aspect specifies the type of container, which helps the UI render appropriate icons and behaviors. Common subtypes include:
- **Database**: Relational database containers
- **Schema**: Schema-level containers within databases
- **Project**: Cloud project containers (GCP, Azure)
- **Dataset**: BigQuery dataset containers
- **Folder**: File system folders
- **Bucket**: Object storage buckets
- **Workspace**: BI platform workspaces
- **Catalog**: Data catalog containers
- **Metastore**: Metadata storage containers
- **MLflow Experiment** (`MLAssetSubTypes.MLFLOW_EXPERIMENT`): ML experiment containers that organize training runs
### ML Experiments as Containers
Machine learning experiments are modeled as containers with the `MLFLOW_EXPERIMENT` subtype. This pattern enables organizing related training runs (which are `dataProcessInstance` entities) into logical groups for comparison and tracking:
```
ML Experiment (Container)
├── Training Run 1 (DataProcessInstance)
├── Training Run 2 (DataProcessInstance)
└── Training Run 3 (DataProcessInstance)
```
Training runs belong to experiments through the `container` aspect. This structure mirrors common ML platform patterns (like MLflow) and enables:
- Comparing metrics across multiple training attempts
- Tracking the evolution of a model through iterations
- Organizing training work by project or objective
For more information on ML experiments and training runs, see:
- [ML Model entity documentation](mlModel.md#training-runs-and-experiments)
- [DataProcessInstance documentation for training runs](dataProcessInstance.md#tracking-ml-training-run-in-a-container)
### Containable Entities
The following entity types can be contained within a container:
- Datasets
- Charts
- Dashboards
- DataProcessInstances (e.g., training runs in ML experiments)
- Other Containers (for nested hierarchies)
## Code Examples
### Create a Database Container
<details>
<summary>Python SDK: Create a database container</summary>
```python
{{ inline /metadata-ingestion/examples/library/container_create_database.py show_path_as_comment }}
```
</details>
### Create a Schema Container with Parent
<details>
<summary>Python SDK: Create a schema container with parent database</summary>
```python
{{ inline /metadata-ingestion/examples/library/container_create_schema.py show_path_as_comment }}
```
</details>
### Add Metadata to a Container
<details>
<summary>Python SDK: Add tags, terms, and ownership to a container</summary>
```python
{{ inline /metadata-ingestion/examples/library/container_add_metadata.py show_path_as_comment }}
```
</details>
### Query Container via REST API
Containers can be retrieved using the standard entity retrieval APIs:
<details>
<summary>Fetch container entity including all aspects</summary>
```bash
curl 'http://localhost:8080/entities/urn%3Ali%3Acontainer%3Ab5e95fce839e7d78151ed7e0a7420d84'
```
The response will include all aspects associated with the container, including properties, ownership, tags, terms, etc.
</details>
To find all entities within a container, use the relationships API:
<details>
<summary>Find all entities contained within a container</summary>
```bash
curl 'http://localhost:8080/relationships?direction=INCOMING&urn=urn%3Ali%3Acontainer%3Ab5e95fce839e7d78151ed7e0a7420d84&types=IsPartOf'
```
This returns all entities (datasets, charts, dashboards, sub-containers) that have this container as their parent.
</details>
## Integration Points
### Relationship with Datasets
Datasets are the most common entities contained within containers. The relationship is established through the `container` aspect on the dataset, which points to the container URN.
```python
# Dataset links to its parent container (schema)
dataset = Dataset(
platform="snowflake",
name="analytics_db.reporting.sales_table",
env="PROD",
parent_container=schema_key, # Links to schema container
)
```
### Hierarchical Navigation
Containers enable hierarchical navigation in the DataHub UI through parent-child relationships:
1. **Top-down browsing**: Users can navigate from databases to schemas to tables
2. **Bottom-up breadcrumbs**: Datasets show their parent containers in breadcrumb trails
3. **Browse paths**: Containers are used to generate browse paths automatically
### GraphQL Resolvers
The container entity has specialized GraphQL resolvers:
- **ContainerEntitiesResolver**: Retrieves all entities (datasets, charts, dashboards, sub-containers) within a container
- **ParentContainersResolver**: Retrieves the full hierarchy of parent containers for any entity
These resolvers power the UI's hierarchical navigation and container overview pages.
### Common Usage Patterns
1. **Database/Schema Hierarchy**: Relational databases use Database and Schema containers
2. **Project/Dataset Hierarchy**: BigQuery uses Project and Dataset containers
3. **Workspace/Folder Hierarchy**: BI tools use Workspace containers for organization
4. **Bucket/Folder Hierarchy**: Data lakes use Bucket and Folder containers
5. **Catalog/Schema Hierarchy**: Modern catalogs (Unity, Iceberg) use Catalog and Schema containers
## Notable Exceptions
### GUID Stability
Container GUIDs must remain stable across ingestion runs. Since containers are identified by GUID rather than explicit properties in the URN, changing the GUID computation will create a new container entity instead of updating the existing one.
When creating custom containers, ensure that the properties used to generate the GUID are:
- Stable across time
- Unique within the platform
- Derived from immutable source system identifiers
### Self-Referential Containers
While containers can contain other containers, be careful not to create circular references. The parent-child relationship should form a directed acyclic graph (DAG), not a cycle.
### Environment Handling
The `env` field in ContainerKey has special handling for backwards compatibility. In some sources, the platform instance was incorrectly set to the environment value. The `backcompat_env_as_instance` flag handles this case.
When using the `env` field:
- Set it to a valid FabricType (PROD, DEV, QA, etc.)
- Don't use it for platform instance identification
- Use the separate `instance` field for multi-instance deployments
### Platform Instance Association
Unlike datasets which embed platform instance in their URN, containers associate platform instances through the `dataPlatformInstance` aspect. This allows containers to be associated with specific instances of a data platform while maintaining a stable GUID.
### Access Control
Containers support the `access` aspect, which can be used to model access control policies at the container level. This is particularly useful for:
- Database-level permissions
- Schema-level access control
- Project-level authorization
- Workspace-level security
Access controls set on containers can be inherited by contained entities, though this behavior depends on the specific platform's implementation.