# Container The container entity is a core entity in the metadata model that represents a grouping of related data assets. Containers provide hierarchical organization for datasets, charts, dashboards, and other containers, enabling navigation and structure discovery within data platforms. ## Identity Containers are uniquely identified by a GUID (Globally Unique Identifier) that is typically derived from a combination of attributes specific to the container type. Unlike datasets which use platform, name, and environment, containers use a more flexible identification scheme based on their hierarchical properties. The URN structure for a container is: `urn:li:container:{guid}` The GUID is typically computed from container-specific properties such as: - **Database containers**: platform + instance + database name - **Schema containers**: platform + instance + database + schema name - **Project containers**: platform + instance + project_id - **Folder containers**: platform + instance + folder_abs_path - **Bucket containers**: platform + instance + bucket_name ### URN Examples ``` urn:li:container:b5e95fce839e7d78151ed7e0a7420d84 ``` The GUID is generated using the `datahub_guid()` function from a dictionary of properties. For example, a Snowflake schema container would be identified by: ```python { "platform": "snowflake", "instance": "prod_instance", "database": "analytics", "schema": "reporting" } ``` ## Real-World Concepts Containers represent various hierarchical structures in data platforms: - **Databases**: Top-level organizational units in relational systems (MySQL, PostgreSQL, Snowflake) - **Schemas**: Logical groupings within databases (Snowflake schemas, PostgreSQL schemas) - **Projects**: Organizational units in cloud platforms (BigQuery projects) - **Datasets**: Logical groupings in cloud platforms (BigQuery datasets) - **Folders**: Directory structures in file systems and data lakes (S3 folders, ADLS directories) - **Buckets**: Top-level storage containers in cloud object stores (S3 buckets, GCS buckets) - **Workspaces**: Organizational units in BI platforms (Power BI workspaces, Tableau sites) - **Catalogs**: Top-level organizational units in data catalogs (Unity Catalog, Iceberg catalogs) - **Metastores**: Storage metadata repositories (Hive metastore, Unity metastore) ## Important Capabilities ### Container Properties The `containerProperties` aspect contains metadata inherited from the source system: - **name**: Display name of the container (required) - **qualifiedName**: Fully-qualified name (optional, e.g., "prod.analytics.reporting") - **description**: Description from the source system - **env**: Environment indicator (PROD, DEV, QA, etc.) - **customProperties**: Additional key-value properties from the source system - **externalUrl**: Link to the container in the source system - **created**: Timestamp when the container was created in the source system - **lastModified**: Timestamp when the container was last modified in the source system ### Editable Container Properties The `editableContainerProperties` aspect allows users to override or add information via the UI: - **description**: User-provided description that supplements or overrides the source system description This separation ensures that metadata from source systems doesn't conflict with user-provided annotations. ### Hierarchical Relationships Containers support nested hierarchies through the `container` aspect, which links a container to its parent container. This enables multi-level organizational structures: ``` Platform (implicit) └── Database Container └── Schema Container └── Dataset ``` For example, in Snowflake: ``` Snowflake Platform └── ANALYTICS_DB (Database Container) └── REPORTING (Schema Container) └── SALES_METRICS (Dataset) └── REVENUE_TABLE (Dataset) ``` ### Subtypes The `subTypes` aspect specifies the type of container, which helps the UI render appropriate icons and behaviors. Common subtypes include: - **Database**: Relational database containers - **Schema**: Schema-level containers within databases - **Project**: Cloud project containers (GCP, Azure) - **Dataset**: BigQuery dataset containers - **Folder**: File system folders - **Bucket**: Object storage buckets - **Workspace**: BI platform workspaces - **Catalog**: Data catalog containers - **Metastore**: Metadata storage containers - **MLflow Experiment** (`MLAssetSubTypes.MLFLOW_EXPERIMENT`): ML experiment containers that organize training runs ### ML Experiments as Containers Machine learning experiments are modeled as containers with the `MLFLOW_EXPERIMENT` subtype. This pattern enables organizing related training runs (which are `dataProcessInstance` entities) into logical groups for comparison and tracking: ``` ML Experiment (Container) ├── Training Run 1 (DataProcessInstance) ├── Training Run 2 (DataProcessInstance) └── Training Run 3 (DataProcessInstance) ``` Training runs belong to experiments through the `container` aspect. This structure mirrors common ML platform patterns (like MLflow) and enables: - Comparing metrics across multiple training attempts - Tracking the evolution of a model through iterations - Organizing training work by project or objective For more information on ML experiments and training runs, see: - [ML Model entity documentation](mlModel.md#training-runs-and-experiments) - [DataProcessInstance documentation for training runs](dataProcessInstance.md#tracking-ml-training-run-in-a-container) ### Containable Entities The following entity types can be contained within a container: - Datasets - Charts - Dashboards - DataProcessInstances (e.g., training runs in ML experiments) - Other Containers (for nested hierarchies) ## Code Examples ### Create a Database Container
Python SDK: Create a database container ```python {{ inline /metadata-ingestion/examples/library/container_create_database.py show_path_as_comment }} ```
### Create a Schema Container with Parent
Python SDK: Create a schema container with parent database ```python {{ inline /metadata-ingestion/examples/library/container_create_schema.py show_path_as_comment }} ```
### Add Metadata to a Container
Python SDK: Add tags, terms, and ownership to a container ```python {{ inline /metadata-ingestion/examples/library/container_add_metadata.py show_path_as_comment }} ```
### Query Container via REST API Containers can be retrieved using the standard entity retrieval APIs:
Fetch container entity including all aspects ```bash curl 'http://localhost:8080/entities/urn%3Ali%3Acontainer%3Ab5e95fce839e7d78151ed7e0a7420d84' ``` The response will include all aspects associated with the container, including properties, ownership, tags, terms, etc.
To find all entities within a container, use the relationships API:
Find all entities contained within a container ```bash curl 'http://localhost:8080/relationships?direction=INCOMING&urn=urn%3Ali%3Acontainer%3Ab5e95fce839e7d78151ed7e0a7420d84&types=IsPartOf' ``` This returns all entities (datasets, charts, dashboards, sub-containers) that have this container as their parent.
## Integration Points ### Relationship with Datasets Datasets are the most common entities contained within containers. The relationship is established through the `container` aspect on the dataset, which points to the container URN. ```python # Dataset links to its parent container (schema) dataset = Dataset( platform="snowflake", name="analytics_db.reporting.sales_table", env="PROD", parent_container=schema_key, # Links to schema container ) ``` ### Hierarchical Navigation Containers enable hierarchical navigation in the DataHub UI through parent-child relationships: 1. **Top-down browsing**: Users can navigate from databases to schemas to tables 2. **Bottom-up breadcrumbs**: Datasets show their parent containers in breadcrumb trails 3. **Browse paths**: Containers are used to generate browse paths automatically ### GraphQL Resolvers The container entity has specialized GraphQL resolvers: - **ContainerEntitiesResolver**: Retrieves all entities (datasets, charts, dashboards, sub-containers) within a container - **ParentContainersResolver**: Retrieves the full hierarchy of parent containers for any entity These resolvers power the UI's hierarchical navigation and container overview pages. ### Common Usage Patterns 1. **Database/Schema Hierarchy**: Relational databases use Database and Schema containers 2. **Project/Dataset Hierarchy**: BigQuery uses Project and Dataset containers 3. **Workspace/Folder Hierarchy**: BI tools use Workspace containers for organization 4. **Bucket/Folder Hierarchy**: Data lakes use Bucket and Folder containers 5. **Catalog/Schema Hierarchy**: Modern catalogs (Unity, Iceberg) use Catalog and Schema containers ## Notable Exceptions ### GUID Stability Container GUIDs must remain stable across ingestion runs. Since containers are identified by GUID rather than explicit properties in the URN, changing the GUID computation will create a new container entity instead of updating the existing one. When creating custom containers, ensure that the properties used to generate the GUID are: - Stable across time - Unique within the platform - Derived from immutable source system identifiers ### Self-Referential Containers While containers can contain other containers, be careful not to create circular references. The parent-child relationship should form a directed acyclic graph (DAG), not a cycle. ### Environment Handling The `env` field in ContainerKey has special handling for backwards compatibility. In some sources, the platform instance was incorrectly set to the environment value. The `backcompat_env_as_instance` flag handles this case. When using the `env` field: - Set it to a valid FabricType (PROD, DEV, QA, etc.) - Don't use it for platform instance identification - Use the separate `instance` field for multi-instance deployments ### Platform Instance Association Unlike datasets which embed platform instance in their URN, containers associate platform instances through the `dataPlatformInstance` aspect. This allows containers to be associated with specific instances of a data platform while maintaining a stable GUID. ### Access Control Containers support the `access` aspect, which can be used to model access control policies at the container level. This is particularly useful for: - Database-level permissions - Schema-level access control - Project-level authorization - Workspace-level security Access controls set on containers can be inherited by contained entities, though this behavior depends on the specific platform's implementation.