An important question that will arise once you've decided to extend the metadata model is whether you need to fork the main repo or not. Use the diagram below to understand how to make this decision.
The green lines represent pathways that will lead to lesser friction for you to maintain your code long term. The red lines represent higher risk of conflicts in the future. We are working hard to move the majority of model extension use-cases to no-code / low-code pathways to ensure that you can extend the core metadata model without having to maintain a custom fork of DataHub.
We will refer to the two options as the **open-source fork** and **custom repository** approaches in the rest of the document below.
## This Guide
This guide will outline what the experience of adding a new Entity should look like through a real example of adding the
Dashboard Entity. If you want to extend an existing Entity, you can skip directly to [Step 3](#step-3-define-custom-aspects-or-attach-existing-aspects-to-your-entity).
A key represents the fields that uniquely identify the entity. For those familiar with DataHub’s legacy architecture,
these fields were previously part of the Urn Java Class that was defined for each entity.
This struct will be used to generate a serialized string key, represented by an Urn. Each field in the key struct will
be converted into a single part of the Urn's tuple, in the order they are defined.
Let’s define a Key aspect for our new Dashboard entity.
```
namespace com.linkedin.metadata.key
/**
* Key for a Dashboard
*/
@Aspect = {
"name": "dashboardKey",
}
record DashboardKey {
/**
* The name of the dashboard tool such as looker, redash etc.
*/
@Searchable = {
...
}
dashboardTool: string
/**
* Unique id for the dashboard. This id should be globally unique for a dashboarding tool even when there are multiple deployments of it. As an example, dashboard URL could be used here for Looker such as 'looker.linkedin.com/dashboards/1234'
*/
dashboardId: string
}
```
The Urn representation of the Key shown above would be:
infra to use the fields in the key to create relationships and index fields for search. See [Step 3](#step-3-define-custom-aspects-or-attach-existing-aspects-to-your-entity) for more details on
Define the entity within an `entity-registry.yml` file. Depending on your approach, the location of this file may vary. More on that in steps [4](#step-4-choose-a-place-to-store-your-model-extension) and [5](#step-5-attaching-your-non-key-aspects-to-the-entity).
At the beginning of this document, we walked you through a flow-chart that should help you decide whether you need to maintain a fork of the open source DataHub repo for your model extensions, or whether you can just use a model extension repository that can stay independent of the DataHub repo. Depending on what path you took, the place you store your aspect model files (the .pdl files) and the entity-registry files (the yaml file called `entity-registry.yaml` or `entity-registry.yml`) will vary.
- Open source Fork: Aspect files go under [`metadata-models`](../../metadata-models) module in the main repo, entity registry goes into [`metadata-models/src/main/resources/entity-registry.yml`](../../metadata-models/src/main/resources/entity-registry.yml). Read on for more details in [Step 5](#step-5-attaching-your-non-key-aspects-to-the-entity).
- Custom repository: Read the [metadata-models-custom](../../metadata-models-custom/README.md) documentation to learn how to store and version your aspect models and registry.
Attaching non-key aspects to an entity can be done simply by adding them to the entity registry yaml file. The location of this file differs based on whether you are following the oss-fork path or the custom-repository path.
Previously, you were required to add all aspects for the entity into an Aspect union. You will see examples of this pattern throughout the code-base (e.g. `DatasetAspect`, `DashboardAspect` etc.). This is no longer required.
### <a name="step_6"></a>Step 6 (Oss-Fork approach): Re-build DataHub to have access to your new or updated entity
If you opted for the open-source fork approach, where you are editing models in the `metadata-models` repository of DataHub, you will need to re-build the DataHub metadata service using the steps below. If you are following the custom model repository approach, you just need to build your custom model repository and deploy it to a running metadata service instance to read and write metadata using your new model extensions.
Then, re-deploy metadata-service (gms), and mae-consumer and mce-consumer (optionally if you are running them unbundled). See [docker development](../../docker/README.md) for details on how
to deploy during development. This will allow Datahub to read and write your new entity or extensions to existing entities, along with serving search and graph queries for that entity type.
If you want to use your custom models beyond your local machine without forking DataHub, then you can generate a custom model package that can be installed from other places.
To enable others to use it, share the file at custom-package/my-company-datahub-models/dist/<wheelfile>.whl and have them install it with `pip install <wheel file>.whl`
If you are extending an entity with additional aspects, and you can use the auto-render specifications to automatically render these aspects to your satisfaction, you do not need to write any custom code.
However, if you want to write specific code to render your model extensions, or if you introduced a whole new entity and want to give it its own page, you will need to write custom React and Grapqhl code to view and mutate your entity in GraphQL or React. For
instructions on how to start extending the GraphQL graph, see [graphql docs](../../datahub-graphql-core/README.md). Once you’ve done that, you can follow the guide [here](../../datahub-web-react/README.md) to add your entity into the React UI.
on entity pages in a tab using a default renderer. **_This is currently only supported for Charts, Dashboards, DataFlows, DataJobs, Datasets, Domains, and GlossaryTerms_**.
- **renderSpec**: RenderSpec (optional) - config for autoRender aspects that controls how they are displayed. **_This is currently only supported for Charts, Dashboards, DataFlows, DataJobs, Datasets, Domains, and GlossaryTerms_**. Contains three fields:
:::note If you are adding @Searchable to a field that already has data, you'll want to restore indices [via api](https://docs.datahub.com/docs/api/restli/restore-indices/) or [via upgrade step](https://github.com/datahub-project/datahub/blob/master/metadata-service/factories/src/main/java/com/linkedin/metadata/boot/steps/RestoreGlossaryIndices.java) to have it be populated with existing data.
- **fieldType**: string - The settings for how each field is indexed is defined by the field type. In general this defines how the field is indexed in the Elasticsearch document. **Note**: With the new search tier system, `fieldType` primarily determines the field's storage format and individual query capabilities. Fulltext search capabilities are now primarily handled by the common `_search.tier_{tier}` fields that consolidate fields from multiple aspects based on their tier assignments.
1._KEYWORD_ - Short text fields that only support exact matches, often used only for filtering. **Default length limit**: 100 characters (tier fields), 255 characters (regular fields).
2._TEXT_ - Text fields delimited by spaces/slashes/periods. Default field type for string variables. **Default length limit**: 100 characters (tier fields), 255 characters (regular fields).
6._OBJECT_ - Each property in an object will become an extra column in Elasticsearch and can be referenced as
`field.property` in queries. **Default limits**: Maximum 1000 object keys, maximum 4096 characters per value. You should be careful to not use it on objects with many properties as it can cause a mapping explosion in Elasticsearch.
8. _MAP_ARRAY_ - Array fields that are stored as maps in Elasticsearch. **Default limits**: Maximum 1000 array elements, maximum 4096 characters per value.
10.~~_TEXT_PARTIAL_~~ - **DEPRECATED**: Text fields with partial matching support. This field type is expensive and should not be applied to fields with long values. Use TEXT instead.
11.~~_WORD_GRAM_~~ - **DEPRECATED**: Text fields with word gram support. This field type is expensive and should not be applied to fields with long values. Use TEXT instead.
12.~~_BROWSE_PATH_~~ - **DEPRECATED**: Field type for browse paths. Browse paths are handled by name, use `browsePathV2` field name. There can only be one for a given entity.
13.~~_URN_~~ - **DEPRECATED**: Urn fields where each sub-component is indexed. Use KEYWORD instead.
14.~~_URN_PARTIAL_~~ - **DEPRECATED**: Urn fields with partial matching support. Use KEYWORD instead.
**⚠️ Important Length Limitations:**
- **Tier Fields**: Fields with `searchTier` are automatically limited to **100 characters** to optimize search performance
- **Regular Fields**: Fields without `searchTier` are limited to **255 characters** for Elasticsearch compatibility
- **Object Fields**: Maximum **1000 object keys** and **4096 characters per value** to prevent mapping explosion
- **Array Fields**: Maximum **1000 array elements** and **4096 characters per value**
- **Field Names**: Maximum **255 characters** for Elasticsearch field name compatibility
**Configuration Overrides:**
- **Environment Variables**: Some limits can be configured via environment variables:
-`SEARCH_DOCUMENT_MAX_VALUE_LENGTH`: Override default 4096 character limit for object/array values
-`SEARCH_DOCUMENT_MAX_ARRAY_LENGTH`: Override default 1000 element limit for arrays
-`SEARCH_DOCUMENT_MAX_OBJECT_KEYS`: Override default 1000 key limit for objects
- **Special Fields**: Some system fields have different limits:
- **URN fields**: Automatically set to **512 characters** (`ignore_above: 512`)
- **Tier fields**: Hard-coded to **100 characters** for performance optimization
**Note**: The `ignore_above` settings are automatically applied by the system. While some limits can be configured via environment variables, the tier field limits (100 characters) and regular field limits (255 characters) are hard-coded and cannot be overridden through annotations or configuration.
**Important**: The ability to have longer keyword fields is limited to system-level configurations and special field types. Regular user-defined fields will always be subject to the default limits for performance and compatibility reasons.
- **queryByDefault**: boolean (optional) - **⚠️ DEPRECATED**: Whether we should match the field for the default search query. True by
default for text and urn fields. **Use `searchTier` instead for better search organization and performance.**
- **enableAutocomplete**: boolean (optional) - **⚠️ DEPRECATED**: Whether we should use the field for autocomplete. Defaults to false. **Use `searchTier: 1` based on the fact that an autocomplete field would be very important for search relevance.**
- **weightsPerFieldValue**: map[object, double] (optional) - **⚠️ DEPRECATED**: Weights to apply to score for a given value. **Use `searchLabel` with `@SearchScore` annotations instead for value-based scoring.**
- **fieldNameAliases**: array[string] (optional) - Aliases for this field that can be used for sorting and other operations. These aliases are created with the aspect name prefix (e.g., `metadata.aliasName`) and provide alternative names for accessing the same field data. Useful for creating multiple access paths to the same field.
- **includeSystemModifiedAt**: boolean (optional) - **⚠️ DEPRECATED**: Whether to include a system-modified timestamp field for this searchable field. **This will be handled programmatically for all aspects in future versions.**
- **systemModifiedAtFieldName**: string (optional) - **⚠️ DEPRECATED**: Custom name for the system-modified timestamp field. **This will be handled programmatically for all aspects in future versions.**
- **includeQueryEmptyAggregation**: boolean (optional) - Whether to create a missing field aggregation when querying the corresponding field. Only affects query time, not mapping. Useful for analytics and reporting.
- **searchTier**: integer (optional) - Search tier for the field (integer value >= 1). Creates a copy*to field that copies the field value to `\_search.tier*{tier}`. Fields with searchTier are automatically set to `index: false`unless`searchIndexed` is true. **Note**: searchTier can only be used with KEYWORD or TEXT field types.
- **searchLabel**: string (optional) - Unified label for search operations. Creates a copy\*to field that copies the field value to `\_search.{label}` (without prefixes). Replaces the previous `sortLabel` and `boostLabel` annotations. Fields with searchLabel are automatically set to `index: false`.
- **searchIndexed**: boolean (optional) - When combined with `searchTier`, determines whether the field is indexed outside of `_search` for direct access. The field will be indexed using its actual field type (KEYWORD or TEXT), not forced to KEYWORD. **Note**: searchIndexed can only be true when searchTier is specified and can only be used with KEYWORD or TEXT field types. Defaults to false.
- **entityFieldName**: string (optional) - If set, this field will be copied to `_search.{entityFieldName}` and the root alias will point there. This allows multiple aspects to consolidate into a single entity-level field.
- **eagerGlobalOrdinals**: boolean (optional) - Whether to set `eager_global_ordinals` to true for this field. This improves aggregation performance for frequently aggregated keyword fields by pre-building ordinals at index time. **Note**: eagerGlobalOrdinals can only be true for KEYWORD, URN, or URN_PARTIAL field types. Defaults to false.
**⚠️ Note on deprecated parameters:** Some parameters like `queryByDefault`, `enableAutocomplete`, `boostScore`, and `weightsPerFieldValue` are still functional when using search version 2 but will be replaced by newer features in future versions. Consider using the new tier-based and label-based annotations for more advanced search functionality.
**Migration from deprecated parameters:**
- **`queryByDefault`** → Use `searchTier: 1` through `searchTier: 4` to include fields in default search queries
- **`enableAutocomplete`** → Use `searchTier: 1`
- **`boostScore`** → Use `searchLabel` for more sophisticated ranking control
- **`weightsPerFieldValue`** → Use `searchLabel` for value-based scoring control
This annotation is saying that we want to index the title field in Elasticsearch. `searchTier: 1` ensures this field is included in default search queries with high relevancy. `entityFieldName: "name"` consolidates this field into the entity-level `_search.name` field, allowing other aspects to contribute to the same consolidated field.
- **Priority field**: `fieldType: "COUNT"` with `searchLabel: "priority"` creates a numeric field that copies to `_search.priority` for proper numeric sorting operations, and `addToFilters: true` makes it available as a filter
- **Status field**: `addToFilters: true` makes it available as a filter, `filterNameOverride` provides a custom display name "Dashboard Status", and `eagerGlobalOrdinals: true` optimizes aggregation performance for this frequently filtered field
- **Owner field**: `fieldType: "URN"` with `eagerGlobalOrdinals: true` optimizes aggregation performance for owner-based filtering, and `searchLabel: "owner"` copies the field to `_search.owner` for ranking operations
Now, when Datahub ingests Dashboards, it will index the priority and status fields in Elasticsearch. The priority field will be available for sorting operations, and both fields will be available as filters in the UI.
**⚠️ DEPRECATED**: This annotation is deprecated and should not be used in new code. Use `searchLabel` with the new search tier system instead for ranking functionality.
#### Search Tier and Label System
The new search tier and label system provides a powerful way to organize search fields and create specialized search experiences:
**Search Tiers (`searchTier`):**
- Fields with `searchTier` are automatically copied to `_search.tier_{tier}` fields
- This creates a fundamental change in search architecture: **fulltext search capabilities are now determined by the tier, not the individual field type**
- All fields assigned to the same tier (e.g., `_search.tier_1`) are consolidated into a single searchable field, regardless of their individual `fieldType`
- This allows you to create tiered search experiences where different fields contribute to different search priorities
- Use `searchIndexed: true` if you need direct access to the field for filtering/sorting while maintaining tier functionality
**Search Labels (`searchLabel`):**
- Fields with `searchLabel` are copied to `_search.{label}` fields (without prefixes)
- Replaces the previous `sortLabel` and `boostLabel` annotations for a unified approach
- Useful for creating specialized search, sorting, and ranking operations across multiple aspects
- Automatically sets `index: false` to optimize storage
**Entity Field Consolidation (`entityFieldName`):**
- Allows multiple aspects to consolidate into a single entity-level field
- Useful for creating unified search experiences across different aspect types
- Fields are copied to `_search.{entityFieldName}` with root-level aliases
**Benefits of the New System:**
1.**Organized Search Fields**: All search-related fields are grouped under `_search.*`
2.**Efficient Indexing**: Original fields are not indexed (index: false) but copied to search fields
3.**Easy Access**: Aliases provide convenient access to fields at the root level
4.**Flexible Querying**: Search queries can target specific tier, sort, or ranking fields
5.**Performance**: Optimized storage and query patterns for complex search scenarios
**Architectural Impact:**
- **Before**: Each field's `fieldType` determined its individual search capabilities and analyzers
- **After**: The `searchTier` determines fulltext search capabilities, while `fieldType` primarily affects storage format and individual field queries
- **Search Consolidation**: Fields from different aspects with the same tier are automatically consolidated into unified search fields
- **Simplified Search Logic**: Search queries can target entire tiers rather than individual fields, making complex search scenarios more manageable
#### Migration Guide for Deprecated Features
If you're currently using deprecated field types or parameters, here's how to migrate to the new system:
| `includeSystemModifiedAt: true` | **Automatic** | System modification tracking is now handled automatically for all aspects |
| `systemModifiedAtFieldName: "customName"` | **Automatic** | System modification field names are now standardized automatically |
**Example Migration:**
```aidl
// Old deprecated approach
@Searchable = {
"fieldType": "TEXT_PARTIAL",
"queryByDefault": true,
"enableAutocomplete": true,
"boostScore": 10.0
}
title: string
// New recommended approach
@Searchable = {
"fieldType": "TEXT",
"searchTier": 1,
"entityFieldName": "name"
}
title: string
```
**Tier Consolidation Example:**
```aidl
// Multiple aspects can now contribute to the same search tier
record DatasetInfo {
@Searchable = {
"fieldType": "KEYWORD",
"searchTier": 1,
"entityFieldName": "name"
}
name: string
@Searchable = {
"fieldType": "TEXT",
"searchTier": 1,
"entityFieldName": "description"
}
description: string
}
record ChartInfo {
@Searchable = {
"fieldType": "KEYWORD",
"searchTier": 1,
"entityFieldName": "name"
}
chartName: string
}
```
In this example, all three fields (`name`, `description`, `chartName`) are automatically consolidated into `_search.tier_1`. A single search query against `_search.tier_1:*` will search across all these fields simultaneously, regardless of their individual aspect and field locations. The `fieldType` now primarily determines how each field is stored and accessed individually, while the tier determines how it participates in fulltext search.
**Benefits of Migration:**
- Better search performance through optimized indexing
- More organized search field structure
- Enhanced query capabilities with tier-based targeting
- Future-proof annotations that won't be deprecated