datahub/docs/how/search-onboarding.md

85 lines
5.0 KiB
Markdown
Raw Normal View History

# How to onboard to GMA search?
2019-12-20 02:36:24 -08:00
## 1. Define search document model for the entity
Modeling is the most important and crucial part of your design.
[Search document] model contains a list of fields that need to be indexed along with the names and their data types.
Check [here][Search document] to learn more about search document model.
Please note that all fields in the search document model (except the `urn`) are `optional`.
This is because we want to support partial updates to search documents.
[Search document]: ../what/search-document.md
## 2. Create the search index, define its mappings and settings
A [mapping] is created using the information of search document model.
It defines how a document, and the fields it contains, are stored and indexed by various [tokenizer]s, [analyzer]s and data type for the fields.
For certain fields, sub-fields are created using different analyzers.
The analyzers are chosen depending on the needs for each field.
This is currently created manually using [curl] commands, and we plan to [automate](../what/search-index.md#search-automation-tbd) the process in the near future.
Check index [mappings & settings](../../docker/elasticsearch/dataset-index-config.json) for `dataset` search index.
## 3. Ingestion into search index
The actual indexing process for each [entity] is powered by [index builder]s.
The builders register the metadata [aspect]s of their interest against [MAE Consumer Job] and will be invoked whenever an [MAE] of same interest is received.
Index builders should be extended from [BaseIndexBuilder]. Check [DatasetIndexBuilder] as an example.
For the consumer job to consume those MAEs, you should add your index builder to the [index builder registry].
## 4. Search query configs
Once you have the [search index] built, it's ready to be queried!
The search query is constructed and executed through [Search DAO].
The raw search hits are retrieved and extracted using the base model.
Besides the regular full text search, run time aggregation and relevance are provided in the search queries as well.
[ESSearchDAO] is the implementation for the [BaseSearchDAO] for Elasticsearch.
It's still a generic class which can be used for a specific [entity] and configured using [BaseSearchConfig].
BaseSearchConfig is the abstraction for all query related configurations such as query templates, default field to execute autocomplete on etc.
```java
public abstract class BaseSearchConfig<DOCUMENT extends RecordTemplate> {
public abstract Set<String> getFacetFields();
public String getIndexName() {
return getSearchDocument().getSimpleName().toLowerCase();
}
public abstract Class<DOCUMENT> getSearchDocument();
public abstract String getDefaultAutocompleteField();
public abstract String getSearchQueryTemplate();
public abstract String getAutocompleteQueryTemplate();
}
```
[DatasetSearchConfig] is the implementation of search config for `dataset` entity.
## 5. Add search query endpoints to GMS
Finally, you need to create [rest.li](https://rest.li) APIs to serve your search queries.
[BaseSearchableEntityResource] provides an abstract implementation of search and autocomplete APIs.
Any top level rest.li resource implementation could extend it and this will automatically provide search and autocomplete endpoints.
Refer to [CorpUsers] rest.li resource implementation as an example.
[mapping]: https://www.elastic.co/guide/en/elasticsearch/reference/5.6/mapping.html
[tokenizer]: https://www.elastic.co/guide/en/elasticsearch/reference/5.6/analysis-tokenizers.html
[analyzer]: https://www.elastic.co/guide/en/elasticsearch/reference/5.6/analysis-analyzers.html
[curl]: https://en.wikipedia.org/wiki/CURL
[entity]: ../what/entity.md
[index builder]: ../architecture/metadata-ingestion.md#search-and-graph-index-builders
[aspect]: ../what/aspect.md
[mae consumer job]: ../architecture/metadata-ingestion.md#mae-consumer-job
[mae]: ../what/mxe.md#metadata-audit-event-mae
[baseindexbuilder]: ../../metadata-builders/src/main/java/com/linkedin/metadata/builders/search/BaseIndexBuilder.java
[datasetindexbuilder]: ../../metadata-builders/src/main/java/com/linkedin/metadata/builders/search/DatasetIndexBuilder.java
[index builder registry]: ../../metadata-builders/src/main/java/com/linkedin/metadata/builders/search/RegisteredIndexBuilders.java
[search index]: ../what/search-index.md
[search dao]: ../architecture/metadata-serving.md#search-dao
[essearchdao]: ../../metadata-dao-impl/elasticsearch-dao/src/main/java/com/linkedin/metadata/dao/search/ESSearchDAO.java
[basesearchdao]: ../../metadata-dao/src/main/java/com/linkedin/metadata/dao/BaseSearchDAO.java
[basesearchconfig]: ../../metadata-dao-impl/elasticsearch-dao/src/main/java/com/linkedin/metadata/dao/search/BaseSearchConfig.java
[datasetsearchconfig]: ../../gms/impl/src/main/java/com/linkedin/dataset/dao/search/DatasetSearchConfig.java
[basesearchableentityresource]: ../../metadata-restli-resource/src/main/java/com/linkedin/metadata/restli/BaseSearchableEntityResource.java
[corpusers]: ../../gms/impl/src/main/java/com/linkedin/identity/rest/resources/CorpUsers.java