datahub/docs/how/search-over-new-field.md

272 lines
7.8 KiB
Markdown
Raw Normal View History

# Onboarding to GMA Search - searching over a new field
If you need to onboard a new entity to search, refer to [How to onboard to GMA Search](./search-onboarding.md).
For this exercise, we'll add a new field to an existing aspect of corp users and search over this field. Your use case might require searching over an existing field of an aspect or create a brand new aspect and search over it's field(s). For such use cases, similar steps should be followed.
## 1. Add field to aspect (skip this step if the field already exists in an aspect)
For this example, we will add new field `courses` to [CorpUserEditableInfo](../../metadata-models/src/main/pegasus/com/linkedin/identity/CorpUserEditableInfo.pdl) which is an aspect of corp user entity.
```
namespace com.linkedin.identity
/**
* Linkedin corp user information that can be edited from UI
*/
@Aspect.EntityUrns = [ "com.linkedin.common.CorpuserUrn" ]
record CorpUserEditableInfo {
...
/**
* Courses that the user has taken e.g. AI200: Introduction to Artificial Intelligence
*/
courses: array[string] = [ ]
}
```
## 2. Add field to search document model
For this example, we will add field `courses` to [CorpUserInfoDocument.pdl](../../metadata-models/src/main/pegasus/com/linkedin/metadata/search/CorpUserInfoDocument.pdl) which is the search document model for corp user entity.
```
namespace com.linkedin.metadata.search
/**
* Data model for CorpUserInfo entity search
*/
record CorpUserInfoDocument includes BaseDocument {
...
/**
* Courses that the user has taken e.g. AI200: Introduction to Artificial Intelligence
*/
courses: optional array[string]
}
```
## 3. Modify the mapping of search index
Now, we will modify the mapping of corp user search index. Use the following Elasticsearch command to add new field to an existing index.
```json
2020-08-27 03:16:42 -07:00
curl http://localhost:9200/corpuserinfodocument/doc/_mapping? --data '
{
"properties": {
"courses": {
"type": "text
}
}
}'
```
2020-09-17 16:22:38 -07:00
If this field needs to be a facet i.e. you want to enable sorting, aggregations on this field or use it in scripts, then your mapping may be different depending on the type of field. For **text** fields you will need to enable *fielddata* (disabled by default), as shown below
```json
curl http://localhost:9200/corpuserinfodocument/doc/_mapping? --data '
{
"properties": {
"courses": {
"type": "text,
"fielddata": true
}
}
}'
```
However *fielddata* enablement could consume significant heap space. If possible, use unanalyzed **keyword** field as a facet. For the current example, you could either choose keyword type for the field *courses* or create a subfield of type keyword under *courses* and use the same for sorting, aggregations, etc (second approach described below)
```json
curl http://localhost:9200/corpuserinfodocument/doc/_mapping? --data '
{
"properties": {
"courses": {
"type": "text,
"fields": {
"subfield": {
"type": "keyword"
}
}
}
}
}'
```
## 4. Modify index config, so that the new mapping is picked up next time
If you want corp user search index to contain this new field `courses` next time docker containers are brought up, we need to add this field to [corpuser-index-config.json](../../docker/elasticsearch-setup/corpuser-index-config.json).
```
{
"settings": {
"index": {
"analysis": {
...
}
}
},
"mappings": {
"doc": {
"properties": {
...
"courses": {
"type": "text"
}
}
}
}
}
```
Choose your analyzer wisely. For this example, we store the field `courses` as an array of string and hence use `text` data type. Default analyzer is `standard` and it provides grammar based tokenization.
## 5. Update the index builder logic
Index builder is where the logic to transform an aspect to search document model is defined. For this example, we will add the logic in [CorpUserInfoIndexBuilder](../../metadata-builders/src/main/java/com/linkedin/metadata/builders/search/CorpUserInfoIndexBuilder.java).
```java
package com.linkedin.metadata.builders.search;
@Slf4j
public class CorpUserInfoIndexBuilder extends BaseIndexBuilder<CorpUserInfoDocument> {
public CorpUserInfoIndexBuilder() {
super(Collections.singletonList(CorpUserSnapshot.class), CorpUserInfoDocument.class);
}
...
@Nonnull
private CorpUserInfoDocument getDocumentToUpdateFromAspect(@Nonnull CorpuserUrn urn,
@Nonnull CorpUserEditableInfo corpUserEditableInfo) {
final String aboutMe = corpUserEditableInfo.getAboutMe() == null ? "" : corpUserEditableInfo.getAboutMe();
return new CorpUserInfoDocument()
.setUrn(urn)
.setAboutMe(aboutMe)
.setTeams(corpUserEditableInfo.getTeams())
.setSkills(corpUserEditableInfo.getSkills())
.setCourses(corpUserEditableInfo.getCourses());
}
...
}
```
## 6: Update search query template, to start searching over the new field
For this example, we will modify [corpUserESSearchQueryTemplate.json](../../gms/impl/src/main/resources/corpUserESSearchQueryTemplate.json) to start searching over the field `courses`. Here is an example.
```json
{
"function_score": {
"query": {
"query_string": {
"query": "$INPUT",
"fields": [
"fullName^4",
"ldap^2",
"managerLdap",
"skills",
"courses"
"teams",
"title"
],
"default_operator": "and",
"analyzer": "standard"
}
},
"functions": [
{
"filter": {
"term": {
"active": true
}
},
"weight": 2
}
],
"score_mode": "multiply"
}
}
```
As you can see in the above query template, corp user search is performed across multiple fields, to which the field `courses` has been added.
## 7: Test your changes
Make sure relevant docker containers are rebuilt before testing the changes.
If this is a new field that has been added to an existing snapshot, then you can test by ingesting data that contains this new field. Here is an example of ingesting to `/corpUsers` endpoint, with the new field `courses`.
```
curl 'http://localhost:8080/corpUsers?action=ingest' -X POST -H 'X-RestLi-Protocol-Version:2.0.0' --data '
{
"snapshot": {
"aspects": [
{
"com.linkedin.identity.CorpUserEditableInfo": {
"courses": [
"Docker for Data Scientists",
"AI100: Introduction to Artificial Intelligence"
],
"skills": [
],
"pictureLink": "https://raw.githubusercontent.com/linkedin/datahub/master/datahub-web/packages/data-portal/public/assets/images/default_avatar.png",
"teams": [
]
}
}
],
"urn": "urn:li:corpuser:datahub"
}
}'
```
Once the ingestion is done, you can test your changes by issuing search queries. Here is an example query with response.
```
curl "http://localhost:8080/corpUsers?q=search&input=ai200" -H 'X-RestLi-Protocol-Version: 2.0.0' -s | jq
Response:
{
"metadata": {
"urns": [
"urn:li:corpuser:datahub"
],
"searchResultMetadatas": [
]
},
"elements": [
{
"editableInfo": {
"skills": [
],
"courses": [
"Docker for Data Scientists",
"AI100: Introduction to Artificial Intelligence"
],
"pictureLink": "https://raw.githubusercontent.com/linkedin/datahub/master/datahub-web/packages/data-portal/public/assets/images/default_avatar.png",
"teams": [
]
},
"username": "datahub",
"info": {
"active": true,
"fullName": "Data Hub",
"title": "CEO",
"displayName": "Data Hub",
"email": "datahub@linkedin.com"
}
}
],
"paging": {
"count": 10,
"start": 0,
"total": 1,
"links": [
]
}
}
```