datahub/docs/how/search-over-new-field.md

# Onboarding to GMA Search - searching over a new field

If you need to onboard a new entity to search, refer to [How to onboard to GMA Search](./search-onboarding.md).

For this exercise, we'll add a new field to an existing aspect of corp users and search over this field. Your use case might require searching over an existing field of an aspect or create a brand new aspect and search over it's field(s). For such use cases, similar steps should be followed.

## 1. Add field to aspect (skip this step if the field already exists in an aspect)
For this example, we will add new field `courses` to [CorpUserEditableInfo](../../metadata-models/src/main/pegasus/com/linkedin/identity/CorpUserEditableInfo.pdl) which is an aspect of corp user entity.
```
namespace com.linkedin.identity

/**
 * Linkedin corp user information that can be edited from UI
 */
@Aspect.EntityUrns = [ "com.linkedin.common.CorpuserUrn" ]
record CorpUserEditableInfo {

  ...
  
  /**
   * Courses that the user has taken e.g. AI200: Introduction to Artificial Intelligence
   */
  courses: array[string] = [ ]
  
}
```

## 2. Add field to search document model
For this example, we will add field `courses` to [CorpUserInfoDocument.pdl](../../metadata-models/src/main/pegasus/com/linkedin/metadata/search/CorpUserInfoDocument.pdl) which is the search document model for corp user entity.

```
namespace com.linkedin.metadata.search

/**
 * Data model for CorpUserInfo entity search
 */
record CorpUserInfoDocument includes BaseDocument {

  ...

  /**
   * Courses that the user has taken e.g. AI200: Introduction to Artificial Intelligence
   */
  courses: optional array[string]
  
}
```

## 3. Modify the mapping of search index
Now, we will modify the mapping of corp user search index. Use the following Elasticsearch command to add new field to an existing index.

```json
curl http://localhost:9200/corpuserinfodocument/doc/_mapping? --data '
{
  "properties": {
    "courses": {
      "type": "text
    }
  }
}'
```

If this field needs to be a facet i.e. you want to enable sorting, aggregations on this field or use it in scripts, then your mapping may be different depending on the type of field. For **text** fields you will need to enable *fielddata* (disabled by default), as shown below
```json
curl http://localhost:9200/corpuserinfodocument/doc/_mapping? --data '
{
  "properties": {
    "courses": {
      "type": "text,
      "fielddata": true
    }
  }
}'
```

However *fielddata* enablement could consume significant heap space. If possible, use unanalyzed **keyword** field as a facet. For the current example, you could either choose keyword type for the field *courses* or create a subfield of type keyword under *courses* and use the same for sorting, aggregations, etc (second approach described below)
```json
curl http://localhost:9200/corpuserinfodocument/doc/_mapping? --data '
{
  "properties": {
    "courses": {
      "type": "text,
      "fields": {
        "subfield": {
          "type": "keyword"
        }
      }
    }
  }
}'
```

## 4. Modify index config, so that the new mapping is picked up next time
If you want corp user search index to contain this new field `courses` next time docker containers are brought up, we need to add this field to [corpuser-index-config.json](../../docker/elasticsearch-setup/corpuser-index-config.json).

```
{
  "settings": {
    "index": {
      "analysis": {
       ...
      }
    }
  },
  "mappings": {
    "doc": {
      "properties": {

        ...

        "courses": {
          "type": "text"
        }
      }
    }
  }
}
```
Choose your analyzer wisely. For this example, we store the field `courses` as an array of string and hence use `text` data type. Default analyzer is `standard` and it provides grammar based tokenization.

## 5. Update the index builder logic
Index builder is where the logic to transform an aspect to search document model is defined. For this example, we will add the logic in [CorpUserInfoIndexBuilder](../../metadata-builders/src/main/java/com/linkedin/metadata/builders/search/CorpUserInfoIndexBuilder.java).

```java
package com.linkedin.metadata.builders.search;

@Slf4j
public class CorpUserInfoIndexBuilder extends BaseIndexBuilder<CorpUserInfoDocument> {

  public CorpUserInfoIndexBuilder() {
    super(Collections.singletonList(CorpUserSnapshot.class), CorpUserInfoDocument.class);
  }
  
  ...
  
  @Nonnull
  private CorpUserInfoDocument getDocumentToUpdateFromAspect(@Nonnull CorpuserUrn urn,
      @Nonnull CorpUserEditableInfo corpUserEditableInfo) {
    final String aboutMe = corpUserEditableInfo.getAboutMe() == null ? "" : corpUserEditableInfo.getAboutMe();
    return new CorpUserInfoDocument()
        .setUrn(urn)
        .setAboutMe(aboutMe)
        .setTeams(corpUserEditableInfo.getTeams())
        .setSkills(corpUserEditableInfo.getSkills())
        .setCourses(corpUserEditableInfo.getCourses());
  }
  
  ...
  
}

```

## 6: Update search query template, to start searching over the new field
For this example, we will modify [corpUserESSearchQueryTemplate.json](../../gms/impl/src/main/resources/corpUserESSearchQueryTemplate.json) to start searching over the field `courses`. Here is an example.

```json
{
  "function_score": {
    "query": {
      "query_string": {
        "query": "$INPUT",
        "fields": [
          "fullName^4",
          "ldap^2",
          "managerLdap",
          "skills",
          "courses"
          "teams",
          "title"
        ],
        "default_operator": "and",
        "analyzer": "standard"
      }
    },
    "functions": [
      {
        "filter": {
          "term": {
            "active": true
          }
        },
        "weight": 2
      }
    ],
    "score_mode": "multiply"
  }
}
```
As you can see in the above query template, corp user search is performed across multiple fields, to which the field `courses` has been added.

## 7: Test your changes
Make sure relevant docker containers are rebuilt before testing the changes.
If this is a new field that has been added to an existing snapshot, then you can test by ingesting data that contains this new field. Here is an example of ingesting to `/corpUsers` endpoint, with the new field `courses`.

```
curl 'http://localhost:8080/corpUsers?action=ingest' -X POST -H 'X-RestLi-Protocol-Version:2.0.0' --data '
{
  "snapshot": {
    "aspects": [
      {
        "com.linkedin.identity.CorpUserEditableInfo": {
          "courses": [
            "Docker for Data Scientists",
            "AI100: Introduction to Artificial Intelligence"
          ],
          "skills": [
            
          ],
          "pictureLink": "https://raw.githubusercontent.com/linkedin/datahub/master/datahub-web/packages/data-portal/public/assets/images/default_avatar.png",
          "teams": [
            
          ]
        }
      }
    ],
    "urn": "urn:li:corpuser:datahub"
  }
}'
```

Once the ingestion is done, you can test your changes by issuing search queries. Here is an example query with response.

```
curl "http://localhost:8080/corpUsers?q=search&input=ai200" -H 'X-RestLi-Protocol-Version: 2.0.0' -s | jq

Response:
{
  "metadata": {
    "urns": [
      "urn:li:corpuser:datahub"
    ],
    "searchResultMetadatas": [
      
    ]
  },
  "elements": [
    {
      "editableInfo": {
        "skills": [
          
        ],
        "courses": [
          "Docker for Data Scientists",
          "AI100: Introduction to Artificial Intelligence"
        ],
        "pictureLink": "https://raw.githubusercontent.com/linkedin/datahub/master/datahub-web/packages/data-portal/public/assets/images/default_avatar.png",
        "teams": [
          
        ]
      },
      "username": "datahub",
      "info": {
        "active": true,
        "fullName": "Data Hub",
        "title": "CEO",
        "displayName": "Data Hub",
        "email": "datahub@linkedin.com"
      }
    }
  ],
  "paging": {
    "count": 10,
    "start": 0,
    "total": 1,
    "links": [
      
    ]
  }
}
```
docs: create search-over-new-field.md (#1790) Add a doc on searching over a new field 2020-08-09 12:06:24 -07:00			`# Onboarding to GMA Search - searching over a new field`

			`If you need to onboard a new entity to search, refer to [How to onboard to GMA Search](./search-onboarding.md).`

			`For this exercise, we'll add a new field to an existing aspect of corp users and search over this field. Your use case might require searching over an existing field of an aspect or create a brand new aspect and search over it's field(s). For such use cases, similar steps should be followed.`

			`## 1. Add field to aspect (skip this step if the field already exists in an aspect)`
			For this example, we will add new field `courses` to [CorpUserEditableInfo](../../metadata-models/src/main/pegasus/com/linkedin/identity/CorpUserEditableInfo.pdl) which is an aspect of corp user entity.
			```
			`namespace com.linkedin.identity`

			`/**`
			`* Linkedin corp user information that can be edited from UI`
			`*/`
			`@Aspect.EntityUrns = [ "com.linkedin.common.CorpuserUrn" ]`
			`record CorpUserEditableInfo {`

			`...`

			`/**`
			`* Courses that the user has taken e.g. AI200: Introduction to Artificial Intelligence`
			`*/`
			`courses: array[string] = [ ]`

			`}`
			```

			`## 2. Add field to search document model`
			For this example, we will add field `courses` to [CorpUserInfoDocument.pdl](../../metadata-models/src/main/pegasus/com/linkedin/metadata/search/CorpUserInfoDocument.pdl) which is the search document model for corp user entity.

			```
			`namespace com.linkedin.metadata.search`

			`/**`
			`* Data model for CorpUserInfo entity search`
			`*/`
			`record CorpUserInfoDocument includes BaseDocument {`

			`...`

			`/**`
			`* Courses that the user has taken e.g. AI200: Introduction to Artificial Intelligence`
			`*/`
			`courses: optional array[string]`

			`}`
			```

			`## 3. Modify the mapping of search index`
			`Now, we will modify the mapping of corp user search index. Use the following Elasticsearch command to add new field to an existing index.`

			```json
Update search-over-new-field.md 2020-08-27 03:16:42 -07:00			`curl http://localhost:9200/corpuserinfodocument/doc/_mapping? --data '`
docs: create search-over-new-field.md (#1790) Add a doc on searching over a new field 2020-08-09 12:06:24 -07:00			`{`
			`"properties": {`
			`"courses": {`
			`"type": "text`
			`}`
			`}`
			`}'`
			```

update the doc for facet field 2020-09-17 16:22:38 -07:00			`If this field needs to be a facet i.e. you want to enable sorting, aggregations on this field or use it in scripts, then your mapping may be different depending on the type of field. For text fields you will need to enable fielddata (disabled by default), as shown below`
			```json
			`curl http://localhost:9200/corpuserinfodocument/doc/_mapping? --data '`
			`{`
			`"properties": {`
			`"courses": {`
			`"type": "text,`
			`"fielddata": true`
			`}`
			`}`
			`}'`
			```

			`However fielddata enablement could consume significant heap space. If possible, use unanalyzed keyword field as a facet. For the current example, you could either choose keyword type for the field courses or create a subfield of type keyword under courses and use the same for sorting, aggregations, etc (second approach described below)`
			```json
			`curl http://localhost:9200/corpuserinfodocument/doc/_mapping? --data '`
			`{`
			`"properties": {`
			`"courses": {`
			`"type": "text,`
			`"fields": {`
			`"subfield": {`
			`"type": "keyword"`
			`}`
			`}`
			`}`
			`}`
			`}'`
			```

docs: create search-over-new-field.md (#1790) Add a doc on searching over a new field 2020-08-09 12:06:24 -07:00			`## 4. Modify index config, so that the new mapping is picked up next time`
			If you want corp user search index to contain this new field `courses` next time docker containers are brought up, we need to add this field to [corpuser-index-config.json](../../docker/elasticsearch-setup/corpuser-index-config.json).

			```
			`{`
			`"settings": {`
			`"index": {`
			`"analysis": {`
			`...`
			`}`
			`}`
			`},`
			`"mappings": {`
			`"doc": {`
			`"properties": {`

			`...`

			`"courses": {`
			`"type": "text"`
			`}`
			`}`
			`}`
			`}`
			`}`
			```
			Choose your analyzer wisely. For this example, we store the field `courses` as an array of string and hence use `text` data type. Default analyzer is `standard` and it provides grammar based tokenization.

			`## 5. Update the index builder logic`
			`Index builder is where the logic to transform an aspect to search document model is defined. For this example, we will add the logic in [CorpUserInfoIndexBuilder](../../metadata-builders/src/main/java/com/linkedin/metadata/builders/search/CorpUserInfoIndexBuilder.java).`

			```java
			`package com.linkedin.metadata.builders.search;`

			`@Slf4j`
			`public class CorpUserInfoIndexBuilder extends BaseIndexBuilder<CorpUserInfoDocument> {`

			`public CorpUserInfoIndexBuilder() {`
			`super(Collections.singletonList(CorpUserSnapshot.class), CorpUserInfoDocument.class);`
			`}`

			`...`

			`@Nonnull`
			`private CorpUserInfoDocument getDocumentToUpdateFromAspect(@Nonnull CorpuserUrn urn,`
			`@Nonnull CorpUserEditableInfo corpUserEditableInfo) {`
			`final String aboutMe = corpUserEditableInfo.getAboutMe() == null ? "" : corpUserEditableInfo.getAboutMe();`
			`return new CorpUserInfoDocument()`
			`.setUrn(urn)`
			`.setAboutMe(aboutMe)`
			`.setTeams(corpUserEditableInfo.getTeams())`
			`.setSkills(corpUserEditableInfo.getSkills())`
			`.setCourses(corpUserEditableInfo.getCourses());`
			`}`

			`...`

			`}`

			```

			`## 6: Update search query template, to start searching over the new field`
			For this example, we will modify [corpUserESSearchQueryTemplate.json](../../gms/impl/src/main/resources/corpUserESSearchQueryTemplate.json) to start searching over the field `courses`. Here is an example.

			```json
			`{`
			`"function_score": {`
			`"query": {`
			`"query_string": {`
			`"query": "$INPUT",`
			`"fields": [`
			`"fullName^4",`
			`"ldap^2",`
			`"managerLdap",`
			`"skills",`
			`"courses"`
			`"teams",`
			`"title"`
			`],`
			`"default_operator": "and",`
			`"analyzer": "standard"`
			`}`
			`},`
			`"functions": [`
			`{`
			`"filter": {`
			`"term": {`
			`"active": true`
			`}`
			`},`
			`"weight": 2`
			`}`
			`],`
			`"score_mode": "multiply"`
			`}`
			`}`
			```
			As you can see in the above query template, corp user search is performed across multiple fields, to which the field `courses` has been added.

			`## 7: Test your changes`
			`Make sure relevant docker containers are rebuilt before testing the changes.`
			If this is a new field that has been added to an existing snapshot, then you can test by ingesting data that contains this new field. Here is an example of ingesting to `/corpUsers` endpoint, with the new field `courses`.

			```
			`curl 'http://localhost:8080/corpUsers?action=ingest' -X POST -H 'X-RestLi-Protocol-Version:2.0.0' --data '`
			`{`
			`"snapshot": {`
			`"aspects": [`
			`{`
			`"com.linkedin.identity.CorpUserEditableInfo": {`
			`"courses": [`
			`"Docker for Data Scientists",`
			`"AI100: Introduction to Artificial Intelligence"`
			`],`
			`"skills": [`

			`],`
			`"pictureLink": "https://raw.githubusercontent.com/linkedin/datahub/master/datahub-web/packages/data-portal/public/assets/images/default_avatar.png",`
			`"teams": [`

			`]`
			`}`
			`}`
			`],`
			`"urn": "urn:li:corpuser:datahub"`
			`}`
			`}'`
			```

			`Once the ingestion is done, you can test your changes by issuing search queries. Here is an example query with response.`

			```
			`curl "http://localhost:8080/corpUsers?q=search&input=ai200" -H 'X-RestLi-Protocol-Version: 2.0.0' -s \| jq`

			`Response:`
			`{`
			`"metadata": {`
			`"urns": [`
			`"urn:li:corpuser:datahub"`
			`],`
			`"searchResultMetadatas": [`

			`]`
			`},`
			`"elements": [`
			`{`
			`"editableInfo": {`
			`"skills": [`

			`],`
			`"courses": [`
			`"Docker for Data Scientists",`
			`"AI100: Introduction to Artificial Intelligence"`
			`],`
			`"pictureLink": "https://raw.githubusercontent.com/linkedin/datahub/master/datahub-web/packages/data-portal/public/assets/images/default_avatar.png",`
			`"teams": [`

			`]`
			`},`
			`"username": "datahub",`
			`"info": {`
			`"active": true,`
			`"fullName": "Data Hub",`
			`"title": "CEO",`
			`"displayName": "Data Hub",`
			`"email": "datahub@linkedin.com"`
			`}`
			`}`
			`],`
			`"paging": {`
			`"count": 10,`
			`"start": 0,`
			`"total": 1,`
			`"links": [`

			`]`
			`}`
			`}`
			```