datahub/docs/how/search-over-new-field.md

# Onboarding to GMA Search - searching over a new field

If you need to onboard a new entity to search, refer to [How to onboard to GMA Search](./search-onboarding.md).

For this exercise, we'll add a new field to an existing aspect of corp users and search over this field. Your use case might require searching over an existing field of an aspect or create a brand new aspect and search over it's field(s). For such use cases, similar steps should be followed.

## 1. Add field to aspect (skip this step if the field already exists in an aspect)
For this example, we will add new field `courses` to [CorpUserEditableInfo](../../metadata-models/src/main/pegasus/com/linkedin/identity/CorpUserEditableInfo.pdl) which is an aspect of corp user entity.
```
namespace com.linkedin.identity

/**
 * Linkedin corp user information that can be edited from UI
 */
@Aspect.EntityUrns = [ "com.linkedin.common.CorpuserUrn" ]
record CorpUserEditableInfo {

  ...
  
  /**
   * Courses that the user has taken e.g. AI200: Introduction to Artificial Intelligence
   */
  courses: array[string] = [ ]
  
}
```

## 2. Add field to search document model
For this example, we will add field `courses` to [CorpUserInfoDocument.pdl](../../metadata-models/src/main/pegasus/com/linkedin/metadata/search/CorpUserInfoDocument.pdl) which is the search document model for corp user entity.

```
namespace com.linkedin.metadata.search

/**
 * Data model for CorpUserInfo entity search
 */
record CorpUserInfoDocument includes BaseDocument {

  ...

  /**
   * Courses that the user has taken e.g. AI200: Introduction to Artificial Intelligence
   */
  courses: optional array[string]
  
}
```

## 3. Modify the mapping of search index
Now, we will modify the mapping of corp user search index. Use the following Elasticsearch command to add new field to an existing index.

```json
curl http://localhost:8080/corpuserinfodocument/doc/_mapping? --data '
{
  "properties": {
    "courses": {
      "type": "text
    }
  }
}'
```

## 4. Modify index config, so that the new mapping is picked up next time
If you want corp user search index to contain this new field `courses` next time docker containers are brought up, we need to add this field to [corpuser-index-config.json](../../docker/elasticsearch-setup/corpuser-index-config.json).

```
{
  "settings": {
    "index": {
      "analysis": {
       ...
      }
    }
  },
  "mappings": {
    "doc": {
      "properties": {

        ...

        "courses": {
          "type": "text"
        }
      }
    }
  }
}
```
Choose your analyzer wisely. For this example, we store the field `courses` as an array of string and hence use `text` data type. Default analyzer is `standard` and it provides grammar based tokenization.

## 5. Update the index builder logic
Index builder is where the logic to transform an aspect to search document model is defined. For this example, we will add the logic in [CorpUserInfoIndexBuilder](../../metadata-builders/src/main/java/com/linkedin/metadata/builders/search/CorpUserInfoIndexBuilder.java).

```java
package com.linkedin.metadata.builders.search;

@Slf4j
public class CorpUserInfoIndexBuilder extends BaseIndexBuilder<CorpUserInfoDocument> {

  public CorpUserInfoIndexBuilder() {
    super(Collections.singletonList(CorpUserSnapshot.class), CorpUserInfoDocument.class);
  }
  
  ...
  
  @Nonnull
  private CorpUserInfoDocument getDocumentToUpdateFromAspect(@Nonnull CorpuserUrn urn,
      @Nonnull CorpUserEditableInfo corpUserEditableInfo) {
    final String aboutMe = corpUserEditableInfo.getAboutMe() == null ? "" : corpUserEditableInfo.getAboutMe();
    return new CorpUserInfoDocument()
        .setUrn(urn)
        .setAboutMe(aboutMe)
        .setTeams(corpUserEditableInfo.getTeams())
        .setSkills(corpUserEditableInfo.getSkills())
        .setCourses(corpUserEditableInfo.getCourses());
  }
  
  ...
  
}

```

## 6: Update search query template, to start searching over the new field
For this example, we will modify [corpUserESSearchQueryTemplate.json](../../gms/impl/src/main/resources/corpUserESSearchQueryTemplate.json) to start searching over the field `courses`. Here is an example.

```json
{
  "function_score": {
    "query": {
      "query_string": {
        "query": "$INPUT",
        "fields": [
          "fullName^4",
          "ldap^2",
          "managerLdap",
          "skills",
          "courses"
          "teams",
          "title"
        ],
        "default_operator": "and",
        "analyzer": "standard"
      }
    },
    "functions": [
      {
        "filter": {
          "term": {
            "active": true
          }
        },
        "weight": 2
      }
    ],
    "score_mode": "multiply"
  }
}
```
As you can see in the above query template, corp user search is performed across multiple fields, to which the field `courses` has been added.

## 7: Test your changes
Make sure relevant docker containers are rebuilt before testing the changes.
If this is a new field that has been added to an existing snapshot, then you can test by ingesting data that contains this new field. Here is an example of ingesting to `/corpUsers` endpoint, with the new field `courses`.

```
curl 'http://localhost:8080/corpUsers?action=ingest' -X POST -H 'X-RestLi-Protocol-Version:2.0.0' --data '
{
  "snapshot": {
    "aspects": [
      {
        "com.linkedin.identity.CorpUserEditableInfo": {
          "courses": [
            "Docker for Data Scientists",
            "AI100: Introduction to Artificial Intelligence"
          ],
          "skills": [
            
          ],
          "pictureLink": "https://raw.githubusercontent.com/linkedin/datahub/master/datahub-web/packages/data-portal/public/assets/images/default_avatar.png",
          "teams": [
            
          ]
        }
      }
    ],
    "urn": "urn:li:corpuser:datahub"
  }
}'
```

Once the ingestion is done, you can test your changes by issuing search queries. Here is an example query with response.

```
curl "http://localhost:8080/corpUsers?q=search&input=ai200" -H 'X-RestLi-Protocol-Version: 2.0.0' -s | jq

Response:
{
  "metadata": {
    "urns": [
      "urn:li:corpuser:datahub"
    ],
    "searchResultMetadatas": [
      
    ]
  },
  "elements": [
    {
      "editableInfo": {
        "skills": [
          
        ],
        "courses": [
          "Docker for Data Scientists",
          "AI100: Introduction to Artificial Intelligence"
        ],
        "pictureLink": "https://raw.githubusercontent.com/linkedin/datahub/master/datahub-web/packages/data-portal/public/assets/images/default_avatar.png",
        "teams": [
          
        ]
      },
      "username": "datahub",
      "info": {
        "active": true,
        "fullName": "Data Hub",
        "title": "CEO",
        "displayName": "Data Hub",
        "email": "datahub@linkedin.com"
      }
    }
  ],
  "paging": {
    "count": 10,
    "start": 0,
    "total": 1,
    "links": [
      
    ]
  }
}
```
docs: create search-over-new-field.md (#1790) Add a doc on searching over a new field 2020-08-09 12:06:24 -07:00			`# Onboarding to GMA Search - searching over a new field`

			`If you need to onboard a new entity to search, refer to [How to onboard to GMA Search](./search-onboarding.md).`

			`For this exercise, we'll add a new field to an existing aspect of corp users and search over this field. Your use case might require searching over an existing field of an aspect or create a brand new aspect and search over it's field(s). For such use cases, similar steps should be followed.`

			`## 1. Add field to aspect (skip this step if the field already exists in an aspect)`
			For this example, we will add new field `courses` to [CorpUserEditableInfo](../../metadata-models/src/main/pegasus/com/linkedin/identity/CorpUserEditableInfo.pdl) which is an aspect of corp user entity.
			```
			`namespace com.linkedin.identity`

			`/**`
			`* Linkedin corp user information that can be edited from UI`
			`*/`
			`@Aspect.EntityUrns = [ "com.linkedin.common.CorpuserUrn" ]`
			`record CorpUserEditableInfo {`

			`...`

			`/**`
			`* Courses that the user has taken e.g. AI200: Introduction to Artificial Intelligence`
			`*/`
			`courses: array[string] = [ ]`

			`}`
			```

			`## 2. Add field to search document model`
			For this example, we will add field `courses` to [CorpUserInfoDocument.pdl](../../metadata-models/src/main/pegasus/com/linkedin/metadata/search/CorpUserInfoDocument.pdl) which is the search document model for corp user entity.

			```
			`namespace com.linkedin.metadata.search`

			`/**`
			`* Data model for CorpUserInfo entity search`
			`*/`
			`record CorpUserInfoDocument includes BaseDocument {`

			`...`

			`/**`
			`* Courses that the user has taken e.g. AI200: Introduction to Artificial Intelligence`
			`*/`
			`courses: optional array[string]`

			`}`
			```

			`## 3. Modify the mapping of search index`
			`Now, we will modify the mapping of corp user search index. Use the following Elasticsearch command to add new field to an existing index.`

			```json
			`curl http://localhost:8080/corpuserinfodocument/doc/_mapping? --data '`
			`{`
			`"properties": {`
			`"courses": {`
			`"type": "text`
			`}`
			`}`
			`}'`
			```

			`## 4. Modify index config, so that the new mapping is picked up next time`
			If you want corp user search index to contain this new field `courses` next time docker containers are brought up, we need to add this field to [corpuser-index-config.json](../../docker/elasticsearch-setup/corpuser-index-config.json).

			```
			`{`
			`"settings": {`
			`"index": {`
			`"analysis": {`
			`...`
			`}`
			`}`
			`},`
			`"mappings": {`
			`"doc": {`
			`"properties": {`

			`...`

			`"courses": {`
			`"type": "text"`
			`}`
			`}`
			`}`
			`}`
			`}`
			```
			Choose your analyzer wisely. For this example, we store the field `courses` as an array of string and hence use `text` data type. Default analyzer is `standard` and it provides grammar based tokenization.

			`## 5. Update the index builder logic`
			`Index builder is where the logic to transform an aspect to search document model is defined. For this example, we will add the logic in [CorpUserInfoIndexBuilder](../../metadata-builders/src/main/java/com/linkedin/metadata/builders/search/CorpUserInfoIndexBuilder.java).`

			```java
			`package com.linkedin.metadata.builders.search;`

			`@Slf4j`
			`public class CorpUserInfoIndexBuilder extends BaseIndexBuilder<CorpUserInfoDocument> {`

			`public CorpUserInfoIndexBuilder() {`
			`super(Collections.singletonList(CorpUserSnapshot.class), CorpUserInfoDocument.class);`
			`}`

			`...`

			`@Nonnull`
			`private CorpUserInfoDocument getDocumentToUpdateFromAspect(@Nonnull CorpuserUrn urn,`
			`@Nonnull CorpUserEditableInfo corpUserEditableInfo) {`
			`final String aboutMe = corpUserEditableInfo.getAboutMe() == null ? "" : corpUserEditableInfo.getAboutMe();`
			`return new CorpUserInfoDocument()`
			`.setUrn(urn)`
			`.setAboutMe(aboutMe)`
			`.setTeams(corpUserEditableInfo.getTeams())`
			`.setSkills(corpUserEditableInfo.getSkills())`
			`.setCourses(corpUserEditableInfo.getCourses());`
			`}`

			`...`

			`}`

			```

			`## 6: Update search query template, to start searching over the new field`
			For this example, we will modify [corpUserESSearchQueryTemplate.json](../../gms/impl/src/main/resources/corpUserESSearchQueryTemplate.json) to start searching over the field `courses`. Here is an example.

			```json
			`{`
			`"function_score": {`
			`"query": {`
			`"query_string": {`
			`"query": "$INPUT",`
			`"fields": [`
			`"fullName^4",`
			`"ldap^2",`
			`"managerLdap",`
			`"skills",`
			`"courses"`
			`"teams",`
			`"title"`
			`],`
			`"default_operator": "and",`
			`"analyzer": "standard"`
			`}`
			`},`
			`"functions": [`
			`{`
			`"filter": {`
			`"term": {`
			`"active": true`
			`}`
			`},`
			`"weight": 2`
			`}`
			`],`
			`"score_mode": "multiply"`
			`}`
			`}`
			```
			As you can see in the above query template, corp user search is performed across multiple fields, to which the field `courses` has been added.

			`## 7: Test your changes`
			`Make sure relevant docker containers are rebuilt before testing the changes.`
			If this is a new field that has been added to an existing snapshot, then you can test by ingesting data that contains this new field. Here is an example of ingesting to `/corpUsers` endpoint, with the new field `courses`.

			```
			`curl 'http://localhost:8080/corpUsers?action=ingest' -X POST -H 'X-RestLi-Protocol-Version:2.0.0' --data '`
			`{`
			`"snapshot": {`
			`"aspects": [`
			`{`
			`"com.linkedin.identity.CorpUserEditableInfo": {`
			`"courses": [`
			`"Docker for Data Scientists",`
			`"AI100: Introduction to Artificial Intelligence"`
			`],`
			`"skills": [`

			`],`
			`"pictureLink": "https://raw.githubusercontent.com/linkedin/datahub/master/datahub-web/packages/data-portal/public/assets/images/default_avatar.png",`
			`"teams": [`

			`]`
			`}`
			`}`
			`],`
			`"urn": "urn:li:corpuser:datahub"`
			`}`
			`}'`
			```

			`Once the ingestion is done, you can test your changes by issuing search queries. Here is an example query with response.`

			```
			`curl "http://localhost:8080/corpUsers?q=search&input=ai200" -H 'X-RestLi-Protocol-Version: 2.0.0' -s \| jq`

			`Response:`
			`{`
			`"metadata": {`
			`"urns": [`
			`"urn:li:corpuser:datahub"`
			`],`
			`"searchResultMetadatas": [`

			`]`
			`},`
			`"elements": [`
			`{`
			`"editableInfo": {`
			`"skills": [`

			`],`
			`"courses": [`
			`"Docker for Data Scientists",`
			`"AI100: Introduction to Artificial Intelligence"`
			`],`
			`"pictureLink": "https://raw.githubusercontent.com/linkedin/datahub/master/datahub-web/packages/data-portal/public/assets/images/default_avatar.png",`
			`"teams": [`

			`]`
			`},`
			`"username": "datahub",`
			`"info": {`
			`"active": true,`
			`"fullName": "Data Hub",`
			`"title": "CEO",`
			`"displayName": "Data Hub",`
			`"email": "datahub@linkedin.com"`
			`}`
			`}`
			`],`
			`"paging": {`
			`"count": 10,`
			`"start": 0,`
			`"total": 1,`
			`"links": [`

			`]`
			`}`
			`}`
			```