datahub/docs/what/search-document.md
2019-12-19 13:17:53 -08:00

3.4 KiB
Raw Blame History

What is a search document?

Search documents are also modeled using PDSC explicitly. In many ways, the model for a Document is very similar to an Entity and Relationship model, where each attribute/field contains a value thats derived from various metadata aspects. However, a search document is also allowed to have array type of attribute that contains only primitives or enum items. This is because most full-text search engines supports membership testing against an array field, e.g. an array field containing all the terms used in a document.

One obvious use of the attributes is to perform search filtering, e.g. give me all the User whose first name or last name is similar to “Joe” and reports up to userFoo. Since the document is also served as the main interface for the search API, the attributes can also be used to format the search snippet. As a result, one may be tempted to add as many attributes as needed. This is acceptable as the underlying search engine is designed to index a large number of fields.

Below shows an example schema for the User search document. Note that:

  1. Each search document is required to have a type-specific urn field, generally maps to an entity in the graph.
  2. Similar to Entity, each document has an optional removed field for "soft deletion". This is captured in BaseDocument, which is expected to be included by all documents.
  3. Similar to Entity, all remaining fields are made optional to support partial updates.
  4. management shows an example of a string array field.
  5. ownedDataset shows an example on how a field can be derived from metadata aspects associated with other types of entity (in this case, Dataset).
{
  "type": "record",
  "name": "BaseDocument",
  "namespace": "com.linkedin.metadata.search",
  "doc": "Common fields that apply to all documents",
  "fields": [
    {
      "name": "removed",
      "type": "boolean",
      "doc": "Whether the entity has been removed or not",
      "optional": true,
      "default": false
    }
  ]
}
{
 "type": "record",
 "name": "UserDocument",
 "namespace": "com.linkedin.metadata.search",
 "doc": "Data model for user entity search",
 "include": [
   "BaseDocument"
 ],
 "fields": [
   {
     "name": "urn",
     "type": "com.linkedin.common.CorpuserUrn",
     "doc": "Urn for the user"
   },
   {
     "name": "firstName",
     "type": "string",
     "doc": "First name of the user",
     "optional": true
   },
   {
     "name": "lastName",
     "type": "string",
     "doc": "Last name of the user",
     "optional": true
   },
   {
     "name": "management",
     "type": {
       "type": "array",
       "items": "com.linkedin.common.CorpuserUrn"
     },
     "doc": "The chain of management all the way to CEO",
     "default": [],
     "optional": true
   },
   {
     "name": "costCenter",
     "type": "int",
     "doc": "Code for the cost center",
     "optional": true
   },
   {
     "name": "ownedDatasets",
     "type": {
       "type": "array",
       "items": "com.linkedin.common.DatasetUrn"
     },
     "doc": "The list of dataset the user owns",
     "default": [],
     "optional": true
   }
 ]
}