Documentation update part-3

2025-12-24 08:28:12 +00:00 · 2019-12-20 02:36:24 -08:00 · 2019-12-20 02:36:24 -08:00 · 4342a9b39a
commit 4342a9b39a
parent 6c99764fb1
7 changed files with 283 additions and 6 deletions
--- a/README.md
+++ b/README.md
@ -5,8 +5,11 @@
 ![DataHub](docs/imgs/datahub-logo.png)

 ## Introduction
-DataHub is Linkedin's generalized metadata search & discovery tool. To learn more about DataHub, check out our 
-[Linkedin blog post](https://engineering.linkedin.com/blog/2019/data-hub) and [Strata presentation](https://speakerdeck.com/shirshanka/the-evolution-of-metadata-linkedins-journey-strata-nyc-2019). This repository contains the complete source code to be able to build DataHub's frontend & backend services.
+DataHub is LinkedIn's generalized metadata search & discovery tool. To learn more about DataHub, check out our 
+[LinkedIn blog post](https://engineering.linkedin.com/blog/2019/data-hub) and [Strata presentation](https://speakerdeck.com/shirshanka/the-evolution-of-metadata-linkedins-journey-strata-nyc-2019). 
+You should also visit [DataHub Architecture](docs/architecture/architecture.md) to get a better understanding of how DataHub is implemented and 
+[DataHub Onboarding Guide](docs/how/entity-onboarding.md) to understand how to extend DataHub for your own use case.
+This repository contains the complete source code to be able to build DataHub's frontend & backend services.

 ## Quickstart
 1. Install [docker](https://docs.docker.com/install/) and [docker-compose](https://docs.docker.com/compose/install/).
@ -25,9 +28,11 @@ Note: Make sure that you're using Java 8, we have a strict dependency to Java 8
 as username and password.

 ## Quicklinks
+* [DataHub Architecture](docs/architecture/architecture.md)
+* [DataHub Onboarding Guide](docs/how/entity-onboarding.md)
 * [Docker Images](docker)
 * [Frontend App](datahub-frontend)
-* [Generalized Metadata Store](gms)
+* [Generalized Metadata Service](gms)
 * [Metadata Consumer Jobs](metadata-jobs)
 * [Metadata Ingestion](metadata-ingestion)

--- a/docs/architecture/metadata-ingestion.md
+++ b/docs/architecture/metadata-ingestion.md
@ -1,6 +1,6 @@
 # Metadata Ingestion Architecture

-## MCE Consumer Job
+## MCE Consumer Job [WIP]

 ## MAE Consumer Job

--- a/docs/how/entity-onboarding.md
+++ b/docs/how/entity-onboarding.md
@ -1,2 +1,32 @@
 # How to onboard an entity?

+Currently, DataHub only has a support for 3 [entity] types: `datasets`, `users` and `groups`.
+If you want to extend DataHub with your own use cases such as `metrics`, `charts`, `dashboards` etc, you should follow the below steps in order.
+
+## 1. Define URN
+Refer to [here](../what/urn.md) for URN definition.
+
+## 2. Model your metadata
+Refer to [metadata modelling](metadata-modelling.md) section.
+Make sure to do the following:
+1. Define [Aspect] models.
+2. Define aspect union model. Refer to [DatasetAspect] as an example.
+3. Define [Snapshot] model. Refer to [DatasetSnapshot] as an example.
+4. Add your newly defined snapshot to [Snapshot Union] model.
+5. Define [Entity] model. Refer to [DatasetEntity] as an example.
+
+## 3. GMA search onboarding
+Refer to [search onboarding](search-onboarding.md).
+
+## 4. GMA graph onboarding
+Refer to [graph onboarding](graph-onboarding.md).
+
+## 5. UI for entity onboarding [WIP]
+
+[Aspect]: ../what/aspect.md
+[DatasetAspect]: ../../metadata-models/src/main/pegasus/com/linkedin/metadata/aspect/DatasetAspect.pdsc
+[Snapshot]: ../what/snapshot.md
+[DatasetSnapshot]: ../../metadata-models/src/main/pegasus/com/linkedin/metadata/snapshot/DatasetSnapshot.pdsc
+[Snapshot Union]: ../../metadata-models/src/main/pegasus/com/linkedin/metadata/snapshot/Snapshot.pdsc
+[Entity]: ../what/entity.md
+[DatasetEntity]: ../../metadata-models/src/main/pegasus/com/linkedin/metadata/entity/DatasetEntity.pdsc
--- a/docs/how/graph-onboarding.md
+++ b/docs/how/graph-onboarding.md
@ -1,2 +1,161 @@
 # How to onboard to GMA graph?

+## 1. Define relationship models
+If you need to define a [relationship] which is not available in the set of [relationship models] provided,
+that relationship model should be implemented as a first step for graph onboarding. 
+Below is an example model for `OwnedBy` relationship:
+
+```json
+{
+  "type": "record",
+  "name": "OwnedBy",
+  "namespace": "com.linkedin.metadata.relationship",
+  "doc": "A generic model for the Owned-By relationship",
+  "include": [
+    "BaseRelationship"
+  ],
+  "pairings": [
+    {
+      "source": "com.linkedin.common.urn.DatasetUrn",
+      "destination": "com.linkedin.common.urn.CorpuserUrn"
+    }
+  ],
+  "fields": [
+    {
+      "name": "type",
+      "type": "com.linkedin.common.OwnershipType",
+      "doc": "The type of the ownership"
+    }
+  ]
+}
+```
+
+## 2. Implement relationship builders
+You need to implement relationship builders for your specific [aspect]s and [relationship]s if they are not already defined.
+Relationship builders build list of relationships after processing aspects and any relationship builder should implement `BaseRelationshipBuilder` abstract class.
+Relationship builders are per aspect and per relationship type.
+
+```java
+public abstract class BaseRelationshipBuilder<ASPECT extends RecordTemplate> {
+
+  private Class<ASPECT> _aspectClass;
+
+  public BaseRelationshipBuilder(Class<ASPECT> aspectClass) {
+    _aspectClass = aspectClass;
+  }
+
+  /**
+   * Returns the aspect class this {@link BaseRelationshipBuilder} supports
+   */
+  public Class<ASPECT> supportedAspectClass() {
+    return _aspectClass;
+  }
+
+  /**
+   * Returns a list of corresponding relationship updates for the given metadata aspect
+   */
+  public abstract <URN extends Urn> List<GraphBuilder.RelationshipUpdates> buildRelationships(URN urn, ASPECT aspect);
+}
+```
+
+## 3. Implement graph builders
+Graph builders build graph updates by processing [snapshot]s. 
+They internally use relationship builders to generate edges and nodes of the graph.
+All relationship builders for an [entity] should be registered through graph builder.
+
+```java
+public abstract class BaseGraphBuilder<SNAPSHOT extends RecordTemplate> implements GraphBuilder<SNAPSHOT> {
+
+  private final Class<SNAPSHOT> _snapshotClass;
+  private final Map<Class<? extends RecordTemplate>, BaseRelationshipBuilder> _relationshipBuildersMap;
+
+  public BaseGraphBuilder(@Nonnull Class<SNAPSHOT> snapshotClass,
+      @Nonnull Collection<BaseRelationshipBuilder> relationshipBuilders) {
+    _snapshotClass = snapshotClass;
+    _relationshipBuildersMap = relationshipBuilders.stream()
+        .collect(Collectors.toMap(builder -> builder.supportedAspectClass(), Function.identity()));
+  }
+
+  @Nonnull
+  Class<SNAPSHOT> supportedSnapshotClass() {
+    return _snapshotClass;
+  }
+
+  @Nonnull
+  @Override
+  public GraphUpdates build(@Nonnull SNAPSHOT snapshot) {
+    final Urn urn = RecordUtils.getRecordTemplateField(snapshot, "urn", Urn.class);
+
+    final List<? extends RecordTemplate> entities = buildEntities(snapshot);
+
+    final List<RelationshipUpdates> relationshipUpdates = new ArrayList<>();
+
+    final List<RecordTemplate> aspects = ModelUtils.getAspectsFromSnapshot(snapshot);
+    for (RecordTemplate aspect : aspects) {
+      BaseRelationshipBuilder relationshipBuilder = _relationshipBuildersMap.get(aspect.getClass());
+      if (relationshipBuilder != null) {
+        relationshipUpdates.addAll(relationshipBuilder.buildRelationships(urn, aspect));
+      }
+    }
+
+    return new GraphUpdates(Collections.unmodifiableList(entities), Collections.unmodifiableList(relationshipUpdates));
+  }
+
+  @Nonnull
+  protected abstract List<? extends RecordTemplate> buildEntities(@Nonnull SNAPSHOT snapshot);
+}
+```
+
+```java
+public class DatasetGraphBuilder extends BaseGraphBuilder<DatasetSnapshot> {
+  private static final Set<BaseRelationshipBuilder> RELATIONSHIP_BUILDERS =
+      Collections.unmodifiableSet(new HashSet<BaseRelationshipBuilder>() {
+        {
+          add(new DownstreamOfBuilderFromUpstreamLineage());
+          add(new OwnedByBuilderFromOwnership());
+        }
+      });
+
+  public DatasetGraphBuilder() {
+    super(DatasetSnapshot.class, RELATIONSHIP_BUILDERS);
+  }
+
+  @Nonnull
+  @Override
+  protected List<? extends RecordTemplate> buildEntities(@Nonnull DatasetSnapshot snapshot) {
+    final DatasetUrn urn = snapshot.getUrn();
+    final DatasetEntity entity = new DatasetEntity().setUrn(urn)
+        .setName(urn.getDatasetNameEntity())
+        .setPlatform(urn.getPlatformEntity())
+        .setOrigin(urn.getOriginEntity());
+
+    setRemovedProperty(snapshot, entity);
+
+    return Collections.singletonList(entity);
+  }
+}
+```
+
+## 4. Ingestion into graph
+The ingestion process for each [entity] is done by graph builders. 
+The builders will be invoked whenever an [MAE] is received by [MAE Consumer Job]. 
+Graph builders should be extended from BaseGraphBuilder. Check DatasetGraphBuilder as an example above. 
+For the consumer job to consume those MAEs, you should add your graph builder to the [graph builder registry].
+
+## 5. Graph queries
+You can onboard the graph queries which fit to your specific use cases using [Query DAO]. 
+You also need to create [rest.li](https://rest.li) APIs to serve your graph queries.
+[BaseQueryDAO] provides an abstract implementation of several graph query APIs.
+Refer to [DownstreamLineageResource] rest.li resource implementation to see a use case of graph queries.
+
+[relationship]: ../what/relationship.md
+[relationship models]: ../../metadata-models/build/mainSchemas/com/linkedin/metadata/relationship
+[aspect]: ../what/aspect.md
+[snapshot]: ../what/snapshot.md
+[entity]: ../what/entity.md
+[mae]: ../what/mxe.md#metadata-audit-event-mae
+[mae consumer job]: ../architecture/metadata-ingestion.md#mae-consumer-job
+[graph builder registry]: ../../metadata-builders/src/main/java/com/linkedin/metadata/builders/graph/RegisteredGraphBuilders.java
+[query dao]: ../architecture/metadata-serving.md#query-dao
+[BaseQueryDAO]: ../../metadata-dao/src/main/java/com/linkedin/metadata/dao/BaseQueryDAO.java
+[DownstreamLineageResource]: ../../gms/impl/src/main/java/com/linkedin/dataset/rest/resources/DownstreamLineageResource.java
--- a/docs/how/search-onboarding.md
+++ b/docs/how/search-onboarding.md
@ -1,2 +1,85 @@
 # How to onboard to GMA search?

+## 1. Define search document model for the entity
+Modeling is the most important and crucial part of your design. 
+[Search document] model contains a list of fields that need to be indexed along with the names and their data types. 
+Check [here][Search document] to learn more about search document model.
+Please note that all fields in the search document model (except the `urn`) are `optional`. 
+This is because we want to support partial updates to search documents.
+
+[Search document]: ../what/search-document.md
+
+## 2. Create the search index, define its mappings and settings
+
+A [mapping] is created using the information of search document model. 
+It defines how a document, and the fields it contains, are stored and indexed by various [tokenizer]s, [analyzer]s and data type for the fields. 
+For certain fields, sub-fields are created using different analyzers. 
+The analyzers are chosen depending on the needs for each field. 
+This is currently created manually using [curl] commands, and we plan to [automate](../what/search-index.md#search-automation-tbd) the process in the near future. 
+Check index [mappings & settings](../../docker/elasticsearch/dataset-index-config.json) for `dataset` search index.
+
+## 3. Ingestion into search index
+The actual indexing process for each [entity] is powered by [index builder]s. 
+The builders register the metadata [aspect]s of their interest against [MAE Consumer Job] and will be invoked whenever an [MAE] of same interest is received. 
+Index builders should be extended from [BaseIndexBuilder]. Check [DatasetIndexBuilder] as an example. 
+For the consumer job to consume those MAEs, you should add your index builder to the [index builder registry].
+
+## 4. Search query configs
+Once you have the [search index] built, it's ready to be queried! 
+The search query is constructed and executed through [Search DAO]. 
+The raw search hits are retrieved and extracted using the base model. 
+Besides the regular full text search, run time aggregation and relevance are provided in the search queries as well. 
+
+[ESSearchDAO] is the implementation for the [BaseSearchDAO] for Elasticsearch.
+It's still a generic class which can be used for a specific [entity] and configured using [BaseSearchConfig]. 
+
+BaseSearchConfig is the abstraction for all query related configurations such as query templates, default field to execute autocomplete on etc.
+
+```java
+public abstract class BaseSearchConfig<DOCUMENT extends RecordTemplate> {
+
+  public abstract Set<String> getFacetFields();
+
+  public String getIndexName() {
+    return getSearchDocument().getSimpleName().toLowerCase();
+  }
+
+  public abstract Class<DOCUMENT> getSearchDocument();
+
+  public abstract String getDefaultAutocompleteField();
+
+  public abstract String getSearchQueryTemplate();
+
+  public abstract String getAutocompleteQueryTemplate();
+}
+```
+
+[DatasetSearchConfig] is the implementation of search config for `dataset` entity.
+
+## 5. Add search query endpoints to GMS
+Finally, you need to create [rest.li](https://rest.li) APIs to serve your search queries. 
+[BaseSearchableEntityResource] provides an abstract implementation of search and autocomplete APIs.
+Any top level rest.li resource implementation could extend it and this will automatically provide search and autocomplete endpoints.
+Refer to [CorpUsers] rest.li resource implementation as an example.
+
+
+[mapping]: https://www.elastic.co/guide/en/elasticsearch/reference/5.6/mapping.html
+[tokenizer]: https://www.elastic.co/guide/en/elasticsearch/reference/5.6/analysis-tokenizers.html
+[analyzer]: https://www.elastic.co/guide/en/elasticsearch/reference/5.6/analysis-analyzers.html
+[curl]: https://en.wikipedia.org/wiki/CURL
+[entity]: ../what/entity.md
+[index builder]: ../architecture/metadata-ingestion.md#search-and-graph-index-builders
+[aspect]: ../what/aspect.md
+[mae consumer job]: ../architecture/metadata-ingestion.md#mae-consumer-job
+[mae]: ../what/mxe.md#metadata-audit-event-mae
+[baseindexbuilder]: ../../metadata-builders/src/main/java/com/linkedin/metadata/builders/search/BaseIndexBuilder.java
+[datasetindexbuilder]: ../../metadata-builders/src/main/java/com/linkedin/metadata/builders/search/DatasetIndexBuilder.java
+[index builder registry]: ../../metadata-builders/src/main/java/com/linkedin/metadata/builders/search/RegisteredIndexBuilders.java
+[search index]: ../what/search-index.md
+[search dao]: ../architecture/metadata-serving.md#search-dao
+[essearchdao]: ../../metadata-dao-impl/elasticsearch-dao/src/main/java/com/linkedin/metadata/dao/search/ESSearchDAO.java
+[basesearchdao]: ../../metadata-dao/src/main/java/com/linkedin/metadata/dao/BaseSearchDAO.java
+[basesearchconfig]: ../../metadata-dao-impl/elasticsearch-dao/src/main/java/com/linkedin/metadata/dao/search/BaseSearchConfig.java
+[datasetsearchconfig]: ../../gms/impl/src/main/java/com/linkedin/dataset/dao/search/DatasetSearchConfig.java
+[basesearchableentityresource]: ../../metadata-restli-resource/src/main/java/com/linkedin/metadata/restli/BaseSearchableEntityResource.java
+[corpusers]: ../../gms/impl/src/main/java/com/linkedin/identity/rest/resources/CorpUsers.java
--- a/docs/what/gms.md
+++ b/docs/what/gms.md
@ -1,4 +1,4 @@
-# What is Generalized Metadata Service (GMS)?
+# What is Generalized Metadata Service (GMS)? [WIP]



--- a/docs/what/urn.md
+++ b/docs/what/urn.md
@ -1,2 +1,2 @@
-# What is URN?
+# What is URN? [WIP]