Documentation update part-3

This commit is contained in:
Kerem Sahin 2019-12-20 02:36:24 -08:00
parent 6c99764fb1
commit 4342a9b39a
7 changed files with 283 additions and 6 deletions

View File

@ -5,8 +5,11 @@
![DataHub](docs/imgs/datahub-logo.png)
## Introduction
DataHub is Linkedin's generalized metadata search & discovery tool. To learn more about DataHub, check out our
[Linkedin blog post](https://engineering.linkedin.com/blog/2019/data-hub) and [Strata presentation](https://speakerdeck.com/shirshanka/the-evolution-of-metadata-linkedins-journey-strata-nyc-2019). This repository contains the complete source code to be able to build DataHub's frontend & backend services.
DataHub is LinkedIn's generalized metadata search & discovery tool. To learn more about DataHub, check out our
[LinkedIn blog post](https://engineering.linkedin.com/blog/2019/data-hub) and [Strata presentation](https://speakerdeck.com/shirshanka/the-evolution-of-metadata-linkedins-journey-strata-nyc-2019).
You should also visit [DataHub Architecture](docs/architecture/architecture.md) to get a better understanding of how DataHub is implemented and
[DataHub Onboarding Guide](docs/how/entity-onboarding.md) to understand how to extend DataHub for your own use case.
This repository contains the complete source code to be able to build DataHub's frontend & backend services.
## Quickstart
1. Install [docker](https://docs.docker.com/install/) and [docker-compose](https://docs.docker.com/compose/install/).
@ -25,9 +28,11 @@ Note: Make sure that you're using Java 8, we have a strict dependency to Java 8
as username and password.
## Quicklinks
* [DataHub Architecture](docs/architecture/architecture.md)
* [DataHub Onboarding Guide](docs/how/entity-onboarding.md)
* [Docker Images](docker)
* [Frontend App](datahub-frontend)
* [Generalized Metadata Store](gms)
* [Generalized Metadata Service](gms)
* [Metadata Consumer Jobs](metadata-jobs)
* [Metadata Ingestion](metadata-ingestion)

View File

@ -1,6 +1,6 @@
# Metadata Ingestion Architecture
## MCE Consumer Job
## MCE Consumer Job [WIP]
## MAE Consumer Job

View File

@ -1,2 +1,32 @@
# How to onboard an entity?
Currently, DataHub only has a support for 3 [entity] types: `datasets`, `users` and `groups`.
If you want to extend DataHub with your own use cases such as `metrics`, `charts`, `dashboards` etc, you should follow the below steps in order.
## 1. Define URN
Refer to [here](../what/urn.md) for URN definition.
## 2. Model your metadata
Refer to [metadata modelling](metadata-modelling.md) section.
Make sure to do the following:
1. Define [Aspect] models.
2. Define aspect union model. Refer to [DatasetAspect] as an example.
3. Define [Snapshot] model. Refer to [DatasetSnapshot] as an example.
4. Add your newly defined snapshot to [Snapshot Union] model.
5. Define [Entity] model. Refer to [DatasetEntity] as an example.
## 3. GMA search onboarding
Refer to [search onboarding](search-onboarding.md).
## 4. GMA graph onboarding
Refer to [graph onboarding](graph-onboarding.md).
## 5. UI for entity onboarding [WIP]
[Aspect]: ../what/aspect.md
[DatasetAspect]: ../../metadata-models/src/main/pegasus/com/linkedin/metadata/aspect/DatasetAspect.pdsc
[Snapshot]: ../what/snapshot.md
[DatasetSnapshot]: ../../metadata-models/src/main/pegasus/com/linkedin/metadata/snapshot/DatasetSnapshot.pdsc
[Snapshot Union]: ../../metadata-models/src/main/pegasus/com/linkedin/metadata/snapshot/Snapshot.pdsc
[Entity]: ../what/entity.md
[DatasetEntity]: ../../metadata-models/src/main/pegasus/com/linkedin/metadata/entity/DatasetEntity.pdsc

View File

@ -1,2 +1,161 @@
# How to onboard to GMA graph?
## 1. Define relationship models
If you need to define a [relationship] which is not available in the set of [relationship models] provided,
that relationship model should be implemented as a first step for graph onboarding.
Below is an example model for `OwnedBy` relationship:
```json
{
"type": "record",
"name": "OwnedBy",
"namespace": "com.linkedin.metadata.relationship",
"doc": "A generic model for the Owned-By relationship",
"include": [
"BaseRelationship"
],
"pairings": [
{
"source": "com.linkedin.common.urn.DatasetUrn",
"destination": "com.linkedin.common.urn.CorpuserUrn"
}
],
"fields": [
{
"name": "type",
"type": "com.linkedin.common.OwnershipType",
"doc": "The type of the ownership"
}
]
}
```
## 2. Implement relationship builders
You need to implement relationship builders for your specific [aspect]s and [relationship]s if they are not already defined.
Relationship builders build list of relationships after processing aspects and any relationship builder should implement `BaseRelationshipBuilder` abstract class.
Relationship builders are per aspect and per relationship type.
```java
public abstract class BaseRelationshipBuilder<ASPECT extends RecordTemplate> {
private Class<ASPECT> _aspectClass;
public BaseRelationshipBuilder(Class<ASPECT> aspectClass) {
_aspectClass = aspectClass;
}
/**
* Returns the aspect class this {@link BaseRelationshipBuilder} supports
*/
public Class<ASPECT> supportedAspectClass() {
return _aspectClass;
}
/**
* Returns a list of corresponding relationship updates for the given metadata aspect
*/
public abstract <URN extends Urn> List<GraphBuilder.RelationshipUpdates> buildRelationships(URN urn, ASPECT aspect);
}
```
## 3. Implement graph builders
Graph builders build graph updates by processing [snapshot]s.
They internally use relationship builders to generate edges and nodes of the graph.
All relationship builders for an [entity] should be registered through graph builder.
```java
public abstract class BaseGraphBuilder<SNAPSHOT extends RecordTemplate> implements GraphBuilder<SNAPSHOT> {
private final Class<SNAPSHOT> _snapshotClass;
private final Map<Class<? extends RecordTemplate>, BaseRelationshipBuilder> _relationshipBuildersMap;
public BaseGraphBuilder(@Nonnull Class<SNAPSHOT> snapshotClass,
@Nonnull Collection<BaseRelationshipBuilder> relationshipBuilders) {
_snapshotClass = snapshotClass;
_relationshipBuildersMap = relationshipBuilders.stream()
.collect(Collectors.toMap(builder -> builder.supportedAspectClass(), Function.identity()));
}
@Nonnull
Class<SNAPSHOT> supportedSnapshotClass() {
return _snapshotClass;
}
@Nonnull
@Override
public GraphUpdates build(@Nonnull SNAPSHOT snapshot) {
final Urn urn = RecordUtils.getRecordTemplateField(snapshot, "urn", Urn.class);
final List<? extends RecordTemplate> entities = buildEntities(snapshot);
final List<RelationshipUpdates> relationshipUpdates = new ArrayList<>();
final List<RecordTemplate> aspects = ModelUtils.getAspectsFromSnapshot(snapshot);
for (RecordTemplate aspect : aspects) {
BaseRelationshipBuilder relationshipBuilder = _relationshipBuildersMap.get(aspect.getClass());
if (relationshipBuilder != null) {
relationshipUpdates.addAll(relationshipBuilder.buildRelationships(urn, aspect));
}
}
return new GraphUpdates(Collections.unmodifiableList(entities), Collections.unmodifiableList(relationshipUpdates));
}
@Nonnull
protected abstract List<? extends RecordTemplate> buildEntities(@Nonnull SNAPSHOT snapshot);
}
```
```java
public class DatasetGraphBuilder extends BaseGraphBuilder<DatasetSnapshot> {
private static final Set<BaseRelationshipBuilder> RELATIONSHIP_BUILDERS =
Collections.unmodifiableSet(new HashSet<BaseRelationshipBuilder>() {
{
add(new DownstreamOfBuilderFromUpstreamLineage());
add(new OwnedByBuilderFromOwnership());
}
});
public DatasetGraphBuilder() {
super(DatasetSnapshot.class, RELATIONSHIP_BUILDERS);
}
@Nonnull
@Override
protected List<? extends RecordTemplate> buildEntities(@Nonnull DatasetSnapshot snapshot) {
final DatasetUrn urn = snapshot.getUrn();
final DatasetEntity entity = new DatasetEntity().setUrn(urn)
.setName(urn.getDatasetNameEntity())
.setPlatform(urn.getPlatformEntity())
.setOrigin(urn.getOriginEntity());
setRemovedProperty(snapshot, entity);
return Collections.singletonList(entity);
}
}
```
## 4. Ingestion into graph
The ingestion process for each [entity] is done by graph builders.
The builders will be invoked whenever an [MAE] is received by [MAE Consumer Job].
Graph builders should be extended from BaseGraphBuilder. Check DatasetGraphBuilder as an example above.
For the consumer job to consume those MAEs, you should add your graph builder to the [graph builder registry].
## 5. Graph queries
You can onboard the graph queries which fit to your specific use cases using [Query DAO].
You also need to create [rest.li](https://rest.li) APIs to serve your graph queries.
[BaseQueryDAO] provides an abstract implementation of several graph query APIs.
Refer to [DownstreamLineageResource] rest.li resource implementation to see a use case of graph queries.
[relationship]: ../what/relationship.md
[relationship models]: ../../metadata-models/build/mainSchemas/com/linkedin/metadata/relationship
[aspect]: ../what/aspect.md
[snapshot]: ../what/snapshot.md
[entity]: ../what/entity.md
[mae]: ../what/mxe.md#metadata-audit-event-mae
[mae consumer job]: ../architecture/metadata-ingestion.md#mae-consumer-job
[graph builder registry]: ../../metadata-builders/src/main/java/com/linkedin/metadata/builders/graph/RegisteredGraphBuilders.java
[query dao]: ../architecture/metadata-serving.md#query-dao
[BaseQueryDAO]: ../../metadata-dao/src/main/java/com/linkedin/metadata/dao/BaseQueryDAO.java
[DownstreamLineageResource]: ../../gms/impl/src/main/java/com/linkedin/dataset/rest/resources/DownstreamLineageResource.java

View File

@ -1,2 +1,85 @@
# How to onboard to GMA search?
## 1. Define search document model for the entity
Modeling is the most important and crucial part of your design.
[Search document] model contains a list of fields that need to be indexed along with the names and their data types.
Check [here][Search document] to learn more about search document model.
Please note that all fields in the search document model (except the `urn`) are `optional`.
This is because we want to support partial updates to search documents.
[Search document]: ../what/search-document.md
## 2. Create the search index, define its mappings and settings
A [mapping] is created using the information of search document model.
It defines how a document, and the fields it contains, are stored and indexed by various [tokenizer]s, [analyzer]s and data type for the fields.
For certain fields, sub-fields are created using different analyzers.
The analyzers are chosen depending on the needs for each field.
This is currently created manually using [curl] commands, and we plan to [automate](../what/search-index.md#search-automation-tbd) the process in the near future.
Check index [mappings & settings](../../docker/elasticsearch/dataset-index-config.json) for `dataset` search index.
## 3. Ingestion into search index
The actual indexing process for each [entity] is powered by [index builder]s.
The builders register the metadata [aspect]s of their interest against [MAE Consumer Job] and will be invoked whenever an [MAE] of same interest is received.
Index builders should be extended from [BaseIndexBuilder]. Check [DatasetIndexBuilder] as an example.
For the consumer job to consume those MAEs, you should add your index builder to the [index builder registry].
## 4. Search query configs
Once you have the [search index] built, it's ready to be queried!
The search query is constructed and executed through [Search DAO].
The raw search hits are retrieved and extracted using the base model.
Besides the regular full text search, run time aggregation and relevance are provided in the search queries as well.
[ESSearchDAO] is the implementation for the [BaseSearchDAO] for Elasticsearch.
It's still a generic class which can be used for a specific [entity] and configured using [BaseSearchConfig].
BaseSearchConfig is the abstraction for all query related configurations such as query templates, default field to execute autocomplete on etc.
```java
public abstract class BaseSearchConfig<DOCUMENT extends RecordTemplate> {
public abstract Set<String> getFacetFields();
public String getIndexName() {
return getSearchDocument().getSimpleName().toLowerCase();
}
public abstract Class<DOCUMENT> getSearchDocument();
public abstract String getDefaultAutocompleteField();
public abstract String getSearchQueryTemplate();
public abstract String getAutocompleteQueryTemplate();
}
```
[DatasetSearchConfig] is the implementation of search config for `dataset` entity.
## 5. Add search query endpoints to GMS
Finally, you need to create [rest.li](https://rest.li) APIs to serve your search queries.
[BaseSearchableEntityResource] provides an abstract implementation of search and autocomplete APIs.
Any top level rest.li resource implementation could extend it and this will automatically provide search and autocomplete endpoints.
Refer to [CorpUsers] rest.li resource implementation as an example.
[mapping]: https://www.elastic.co/guide/en/elasticsearch/reference/5.6/mapping.html
[tokenizer]: https://www.elastic.co/guide/en/elasticsearch/reference/5.6/analysis-tokenizers.html
[analyzer]: https://www.elastic.co/guide/en/elasticsearch/reference/5.6/analysis-analyzers.html
[curl]: https://en.wikipedia.org/wiki/CURL
[entity]: ../what/entity.md
[index builder]: ../architecture/metadata-ingestion.md#search-and-graph-index-builders
[aspect]: ../what/aspect.md
[mae consumer job]: ../architecture/metadata-ingestion.md#mae-consumer-job
[mae]: ../what/mxe.md#metadata-audit-event-mae
[baseindexbuilder]: ../../metadata-builders/src/main/java/com/linkedin/metadata/builders/search/BaseIndexBuilder.java
[datasetindexbuilder]: ../../metadata-builders/src/main/java/com/linkedin/metadata/builders/search/DatasetIndexBuilder.java
[index builder registry]: ../../metadata-builders/src/main/java/com/linkedin/metadata/builders/search/RegisteredIndexBuilders.java
[search index]: ../what/search-index.md
[search dao]: ../architecture/metadata-serving.md#search-dao
[essearchdao]: ../../metadata-dao-impl/elasticsearch-dao/src/main/java/com/linkedin/metadata/dao/search/ESSearchDAO.java
[basesearchdao]: ../../metadata-dao/src/main/java/com/linkedin/metadata/dao/BaseSearchDAO.java
[basesearchconfig]: ../../metadata-dao-impl/elasticsearch-dao/src/main/java/com/linkedin/metadata/dao/search/BaseSearchConfig.java
[datasetsearchconfig]: ../../gms/impl/src/main/java/com/linkedin/dataset/dao/search/DatasetSearchConfig.java
[basesearchableentityresource]: ../../metadata-restli-resource/src/main/java/com/linkedin/metadata/restli/BaseSearchableEntityResource.java
[corpusers]: ../../gms/impl/src/main/java/com/linkedin/identity/rest/resources/CorpUsers.java

View File

@ -1,4 +1,4 @@
# What is Generalized Metadata Service (GMS)?
# What is Generalized Metadata Service (GMS)? [WIP]

View File

@ -1,2 +1,2 @@
# What is URN?
# What is URN? [WIP]