mirror of
https://github.com/datahub-project/datahub.git
synced 2025-10-27 08:54:32 +00:00
Documentation update part-3
This commit is contained in:
parent
6c99764fb1
commit
4342a9b39a
11
README.md
11
README.md
@ -5,8 +5,11 @@
|
||||

|
||||
|
||||
## Introduction
|
||||
DataHub is Linkedin's generalized metadata search & discovery tool. To learn more about DataHub, check out our
|
||||
[Linkedin blog post](https://engineering.linkedin.com/blog/2019/data-hub) and [Strata presentation](https://speakerdeck.com/shirshanka/the-evolution-of-metadata-linkedins-journey-strata-nyc-2019). This repository contains the complete source code to be able to build DataHub's frontend & backend services.
|
||||
DataHub is LinkedIn's generalized metadata search & discovery tool. To learn more about DataHub, check out our
|
||||
[LinkedIn blog post](https://engineering.linkedin.com/blog/2019/data-hub) and [Strata presentation](https://speakerdeck.com/shirshanka/the-evolution-of-metadata-linkedins-journey-strata-nyc-2019).
|
||||
You should also visit [DataHub Architecture](docs/architecture/architecture.md) to get a better understanding of how DataHub is implemented and
|
||||
[DataHub Onboarding Guide](docs/how/entity-onboarding.md) to understand how to extend DataHub for your own use case.
|
||||
This repository contains the complete source code to be able to build DataHub's frontend & backend services.
|
||||
|
||||
## Quickstart
|
||||
1. Install [docker](https://docs.docker.com/install/) and [docker-compose](https://docs.docker.com/compose/install/).
|
||||
@ -25,9 +28,11 @@ Note: Make sure that you're using Java 8, we have a strict dependency to Java 8
|
||||
as username and password.
|
||||
|
||||
## Quicklinks
|
||||
* [DataHub Architecture](docs/architecture/architecture.md)
|
||||
* [DataHub Onboarding Guide](docs/how/entity-onboarding.md)
|
||||
* [Docker Images](docker)
|
||||
* [Frontend App](datahub-frontend)
|
||||
* [Generalized Metadata Store](gms)
|
||||
* [Generalized Metadata Service](gms)
|
||||
* [Metadata Consumer Jobs](metadata-jobs)
|
||||
* [Metadata Ingestion](metadata-ingestion)
|
||||
|
||||
|
||||
@ -1,6 +1,6 @@
|
||||
# Metadata Ingestion Architecture
|
||||
|
||||
## MCE Consumer Job
|
||||
## MCE Consumer Job [WIP]
|
||||
|
||||
## MAE Consumer Job
|
||||
|
||||
|
||||
@ -1,2 +1,32 @@
|
||||
# How to onboard an entity?
|
||||
|
||||
Currently, DataHub only has a support for 3 [entity] types: `datasets`, `users` and `groups`.
|
||||
If you want to extend DataHub with your own use cases such as `metrics`, `charts`, `dashboards` etc, you should follow the below steps in order.
|
||||
|
||||
## 1. Define URN
|
||||
Refer to [here](../what/urn.md) for URN definition.
|
||||
|
||||
## 2. Model your metadata
|
||||
Refer to [metadata modelling](metadata-modelling.md) section.
|
||||
Make sure to do the following:
|
||||
1. Define [Aspect] models.
|
||||
2. Define aspect union model. Refer to [DatasetAspect] as an example.
|
||||
3. Define [Snapshot] model. Refer to [DatasetSnapshot] as an example.
|
||||
4. Add your newly defined snapshot to [Snapshot Union] model.
|
||||
5. Define [Entity] model. Refer to [DatasetEntity] as an example.
|
||||
|
||||
## 3. GMA search onboarding
|
||||
Refer to [search onboarding](search-onboarding.md).
|
||||
|
||||
## 4. GMA graph onboarding
|
||||
Refer to [graph onboarding](graph-onboarding.md).
|
||||
|
||||
## 5. UI for entity onboarding [WIP]
|
||||
|
||||
[Aspect]: ../what/aspect.md
|
||||
[DatasetAspect]: ../../metadata-models/src/main/pegasus/com/linkedin/metadata/aspect/DatasetAspect.pdsc
|
||||
[Snapshot]: ../what/snapshot.md
|
||||
[DatasetSnapshot]: ../../metadata-models/src/main/pegasus/com/linkedin/metadata/snapshot/DatasetSnapshot.pdsc
|
||||
[Snapshot Union]: ../../metadata-models/src/main/pegasus/com/linkedin/metadata/snapshot/Snapshot.pdsc
|
||||
[Entity]: ../what/entity.md
|
||||
[DatasetEntity]: ../../metadata-models/src/main/pegasus/com/linkedin/metadata/entity/DatasetEntity.pdsc
|
||||
@ -1,2 +1,161 @@
|
||||
# How to onboard to GMA graph?
|
||||
|
||||
## 1. Define relationship models
|
||||
If you need to define a [relationship] which is not available in the set of [relationship models] provided,
|
||||
that relationship model should be implemented as a first step for graph onboarding.
|
||||
Below is an example model for `OwnedBy` relationship:
|
||||
|
||||
```json
|
||||
{
|
||||
"type": "record",
|
||||
"name": "OwnedBy",
|
||||
"namespace": "com.linkedin.metadata.relationship",
|
||||
"doc": "A generic model for the Owned-By relationship",
|
||||
"include": [
|
||||
"BaseRelationship"
|
||||
],
|
||||
"pairings": [
|
||||
{
|
||||
"source": "com.linkedin.common.urn.DatasetUrn",
|
||||
"destination": "com.linkedin.common.urn.CorpuserUrn"
|
||||
}
|
||||
],
|
||||
"fields": [
|
||||
{
|
||||
"name": "type",
|
||||
"type": "com.linkedin.common.OwnershipType",
|
||||
"doc": "The type of the ownership"
|
||||
}
|
||||
]
|
||||
}
|
||||
```
|
||||
|
||||
## 2. Implement relationship builders
|
||||
You need to implement relationship builders for your specific [aspect]s and [relationship]s if they are not already defined.
|
||||
Relationship builders build list of relationships after processing aspects and any relationship builder should implement `BaseRelationshipBuilder` abstract class.
|
||||
Relationship builders are per aspect and per relationship type.
|
||||
|
||||
```java
|
||||
public abstract class BaseRelationshipBuilder<ASPECT extends RecordTemplate> {
|
||||
|
||||
private Class<ASPECT> _aspectClass;
|
||||
|
||||
public BaseRelationshipBuilder(Class<ASPECT> aspectClass) {
|
||||
_aspectClass = aspectClass;
|
||||
}
|
||||
|
||||
/**
|
||||
* Returns the aspect class this {@link BaseRelationshipBuilder} supports
|
||||
*/
|
||||
public Class<ASPECT> supportedAspectClass() {
|
||||
return _aspectClass;
|
||||
}
|
||||
|
||||
/**
|
||||
* Returns a list of corresponding relationship updates for the given metadata aspect
|
||||
*/
|
||||
public abstract <URN extends Urn> List<GraphBuilder.RelationshipUpdates> buildRelationships(URN urn, ASPECT aspect);
|
||||
}
|
||||
```
|
||||
|
||||
## 3. Implement graph builders
|
||||
Graph builders build graph updates by processing [snapshot]s.
|
||||
They internally use relationship builders to generate edges and nodes of the graph.
|
||||
All relationship builders for an [entity] should be registered through graph builder.
|
||||
|
||||
```java
|
||||
public abstract class BaseGraphBuilder<SNAPSHOT extends RecordTemplate> implements GraphBuilder<SNAPSHOT> {
|
||||
|
||||
private final Class<SNAPSHOT> _snapshotClass;
|
||||
private final Map<Class<? extends RecordTemplate>, BaseRelationshipBuilder> _relationshipBuildersMap;
|
||||
|
||||
public BaseGraphBuilder(@Nonnull Class<SNAPSHOT> snapshotClass,
|
||||
@Nonnull Collection<BaseRelationshipBuilder> relationshipBuilders) {
|
||||
_snapshotClass = snapshotClass;
|
||||
_relationshipBuildersMap = relationshipBuilders.stream()
|
||||
.collect(Collectors.toMap(builder -> builder.supportedAspectClass(), Function.identity()));
|
||||
}
|
||||
|
||||
@Nonnull
|
||||
Class<SNAPSHOT> supportedSnapshotClass() {
|
||||
return _snapshotClass;
|
||||
}
|
||||
|
||||
@Nonnull
|
||||
@Override
|
||||
public GraphUpdates build(@Nonnull SNAPSHOT snapshot) {
|
||||
final Urn urn = RecordUtils.getRecordTemplateField(snapshot, "urn", Urn.class);
|
||||
|
||||
final List<? extends RecordTemplate> entities = buildEntities(snapshot);
|
||||
|
||||
final List<RelationshipUpdates> relationshipUpdates = new ArrayList<>();
|
||||
|
||||
final List<RecordTemplate> aspects = ModelUtils.getAspectsFromSnapshot(snapshot);
|
||||
for (RecordTemplate aspect : aspects) {
|
||||
BaseRelationshipBuilder relationshipBuilder = _relationshipBuildersMap.get(aspect.getClass());
|
||||
if (relationshipBuilder != null) {
|
||||
relationshipUpdates.addAll(relationshipBuilder.buildRelationships(urn, aspect));
|
||||
}
|
||||
}
|
||||
|
||||
return new GraphUpdates(Collections.unmodifiableList(entities), Collections.unmodifiableList(relationshipUpdates));
|
||||
}
|
||||
|
||||
@Nonnull
|
||||
protected abstract List<? extends RecordTemplate> buildEntities(@Nonnull SNAPSHOT snapshot);
|
||||
}
|
||||
```
|
||||
|
||||
```java
|
||||
public class DatasetGraphBuilder extends BaseGraphBuilder<DatasetSnapshot> {
|
||||
private static final Set<BaseRelationshipBuilder> RELATIONSHIP_BUILDERS =
|
||||
Collections.unmodifiableSet(new HashSet<BaseRelationshipBuilder>() {
|
||||
{
|
||||
add(new DownstreamOfBuilderFromUpstreamLineage());
|
||||
add(new OwnedByBuilderFromOwnership());
|
||||
}
|
||||
});
|
||||
|
||||
public DatasetGraphBuilder() {
|
||||
super(DatasetSnapshot.class, RELATIONSHIP_BUILDERS);
|
||||
}
|
||||
|
||||
@Nonnull
|
||||
@Override
|
||||
protected List<? extends RecordTemplate> buildEntities(@Nonnull DatasetSnapshot snapshot) {
|
||||
final DatasetUrn urn = snapshot.getUrn();
|
||||
final DatasetEntity entity = new DatasetEntity().setUrn(urn)
|
||||
.setName(urn.getDatasetNameEntity())
|
||||
.setPlatform(urn.getPlatformEntity())
|
||||
.setOrigin(urn.getOriginEntity());
|
||||
|
||||
setRemovedProperty(snapshot, entity);
|
||||
|
||||
return Collections.singletonList(entity);
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
## 4. Ingestion into graph
|
||||
The ingestion process for each [entity] is done by graph builders.
|
||||
The builders will be invoked whenever an [MAE] is received by [MAE Consumer Job].
|
||||
Graph builders should be extended from BaseGraphBuilder. Check DatasetGraphBuilder as an example above.
|
||||
For the consumer job to consume those MAEs, you should add your graph builder to the [graph builder registry].
|
||||
|
||||
## 5. Graph queries
|
||||
You can onboard the graph queries which fit to your specific use cases using [Query DAO].
|
||||
You also need to create [rest.li](https://rest.li) APIs to serve your graph queries.
|
||||
[BaseQueryDAO] provides an abstract implementation of several graph query APIs.
|
||||
Refer to [DownstreamLineageResource] rest.li resource implementation to see a use case of graph queries.
|
||||
|
||||
[relationship]: ../what/relationship.md
|
||||
[relationship models]: ../../metadata-models/build/mainSchemas/com/linkedin/metadata/relationship
|
||||
[aspect]: ../what/aspect.md
|
||||
[snapshot]: ../what/snapshot.md
|
||||
[entity]: ../what/entity.md
|
||||
[mae]: ../what/mxe.md#metadata-audit-event-mae
|
||||
[mae consumer job]: ../architecture/metadata-ingestion.md#mae-consumer-job
|
||||
[graph builder registry]: ../../metadata-builders/src/main/java/com/linkedin/metadata/builders/graph/RegisteredGraphBuilders.java
|
||||
[query dao]: ../architecture/metadata-serving.md#query-dao
|
||||
[BaseQueryDAO]: ../../metadata-dao/src/main/java/com/linkedin/metadata/dao/BaseQueryDAO.java
|
||||
[DownstreamLineageResource]: ../../gms/impl/src/main/java/com/linkedin/dataset/rest/resources/DownstreamLineageResource.java
|
||||
@ -1,2 +1,85 @@
|
||||
# How to onboard to GMA search?
|
||||
|
||||
## 1. Define search document model for the entity
|
||||
Modeling is the most important and crucial part of your design.
|
||||
[Search document] model contains a list of fields that need to be indexed along with the names and their data types.
|
||||
Check [here][Search document] to learn more about search document model.
|
||||
Please note that all fields in the search document model (except the `urn`) are `optional`.
|
||||
This is because we want to support partial updates to search documents.
|
||||
|
||||
[Search document]: ../what/search-document.md
|
||||
|
||||
## 2. Create the search index, define its mappings and settings
|
||||
|
||||
A [mapping] is created using the information of search document model.
|
||||
It defines how a document, and the fields it contains, are stored and indexed by various [tokenizer]s, [analyzer]s and data type for the fields.
|
||||
For certain fields, sub-fields are created using different analyzers.
|
||||
The analyzers are chosen depending on the needs for each field.
|
||||
This is currently created manually using [curl] commands, and we plan to [automate](../what/search-index.md#search-automation-tbd) the process in the near future.
|
||||
Check index [mappings & settings](../../docker/elasticsearch/dataset-index-config.json) for `dataset` search index.
|
||||
|
||||
## 3. Ingestion into search index
|
||||
The actual indexing process for each [entity] is powered by [index builder]s.
|
||||
The builders register the metadata [aspect]s of their interest against [MAE Consumer Job] and will be invoked whenever an [MAE] of same interest is received.
|
||||
Index builders should be extended from [BaseIndexBuilder]. Check [DatasetIndexBuilder] as an example.
|
||||
For the consumer job to consume those MAEs, you should add your index builder to the [index builder registry].
|
||||
|
||||
## 4. Search query configs
|
||||
Once you have the [search index] built, it's ready to be queried!
|
||||
The search query is constructed and executed through [Search DAO].
|
||||
The raw search hits are retrieved and extracted using the base model.
|
||||
Besides the regular full text search, run time aggregation and relevance are provided in the search queries as well.
|
||||
|
||||
[ESSearchDAO] is the implementation for the [BaseSearchDAO] for Elasticsearch.
|
||||
It's still a generic class which can be used for a specific [entity] and configured using [BaseSearchConfig].
|
||||
|
||||
BaseSearchConfig is the abstraction for all query related configurations such as query templates, default field to execute autocomplete on etc.
|
||||
|
||||
```java
|
||||
public abstract class BaseSearchConfig<DOCUMENT extends RecordTemplate> {
|
||||
|
||||
public abstract Set<String> getFacetFields();
|
||||
|
||||
public String getIndexName() {
|
||||
return getSearchDocument().getSimpleName().toLowerCase();
|
||||
}
|
||||
|
||||
public abstract Class<DOCUMENT> getSearchDocument();
|
||||
|
||||
public abstract String getDefaultAutocompleteField();
|
||||
|
||||
public abstract String getSearchQueryTemplate();
|
||||
|
||||
public abstract String getAutocompleteQueryTemplate();
|
||||
}
|
||||
```
|
||||
|
||||
[DatasetSearchConfig] is the implementation of search config for `dataset` entity.
|
||||
|
||||
## 5. Add search query endpoints to GMS
|
||||
Finally, you need to create [rest.li](https://rest.li) APIs to serve your search queries.
|
||||
[BaseSearchableEntityResource] provides an abstract implementation of search and autocomplete APIs.
|
||||
Any top level rest.li resource implementation could extend it and this will automatically provide search and autocomplete endpoints.
|
||||
Refer to [CorpUsers] rest.li resource implementation as an example.
|
||||
|
||||
|
||||
[mapping]: https://www.elastic.co/guide/en/elasticsearch/reference/5.6/mapping.html
|
||||
[tokenizer]: https://www.elastic.co/guide/en/elasticsearch/reference/5.6/analysis-tokenizers.html
|
||||
[analyzer]: https://www.elastic.co/guide/en/elasticsearch/reference/5.6/analysis-analyzers.html
|
||||
[curl]: https://en.wikipedia.org/wiki/CURL
|
||||
[entity]: ../what/entity.md
|
||||
[index builder]: ../architecture/metadata-ingestion.md#search-and-graph-index-builders
|
||||
[aspect]: ../what/aspect.md
|
||||
[mae consumer job]: ../architecture/metadata-ingestion.md#mae-consumer-job
|
||||
[mae]: ../what/mxe.md#metadata-audit-event-mae
|
||||
[baseindexbuilder]: ../../metadata-builders/src/main/java/com/linkedin/metadata/builders/search/BaseIndexBuilder.java
|
||||
[datasetindexbuilder]: ../../metadata-builders/src/main/java/com/linkedin/metadata/builders/search/DatasetIndexBuilder.java
|
||||
[index builder registry]: ../../metadata-builders/src/main/java/com/linkedin/metadata/builders/search/RegisteredIndexBuilders.java
|
||||
[search index]: ../what/search-index.md
|
||||
[search dao]: ../architecture/metadata-serving.md#search-dao
|
||||
[essearchdao]: ../../metadata-dao-impl/elasticsearch-dao/src/main/java/com/linkedin/metadata/dao/search/ESSearchDAO.java
|
||||
[basesearchdao]: ../../metadata-dao/src/main/java/com/linkedin/metadata/dao/BaseSearchDAO.java
|
||||
[basesearchconfig]: ../../metadata-dao-impl/elasticsearch-dao/src/main/java/com/linkedin/metadata/dao/search/BaseSearchConfig.java
|
||||
[datasetsearchconfig]: ../../gms/impl/src/main/java/com/linkedin/dataset/dao/search/DatasetSearchConfig.java
|
||||
[basesearchableentityresource]: ../../metadata-restli-resource/src/main/java/com/linkedin/metadata/restli/BaseSearchableEntityResource.java
|
||||
[corpusers]: ../../gms/impl/src/main/java/com/linkedin/identity/rest/resources/CorpUsers.java
|
||||
@ -1,4 +1,4 @@
|
||||
# What is Generalized Metadata Service (GMS)?
|
||||
# What is Generalized Metadata Service (GMS)? [WIP]
|
||||
|
||||
|
||||
|
||||
|
||||
@ -1,2 +1,2 @@
|
||||
# What is URN?
|
||||
# What is URN? [WIP]
|
||||
|
||||
|
||||
Loading…
x
Reference in New Issue
Block a user