mirror of
https://github.com/datahub-project/datahub.git
synced 2025-10-29 01:42:08 +00:00
Documentation update part-3
This commit is contained in:
parent
6c99764fb1
commit
4342a9b39a
11
README.md
11
README.md
@ -5,8 +5,11 @@
|
|||||||

|

|
||||||
|
|
||||||
## Introduction
|
## Introduction
|
||||||
DataHub is Linkedin's generalized metadata search & discovery tool. To learn more about DataHub, check out our
|
DataHub is LinkedIn's generalized metadata search & discovery tool. To learn more about DataHub, check out our
|
||||||
[Linkedin blog post](https://engineering.linkedin.com/blog/2019/data-hub) and [Strata presentation](https://speakerdeck.com/shirshanka/the-evolution-of-metadata-linkedins-journey-strata-nyc-2019). This repository contains the complete source code to be able to build DataHub's frontend & backend services.
|
[LinkedIn blog post](https://engineering.linkedin.com/blog/2019/data-hub) and [Strata presentation](https://speakerdeck.com/shirshanka/the-evolution-of-metadata-linkedins-journey-strata-nyc-2019).
|
||||||
|
You should also visit [DataHub Architecture](docs/architecture/architecture.md) to get a better understanding of how DataHub is implemented and
|
||||||
|
[DataHub Onboarding Guide](docs/how/entity-onboarding.md) to understand how to extend DataHub for your own use case.
|
||||||
|
This repository contains the complete source code to be able to build DataHub's frontend & backend services.
|
||||||
|
|
||||||
## Quickstart
|
## Quickstart
|
||||||
1. Install [docker](https://docs.docker.com/install/) and [docker-compose](https://docs.docker.com/compose/install/).
|
1. Install [docker](https://docs.docker.com/install/) and [docker-compose](https://docs.docker.com/compose/install/).
|
||||||
@ -25,9 +28,11 @@ Note: Make sure that you're using Java 8, we have a strict dependency to Java 8
|
|||||||
as username and password.
|
as username and password.
|
||||||
|
|
||||||
## Quicklinks
|
## Quicklinks
|
||||||
|
* [DataHub Architecture](docs/architecture/architecture.md)
|
||||||
|
* [DataHub Onboarding Guide](docs/how/entity-onboarding.md)
|
||||||
* [Docker Images](docker)
|
* [Docker Images](docker)
|
||||||
* [Frontend App](datahub-frontend)
|
* [Frontend App](datahub-frontend)
|
||||||
* [Generalized Metadata Store](gms)
|
* [Generalized Metadata Service](gms)
|
||||||
* [Metadata Consumer Jobs](metadata-jobs)
|
* [Metadata Consumer Jobs](metadata-jobs)
|
||||||
* [Metadata Ingestion](metadata-ingestion)
|
* [Metadata Ingestion](metadata-ingestion)
|
||||||
|
|
||||||
|
|||||||
@ -1,6 +1,6 @@
|
|||||||
# Metadata Ingestion Architecture
|
# Metadata Ingestion Architecture
|
||||||
|
|
||||||
## MCE Consumer Job
|
## MCE Consumer Job [WIP]
|
||||||
|
|
||||||
## MAE Consumer Job
|
## MAE Consumer Job
|
||||||
|
|
||||||
|
|||||||
@ -1,2 +1,32 @@
|
|||||||
# How to onboard an entity?
|
# How to onboard an entity?
|
||||||
|
|
||||||
|
Currently, DataHub only has a support for 3 [entity] types: `datasets`, `users` and `groups`.
|
||||||
|
If you want to extend DataHub with your own use cases such as `metrics`, `charts`, `dashboards` etc, you should follow the below steps in order.
|
||||||
|
|
||||||
|
## 1. Define URN
|
||||||
|
Refer to [here](../what/urn.md) for URN definition.
|
||||||
|
|
||||||
|
## 2. Model your metadata
|
||||||
|
Refer to [metadata modelling](metadata-modelling.md) section.
|
||||||
|
Make sure to do the following:
|
||||||
|
1. Define [Aspect] models.
|
||||||
|
2. Define aspect union model. Refer to [DatasetAspect] as an example.
|
||||||
|
3. Define [Snapshot] model. Refer to [DatasetSnapshot] as an example.
|
||||||
|
4. Add your newly defined snapshot to [Snapshot Union] model.
|
||||||
|
5. Define [Entity] model. Refer to [DatasetEntity] as an example.
|
||||||
|
|
||||||
|
## 3. GMA search onboarding
|
||||||
|
Refer to [search onboarding](search-onboarding.md).
|
||||||
|
|
||||||
|
## 4. GMA graph onboarding
|
||||||
|
Refer to [graph onboarding](graph-onboarding.md).
|
||||||
|
|
||||||
|
## 5. UI for entity onboarding [WIP]
|
||||||
|
|
||||||
|
[Aspect]: ../what/aspect.md
|
||||||
|
[DatasetAspect]: ../../metadata-models/src/main/pegasus/com/linkedin/metadata/aspect/DatasetAspect.pdsc
|
||||||
|
[Snapshot]: ../what/snapshot.md
|
||||||
|
[DatasetSnapshot]: ../../metadata-models/src/main/pegasus/com/linkedin/metadata/snapshot/DatasetSnapshot.pdsc
|
||||||
|
[Snapshot Union]: ../../metadata-models/src/main/pegasus/com/linkedin/metadata/snapshot/Snapshot.pdsc
|
||||||
|
[Entity]: ../what/entity.md
|
||||||
|
[DatasetEntity]: ../../metadata-models/src/main/pegasus/com/linkedin/metadata/entity/DatasetEntity.pdsc
|
||||||
@ -1,2 +1,161 @@
|
|||||||
# How to onboard to GMA graph?
|
# How to onboard to GMA graph?
|
||||||
|
|
||||||
|
## 1. Define relationship models
|
||||||
|
If you need to define a [relationship] which is not available in the set of [relationship models] provided,
|
||||||
|
that relationship model should be implemented as a first step for graph onboarding.
|
||||||
|
Below is an example model for `OwnedBy` relationship:
|
||||||
|
|
||||||
|
```json
|
||||||
|
{
|
||||||
|
"type": "record",
|
||||||
|
"name": "OwnedBy",
|
||||||
|
"namespace": "com.linkedin.metadata.relationship",
|
||||||
|
"doc": "A generic model for the Owned-By relationship",
|
||||||
|
"include": [
|
||||||
|
"BaseRelationship"
|
||||||
|
],
|
||||||
|
"pairings": [
|
||||||
|
{
|
||||||
|
"source": "com.linkedin.common.urn.DatasetUrn",
|
||||||
|
"destination": "com.linkedin.common.urn.CorpuserUrn"
|
||||||
|
}
|
||||||
|
],
|
||||||
|
"fields": [
|
||||||
|
{
|
||||||
|
"name": "type",
|
||||||
|
"type": "com.linkedin.common.OwnershipType",
|
||||||
|
"doc": "The type of the ownership"
|
||||||
|
}
|
||||||
|
]
|
||||||
|
}
|
||||||
|
```
|
||||||
|
|
||||||
|
## 2. Implement relationship builders
|
||||||
|
You need to implement relationship builders for your specific [aspect]s and [relationship]s if they are not already defined.
|
||||||
|
Relationship builders build list of relationships after processing aspects and any relationship builder should implement `BaseRelationshipBuilder` abstract class.
|
||||||
|
Relationship builders are per aspect and per relationship type.
|
||||||
|
|
||||||
|
```java
|
||||||
|
public abstract class BaseRelationshipBuilder<ASPECT extends RecordTemplate> {
|
||||||
|
|
||||||
|
private Class<ASPECT> _aspectClass;
|
||||||
|
|
||||||
|
public BaseRelationshipBuilder(Class<ASPECT> aspectClass) {
|
||||||
|
_aspectClass = aspectClass;
|
||||||
|
}
|
||||||
|
|
||||||
|
/**
|
||||||
|
* Returns the aspect class this {@link BaseRelationshipBuilder} supports
|
||||||
|
*/
|
||||||
|
public Class<ASPECT> supportedAspectClass() {
|
||||||
|
return _aspectClass;
|
||||||
|
}
|
||||||
|
|
||||||
|
/**
|
||||||
|
* Returns a list of corresponding relationship updates for the given metadata aspect
|
||||||
|
*/
|
||||||
|
public abstract <URN extends Urn> List<GraphBuilder.RelationshipUpdates> buildRelationships(URN urn, ASPECT aspect);
|
||||||
|
}
|
||||||
|
```
|
||||||
|
|
||||||
|
## 3. Implement graph builders
|
||||||
|
Graph builders build graph updates by processing [snapshot]s.
|
||||||
|
They internally use relationship builders to generate edges and nodes of the graph.
|
||||||
|
All relationship builders for an [entity] should be registered through graph builder.
|
||||||
|
|
||||||
|
```java
|
||||||
|
public abstract class BaseGraphBuilder<SNAPSHOT extends RecordTemplate> implements GraphBuilder<SNAPSHOT> {
|
||||||
|
|
||||||
|
private final Class<SNAPSHOT> _snapshotClass;
|
||||||
|
private final Map<Class<? extends RecordTemplate>, BaseRelationshipBuilder> _relationshipBuildersMap;
|
||||||
|
|
||||||
|
public BaseGraphBuilder(@Nonnull Class<SNAPSHOT> snapshotClass,
|
||||||
|
@Nonnull Collection<BaseRelationshipBuilder> relationshipBuilders) {
|
||||||
|
_snapshotClass = snapshotClass;
|
||||||
|
_relationshipBuildersMap = relationshipBuilders.stream()
|
||||||
|
.collect(Collectors.toMap(builder -> builder.supportedAspectClass(), Function.identity()));
|
||||||
|
}
|
||||||
|
|
||||||
|
@Nonnull
|
||||||
|
Class<SNAPSHOT> supportedSnapshotClass() {
|
||||||
|
return _snapshotClass;
|
||||||
|
}
|
||||||
|
|
||||||
|
@Nonnull
|
||||||
|
@Override
|
||||||
|
public GraphUpdates build(@Nonnull SNAPSHOT snapshot) {
|
||||||
|
final Urn urn = RecordUtils.getRecordTemplateField(snapshot, "urn", Urn.class);
|
||||||
|
|
||||||
|
final List<? extends RecordTemplate> entities = buildEntities(snapshot);
|
||||||
|
|
||||||
|
final List<RelationshipUpdates> relationshipUpdates = new ArrayList<>();
|
||||||
|
|
||||||
|
final List<RecordTemplate> aspects = ModelUtils.getAspectsFromSnapshot(snapshot);
|
||||||
|
for (RecordTemplate aspect : aspects) {
|
||||||
|
BaseRelationshipBuilder relationshipBuilder = _relationshipBuildersMap.get(aspect.getClass());
|
||||||
|
if (relationshipBuilder != null) {
|
||||||
|
relationshipUpdates.addAll(relationshipBuilder.buildRelationships(urn, aspect));
|
||||||
|
}
|
||||||
|
}
|
||||||
|
|
||||||
|
return new GraphUpdates(Collections.unmodifiableList(entities), Collections.unmodifiableList(relationshipUpdates));
|
||||||
|
}
|
||||||
|
|
||||||
|
@Nonnull
|
||||||
|
protected abstract List<? extends RecordTemplate> buildEntities(@Nonnull SNAPSHOT snapshot);
|
||||||
|
}
|
||||||
|
```
|
||||||
|
|
||||||
|
```java
|
||||||
|
public class DatasetGraphBuilder extends BaseGraphBuilder<DatasetSnapshot> {
|
||||||
|
private static final Set<BaseRelationshipBuilder> RELATIONSHIP_BUILDERS =
|
||||||
|
Collections.unmodifiableSet(new HashSet<BaseRelationshipBuilder>() {
|
||||||
|
{
|
||||||
|
add(new DownstreamOfBuilderFromUpstreamLineage());
|
||||||
|
add(new OwnedByBuilderFromOwnership());
|
||||||
|
}
|
||||||
|
});
|
||||||
|
|
||||||
|
public DatasetGraphBuilder() {
|
||||||
|
super(DatasetSnapshot.class, RELATIONSHIP_BUILDERS);
|
||||||
|
}
|
||||||
|
|
||||||
|
@Nonnull
|
||||||
|
@Override
|
||||||
|
protected List<? extends RecordTemplate> buildEntities(@Nonnull DatasetSnapshot snapshot) {
|
||||||
|
final DatasetUrn urn = snapshot.getUrn();
|
||||||
|
final DatasetEntity entity = new DatasetEntity().setUrn(urn)
|
||||||
|
.setName(urn.getDatasetNameEntity())
|
||||||
|
.setPlatform(urn.getPlatformEntity())
|
||||||
|
.setOrigin(urn.getOriginEntity());
|
||||||
|
|
||||||
|
setRemovedProperty(snapshot, entity);
|
||||||
|
|
||||||
|
return Collections.singletonList(entity);
|
||||||
|
}
|
||||||
|
}
|
||||||
|
```
|
||||||
|
|
||||||
|
## 4. Ingestion into graph
|
||||||
|
The ingestion process for each [entity] is done by graph builders.
|
||||||
|
The builders will be invoked whenever an [MAE] is received by [MAE Consumer Job].
|
||||||
|
Graph builders should be extended from BaseGraphBuilder. Check DatasetGraphBuilder as an example above.
|
||||||
|
For the consumer job to consume those MAEs, you should add your graph builder to the [graph builder registry].
|
||||||
|
|
||||||
|
## 5. Graph queries
|
||||||
|
You can onboard the graph queries which fit to your specific use cases using [Query DAO].
|
||||||
|
You also need to create [rest.li](https://rest.li) APIs to serve your graph queries.
|
||||||
|
[BaseQueryDAO] provides an abstract implementation of several graph query APIs.
|
||||||
|
Refer to [DownstreamLineageResource] rest.li resource implementation to see a use case of graph queries.
|
||||||
|
|
||||||
|
[relationship]: ../what/relationship.md
|
||||||
|
[relationship models]: ../../metadata-models/build/mainSchemas/com/linkedin/metadata/relationship
|
||||||
|
[aspect]: ../what/aspect.md
|
||||||
|
[snapshot]: ../what/snapshot.md
|
||||||
|
[entity]: ../what/entity.md
|
||||||
|
[mae]: ../what/mxe.md#metadata-audit-event-mae
|
||||||
|
[mae consumer job]: ../architecture/metadata-ingestion.md#mae-consumer-job
|
||||||
|
[graph builder registry]: ../../metadata-builders/src/main/java/com/linkedin/metadata/builders/graph/RegisteredGraphBuilders.java
|
||||||
|
[query dao]: ../architecture/metadata-serving.md#query-dao
|
||||||
|
[BaseQueryDAO]: ../../metadata-dao/src/main/java/com/linkedin/metadata/dao/BaseQueryDAO.java
|
||||||
|
[DownstreamLineageResource]: ../../gms/impl/src/main/java/com/linkedin/dataset/rest/resources/DownstreamLineageResource.java
|
||||||
@ -1,2 +1,85 @@
|
|||||||
# How to onboard to GMA search?
|
# How to onboard to GMA search?
|
||||||
|
|
||||||
|
## 1. Define search document model for the entity
|
||||||
|
Modeling is the most important and crucial part of your design.
|
||||||
|
[Search document] model contains a list of fields that need to be indexed along with the names and their data types.
|
||||||
|
Check [here][Search document] to learn more about search document model.
|
||||||
|
Please note that all fields in the search document model (except the `urn`) are `optional`.
|
||||||
|
This is because we want to support partial updates to search documents.
|
||||||
|
|
||||||
|
[Search document]: ../what/search-document.md
|
||||||
|
|
||||||
|
## 2. Create the search index, define its mappings and settings
|
||||||
|
|
||||||
|
A [mapping] is created using the information of search document model.
|
||||||
|
It defines how a document, and the fields it contains, are stored and indexed by various [tokenizer]s, [analyzer]s and data type for the fields.
|
||||||
|
For certain fields, sub-fields are created using different analyzers.
|
||||||
|
The analyzers are chosen depending on the needs for each field.
|
||||||
|
This is currently created manually using [curl] commands, and we plan to [automate](../what/search-index.md#search-automation-tbd) the process in the near future.
|
||||||
|
Check index [mappings & settings](../../docker/elasticsearch/dataset-index-config.json) for `dataset` search index.
|
||||||
|
|
||||||
|
## 3. Ingestion into search index
|
||||||
|
The actual indexing process for each [entity] is powered by [index builder]s.
|
||||||
|
The builders register the metadata [aspect]s of their interest against [MAE Consumer Job] and will be invoked whenever an [MAE] of same interest is received.
|
||||||
|
Index builders should be extended from [BaseIndexBuilder]. Check [DatasetIndexBuilder] as an example.
|
||||||
|
For the consumer job to consume those MAEs, you should add your index builder to the [index builder registry].
|
||||||
|
|
||||||
|
## 4. Search query configs
|
||||||
|
Once you have the [search index] built, it's ready to be queried!
|
||||||
|
The search query is constructed and executed through [Search DAO].
|
||||||
|
The raw search hits are retrieved and extracted using the base model.
|
||||||
|
Besides the regular full text search, run time aggregation and relevance are provided in the search queries as well.
|
||||||
|
|
||||||
|
[ESSearchDAO] is the implementation for the [BaseSearchDAO] for Elasticsearch.
|
||||||
|
It's still a generic class which can be used for a specific [entity] and configured using [BaseSearchConfig].
|
||||||
|
|
||||||
|
BaseSearchConfig is the abstraction for all query related configurations such as query templates, default field to execute autocomplete on etc.
|
||||||
|
|
||||||
|
```java
|
||||||
|
public abstract class BaseSearchConfig<DOCUMENT extends RecordTemplate> {
|
||||||
|
|
||||||
|
public abstract Set<String> getFacetFields();
|
||||||
|
|
||||||
|
public String getIndexName() {
|
||||||
|
return getSearchDocument().getSimpleName().toLowerCase();
|
||||||
|
}
|
||||||
|
|
||||||
|
public abstract Class<DOCUMENT> getSearchDocument();
|
||||||
|
|
||||||
|
public abstract String getDefaultAutocompleteField();
|
||||||
|
|
||||||
|
public abstract String getSearchQueryTemplate();
|
||||||
|
|
||||||
|
public abstract String getAutocompleteQueryTemplate();
|
||||||
|
}
|
||||||
|
```
|
||||||
|
|
||||||
|
[DatasetSearchConfig] is the implementation of search config for `dataset` entity.
|
||||||
|
|
||||||
|
## 5. Add search query endpoints to GMS
|
||||||
|
Finally, you need to create [rest.li](https://rest.li) APIs to serve your search queries.
|
||||||
|
[BaseSearchableEntityResource] provides an abstract implementation of search and autocomplete APIs.
|
||||||
|
Any top level rest.li resource implementation could extend it and this will automatically provide search and autocomplete endpoints.
|
||||||
|
Refer to [CorpUsers] rest.li resource implementation as an example.
|
||||||
|
|
||||||
|
|
||||||
|
[mapping]: https://www.elastic.co/guide/en/elasticsearch/reference/5.6/mapping.html
|
||||||
|
[tokenizer]: https://www.elastic.co/guide/en/elasticsearch/reference/5.6/analysis-tokenizers.html
|
||||||
|
[analyzer]: https://www.elastic.co/guide/en/elasticsearch/reference/5.6/analysis-analyzers.html
|
||||||
|
[curl]: https://en.wikipedia.org/wiki/CURL
|
||||||
|
[entity]: ../what/entity.md
|
||||||
|
[index builder]: ../architecture/metadata-ingestion.md#search-and-graph-index-builders
|
||||||
|
[aspect]: ../what/aspect.md
|
||||||
|
[mae consumer job]: ../architecture/metadata-ingestion.md#mae-consumer-job
|
||||||
|
[mae]: ../what/mxe.md#metadata-audit-event-mae
|
||||||
|
[baseindexbuilder]: ../../metadata-builders/src/main/java/com/linkedin/metadata/builders/search/BaseIndexBuilder.java
|
||||||
|
[datasetindexbuilder]: ../../metadata-builders/src/main/java/com/linkedin/metadata/builders/search/DatasetIndexBuilder.java
|
||||||
|
[index builder registry]: ../../metadata-builders/src/main/java/com/linkedin/metadata/builders/search/RegisteredIndexBuilders.java
|
||||||
|
[search index]: ../what/search-index.md
|
||||||
|
[search dao]: ../architecture/metadata-serving.md#search-dao
|
||||||
|
[essearchdao]: ../../metadata-dao-impl/elasticsearch-dao/src/main/java/com/linkedin/metadata/dao/search/ESSearchDAO.java
|
||||||
|
[basesearchdao]: ../../metadata-dao/src/main/java/com/linkedin/metadata/dao/BaseSearchDAO.java
|
||||||
|
[basesearchconfig]: ../../metadata-dao-impl/elasticsearch-dao/src/main/java/com/linkedin/metadata/dao/search/BaseSearchConfig.java
|
||||||
|
[datasetsearchconfig]: ../../gms/impl/src/main/java/com/linkedin/dataset/dao/search/DatasetSearchConfig.java
|
||||||
|
[basesearchableentityresource]: ../../metadata-restli-resource/src/main/java/com/linkedin/metadata/restli/BaseSearchableEntityResource.java
|
||||||
|
[corpusers]: ../../gms/impl/src/main/java/com/linkedin/identity/rest/resources/CorpUsers.java
|
||||||
@ -1,4 +1,4 @@
|
|||||||
# What is Generalized Metadata Service (GMS)?
|
# What is Generalized Metadata Service (GMS)? [WIP]
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
|
|||||||
@ -1,2 +1,2 @@
|
|||||||
# What is URN?
|
# What is URN? [WIP]
|
||||||
|
|
||||||
|
|||||||
Loading…
x
Reference in New Issue
Block a user