diff --git a/README.md b/README.md index c7b6b84911..34cc078616 100644 --- a/README.md +++ b/README.md @@ -5,8 +5,11 @@ ![DataHub](docs/imgs/datahub-logo.png) ## Introduction -DataHub is Linkedin's generalized metadata search & discovery tool. To learn more about DataHub, check out our -[Linkedin blog post](https://engineering.linkedin.com/blog/2019/data-hub) and [Strata presentation](https://speakerdeck.com/shirshanka/the-evolution-of-metadata-linkedins-journey-strata-nyc-2019). This repository contains the complete source code to be able to build DataHub's frontend & backend services. +DataHub is LinkedIn's generalized metadata search & discovery tool. To learn more about DataHub, check out our +[LinkedIn blog post](https://engineering.linkedin.com/blog/2019/data-hub) and [Strata presentation](https://speakerdeck.com/shirshanka/the-evolution-of-metadata-linkedins-journey-strata-nyc-2019). +You should also visit [DataHub Architecture](docs/architecture/architecture.md) to get a better understanding of how DataHub is implemented and +[DataHub Onboarding Guide](docs/how/entity-onboarding.md) to understand how to extend DataHub for your own use case. +This repository contains the complete source code to be able to build DataHub's frontend & backend services. ## Quickstart 1. Install [docker](https://docs.docker.com/install/) and [docker-compose](https://docs.docker.com/compose/install/). @@ -25,9 +28,11 @@ Note: Make sure that you're using Java 8, we have a strict dependency to Java 8 as username and password. ## Quicklinks +* [DataHub Architecture](docs/architecture/architecture.md) +* [DataHub Onboarding Guide](docs/how/entity-onboarding.md) * [Docker Images](docker) * [Frontend App](datahub-frontend) -* [Generalized Metadata Store](gms) +* [Generalized Metadata Service](gms) * [Metadata Consumer Jobs](metadata-jobs) * [Metadata Ingestion](metadata-ingestion) diff --git a/docs/architecture/metadata-ingestion.md b/docs/architecture/metadata-ingestion.md index 01e48dbe9b..b0aa81d5a9 100644 --- a/docs/architecture/metadata-ingestion.md +++ b/docs/architecture/metadata-ingestion.md @@ -1,6 +1,6 @@ # Metadata Ingestion Architecture -## MCE Consumer Job +## MCE Consumer Job [WIP] ## MAE Consumer Job diff --git a/docs/how/entity-onboarding.md b/docs/how/entity-onboarding.md index 6fbf3189d3..0615f87499 100644 --- a/docs/how/entity-onboarding.md +++ b/docs/how/entity-onboarding.md @@ -1,2 +1,32 @@ # How to onboard an entity? +Currently, DataHub only has a support for 3 [entity] types: `datasets`, `users` and `groups`. +If you want to extend DataHub with your own use cases such as `metrics`, `charts`, `dashboards` etc, you should follow the below steps in order. + +## 1. Define URN +Refer to [here](../what/urn.md) for URN definition. + +## 2. Model your metadata +Refer to [metadata modelling](metadata-modelling.md) section. +Make sure to do the following: +1. Define [Aspect] models. +2. Define aspect union model. Refer to [DatasetAspect] as an example. +3. Define [Snapshot] model. Refer to [DatasetSnapshot] as an example. +4. Add your newly defined snapshot to [Snapshot Union] model. +5. Define [Entity] model. Refer to [DatasetEntity] as an example. + +## 3. GMA search onboarding +Refer to [search onboarding](search-onboarding.md). + +## 4. GMA graph onboarding +Refer to [graph onboarding](graph-onboarding.md). + +## 5. UI for entity onboarding [WIP] + +[Aspect]: ../what/aspect.md +[DatasetAspect]: ../../metadata-models/src/main/pegasus/com/linkedin/metadata/aspect/DatasetAspect.pdsc +[Snapshot]: ../what/snapshot.md +[DatasetSnapshot]: ../../metadata-models/src/main/pegasus/com/linkedin/metadata/snapshot/DatasetSnapshot.pdsc +[Snapshot Union]: ../../metadata-models/src/main/pegasus/com/linkedin/metadata/snapshot/Snapshot.pdsc +[Entity]: ../what/entity.md +[DatasetEntity]: ../../metadata-models/src/main/pegasus/com/linkedin/metadata/entity/DatasetEntity.pdsc \ No newline at end of file diff --git a/docs/how/graph-onboarding.md b/docs/how/graph-onboarding.md index 3ec0268ab3..8b448447cf 100644 --- a/docs/how/graph-onboarding.md +++ b/docs/how/graph-onboarding.md @@ -1,2 +1,161 @@ # How to onboard to GMA graph? +## 1. Define relationship models +If you need to define a [relationship] which is not available in the set of [relationship models] provided, +that relationship model should be implemented as a first step for graph onboarding. +Below is an example model for `OwnedBy` relationship: + +```json +{ + "type": "record", + "name": "OwnedBy", + "namespace": "com.linkedin.metadata.relationship", + "doc": "A generic model for the Owned-By relationship", + "include": [ + "BaseRelationship" + ], + "pairings": [ + { + "source": "com.linkedin.common.urn.DatasetUrn", + "destination": "com.linkedin.common.urn.CorpuserUrn" + } + ], + "fields": [ + { + "name": "type", + "type": "com.linkedin.common.OwnershipType", + "doc": "The type of the ownership" + } + ] +} +``` + +## 2. Implement relationship builders +You need to implement relationship builders for your specific [aspect]s and [relationship]s if they are not already defined. +Relationship builders build list of relationships after processing aspects and any relationship builder should implement `BaseRelationshipBuilder` abstract class. +Relationship builders are per aspect and per relationship type. + +```java +public abstract class BaseRelationshipBuilder { + + private Class _aspectClass; + + public BaseRelationshipBuilder(Class aspectClass) { + _aspectClass = aspectClass; + } + + /** + * Returns the aspect class this {@link BaseRelationshipBuilder} supports + */ + public Class supportedAspectClass() { + return _aspectClass; + } + + /** + * Returns a list of corresponding relationship updates for the given metadata aspect + */ + public abstract List buildRelationships(URN urn, ASPECT aspect); +} +``` + +## 3. Implement graph builders +Graph builders build graph updates by processing [snapshot]s. +They internally use relationship builders to generate edges and nodes of the graph. +All relationship builders for an [entity] should be registered through graph builder. + +```java +public abstract class BaseGraphBuilder implements GraphBuilder { + + private final Class _snapshotClass; + private final Map, BaseRelationshipBuilder> _relationshipBuildersMap; + + public BaseGraphBuilder(@Nonnull Class snapshotClass, + @Nonnull Collection relationshipBuilders) { + _snapshotClass = snapshotClass; + _relationshipBuildersMap = relationshipBuilders.stream() + .collect(Collectors.toMap(builder -> builder.supportedAspectClass(), Function.identity())); + } + + @Nonnull + Class supportedSnapshotClass() { + return _snapshotClass; + } + + @Nonnull + @Override + public GraphUpdates build(@Nonnull SNAPSHOT snapshot) { + final Urn urn = RecordUtils.getRecordTemplateField(snapshot, "urn", Urn.class); + + final List entities = buildEntities(snapshot); + + final List relationshipUpdates = new ArrayList<>(); + + final List aspects = ModelUtils.getAspectsFromSnapshot(snapshot); + for (RecordTemplate aspect : aspects) { + BaseRelationshipBuilder relationshipBuilder = _relationshipBuildersMap.get(aspect.getClass()); + if (relationshipBuilder != null) { + relationshipUpdates.addAll(relationshipBuilder.buildRelationships(urn, aspect)); + } + } + + return new GraphUpdates(Collections.unmodifiableList(entities), Collections.unmodifiableList(relationshipUpdates)); + } + + @Nonnull + protected abstract List buildEntities(@Nonnull SNAPSHOT snapshot); +} +``` + +```java +public class DatasetGraphBuilder extends BaseGraphBuilder { + private static final Set RELATIONSHIP_BUILDERS = + Collections.unmodifiableSet(new HashSet() { + { + add(new DownstreamOfBuilderFromUpstreamLineage()); + add(new OwnedByBuilderFromOwnership()); + } + }); + + public DatasetGraphBuilder() { + super(DatasetSnapshot.class, RELATIONSHIP_BUILDERS); + } + + @Nonnull + @Override + protected List buildEntities(@Nonnull DatasetSnapshot snapshot) { + final DatasetUrn urn = snapshot.getUrn(); + final DatasetEntity entity = new DatasetEntity().setUrn(urn) + .setName(urn.getDatasetNameEntity()) + .setPlatform(urn.getPlatformEntity()) + .setOrigin(urn.getOriginEntity()); + + setRemovedProperty(snapshot, entity); + + return Collections.singletonList(entity); + } +} +``` + +## 4. Ingestion into graph +The ingestion process for each [entity] is done by graph builders. +The builders will be invoked whenever an [MAE] is received by [MAE Consumer Job]. +Graph builders should be extended from BaseGraphBuilder. Check DatasetGraphBuilder as an example above. +For the consumer job to consume those MAEs, you should add your graph builder to the [graph builder registry]. + +## 5. Graph queries +You can onboard the graph queries which fit to your specific use cases using [Query DAO]. +You also need to create [rest.li](https://rest.li) APIs to serve your graph queries. +[BaseQueryDAO] provides an abstract implementation of several graph query APIs. +Refer to [DownstreamLineageResource] rest.li resource implementation to see a use case of graph queries. + +[relationship]: ../what/relationship.md +[relationship models]: ../../metadata-models/build/mainSchemas/com/linkedin/metadata/relationship +[aspect]: ../what/aspect.md +[snapshot]: ../what/snapshot.md +[entity]: ../what/entity.md +[mae]: ../what/mxe.md#metadata-audit-event-mae +[mae consumer job]: ../architecture/metadata-ingestion.md#mae-consumer-job +[graph builder registry]: ../../metadata-builders/src/main/java/com/linkedin/metadata/builders/graph/RegisteredGraphBuilders.java +[query dao]: ../architecture/metadata-serving.md#query-dao +[BaseQueryDAO]: ../../metadata-dao/src/main/java/com/linkedin/metadata/dao/BaseQueryDAO.java +[DownstreamLineageResource]: ../../gms/impl/src/main/java/com/linkedin/dataset/rest/resources/DownstreamLineageResource.java \ No newline at end of file diff --git a/docs/how/search-onboarding.md b/docs/how/search-onboarding.md index 90e140a609..32cc331634 100644 --- a/docs/how/search-onboarding.md +++ b/docs/how/search-onboarding.md @@ -1,2 +1,85 @@ # How to onboard to GMA search? +## 1. Define search document model for the entity +Modeling is the most important and crucial part of your design. +[Search document] model contains a list of fields that need to be indexed along with the names and their data types. +Check [here][Search document] to learn more about search document model. +Please note that all fields in the search document model (except the `urn`) are `optional`. +This is because we want to support partial updates to search documents. + +[Search document]: ../what/search-document.md + +## 2. Create the search index, define its mappings and settings + +A [mapping] is created using the information of search document model. +It defines how a document, and the fields it contains, are stored and indexed by various [tokenizer]s, [analyzer]s and data type for the fields. +For certain fields, sub-fields are created using different analyzers. +The analyzers are chosen depending on the needs for each field. +This is currently created manually using [curl] commands, and we plan to [automate](../what/search-index.md#search-automation-tbd) the process in the near future. +Check index [mappings & settings](../../docker/elasticsearch/dataset-index-config.json) for `dataset` search index. + +## 3. Ingestion into search index +The actual indexing process for each [entity] is powered by [index builder]s. +The builders register the metadata [aspect]s of their interest against [MAE Consumer Job] and will be invoked whenever an [MAE] of same interest is received. +Index builders should be extended from [BaseIndexBuilder]. Check [DatasetIndexBuilder] as an example. +For the consumer job to consume those MAEs, you should add your index builder to the [index builder registry]. + +## 4. Search query configs +Once you have the [search index] built, it's ready to be queried! +The search query is constructed and executed through [Search DAO]. +The raw search hits are retrieved and extracted using the base model. +Besides the regular full text search, run time aggregation and relevance are provided in the search queries as well. + +[ESSearchDAO] is the implementation for the [BaseSearchDAO] for Elasticsearch. +It's still a generic class which can be used for a specific [entity] and configured using [BaseSearchConfig]. + +BaseSearchConfig is the abstraction for all query related configurations such as query templates, default field to execute autocomplete on etc. + +```java +public abstract class BaseSearchConfig { + + public abstract Set getFacetFields(); + + public String getIndexName() { + return getSearchDocument().getSimpleName().toLowerCase(); + } + + public abstract Class getSearchDocument(); + + public abstract String getDefaultAutocompleteField(); + + public abstract String getSearchQueryTemplate(); + + public abstract String getAutocompleteQueryTemplate(); +} +``` + +[DatasetSearchConfig] is the implementation of search config for `dataset` entity. + +## 5. Add search query endpoints to GMS +Finally, you need to create [rest.li](https://rest.li) APIs to serve your search queries. +[BaseSearchableEntityResource] provides an abstract implementation of search and autocomplete APIs. +Any top level rest.li resource implementation could extend it and this will automatically provide search and autocomplete endpoints. +Refer to [CorpUsers] rest.li resource implementation as an example. + + +[mapping]: https://www.elastic.co/guide/en/elasticsearch/reference/5.6/mapping.html +[tokenizer]: https://www.elastic.co/guide/en/elasticsearch/reference/5.6/analysis-tokenizers.html +[analyzer]: https://www.elastic.co/guide/en/elasticsearch/reference/5.6/analysis-analyzers.html +[curl]: https://en.wikipedia.org/wiki/CURL +[entity]: ../what/entity.md +[index builder]: ../architecture/metadata-ingestion.md#search-and-graph-index-builders +[aspect]: ../what/aspect.md +[mae consumer job]: ../architecture/metadata-ingestion.md#mae-consumer-job +[mae]: ../what/mxe.md#metadata-audit-event-mae +[baseindexbuilder]: ../../metadata-builders/src/main/java/com/linkedin/metadata/builders/search/BaseIndexBuilder.java +[datasetindexbuilder]: ../../metadata-builders/src/main/java/com/linkedin/metadata/builders/search/DatasetIndexBuilder.java +[index builder registry]: ../../metadata-builders/src/main/java/com/linkedin/metadata/builders/search/RegisteredIndexBuilders.java +[search index]: ../what/search-index.md +[search dao]: ../architecture/metadata-serving.md#search-dao +[essearchdao]: ../../metadata-dao-impl/elasticsearch-dao/src/main/java/com/linkedin/metadata/dao/search/ESSearchDAO.java +[basesearchdao]: ../../metadata-dao/src/main/java/com/linkedin/metadata/dao/BaseSearchDAO.java +[basesearchconfig]: ../../metadata-dao-impl/elasticsearch-dao/src/main/java/com/linkedin/metadata/dao/search/BaseSearchConfig.java +[datasetsearchconfig]: ../../gms/impl/src/main/java/com/linkedin/dataset/dao/search/DatasetSearchConfig.java +[basesearchableentityresource]: ../../metadata-restli-resource/src/main/java/com/linkedin/metadata/restli/BaseSearchableEntityResource.java +[corpusers]: ../../gms/impl/src/main/java/com/linkedin/identity/rest/resources/CorpUsers.java \ No newline at end of file diff --git a/docs/what/gms.md b/docs/what/gms.md index f1ee05af1f..35b9efc2fe 100644 --- a/docs/what/gms.md +++ b/docs/what/gms.md @@ -1,4 +1,4 @@ -# What is Generalized Metadata Service (GMS)? +# What is Generalized Metadata Service (GMS)? [WIP] diff --git a/docs/what/urn.md b/docs/what/urn.md index 6680260707..1c106ae462 100644 --- a/docs/what/urn.md +++ b/docs/what/urn.md @@ -1,2 +1,2 @@ -# What is URN? +# What is URN? [WIP]