mirror of
				https://github.com/datahub-project/datahub.git
				synced 2025-10-31 10:49:00 +00:00 
			
		
		
		
	Documentation update part-3
This commit is contained in:
		
							parent
							
								
									6c99764fb1
								
							
						
					
					
						commit
						4342a9b39a
					
				
							
								
								
									
										11
									
								
								README.md
									
									
									
									
									
								
							
							
						
						
									
										11
									
								
								README.md
									
									
									
									
									
								
							| @ -5,8 +5,11 @@ | |||||||
|  |  | ||||||
| 
 | 
 | ||||||
| ## Introduction | ## Introduction | ||||||
| DataHub is Linkedin's generalized metadata search & discovery tool. To learn more about DataHub, check out our  | DataHub is LinkedIn's generalized metadata search & discovery tool. To learn more about DataHub, check out our  | ||||||
| [Linkedin blog post](https://engineering.linkedin.com/blog/2019/data-hub) and [Strata presentation](https://speakerdeck.com/shirshanka/the-evolution-of-metadata-linkedins-journey-strata-nyc-2019). This repository contains the complete source code to be able to build DataHub's frontend & backend services. | [LinkedIn blog post](https://engineering.linkedin.com/blog/2019/data-hub) and [Strata presentation](https://speakerdeck.com/shirshanka/the-evolution-of-metadata-linkedins-journey-strata-nyc-2019).  | ||||||
|  | You should also visit [DataHub Architecture](docs/architecture/architecture.md) to get a better understanding of how DataHub is implemented and  | ||||||
|  | [DataHub Onboarding Guide](docs/how/entity-onboarding.md) to understand how to extend DataHub for your own use case. | ||||||
|  | This repository contains the complete source code to be able to build DataHub's frontend & backend services. | ||||||
| 
 | 
 | ||||||
| ## Quickstart | ## Quickstart | ||||||
| 1. Install [docker](https://docs.docker.com/install/) and [docker-compose](https://docs.docker.com/compose/install/). | 1. Install [docker](https://docs.docker.com/install/) and [docker-compose](https://docs.docker.com/compose/install/). | ||||||
| @ -25,9 +28,11 @@ Note: Make sure that you're using Java 8, we have a strict dependency to Java 8 | |||||||
| as username and password. | as username and password. | ||||||
| 
 | 
 | ||||||
| ## Quicklinks | ## Quicklinks | ||||||
|  | * [DataHub Architecture](docs/architecture/architecture.md) | ||||||
|  | * [DataHub Onboarding Guide](docs/how/entity-onboarding.md) | ||||||
| * [Docker Images](docker) | * [Docker Images](docker) | ||||||
| * [Frontend App](datahub-frontend) | * [Frontend App](datahub-frontend) | ||||||
| * [Generalized Metadata Store](gms) | * [Generalized Metadata Service](gms) | ||||||
| * [Metadata Consumer Jobs](metadata-jobs) | * [Metadata Consumer Jobs](metadata-jobs) | ||||||
| * [Metadata Ingestion](metadata-ingestion) | * [Metadata Ingestion](metadata-ingestion) | ||||||
| 
 | 
 | ||||||
|  | |||||||
| @ -1,6 +1,6 @@ | |||||||
| # Metadata Ingestion Architecture | # Metadata Ingestion Architecture | ||||||
| 
 | 
 | ||||||
| ## MCE Consumer Job | ## MCE Consumer Job [WIP] | ||||||
| 
 | 
 | ||||||
| ## MAE Consumer Job | ## MAE Consumer Job | ||||||
| 
 | 
 | ||||||
|  | |||||||
| @ -1,2 +1,32 @@ | |||||||
| # How to onboard an entity? | # How to onboard an entity? | ||||||
| 
 | 
 | ||||||
|  | Currently, DataHub only has a support for 3 [entity] types: `datasets`, `users` and `groups`. | ||||||
|  | If you want to extend DataHub with your own use cases such as `metrics`, `charts`, `dashboards` etc, you should follow the below steps in order. | ||||||
|  | 
 | ||||||
|  | ## 1. Define URN | ||||||
|  | Refer to [here](../what/urn.md) for URN definition. | ||||||
|  | 
 | ||||||
|  | ## 2. Model your metadata | ||||||
|  | Refer to [metadata modelling](metadata-modelling.md) section. | ||||||
|  | Make sure to do the following: | ||||||
|  | 1. Define [Aspect] models. | ||||||
|  | 2. Define aspect union model. Refer to [DatasetAspect] as an example. | ||||||
|  | 3. Define [Snapshot] model. Refer to [DatasetSnapshot] as an example. | ||||||
|  | 4. Add your newly defined snapshot to [Snapshot Union] model. | ||||||
|  | 5. Define [Entity] model. Refer to [DatasetEntity] as an example. | ||||||
|  | 
 | ||||||
|  | ## 3. GMA search onboarding | ||||||
|  | Refer to [search onboarding](search-onboarding.md). | ||||||
|  | 
 | ||||||
|  | ## 4. GMA graph onboarding | ||||||
|  | Refer to [graph onboarding](graph-onboarding.md). | ||||||
|  | 
 | ||||||
|  | ## 5. UI for entity onboarding [WIP] | ||||||
|  | 
 | ||||||
|  | [Aspect]: ../what/aspect.md | ||||||
|  | [DatasetAspect]: ../../metadata-models/src/main/pegasus/com/linkedin/metadata/aspect/DatasetAspect.pdsc | ||||||
|  | [Snapshot]: ../what/snapshot.md | ||||||
|  | [DatasetSnapshot]: ../../metadata-models/src/main/pegasus/com/linkedin/metadata/snapshot/DatasetSnapshot.pdsc | ||||||
|  | [Snapshot Union]: ../../metadata-models/src/main/pegasus/com/linkedin/metadata/snapshot/Snapshot.pdsc | ||||||
|  | [Entity]: ../what/entity.md | ||||||
|  | [DatasetEntity]: ../../metadata-models/src/main/pegasus/com/linkedin/metadata/entity/DatasetEntity.pdsc | ||||||
| @ -1,2 +1,161 @@ | |||||||
| # How to onboard to GMA graph? | # How to onboard to GMA graph? | ||||||
| 
 | 
 | ||||||
|  | ## 1. Define relationship models | ||||||
|  | If you need to define a [relationship] which is not available in the set of [relationship models] provided, | ||||||
|  | that relationship model should be implemented as a first step for graph onboarding.  | ||||||
|  | Below is an example model for `OwnedBy` relationship: | ||||||
|  | 
 | ||||||
|  | ```json | ||||||
|  | { | ||||||
|  |   "type": "record", | ||||||
|  |   "name": "OwnedBy", | ||||||
|  |   "namespace": "com.linkedin.metadata.relationship", | ||||||
|  |   "doc": "A generic model for the Owned-By relationship", | ||||||
|  |   "include": [ | ||||||
|  |     "BaseRelationship" | ||||||
|  |   ], | ||||||
|  |   "pairings": [ | ||||||
|  |     { | ||||||
|  |       "source": "com.linkedin.common.urn.DatasetUrn", | ||||||
|  |       "destination": "com.linkedin.common.urn.CorpuserUrn" | ||||||
|  |     } | ||||||
|  |   ], | ||||||
|  |   "fields": [ | ||||||
|  |     { | ||||||
|  |       "name": "type", | ||||||
|  |       "type": "com.linkedin.common.OwnershipType", | ||||||
|  |       "doc": "The type of the ownership" | ||||||
|  |     } | ||||||
|  |   ] | ||||||
|  | } | ||||||
|  | ``` | ||||||
|  | 
 | ||||||
|  | ## 2. Implement relationship builders | ||||||
|  | You need to implement relationship builders for your specific [aspect]s and [relationship]s if they are not already defined. | ||||||
|  | Relationship builders build list of relationships after processing aspects and any relationship builder should implement `BaseRelationshipBuilder` abstract class. | ||||||
|  | Relationship builders are per aspect and per relationship type. | ||||||
|  | 
 | ||||||
|  | ```java | ||||||
|  | public abstract class BaseRelationshipBuilder<ASPECT extends RecordTemplate> { | ||||||
|  | 
 | ||||||
|  |   private Class<ASPECT> _aspectClass; | ||||||
|  | 
 | ||||||
|  |   public BaseRelationshipBuilder(Class<ASPECT> aspectClass) { | ||||||
|  |     _aspectClass = aspectClass; | ||||||
|  |   } | ||||||
|  | 
 | ||||||
|  |   /** | ||||||
|  |    * Returns the aspect class this {@link BaseRelationshipBuilder} supports | ||||||
|  |    */ | ||||||
|  |   public Class<ASPECT> supportedAspectClass() { | ||||||
|  |     return _aspectClass; | ||||||
|  |   } | ||||||
|  | 
 | ||||||
|  |   /** | ||||||
|  |    * Returns a list of corresponding relationship updates for the given metadata aspect | ||||||
|  |    */ | ||||||
|  |   public abstract <URN extends Urn> List<GraphBuilder.RelationshipUpdates> buildRelationships(URN urn, ASPECT aspect); | ||||||
|  | } | ||||||
|  | ``` | ||||||
|  | 
 | ||||||
|  | ## 3. Implement graph builders | ||||||
|  | Graph builders build graph updates by processing [snapshot]s.  | ||||||
|  | They internally use relationship builders to generate edges and nodes of the graph. | ||||||
|  | All relationship builders for an [entity] should be registered through graph builder. | ||||||
|  | 
 | ||||||
|  | ```java | ||||||
|  | public abstract class BaseGraphBuilder<SNAPSHOT extends RecordTemplate> implements GraphBuilder<SNAPSHOT> { | ||||||
|  | 
 | ||||||
|  |   private final Class<SNAPSHOT> _snapshotClass; | ||||||
|  |   private final Map<Class<? extends RecordTemplate>, BaseRelationshipBuilder> _relationshipBuildersMap; | ||||||
|  | 
 | ||||||
|  |   public BaseGraphBuilder(@Nonnull Class<SNAPSHOT> snapshotClass, | ||||||
|  |       @Nonnull Collection<BaseRelationshipBuilder> relationshipBuilders) { | ||||||
|  |     _snapshotClass = snapshotClass; | ||||||
|  |     _relationshipBuildersMap = relationshipBuilders.stream() | ||||||
|  |         .collect(Collectors.toMap(builder -> builder.supportedAspectClass(), Function.identity())); | ||||||
|  |   } | ||||||
|  | 
 | ||||||
|  |   @Nonnull | ||||||
|  |   Class<SNAPSHOT> supportedSnapshotClass() { | ||||||
|  |     return _snapshotClass; | ||||||
|  |   } | ||||||
|  | 
 | ||||||
|  |   @Nonnull | ||||||
|  |   @Override | ||||||
|  |   public GraphUpdates build(@Nonnull SNAPSHOT snapshot) { | ||||||
|  |     final Urn urn = RecordUtils.getRecordTemplateField(snapshot, "urn", Urn.class); | ||||||
|  | 
 | ||||||
|  |     final List<? extends RecordTemplate> entities = buildEntities(snapshot); | ||||||
|  | 
 | ||||||
|  |     final List<RelationshipUpdates> relationshipUpdates = new ArrayList<>(); | ||||||
|  | 
 | ||||||
|  |     final List<RecordTemplate> aspects = ModelUtils.getAspectsFromSnapshot(snapshot); | ||||||
|  |     for (RecordTemplate aspect : aspects) { | ||||||
|  |       BaseRelationshipBuilder relationshipBuilder = _relationshipBuildersMap.get(aspect.getClass()); | ||||||
|  |       if (relationshipBuilder != null) { | ||||||
|  |         relationshipUpdates.addAll(relationshipBuilder.buildRelationships(urn, aspect)); | ||||||
|  |       } | ||||||
|  |     } | ||||||
|  | 
 | ||||||
|  |     return new GraphUpdates(Collections.unmodifiableList(entities), Collections.unmodifiableList(relationshipUpdates)); | ||||||
|  |   } | ||||||
|  | 
 | ||||||
|  |   @Nonnull | ||||||
|  |   protected abstract List<? extends RecordTemplate> buildEntities(@Nonnull SNAPSHOT snapshot); | ||||||
|  | } | ||||||
|  | ``` | ||||||
|  | 
 | ||||||
|  | ```java | ||||||
|  | public class DatasetGraphBuilder extends BaseGraphBuilder<DatasetSnapshot> { | ||||||
|  |   private static final Set<BaseRelationshipBuilder> RELATIONSHIP_BUILDERS = | ||||||
|  |       Collections.unmodifiableSet(new HashSet<BaseRelationshipBuilder>() { | ||||||
|  |         { | ||||||
|  |           add(new DownstreamOfBuilderFromUpstreamLineage()); | ||||||
|  |           add(new OwnedByBuilderFromOwnership()); | ||||||
|  |         } | ||||||
|  |       }); | ||||||
|  | 
 | ||||||
|  |   public DatasetGraphBuilder() { | ||||||
|  |     super(DatasetSnapshot.class, RELATIONSHIP_BUILDERS); | ||||||
|  |   } | ||||||
|  | 
 | ||||||
|  |   @Nonnull | ||||||
|  |   @Override | ||||||
|  |   protected List<? extends RecordTemplate> buildEntities(@Nonnull DatasetSnapshot snapshot) { | ||||||
|  |     final DatasetUrn urn = snapshot.getUrn(); | ||||||
|  |     final DatasetEntity entity = new DatasetEntity().setUrn(urn) | ||||||
|  |         .setName(urn.getDatasetNameEntity()) | ||||||
|  |         .setPlatform(urn.getPlatformEntity()) | ||||||
|  |         .setOrigin(urn.getOriginEntity()); | ||||||
|  | 
 | ||||||
|  |     setRemovedProperty(snapshot, entity); | ||||||
|  | 
 | ||||||
|  |     return Collections.singletonList(entity); | ||||||
|  |   } | ||||||
|  | } | ||||||
|  | ``` | ||||||
|  | 
 | ||||||
|  | ## 4. Ingestion into graph | ||||||
|  | The ingestion process for each [entity] is done by graph builders.  | ||||||
|  | The builders will be invoked whenever an [MAE] is received by [MAE Consumer Job].  | ||||||
|  | Graph builders should be extended from BaseGraphBuilder. Check DatasetGraphBuilder as an example above.  | ||||||
|  | For the consumer job to consume those MAEs, you should add your graph builder to the [graph builder registry]. | ||||||
|  | 
 | ||||||
|  | ## 5. Graph queries | ||||||
|  | You can onboard the graph queries which fit to your specific use cases using [Query DAO].  | ||||||
|  | You also need to create [rest.li](https://rest.li) APIs to serve your graph queries. | ||||||
|  | [BaseQueryDAO] provides an abstract implementation of several graph query APIs. | ||||||
|  | Refer to [DownstreamLineageResource] rest.li resource implementation to see a use case of graph queries. | ||||||
|  | 
 | ||||||
|  | [relationship]: ../what/relationship.md | ||||||
|  | [relationship models]: ../../metadata-models/build/mainSchemas/com/linkedin/metadata/relationship | ||||||
|  | [aspect]: ../what/aspect.md | ||||||
|  | [snapshot]: ../what/snapshot.md | ||||||
|  | [entity]: ../what/entity.md | ||||||
|  | [mae]: ../what/mxe.md#metadata-audit-event-mae | ||||||
|  | [mae consumer job]: ../architecture/metadata-ingestion.md#mae-consumer-job | ||||||
|  | [graph builder registry]: ../../metadata-builders/src/main/java/com/linkedin/metadata/builders/graph/RegisteredGraphBuilders.java | ||||||
|  | [query dao]: ../architecture/metadata-serving.md#query-dao | ||||||
|  | [BaseQueryDAO]: ../../metadata-dao/src/main/java/com/linkedin/metadata/dao/BaseQueryDAO.java | ||||||
|  | [DownstreamLineageResource]: ../../gms/impl/src/main/java/com/linkedin/dataset/rest/resources/DownstreamLineageResource.java | ||||||
| @ -1,2 +1,85 @@ | |||||||
| # How to onboard to GMA search? | # How to onboard to GMA search? | ||||||
| 
 | 
 | ||||||
|  | ## 1. Define search document model for the entity | ||||||
|  | Modeling is the most important and crucial part of your design.  | ||||||
|  | [Search document] model contains a list of fields that need to be indexed along with the names and their data types.  | ||||||
|  | Check [here][Search document] to learn more about search document model. | ||||||
|  | Please note that all fields in the search document model (except the `urn`) are `optional`.  | ||||||
|  | This is because we want to support partial updates to search documents. | ||||||
|  | 
 | ||||||
|  | [Search document]: ../what/search-document.md | ||||||
|  | 
 | ||||||
|  | ## 2. Create the search index, define its mappings and settings | ||||||
|  | 
 | ||||||
|  | A [mapping] is created using the information of search document model.  | ||||||
|  | It defines how a document, and the fields it contains, are stored and indexed by various [tokenizer]s, [analyzer]s and data type for the fields.  | ||||||
|  | For certain fields, sub-fields are created using different analyzers.  | ||||||
|  | The analyzers are chosen depending on the needs for each field.  | ||||||
|  | This is currently created manually using [curl] commands, and we plan to [automate](../what/search-index.md#search-automation-tbd) the process in the near future.  | ||||||
|  | Check index [mappings & settings](../../docker/elasticsearch/dataset-index-config.json) for `dataset` search index. | ||||||
|  | 
 | ||||||
|  | ## 3. Ingestion into search index | ||||||
|  | The actual indexing process for each [entity] is powered by [index builder]s.  | ||||||
|  | The builders register the metadata [aspect]s of their interest against [MAE Consumer Job] and will be invoked whenever an [MAE] of same interest is received.  | ||||||
|  | Index builders should be extended from [BaseIndexBuilder]. Check [DatasetIndexBuilder] as an example.  | ||||||
|  | For the consumer job to consume those MAEs, you should add your index builder to the [index builder registry]. | ||||||
|  | 
 | ||||||
|  | ## 4. Search query configs | ||||||
|  | Once you have the [search index] built, it's ready to be queried!  | ||||||
|  | The search query is constructed and executed through [Search DAO].  | ||||||
|  | The raw search hits are retrieved and extracted using the base model.  | ||||||
|  | Besides the regular full text search, run time aggregation and relevance are provided in the search queries as well.  | ||||||
|  | 
 | ||||||
|  | [ESSearchDAO] is the implementation for the [BaseSearchDAO] for Elasticsearch. | ||||||
|  | It's still a generic class which can be used for a specific [entity] and configured using [BaseSearchConfig].  | ||||||
|  | 
 | ||||||
|  | BaseSearchConfig is the abstraction for all query related configurations such as query templates, default field to execute autocomplete on etc. | ||||||
|  | 
 | ||||||
|  | ```java | ||||||
|  | public abstract class BaseSearchConfig<DOCUMENT extends RecordTemplate> { | ||||||
|  | 
 | ||||||
|  |   public abstract Set<String> getFacetFields(); | ||||||
|  | 
 | ||||||
|  |   public String getIndexName() { | ||||||
|  |     return getSearchDocument().getSimpleName().toLowerCase(); | ||||||
|  |   } | ||||||
|  | 
 | ||||||
|  |   public abstract Class<DOCUMENT> getSearchDocument(); | ||||||
|  | 
 | ||||||
|  |   public abstract String getDefaultAutocompleteField(); | ||||||
|  | 
 | ||||||
|  |   public abstract String getSearchQueryTemplate(); | ||||||
|  | 
 | ||||||
|  |   public abstract String getAutocompleteQueryTemplate(); | ||||||
|  | } | ||||||
|  | ``` | ||||||
|  | 
 | ||||||
|  | [DatasetSearchConfig] is the implementation of search config for `dataset` entity. | ||||||
|  | 
 | ||||||
|  | ## 5. Add search query endpoints to GMS | ||||||
|  | Finally, you need to create [rest.li](https://rest.li) APIs to serve your search queries.  | ||||||
|  | [BaseSearchableEntityResource] provides an abstract implementation of search and autocomplete APIs. | ||||||
|  | Any top level rest.li resource implementation could extend it and this will automatically provide search and autocomplete endpoints. | ||||||
|  | Refer to [CorpUsers] rest.li resource implementation as an example. | ||||||
|  | 
 | ||||||
|  | 
 | ||||||
|  | [mapping]: https://www.elastic.co/guide/en/elasticsearch/reference/5.6/mapping.html | ||||||
|  | [tokenizer]: https://www.elastic.co/guide/en/elasticsearch/reference/5.6/analysis-tokenizers.html | ||||||
|  | [analyzer]: https://www.elastic.co/guide/en/elasticsearch/reference/5.6/analysis-analyzers.html | ||||||
|  | [curl]: https://en.wikipedia.org/wiki/CURL | ||||||
|  | [entity]: ../what/entity.md | ||||||
|  | [index builder]: ../architecture/metadata-ingestion.md#search-and-graph-index-builders | ||||||
|  | [aspect]: ../what/aspect.md | ||||||
|  | [mae consumer job]: ../architecture/metadata-ingestion.md#mae-consumer-job | ||||||
|  | [mae]: ../what/mxe.md#metadata-audit-event-mae | ||||||
|  | [baseindexbuilder]: ../../metadata-builders/src/main/java/com/linkedin/metadata/builders/search/BaseIndexBuilder.java | ||||||
|  | [datasetindexbuilder]: ../../metadata-builders/src/main/java/com/linkedin/metadata/builders/search/DatasetIndexBuilder.java | ||||||
|  | [index builder registry]: ../../metadata-builders/src/main/java/com/linkedin/metadata/builders/search/RegisteredIndexBuilders.java | ||||||
|  | [search index]: ../what/search-index.md | ||||||
|  | [search dao]: ../architecture/metadata-serving.md#search-dao | ||||||
|  | [essearchdao]: ../../metadata-dao-impl/elasticsearch-dao/src/main/java/com/linkedin/metadata/dao/search/ESSearchDAO.java | ||||||
|  | [basesearchdao]: ../../metadata-dao/src/main/java/com/linkedin/metadata/dao/BaseSearchDAO.java | ||||||
|  | [basesearchconfig]: ../../metadata-dao-impl/elasticsearch-dao/src/main/java/com/linkedin/metadata/dao/search/BaseSearchConfig.java | ||||||
|  | [datasetsearchconfig]: ../../gms/impl/src/main/java/com/linkedin/dataset/dao/search/DatasetSearchConfig.java | ||||||
|  | [basesearchableentityresource]: ../../metadata-restli-resource/src/main/java/com/linkedin/metadata/restli/BaseSearchableEntityResource.java | ||||||
|  | [corpusers]: ../../gms/impl/src/main/java/com/linkedin/identity/rest/resources/CorpUsers.java | ||||||
| @ -1,4 +1,4 @@ | |||||||
| # What is Generalized Metadata Service (GMS)? | # What is Generalized Metadata Service (GMS)? [WIP] | ||||||
| 
 | 
 | ||||||
| 
 | 
 | ||||||
| 
 | 
 | ||||||
|  | |||||||
| @ -1,2 +1,2 @@ | |||||||
| # What is URN? | # What is URN? [WIP] | ||||||
| 
 | 
 | ||||||
|  | |||||||
		Loading…
	
	
			
			x
			
			
		
	
		Reference in New Issue
	
	Block a user
	 Kerem Sahin
						Kerem Sahin