feat(docs) Improves docs around developing datahub, removes deprecated docs on building metadata service (#4552)

This commit is contained in:
Pedro Silva 2022-04-05 03:15:21 +01:00 committed by GitHub
parent 179fe07393
commit a20012fd6c
No known key found for this signature in database
GPG Key ID: 4AEE18F83AFDEB23
5 changed files with 62 additions and 196 deletions

View File

@ -73,6 +73,10 @@ DataHub is an open-source metadata platform for the modern data stack. Read abou
Please follow the [DataHub Quickstart Guide](https://datahubproject.io/docs/quickstart) to get a copy of DataHub up & running locally using [Docker](https://docker.com). As the guide assumes some basic knowledge of Docker, we'd recommend you to go through the "Hello World" example of [A Docker Tutorial for Beginners](https://docker-curriculum.com) if Docker is completely foreign to you.
## Development
If you're looking to build & modify datahub please take a look at our [Development Guide](https://datahubproject.io/docs/developers).
## Demo and Screenshots
There's a [hosted demo environment](https://datahubproject.io/docs/demo) where you can play around with DataHub before installing.

View File

@ -34,12 +34,7 @@ To ensure that metadata changes are processed in the correct chronological order
### Metadata Query Serving
Primary-key based reads (e.g. getting schema metadata for a dataset based on the `dataset-urn`) on metadata are routed to the document store. Secondary index based reads on metadata are routed to the search index (or alternately can use the strongly consistent secondary index support described [here]()). Full-text and advanced search queries are routed to the search index. Complex graph queries such as lineage are routed to the graph index.
### Further Reading
Read the [metadata service developer guide](../how/build-metadata-service.md) to understand how to customize the DataHub metadata service tier.
Primary-key based reads (e.g. getting schema metadata for a dataset based on the `dataset-urn`) on metadata are routed to the document store. Secondary index based reads on metadata are routed to the search index (or alternately can use the strongly consistent secondary index support described [here]()). Full-text and advanced search queries are routed to the search index. Complex graph queries such as lineage are routed to the graph index.
[RecordTemplate]: https://github.com/linkedin/rest.li/blob/master/data/src/main/java/com/linkedin/data/template/RecordTemplate.java
[GenericRecord]: https://github.com/apache/avro/blob/master/lang/java/avro/src/main/java/org/apache/avro/generic/GenericRecord.java

View File

@ -4,6 +4,12 @@ title: "Local Development"
# DataHub Developer's Guide
## Pre-requirements
- [Java 1.8 SDK](https://adoptopenjdk.net/?variant=openjdk8&jvmVariant=hotspot)
- [Docker](https://www.docker.com/)
- [Docker Compose](https://docs.docker.com/compose/)
- Docker engine with at least 8GB of memory to run tests.
## Building the Project
Fork and clone the repository if haven't done so already
@ -21,6 +27,56 @@ Use [gradle wrapper](https://docs.gradle.org/current/userguide/gradle_wrapper.ht
./gradlew build
```
Note that the above will also run run tests and a number of validations which makes the process considerably slower.
We suggest partially compiling DataHub according to your needs:
- Build Datahub's backend GMS (Generalized metadata service):
```
./gradlew :metadata-service:war:build
```
- Build Datahub's frontend:
```
./gradlew :datahub-frontend:build -x yarnTest -x yarnLint
```
- Build DataHub's command line tool:
```
./gradlew :metadata-ingestion:installDev
```
- Build DataHub's documentation:
```
./gradlew :docs-website:yarnLintFix :docs-website:build -x :metadata-ingestion:runPreFlightScript
# To preview the documentation
./gradlew :docs-website:serve
```
## Deploying local versions
Run just once to have the local `datahub` cli tool installed in your $PATH
```
cd smoke-test/
python3 -m venv venv
source venv/bin/activate
pip install --upgrade pip wheel setuptools
pip install -r requirements.txt
cd ../
```
Once you have compiled & packaged the project or appropriate module you can deploy the entire system via docker-compose by running:
```
datahub docker quickstart --build-locally
```
Replace whatever container you want in the existing deployment.
I.e, replacing datahub's backend (GMS):
```
(cd docker && COMPOSE_DOCKER_CLI_BUILD=1 DOCKER_BUILDKIT=1 docker-compose -p datahub -f docker-compose-without-neo4j.yml -f docker-compose-without-neo4j.override.yml -f docker-compose.dev.yml up -d --no-deps --force-recreate datahub-gms)
```
Running the local version of the frontend
```
(cd docker && COMPOSE_DOCKER_CLI_BUILD=1 DOCKER_BUILDKIT=1 docker-compose -p datahub -f docker-compose-without-neo4j.yml -f docker-compose-without-neo4j.override.yml -f docker-compose.dev.yml up -d --no-deps --force-recreate datahub-frontend-react)
```
## IDE Support
The recommended IDE for DataHub development is [IntelliJ IDEA](https://www.jetbrains.com/idea/).
You can run the following command to generate or update the IntelliJ project file

View File

@ -1,190 +0,0 @@
# Metadata Service Developer Guide
This guide assumes that you are already familar with the architecture of DataHub's Metadata Serving Layer, as described [here](../architecture/metadata-serving.md).
Read on to understand how to build and extend the DataHub service tier for your specific needs.
## Using DAOs to store and query Metadata
DataHub metadata service uses the excellent `datahub-gma` library to store and query metadata in a standard way.
There are four types of Data Access Objects ([DAO]) that standardize the way metadata is accessed.
This section describes each type of DAO, its purpose, and the interface.
These DAOs rely heavily on [Java Generics](https://docs.oracle.com/javase/tutorial/extra/generics/index.html) so that the core logics can remain type-neutral.
However, as theres no inheritance in [Pegasus], the generics often fallback to extending [RecordTemplate] instead of the desired types (i.e. [entity], [relationship], metadata [aspect] etc). Additional runtime type checking has been added to the DAOs to avoid binding to unexpected types. We also cache the type checking result to minimize runtime overhead.
### Key-value DAO (Local DAO)
[GMS] use [Local DAO] to store and retrieve metadata [aspect]s from the local document store.
Below shows the base class and its simple key-value interface.
As the DAO is a generic class, it needs to be bound to specific type during instantiation.
Each entity type will need to instantiate its own version of DAO.
```java
public abstract class BaseLocalDAO<ASPECT extends UnionTemplate> {
public abstract <URN extends Urn, METADATA extends RecordTemplate> void
add(Class<METADATA> type, URN urn, METADATA value);
public abstract <URN extends Urn, METADATA extends RecordTemplate>
Optional<METADATA> get(Class<METADATA> type, URN urn, int version);
public abstract <URN extends Urn, METADATA extends RecordTemplate>
ListResult<Integer> listVersions(Class<METADATA> type, URN urn, int start,
int pageSize);
public abstract <METADATA extends RecordTemplate> ListResult<Urn> listUrns(
Class<METADATA> type, int start, int pageSize);
public abstract <URN extends Urn, METADATA extends RecordTemplate>
ListResult<METADATA> list(Class<METADATA> type, URN urn, int start, int pageSize);
}
```
Another important function of [Local DAO] is to automatically emit [MAE]s whenever the metadata is updated.
This is doable because MAE effectively use the same [Pegasus] models so [RecordTemplate] can be easily converted into the corresponding [GenericRecord].
### Search DAO
Search DAO is also a generic class that can be bound to a specific type of search document.
The DAO provides 3 APIs:
* A `search` API that takes the search input, a [Filter], a [SortCriterion], some pagination parameters, and returns a [SearchResult].
* An `autoComplete` API which allows typeahead-style autocomplete based on the current input and a [Filter], and returns [AutocompleteResult].
* A `filter` API which allows for filtering only without a search input. It takes a a [Filter] and a [SortCriterion] as input and returns [SearchResult].
```java
public abstract class BaseSearchDAO<DOCUMENT extends RecordTemplate> {
public abstract SearchResult<DOCUMENT> search(String input, Filter filter,
SortCriterion sortCriterion, int from, int size);
public abstract AutoCompleteResult autoComplete(String input, String field,
Filter filter, int limit);
public abstract SearchResult<DOCUMENT> filter(Filter filter, SortCriterion sortCriterion,
int from, int size);
}
```
### Query DAO
Query DAO allows clients, e.g. [GMS](../what/gms.md), [MAE Consumer Job](../architecture/metadata-serving.md#metadata-index-applier-mae-consumer-job) etc, to perform both graph & non-graph queries against the metadata graph.
For instance, a GMS can use the Query DAO to find out “all the dataset owned by the users who is part of the group `foo` and report to `bar`,” which naturally translates to a graph query.
Alternatively, a client may wish to retrieve “all the datasets that stored under /jobs/metrics”, which doesnt involve any graph traversal.
Below is the base class for Query DAOs, which contains the `findEntities` and `findRelationships` methods.
Both methods also have two versions, one involves graph traversal, and the other doesnt.
You can use `findMixedTypesEntities` and `findMixedTypesRelationships` for queries that return a mixture of different types of entities or relationships.
As these methods return a list of [RecordTemplate], callers will need to manually cast them back to the specific entity type using [isInstance()](https://docs.oracle.com/javase/8/docs/api/java/lang/Class.html#isInstance-java.lang.Object-) or reflection.
Note that the generics (ENTITY, RELATIONSHIP) are purposely left untyped, as these types are native to the underlying graph DB and will most likely differ from one implementation to another.
```java
public abstract class BaseQueryDAO<ENTITY, RELATIONSHIP> {
public abstract <ENTITY extends RecordTemplate> List<ENTITY> findEntities(
Class<ENTITY> type, Filter filter, int offset, int count);
public abstract <ENTITY extends RecordTemplate> List<ENTITY> findEntities(
Class<ENTITY> type, Statement function);
public abstract List<RecordTemplate> findMixedTypesEntities(Statement function);
public abstract <ENTITY extends RecordTemplate, RELATIONSHIP extends RecordTemplate> List<RELATIONSHIP>
findRelationships(Class<ENTITY> entityType, Class<RELATIONSHIP> relationshipType, Filter filter, int offset, int count);
public abstract <RELATIONSHIP extends RecordTemplate> List<RELATIONSHIP>
findRelationships(Class<RELATIONSHIP> type, Statement function);
public abstract List<RecordTemplate> findMixedTypesRelationships(
Statement function);
}
```
### Remote DAO
[Remote DAO] is nothing but a specialized readonly implementation of [Local DAO].
Rather than retrieving metadata from a local storage, Remote DAO will fetch the metadata from another [GMS].
The mapping between [entity] type and GMS is implemented as a hard-coded map.
To prevent circular dependency ([rest.li] service depends on remote DAO, which in turn depends on rest.li client generated by each rest.li service),
Remote DAO will need to construct raw rest.li requests directly, instead of using each entitys rest.li request builder.
## Customizing Search and Graph Index Updates
In addition to storing and querying metadata, a common requirement is to customize and extend the fields that are being stored in the search or the graph index.
As described in [Metadata Modelling] section, [Entity], [Relationship], and [Search Document] models do not directly encode the logic of how each field should be derived from metadata.
Instead, this logic needs to be provided in the form of a Java class: a graph or search index builder.
The builders register the [metadata aspect]s of their interest against [MAE Consumer Job](#mae-consumer-job) and will be invoked whenever a MAE involving the corresponding aspect is received.
If the MAE itself doesnt contain all the metadata needed, builders can use Remote DAO to fetch from GMS directly.
```java
public abstract class BaseIndexBuilder<DOCUMENT extends RecordTemplate> {
BaseIndexBuilder(@Nonnull List<Class<? extends RecordTemplate>> snapshotsInterested);
@Nullable
public abstract List<DOCUMENT> getDocumentsToUpdate(@Nonnull RecordTemplate snapshot);
@Nonnull
public abstract Class<DOCUMENT> getDocumentType();
}
```
```java
public interface GraphBuilder<SNAPSHOT extends RecordTemplate> {
GraphUpdates build(SNAPSHOT snapshot);
@Value
class GraphUpdates {
List<? extends RecordTemplate> entities;
List<RelationshipUpdates> relationshipUpdates;
}
@Value
class RelationshipUpdates {
List<? extends RecordTemplate> relationships;
BaseGraphWriterDAO.RemovalOption preUpdateOperation;
}
}
```
[AutocompleteResult]: ../../metadata-dao/src/main/pegasus/com/linkedin/metadata/query/AutoCompleteResult.pdl
[Filter]: ../../metadata-dao/src/main/pegasus/com/linkedin/metadata/query/Filter.pdl
[SortCriterion]: ../../metadata-dao/src/main/pegasus/com/linkedin/metadata/query/SortCriterion.pdl
[SearchResult]: ../../metadata-dao/src/main/java/com/linkedin/metadata/dao/SearchResult.java
[RecordTemplate]: https://github.com/linkedin/rest.li/blob/master/data/src/main/java/com/linkedin/data/template/RecordTemplate.java
[GenericRecord]: https://github.com/apache/avro/blob/master/lang/java/avro/src/main/java/org/apache/avro/generic/GenericRecord.java
[DAO]: https://en.wikipedia.org/wiki/Data_access_object
[Pegasus]: https://linkedin.github.io/rest.li/pdl_schema
[relationship]: ../what/relationship.md
[entity]: ../what/entity.md
[aspect]: ../what/aspect.md
[GMS]: ../what/gms.md
[Local DAO]: ../../metadata-dao/src/main/java/com/linkedin/metadata/ebean/EbeanAspectDAO.java
[Remote DAO]: ../../metadata-dao/src/main/java/com/linkedin/metadata/dao/BaseRemoteDAO.java
[MAE]: ../what/mxe.md#metadata-audit-event-mae
[rest.li]: https://rest.li
[Metadata Change Event (MCE)]: ../what/mxe.md#metadata-change-event-mce
[Metadata Audit Event (MAE)]: ../what/mxe.md#metadata-audit-event-mae
[MAE]: ../what/mxe.md#metadata-audit-event-mae
[equivalent Pegasus format]: https://linkedin.github.io/rest.li/how_data_is_represented_in_memory#the-data-template-layer
[graph]: ../what/graph.md
[search index]: ../what/search-index.md
[mce-consumer-job]: ../../metadata-jobs/mce-consumer-job
[mae-consumer-job]: ../../metadata-jobs/mae-consumer-job
[Remote DAO]: ../architecture/metadata-serving.md#remote-dao
[URN]: ../what/urn.md
[Metadata Modelling]: ../modeling/metadata-model.md
[Entity]: ../what/entity.md
[Relationship]: ../what/relationship.md
[Search Document]: ../what/search-document.md
[metadata aspect]: ../what/aspect.md
[Python emitters]: https://datahubproject.io/docs/metadata-ingestion/#using-as-a-library

View File

@ -65,6 +65,7 @@ Note that a `.` is used to denote nested fields in the YAML recipe.
| `token` | | | Bearer token used for authentication. |
| `extra_headers` | | | Extra headers which will be added to the request. |
| `max_threads` | | `1` | Experimental: Max parallelism for REST API calls |
| `ca_certificate_path` | | | Path to CA certificate for HTTPS communications |
## DataHub Kafka