diff --git a/README.md b/README.md index 5eda3c39a7..022e5dd880 100644 --- a/README.md +++ b/README.md @@ -73,6 +73,10 @@ DataHub is an open-source metadata platform for the modern data stack. Read abou Please follow the [DataHub Quickstart Guide](https://datahubproject.io/docs/quickstart) to get a copy of DataHub up & running locally using [Docker](https://docker.com). As the guide assumes some basic knowledge of Docker, we'd recommend you to go through the "Hello World" example of [A Docker Tutorial for Beginners](https://docker-curriculum.com) if Docker is completely foreign to you. +## Development + +If you're looking to build & modify datahub please take a look at our [Development Guide](https://datahubproject.io/docs/developers). + ## Demo and Screenshots There's a [hosted demo environment](https://datahubproject.io/docs/demo) where you can play around with DataHub before installing. diff --git a/docs/architecture/metadata-serving.md b/docs/architecture/metadata-serving.md index f26e6d1738..3056790d29 100644 --- a/docs/architecture/metadata-serving.md +++ b/docs/architecture/metadata-serving.md @@ -34,12 +34,7 @@ To ensure that metadata changes are processed in the correct chronological order ### Metadata Query Serving -Primary-key based reads (e.g. getting schema metadata for a dataset based on the `dataset-urn`) on metadata are routed to the document store. Secondary index based reads on metadata are routed to the search index (or alternately can use the strongly consistent secondary index support described [here]()). Full-text and advanced search queries are routed to the search index. Complex graph queries such as lineage are routed to the graph index. - -### Further Reading - -Read the [metadata service developer guide](../how/build-metadata-service.md) to understand how to customize the DataHub metadata service tier. - +Primary-key based reads (e.g. getting schema metadata for a dataset based on the `dataset-urn`) on metadata are routed to the document store. Secondary index based reads on metadata are routed to the search index (or alternately can use the strongly consistent secondary index support described [here]()). Full-text and advanced search queries are routed to the search index. Complex graph queries such as lineage are routed to the graph index. [RecordTemplate]: https://github.com/linkedin/rest.li/blob/master/data/src/main/java/com/linkedin/data/template/RecordTemplate.java [GenericRecord]: https://github.com/apache/avro/blob/master/lang/java/avro/src/main/java/org/apache/avro/generic/GenericRecord.java diff --git a/docs/developers.md b/docs/developers.md index 5e8e1713c1..7d96b748d8 100644 --- a/docs/developers.md +++ b/docs/developers.md @@ -4,6 +4,12 @@ title: "Local Development" # DataHub Developer's Guide +## Pre-requirements + - [Java 1.8 SDK](https://adoptopenjdk.net/?variant=openjdk8&jvmVariant=hotspot) + - [Docker](https://www.docker.com/) + - [Docker Compose](https://docs.docker.com/compose/) + - Docker engine with at least 8GB of memory to run tests. + ## Building the Project Fork and clone the repository if haven't done so already @@ -21,6 +27,56 @@ Use [gradle wrapper](https://docs.gradle.org/current/userguide/gradle_wrapper.ht ./gradlew build ``` +Note that the above will also run run tests and a number of validations which makes the process considerably slower. + +We suggest partially compiling DataHub according to your needs: + + - Build Datahub's backend GMS (Generalized metadata service): +``` +./gradlew :metadata-service:war:build +``` + - Build Datahub's frontend: +``` +./gradlew :datahub-frontend:build -x yarnTest -x yarnLint +``` + - Build DataHub's command line tool: +``` +./gradlew :metadata-ingestion:installDev +``` + - Build DataHub's documentation: +``` +./gradlew :docs-website:yarnLintFix :docs-website:build -x :metadata-ingestion:runPreFlightScript +# To preview the documentation +./gradlew :docs-website:serve +``` + +## Deploying local versions + +Run just once to have the local `datahub` cli tool installed in your $PATH +``` +cd smoke-test/ +python3 -m venv venv +source venv/bin/activate +pip install --upgrade pip wheel setuptools +pip install -r requirements.txt +cd ../ +``` + +Once you have compiled & packaged the project or appropriate module you can deploy the entire system via docker-compose by running: +``` +datahub docker quickstart --build-locally +``` + +Replace whatever container you want in the existing deployment. +I.e, replacing datahub's backend (GMS): +``` +(cd docker && COMPOSE_DOCKER_CLI_BUILD=1 DOCKER_BUILDKIT=1 docker-compose -p datahub -f docker-compose-without-neo4j.yml -f docker-compose-without-neo4j.override.yml -f docker-compose.dev.yml up -d --no-deps --force-recreate datahub-gms) +``` + +Running the local version of the frontend +``` +(cd docker && COMPOSE_DOCKER_CLI_BUILD=1 DOCKER_BUILDKIT=1 docker-compose -p datahub -f docker-compose-without-neo4j.yml -f docker-compose-without-neo4j.override.yml -f docker-compose.dev.yml up -d --no-deps --force-recreate datahub-frontend-react) +``` ## IDE Support The recommended IDE for DataHub development is [IntelliJ IDEA](https://www.jetbrains.com/idea/). You can run the following command to generate or update the IntelliJ project file diff --git a/docs/how/build-metadata-service.md b/docs/how/build-metadata-service.md deleted file mode 100644 index 9e59b72026..0000000000 --- a/docs/how/build-metadata-service.md +++ /dev/null @@ -1,190 +0,0 @@ -# Metadata Service Developer Guide - -This guide assumes that you are already familar with the architecture of DataHub's Metadata Serving Layer, as described [here](../architecture/metadata-serving.md). - -Read on to understand how to build and extend the DataHub service tier for your specific needs. - - -## Using DAOs to store and query Metadata - -DataHub metadata service uses the excellent `datahub-gma` library to store and query metadata in a standard way. -There are four types of Data Access Objects ([DAO]) that standardize the way metadata is accessed. -This section describes each type of DAO, its purpose, and the interface. - -These DAOs rely heavily on [Java Generics](https://docs.oracle.com/javase/tutorial/extra/generics/index.html) so that the core logics can remain type-neutral. -However, as there’s no inheritance in [Pegasus], the generics often fallback to extending [RecordTemplate] instead of the desired types (i.e. [entity], [relationship], metadata [aspect] etc). Additional runtime type checking has been added to the DAOs to avoid binding to unexpected types. We also cache the type checking result to minimize runtime overhead. - -### Key-value DAO (Local DAO) - -[GMS] use [Local DAO] to store and retrieve metadata [aspect]s from the local document store. -Below shows the base class and its simple key-value interface. -As the DAO is a generic class, it needs to be bound to specific type during instantiation. -Each entity type will need to instantiate its own version of DAO. - -```java -public abstract class BaseLocalDAO { - - public abstract void - add(Class type, URN urn, METADATA value); - - public abstract - Optional get(Class type, URN urn, int version); - - public abstract - ListResult listVersions(Class type, URN urn, int start, - int pageSize); - - public abstract ListResult listUrns( - Class type, int start, int pageSize); - - public abstract - ListResult list(Class type, URN urn, int start, int pageSize); -} -``` - -Another important function of [Local DAO] is to automatically emit [MAE]s whenever the metadata is updated. -This is doable because MAE effectively use the same [Pegasus] models so [RecordTemplate] can be easily converted into the corresponding [GenericRecord]. - -### Search DAO - -Search DAO is also a generic class that can be bound to a specific type of search document. -The DAO provides 3 APIs: -* A `search` API that takes the search input, a [Filter], a [SortCriterion], some pagination parameters, and returns a [SearchResult]. -* An `autoComplete` API which allows typeahead-style autocomplete based on the current input and a [Filter], and returns [AutocompleteResult]. -* A `filter` API which allows for filtering only without a search input. It takes a a [Filter] and a [SortCriterion] as input and returns [SearchResult]. - -```java -public abstract class BaseSearchDAO { - - public abstract SearchResult search(String input, Filter filter, - SortCriterion sortCriterion, int from, int size); - - public abstract AutoCompleteResult autoComplete(String input, String field, - Filter filter, int limit); - - public abstract SearchResult filter(Filter filter, SortCriterion sortCriterion, - int from, int size); -} -``` - -### Query DAO - -Query DAO allows clients, e.g. [GMS](../what/gms.md), [MAE Consumer Job](../architecture/metadata-serving.md#metadata-index-applier-mae-consumer-job) etc, to perform both graph & non-graph queries against the metadata graph. -For instance, a GMS can use the Query DAO to find out “all the dataset owned by the users who is part of the group `foo` and report to `bar`,” which naturally translates to a graph query. -Alternatively, a client may wish to retrieve “all the datasets that stored under /jobs/metrics”, which doesn’t involve any graph traversal. - -Below is the base class for Query DAOs, which contains the `findEntities` and `findRelationships` methods. -Both methods also have two versions, one involves graph traversal, and the other doesn’t. -You can use `findMixedTypesEntities` and `findMixedTypesRelationships` for queries that return a mixture of different types of entities or relationships. -As these methods return a list of [RecordTemplate], callers will need to manually cast them back to the specific entity type using [isInstance()](https://docs.oracle.com/javase/8/docs/api/java/lang/Class.html#isInstance-java.lang.Object-) or reflection. - -Note that the generics (ENTITY, RELATIONSHIP) are purposely left untyped, as these types are native to the underlying graph DB and will most likely differ from one implementation to another. - -```java -public abstract class BaseQueryDAO { - - public abstract List findEntities( - Class type, Filter filter, int offset, int count); - - public abstract List findEntities( - Class type, Statement function); - - public abstract List findMixedTypesEntities(Statement function); - - public abstract List - findRelationships(Class entityType, Class relationshipType, Filter filter, int offset, int count); - - public abstract List - findRelationships(Class type, Statement function); - - public abstract List findMixedTypesRelationships( - Statement function); -} -``` - -### Remote DAO - -[Remote DAO] is nothing but a specialized readonly implementation of [Local DAO]. -Rather than retrieving metadata from a local storage, Remote DAO will fetch the metadata from another [GMS]. -The mapping between [entity] type and GMS is implemented as a hard-coded map. - -To prevent circular dependency ([rest.li] service depends on remote DAO, which in turn depends on rest.li client generated by each rest.li service), -Remote DAO will need to construct raw rest.li requests directly, instead of using each entity’s rest.li request builder. - -## Customizing Search and Graph Index Updates - -In addition to storing and querying metadata, a common requirement is to customize and extend the fields that are being stored in the search or the graph index. - -As described in [Metadata Modelling] section, [Entity], [Relationship], and [Search Document] models do not directly encode the logic of how each field should be derived from metadata. -Instead, this logic needs to be provided in the form of a Java class: a graph or search index builder. - -The builders register the [metadata aspect]s of their interest against [MAE Consumer Job](#mae-consumer-job) and will be invoked whenever a MAE involving the corresponding aspect is received. -If the MAE itself doesn’t contain all the metadata needed, builders can use Remote DAO to fetch from GMS directly. - -```java -public abstract class BaseIndexBuilder { - - BaseIndexBuilder(@Nonnull List> snapshotsInterested); - - @Nullable - public abstract List getDocumentsToUpdate(@Nonnull RecordTemplate snapshot); - - @Nonnull - public abstract Class getDocumentType(); -} -``` - -```java -public interface GraphBuilder { - GraphUpdates build(SNAPSHOT snapshot); - - @Value - class GraphUpdates { - List entities; - List relationshipUpdates; - } - - @Value - class RelationshipUpdates { - List relationships; - BaseGraphWriterDAO.RemovalOption preUpdateOperation; - } -} -``` - - - -[AutocompleteResult]: ../../metadata-dao/src/main/pegasus/com/linkedin/metadata/query/AutoCompleteResult.pdl -[Filter]: ../../metadata-dao/src/main/pegasus/com/linkedin/metadata/query/Filter.pdl -[SortCriterion]: ../../metadata-dao/src/main/pegasus/com/linkedin/metadata/query/SortCriterion.pdl -[SearchResult]: ../../metadata-dao/src/main/java/com/linkedin/metadata/dao/SearchResult.java -[RecordTemplate]: https://github.com/linkedin/rest.li/blob/master/data/src/main/java/com/linkedin/data/template/RecordTemplate.java -[GenericRecord]: https://github.com/apache/avro/blob/master/lang/java/avro/src/main/java/org/apache/avro/generic/GenericRecord.java -[DAO]: https://en.wikipedia.org/wiki/Data_access_object -[Pegasus]: https://linkedin.github.io/rest.li/pdl_schema -[relationship]: ../what/relationship.md -[entity]: ../what/entity.md -[aspect]: ../what/aspect.md -[GMS]: ../what/gms.md -[Local DAO]: ../../metadata-dao/src/main/java/com/linkedin/metadata/ebean/EbeanAspectDAO.java -[Remote DAO]: ../../metadata-dao/src/main/java/com/linkedin/metadata/dao/BaseRemoteDAO.java -[MAE]: ../what/mxe.md#metadata-audit-event-mae -[rest.li]: https://rest.li - - -[Metadata Change Event (MCE)]: ../what/mxe.md#metadata-change-event-mce -[Metadata Audit Event (MAE)]: ../what/mxe.md#metadata-audit-event-mae -[MAE]: ../what/mxe.md#metadata-audit-event-mae -[equivalent Pegasus format]: https://linkedin.github.io/rest.li/how_data_is_represented_in_memory#the-data-template-layer -[graph]: ../what/graph.md -[search index]: ../what/search-index.md -[mce-consumer-job]: ../../metadata-jobs/mce-consumer-job -[mae-consumer-job]: ../../metadata-jobs/mae-consumer-job -[Remote DAO]: ../architecture/metadata-serving.md#remote-dao -[URN]: ../what/urn.md -[Metadata Modelling]: ../modeling/metadata-model.md -[Entity]: ../what/entity.md -[Relationship]: ../what/relationship.md -[Search Document]: ../what/search-document.md -[metadata aspect]: ../what/aspect.md -[Python emitters]: https://datahubproject.io/docs/metadata-ingestion/#using-as-a-library diff --git a/metadata-ingestion/sink_docs/datahub.md b/metadata-ingestion/sink_docs/datahub.md index 77048b5e2e..5b0b81d44d 100644 --- a/metadata-ingestion/sink_docs/datahub.md +++ b/metadata-ingestion/sink_docs/datahub.md @@ -65,6 +65,7 @@ Note that a `.` is used to denote nested fields in the YAML recipe. | `token` | | | Bearer token used for authentication. | | `extra_headers` | | | Extra headers which will be added to the request. | | `max_threads` | | `1` | Experimental: Max parallelism for REST API calls | +| `ca_certificate_path` | | | Path to CA certificate for HTTPS communications | ## DataHub Kafka