datahub/README.md

71 lines
5.6 KiB
Markdown
Raw Normal View History

2020-02-07 07:33:52 -08:00
# DataHub: A Generalized Metadata Search & Discovery Tool
2020-02-13 05:22:44 -08:00
[![Version](https://img.shields.io/github/v/release/linkedin/datahub?include_prereleases)](https://github.com/linkedin/datahub/releases)
2020-01-23 12:04:27 -08:00
[![Build Status](https://travis-ci.org/linkedin/datahub.svg)](https://travis-ci.org/linkedin/datahub)
2020-02-08 06:24:04 -08:00
[![License](https://img.shields.io/github/license/linkedin/datahub)](LICENSE)
[![Gitter](https://img.shields.io/gitter/room/nwjs/nw.js.svg)](https://gitter.im/linkedin/datahub)
2020-01-30 14:02:32 -08:00
[![PRs Welcome](https://img.shields.io/badge/PRs-welcome-brightgreen.svg)](https://github.com/linkedin/datahub/blob/master/CONTRIBUTING.md)
2019-09-01 16:03:45 -07:00
2019-12-18 18:57:18 -08:00
![DataHub](docs/imgs/datahub-logo.png)
2015-11-19 14:39:21 -08:00
2020-02-28 21:53:39 -08:00
> :mega: First DataHub town hall meeting on March 6th, 10am-11am PST:
2020-02-28 22:12:56 -08:00
> - Video conference link: https://bluejeans.com/4642477444
2020-02-28 21:44:40 -08:00
> - [Signup sheet & questions](https://docs.google.com/spreadsheets/d/1hCTFQZnhYHAPa-DeIfyye4MlwmrY7GF4hBds5pTZJYM)
> :sparkles:Feb 2020 Update:
> - Our [blog post](https://engineering.linkedin.com/blog/2020/open-sourcing-datahub--linkedins-metadata-search-and-discovery-p) on open sourcing DataHub is out!
> - *DataHub v0.3.0* is [released](https://github.com/linkedin/datahub/releases/tag/v0.3.0)!
2020-02-12 12:29:41 -08:00
2019-09-08 20:25:58 -07:00
## Introduction
2019-12-20 02:36:24 -08:00
DataHub is LinkedIn's generalized metadata search & discovery tool. To learn more about DataHub, check out our
[LinkedIn blog post](https://engineering.linkedin.com/blog/2019/data-hub) and [Strata presentation](https://speakerdeck.com/shirshanka/the-evolution-of-metadata-linkedins-journey-strata-nyc-2019).
2020-02-15 15:21:52 -08:00
You should also visit [DataHub Architecture](docs/architecture/architecture.md) to get a better understanding of how DataHub is implemented and [DataHub Onboarding Guide](docs/how/entity-onboarding.md) to understand how to extend DataHub for your own use case.
2020-02-21 16:33:45 -08:00
This repository contains the complete source code for both DataHub's frontend & backend. You can also read about [how we sync the changes](https://engineering.linkedin.com/blog/2020/open-sourcing-datahub--linkedins-metadata-search-and-discovery-p) between our the internal fork and GitHub.
2016-02-09 12:23:00 -08:00
2019-08-31 20:51:14 -07:00
## Quickstart
1. Install [docker](https://docs.docker.com/install/) and [docker-compose](https://docs.docker.com/compose/install/). Make sure to configure Docker to allocate enough hardware resources for Docker engine. Tested & confirmed config: 4 CPUs, 8GB RAM, 2GB Swap area.
2020-02-10 15:36:46 -08:00
2. Open Docker either from the command line or the Desktop app and ensure it is up and running.
3. Clone this repo and `cd` into the root directory for the cloned repository.
2020-02-06 10:40:15 -08:00
4. Run below command to download and run all Docker containers in your local:
2020-02-06 16:33:02 -08:00
```
cd docker/quickstart && docker-compose pull && docker-compose up --build
```
2020-02-10 15:36:46 -08:00
This step takes long time and it might be hard to figure out when DataHub is fully up. You can refer to [this guide](https://github.com/linkedin/datahub/blob/master/docs/debugging.md#how-can-i-confirm-if-all-docker-containers-are-running-as-expected-after-a-quickstart) to verify if DataHub is up and running.
5. At this point, you should be able to start `DataHub` by opening [http://localhost:9001](http://localhost:9001) in your browser. You can sign in using `datahub` as both username and password. However, there is no data just yet.
6. To ingest [provided](https://github.com/linkedin/datahub/blob/master/metadata-ingestion/mce-cli/bootstrap_mce.dat) sample data to DataHub, switch to a new terminal, `cd` into the cloned `datahub` repo, and run below command:
2020-02-06 16:33:02 -08:00
```
docker build -t ingestion -f docker/ingestion/Dockerfile . && cd docker/ingestion && docker-compose up
```
After running this, you should be able to see sample data in DataHub.
2019-09-08 20:25:58 -07:00
2020-01-24 17:48:25 -08:00
Refer to [debugging guide](docs/debugging.md) if you have issues in any of the above steps.
2020-02-28 22:04:01 -08:00
## Documents
2019-12-20 02:36:24 -08:00
* [DataHub Architecture](docs/architecture/architecture.md)
* [DataHub Onboarding Guide](docs/how/entity-onboarding.md)
2019-09-08 20:25:58 -07:00
* [Docker Images](docker)
2020-02-28 22:09:03 -08:00
* [Frontend](datahub-frontend)
* [Web App](datahub-web)
2019-12-20 02:36:24 -08:00
* [Generalized Metadata Service](gms)
2019-09-08 20:25:58 -07:00
* [Metadata Ingestion](metadata-ingestion)
2020-02-28 22:09:35 -08:00
* [Metadata Processing Jobs](metadata-jobs)
2019-09-08 20:25:58 -07:00
2020-01-22 18:30:32 -08:00
## Releases
2020-02-04 18:36:08 -08:00
See [Releases](https://github.com/linkedin/datahub/releases) page for more details.
2020-01-22 18:30:32 -08:00
2019-09-08 20:25:58 -07:00
## Roadmap
2020-02-28 22:09:35 -08:00
Check out DataHub's [Roadmap](docs/roadmap.md).
2020-02-28 21:52:12 -08:00
2020-02-28 22:04:01 -08:00
## Contributing
We welcome contributions from the community. Please refer to [the guidelines](CONTRIBUTING.md) for more details. We also have a [contrib](contrib) directory for incubation.
2020-02-28 21:53:39 -08:00
## Related Articles & Presentations
2020-02-28 21:52:12 -08:00
* [DataHub: A Generalized Metadata Search & Discovery Tool](https://engineering.linkedin.com/blog/2019/data-hub)
* [Open sourcing DataHub: LinkedIns metadata search and discovery platform](https://engineering.linkedin.com/blog/2020/open-sourcing-datahub--linkedins-metadata-search-and-discovery-p)
2020-02-28 21:58:44 -08:00
* [The evolution of metadata: LinkedIns story @ Strata Data Conference 2019](https://speakerdeck.com/shirshanka/the-evolution-of-metadata-linkedins-journey-strata-nyc-2019)
2020-02-28 21:52:34 -08:00
* [Journey of metadata at LinkedIn @ Crunch Data Conference 2019](https://www.youtube.com/watch?v=OB-O0Y6OYDE)
2020-02-28 21:52:12 -08:00
* [Data Catalogue — Knowing your data](https://medium.com/albert-franzi/data-catalogue-knowing-your-data-15f7d0724900)
* [How LinkedIn, Uber, Lyft, Airbnb and Netflix are Solving Data Management and Discovery for Machine Learning Solutions](https://towardsdatascience.com/how-linkedin-uber-lyft-airbnb-and-netflix-are-solving-data-management-and-discovery-for-machine-9b79ee9184bb)
2020-02-28 22:20:00 -08:00
* [LinkedIn元数据之旅的最新进展—Data Hub](https://zhuanlan.zhihu.com/p/80459081)
* [LinkedIn gibt die Datenplattform DataHub als Open Source frei](https://www.heise.de/developer/meldung/LinkedIn-gibt-die-Datenplattform-DataHub-als-Open-Source-frei-4663773.html)