datahub/README.md

80 lines
6.6 KiB
Markdown
Raw Normal View History

2020-02-07 07:33:52 -08:00
# DataHub: A Generalized Metadata Search & Discovery Tool
2020-02-13 05:22:44 -08:00
[![Version](https://img.shields.io/github/v/release/linkedin/datahub?include_prereleases)](https://github.com/linkedin/datahub/releases)
2020-01-23 12:04:27 -08:00
[![Build Status](https://travis-ci.org/linkedin/datahub.svg)](https://travis-ci.org/linkedin/datahub)
2020-03-06 15:50:49 -08:00
[![Get on Slack](https://img.shields.io/badge/slack-join-orange.svg)](https://datahubspace.slack.com/join/shared_invite/zt-cl60ng6o-6odCh_I~ejZKE~a9GG30PA)
2020-01-30 14:02:32 -08:00
[![PRs Welcome](https://img.shields.io/badge/PRs-welcome-brightgreen.svg)](https://github.com/linkedin/datahub/blob/master/CONTRIBUTING.md)
2020-03-07 06:52:51 -08:00
[![License](https://img.shields.io/github/license/linkedin/datahub)](LICENSE)
2019-09-01 16:03:45 -07:00
2019-12-18 18:57:18 -08:00
![DataHub](docs/imgs/datahub-logo.png)
2015-11-19 14:39:21 -08:00
2020-03-20 11:20:20 -07:00
> :mega: Next DataHub town hall meeting on April 3rd, 9am-10am PDT:
2020-02-28 21:44:40 -08:00
> - [Signup sheet & questions](https://docs.google.com/spreadsheets/d/1hCTFQZnhYHAPa-DeIfyye4MlwmrY7GF4hBds5pTZJYM)
2020-03-13 08:32:31 -07:00
> - Details and recordings of past meetings can be found [here](docs/townhalls.md)
2020-03-07 06:52:51 -08:00
> :sparkles:Mar 2020 Update:
2020-03-21 10:48:45 -07:00
> - DataHub v0.3.1 has just been released. See [relase notes](https://github.com/linkedin/datahub/releases/tag/v0.3.1) for more details.
2020-03-13 08:32:31 -07:00
> - We're on [Slack](https://datahubspace.slack.com/join/shared_invite/zt-cl60ng6o-6odCh_I~ejZKE~a9GG30PA) now! Ask questions and keep up with the latest announcement.
2020-02-12 12:29:41 -08:00
2019-09-08 20:25:58 -07:00
## Introduction
2019-12-20 02:36:24 -08:00
DataHub is LinkedIn's generalized metadata search & discovery tool. To learn more about DataHub, check out our
[LinkedIn blog post](https://engineering.linkedin.com/blog/2019/data-hub) and [Strata presentation](https://speakerdeck.com/shirshanka/the-evolution-of-metadata-linkedins-journey-strata-nyc-2019).
2020-02-15 15:21:52 -08:00
You should also visit [DataHub Architecture](docs/architecture/architecture.md) to get a better understanding of how DataHub is implemented and [DataHub Onboarding Guide](docs/how/entity-onboarding.md) to understand how to extend DataHub for your own use case.
2020-02-21 16:33:45 -08:00
This repository contains the complete source code for both DataHub's frontend & backend. You can also read about [how we sync the changes](https://engineering.linkedin.com/blog/2020/open-sourcing-datahub--linkedins-metadata-search-and-discovery-p) between our the internal fork and GitHub.
2016-02-09 12:23:00 -08:00
2019-08-31 20:51:14 -07:00
## Quickstart
2020-03-21 10:08:46 -07:00
1. Install [docker](https://docs.docker.com/install/) and [docker-compose](https://docs.docker.com/compose/install/) (if using Linux). Make sure to allocate enough hardware resources for Docker engine. Tested & confirmed config: 2 CPUs, 8GB RAM, 2GB Swap area.
2. Open Docker either from the command line or the desktop app and ensure it is up and running.
3. Clone this repo and `cd` into the root directory of the cloned repository.
4. Run the following command to download and run all Docker containers locally:
2020-02-06 16:33:02 -08:00
```
cd docker/quickstart && docker-compose pull && docker-compose up --build
```
2020-03-21 10:08:46 -07:00
This step takes a while to run the first time, and it may be difficult to tell if DataHub is fully up and running from the combined log. Please use [this guide](https://github.com/linkedin/datahub/blob/master/docs/debugging.md#how-can-i-confirm-if-all-docker-containers-are-running-as-expected-after-a-quickstart) to verify that each container is running correctly.
5. At this point, you should be able to start DataHub by opening [http://localhost:9001](http://localhost:9001) in your browser. You can sign in using `datahub` as both username and password. However, you'll notice that no data has been ingested yet.
6. To ingest provided [sample data](https://github.com/linkedin/datahub/blob/master/metadata-ingestion/mce-cli/bootstrap_mce.dat) to DataHub, switch to a new terminal window, `cd` into the cloned `datahub` repo, and run the following command:
2020-02-06 16:33:02 -08:00
```
docker build -t ingestion -f docker/ingestion/Dockerfile . && cd docker/ingestion && docker-compose up
```
2020-03-21 10:08:46 -07:00
After running this, you should be able to see and search sample datasets in DataHub.
2019-09-08 20:25:58 -07:00
2020-03-21 10:08:46 -07:00
Please refer to the [debugging guide](docs/debugging.md) if you encounter any issues during the quickstart.
2020-01-24 17:48:25 -08:00
2020-02-28 22:04:01 -08:00
## Documents
2019-12-20 02:36:24 -08:00
* [DataHub Architecture](docs/architecture/architecture.md)
* [DataHub Onboarding Guide](docs/how/entity-onboarding.md)
2019-09-08 20:25:58 -07:00
* [Docker Images](docker)
2020-02-28 22:09:03 -08:00
* [Frontend](datahub-frontend)
* [Web App](datahub-web)
2019-12-20 02:36:24 -08:00
* [Generalized Metadata Service](gms)
2019-09-08 20:25:58 -07:00
* [Metadata Ingestion](metadata-ingestion)
2020-02-28 22:09:35 -08:00
* [Metadata Processing Jobs](metadata-jobs)
2019-09-08 20:25:58 -07:00
2020-01-22 18:30:32 -08:00
## Releases
2020-03-21 10:48:45 -07:00
See [Releases](https://github.com/linkedin/datahub/releases) page for more details. We follow the [SemVer Specification](https://semver.org) when versioning the releases.
2020-01-22 18:30:32 -08:00
2020-03-11 05:25:09 -07:00
## Features & Roadmap
Check out DataHub's [Features](docs/features.md) & [Roadmap](docs/roadmap.md).
2020-02-28 21:52:12 -08:00
2020-02-28 22:04:01 -08:00
## Contributing
We welcome contributions from the community. Please refer to [the guidelines](CONTRIBUTING.md) for more details. We also have a [contrib](contrib) directory for incubation.
2020-03-13 10:20:49 -07:00
## Community
2020-03-13 12:09:11 -07:00
Join our [slack channel](https://app.slack.com/client/TUMKD5EGJ/DV0SB2ZQV/thread/GV2TEEZ5L-1583704023.001100) for important discussions and announcements. You can also find out more about our past and upcoming [town hall meetings](https://github.com/linkedin/datahub/blob/master/docs/townhalls.md).
2020-03-13 10:20:49 -07:00
2020-03-13 09:27:08 -07:00
## FAQs
Frequently Asked Questions about DataHub can be found [here](https://github.com/linkedin/datahub/blob/master/docs/faq.md).
2020-03-13 09:27:08 -07:00
2020-02-28 21:53:39 -08:00
## Related Articles & Presentations
2020-02-28 21:52:12 -08:00
* [DataHub: A Generalized Metadata Search & Discovery Tool](https://engineering.linkedin.com/blog/2019/data-hub)
* [Open sourcing DataHub: LinkedIns metadata search and discovery platform](https://engineering.linkedin.com/blog/2020/open-sourcing-datahub--linkedins-metadata-search-and-discovery-p)
2020-02-28 21:58:44 -08:00
* [The evolution of metadata: LinkedIns story @ Strata Data Conference 2019](https://speakerdeck.com/shirshanka/the-evolution-of-metadata-linkedins-journey-strata-nyc-2019)
2020-02-28 21:52:34 -08:00
* [Journey of metadata at LinkedIn @ Crunch Data Conference 2019](https://www.youtube.com/watch?v=OB-O0Y6OYDE)
2020-02-28 21:52:12 -08:00
* [Data Catalogue — Knowing your data](https://medium.com/albert-franzi/data-catalogue-knowing-your-data-15f7d0724900)
* [How LinkedIn, Uber, Lyft, Airbnb and Netflix are Solving Data Management and Discovery for Machine Learning Solutions](https://towardsdatascience.com/how-linkedin-uber-lyft-airbnb-and-netflix-are-solving-data-management-and-discovery-for-machine-9b79ee9184bb)
2020-02-28 22:20:00 -08:00
* [LinkedIn元数据之旅的最新进展—Data Hub](https://zhuanlan.zhihu.com/p/80459081)
2020-03-08 10:27:10 -07:00
* [数据治理篇: 元数据之datahub-概述](https://www.jianshu.com/p/04630b0c63f7)
2020-02-28 22:20:00 -08:00
* [LinkedIn gibt die Datenplattform DataHub als Open Source frei](https://www.heise.de/developer/meldung/LinkedIn-gibt-die-Datenplattform-DataHub-als-Open-Source-frei-4663773.html)
2020-03-03 17:33:28 -08:00
* [Linkedin bringt Open-Source-Datahub](https://www.itmagazine.ch/artikel/71532/Linkedin_bringt_Open-Source-Datahub.html)