datahub/README.md

105 lines
7.4 KiB
Markdown
Raw Normal View History

2020-02-07 07:33:52 -08:00
# DataHub: A Generalized Metadata Search & Discovery Tool
2020-02-13 05:22:44 -08:00
[![Version](https://img.shields.io/github/v/release/linkedin/datahub?include_prereleases)](https://github.com/linkedin/datahub/releases)
2020-01-23 12:04:27 -08:00
[![Build Status](https://travis-ci.org/linkedin/datahub.svg)](https://travis-ci.org/linkedin/datahub)
[![Get on Slack](https://img.shields.io/badge/slack-join-orange.svg)](https://join.slack.com/t/datahubspace/shared_invite/zt-dkzbxfck-dzNl96vBzB06pJpbRwP6RA)
2020-01-30 14:02:32 -08:00
[![PRs Welcome](https://img.shields.io/badge/PRs-welcome-brightgreen.svg)](https://github.com/linkedin/datahub/blob/master/CONTRIBUTING.md)
2020-03-07 06:52:51 -08:00
[![License](https://img.shields.io/github/license/linkedin/datahub)](LICENSE)
2019-09-01 16:03:45 -07:00
2020-03-29 07:59:59 -07:00
---
[Quickstart](#quickstart) |
[Documentation](#documentation) |
[Features](https://github.com/linkedin/datahub/blob/master/docs/features.md) |
[Roadmap](https://github.com/linkedin/datahub/blob/master/docs/roadmap.md) |
[FAQ](https://github.com/linkedin/datahub/blob/master/docs/faq.md) |
[Town Hall](https://github.com/linkedin/datahub/blob/master/docs/townhalls.md)
2020-03-29 08:02:53 -07:00
2020-03-29 07:59:59 -07:00
---
2020-03-29 08:04:17 -07:00
![DataHub](docs/imgs/datahub-logo.png)
2020-06-30 18:14:50 -07:00
> :mega: Next DataHub town hall meeting on July 31st, 9am-10am PDT:
2020-03-29 08:04:17 -07:00
> - [Signup sheet & questions](https://docs.google.com/spreadsheets/d/1hCTFQZnhYHAPa-DeIfyye4MlwmrY7GF4hBds5pTZJYM)
> - Details and recordings of past meetings can be found [here](docs/townhalls.md)
2020-06-23 15:54:36 -07:00
> :sparkles:Latest Update:
2020-06-26 07:28:52 -07:00
> - We released v0.4.1, you can find release notes [here](https://github.com/linkedin/datahub/releases/tag/v0.4.1)
> - We're on Slack now! [Join](https://join.slack.com/t/datahubspace/shared_invite/zt-dkzbxfck-dzNl96vBzB06pJpbRwP6RA) or [log in with an existing account](https://datahubspace.slack.com). Ask questions and keep up with the latest announcements.
2020-03-29 08:04:17 -07:00
2019-09-08 20:25:58 -07:00
## Introduction
2019-12-20 02:36:24 -08:00
DataHub is LinkedIn's generalized metadata search & discovery tool. To learn more about DataHub, check out our
[LinkedIn blog post](https://engineering.linkedin.com/blog/2019/data-hub) and [Strata presentation](https://speakerdeck.com/shirshanka/the-evolution-of-metadata-linkedins-journey-strata-nyc-2019).
2020-02-15 15:21:52 -08:00
You should also visit [DataHub Architecture](docs/architecture/architecture.md) to get a better understanding of how DataHub is implemented and [DataHub Onboarding Guide](docs/how/entity-onboarding.md) to understand how to extend DataHub for your own use case.
2020-02-21 16:33:45 -08:00
This repository contains the complete source code for both DataHub's frontend & backend. You can also read about [how we sync the changes](https://engineering.linkedin.com/blog/2020/open-sourcing-datahub--linkedins-metadata-search-and-discovery-p) between our the internal fork and GitHub.
2016-02-09 12:23:00 -08:00
2019-08-31 20:51:14 -07:00
## Quickstart
2020-03-21 10:08:46 -07:00
1. Install [docker](https://docs.docker.com/install/) and [docker-compose](https://docs.docker.com/compose/install/) (if using Linux). Make sure to allocate enough hardware resources for Docker engine. Tested & confirmed config: 2 CPUs, 8GB RAM, 2GB Swap area.
2. Open Docker either from the command line or the desktop app and ensure it is up and running.
3. Clone this repo and `cd` into the root directory of the cloned repository.
4. Run the following command to download and run all Docker containers locally:
2020-02-06 16:33:02 -08:00
```
./docker/quickstart/quickstart.sh
2020-02-06 16:33:02 -08:00
```
2020-03-21 10:08:46 -07:00
This step takes a while to run the first time, and it may be difficult to tell if DataHub is fully up and running from the combined log. Please use [this guide](https://github.com/linkedin/datahub/blob/master/docs/debugging.md#how-can-i-confirm-if-all-docker-containers-are-running-as-expected-after-a-quickstart) to verify that each container is running correctly.
5. At this point, you should be able to start DataHub by opening [http://localhost:9001](http://localhost:9001) in your browser. You can sign in using `datahub` as both username and password. However, you'll notice that no data has been ingested yet.
6. To ingest provided [sample data](https://github.com/linkedin/datahub/blob/master/metadata-ingestion/mce-cli/bootstrap_mce.dat) to DataHub, switch to a new terminal window, `cd` into the cloned `datahub` repo, and run the following command:
2020-02-06 16:33:02 -08:00
```
./docker/ingestion/ingestion.sh
2020-02-06 16:33:02 -08:00
```
2020-03-21 10:08:46 -07:00
After running this, you should be able to see and search sample datasets in DataHub.
2019-09-08 20:25:58 -07:00
2020-03-21 10:08:46 -07:00
Please refer to the [debugging guide](docs/debugging.md) if you encounter any issues during the quickstart.
2020-01-24 17:48:25 -08:00
2020-03-22 17:16:58 -07:00
## Documentation
2020-03-21 15:58:04 -07:00
* [DataHub Developer's Guide](docs/developers.md)
2019-12-20 02:36:24 -08:00
* [DataHub Architecture](docs/architecture/architecture.md)
* [DataHub Onboarding Guide](docs/how/entity-onboarding.md)
2019-09-08 20:25:58 -07:00
* [Docker Images](docker)
2020-02-28 22:09:03 -08:00
* [Frontend](datahub-frontend)
* [Web App](datahub-web)
2019-12-20 02:36:24 -08:00
* [Generalized Metadata Service](gms)
2019-09-08 20:25:58 -07:00
* [Metadata Ingestion](metadata-ingestion)
2020-02-28 22:09:35 -08:00
* [Metadata Processing Jobs](metadata-jobs)
2019-09-08 20:25:58 -07:00
2020-01-22 18:30:32 -08:00
## Releases
2020-03-21 11:16:56 -07:00
See [Releases](https://github.com/linkedin/datahub/releases) page for more details. We follow the [SemVer Specification](https://semver.org) when versioning the releases and adopt the [Keep a Changelog convention](https://keepachangelog.com/) for the changelog format.
2020-01-22 18:30:32 -08:00
2020-03-22 17:16:58 -07:00
## FAQs
Frequently Asked Questions about DataHub can be found [here](https://github.com/linkedin/datahub/blob/master/docs/faq.md).
2020-03-11 05:25:09 -07:00
## Features & Roadmap
Check out DataHub's [Features](docs/features.md) & [Roadmap](docs/roadmap.md).
2020-02-28 21:52:12 -08:00
2020-02-28 22:04:01 -08:00
## Contributing
2020-03-22 17:16:58 -07:00
We welcome contributions from the community. Please refer to our [Contributing Guidelines](CONTRIBUTING.md) for more details. We also have a [contrib](contrib) directory for incubating experimental features.
2020-03-13 10:20:49 -07:00
## Community
2020-03-22 17:16:58 -07:00
Join our [slack workspace](https://app.slack.com/client/TUMKD5EGJ/DV0SB2ZQV/thread/GV2TEEZ5L-1583704023.001100) for important discussions and announcements. You can also find out more about our past and upcoming [town hall meetings](https://github.com/linkedin/datahub/blob/master/docs/townhalls.md).
2020-03-13 09:27:08 -07:00
2020-07-22 15:36:05 -07:00
## Adoption
Here are the companies officially using DataHub. Please feel free to add your company to the list if we miss it.
* [LinkedIn](http://linkedin.com)
* [Expedia Group](http://expedia.com)
* [TypeForm](http://typeform.com)
Here is a list of companies currently building POC or seriously evaluating DataHub.
* [Microsoft](https://microsoft.com)
* [Saxo bank](https://www.home.saxo)
* [Morgan Stanley](https://www.morganstanley.com)
* [Instructure](https://www.instructure.com)
* [SpotHero](https://spothero.com)
* [Geotab](https://www.geotab.com)
## Select Articles & Talks
2020-02-28 21:52:12 -08:00
* [DataHub: A Generalized Metadata Search & Discovery Tool](https://engineering.linkedin.com/blog/2019/data-hub)
* [Open sourcing DataHub: LinkedIns metadata search and discovery platform](https://engineering.linkedin.com/blog/2020/open-sourcing-datahub--linkedins-metadata-search-and-discovery-p)
2020-02-28 21:58:44 -08:00
* [The evolution of metadata: LinkedIns story @ Strata Data Conference 2019](https://speakerdeck.com/shirshanka/the-evolution-of-metadata-linkedins-journey-strata-nyc-2019)
2020-02-28 21:52:34 -08:00
* [Journey of metadata at LinkedIn @ Crunch Data Conference 2019](https://www.youtube.com/watch?v=OB-O0Y6OYDE)
* [DataHub Journey with Expedia Group by Arun Vasudevan](https://www.youtube.com/watch?v=ajcRdB22s5o)
2020-02-28 21:52:12 -08:00
* [Data Catalogue — Knowing your data](https://medium.com/albert-franzi/data-catalogue-knowing-your-data-15f7d0724900)
2020-05-05 18:28:59 -07:00
* [LinkedIn Datahub Application Architecture Quick Understanding](https://medium.com/@liangjunjiang/linkedin-datahub-application-architecture-quick-understanding-a5b7868ee205)
2020-05-18 16:02:15 -07:00
* [25 Hot New Data Tools and What They DONT Do](https://blog.amplifypartners.com/25-hot-new-data-tools-and-what-they-dont-do/)
See the full list [here](https://github.com/linkedin/datahub/blob/mars-lan-patch-2/docs/links.md).