docs(ingestion): remove outdated data-source-onboarding.md docs (#2360)

This commit is contained in:
Harshal Sheth 2021-04-09 17:33:09 -07:00 committed by GitHub
parent 98813ed2b5
commit 011ab9a09a
No known key found for this signature in database
GPG Key ID: 4AEE18F83AFDEB23
3 changed files with 2 additions and 19 deletions

View File

@ -40,7 +40,6 @@ module.exports = {
"docs/quickstart",
"docs/debugging",
"metadata-ingestion/README",
// TODO "docs/how/data-source-onboarding",
],
Architecture: [
"docs/architecture/architecture",
@ -50,8 +49,6 @@ module.exports = {
//"docs/what/gms",
"datahub-web-react/README",
],
// },
// developerGuideSidebar: {
"Metadata Modeling": [
// TODO: change the titles of these, removing the "What is..." portion from the sidebar"
"docs/what/entity",
@ -67,6 +64,7 @@ module.exports = {
// TODO: the titles of these should not be in question form in the sidebar
"docs/developers",
"docs/docker/development",
"metadata-ingestion/README",
"docs/what/graph",
"docs/what/search-index",
"docs/how/add-new-aspect",
@ -104,8 +102,6 @@ module.exports = {
// WIP "docs/advanced/partial-update",
// WIP "docs/advanced/pdl-best-practices",
],
// },
// operatorGuideSidebar: {
Deployment: [
"docs/how/kafka-config",
"docker/README",
@ -125,7 +121,6 @@ module.exports = {
// - "docker/neo4j/README",
// - "docker/postgres/README",
],
// },
Community: [
"docs/slack",
"docs/links",

View File

@ -91,7 +91,7 @@ MCE is the ideal way to push metadata from different security zones, assuming th
Currently, DataHub supports all major database providers that are supported by Ebean as the document store i.e. Oracle, Postgres, MySQL, H2. We also support [Espresso](https://engineering.linkedin.com/espresso/introducing-espresso-linkedins-hot-new-distributed-document-store), which is LinkedIn's proprietary document store. Other than that, we support Elasticsearch and Neo4j for search and graph use cases, respectively. However, as data stores in the backend are all abstracted and accessed through DAOs, you should be able to easily support other data stores by plugging in your own DAO implementations. Please refer to [Metadata Serving](architecture/metadata-serving.md) for more details.
## For which stores, you have discovery services?
Supported data sources are listed [here](https://github.com/linkedin/datahub/tree/master/metadata-ingestion). To onboard your own data source which is not listed there, you can refer to the [onboarding guide](how/data-source-onboarding.md).
Supported data sources are listed [here](../metadata-ingestion/README.md). It's also fairly easy to add your own sources.
## How is metadata ingested in DataHub? Is it real-time?
You can call the [rest.li](https://github.com/linkedin/rest.li) API to ingest metadata in DataHub directly instead of using Kafka event. Metadata ingestion is real-time if you're updating via rest.li API. It's near real-time in the case of Kafka events due to the asynchronous nature of Kafka processing.

View File

@ -1,12 +0,0 @@
# How to onboard a new data source?
In the [metadata-ingestion](https://github.com/linkedin/datahub/tree/master/metadata-ingestion), DataHub provides various kinds of metadata sources onboarding, including [Hive](https://github.com/linkedin/datahub/tree/master/metadata-ingestion/hive-etl), [Kafka](https://github.com/linkedin/datahub/tree/master/metadata-ingestion/kafka-etl), [LDAP](https://github.com/linkedin/datahub/tree/master/metadata-ingestion/ldap-etl), [mySQL](https://github.com/linkedin/datahub/tree/master/metadata-ingestion/mysql-etl), and generic [RDBMS](https://github.com/linkedin/datahub/tree/master/metadata-ingestion/rdbms-etl) as ETL scripts to feed the metadata to the [GMS](../what/gms.md).
## 1. Extract
The extract process will be specific tight to the data source, hence, the [data accessor](https://github.com/linkedin/datahub/blob/master/metadata-ingestion/ldap-etl/ldap_etl.py#L103) should be able to reflect the correctness of the metadata from underlying data platforms.
## 2. Transform
In the transform stage, the extracted metadata should be [encapsulated in a valid MetadataChangeEvent](https://github.com/linkedin/datahub/blob/master/metadata-ingestion/ldap-etl/ldap_etl.py#L56) under the defined aspects and snapshots.
## 3. Load
The load part will leverage the [Kafka producer](https://github.com/linkedin/datahub/blob/master/metadata-ingestion/ldap-etl/ldap_etl.py#L80) to enable the pub-sub event-based ingestion. Meanwhile, the schema validation will be involved to check metadata quality.