mirror of
https://github.com/datahub-project/datahub.git
synced 2025-12-26 09:26:22 +00:00
docs(ingestion): remove outdated data-source-onboarding.md docs (#2360)
This commit is contained in:
parent
98813ed2b5
commit
011ab9a09a
@ -40,7 +40,6 @@ module.exports = {
|
||||
"docs/quickstart",
|
||||
"docs/debugging",
|
||||
"metadata-ingestion/README",
|
||||
// TODO "docs/how/data-source-onboarding",
|
||||
],
|
||||
Architecture: [
|
||||
"docs/architecture/architecture",
|
||||
@ -50,8 +49,6 @@ module.exports = {
|
||||
//"docs/what/gms",
|
||||
"datahub-web-react/README",
|
||||
],
|
||||
// },
|
||||
// developerGuideSidebar: {
|
||||
"Metadata Modeling": [
|
||||
// TODO: change the titles of these, removing the "What is..." portion from the sidebar"
|
||||
"docs/what/entity",
|
||||
@ -67,6 +64,7 @@ module.exports = {
|
||||
// TODO: the titles of these should not be in question form in the sidebar
|
||||
"docs/developers",
|
||||
"docs/docker/development",
|
||||
"metadata-ingestion/README",
|
||||
"docs/what/graph",
|
||||
"docs/what/search-index",
|
||||
"docs/how/add-new-aspect",
|
||||
@ -104,8 +102,6 @@ module.exports = {
|
||||
// WIP "docs/advanced/partial-update",
|
||||
// WIP "docs/advanced/pdl-best-practices",
|
||||
],
|
||||
// },
|
||||
// operatorGuideSidebar: {
|
||||
Deployment: [
|
||||
"docs/how/kafka-config",
|
||||
"docker/README",
|
||||
@ -125,7 +121,6 @@ module.exports = {
|
||||
// - "docker/neo4j/README",
|
||||
// - "docker/postgres/README",
|
||||
],
|
||||
// },
|
||||
Community: [
|
||||
"docs/slack",
|
||||
"docs/links",
|
||||
|
||||
@ -91,7 +91,7 @@ MCE is the ideal way to push metadata from different security zones, assuming th
|
||||
Currently, DataHub supports all major database providers that are supported by Ebean as the document store i.e. Oracle, Postgres, MySQL, H2. We also support [Espresso](https://engineering.linkedin.com/espresso/introducing-espresso-linkedins-hot-new-distributed-document-store), which is LinkedIn's proprietary document store. Other than that, we support Elasticsearch and Neo4j for search and graph use cases, respectively. However, as data stores in the backend are all abstracted and accessed through DAOs, you should be able to easily support other data stores by plugging in your own DAO implementations. Please refer to [Metadata Serving](architecture/metadata-serving.md) for more details.
|
||||
|
||||
## For which stores, you have discovery services?
|
||||
Supported data sources are listed [here](https://github.com/linkedin/datahub/tree/master/metadata-ingestion). To onboard your own data source which is not listed there, you can refer to the [onboarding guide](how/data-source-onboarding.md).
|
||||
Supported data sources are listed [here](../metadata-ingestion/README.md). It's also fairly easy to add your own sources.
|
||||
|
||||
## How is metadata ingested in DataHub? Is it real-time?
|
||||
You can call the [rest.li](https://github.com/linkedin/rest.li) API to ingest metadata in DataHub directly instead of using Kafka event. Metadata ingestion is real-time if you're updating via rest.li API. It's near real-time in the case of Kafka events due to the asynchronous nature of Kafka processing.
|
||||
|
||||
@ -1,12 +0,0 @@
|
||||
# How to onboard a new data source?
|
||||
|
||||
In the [metadata-ingestion](https://github.com/linkedin/datahub/tree/master/metadata-ingestion), DataHub provides various kinds of metadata sources onboarding, including [Hive](https://github.com/linkedin/datahub/tree/master/metadata-ingestion/hive-etl), [Kafka](https://github.com/linkedin/datahub/tree/master/metadata-ingestion/kafka-etl), [LDAP](https://github.com/linkedin/datahub/tree/master/metadata-ingestion/ldap-etl), [mySQL](https://github.com/linkedin/datahub/tree/master/metadata-ingestion/mysql-etl), and generic [RDBMS](https://github.com/linkedin/datahub/tree/master/metadata-ingestion/rdbms-etl) as ETL scripts to feed the metadata to the [GMS](../what/gms.md).
|
||||
|
||||
## 1. Extract
|
||||
The extract process will be specific tight to the data source, hence, the [data accessor](https://github.com/linkedin/datahub/blob/master/metadata-ingestion/ldap-etl/ldap_etl.py#L103) should be able to reflect the correctness of the metadata from underlying data platforms.
|
||||
|
||||
## 2. Transform
|
||||
In the transform stage, the extracted metadata should be [encapsulated in a valid MetadataChangeEvent](https://github.com/linkedin/datahub/blob/master/metadata-ingestion/ldap-etl/ldap_etl.py#L56) under the defined aspects and snapshots.
|
||||
|
||||
## 3. Load
|
||||
The load part will leverage the [Kafka producer](https://github.com/linkedin/datahub/blob/master/metadata-ingestion/ldap-etl/ldap_etl.py#L80) to enable the pub-sub event-based ingestion. Meanwhile, the schema validation will be involved to check metadata quality.
|
||||
Loading…
x
Reference in New Issue
Block a user