datahub/metadata-ingestion/README.md

# Introduction to Metadata Ingestion

<a
    className='button button--primary button--lg'
    href="https://datahubproject.io/integrations">
    Find Integration Source
</a>

## Integration Options

DataHub supports both **push-based** and **pull-based** metadata integration.

Push-based integrations allow you to emit metadata directly from your data systems when metadata changes, while pull-based integrations allow you to "crawl" or "ingest" metadata from the data systems by connecting to them and extracting metadata in a batch or incremental-batch manner. Supporting both mechanisms means that you can integrate with all your systems in the most flexible way possible.

Examples of push-based integrations include [Airflow](../docs/lineage/airflow.md), [Spark](../metadata-integration/java/spark-lineage/README.md), [Great Expectations](./integration_docs/great-expectations.md) and [Protobuf Schemas](../metadata-integration/java/datahub-protobuf/README.md). This allows you to get low-latency metadata integration from the "active" agents in your data ecosystem. Examples of pull-based integrations include BigQuery, Snowflake, Looker, Tableau and many others.

This document describes the pull-based metadata ingestion system that is built into DataHub for easy integration with a wide variety of sources in your data stack.

## Getting Started

### Prerequisites

Before running any metadata ingestion job, you should make sure that DataHub backend services are all running. You can either run ingestion via the [UI](../docs/ui-ingestion.md) or via the [CLI](../docs/cli.md). You can reference the CLI usage guide given there as you go through this page.

## Core Concepts

### Sources

Please see our [Integrations page](https://datahubproject.io/integrations) to browse our ingestion sources and filter on their features.

Data systems that we are extracting metadata from are referred to as **Sources**. The `Sources` tab on the left in the sidebar shows you all the sources that are available for you to ingest metadata from. For example, we have sources for [BigQuery](https://datahubproject.io/docs/generated/ingestion/sources/bigquery), [Looker](https://datahubproject.io/docs/generated/ingestion/sources/looker), [Tableau](https://datahubproject.io/docs/generated/ingestion/sources/tableau) and many others.

#### Metadata Ingestion Source Status

We apply a Support Status to each Metadata Source to help you understand the integration reliability at a glance.

![Certified](https://img.shields.io/badge/support%20status-certified-brightgreen): Certified Sources are well-tested & widely-adopted by the DataHub Community. We expect the integration to be stable with few user-facing issues.

![Incubating](https://img.shields.io/badge/support%20status-incubating-blue): Incubating Sources are ready for DataHub Community adoption but have not been tested for a wide variety of edge-cases. We eagerly solicit feedback from the Community to streghten the connector; minor version changes may arise in future releases.

![Testing](https://img.shields.io/badge/support%20status-testing-lightgrey): Testing Sources are available for experiementation by DataHub Community members, but may change without notice.

### Sinks

Sinks are destinations for metadata. When configuring ingestion for DataHub, you're likely to be sending the metadata to DataHub over either the [REST (datahub-sink)](./sink_docs/datahub.md#datahub-rest) or the [Kafka (datahub-kafka)](./sink_docs/datahub.md#datahub-kafka) sink. In some cases, the [File](./sink_docs/file.md) sink is also helpful to store a persistent offline copy of the metadata during debugging.

The default sink that most of the ingestion systems and guides assume is the `datahub-rest` sink, but you should be able to adapt all of them for the other sinks as well!

### Recipes

A recipe is the main configuration file that puts it all together. It tells our ingestion scripts where to pull data from (source) and where to put it (sink).

:::tip
Name your recipe with **.dhub.yaml** extension like _myrecipe.dhub.yaml_ to use vscode or intellij as a recipe editor with autocomplete
and syntax validation.

Make sure yaml plugin is installed for your editor:

- For vscode install [Redhat's yaml plugin](https://marketplace.visualstudio.com/items?itemName=redhat.vscode-yaml)
- For intellij install [official yaml plugin](https://plugins.jetbrains.com/plugin/13126-yaml)

:::

Since `acryl-datahub` version `>=0.8.33.2`, the default sink is assumed to be a DataHub REST endpoint:

- Hosted at "http://localhost:8080" or the environment variable `${DATAHUB_GMS_URL}` if present
- With an empty auth token or the environment variable `${DATAHUB_GMS_TOKEN}` if present.

Here's a simple recipe that pulls metadata from MSSQL (source) and puts it into the default sink (datahub rest).

```yaml
# The simplest recipe that pulls metadata from MSSQL and puts it into DataHub
# using the Rest API.
source:
  type: mssql
  config:
    username: sa
    password: ${MSSQL_PASSWORD}
    database: DemoData
# sink section omitted as we want to use the default datahub-rest sink
```

Running this recipe is as simple as:

```shell
datahub ingest -c recipe.dhub.yaml
```

or if you want to override the default endpoints, you can provide the environment variables as part of the command like below:

```shell
DATAHUB_GMS_URL="https://my-datahub-server:8080" DATAHUB_GMS_TOKEN="my-datahub-token" datahub ingest -c recipe.dhub.yaml
```

A number of recipes are included in the [examples/recipes](./examples/recipes) directory. For full info and context on each source and sink, see the pages described in the [table of plugins](../docs/cli.md#installing-plugins).

> Note that one recipe file can only have 1 source and 1 sink. If you want multiple sources then you will need multiple recipe files.

### Handling sensitive information in recipes

We automatically expand environment variables in the config (e.g. `${MSSQL_PASSWORD}`),
similar to variable substitution in GNU bash or in docker-compose files. For details, see
https://docs.docker.com/compose/compose-file/compose-file-v2/#variable-substitution. This environment variable substitution should be used to mask sensitive information in recipe files. As long as you can get env variables securely to the ingestion process there would not be any need to store sensitive information in recipes.

### Basic Usage of CLI for ingestion

```shell
pip install 'acryl-datahub[datahub-rest]'  # install the required plugin
datahub ingest -c ./examples/recipes/mssql_to_datahub.dhub.yml
```

The `--dry-run` option of the `ingest` command performs all of the ingestion steps, except writing to the sink. This is useful to validate that the
ingestion recipe is producing the desired metadata events before ingesting them into datahub.

```shell
# Dry run
datahub ingest -c ./examples/recipes/example_to_datahub_rest.dhub.yml --dry-run
# Short-form
datahub ingest -c ./examples/recipes/example_to_datahub_rest.dhub.yml -n
```

The `--preview` option of the `ingest` command performs all of the ingestion steps, but limits the processing to only the first 10 workunits produced by the source.
This option helps with quick end-to-end smoke testing of the ingestion recipe.

```shell
# Preview
datahub ingest -c ./examples/recipes/example_to_datahub_rest.dhub.yml --preview
# Preview with dry-run
datahub ingest -c ./examples/recipes/example_to_datahub_rest.dhub.yml -n --preview
```

By default `--preview` creates 10 workunits. But if you wish to try producing more workunits you can use another option `--preview-workunits`

```shell
# Preview 20 workunits without sending anything to sink
datahub ingest -c ./examples/recipes/example_to_datahub_rest.dhub.yml -n --preview --preview-workunits=20
```

#### Reporting

By default, the cli sends an ingestion report to DataHub, which allows you to see the result of all cli-based ingestion in the UI. This can be turned off with the `--no-default-report` flag.

```shell
# Running ingestion with reporting to DataHub turned off
datahub ingest -c ./examples/recipes/example_to_datahub_rest.dhub.yaml --no-default-report
```

The reports include the recipe that was used for ingestion. This can be turned off by adding an additional section to the ingestion recipe.

```yaml
source:
  # source configs

sink:
  # sink configs

# Add configuration for the datahub reporter
reporting:
  - type: datahub
    config:
      report_recipe: false
```

## Transformations

If you'd like to modify data before it reaches the ingestion sinks – for instance, adding additional owners or tags – you can use a transformer to write your own module and integrate it with DataHub. Transformers require extending the recipe with a new section to describe the transformers that you want to run.

For example, a pipeline that ingests metadata from MSSQL and applies a default "important" tag to all datasets is described below:

```yaml
# A recipe to ingest metadata from MSSQL and apply default tags to all tables
source:
  type: mssql
  config:
    username: sa
    password: ${MSSQL_PASSWORD}
    database: DemoData

transformers: # an array of transformers applied sequentially
  - type: simple_add_dataset_tags
    config:
      tag_urns:
        - "urn:li:tag:Important"
# default sink, no config needed
```

Check out the [transformers guide](./docs/transformer/intro.md) to learn more about how you can create really flexible pipelines for processing metadata using Transformers!

## Using as a library (SDK)

In some cases, you might want to construct Metadata events directly and use programmatic ways to emit that metadata to DataHub. In this case, take a look at the [Python emitter](./as-a-library.md) and the [Java emitter](../metadata-integration/java/as-a-library.md) libraries which can be called from your own code.

### Programmatic Pipeline

In some cases, you might want to configure and run a pipeline entirely from within your custom Python script. Here is an example of how to do it.

- [programmatic_pipeline.py](./examples/library/programatic_pipeline.py) - a basic mysql to REST programmatic pipeline.

## Developing

See the guides on [developing](./developing.md), [adding a source](./adding-source.md) and [using transformers](./docs/transformer/intro.md).

## Compatibility

DataHub server uses a 3 digit versioning scheme, while the CLI uses a 4 digit scheme. For example, if you're using DataHub server version 0.10.0, you should use CLI version 0.10.0.x, where x is a patch version.
We do this because we do CLI releases at a much higher frequency than server releases, usually every few days vs twice a month.

For ingestion sources, any breaking changes will be highlighted in the [release notes](../docs/how/updating-datahub.md). When fields are deprecated or otherwise changed, we will try to maintain backwards compatibility for two server releases, which is about 4-6 weeks. The CLI will also print warnings whenever deprecated options are used.
-												docs: Ingestion Source Docs Template (#4275)

* testing img.shield for status

* update to hyperlink

* changing link format

* adding status options

* updating prerequisities and quickstart

* update to ingestion docs

* updating template with collapse details

* adding linebreak between pip install commands

* Removed incomplete sentence

* typo fix

* pushing current changes

* testing logos in markdown table

* markdown table fix

* markdown table fix

* adding in additional logos

* transposing markdown table

* settling on final table format

* adding commented-out source template to sidebar.js

* moving reference sidebar and adding trailing comma

* fixing docs build
											
										
										
											2022-03-30 17:36:39 -05:00
+								# Introduction to Metadata Ingestion
-												Start updating readme

											
										
										
											2021-02-12 10:46:28 -08:00
-												fix: refactor toc (#7862)


											
										
										
											2023-04-22 10:36:10 +09:00
+								<a
 								    className='button button--primary button--lg'
 								    href="https://datahubproject.io/integrations">
 								    Find Integration Source
 								</a>
-												feat(cli): improve error reporting, make sink config optional (#4718)


											
										
										
											2022-04-24 17:12:21 -07:00
+								## Integration Options
-												docs(ingest): add python versions badge (#2278)


											
										
										
											2021-03-23 02:12:41 -04:00
-												docs(ingest): add details about backwards compatibility guarantees (#7439)


											
										
										
											2023-02-28 13:33:58 -08:00
+								DataHub supports both **push-based** and **pull-based** metadata integration.
-												feat(cli): improve error reporting, make sink config optional (#4718)


											
										
										
											2022-04-24 17:12:21 -07:00
-												docs(ingest): add details about backwards compatibility guarantees (#7439)


											
										
										
											2023-02-28 13:33:58 -08:00
+								Push-based integrations allow you to emit metadata directly from your data systems when metadata changes, while pull-based integrations allow you to "crawl" or "ingest" metadata from the data systems by connecting to them and extracting metadata in a batch or incremental-batch manner. Supporting both mechanisms means that you can integrate with all your systems in the most flexible way possible.
-												feat(cli): improve error reporting, make sink config optional (#4718)


											
										
										
											2022-04-24 17:12:21 -07:00
 								Examples of push-based integrations include [Airflow](../docs/lineage/airflow.md), [Spark](../metadata-integration/java/spark-lineage/README.md), [Great Expectations](./integration_docs/great-expectations.md) and [Protobuf Schemas](../metadata-integration/java/datahub-protobuf/README.md). This allows you to get low-latency metadata integration from the "active" agents in your data ecosystem. Examples of pull-based integrations include BigQuery, Snowflake, Looker, Tableau and many others.
 								This document describes the pull-based metadata ingestion system that is built into DataHub for easy integration with a wide variety of sources in your data stack.
 								## Getting Started
 								### Prerequisites
 								Before running any metadata ingestion job, you should make sure that DataHub backend services are all running. You can either run ingestion via the [UI](../docs/ui-ingestion.md) or via the [CLI](../docs/cli.md). You can reference the CLI usage guide given there as you go through this page.
 								## Core Concepts
 								### Sources
-												docs(): re-add sources summary page (#7563)

Co-authored-by: Jeff Merrick <jeff@wireform.io>
											
										
										
											2023-03-13 11:37:24 -07:00
+								Please see our [Integrations page](https://datahubproject.io/integrations) to browse our ingestion sources and filter on their features.
-												fix(docs): Fix broken links in ingestion docs (#7183)


											
										
										
											2023-01-30 19:19:31 -05:00
+								Data systems that we are extracting metadata from are referred to as **Sources**. The `Sources` tab on the left in the sidebar shows you all the sources that are available for you to ingest metadata from. For example, we have sources for [BigQuery](https://datahubproject.io/docs/generated/ingestion/sources/bigquery), [Looker](https://datahubproject.io/docs/generated/ingestion/sources/looker), [Tableau](https://datahubproject.io/docs/generated/ingestion/sources/tableau) and many others.
-												feat(cli): improve error reporting, make sink config optional (#4718)


											
										
										
											2022-04-24 17:12:21 -07:00
 								#### Metadata Ingestion Source Status
-												docs: Ingestion Source Docs Template (#4275)

* testing img.shield for status

* update to hyperlink

* changing link format

* adding status options

* updating prerequisities and quickstart

* update to ingestion docs

* updating template with collapse details

* adding linebreak between pip install commands

* Removed incomplete sentence

* typo fix

* pushing current changes

* testing logos in markdown table

* markdown table fix

* markdown table fix

* adding in additional logos

* transposing markdown table

* settling on final table format

* adding commented-out source template to sidebar.js

* moving reference sidebar and adding trailing comma

* fixing docs build
											
										
										
											2022-03-30 17:36:39 -05:00
 								We apply a Support Status to each Metadata Source to help you understand the integration reliability at a glance.
 								![Certified](https://img.shields.io/badge/support%20status-certified-brightgreen): Certified Sources are well-tested & widely-adopted by the DataHub Community. We expect the integration to be stable with few user-facing issues.
 								![Incubating](https://img.shields.io/badge/support%20status-incubating-blue): Incubating Sources are ready for DataHub Community adoption but have not been tested for a wide variety of edge-cases. We eagerly solicit feedback from the Community to streghten the connector; minor version changes may arise in future releases.
-												docs(ingest): add details about backwards compatibility guarantees (#7439)


											
										
										
											2023-02-28 13:33:58 -08:00
+								![Testing](https://img.shields.io/badge/support%20status-testing-lightgrey): Testing Sources are available for experiementation by DataHub Community members, but may change without notice.
-												docs: Ingestion Source Docs Template (#4275)

* testing img.shield for status

* update to hyperlink

* changing link format

* adding status options

* updating prerequisities and quickstart

* update to ingestion docs

* updating template with collapse details

* adding linebreak between pip install commands

* Removed incomplete sentence

* typo fix

* pushing current changes

* testing logos in markdown table

* markdown table fix

* markdown table fix

* adding in additional logos

* transposing markdown table

* settling on final table format

* adding commented-out source template to sidebar.js

* moving reference sidebar and adding trailing comma

* fixing docs build
											
										
										
											2022-03-30 17:36:39 -05:00
-												feat(cli): improve error reporting, make sink config optional (#4718)


											
										
										
											2022-04-24 17:12:21 -07:00
+								### Sinks
-												feat(ingest): use plugin system based on Python extras (#2224)


											
										
										
											2021-03-11 16:41:05 -05:00
-												feat(cli): improve error reporting, make sink config optional (#4718)


											
										
										
											2022-04-24 17:12:21 -07:00
+								Sinks are destinations for metadata. When configuring ingestion for DataHub, you're likely to be sending the metadata to DataHub over either the [REST (datahub-sink)](./sink_docs/datahub.md#datahub-rest) or the [Kafka (datahub-kafka)](./sink_docs/datahub.md#datahub-kafka) sink. In some cases, the [File](./sink_docs/file.md) sink is also helpful to store a persistent offline copy of the metadata during debugging.
-												Modifying README to bring in old content

											
										
										
											2021-02-14 11:35:45 -08:00
-												feat(cli): improve error reporting, make sink config optional (#4718)


											
										
										
											2022-04-24 17:12:21 -07:00
+								The default sink that most of the ingestion systems and guides assume is the `datahub-rest` sink, but you should be able to adapt all of them for the other sinks as well!
-												Update README.md
											
										
										
											2021-02-15 11:03:38 -08:00
-												feat(cli): improve error reporting, make sink config optional (#4718)


											
										
										
											2022-04-24 17:12:21 -07:00
+								### Recipes
-												docs: Ingestion Source Docs Template (#4275)

* testing img.shield for status

* update to hyperlink

* changing link format

* adding status options

* updating prerequisities and quickstart

* update to ingestion docs

* updating template with collapse details

* adding linebreak between pip install commands

* Removed incomplete sentence

* typo fix

* pushing current changes

* testing logos in markdown table

* markdown table fix

* markdown table fix

* adding in additional logos

* transposing markdown table

* settling on final table format

* adding commented-out source template to sidebar.js

* moving reference sidebar and adding trailing comma

* fixing docs build
											
										
										
											2022-03-30 17:36:39 -05:00
-												feat(cli): improve error reporting, make sink config optional (#4718)


											
										
										
											2022-04-24 17:12:21 -07:00
+								A recipe is the main configuration file that puts it all together. It tells our ingestion scripts where to pull data from (source) and where to put it (sink).
-												docs(ingest): add solutions for common install issues (#2123)


											
										
										
											2021-02-18 20:06:30 -08:00
-												feat(docs): Updating example files with the new ingestion recipe suffix (#5103)


											
										
										
											2022-06-08 00:52:26 +02:00
+								:::tip
-												docs(ingest): add details about backwards compatibility guarantees (#7439)


											
										
										
											2023-02-28 13:33:58 -08:00
+								Name your recipe with **.dhub.yaml** extension like _myrecipe.dhub.yaml_ to use vscode or intellij as a recipe editor with autocomplete
-												feat(docs): Updating example files with the new ingestion recipe suffix (#5103)


											
										
										
											2022-06-08 00:52:26 +02:00
+								and syntax validation.
 								Make sure yaml plugin is installed for your editor:
-												docs(ingest): add details about backwards compatibility guarantees (#7439)


											
										
										
											2023-02-28 13:33:58 -08:00
-												feat(docs): Updating example files with the new ingestion recipe suffix (#5103)


											
										
										
											2022-06-08 00:52:26 +02:00
+								- For vscode install [Redhat's yaml plugin](https://marketplace.visualstudio.com/items?itemName=redhat.vscode-yaml)
-												docs(ingest): add details about backwards compatibility guarantees (#7439)


											
										
										
											2023-02-28 13:33:58 -08:00
+								- For intellij install [official yaml plugin](https://plugins.jetbrains.com/plugin/13126-yaml)
-												feat(docs): Updating example files with the new ingestion recipe suffix (#5103)


											
										
										
											2022-06-08 00:52:26 +02:00
 								:::
-												feat(cli): improve error reporting, make sink config optional (#4718)


											
										
										
											2022-04-24 17:12:21 -07:00
+								Since `acryl-datahub` version `>=0.8.33.2`, the default sink is assumed to be a DataHub REST endpoint:
-												docs(ingest): add details about backwards compatibility guarantees (#7439)


											
										
										
											2023-02-28 13:33:58 -08:00
-												fix(cli): correct handling of env variables (#5203)


											
										
										
											2022-06-20 20:53:47 +05:30
+								- Hosted at "http://localhost:8080" or the environment variable `${DATAHUB_GMS_URL}` if present
-												docs(ingest): add details about backwards compatibility guarantees (#7439)


											
										
										
											2023-02-28 13:33:58 -08:00
+								- With an empty auth token or the environment variable `${DATAHUB_GMS_TOKEN}` if present.
-												feat(ingest): standalone metadata emitters (#2207)


											
										
										
											2021-03-10 17:32:12 -05:00
-												feat(cli): improve error reporting, make sink config optional (#4718)


											
										
										
											2022-04-24 17:12:21 -07:00
+								Here's a simple recipe that pulls metadata from MSSQL (source) and puts it into the default sink (datahub rest).
-												feat(ingest): standalone metadata emitters (#2207)


											
										
										
											2021-03-10 17:32:12 -05:00
-												fix(docs): make intro to metadata ingestion easier for beginners (#4039)

* fix(docs): fix sidebar titles for clarity

* re-arrange docs to make Intro to Metadata ingestion easier for beginners

* minor changes for readability

* add heading

* docs: add note for common question
											
										
										
											2022-02-11 22:33:01 +05:30
+								```yaml
-												feat(cli): improve error reporting, make sink config optional (#4718)


											
										
										
											2022-04-24 17:12:21 -07:00
+								# The simplest recipe that pulls metadata from MSSQL and puts it into DataHub
-												fix(docs): make intro to metadata ingestion easier for beginners (#4039)

* fix(docs): fix sidebar titles for clarity

* re-arrange docs to make Intro to Metadata ingestion easier for beginners

* minor changes for readability

* add heading

* docs: add note for common question
											
										
										
											2022-02-11 22:33:01 +05:30
+								# using the Rest API.
 								source:
 								  type: mssql
 								  config:
 								    username: sa
 								    password: ${MSSQL_PASSWORD}
 								    database: DemoData
-												feat(cli): improve error reporting, make sink config optional (#4718)


											
										
										
											2022-04-24 17:12:21 -07:00
+								# sink section omitted as we want to use the default datahub-rest sink
 								```
-												feat(ingest): use plugin system based on Python extras (#2224)


											
										
										
											2021-03-11 16:41:05 -05:00
-												feat(cli): improve error reporting, make sink config optional (#4718)


											
										
										
											2022-04-24 17:12:21 -07:00
+								Running this recipe is as simple as:
-												docs(ingest): add details about backwards compatibility guarantees (#7439)


											
										
										
											2023-02-28 13:33:58 -08:00
-												feat(cli): improve error reporting, make sink config optional (#4718)


											
										
										
											2022-04-24 17:12:21 -07:00
+								```shell
-												feat(docs): Updating example files with the new ingestion recipe suffix (#5103)


											
										
										
											2022-06-08 00:52:26 +02:00
+								datahub ingest -c recipe.dhub.yaml
-												feat(cli): improve error reporting, make sink config optional (#4718)


											
										
										
											2022-04-24 17:12:21 -07:00
+								```
-												feat(ingest): use plugin system based on Python extras (#2224)


											
										
										
											2021-03-11 16:41:05 -05:00
-												feat(cli): improve error reporting, make sink config optional (#4718)


											
										
										
											2022-04-24 17:12:21 -07:00
+								or if you want to override the default endpoints, you can provide the environment variables as part of the command like below:
-												docs(ingest): add details about backwards compatibility guarantees (#7439)


											
										
										
											2023-02-28 13:33:58 -08:00
-												feat(cli): improve error reporting, make sink config optional (#4718)


											
										
										
											2022-04-24 17:12:21 -07:00
+								```shell
-												fix(cli): correct handling of env variables (#5203)


											
										
										
											2022-06-20 20:53:47 +05:30
+								DATAHUB_GMS_URL="https://my-datahub-server:8080" DATAHUB_GMS_TOKEN="my-datahub-token" datahub ingest -c recipe.dhub.yaml
-												feat(ingest): use plugin system based on Python extras (#2224)


											
										
										
											2021-03-11 16:41:05 -05:00
+								```
-												fix(docs): fix broken link (#4242)


											
										
										
											2022-02-25 05:15:52 +05:30
+								A number of recipes are included in the [examples/recipes](./examples/recipes) directory. For full info and context on each source and sink, see the pages described in the [table of plugins](../docs/cli.md#installing-plugins).
-												fix(docs): make intro to metadata ingestion easier for beginners (#4039)

* fix(docs): fix sidebar titles for clarity

* re-arrange docs to make Intro to Metadata ingestion easier for beginners

* minor changes for readability

* add heading

* docs: add note for common question
											
										
										
											2022-02-11 22:33:01 +05:30
-												feat(cli): improve error reporting, make sink config optional (#4718)


											
										
										
											2022-04-24 17:12:21 -07:00
+								> Note that one recipe file can only have 1 source and 1 sink. If you want multiple sources then you will need multiple recipe files.
-												fix(docs): make intro to metadata ingestion easier for beginners (#4039)

* fix(docs): fix sidebar titles for clarity

* re-arrange docs to make Intro to Metadata ingestion easier for beginners

* minor changes for readability

* add heading

* docs: add note for common question
											
										
										
											2022-02-11 22:33:01 +05:30
+								### Handling sensitive information in recipes
 								We automatically expand environment variables in the config (e.g. `${MSSQL_PASSWORD}`),
 								similar to variable substitution in GNU bash or in docker-compose files. For details, see
 								https://docs.docker.com/compose/compose-file/compose-file-v2/#variable-substitution. This environment variable substitution should be used to mask sensitive information in recipe files. As long as you can get env variables securely to the ingestion process there would not be any need to store sensitive information in recipes.
-												feat(ingest): use plugin system based on Python extras (#2224)


											
										
										
											2021-03-11 16:41:05 -05:00
-												fix(docs): make intro to metadata ingestion easier for beginners (#4039)

* fix(docs): fix sidebar titles for clarity

* re-arrange docs to make Intro to Metadata ingestion easier for beginners

* minor changes for readability

* add heading

* docs: add note for common question
											
										
										
											2022-02-11 22:33:01 +05:30
+								### Basic Usage of CLI for ingestion
-												feat(ingest): standalone metadata emitters (#2207)


											
										
										
											2021-03-10 17:32:12 -05:00
-												docs: enable better syntax highlighting (#2529)


											
										
										
											2021-05-11 15:16:12 -07:00
+								```shell
-												feat(ingest): start airflow integration + metadata builders (#2331)


											
										
										
											2021-04-05 19:11:28 -07:00
+								pip install 'acryl-datahub[datahub-rest]'  # install the required plugin
-												feat(docs): Updating example files with the new ingestion recipe suffix (#5103)


											
										
										
											2022-06-08 00:52:26 +02:00
+								datahub ingest -c ./examples/recipes/mssql_to_datahub.dhub.yml
-												More autofixes

											
										
										
											2021-02-11 22:48:08 -08:00
+								```
-												Start updating readme

											
										
										
											2021-02-12 10:46:28 -08:00
-												feat(cli): improve error reporting, make sink config optional (#4718)


											
										
										
											2022-04-24 17:12:21 -07:00
+								The `--dry-run` option of the `ingest` command performs all of the ingestion steps, except writing to the sink. This is useful to validate that the
 								ingestion recipe is producing the desired metadata events before ingesting them into datahub.
-												feat(ingestion): Adds --dry-run and --preview options to datahub ingest command. (#3584)


											
										
										
											2021-11-17 09:24:58 -08:00
 								```shell
 								# Dry run
-												feat(docs): Updating example files with the new ingestion recipe suffix (#5103)


											
										
										
											2022-06-08 00:52:26 +02:00
+								datahub ingest -c ./examples/recipes/example_to_datahub_rest.dhub.yml --dry-run
-												feat(ingestion): Adds --dry-run and --preview options to datahub ingest command. (#3584)


											
										
										
											2021-11-17 09:24:58 -08:00
+								# Short-form
-												feat(docs): Updating example files with the new ingestion recipe suffix (#5103)


											
										
										
											2022-06-08 00:52:26 +02:00
+								datahub ingest -c ./examples/recipes/example_to_datahub_rest.dhub.yml -n
-												feat(ingestion): Adds --dry-run and --preview options to datahub ingest command. (#3584)


											
										
										
											2021-11-17 09:24:58 -08:00
+								```
 								The `--preview` option of the `ingest` command performs all of the ingestion steps, but limits the processing to only the first 10 workunits produced by the source.
 								This option helps with quick end-to-end smoke testing of the ingestion recipe.
 								```shell
 								# Preview
-												feat(docs): Updating example files with the new ingestion recipe suffix (#5103)


											
										
										
											2022-06-08 00:52:26 +02:00
+								datahub ingest -c ./examples/recipes/example_to_datahub_rest.dhub.yml --preview
-												feat(ingestion): Adds --dry-run and --preview options to datahub ingest command. (#3584)


											
										
										
											2021-11-17 09:24:58 -08:00
+								# Preview with dry-run
-												feat(docs): Updating example files with the new ingestion recipe suffix (#5103)


											
										
										
											2022-06-08 00:52:26 +02:00
+								datahub ingest -c ./examples/recipes/example_to_datahub_rest.dhub.yml -n --preview
-												feat(ingestion): Adds --dry-run and --preview options to datahub ingest command. (#3584)


											
										
										
											2021-11-17 09:24:58 -08:00
+								```
-												feat(ingest): option for number of workunits in preview (#4517)

* feat(ingest): option for number of workunits in preview + documentation update
											
										
										
											2022-03-29 19:40:22 +05:30
 								By default `--preview` creates 10 workunits. But if you wish to try producing more workunits you can use another option `--preview-workunits`
 								```shell
 								# Preview 20 workunits without sending anything to sink
-												feat(docs): Updating example files with the new ingestion recipe suffix (#5103)


											
										
										
											2022-06-08 00:52:26 +02:00
+								datahub ingest -c ./examples/recipes/example_to_datahub_rest.dhub.yml -n --preview --preview-workunits=20
-												feat(ingest): option for number of workunits in preview (#4517)

* feat(ingest): option for number of workunits in preview + documentation update
											
										
										
											2022-03-29 19:40:22 +05:30
+								```
-												feat(ingestion): send reports of ingestion runs to datahub (#5639)

Co-authored-by: Shirshanka Das <shirshanka@apache.org>
											
										
										
											2022-08-19 09:08:17 -07:00
+								#### Reporting
 								By default, the cli sends an ingestion report to DataHub, which allows you to see the result of all cli-based ingestion in the UI. This can be turned off with the `--no-default-report` flag.
 								```shell
 								# Running ingestion with reporting to DataHub turned off
 								datahub ingest -c ./examples/recipes/example_to_datahub_rest.dhub.yaml --no-default-report
 								```
 								The reports include the recipe that was used for ingestion. This can be turned off by adding an additional section to the ingestion recipe.
 								```yaml
 								source:
-												docs(ingest): add details about backwards compatibility guarantees (#7439)


											
										
										
											2023-02-28 13:33:58 -08:00
+								  # source configs
-												feat(ingestion): send reports of ingestion runs to datahub (#5639)

Co-authored-by: Shirshanka Das <shirshanka@apache.org>
											
										
										
											2022-08-19 09:08:17 -07:00
 								sink:
-												docs(ingest): add details about backwards compatibility guarantees (#7439)


											
										
										
											2023-02-28 13:33:58 -08:00
+								  # sink configs
-												feat(ingestion): send reports of ingestion runs to datahub (#5639)

Co-authored-by: Shirshanka Das <shirshanka@apache.org>
											
										
										
											2022-08-19 09:08:17 -07:00
 								# Add configuration for the datahub reporter
 								reporting:
 								  - type: datahub
 								    config:
 								      report_recipe: false
 								```
-												feature(ingestion): Adding the concept of transformers (#2411)

Fixes: #2410

Co-authored-by: thomas.larsson <thomas.larsson@klarna.com>
											
										
										
											2021-04-18 20:15:05 +02:00
+								## Transformations
-												feat(cli): improve error reporting, make sink config optional (#4718)


											
										
										
											2022-04-24 17:12:21 -07:00
+								If you'd like to modify data before it reaches the ingestion sinks – for instance, adding additional owners or tags – you can use a transformer to write your own module and integrate it with DataHub. Transformers require extending the recipe with a new section to describe the transformers that you want to run.
 								For example, a pipeline that ingests metadata from MSSQL and applies a default "important" tag to all datasets is described below:
-												docs(ingest): add details about backwards compatibility guarantees (#7439)


											
										
										
											2023-02-28 13:33:58 -08:00
-												feat(cli): improve error reporting, make sink config optional (#4718)


											
										
										
											2022-04-24 17:12:21 -07:00
+								```yaml
 								# A recipe to ingest metadata from MSSQL and apply default tags to all tables
 								source:
 								  type: mssql
 								  config:
 								    username: sa
 								    password: ${MSSQL_PASSWORD}
 								    database: DemoData
 								transformers: # an array of transformers applied sequentially
 								  - type: simple_add_dataset_tags
 								    config:
 								      tag_urns:
 								        - "urn:li:tag:Important"
 								# default sink, no config needed
 								```
-												feature(ingestion): Adding the concept of transformers (#2411)

Fixes: #2410

Co-authored-by: thomas.larsson <thomas.larsson@klarna.com>
											
										
										
											2021-04-18 20:15:05 +02:00
-												feat(transformers): Add semantics & transform_aspect support in transformers (#5514)

Co-authored-by: MohdSiddique Bagwan <mohdsiddique.bagwan@gslab.com>
Co-authored-by: Harshal Sheth <hsheth2@gmail.com>
											
										
										
											2022-09-07 03:14:14 +05:30
+								Check out the [transformers guide](./docs/transformer/intro.md) to learn more about how you can create really flexible pipelines for processing metadata using Transformers!
-												feature(ingestion): Adding the concept of transformers (#2411)

Fixes: #2410

Co-authored-by: thomas.larsson <thomas.larsson@klarna.com>
											
										
										
											2021-04-18 20:15:05 +02:00
-												feat(cli): improve error reporting, make sink config optional (#4718)


											
										
										
											2022-04-24 17:12:21 -07:00
+								## Using as a library (SDK)
-												feat(ingest): standalone metadata emitters (#2207)


											
										
										
											2021-03-10 17:32:12 -05:00
-												docs(ingest): add details about backwards compatibility guarantees (#7439)


											
										
										
											2023-02-28 13:33:58 -08:00
+								In some cases, you might want to construct Metadata events directly and use programmatic ways to emit that metadata to DataHub. In this case, take a look at the [Python emitter](./as-a-library.md) and the [Java emitter](../metadata-integration/java/as-a-library.md) libraries which can be called from your own code.
-												docs(scheduling): re-arrange docs related to scheduling, lineage, CLI (#3669)


											
										
										
											2021-12-07 23:39:59 +05:30
 								### Programmatic Pipeline
-												docs(ingest): add details about backwards compatibility guarantees (#7439)


											
										
										
											2023-02-28 13:33:58 -08:00
-												feat(cli): improve error reporting, make sink config optional (#4718)


											
										
										
											2022-04-24 17:12:21 -07:00
+								In some cases, you might want to configure and run a pipeline entirely from within your custom Python script. Here is an example of how to do it.
-												docs(ingest): add details about backwards compatibility guarantees (#7439)


											
										
										
											2023-02-28 13:33:58 -08:00
 								- [programmatic_pipeline.py](./examples/library/programatic_pipeline.py) - a basic mysql to REST programmatic pipeline.
-												docs(ingestion): Emitter api examples + Documentation (#3599)


											
										
										
											2021-11-18 23:30:25 -08:00
-												feat(ingest): start airflow integration + metadata builders (#2331)


											
										
										
											2021-04-05 19:11:28 -07:00
+								## Developing
-												docs(ingest): clarify docs for new ingestion framework (#2108)


											
										
										
											2021-02-16 15:31:13 -08:00
-												feat(transformers): Add semantics & transform_aspect support in transformers (#5514)

Co-authored-by: MohdSiddique Bagwan <mohdsiddique.bagwan@gslab.com>
Co-authored-by: Harshal Sheth <hsheth2@gmail.com>
											
										
										
											2022-09-07 03:14:14 +05:30
+								See the guides on [developing](./developing.md), [adding a source](./adding-source.md) and [using transformers](./docs/transformer/intro.md).
-												feat: Adding support for nested schemas in ingestion and visualization (#3079)


											
										
										
											2021-08-11 15:47:18 -07:00
-												docs(ingest): add details about backwards compatibility guarantees (#7439)


											
										
										
											2023-02-28 13:33:58 -08:00
+								## Compatibility
 								DataHub server uses a 3 digit versioning scheme, while the CLI uses a 4 digit scheme. For example, if you're using DataHub server version 0.10.0, you should use CLI version 0.10.0.x, where x is a patch version.
 								We do this because we do CLI releases at a much higher frequency than server releases, usually every few days vs twice a month.
 								For ingestion sources, any breaking changes will be highlighted in the [release notes](../docs/how/updating-datahub.md). When fields are deprecated or otherwise changed, we will try to maintain backwards compatibility for two server releases, which is about 4-6 weeks. The CLI will also print warnings whenever deprecated options are used.