2022-02-02 13:19:15 -08:00 
										
									 
								 
							 
							
								
							 
							
								 
							
							
								# Datahub's Reporting Framework for Ingestion Job Telemetry
  
						 
					
						
							
								
									
										
										
										
											2025-04-16 16:55:51 -07:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
								
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								The Datahub's reporting framework allows for configuring reporting providers with the ingestion pipelines to send
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								telemetry about the ingestion job runs to external systems for monitoring purposes. It is powered by the Datahub's
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								stateful ingestion framework. The `datahub`  reporting provider comes with the standard client installation,
							 
						 
					
						
							
								
									
										
										
										
											2022-02-02 13:19:15 -08:00 
										
									 
								 
							 
							
								
							 
							
								 
							
							
								and allows for reporting ingestion job telemetry to the datahub backend as the destination.
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								
							 
						 
					
						
							
								
									
										
										
										
											2025-04-16 16:55:51 -07:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
								**_NOTE_**: This feature requires the server to be `statefulIngestion`  capable.
							 
						 
					
						
							
								
									
										
										
										
											2022-02-02 13:19:15 -08:00 
										
									 
								 
							 
							
								
							 
							
								 
							
							
								This is a feature of metadata service with version >= `0.8.20` .
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								To check if you are running a stateful ingestion capable server:
							 
						 
					
						
							
								
									
										
										
										
											2025-04-16 16:55:51 -07:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
								
							 
						 
					
						
							
								
									
										
										
										
											2022-02-02 13:19:15 -08:00 
										
									 
								 
							 
							
								
							 
							
								 
							
							
								```console
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								curl http://< datahub-gms-endpoint > /config
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								{
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								models: { },
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								statefulIngestionCapable: true, # < --  this  should  be  present  and  true 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								retention: "true",
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								noCode: "true"
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								}
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								```
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								## Config details
  
						 
					
						
							
								
									
										
										
										
											2025-04-16 16:55:51 -07:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
								
							 
						 
					
						
							
								
									
										
										
										
											2022-02-02 13:19:15 -08:00 
										
									 
								 
							 
							
								
							 
							
								 
							
							
								The ingestion reporting providers are a list of reporting provider configurations under the `reporting`  config
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								param of the pipeline, each reporting provider configuration begin a type and config pair object. The telemetry data will
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								be sent to all the reporting providers in this list.
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								Note that a `.`  is used to denote nested fields, and `[idx]`  is used to denote an element of an array of objects in the YAML recipe.
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								
							 
						 
					
						
							
								
									
										
										
										
											2025-04-16 16:55:51 -07:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
								| Field                   | Required | Default                                                                                                                                                                                                                                        | Description                                                                                                                                              |
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								| ----------------------- | -------- | ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | -------------------------------------------------------------------------------------------------------------------------------------------------------- |
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								| `reporting[idx].type`    | ✅       | `datahub`                                                                                                                                                                                                                                       | The type of the ingestion reporting provider registered with datahub.                                                                                    |
							 
						 
					
						
							
								
									
										
										
										
											2022-03-18 22:12:19 +01:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
								| `reporting[idx].config`  |          | The `datahub_api`  config if set at pipeline level. Otherwise, the default `DatahubClientConfig` . See the [defaults ](https://github.com/datahub-project/datahub/blob/master/metadata-ingestion/src/datahub/ingestion/graph/client.py#L19 ) here. | The configuration required for initializing the datahub reporting provider.                                                                              |
							 
						 
					
						
							
								
									
										
										
										
											2025-04-16 16:55:51 -07:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
								| `pipeline_name`          | ✅       |                                                                                                                                                                                                                                                | The name of the ingestion pipeline. This is used as a part of the identifying key for the telemetry data reported by each job in the ingestion pipeline. |
							 
						 
					
						
							
								
									
										
										
										
											2022-02-02 13:19:15 -08:00 
										
									 
								 
							 
							
								
							 
							
								 
							
							
								
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								#### Supported sources
  
						 
					
						
							
								
									
										
										
										
											2025-04-16 16:55:51 -07:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
								
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								-  All sql based sources. 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								-  snowflake_usage. 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								
							 
						 
					
						
							
								
									
										
										
										
											2022-02-02 13:19:15 -08:00 
										
									 
								 
							 
							
								
							 
							
								 
							
							
								#### Sample configuration
  
						 
					
						
							
								
									
										
										
										
											2025-04-16 16:55:51 -07:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
								
							 
						 
					
						
							
								
									
										
										
										
											2022-02-02 13:19:15 -08:00 
										
									 
								 
							 
							
								
							 
							
								 
							
							
								```yaml
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								source:
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								  type: "snowflake"
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								  config:
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								    username: < user_name > 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								    password: < password > 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								    role: < role > 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								    host_port: < host_port > 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								    warehouse: < ware_house > 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								    # Rest of the source specific params ...
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								# This is mandatory. Changing it will cause old telemetry correlation to be lost.
  
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								pipeline_name: "my_snowflake_pipeline_1"
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								# Pipeline-level datahub_api configuration.
  
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								datahub_api: # Optional. But if provided, this config will be used by the "datahub" ingestion state provider.
							 
						 
					
						
							
								
									
										
										
										
											2025-04-16 16:55:51 -07:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
								  server: "http://localhost:8080"
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								
							 
						 
					
						
							
								
									
										
										
										
											2022-02-02 13:19:15 -08:00 
										
									 
								 
							 
							
								
							 
							
								 
							
							
								sink:
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								  type: "datahub-rest"
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								  config:
							 
						 
					
						
							
								
									
										
										
										
											2025-04-16 16:55:51 -07:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
								    server: "http://localhost:8080"
							 
						 
					
						
							
								
									
										
										
										
											2022-02-02 13:19:15 -08:00 
										
									 
								 
							 
							
								
							 
							
								 
							
							
								
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								reporting:
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								  -  type: "datahub" # Required
							 
						 
					
						
							
								
									
										
										
										
											2025-04-16 16:55:51 -07:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
								    config: # Optional.
							 
						 
					
						
							
								
									
										
										
										
											2022-02-02 13:19:15 -08:00 
										
									 
								 
							 
							
								
							 
							
								 
							
							
								      datahub_api: # default value
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								        server: "http://localhost:8080"
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								```
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								## Reporting Ingestion State Provider (Developer Guide)
  
						 
					
						
							
								
									
										
										
										
											2025-04-16 16:55:51 -07:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
								
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								An ingestion reporting state provider is responsible for saving and retrieving the ingestion telemetry
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								associated with the ingestion runs of various jobs inside the source connector of the ingestion pipeline.
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								The data model used for capturing the telemetry is [DatahubIngestionRunSummary ](https://github.com/datahub-project/datahub/blob/master/metadata-models/src/main/pegasus/com/linkedin/datajob/datahub/DatahubIngestionRunSummary.pdl ).
							 
						 
					
						
							
								
									
										
										
										
											2023-09-14 11:34:21 +09:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
								A reporting ingestion state provider needs to implement the IngestionReportingProviderBase.
							 
						 
					
						
							
								
									
										
										
										
											2025-04-16 16:55:51 -07:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
								interface and register itself with datahub by adding an entry under `datahub.ingestion.reporting_provider.plugins` 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								key of the entry_points section in [setup.py ](https://github.com/datahub-project/datahub/blob/master/metadata-ingestion/setup.py )
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								with its type and implementation class as shown below.
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								
							 
						 
					
						
							
								
									
										
										
										
											2022-02-02 13:19:15 -08:00 
										
									 
								 
							 
							
								
							 
							
								 
							
							
								```python
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								entry_points = {
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								    # < snip  other  keys > "
							 
						 
					
						
							
								
									
										
										
										
											2023-05-31 18:49:48 -07:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
								    "datahub.ingestion.reporting_provider.plugins": [
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								        "datahub = datahub.ingestion.reporting.datahub_ingestion_run_summary_provider:DatahubIngestionRunSummaryProvider",
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								        "file = datahub.ingestion.reporting.file_reporter:FileReporter",
							 
						 
					
						
							
								
									
										
										
										
											2022-02-02 13:19:15 -08:00 
										
									 
								 
							 
							
								
							 
							
								 
							
							
								    ],
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								}
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								```
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								### Datahub Reporting Ingestion State Provider
  
						 
					
						
							
								
									
										
										
										
											2025-04-16 16:55:51 -07:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
								
							 
						 
					
						
							
								
									
										
										
										
											2022-02-02 13:19:15 -08:00 
										
									 
								 
							 
							
								
							 
							
								 
							
							
								This is the reporting state provider implementation that is available out of the box in datahub. Its type is `datahub`  and it is implemented on top
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								of the `datahub_api`  client and the timeseries aspect capabilities of the datahub-backend.
							 
						 
					
						
							
								
									
										
										
										
											2025-04-16 16:55:51 -07:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
								
							 
						 
					
						
							
								
									
										
										
										
											2022-02-02 13:19:15 -08:00 
										
									 
								 
							 
							
								
							 
							
								 
							
							
								#### Config details
  
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								Note that a `.`  is used to denote nested fields in the YAML recipe.
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								
							 
						 
					
						
							
								
									
										
										
										
											2025-04-16 16:55:51 -07:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
								| Field    | Required | Default                                                                                                                                                                                                                                        | Description                                                                 |
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								| -------- | -------- | ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | --------------------------------------------------------------------------- |
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								| `type`    | ✅       | `datahub`                                                                                                                                                                                                                                       | The type of the ingestion reporting provider registered with datahub.       |
							 
						 
					
						
							
								
									
										
										
										
											2022-03-18 22:12:19 +01:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
								| `config`  |          | The `datahub_api`  config if set at pipeline level. Otherwise, the default `DatahubClientConfig` . See the [defaults ](https://github.com/datahub-project/datahub/blob/master/metadata-ingestion/src/datahub/ingestion/graph/client.py#L19 ) here. | The configuration required for initializing the datahub reporting provider. |