datahub/metadata-integration/java/spark-lineage/README.md

# Spark

To integrate Spark with DataHub, we provide a lightweight Java agent that listens for Spark application and job events and pushes metadata out to DataHub in real-time. The agent listens to events such application start/end, and SQLExecution start/end to create pipelines (i.e. DataJob) and tasks (i.e. DataFlow) in Datahub along with lineage to datasets that are being read from and written to. Read on to learn how to configure this for different Spark scenarios.

## Configuring Spark agent

The Spark agent can be configured using a config file or while creating a spark Session.

## Before you begin: Versions and Release Notes

Versioning of the jar artifact will follow the semantic versioning of the main [DataHub repo](https://github.com/datahub-project/datahub) and release notes will be available [here](https://github.com/datahub-project/datahub/releases).
Always check [the Maven central repository](https://search.maven.org/search?q=a:datahub-spark-lineage) for the latest released version.

### Configuration Instructions: spark-submit

When running jobs using spark-submit, the agent needs to be configured in the config file.

```text
#Configuring datahub spark agent jar
spark.jars.packages                          io.acryl:datahub-spark-lineage:0.8.23
spark.extraListeners                         datahub.spark.DatahubSparkListener
spark.datahub.rest.server                    http://localhost:8080
```

#### Configuration for Amazon EMR

Set the following spark-defaults configuration properties as it stated [here](https://docs.aws.amazon.com/emr/latest/ReleaseGuide/emr-spark-configure.html)

```
spark.jars.packages        io.acryl:datahub-spark-lineage:0.8.23
spark.extraListeners                         datahub.spark.DatahubSparkListener
spark.datahub.rest.server                    https://your_datahub_host/gms
#If you have authentication set up then you also need to specify the Datahub access token
spark.datahub.rest.token                     yourtoken
```

### Configuration Instructions: Notebooks

When running interactive jobs from a notebook, the listener can be configured while building the Spark Session.

```python
spark = SparkSession.builder \
          .master("spark://spark-master:7077") \
          .appName("test-application") \
          .config("spark.jars.packages","io.acryl:datahub-spark-lineage:0.8.23") \
          .config("spark.extraListeners","datahub.spark.DatahubSparkListener") \
          .config("spark.datahub.rest.server", "http://localhost:8080") \
          .enableHiveSupport() \
          .getOrCreate()
```

### Configuration Instructions: Standalone Java Applications

The configuration for standalone Java apps is very similar.

```java
spark = SparkSession.builder()
        .appName("test-application")
        .config("spark.master", "spark://spark-master:7077")
        .config("spark.jars.packages","io.acryl:datahub-spark-lineage:0.8.23")
        .config("spark.extraListeners", "datahub.spark.DatahubSparkListener")
        .config("spark.datahub.rest.server", "http://localhost:8080")
        .enableHiveSupport()
        .getOrCreate();
 ```

### Configuration details

| Field                                           | Required | Default | Description                                                             |
|-------------------------------------------------|----------|---------|-------------------------------------------------------------------------|
| spark.jars.packages                              | ✅        |         | Set with latest/required version  io.acryl:datahub-spark-lineage:0.8.23 |
| spark.extraListeners                             | ✅        |         | datahub.spark.DatahubSparkListener                                      |
| spark.datahub.rest.server                        | ✅        |         | Datahub server url  eg:<http://localhost:8080>                            |
| spark.datahub.rest.token                         |          |         | Authentication token.                         |
| spark.datahub.rest.disable_ssl_verification                 |          |  false  | Disable SSL certificate validation. Caution: Only use this if you know what you are doing!                         |
| spark.datahub.metadata.pipeline.platformInstance|          |         | Pipeline level platform instance                                        |
| spark.datahub.metadata.dataset.platformInstance|          |         | dataset level platform instance                                        |
| spark.datahub.metadata.dataset.env              |          | PROD    | [Supported values](https://datahubproject.io/docs/graphql/enums#fabrictype). In all other cases, will fallback to PROD           |
| spark.datahub.coalesce_jobs              |          |  false     |  Only one datajob(taask) will be emitted containing all input and output datasets for the spark application          |
| spark.datahub.parent.datajob_urn              |          |       | Specified dataset will be set as upstream dataset for datajob created. Effective only when spark.datahub.coalesce_jobs is set to true     |

## What to Expect: The Metadata Model

As of current writing, the Spark agent produces metadata related to the Spark job, tasks and lineage edges to datasets.

- A pipeline is created per Spark <master, appName>.
- A task is created per unique Spark query execution within an app.

### Custom properties & relating to Spark UI

The following custom properties in pipelines and tasks relate to the Spark UI:

- appName and appId in a pipeline can be used to determine the Spark application
- description and SQLQueryId in a task can be used to determine the Query Execution within the application on the SQL tab of Spark UI

Other custom properties of pipelines and tasks capture the start and end times of execution etc.
The query plan is captured in the *queryPlan* property of a task.

### Spark versions supported

The primary version tested is Spark/Scala version 2.4.8/2_11.
This library has also been tested to work with Spark versions(2.2.0 - 2.4.8) and Scala versions(2.10 - 2.12).
For the Spark 3.x series, this has been tested to work with Spark 3.1.2 and 3.2.0 with Scala 2.12. Other combinations are not guaranteed to work currently.
Support for other Spark versions is planned in the very near future.

### Environments tested with

This initial release has been tested with the following environments:

- spark-submit of Python/Java applications to local and remote servers
- Jupyter notebooks with pyspark code
- Standalone Java applications

Note that testing for other environments such as Databricks is planned in near future.

### Spark commands supported

Below is a list of Spark commands that are parsed currently:

- InsertIntoHadoopFsRelationCommand
- SaveIntoDataSourceCommand (jdbc)
- CreateHiveTableAsSelectCommand
- InsertIntoHiveTable

Effectively, these support data sources/sinks corresponding to Hive, HDFS and JDBC.

DataFrame.persist command is supported for below LeafExecNodes:

- FileSourceScanExec
- HiveTableScanExec
- RowDataSourceScanExec
- InMemoryTableScanExec

### Spark commands not yet supported

- View related commands
- Cache commands and implications on lineage
- RDD jobs

### Important notes on usage

- It is advisable to ensure appName is used appropriately to ensure you can trace lineage from a pipeline back to your source code.
- If multiple apps with the same appName run concurrently, dataset-lineage will be captured correctly but the custom-properties e.g. app-id, SQLQueryId would be unreliable. We expect this to be quite rare.
- If spark execution fails, then an empty pipeline would still get created, but it may not have any tasks.
- For HDFS sources, the folder (name) is regarded as the dataset (name) to align with typical storage of parquet/csv formats.

### Debugging

- Following info logs are generated

On Spark context startup

```text
YY/MM/DD HH:mm:ss INFO DatahubSparkListener: DatahubSparkListener initialised.
YY/MM/DD HH:mm:ss INFO SparkContext: Registered listener datahub.spark.DatahubSparkListener
```

On application start

```
YY/MM/DD HH:mm:ss INFO DatahubSparkListener: Application started: SparkListenerApplicationStart(AppName,Some(local-1644489736794),1644489735772,user,None,None)
YY/MM/DD HH:mm:ss INFO McpEmitter: REST Emitter Configuration: GMS url <rest.server>
YY/MM/DD HH:mm:ss INFO McpEmitter: REST Emitter Configuration: Token XXXXX
```

On pushing data to server

```
YY/MM/DD HH:mm:ss INFO McpEmitter: MetadataWriteResponse(success=true, responseContent={"value":"<URN>"}, underlyingResponse=HTTP/1.1 200 OK [Date: day, DD month year HH:mm:ss GMT, Content-Type: application/json, X-RestLi-Protocol-Version: 2.0.0, Content-Length: 97, Server: Jetty(9.4.46.v20220331)] [Content-Length: 97,Chunked: false])
```

On application end

```
YY/MM/DD HH:mm:ss INFO DatahubSparkListener: Application ended : AppName AppID
```

- To enable debugging logs, add below configuration in log4j.properties file

```
log4j.logger.datahub.spark=DEBUG
log4j.logger.datahub.client.rest=DEBUG
```

## Known limitations

- Only postgres supported for JDBC sources in this initial release. Support for other driver URL formats will be added in future.
- Behavior with cached datasets is not fully specified/defined in context of lineage.
- There is a possibility that very short-lived jobs that run within a few milliseconds may not be captured by the listener. This should not cause an issue for realistic Spark applications.
feat(cli): improve error reporting, make sink config optional (#4718) 2022-04-24 17:12:21 -07:00			`# Spark`
docs: spark-lineage - configuration details for Amazon EMR (#5459) 2022-07-21 17:57:26 +02:00
feat(spark-lineage): simplified jars, config, auto publish to maven (#3924) 2022-01-20 00:48:09 -08:00			`To integrate Spark with DataHub, we provide a lightweight Java agent that listens for Spark application and job events and pushes metadata out to DataHub in real-time. The agent listens to events such application start/end, and SQLExecution start/end to create pipelines (i.e. DataJob) and tasks (i.e. DataFlow) in Datahub along with lineage to datasets that are being read from and written to. Read on to learn how to configure this for different Spark scenarios.`
feat(spark-lineage): add ability to push data lineage from spark to d… (#3664) Co-authored-by: Shirshanka Das <shirshanka@apache.org> 2021-12-14 01:30:51 +05:30
feat(spark-lineage): simplified jars, config, auto publish to maven (#3924) 2022-01-20 00:48:09 -08:00			`## Configuring Spark agent`
docs: spark-lineage - configuration details for Amazon EMR (#5459) 2022-07-21 17:57:26 +02:00
feat(spark-lineage): simplified jars, config, auto publish to maven (#3924) 2022-01-20 00:48:09 -08:00			`The Spark agent can be configured using a config file or while creating a spark Session.`
feat(spark-lineage): add ability to push data lineage from spark to d… (#3664) Co-authored-by: Shirshanka Das <shirshanka@apache.org> 2021-12-14 01:30:51 +05:30
feat(spark-lineage): simplified jars, config, auto publish to maven (#3924) 2022-01-20 00:48:09 -08:00			`## Before you begin: Versions and Release Notes`
docs: spark-lineage - configuration details for Amazon EMR (#5459) 2022-07-21 17:57:26 +02:00
fix: Replace old repository link with new link (#4446) 2022-03-18 22:12:19 +01:00			`Versioning of the jar artifact will follow the semantic versioning of the main [DataHub repo](https://github.com/datahub-project/datahub) and release notes will be available [here](https://github.com/datahub-project/datahub/releases).`
feat(spark-lineage): simplified jars, config, auto publish to maven (#3924) 2022-01-20 00:48:09 -08:00			`Always check [the Maven central repository](https://search.maven.org/search?q=a:datahub-spark-lineage) for the latest released version.`

			`### Configuration Instructions: spark-submit`
feat(spark-lineage): add ability to push data lineage from spark to d… (#3664) Co-authored-by: Shirshanka Das <shirshanka@apache.org> 2021-12-14 01:30:51 +05:30
docs: spark-lineage - configuration details for Amazon EMR (#5459) 2022-07-21 17:57:26 +02:00			`When running jobs using spark-submit, the agent needs to be configured in the config file.`
feat(spark-lineage): add ability to push data lineage from spark to d… (#3664) Co-authored-by: Shirshanka Das <shirshanka@apache.org> 2021-12-14 01:30:51 +05:30
docs: spark-lineage - configuration details for Amazon EMR (#5459) 2022-07-21 17:57:26 +02:00			```text
feat(spark-lineage): simplified jars, config, auto publish to maven (#3924) 2022-01-20 00:48:09 -08:00			`#Configuring datahub spark agent jar`
docs: spark-lineage - configuration details for Amazon EMR (#5459) 2022-07-21 17:57:26 +02:00			`spark.jars.packages io.acryl:datahub-spark-lineage:0.8.23`
feat(spark-lineage): simplified jars, config, auto publish to maven (#3924) 2022-01-20 00:48:09 -08:00			`spark.extraListeners datahub.spark.DatahubSparkListener`
feat(datahub-client): add Java REST emitter (#3781) 2022-01-02 22:48:38 +05:30			`spark.datahub.rest.server http://localhost:8080`
feat(spark-lineage): add ability to push data lineage from spark to d… (#3664) Co-authored-by: Shirshanka Das <shirshanka@apache.org> 2021-12-14 01:30:51 +05:30			```

docs: spark-lineage - configuration details for Amazon EMR (#5459) 2022-07-21 17:57:26 +02:00			`#### Configuration for Amazon EMR`

			`Set the following spark-defaults configuration properties as it stated [here](https://docs.aws.amazon.com/emr/latest/ReleaseGuide/emr-spark-configure.html)`

			```
			`spark.jars.packages io.acryl:datahub-spark-lineage:0.8.23`
			`spark.extraListeners datahub.spark.DatahubSparkListener`
			`spark.datahub.rest.server https://your_datahub_host/gms`
			`#If you have authentication set up then you also need to specify the Datahub access token`
			`spark.datahub.rest.token yourtoken`
			```

feat(spark-lineage): simplified jars, config, auto publish to maven (#3924) 2022-01-20 00:48:09 -08:00			`### Configuration Instructions: Notebooks`
docs: spark-lineage - configuration details for Amazon EMR (#5459) 2022-07-21 17:57:26 +02:00
feat(spark-lineage): add ability to push data lineage from spark to d… (#3664) Co-authored-by: Shirshanka Das <shirshanka@apache.org> 2021-12-14 01:30:51 +05:30			`When running interactive jobs from a notebook, the listener can be configured while building the Spark Session.`

			```python
			`spark = SparkSession.builder \`
			`.master("spark://spark-master:7077") \`
			`.appName("test-application") \`
feat(spark-lineage): simplified jars, config, auto publish to maven (#3924) 2022-01-20 00:48:09 -08:00			`.config("spark.jars.packages","io.acryl:datahub-spark-lineage:0.8.23") \`
			`.config("spark.extraListeners","datahub.spark.DatahubSparkListener") \`
feat(datahub-client): add Java REST emitter (#3781) 2022-01-02 22:48:38 +05:30			`.config("spark.datahub.rest.server", "http://localhost:8080") \`
feat(spark-lineage): add ability to push data lineage from spark to d… (#3664) Co-authored-by: Shirshanka Das <shirshanka@apache.org> 2021-12-14 01:30:51 +05:30			`.enableHiveSupport() \`
			`.getOrCreate()`
			```

feat(spark-lineage): simplified jars, config, auto publish to maven (#3924) 2022-01-20 00:48:09 -08:00			`### Configuration Instructions: Standalone Java Applications`
docs: spark-lineage - configuration details for Amazon EMR (#5459) 2022-07-21 17:57:26 +02:00
			`The configuration for standalone Java apps is very similar.`
feat(spark-lineage): simplified jars, config, auto publish to maven (#3924) 2022-01-20 00:48:09 -08:00
			```java
			`spark = SparkSession.builder()`
			`.appName("test-application")`
			`.config("spark.master", "spark://spark-master:7077")`
			`.config("spark.jars.packages","io.acryl:datahub-spark-lineage:0.8.23")`
			`.config("spark.extraListeners", "datahub.spark.DatahubSparkListener")`
			`.config("spark.datahub.rest.server", "http://localhost:8080")`
			`.enableHiveSupport()`
			`.getOrCreate();`
			```
refactor(spark-lineage): enhance logging and documentation (#4113) 2022-02-19 02:24:26 +05:30
feat(spark-lineage): add support for custom env and platform_instance (#4208) 2022-03-14 23:18:27 +05:30			`### Configuration details`

			`\| Field \| Required \| Default \| Description \|`
			`\|-------------------------------------------------\|----------\|---------\|-------------------------------------------------------------------------\|`
			`\| spark.jars.packages \| ✅ \| \| Set with latest/required version io.acryl:datahub-spark-lineage:0.8.23 \|`
			`\| spark.extraListeners \| ✅ \| \| datahub.spark.DatahubSparkListener \|`
docs: spark-lineage - configuration details for Amazon EMR (#5459) 2022-07-21 17:57:26 +02:00			`\| spark.datahub.rest.server \| ✅ \| \| Datahub server url eg:<http://localhost:8080> \|`
feat(spark-lineage): add support for custom env and platform_instance (#4208) 2022-03-14 23:18:27 +05:30			`\| spark.datahub.rest.token \| \| \| Authentication token. \|`
feat(spark-lineage, java-emitter): Support ssl cert disable verification functionality (#5488) Co-authored-by: Shirshanka Das <shirshanka@apache.org> 2022-07-27 06:36:03 +05:30			`\| spark.datahub.rest.disable_ssl_verification \| \| false \| Disable SSL certificate validation. Caution: Only use this if you know what you are doing! \|`
feat(spark-lineage): add support for custom env and platform_instance (#4208) 2022-03-14 23:18:27 +05:30			`\| spark.datahub.metadata.pipeline.platformInstance\| \| \| Pipeline level platform instance \|`
			`\| spark.datahub.metadata.dataset.platformInstance\| \| \| dataset level platform instance \|`
			`\| spark.datahub.metadata.dataset.env \| \| PROD \| [Supported values](https://datahubproject.io/docs/graphql/enums#fabrictype). In all other cases, will fallback to PROD \|`
feat(spark-lineage): coalesce spark jobs (#5077) 2022-06-03 20:02:22 +05:30			`\| spark.datahub.coalesce_jobs \| \| false \| Only one datajob(taask) will be emitted containing all input and output datasets for the spark application \|`
			`\| spark.datahub.parent.datajob_urn \| \| \| Specified dataset will be set as upstream dataset for datajob created. Effective only when spark.datahub.coalesce_jobs is set to true \|`
feat(spark-lineage): add support for custom env and platform_instance (#4208) 2022-03-14 23:18:27 +05:30
feat(spark-lineage): simplified jars, config, auto publish to maven (#3924) 2022-01-20 00:48:09 -08:00			`## What to Expect: The Metadata Model`

			`As of current writing, the Spark agent produces metadata related to the Spark job, tasks and lineage edges to datasets.`

			`- A pipeline is created per Spark <master, appName>.`
			`- A task is created per unique Spark query execution within an app.`
feat(spark-lineage): add ability to push data lineage from spark to d… (#3664) Co-authored-by: Shirshanka Das <shirshanka@apache.org> 2021-12-14 01:30:51 +05:30
			`### Custom properties & relating to Spark UI`
docs: spark-lineage - configuration details for Amazon EMR (#5459) 2022-07-21 17:57:26 +02:00
feat(spark-lineage): add ability to push data lineage from spark to d… (#3664) Co-authored-by: Shirshanka Das <shirshanka@apache.org> 2021-12-14 01:30:51 +05:30			`The following custom properties in pipelines and tasks relate to the Spark UI:`
docs: spark-lineage - configuration details for Amazon EMR (#5459) 2022-07-21 17:57:26 +02:00
feat(spark-lineage): add ability to push data lineage from spark to d… (#3664) Co-authored-by: Shirshanka Das <shirshanka@apache.org> 2021-12-14 01:30:51 +05:30			`- appName and appId in a pipeline can be used to determine the Spark application`
			`- description and SQLQueryId in a task can be used to determine the Query Execution within the application on the SQL tab of Spark UI`

docs: spark-lineage - configuration details for Amazon EMR (#5459) 2022-07-21 17:57:26 +02:00			`Other custom properties of pipelines and tasks capture the start and end times of execution etc.`
feat(spark-lineage): add ability to push data lineage from spark to d… (#3664) Co-authored-by: Shirshanka Das <shirshanka@apache.org> 2021-12-14 01:30:51 +05:30			`The query plan is captured in the queryPlan property of a task.`

			`### Spark versions supported`
docs: spark-lineage - configuration details for Amazon EMR (#5459) 2022-07-21 17:57:26 +02:00
feat(spark-lineage): add ability to push data lineage from spark to d… (#3664) Co-authored-by: Shirshanka Das <shirshanka@apache.org> 2021-12-14 01:30:51 +05:30			`The primary version tested is Spark/Scala version 2.4.8/2_11.`
feat(spark-lineage): simplified jars, config, auto publish to maven (#3924) 2022-01-20 00:48:09 -08:00			`This library has also been tested to work with Spark versions(2.2.0 - 2.4.8) and Scala versions(2.10 - 2.12).`
			`For the Spark 3.x series, this has been tested to work with Spark 3.1.2 and 3.2.0 with Scala 2.12. Other combinations are not guaranteed to work currently.`
feat(spark-lineage): add ability to push data lineage from spark to d… (#3664) Co-authored-by: Shirshanka Das <shirshanka@apache.org> 2021-12-14 01:30:51 +05:30			`Support for other Spark versions is planned in the very near future.`

			`### Environments tested with`
docs: spark-lineage - configuration details for Amazon EMR (#5459) 2022-07-21 17:57:26 +02:00
feat(spark-lineage): add ability to push data lineage from spark to d… (#3664) Co-authored-by: Shirshanka Das <shirshanka@apache.org> 2021-12-14 01:30:51 +05:30			`This initial release has been tested with the following environments:`
docs: spark-lineage - configuration details for Amazon EMR (#5459) 2022-07-21 17:57:26 +02:00
feat(spark-lineage): add ability to push data lineage from spark to d… (#3664) Co-authored-by: Shirshanka Das <shirshanka@apache.org> 2021-12-14 01:30:51 +05:30			`- spark-submit of Python/Java applications to local and remote servers`
feat(spark-lineage): simplified jars, config, auto publish to maven (#3924) 2022-01-20 00:48:09 -08:00			`- Jupyter notebooks with pyspark code`
			`- Standalone Java applications`
feat(spark-lineage): add ability to push data lineage from spark to d… (#3664) Co-authored-by: Shirshanka Das <shirshanka@apache.org> 2021-12-14 01:30:51 +05:30
feat(spark-lineage): simplified jars, config, auto publish to maven (#3924) 2022-01-20 00:48:09 -08:00			`Note that testing for other environments such as Databricks is planned in near future.`
feat(spark-lineage): add ability to push data lineage from spark to d… (#3664) Co-authored-by: Shirshanka Das <shirshanka@apache.org> 2021-12-14 01:30:51 +05:30
			`### Spark commands supported`
feat(spark-lineage): support for persist API (#4980) 2022-05-24 00:23:57 +05:30
feat(spark-lineage): add ability to push data lineage from spark to d… (#3664) Co-authored-by: Shirshanka Das <shirshanka@apache.org> 2021-12-14 01:30:51 +05:30			`Below is a list of Spark commands that are parsed currently:`
docs: spark-lineage - configuration details for Amazon EMR (#5459) 2022-07-21 17:57:26 +02:00
feat(spark-lineage): add ability to push data lineage from spark to d… (#3664) Co-authored-by: Shirshanka Das <shirshanka@apache.org> 2021-12-14 01:30:51 +05:30			`- InsertIntoHadoopFsRelationCommand`
			`- SaveIntoDataSourceCommand (jdbc)`
			`- CreateHiveTableAsSelectCommand`
			`- InsertIntoHiveTable`

			`Effectively, these support data sources/sinks corresponding to Hive, HDFS and JDBC.`

feat(spark-lineage): support for persist API (#4980) 2022-05-24 00:23:57 +05:30			`DataFrame.persist command is supported for below LeafExecNodes:`
docs: spark-lineage - configuration details for Amazon EMR (#5459) 2022-07-21 17:57:26 +02:00
feat(spark-lineage): support for persist API (#4980) 2022-05-24 00:23:57 +05:30			`- FileSourceScanExec`
			`- HiveTableScanExec`
			`- RowDataSourceScanExec`
			`- InMemoryTableScanExec`

feat(spark-lineage): add ability to push data lineage from spark to d… (#3664) Co-authored-by: Shirshanka Das <shirshanka@apache.org> 2021-12-14 01:30:51 +05:30			`### Spark commands not yet supported`
docs: spark-lineage - configuration details for Amazon EMR (#5459) 2022-07-21 17:57:26 +02:00
feat(spark-lineage): add ability to push data lineage from spark to d… (#3664) Co-authored-by: Shirshanka Das <shirshanka@apache.org> 2021-12-14 01:30:51 +05:30			`- View related commands`
			`- Cache commands and implications on lineage`
			`- RDD jobs`

			`### Important notes on usage`

			`- It is advisable to ensure appName is used appropriately to ensure you can trace lineage from a pipeline back to your source code.`
			`- If multiple apps with the same appName run concurrently, dataset-lineage will be captured correctly but the custom-properties e.g. app-id, SQLQueryId would be unreliable. We expect this to be quite rare.`
			`- If spark execution fails, then an empty pipeline would still get created, but it may not have any tasks.`
			`- For HDFS sources, the folder (name) is regarded as the dataset (name) to align with typical storage of parquet/csv formats.`

refactor(spark-lineage): enhance logging and documentation (#4113) 2022-02-19 02:24:26 +05:30			`### Debugging`

			`- Following info logs are generated`

			`On Spark context startup`
docs: spark-lineage - configuration details for Amazon EMR (#5459) 2022-07-21 17:57:26 +02:00
			```text
refactor(spark-lineage): enhance logging and documentation (#4113) 2022-02-19 02:24:26 +05:30			`YY/MM/DD HH:mm:ss INFO DatahubSparkListener: DatahubSparkListener initialised.`
			`YY/MM/DD HH:mm:ss INFO SparkContext: Registered listener datahub.spark.DatahubSparkListener`
			```
docs: spark-lineage - configuration details for Amazon EMR (#5459) 2022-07-21 17:57:26 +02:00
refactor(spark-lineage): enhance logging and documentation (#4113) 2022-02-19 02:24:26 +05:30			`On application start`
docs: spark-lineage - configuration details for Amazon EMR (#5459) 2022-07-21 17:57:26 +02:00
refactor(spark-lineage): enhance logging and documentation (#4113) 2022-02-19 02:24:26 +05:30			```
			`YY/MM/DD HH:mm:ss INFO DatahubSparkListener: Application started: SparkListenerApplicationStart(AppName,Some(local-1644489736794),1644489735772,user,None,None)`
			`YY/MM/DD HH:mm:ss INFO McpEmitter: REST Emitter Configuration: GMS url <rest.server>`
			`YY/MM/DD HH:mm:ss INFO McpEmitter: REST Emitter Configuration: Token XXXXX`
			```
docs: spark-lineage - configuration details for Amazon EMR (#5459) 2022-07-21 17:57:26 +02:00
refactor(spark-lineage): enhance logging and documentation (#4113) 2022-02-19 02:24:26 +05:30			`On pushing data to server`
docs: spark-lineage - configuration details for Amazon EMR (#5459) 2022-07-21 17:57:26 +02:00
refactor(spark-lineage): enhance logging and documentation (#4113) 2022-02-19 02:24:26 +05:30			```
chore(jetty): upgrade jetty to 9.4.46 for CVE (#4857) 2022-05-06 16:18:20 -05:00			`YY/MM/DD HH:mm:ss INFO McpEmitter: MetadataWriteResponse(success=true, responseContent={"value":"<URN>"}, underlyingResponse=HTTP/1.1 200 OK [Date: day, DD month year HH:mm:ss GMT, Content-Type: application/json, X-RestLi-Protocol-Version: 2.0.0, Content-Length: 97, Server: Jetty(9.4.46.v20220331)] [Content-Length: 97,Chunked: false])`
refactor(spark-lineage): enhance logging and documentation (#4113) 2022-02-19 02:24:26 +05:30			```
docs: spark-lineage - configuration details for Amazon EMR (#5459) 2022-07-21 17:57:26 +02:00
refactor(spark-lineage): enhance logging and documentation (#4113) 2022-02-19 02:24:26 +05:30			`On application end`
docs: spark-lineage - configuration details for Amazon EMR (#5459) 2022-07-21 17:57:26 +02:00
refactor(spark-lineage): enhance logging and documentation (#4113) 2022-02-19 02:24:26 +05:30			```
			`YY/MM/DD HH:mm:ss INFO DatahubSparkListener: Application ended : AppName AppID`
			```

			`- To enable debugging logs, add below configuration in log4j.properties file`

			```
			`log4j.logger.datahub.spark=DEBUG`
			`log4j.logger.datahub.client.rest=DEBUG`
			```

feat(spark-lineage): add ability to push data lineage from spark to d… (#3664) Co-authored-by: Shirshanka Das <shirshanka@apache.org> 2021-12-14 01:30:51 +05:30			`## Known limitations`
docs: spark-lineage - configuration details for Amazon EMR (#5459) 2022-07-21 17:57:26 +02:00
feat(spark-lineage): add ability to push data lineage from spark to d… (#3664) Co-authored-by: Shirshanka Das <shirshanka@apache.org> 2021-12-14 01:30:51 +05:30			`- Only postgres supported for JDBC sources in this initial release. Support for other driver URL formats will be added in future.`
			`- Behavior with cached datasets is not fully specified/defined in context of lineage.`
			`- There is a possibility that very short-lived jobs that run within a few milliseconds may not be captured by the listener. This should not cause an issue for realistic Spark applications.`