The Spark lineage emitter is a java library that provides a Spark listener implementation "DatahubLineageEmitter". The DatahubLineageEmitter listens to events such application start/end, and SQLExecution start/end to create pipelines (i.e. DataJob) and tasks (i.e. DataFlow) in Datahub along with lineage.
## Configuring Spark emitter
Listener configuration can be done using a config file or while creating a spark Session.
### Config file for spark-submit
When running jobs using spark-submit, the listener is to be configured in the config file.
In this version, basic dataset-level lineage is captured using the model mapping as mentioned earlier.
### Spark versions supported
The primary version tested is Spark/Scala version 2.4.8/2_11.
We anticipate this to work well with other Spark 2.4.x versions and Scala 2_11.
Support for other Spark versions is planned in the very near future.
### Environments tested with
This initial release has been tested with the following environments:
- spark-submit of Python/Java applications to local and remote servers
- notebooks
Note that testing for other environments such as Databricks and standalone applications is planned in near future.
### Spark commands supported
Below is a list of Spark commands that are parsed currently:
- InsertIntoHadoopFsRelationCommand
- SaveIntoDataSourceCommand (jdbc)
- CreateHiveTableAsSelectCommand
- InsertIntoHiveTable
Effectively, these support data sources/sinks corresponding to Hive, HDFS and JDBC.
### Spark commands not yet supported
- View related commands
- Cache commands and implications on lineage
- RDD jobs
### Important notes on usage
- It is advisable to ensure appName is used appropriately to ensure you can trace lineage from a pipeline back to your source code.
- If multiple apps with the same appName run concurrently, dataset-lineage will be captured correctly but the custom-properties e.g. app-id, SQLQueryId would be unreliable. We expect this to be quite rare.
- If spark execution fails, then an empty pipeline would still get created, but it may not have any tasks.
- For HDFS sources, the folder (name) is regarded as the dataset (name) to align with typical storage of parquet/csv formats.
## Known limitations
- Only postgres supported for JDBC sources in this initial release. Support for other driver URL formats will be added in future.
- Behavior with cached datasets is not fully specified/defined in context of lineage.
- There is a possibility that very short-lived jobs that run within a few milliseconds may not be captured by the listener. This should not cause an issue for realistic Spark applications.