| spark.jars.packages | ✅ | | Set with latest/required version io.acryl:acryl-spark-lineage_2.12:0.2.18 (or io.acryl:acryl-spark-lineage_2.13:0.2.18 for Scala 2.13) |
| spark.datahub.emitter | | rest | Specify the ways to emit metadata. By default it sends to DataHub using REST emitter. Valid options are rest, kafka or file |
| spark.datahub.rest.disable_ssl_verification | | false | Disable SSL certificate validation. Caution: Only use this if you know what you are doing! |
| spark.datahub.rest.disable_chunked_encoding | | false | Disable Chunked Transfer Encoding. In some environment chunked encoding causes issues. With this config option it can be disabled. |
| spark.datahub.rest.max_retries | | 0 | Number of times a request retried if failed |
| spark.datahub.rest.retry_interval | | 10 | Number of seconds to wait between retries |
| spark.datahub.file.filename | | | The file where metadata will be written if file emitter is set |
| spark.datahub.kafka.bootstrap | | | The Kafka bootstrap server url to use if the Kafka emitter is set |
| spark.datahub.kafka.schema_registry_url | | | The Schema registry url to use if the Kafka emitter is set |
| spark.datahub.kafka.schema_registry_config. | | | Additional config to pass in to the Schema Registry Client |
| spark.datahub.kafka.producer_config. | | | Additional config to pass in to the Kafka producer. For example: `--conf "spark.datahub.kafka.producer_config.client.id=my_client_id"` |
| spark.datahub.metadata.dataset.platformInstance | | | Dataset level platform instance (useful when you need to match dataset URNs with those created by other ingestion sources) |
| spark.datahub.metadata.dataset.env | | PROD | [Supported values](https://docs.datahub.com/docs/graphql/enums#fabrictype). In all other cases, will fall back to PROD |
| spark.datahub.metadata.dataset.hivePlatformAlias | | hive | By default, DataHub assigns Hive-like tables to the Hive platform. If you are using Glue as your Hive metastore, set this config flag to `glue` |
| spark.datahub.metadata.include_scheme | | true | Include scheme from the path URI (e.g. hdfs://, s3://) in the dataset URN. We recommend setting this value to false, but it is set to true for backwards compatibility with previous versions |
| spark.datahub.metadata.remove_partition_pattern | | | Remove partition pattern (e.g. /partition=\d+). It changes database/table/partition=123 to database/table |
| spark.datahub.coalesce_jobs | | true | Only one DataJob (task) will be emitted containing all input and output datasets for the Spark application |
| spark.datahub.parent.datajob_urn | | | Specified dataset will be set as upstream dataset for DataJob created. Effective only when spark.datahub.coalesce_jobs is set to true |
| spark.datahub.platform.s3.path_spec_list | | | List of path specs per platform |
| spark.datahub.metadata.dataset.include_schema_metadata | | false | Emit dataset schema metadata based on the Spark execution. It is recommended to get schema information from platform-specific DataHub sources as this is less reliable |
| spark.datahub.flow_name | | | If set, it will be used as the DataFlow name; otherwise, it uses the Spark app name as flow_name |
| spark.datahub.file_partition_regexp | | | Strip partition part from the path if the path end matches the specified regexp. Example: `year=.*/month=.*/day=.*` |
| spark.datahub.tags | | | Comma-separated list of tags to attach to the DataFlow |
| spark.datahub.domains | | | Comma-separated list of domain URNs to attach to the DataFlow |
| spark.datahub.stage_metadata_coalescing | | false | Normally metadata is coalesced and sent at the onApplicationEnd event, which is never called on Databricks or on Glue. Enable this on Databricks if you want coalesced runs. |
| spark.datahub.patch.enabled | | false | Set this to true to send lineage as a patch, which appends rather than overwrites existing Dataset lineage edges. By default, it is disabled. |
| spark.datahub.metadata.dataset.lowerCaseUrns | | false | Set this to true to lowercase dataset URNs. By default, it is disabled. |
| spark.datahub.disableSymlinkResolution | | false | Set this to true if you prefer using the S3 location instead of the Hive table. By default, it is disabled. |
| spark.datahub.s3.bucket | | | The name of the bucket where metadata will be written if s3 emitter is set |
| spark.datahub.s3.prefix | | | The prefix for the file where metadata will be written on s3 if s3 emitter is set |
| spark.datahub.s3.filename | | | The name of the file where metadata will be written if it is not set random filename will be used on s3 if s3 emitter is set |
| spark.datahub.log.mcps | | true | Set this to true to log MCPS to the log. By default, it is enabled. |
| spark.datahub.legacyLineageCleanup.enabled | | false | Set this to true to remove legacy lineages from older Spark Plugin runs. This will remove those lineages from the Datasets which it adds to DataJob. By default, it is disabled. |
| spark.datahub.capture_spark_plan | | false | Set this to true to capture the Spark plan. By default, it is disabled. |
| spark.datahub.metadata.dataset.enableEnhancedMergeIntoExtraction | | false | Set this to true to enable enhanced table name extraction for Delta Lake MERGE INTO commands. This improves lineage tracking by including the target table name in the job name. By default, it is disabled. |
The Spark agent captures fine-grained lineage information, including column-level lineage with transformation types. When available, OpenLineage's [transformation types](https://openlineage.io/docs/spec/facets/dataset-facets/column_lineage_facet/#transformation-type) are captured and mapped to DataHub's FinegrainedLineage `TransformOption`, providing detailed insights into how data transformations occur at the column level.
- **Column-level Lineage Enhancement**: OpenLineage's transformation types are now captured and mapped to DataHub's FinegrainedLineage `TransformOption` as per the [OpenLineage column lineage specification](https://openlineage.io/docs/spec/facets/dataset-facets/column_lineage_facet/#transformation-type)
- **Dependency Cleanup**: Removed logback dependency to reduce potential conflicts with user applications
- FileStreamMicroBatchStream and foreachBatch for Spark streaming
- MERGE INTO operations now capture both dataset-level AND column-level lineage
- Finegrained lineage is emitted on the DataJob and not on the emitted Datasets. This is the correct behaviour which was not correct earlier. This causes earlier emitted finegrained lineages won't be overwritten by the new ones.
You can remove the old lineages by setting `spark.datahub.legacyLineageCleanup.enabled=true`. Make sure you have the latest server if you enable with patch support. (this was introduced since 0.2.17-rc5)
- Add option to disable chunked encoding in the datahub rest sink -> `spark.datahub.rest.disable_chunked_encoding`
- Add option to specify the mcp kafka topic for the datahub kafka sink -> `spark.datahub.kafka.mcp_topic`
- Add option to remove legacy lineages from older Spark Plugin runs. This will remove those lineages from the Datasets which it adds to DataJob -> `spark.datahub.legacyLineageCleanup.enabled`
- Add option to set platform instance and/or env per platform with `spark.datahub.platform.<platform_name>.env` and `spark.datahub.platform.<platform_name>.platform_instance` config parameter
- Fixing platform instance setting for datasets when `spark.datahub.metadata.dataset.platformInstance` is set
- Fixing column level lineage support when patch is enabled