A spark job may involve movement/transfer of data which may result into a data lineage, to capture such lineages you can make use of `OpenMetadata Spark Agent` which you can configure with your spark session and capture these spark lineages into your OpenMetadata instance.
In this guide we will explain how you can make use of the `OpenMetadata Spark Agent` to capture such lineage.
## Requirement
To use the `OpenMetadata Spark Agent`, you will have to download the latest jar from [here](https://github.com/open-metadata/openmetadata-spark-agent/releases).
We support spark version 3.1 and above.
## Configuration
While configuring the spark session, in this guide we will make use of PySpark to demonstrate the use of `OpenMetadata Spark Agent`
{% codePreview %}
{% codeInfoContainer %}
{% codeInfo srNumber=1 %}
Once you have downloaded the jar from [here](https://github.com/open-metadata/openmetadata-spark-agent/releases) in your spark configuration you will have to add the path to your `openmetadata-spark-agent.jar` along with other required jars to run your spark job, in this example it is `mysql-connector-java.jar`
{% /codeInfo %}
{% codeInfo srNumber=2 %}
`openmetadata-spark-agent.jar` comes with a custom spark listener i.e. `org.openmetadata.spark.agent.OpenMetadataSparkListener` you will need to add this as `extraListeners` spark configuration.
{% /codeInfo %}
{% codeInfo srNumber=3 %}
`spark.openmetadata.transport.hostPort`: Specify the host & port of the instance where your OpenMetadata is hosted.
{% /codeInfo %}
{% codeInfo srNumber=4 %}
`spark.openmetadata.transport.type` is required configuration with value as `openmetadata`.
{% /codeInfo %}
{% codeInfo srNumber=5 %}
`spark.openmetadata.transport.jwtToken`: Specify your OpenMetadata Jwt token here. Checkout [this](/deployment/security/enable-jwt-tokens#generate-token) documentation on how you can generate a jwt token in OpenMetadata.
{% /codeInfo %}
{% codeInfo srNumber=6 %}
`spark.openmetadata.transport.pipelineServiceName`: This spark job will be creating a new pipeline service of type `Spark`, use this configuration to customize the pipeline service name.
Note: If the pipeline service with the specified name already exists then we will be updating/using the same pipeline service.
{% /codeInfo %}
{% codeInfo srNumber=7 %}
`spark.openmetadata.transport.pipelineName`: This spark job will also create a new pipeline within the pipeline service defined above. Use this configuration to customize the name of pipeline.
Note: If the pipeline with the specified name already exists then we will be updating/using the same pipeline.
{% /codeInfo %}
{% codeInfo srNumber=8 %}
`spark.openmetadata.transport.pipelineSourceUrl`: You can use this configuration to provide additional context to your pipeline by specifying a url related to the pipeline.
{% /codeInfo %}
{% codeInfo srNumber=9 %}
`spark.openmetadata.transport.pipelineDescription`: Provide pipeline description using this spark configuration.
{% /codeInfo %}
{% codeInfo srNumber=10 %}
`spark.openmetadata.transport.databaseServiceNames`: Provide the comma separated list of database service names which contains the source tables used in this job. If you do not provide this configuration then we will be searching through all the services available in openmetadata.
{% /codeInfo %}
{% codeInfo srNumber=11 %}
`spark.openmetadata.transport.timeout`: Provide the timeout to communicate with OpenMetadata APIs.
{% /codeInfo %}
{% codeInfo srNumber=12 %}
In this job we are reading data from `employee` table and moving it to another table `employee_new` of within same mysql source.
Once this pyspark job get finished you will see a new pipeline service with name `my_pipeline_service` generated in your openmetadata instance which would contain a pipeline with name `my_pipeline` as per the above example and you should also see lineage between the table `employee` and `employee_new` via `my_pipeline`.
Follow the below steps in order to use OpenMetadata Spark Agent with databricks.
### 1. Upload the jar to compute cluster
To use the `OpenMetadata Spark Agent`, you will have to download the latest jar from [here](https://github.com/open-metadata/openmetadata-spark-agent/releases) and upload it to your databricks compute cluster.
To upload the jar you can visit the compute details page and then go to the libraries tab
Once you have created a initialization script, you will need to attach this script to your compute instance, to do that you can go to advanced config > init scripts and add your script path.