OpenMetadata/openmetadata-docs/content/v1.9.x-SNAPSHOT/how-to-guides/data-quality-observability/profiler/spark-engine/prerequisites.md

---
title: Spark Engine Prerequisites | OpenMetadata Spark Profiling Setup
description: Learn about the required infrastructure, network connectivity, and setup for using Spark Engine in OpenMetadata.
slug: /how-to-guides/data-quality-observability/profiler/spark-engine/prerequisites
collate: true
---

# Spark Engine Prerequisites

## Required Infrastructure

### Spark Cluster

- **Spark Connect available** (versions 3.5.2 to 3.5.6 supported)
- **Network access** from the pipeline execution environment to the Spark Connect endpoint
- **Network access** from the pipeline execution environment to the OpenMetadata server

### Database Drivers in Spark Cluster

Depending on your source database, ensure the appropriate driver is installed in your Spark cluster:

- **PostgreSQL**: `org.postgresql.Driver`
- **MySQL**: `com.mysql.cj.jdbc.Driver`

{% note %}
The specific driver versions should match your Spark version and database version for optimal compatibility.
{% /note %}

## Network Connectivity

The pipeline execution environment must have:

- **Outbound access** to your Spark Connect endpoint (typically port 15002)
- **Outbound access** to your OpenMetadata server (typically port 8585)
- **Inbound access** from Spark workers to your source database

## Verification Steps

1. **Test Spark Connect**: Verify connectivity from your pipeline environment to Spark Connect
2. **Test OpenMetadata**: Ensure your pipeline environment can reach the OpenMetadata API
3. **Test Database**: Confirm Spark workers can connect to your source database
4. **Verify Drivers**: Check that the appropriate database driver is available in your Spark cluster

## Example Verification Commands

### Test Spark Connect Connectivity

```bash
# Test basic connectivity to Spark Connect
telnet your_spark_connect_host 15002

# Or using curl if available
curl -X GET http://your_spark_connect_host:15002
```

### Test OpenMetadata Connectivity

```bash
# Test OpenMetadata API connectivity
curl -X GET http://your_openmetadata_host:8585/api/v1/version
```

### Test Database Connectivity from Spark

```python
# Test database connectivity using Spark
from pyspark.sql import SparkSession

spark = SparkSession.builder \
    .appName("DatabaseConnectivityTest") \
    .remote("<SPARK_CONNECT_HOST>:<SPARK_CONNECT_PORT>") \
    .config("spark.jars", "/path/to/your/database/driver.jar") \
    .getOrCreate()

# Test connection to your database
df = spark.read \
    .format("jdbc") \
    .option("url", "jdbc:your_database_url") \
    .option("dbtable", "your_test_table") \
    .option("user", "your_username") \
    .option("password", "your_password") \
    .load()

df.head()
```

{% inlineCalloutContainer %}
 {% inlineCallout
  color="violet-70"
  bold="Partitioning Requirements"
  icon="MdOutlineSchema"
  href="/how-to-guides/data-quality-observability/profiler/spark-engine/partitioning" %}
  Learn about the partitioning requirements for Spark Engine.
 {% /inlineCallout %}
 {% inlineCallout
    icon="MdAnalytics"
    bold="Configuration"
    href="/how-to-guides/data-quality-observability/profiler/spark-engine/configuration" %}
    Configure your profiler pipeline to use Spark Engine.
 {% /inlineCallout %}
{% /inlineCalloutContainer %}
Add Spark Engine docs (#22739) * Add Spark Engine docs * Update structure 2025-08-04 18:28:46 +02:00			`---`
			`title: Spark Engine Prerequisites \| OpenMetadata Spark Profiling Setup`
			`description: Learn about the required infrastructure, network connectivity, and setup for using Spark Engine in OpenMetadata.`
			`slug: /how-to-guides/data-quality-observability/profiler/spark-engine/prerequisites`
			`collate: true`
			`---`

			`# Spark Engine Prerequisites`

			`## Required Infrastructure`

			`### Spark Cluster`

			`- Spark Connect available (versions 3.5.2 to 3.5.6 supported)`
			`- Network access from the pipeline execution environment to the Spark Connect endpoint`
			`- Network access from the pipeline execution environment to the OpenMetadata server`

			`### Database Drivers in Spark Cluster`

			`Depending on your source database, ensure the appropriate driver is installed in your Spark cluster:`

			- PostgreSQL: `org.postgresql.Driver`
			- MySQL: `com.mysql.cj.jdbc.Driver`

			`{% note %}`
			`The specific driver versions should match your Spark version and database version for optimal compatibility.`
			`{% /note %}`

			`## Network Connectivity`

			`The pipeline execution environment must have:`

			`- Outbound access to your Spark Connect endpoint (typically port 15002)`
			`- Outbound access to your OpenMetadata server (typically port 8585)`
			`- Inbound access from Spark workers to your source database`

			`## Verification Steps`

			`1. Test Spark Connect: Verify connectivity from your pipeline environment to Spark Connect`
			`2. Test OpenMetadata: Ensure your pipeline environment can reach the OpenMetadata API`
			`3. Test Database: Confirm Spark workers can connect to your source database`
			`4. Verify Drivers: Check that the appropriate database driver is available in your Spark cluster`

			`## Example Verification Commands`

			`### Test Spark Connect Connectivity`

			```bash
			`# Test basic connectivity to Spark Connect`
			`telnet your_spark_connect_host 15002`

			`# Or using curl if available`
			`curl -X GET http://your_spark_connect_host:15002`
			```

			`### Test OpenMetadata Connectivity`

			```bash
			`# Test OpenMetadata API connectivity`
			`curl -X GET http://your_openmetadata_host:8585/api/v1/version`
			```

			`### Test Database Connectivity from Spark`

			```python
			`# Test database connectivity using Spark`
			`from pyspark.sql import SparkSession`

			`spark = SparkSession.builder \`
			`.appName("DatabaseConnectivityTest") \`
			`.remote("<SPARK_CONNECT_HOST>:<SPARK_CONNECT_PORT>") \`
			`.config("spark.jars", "/path/to/your/database/driver.jar") \`
			`.getOrCreate()`

			`# Test connection to your database`
			`df = spark.read \`
			`.format("jdbc") \`
			`.option("url", "jdbc:your_database_url") \`
			`.option("dbtable", "your_test_table") \`
			`.option("user", "your_username") \`
			`.option("password", "your_password") \`
			`.load()`

			`df.head()`
			```

			`{% inlineCalloutContainer %}`
			`{% inlineCallout`
			`color="violet-70"`
			`bold="Partitioning Requirements"`
			`icon="MdOutlineSchema"`
			`href="/how-to-guides/data-quality-observability/profiler/spark-engine/partitioning" %}`
			`Learn about the partitioning requirements for Spark Engine.`
			`{% /inlineCallout %}`
			`{% inlineCallout`
			`icon="MdAnalytics"`
			`bold="Configuration"`
			`href="/how-to-guides/data-quality-observability/profiler/spark-engine/configuration" %}`
			`Configure your profiler pipeline to use Spark Engine.`
			`{% /inlineCallout %}`
			`{% /inlineCalloutContainer %}`