--- title: Spark Engine Prerequisites | OpenMetadata Spark Profiling Setup description: Learn about the required infrastructure, network connectivity, and setup for using Spark Engine in OpenMetadata. slug: /how-to-guides/data-quality-observability/profiler/spark-engine/prerequisites collate: true --- # Spark Engine Prerequisites ## Required Infrastructure ### Spark Cluster - **Spark Connect available** (versions 3.5.2 to 3.5.6 supported) - **Network access** from the pipeline execution environment to the Spark Connect endpoint - **Network access** from the pipeline execution environment to the OpenMetadata server ### Database Drivers in Spark Cluster Depending on your source database, ensure the appropriate driver is installed in your Spark cluster: - **PostgreSQL**: `org.postgresql.Driver` - **MySQL**: `com.mysql.cj.jdbc.Driver` {% note %} The specific driver versions should match your Spark version and database version for optimal compatibility. {% /note %} ## Network Connectivity The pipeline execution environment must have: - **Outbound access** to your Spark Connect endpoint (typically port 15002) - **Outbound access** to your OpenMetadata server (typically port 8585) - **Inbound access** from Spark workers to your source database ## Verification Steps 1. **Test Spark Connect**: Verify connectivity from your pipeline environment to Spark Connect 2. **Test OpenMetadata**: Ensure your pipeline environment can reach the OpenMetadata API 3. **Test Database**: Confirm Spark workers can connect to your source database 4. **Verify Drivers**: Check that the appropriate database driver is available in your Spark cluster ## Example Verification Commands ### Test Spark Connect Connectivity ```bash # Test basic connectivity to Spark Connect telnet your_spark_connect_host 15002 # Or using curl if available curl -X GET http://your_spark_connect_host:15002 ``` ### Test OpenMetadata Connectivity ```bash # Test OpenMetadata API connectivity curl -X GET http://your_openmetadata_host:8585/api/v1/version ``` ### Test Database Connectivity from Spark ```python # Test database connectivity using Spark from pyspark.sql import SparkSession spark = SparkSession.builder \ .appName("DatabaseConnectivityTest") \ .remote(":") \ .config("spark.jars", "/path/to/your/database/driver.jar") \ .getOrCreate() # Test connection to your database df = spark.read \ .format("jdbc") \ .option("url", "jdbc:your_database_url") \ .option("dbtable", "your_test_table") \ .option("user", "your_username") \ .option("password", "your_password") \ .load() df.head() ``` {% inlineCalloutContainer %} {% inlineCallout color="violet-70" bold="Partitioning Requirements" icon="MdOutlineSchema" href="/how-to-guides/data-quality-observability/profiler/spark-engine/partitioning" %} Learn about the partitioning requirements for Spark Engine. {% /inlineCallout %} {% inlineCallout icon="MdAnalytics" bold="Configuration" href="/how-to-guides/data-quality-observability/profiler/spark-engine/configuration" %} Configure your profiler pipeline to use Spark Engine. {% /inlineCallout %} {% /inlineCalloutContainer %}