3.2 KiB
title | description | slug | collate |
---|---|---|---|
Spark Engine Prerequisites | OpenMetadata Spark Profiling Setup | Learn about the required infrastructure, network connectivity, and setup for using Spark Engine in OpenMetadata. | /how-to-guides/data-quality-observability/profiler/spark-engine/prerequisites | true |
Spark Engine Prerequisites
Required Infrastructure
Spark Cluster
- Spark Connect available (versions 3.5.2 to 3.5.6 supported)
- Network access from the pipeline execution environment to the Spark Connect endpoint
- Network access from the pipeline execution environment to the OpenMetadata server
Database Drivers in Spark Cluster
Depending on your source database, ensure the appropriate driver is installed in your Spark cluster:
- PostgreSQL:
org.postgresql.Driver
- MySQL:
com.mysql.cj.jdbc.Driver
{% note %} The specific driver versions should match your Spark version and database version for optimal compatibility. {% /note %}
Network Connectivity
The pipeline execution environment must have:
- Outbound access to your Spark Connect endpoint (typically port 15002)
- Outbound access to your OpenMetadata server (typically port 8585)
- Inbound access from Spark workers to your source database
Verification Steps
- Test Spark Connect: Verify connectivity from your pipeline environment to Spark Connect
- Test OpenMetadata: Ensure your pipeline environment can reach the OpenMetadata API
- Test Database: Confirm Spark workers can connect to your source database
- Verify Drivers: Check that the appropriate database driver is available in your Spark cluster
Example Verification Commands
Test Spark Connect Connectivity
# Test basic connectivity to Spark Connect
telnet your_spark_connect_host 15002
# Or using curl if available
curl -X GET http://your_spark_connect_host:15002
Test OpenMetadata Connectivity
# Test OpenMetadata API connectivity
curl -X GET http://your_openmetadata_host:8585/api/v1/version
Test Database Connectivity from Spark
# Test database connectivity using Spark
from pyspark.sql import SparkSession
spark = SparkSession.builder \
.appName("DatabaseConnectivityTest") \
.remote("<SPARK_CONNECT_HOST>:<SPARK_CONNECT_PORT>") \
.config("spark.jars", "/path/to/your/database/driver.jar") \
.getOrCreate()
# Test connection to your database
df = spark.read \
.format("jdbc") \
.option("url", "jdbc:your_database_url") \
.option("dbtable", "your_test_table") \
.option("user", "your_username") \
.option("password", "your_password") \
.load()
df.head()
{% inlineCalloutContainer %} {% inlineCallout color="violet-70" bold="Partitioning Requirements" icon="MdOutlineSchema" href="/how-to-guides/data-quality-observability/profiler/spark-engine/partitioning" %} Learn about the partitioning requirements for Spark Engine. {% /inlineCallout %} {% inlineCallout icon="MdAnalytics" bold="Configuration" href="/how-to-guides/data-quality-observability/profiler/spark-engine/configuration" %} Configure your profiler pipeline to use Spark Engine. {% /inlineCallout %} {% /inlineCalloutContainer %}