Ignoring revisions in .git-blame-ignore-revs. Click here to bypass and see the normal blame view.

101 lines
3.2 KiB
Markdown
Raw Normal View History

---
title: Spark Engine Prerequisites | OpenMetadata Spark Profiling Setup
description: Learn about the required infrastructure, network connectivity, and setup for using Spark Engine in OpenMetadata.
slug: /how-to-guides/data-quality-observability/profiler/spark-engine/prerequisites
collate: true
---
# Spark Engine Prerequisites
## Required Infrastructure
### Spark Cluster
- **Spark Connect available** (versions 3.5.2 to 3.5.6 supported)
- **Network access** from the pipeline execution environment to the Spark Connect endpoint
- **Network access** from the pipeline execution environment to the OpenMetadata server
### Database Drivers in Spark Cluster
Depending on your source database, ensure the appropriate driver is installed in your Spark cluster:
- **PostgreSQL**: `org.postgresql.Driver`
- **MySQL**: `com.mysql.cj.jdbc.Driver`
{% note %}
The specific driver versions should match your Spark version and database version for optimal compatibility.
{% /note %}
## Network Connectivity
The pipeline execution environment must have:
- **Outbound access** to your Spark Connect endpoint (typically port 15002)
- **Outbound access** to your OpenMetadata server (typically port 8585)
- **Inbound access** from Spark workers to your source database
## Verification Steps
1. **Test Spark Connect**: Verify connectivity from your pipeline environment to Spark Connect
2. **Test OpenMetadata**: Ensure your pipeline environment can reach the OpenMetadata API
3. **Test Database**: Confirm Spark workers can connect to your source database
4. **Verify Drivers**: Check that the appropriate database driver is available in your Spark cluster
## Example Verification Commands
### Test Spark Connect Connectivity
```bash
# Test basic connectivity to Spark Connect
telnet your_spark_connect_host 15002
# Or using curl if available
curl -X GET http://your_spark_connect_host:15002
```
### Test OpenMetadata Connectivity
```bash
# Test OpenMetadata API connectivity
curl -X GET http://your_openmetadata_host:8585/api/v1/version
```
### Test Database Connectivity from Spark
```python
# Test database connectivity using Spark
from pyspark.sql import SparkSession
spark = SparkSession.builder \
.appName("DatabaseConnectivityTest") \
.remote("<SPARK_CONNECT_HOST>:<SPARK_CONNECT_PORT>") \
.config("spark.jars", "/path/to/your/database/driver.jar") \
.getOrCreate()
# Test connection to your database
df = spark.read \
.format("jdbc") \
.option("url", "jdbc:your_database_url") \
.option("dbtable", "your_test_table") \
.option("user", "your_username") \
.option("password", "your_password") \
.load()
df.head()
```
{% inlineCalloutContainer %}
{% inlineCallout
color="violet-70"
bold="Partitioning Requirements"
icon="MdOutlineSchema"
href="/how-to-guides/data-quality-observability/profiler/spark-engine/partitioning" %}
Learn about the partitioning requirements for Spark Engine.
{% /inlineCallout %}
{% inlineCallout
icon="MdAnalytics"
bold="Configuration"
href="/how-to-guides/data-quality-observability/profiler/spark-engine/configuration" %}
Configure your profiler pipeline to use Spark Engine.
{% /inlineCallout %}
{% /inlineCalloutContainer %}