mirror of
https://github.com/open-metadata/OpenMetadata.git
synced 2025-08-10 10:08:11 +00:00
101 lines
3.2 KiB
Markdown
101 lines
3.2 KiB
Markdown
![]() |
---
|
||
|
title: Spark Engine Prerequisites | OpenMetadata Spark Profiling Setup
|
||
|
description: Learn about the required infrastructure, network connectivity, and setup for using Spark Engine in OpenMetadata.
|
||
|
slug: /how-to-guides/data-quality-observability/profiler/spark-engine/prerequisites
|
||
|
collate: true
|
||
|
---
|
||
|
|
||
|
# Spark Engine Prerequisites
|
||
|
|
||
|
## Required Infrastructure
|
||
|
|
||
|
### Spark Cluster
|
||
|
|
||
|
- **Spark Connect available** (versions 3.5.2 to 3.5.6 supported)
|
||
|
- **Network access** from the pipeline execution environment to the Spark Connect endpoint
|
||
|
- **Network access** from the pipeline execution environment to the OpenMetadata server
|
||
|
|
||
|
### Database Drivers in Spark Cluster
|
||
|
|
||
|
Depending on your source database, ensure the appropriate driver is installed in your Spark cluster:
|
||
|
|
||
|
- **PostgreSQL**: `org.postgresql.Driver`
|
||
|
- **MySQL**: `com.mysql.cj.jdbc.Driver`
|
||
|
|
||
|
{% note %}
|
||
|
The specific driver versions should match your Spark version and database version for optimal compatibility.
|
||
|
{% /note %}
|
||
|
|
||
|
## Network Connectivity
|
||
|
|
||
|
The pipeline execution environment must have:
|
||
|
|
||
|
- **Outbound access** to your Spark Connect endpoint (typically port 15002)
|
||
|
- **Outbound access** to your OpenMetadata server (typically port 8585)
|
||
|
- **Inbound access** from Spark workers to your source database
|
||
|
|
||
|
## Verification Steps
|
||
|
|
||
|
1. **Test Spark Connect**: Verify connectivity from your pipeline environment to Spark Connect
|
||
|
2. **Test OpenMetadata**: Ensure your pipeline environment can reach the OpenMetadata API
|
||
|
3. **Test Database**: Confirm Spark workers can connect to your source database
|
||
|
4. **Verify Drivers**: Check that the appropriate database driver is available in your Spark cluster
|
||
|
|
||
|
## Example Verification Commands
|
||
|
|
||
|
### Test Spark Connect Connectivity
|
||
|
|
||
|
```bash
|
||
|
# Test basic connectivity to Spark Connect
|
||
|
telnet your_spark_connect_host 15002
|
||
|
|
||
|
# Or using curl if available
|
||
|
curl -X GET http://your_spark_connect_host:15002
|
||
|
```
|
||
|
|
||
|
### Test OpenMetadata Connectivity
|
||
|
|
||
|
```bash
|
||
|
# Test OpenMetadata API connectivity
|
||
|
curl -X GET http://your_openmetadata_host:8585/api/v1/version
|
||
|
```
|
||
|
|
||
|
### Test Database Connectivity from Spark
|
||
|
|
||
|
```python
|
||
|
# Test database connectivity using Spark
|
||
|
from pyspark.sql import SparkSession
|
||
|
|
||
|
spark = SparkSession.builder \
|
||
|
.appName("DatabaseConnectivityTest") \
|
||
|
.remote("<SPARK_CONNECT_HOST>:<SPARK_CONNECT_PORT>") \
|
||
|
.config("spark.jars", "/path/to/your/database/driver.jar") \
|
||
|
.getOrCreate()
|
||
|
|
||
|
# Test connection to your database
|
||
|
df = spark.read \
|
||
|
.format("jdbc") \
|
||
|
.option("url", "jdbc:your_database_url") \
|
||
|
.option("dbtable", "your_test_table") \
|
||
|
.option("user", "your_username") \
|
||
|
.option("password", "your_password") \
|
||
|
.load()
|
||
|
|
||
|
df.head()
|
||
|
```
|
||
|
|
||
|
{% inlineCalloutContainer %}
|
||
|
{% inlineCallout
|
||
|
color="violet-70"
|
||
|
bold="Partitioning Requirements"
|
||
|
icon="MdOutlineSchema"
|
||
|
href="/how-to-guides/data-quality-observability/profiler/spark-engine/partitioning" %}
|
||
|
Learn about the partitioning requirements for Spark Engine.
|
||
|
{% /inlineCallout %}
|
||
|
{% inlineCallout
|
||
|
icon="MdAnalytics"
|
||
|
bold="Configuration"
|
||
|
href="/how-to-guides/data-quality-observability/profiler/spark-engine/configuration" %}
|
||
|
Configure your profiler pipeline to use Spark Engine.
|
||
|
{% /inlineCallout %}
|
||
|
{% /inlineCalloutContainer %}
|