Add Spark Engine docs (#22739)

* Add Spark Engine docs

* Update structure
This commit is contained in:
IceS2 2025-08-04 18:28:46 +02:00 committed by GitHub
parent b92e9d0e06
commit 5adf32b731
No known key found for this signature in database
GPG Key ID: B5690EEEBB952194
8 changed files with 484 additions and 0 deletions

View File

@ -954,6 +954,18 @@ site_menu:
url: /how-to-guides/data-quality-observability/incident-manager/workflow url: /how-to-guides/data-quality-observability/incident-manager/workflow
- category: How-to Guides / Data Quality and Observability / Incident Manager / Root Cause Analysis - category: How-to Guides / Data Quality and Observability / Incident Manager / Root Cause Analysis
url: /how-to-guides/data-quality-observability/incident-manager/root-cause-analysis url: /how-to-guides/data-quality-observability/incident-manager/root-cause-analysis
- category: How-to Guides / Data Quality and Observability / Data Profiler / Spark Engine
url: /how-to-guides/data-quality-observability/profiler/spark-engine
- category: How-to Guides / Data Quality and Observability / Data Profiler / Spark Engine / Prerequisites
url: /how-to-guides/data-quality-observability/profiler/spark-engine/prerequisites
- category: How-to Guides / Data Quality and Observability / Data Profiler / Spark Engine / Partitioning
url: /how-to-guides/data-quality-observability/profiler/spark-engine/partitioning
- category: How-to Guides / Data Quality and Observability / Data Profiler / Spark Engine / Configuration
url: /how-to-guides/data-quality-observability/profiler/spark-engine/configuration
- category: How-to Guides / Data Quality and Observability / Data Profiler / Spark Engine / Configuration / UI Configuration
url: /how-to-guides/data-quality-observability/profiler/spark-engine/ui-configuration
- category: How-to Guides / Data Quality and Observability / Data Profiler / Spark Engine / Configuration / External Configuration
url: /how-to-guides/data-quality-observability/profiler/spark-engine/external-configuration
- category: How-to Guides / Data Quality and Observability / Anomaly Detection - category: How-to Guides / Data Quality and Observability / Anomaly Detection
url: /how-to-guides/data-quality-observability/anomaly-detection url: /how-to-guides/data-quality-observability/anomaly-detection

View File

@ -38,4 +38,10 @@ Watch the video to understand OpenMetadatas native Data Profiler and Data Qua
href="/how-to-guides/data-quality-observability/profiler/external-workflow"%} href="/how-to-guides/data-quality-observability/profiler/external-workflow"%}
Run a single workflow profiler for the entire source externally. Run a single workflow profiler for the entire source externally.
{%/inlineCallout%} {%/inlineCallout%}
{%inlineCallout
icon="MdRocket"
bold="Spark Engine"
href="/how-to-guides/data-quality-observability/profiler/spark-engine"%}
Use distributed processing with Apache Spark for large-scale data profiling.
{%/inlineCallout%}
{%/inlineCalloutContainer%} {%/inlineCalloutContainer%}

View File

@ -0,0 +1,32 @@
---
title: Spark Engine Configuration | OpenMetadata Spark Profiling Setup
description: Learn how to configure your profiler pipeline to use Spark Engine for distributed data profiling.
slug: /how-to-guides/data-quality-observability/profiler/spark-engine/configuration
collate: true
---
# Spark Engine Configuration
## Overview
There are two ways to configure Spark Engine in OpenMetadata:
{% inlineCalloutContainer %}
{% inlineCallout
color="violet-70"
bold="UI Configuration"
icon="MdAnalytics"
href="/how-to-guides/data-quality-observability/profiler/spark-engine/ui-configuration" %}
Configure Spark Engine through the OpenMetadata UI.
{% /inlineCallout %}
{% inlineCallout
icon="MdOutlineSchema"
bold="External Configuration"
href="/how-to-guides/data-quality-observability/profiler/spark-engine/external-configuration" %}
Configure Spark Engine using YAML files for external workflows.
{% /inlineCallout %}
{% /inlineCalloutContainer %}
{% note %}
Before configuring, ensure you have completed the [Spark Engine Prerequisites](/how-to-guides/data-quality-observability/profiler/spark-engine/prerequisites) and understand the [Partitioning Requirements](/how-to-guides/data-quality-observability/profiler/spark-engine/partitioning).
{% /note %}

View File

@ -0,0 +1,146 @@
---
title: Spark Engine External Configuration | OpenMetadata Spark Profiling
description: Configure Spark Engine using YAML files for external workflows and distributed data profiling.
slug: /how-to-guides/data-quality-observability/profiler/spark-engine/external-configuration
collate: true
---
# Spark Engine External Configuration
## Overview
To configure your profiler pipeline to use Spark Engine, you need to add the `processingEngine` configuration to your existing YAML file.
{% note %}
Before configuring, ensure you have completed the [Spark Engine Prerequisites](/how-to-guides/data-quality-observability/profiler/spark-engine/prerequisites) and understand the [Partitioning Requirements](/how-to-guides/data-quality-observability/profiler/spark-engine/partitioning).
{% /note %}
## Step 1: Add Spark Engine Configuration
In your existing profiler YAML, add the `processingEngine` section under `sourceConfig.config`:
```yaml
sourceConfig:
config:
type: Profiler
# ... your existing configuration ...
processingEngine:
type: Spark
remote: sc://your_spark_connect_host:15002
config:
tempPath: your_path
{% note %}
**Important**: The `tempPath` must be accessible to all nodes in your Spark cluster. This is typically a shared filesystem path (like S3, HDFS, or a mounted network drive) that all Spark workers can read from and write to.
{% /note %}
```
## Step 2: Add Partition Configuration
In the `processor.config.tableConfig` section, add the `sparkTableProfilerConfig`:
```yaml
processor:
type: orm-profiler
config:
tableConfig:
- fullyQualifiedName: YourService.YourDatabase.YourSchema.YourTable
sparkTableProfilerConfig:
partitioning:
partitionColumn: your_partition_column
# lowerBound: 0
# upperBound: 10000000
```
## Complete Example
### Before (Native Engine)
```yaml
sourceConfig:
config:
type: Profiler
schemaFilterPattern:
includes:
- ^your_schema$
tableFilterPattern:
includes:
- your_table_name
processor:
type: orm-profiler
config: {}
```
### After (Spark Engine)
```yaml
sourceConfig:
config:
type: Profiler
schemaFilterPattern:
includes:
- ^your_schema$
tableFilterPattern:
includes:
- your_table_name
processingEngine:
type: Spark
remote: sc://your_spark_connect_host:15002
config:
tempPath: s3://your_s3_bucket/table
# extraConfig:
# key: value
processor:
type: orm-profiler
config:
tableConfig:
- fullyQualifiedName: YourService.YourDatabase.YourSchema.YourTable
sparkTableProfilerConfig:
partitioning:
partitionColumn: your_partition_column
# lowerBound: 0
# upperBound: 1000000
```
## Required Changes
1. **Add `processingEngine`** to `sourceConfig.config`
2. **Add `sparkTableProfilerConfig`** to your table configuration
3. **Specify partition column** for Spark processing (Else it will fallback to fetching the Primary Key if any, or skipping the table entirely)
## Run the Pipeline
Use the same command as before:
```bash
metadata profile -c your_profiler_config.yaml
```
The pipeline will now use Spark Engine instead of the Native engine for processing.
## Troubleshooting Configuration
### Common Issues
1. **Missing Partition Column**: Ensure you've specified a suitable partition column
2. **Network Connectivity**: Verify Spark Connect and database connectivity
3. **Driver Issues**: Check that appropriate database drivers are installed in Spark cluster
4. **Configuration Errors**: Validate YAML syntax and required fields
### Debugging Steps
1. **Check Logs**: Review profiler logs for specific error messages
2. **Test Connectivity**: Verify all network connections are working
3. **Validate Configuration**: Ensure all required fields are properly set
4. **Test with Small Dataset**: Start with a small table to verify the setup
{% inlineCalloutContainer %}
{% inlineCallout
color="violet-70"
bold="UI Configuration"
icon="MdAnalytics"
href="/how-to-guides/data-quality-observability/profiler/spark-engine/ui-configuration" %}
Configure Spark Engine through the OpenMetadata UI.
{% /inlineCallout %}
{% /inlineCalloutContainer %}

View File

@ -0,0 +1,53 @@
---
title: Spark Engine Overview | OpenMetadata Distributed Profiling
description: Learn about OpenMetadata's Spark Engine for distributed data profiling of large-scale datasets using Apache Spark.
slug: /how-to-guides/data-quality-observability/profiler/spark-engine
collate: true
---
# Spark Engine Overview
## What is Spark Engine?
The Spark Engine is a distributed processing engine in OpenMetadata that enables large-scale data profiling using Apache Spark. It's an alternative to the default Native engine, designed specifically for handling massive datasets that would be impractical or impossible to profile directly on the source database.
## When to Use Spark Engine
### Use Spark Engine when:
- You have access to a Spark cluster (local, standalone, YARN, or Kubernetes)
- Your datasets are too large to profile directly on the source database
- You need distributed processing capabilities for enterprise-scale data profiling
- Your source database doesn't have built-in distributed processing capabilities
### Stick with Native Engine when:
- You are using an already distributed processed database such as BigQuery or Snowflake
- Your profiler pipeline runs smoothly directly on the source database
- You're doing development or testing with small tables
- You don't have access to a Spark cluster
- You need the simplest possible setup
The Spark Engine integrates seamlessly with OpenMetadata's existing profiling framework while providing the distributed processing capabilities needed for enterprise-scale data profiling operations.
{% inlineCalloutContainer %}
{% inlineCallout
color="violet-70"
bold="Prerequisites"
icon="MdSecurity"
href="/how-to-guides/data-quality-observability/profiler/spark-engine/prerequisites" %}
Learn about the required infrastructure and setup for Spark Engine.
{% /inlineCallout %}
{% inlineCallout
icon="MdOutlineSchema"
bold="Partitioning Requirements"
href="/how-to-guides/data-quality-observability/profiler/spark-engine/partitioning" %}
Understand the partitioning requirements for Spark Engine.
{% /inlineCallout %}
{% inlineCallout
icon="MdAnalytics"
bold="Configuration"
href="/how-to-guides/data-quality-observability/profiler/spark-engine/configuration" %}
Configure your profiler pipeline to use Spark Engine.
{% /inlineCallout %}
{% /inlineCalloutContainer %}

View File

@ -0,0 +1,112 @@
---
title: Spark Engine Partitioning Requirements | OpenMetadata Spark Profiling
description: Learn about the partitioning requirements for Spark Engine and how to choose the right partition column for optimal performance.
slug: /how-to-guides/data-quality-observability/profiler/spark-engine/partitioning
collate: true
---
# Spark Engine Partitioning Requirements
## Why Partitioning is Required
The Spark Engine requires a partition column to efficiently process large datasets. This is because:
1. **Parallel Processing**: Each partition can be processed independently across different Spark workers
2. **Resource Optimization**: Prevents memory overflow and ensures stable processing of large datasets
## How Partitioning Works
The Spark Engine automatically detects and uses partition columns based on this logic:
### Automatic Detection Priority
1. **Manual Configuration**: You can explicitly specify a partition column in the table configuration
2. **Primary Key Columns**: If a table has a primary key with numeric or date/time data types, it's automatically selected
### Supported Data Types for Partitioning
- **Numeric**: `SMALLINT`, `INT`, `BIGINT`, `NUMBER`
- **Date/Time**: `DATE`, `DATETIME`, `TIMESTAMP`, `TIMESTAMPZ`, `TIME`
## What Happens Without a Suitable Partition Column
If no suitable partition column is found, the table will be skipped during profiling. This ensures that only tables with proper partitioning can be processed by the Spark Engine, preventing potential performance issues or failures.
## Choosing the Right Partition Column
### Best Practices
1. **High Cardinality**: Choose columns with many unique values to ensure even data distribution
2. **Even Distribution**: Avoid columns with heavily skewed data (e.g., mostly NULL values)
3. **Query Performance**: Select columns that have an index created on them
4. **Data Type Compatibility**: Ensure the column is of a supported data type for partitioning
### Examples
| Column Type | Good Partition Column | Poor Partition Column |
| --- | --- | --- |
| **Numeric** | `user_id`, `order_id`, `age` | `status_code` (limited values) |
| **Date/Time** | `created_date`, `updated_at`, `event_timestamp` | `last_login` (many NULLs) |
### Common Patterns
- **Primary Keys**: Usually excellent partition columns
- **Timestamps**: Great for time-based partitioning
- **Foreign Keys**: Good if they have high cardinality
- **Business Keys**: Customer IDs, order IDs, etc.
## Configuration Examples
### Basic Partition Configuration
```yaml
processor:
type: orm-profiler
config:
tableConfig:
- fullyQualifiedName: YourService.YourDatabase.YourSchema.YourTable
sparkTableProfilerConfig:
partitioning:
partitionColumn: user_id
```
### Advanced Partition Configuration
```yaml
processor:
type: orm-profiler
config:
tableConfig:
- fullyQualifiedName: YourService.YourDatabase.YourSchema.YourTable
sparkTableProfilerConfig:
partitioning:
partitionColumn: created_date
lowerBound: "2023-01-01"
upperBound: "2024-01-01"
```
## Troubleshooting Partitioning Issues
### Common Issues
1. **No Suitable Partition Column**: Ensure your table has a column with supported data types
2. **Low Cardinality**: Choose a column with more unique values
3. **Data Type Mismatch**: Verify the column is of a supported data type
4. **Missing Index**: Consider adding an index to improve partitioning performance
### Debugging Steps
1. **Check Table Schema**: Verify available columns and their data types
2. **Analyze Column Distribution**: Check for NULL values and cardinality
3. **Test Partition Column**: Validate the chosen column works with your data
4. **Review Logs**: Check profiler logs for specific partitioning errors
{% inlineCalloutContainer %}
{% inlineCallout
color="violet-70"
bold="Configuration"
icon="MdAnalytics"
href="/how-to-guides/data-quality-observability/profiler/spark-engine/configuration" %}
Configure your profiler pipeline to use Spark Engine.
{% /inlineCallout %}
{% /inlineCalloutContainer %}

View File

@ -0,0 +1,101 @@
---
title: Spark Engine Prerequisites | OpenMetadata Spark Profiling Setup
description: Learn about the required infrastructure, network connectivity, and setup for using Spark Engine in OpenMetadata.
slug: /how-to-guides/data-quality-observability/profiler/spark-engine/prerequisites
collate: true
---
# Spark Engine Prerequisites
## Required Infrastructure
### Spark Cluster
- **Spark Connect available** (versions 3.5.2 to 3.5.6 supported)
- **Network access** from the pipeline execution environment to the Spark Connect endpoint
- **Network access** from the pipeline execution environment to the OpenMetadata server
### Database Drivers in Spark Cluster
Depending on your source database, ensure the appropriate driver is installed in your Spark cluster:
- **PostgreSQL**: `org.postgresql.Driver`
- **MySQL**: `com.mysql.cj.jdbc.Driver`
{% note %}
The specific driver versions should match your Spark version and database version for optimal compatibility.
{% /note %}
## Network Connectivity
The pipeline execution environment must have:
- **Outbound access** to your Spark Connect endpoint (typically port 15002)
- **Outbound access** to your OpenMetadata server (typically port 8585)
- **Inbound access** from Spark workers to your source database
## Verification Steps
1. **Test Spark Connect**: Verify connectivity from your pipeline environment to Spark Connect
2. **Test OpenMetadata**: Ensure your pipeline environment can reach the OpenMetadata API
3. **Test Database**: Confirm Spark workers can connect to your source database
4. **Verify Drivers**: Check that the appropriate database driver is available in your Spark cluster
## Example Verification Commands
### Test Spark Connect Connectivity
```bash
# Test basic connectivity to Spark Connect
telnet your_spark_connect_host 15002
# Or using curl if available
curl -X GET http://your_spark_connect_host:15002
```
### Test OpenMetadata Connectivity
```bash
# Test OpenMetadata API connectivity
curl -X GET http://your_openmetadata_host:8585/api/v1/version
```
### Test Database Connectivity from Spark
```python
# Test database connectivity using Spark
from pyspark.sql import SparkSession
spark = SparkSession.builder \
.appName("DatabaseConnectivityTest") \
.remote("<SPARK_CONNECT_HOST>:<SPARK_CONNECT_PORT>") \
.config("spark.jars", "/path/to/your/database/driver.jar") \
.getOrCreate()
# Test connection to your database
df = spark.read \
.format("jdbc") \
.option("url", "jdbc:your_database_url") \
.option("dbtable", "your_test_table") \
.option("user", "your_username") \
.option("password", "your_password") \
.load()
df.head()
```
{% inlineCalloutContainer %}
{% inlineCallout
color="violet-70"
bold="Partitioning Requirements"
icon="MdOutlineSchema"
href="/how-to-guides/data-quality-observability/profiler/spark-engine/partitioning" %}
Learn about the partitioning requirements for Spark Engine.
{% /inlineCallout %}
{% inlineCallout
icon="MdAnalytics"
bold="Configuration"
href="/how-to-guides/data-quality-observability/profiler/spark-engine/configuration" %}
Configure your profiler pipeline to use Spark Engine.
{% /inlineCallout %}
{% /inlineCalloutContainer %}

View File

@ -0,0 +1,22 @@
---
title: Spark Engine UI Configuration | OpenMetadata Spark Profiling
description: Configure Spark Engine through the OpenMetadata UI for distributed data profiling.
slug: /how-to-guides/data-quality-observability/profiler/spark-engine/ui-configuration
collate: true
---
# Spark Engine UI Configuration
{% note %}
UI configuration for Spark Engine is currently being implemented and will be available in future releases.
{% /note %}
{% inlineCalloutContainer %}
{% inlineCallout
color="violet-70"
bold="External Configuration"
icon="MdAnalytics"
href="/how-to-guides/data-quality-observability/profiler/spark-engine/external-configuration" %}
Configure Spark Engine using YAML files for external workflows.
{% /inlineCallout %}
{% /inlineCalloutContainer %}