Add Spark Engine docs (#22739)

* Add Spark Engine docs * Update structure
2025-08-06 16:18:05 +00:00 · 2025-08-04 18:28:46 +02:00 · 2025-08-04 18:28:46 +02:00 · 5adf32b731
commit 5adf32b731
parent b92e9d0e06
8 changed files with 484 additions and 0 deletions
--- a/openmetadata-docs/content/v1.9.x-SNAPSHOT/collate-menu.md
+++ b/openmetadata-docs/content/v1.9.x-SNAPSHOT/collate-menu.md
@ -954,6 +954,18 @@ site_menu:
    url: /how-to-guides/data-quality-observability/incident-manager/workflow
  - category: How-to Guides / Data Quality and Observability / Incident Manager / Root Cause Analysis
    url: /how-to-guides/data-quality-observability/incident-manager/root-cause-analysis
+  - category: How-to Guides / Data Quality and Observability / Data Profiler / Spark Engine
+    url: /how-to-guides/data-quality-observability/profiler/spark-engine
+  - category: How-to Guides / Data Quality and Observability / Data Profiler / Spark Engine / Prerequisites
+    url: /how-to-guides/data-quality-observability/profiler/spark-engine/prerequisites
+  - category: How-to Guides / Data Quality and Observability / Data Profiler / Spark Engine / Partitioning
+    url: /how-to-guides/data-quality-observability/profiler/spark-engine/partitioning
+  - category: How-to Guides / Data Quality and Observability / Data Profiler / Spark Engine / Configuration
+    url: /how-to-guides/data-quality-observability/profiler/spark-engine/configuration
+  - category: How-to Guides / Data Quality and Observability / Data Profiler / Spark Engine / Configuration / UI Configuration
+    url: /how-to-guides/data-quality-observability/profiler/spark-engine/ui-configuration
+  - category: How-to Guides / Data Quality and Observability / Data Profiler / Spark Engine / Configuration / External Configuration
+    url: /how-to-guides/data-quality-observability/profiler/spark-engine/external-configuration

  - category: How-to Guides / Data Quality and Observability / Anomaly Detection
    url: /how-to-guides/data-quality-observability/anomaly-detection
--- a/openmetadata-docs/content/v1.9.x-SNAPSHOT/how-to-guides/data-quality-observability/profiler/index.md
+++ b/openmetadata-docs/content/v1.9.x-SNAPSHOT/how-to-guides/data-quality-observability/profiler/index.md
@ -38,4 +38,10 @@ Watch the video to understand OpenMetadata’s native Data Profiler and Data Qua
    href="/how-to-guides/data-quality-observability/profiler/external-workflow"%}
    Run a single workflow profiler for the entire source externally.
 {%/inlineCallout%}
+ {%inlineCallout
+    icon="MdRocket"
+    bold="Spark Engine"
+    href="/how-to-guides/data-quality-observability/profiler/spark-engine"%}
+    Use distributed processing with Apache Spark for large-scale data profiling.
+ {%/inlineCallout%}
 {%/inlineCalloutContainer%}
--- a/openmetadata-docs/content/v1.9.x-SNAPSHOT/how-to-guides/data-quality-observability/profiler/spark-engine/configuration.md
+++ b/openmetadata-docs/content/v1.9.x-SNAPSHOT/how-to-guides/data-quality-observability/profiler/spark-engine/configuration.md
@ -0,0 +1,32 @@
+---
+title: Spark Engine Configuration | OpenMetadata Spark Profiling Setup
+description: Learn how to configure your profiler pipeline to use Spark Engine for distributed data profiling.
+slug: /how-to-guides/data-quality-observability/profiler/spark-engine/configuration
+collate: true
+---
+
+# Spark Engine Configuration
+
+## Overview
+
+There are two ways to configure Spark Engine in OpenMetadata:
+
+{% inlineCalloutContainer %}
+ {% inlineCallout
+  color="violet-70"
+  bold="UI Configuration"
+  icon="MdAnalytics"
+  href="/how-to-guides/data-quality-observability/profiler/spark-engine/ui-configuration" %}
+  Configure Spark Engine through the OpenMetadata UI.
+ {% /inlineCallout %}
+ {% inlineCallout
+    icon="MdOutlineSchema"
+    bold="External Configuration"
+    href="/how-to-guides/data-quality-observability/profiler/spark-engine/external-configuration" %}
+    Configure Spark Engine using YAML files for external workflows.
+ {% /inlineCallout %}
+{% /inlineCalloutContainer %}
+
+{% note %}
+Before configuring, ensure you have completed the [Spark Engine Prerequisites](/how-to-guides/data-quality-observability/profiler/spark-engine/prerequisites) and understand the [Partitioning Requirements](/how-to-guides/data-quality-observability/profiler/spark-engine/partitioning).
+{% /note %} 
--- a/openmetadata-docs/content/v1.9.x-SNAPSHOT/how-to-guides/data-quality-observability/profiler/spark-engine/external-configuration.md
+++ b/openmetadata-docs/content/v1.9.x-SNAPSHOT/how-to-guides/data-quality-observability/profiler/spark-engine/external-configuration.md
@ -0,0 +1,146 @@
+---
+title: Spark Engine External Configuration | OpenMetadata Spark Profiling
+description: Configure Spark Engine using YAML files for external workflows and distributed data profiling.
+slug: /how-to-guides/data-quality-observability/profiler/spark-engine/external-configuration
+collate: true
+---
+
+# Spark Engine External Configuration
+
+## Overview
+
+To configure your profiler pipeline to use Spark Engine, you need to add the `processingEngine` configuration to your existing YAML file.
+
+{% note %}
+Before configuring, ensure you have completed the [Spark Engine Prerequisites](/how-to-guides/data-quality-observability/profiler/spark-engine/prerequisites) and understand the [Partitioning Requirements](/how-to-guides/data-quality-observability/profiler/spark-engine/partitioning).
+{% /note %}
+
+## Step 1: Add Spark Engine Configuration
+
+In your existing profiler YAML, add the `processingEngine` section under `sourceConfig.config`:
+
+```yaml
+sourceConfig:
+  config:
+    type: Profiler
+    # ... your existing configuration ...
+    processingEngine:
+      type: Spark
+      remote: sc://your_spark_connect_host:15002
+      config:
+        tempPath: your_path
+
+{% note %}
+**Important**: The `tempPath` must be accessible to all nodes in your Spark cluster. This is typically a shared filesystem path (like S3, HDFS, or a mounted network drive) that all Spark workers can read from and write to.
+{% /note %}
+```
+
+## Step 2: Add Partition Configuration
+
+In the `processor.config.tableConfig` section, add the `sparkTableProfilerConfig`:
+
+```yaml
+processor:
+  type: orm-profiler
+  config:
+    tableConfig:
+      - fullyQualifiedName: YourService.YourDatabase.YourSchema.YourTable
+        sparkTableProfilerConfig:
+          partitioning:
+            partitionColumn: your_partition_column
+            # lowerBound: 0
+            # upperBound: 10000000
+```
+
+## Complete Example
+
+### Before (Native Engine)
+
+```yaml
+sourceConfig:
+  config:
+    type: Profiler
+    schemaFilterPattern:
+      includes:
+        - ^your_schema$
+    tableFilterPattern:
+      includes:
+        - your_table_name
+
+processor:
+  type: orm-profiler
+  config: {}
+```
+
+### After (Spark Engine)
+
+```yaml
+sourceConfig:
+  config:
+    type: Profiler
+    schemaFilterPattern:
+      includes:
+        - ^your_schema$
+    tableFilterPattern:
+      includes:
+        - your_table_name
+    processingEngine:
+      type: Spark
+      remote: sc://your_spark_connect_host:15002
+      config:
+        tempPath: s3://your_s3_bucket/table
+        # extraConfig:
+        #     key: value
+processor:
+  type: orm-profiler
+  config:
+    tableConfig:
+      - fullyQualifiedName: YourService.YourDatabase.YourSchema.YourTable
+        sparkTableProfilerConfig:
+          partitioning:
+            partitionColumn: your_partition_column
+            # lowerBound: 0
+            # upperBound: 1000000
+```
+
+## Required Changes
+
+1. **Add `processingEngine`** to `sourceConfig.config`
+2. **Add `sparkTableProfilerConfig`** to your table configuration
+3. **Specify partition column** for Spark processing (Else it will fallback to fetching the Primary Key if any, or skipping the table entirely)
+
+## Run the Pipeline
+
+Use the same command as before:
+
+```bash
+metadata profile -c your_profiler_config.yaml
+```
+
+The pipeline will now use Spark Engine instead of the Native engine for processing.
+
+## Troubleshooting Configuration
+
+### Common Issues
+
+1. **Missing Partition Column**: Ensure you've specified a suitable partition column
+2. **Network Connectivity**: Verify Spark Connect and database connectivity
+3. **Driver Issues**: Check that appropriate database drivers are installed in Spark cluster
+4. **Configuration Errors**: Validate YAML syntax and required fields
+
+### Debugging Steps
+
+1. **Check Logs**: Review profiler logs for specific error messages
+2. **Test Connectivity**: Verify all network connections are working
+3. **Validate Configuration**: Ensure all required fields are properly set
+4. **Test with Small Dataset**: Start with a small table to verify the setup
+
+{% inlineCalloutContainer %}
+ {% inlineCallout
+  color="violet-70"
+  bold="UI Configuration"
+  icon="MdAnalytics"
+  href="/how-to-guides/data-quality-observability/profiler/spark-engine/ui-configuration" %}
+  Configure Spark Engine through the OpenMetadata UI.
+ {% /inlineCallout %}
+{% /inlineCalloutContainer %} 
--- a/openmetadata-docs/content/v1.9.x-SNAPSHOT/how-to-guides/data-quality-observability/profiler/spark-engine/index.md
+++ b/openmetadata-docs/content/v1.9.x-SNAPSHOT/how-to-guides/data-quality-observability/profiler/spark-engine/index.md
@ -0,0 +1,53 @@
+---
+title: Spark Engine Overview | OpenMetadata Distributed Profiling
+description: Learn about OpenMetadata's Spark Engine for distributed data profiling of large-scale datasets using Apache Spark.
+slug: /how-to-guides/data-quality-observability/profiler/spark-engine
+collate: true
+---
+
+# Spark Engine Overview
+
+## What is Spark Engine?
+
+The Spark Engine is a distributed processing engine in OpenMetadata that enables large-scale data profiling using Apache Spark. It's an alternative to the default Native engine, designed specifically for handling massive datasets that would be impractical or impossible to profile directly on the source database.
+
+## When to Use Spark Engine
+
+### Use Spark Engine when:
+
+- You have access to a Spark cluster (local, standalone, YARN, or Kubernetes)
+- Your datasets are too large to profile directly on the source database
+- You need distributed processing capabilities for enterprise-scale data profiling
+- Your source database doesn't have built-in distributed processing capabilities
+
+### Stick with Native Engine when:
+
+- You are using an already distributed processed database such as BigQuery or Snowflake
+- Your profiler pipeline runs smoothly directly on the source database
+- You're doing development or testing with small tables
+- You don't have access to a Spark cluster
+- You need the simplest possible setup
+
+The Spark Engine integrates seamlessly with OpenMetadata's existing profiling framework while providing the distributed processing capabilities needed for enterprise-scale data profiling operations.
+
+{% inlineCalloutContainer %}
+ {% inlineCallout
+  color="violet-70"
+  bold="Prerequisites"
+  icon="MdSecurity"
+  href="/how-to-guides/data-quality-observability/profiler/spark-engine/prerequisites" %}
+  Learn about the required infrastructure and setup for Spark Engine.
+ {% /inlineCallout %}
+ {% inlineCallout
+    icon="MdOutlineSchema"
+    bold="Partitioning Requirements"
+    href="/how-to-guides/data-quality-observability/profiler/spark-engine/partitioning" %}
+    Understand the partitioning requirements for Spark Engine.
+ {% /inlineCallout %}
+ {% inlineCallout
+    icon="MdAnalytics"
+    bold="Configuration"
+    href="/how-to-guides/data-quality-observability/profiler/spark-engine/configuration" %}
+    Configure your profiler pipeline to use Spark Engine.
+ {% /inlineCallout %}
+{% /inlineCalloutContainer %} 
--- a/openmetadata-docs/content/v1.9.x-SNAPSHOT/how-to-guides/data-quality-observability/profiler/spark-engine/partitioning.md
+++ b/openmetadata-docs/content/v1.9.x-SNAPSHOT/how-to-guides/data-quality-observability/profiler/spark-engine/partitioning.md
@ -0,0 +1,112 @@
+---
+title: Spark Engine Partitioning Requirements | OpenMetadata Spark Profiling
+description: Learn about the partitioning requirements for Spark Engine and how to choose the right partition column for optimal performance.
+slug: /how-to-guides/data-quality-observability/profiler/spark-engine/partitioning
+collate: true
+---
+
+# Spark Engine Partitioning Requirements
+
+## Why Partitioning is Required
+
+The Spark Engine requires a partition column to efficiently process large datasets. This is because:
+
+1. **Parallel Processing**: Each partition can be processed independently across different Spark workers
+2. **Resource Optimization**: Prevents memory overflow and ensures stable processing of large datasets
+
+## How Partitioning Works
+
+The Spark Engine automatically detects and uses partition columns based on this logic:
+
+### Automatic Detection Priority
+
+1. **Manual Configuration**: You can explicitly specify a partition column in the table configuration
+2. **Primary Key Columns**: If a table has a primary key with numeric or date/time data types, it's automatically selected
+
+### Supported Data Types for Partitioning
+
+- **Numeric**: `SMALLINT`, `INT`, `BIGINT`, `NUMBER`
+- **Date/Time**: `DATE`, `DATETIME`, `TIMESTAMP`, `TIMESTAMPZ`, `TIME`
+
+## What Happens Without a Suitable Partition Column
+
+If no suitable partition column is found, the table will be skipped during profiling. This ensures that only tables with proper partitioning can be processed by the Spark Engine, preventing potential performance issues or failures.
+
+## Choosing the Right Partition Column
+
+### Best Practices
+
+1. **High Cardinality**: Choose columns with many unique values to ensure even data distribution
+2. **Even Distribution**: Avoid columns with heavily skewed data (e.g., mostly NULL values)
+3. **Query Performance**: Select columns that have an index created on them
+4. **Data Type Compatibility**: Ensure the column is of a supported data type for partitioning
+
+### Examples
+
+| Column Type | Good Partition Column | Poor Partition Column |
+| --- | --- | --- |
+| **Numeric** | `user_id`, `order_id`, `age` | `status_code` (limited values) |
+| **Date/Time** | `created_date`, `updated_at`, `event_timestamp` | `last_login` (many NULLs) |
+
+### Common Patterns
+
+- **Primary Keys**: Usually excellent partition columns
+- **Timestamps**: Great for time-based partitioning
+- **Foreign Keys**: Good if they have high cardinality
+- **Business Keys**: Customer IDs, order IDs, etc.
+
+## Configuration Examples
+
+### Basic Partition Configuration
+
+```yaml
+processor:
+  type: orm-profiler
+  config:
+    tableConfig:
+      - fullyQualifiedName: YourService.YourDatabase.YourSchema.YourTable
+        sparkTableProfilerConfig:
+          partitioning:
+            partitionColumn: user_id
+```
+
+### Advanced Partition Configuration
+
+```yaml
+processor:
+  type: orm-profiler
+  config:
+    tableConfig:
+      - fullyQualifiedName: YourService.YourDatabase.YourSchema.YourTable
+        sparkTableProfilerConfig:
+          partitioning:
+            partitionColumn: created_date
+            lowerBound: "2023-01-01"
+            upperBound: "2024-01-01"
+```
+
+## Troubleshooting Partitioning Issues
+
+### Common Issues
+
+1. **No Suitable Partition Column**: Ensure your table has a column with supported data types
+2. **Low Cardinality**: Choose a column with more unique values
+3. **Data Type Mismatch**: Verify the column is of a supported data type
+4. **Missing Index**: Consider adding an index to improve partitioning performance
+
+### Debugging Steps
+
+1. **Check Table Schema**: Verify available columns and their data types
+2. **Analyze Column Distribution**: Check for NULL values and cardinality
+3. **Test Partition Column**: Validate the chosen column works with your data
+4. **Review Logs**: Check profiler logs for specific partitioning errors
+
+{% inlineCalloutContainer %}
+ {% inlineCallout
+  color="violet-70"
+  bold="Configuration"
+  icon="MdAnalytics"
+  href="/how-to-guides/data-quality-observability/profiler/spark-engine/configuration" %}
+  Configure your profiler pipeline to use Spark Engine.
+ {% /inlineCallout %}
+{% /inlineCalloutContainer %} 
--- a/openmetadata-docs/content/v1.9.x-SNAPSHOT/how-to-guides/data-quality-observability/profiler/spark-engine/prerequisites.md
+++ b/openmetadata-docs/content/v1.9.x-SNAPSHOT/how-to-guides/data-quality-observability/profiler/spark-engine/prerequisites.md
@ -0,0 +1,101 @@
+---
+title: Spark Engine Prerequisites | OpenMetadata Spark Profiling Setup
+description: Learn about the required infrastructure, network connectivity, and setup for using Spark Engine in OpenMetadata.
+slug: /how-to-guides/data-quality-observability/profiler/spark-engine/prerequisites
+collate: true
+---
+
+# Spark Engine Prerequisites
+
+## Required Infrastructure
+
+### Spark Cluster
+
+- **Spark Connect available** (versions 3.5.2 to 3.5.6 supported)
+- **Network access** from the pipeline execution environment to the Spark Connect endpoint
+- **Network access** from the pipeline execution environment to the OpenMetadata server
+
+### Database Drivers in Spark Cluster
+
+Depending on your source database, ensure the appropriate driver is installed in your Spark cluster:
+
+- **PostgreSQL**: `org.postgresql.Driver`
+- **MySQL**: `com.mysql.cj.jdbc.Driver`
+
+{% note %}
+The specific driver versions should match your Spark version and database version for optimal compatibility.
+{% /note %}
+
+## Network Connectivity
+
+The pipeline execution environment must have:
+
+- **Outbound access** to your Spark Connect endpoint (typically port 15002)
+- **Outbound access** to your OpenMetadata server (typically port 8585)
+- **Inbound access** from Spark workers to your source database
+
+## Verification Steps
+
+1. **Test Spark Connect**: Verify connectivity from your pipeline environment to Spark Connect
+2. **Test OpenMetadata**: Ensure your pipeline environment can reach the OpenMetadata API
+3. **Test Database**: Confirm Spark workers can connect to your source database
+4. **Verify Drivers**: Check that the appropriate database driver is available in your Spark cluster
+
+## Example Verification Commands
+
+### Test Spark Connect Connectivity
+
+```bash
+# Test basic connectivity to Spark Connect
+telnet your_spark_connect_host 15002
+
+# Or using curl if available
+curl -X GET http://your_spark_connect_host:15002
+```
+
+### Test OpenMetadata Connectivity
+
+```bash
+# Test OpenMetadata API connectivity
+curl -X GET http://your_openmetadata_host:8585/api/v1/version
+```
+
+### Test Database Connectivity from Spark
+
+```python
+# Test database connectivity using Spark
+from pyspark.sql import SparkSession
+
+spark = SparkSession.builder \
+    .appName("DatabaseConnectivityTest") \
+    .remote("<SPARK_CONNECT_HOST>:<SPARK_CONNECT_PORT>") \
+    .config("spark.jars", "/path/to/your/database/driver.jar") \
+    .getOrCreate()
+
+# Test connection to your database
+df = spark.read \
+    .format("jdbc") \
+    .option("url", "jdbc:your_database_url") \
+    .option("dbtable", "your_test_table") \
+    .option("user", "your_username") \
+    .option("password", "your_password") \
+    .load()
+
+df.head()
+```
+
+{% inlineCalloutContainer %}
+ {% inlineCallout
+  color="violet-70"
+  bold="Partitioning Requirements"
+  icon="MdOutlineSchema"
+  href="/how-to-guides/data-quality-observability/profiler/spark-engine/partitioning" %}
+  Learn about the partitioning requirements for Spark Engine.
+ {% /inlineCallout %}
+ {% inlineCallout
+    icon="MdAnalytics"
+    bold="Configuration"
+    href="/how-to-guides/data-quality-observability/profiler/spark-engine/configuration" %}
+    Configure your profiler pipeline to use Spark Engine.
+ {% /inlineCallout %}
+{% /inlineCalloutContainer %} 
--- a/openmetadata-docs/content/v1.9.x-SNAPSHOT/how-to-guides/data-quality-observability/profiler/spark-engine/ui-configuration.md
+++ b/openmetadata-docs/content/v1.9.x-SNAPSHOT/how-to-guides/data-quality-observability/profiler/spark-engine/ui-configuration.md
@ -0,0 +1,22 @@
+---
+title: Spark Engine UI Configuration | OpenMetadata Spark Profiling
+description: Configure Spark Engine through the OpenMetadata UI for distributed data profiling.
+slug: /how-to-guides/data-quality-observability/profiler/spark-engine/ui-configuration
+collate: true
+---
+
+# Spark Engine UI Configuration
+
+{% note %}
+UI configuration for Spark Engine is currently being implemented and will be available in future releases.
+{% /note %}
+
+{% inlineCalloutContainer %}
+ {% inlineCallout
+  color="violet-70"
+  bold="External Configuration"
+  icon="MdAnalytics"
+  href="/how-to-guides/data-quality-observability/profiler/spark-engine/external-configuration" %}
+  Configure Spark Engine using YAML files for external workflows.
+ {% /inlineCallout %}
+{% /inlineCalloutContainer %}