Add Spark Engine docs (#22739)

* Add Spark Engine docs * Update structure
2025-08-06 16:18:05 +00:00 · 2025-08-04 18:28:46 +02:00 · 2025-08-04 18:28:46 +02:00 · 5adf32b731
commit 5adf32b731
parent b92e9d0e06
8 changed files with 484 additions and 0 deletions
--- a/openmetadata-docs/content/v1.9.x-SNAPSHOT/collate-menu.md
+++ b/openmetadata-docs/content/v1.9.x-SNAPSHOT/collate-menu.md
@ -954,6 +954,18 @@ site_menu:
    url: /how-to-guides/data-quality-observability/incident-manager/workflow
  - category: How-to Guides / Data Quality and Observability / Incident Manager / Root Cause Analysis
    url: /how-to-guides/data-quality-observability/incident-manager/root-cause-analysis
  - category: How-to Guides / Data Quality and Observability / Data Profiler / Spark Engine
    url: /how-to-guides/data-quality-observability/profiler/spark-engine
  - category: How-to Guides / Data Quality and Observability / Data Profiler / Spark Engine / Prerequisites
    url: /how-to-guides/data-quality-observability/profiler/spark-engine/prerequisites
  - category: How-to Guides / Data Quality and Observability / Data Profiler / Spark Engine / Partitioning
    url: /how-to-guides/data-quality-observability/profiler/spark-engine/partitioning
  - category: How-to Guides / Data Quality and Observability / Data Profiler / Spark Engine / Configuration
    url: /how-to-guides/data-quality-observability/profiler/spark-engine/configuration
  - category: How-to Guides / Data Quality and Observability / Data Profiler / Spark Engine / Configuration / UI Configuration
    url: /how-to-guides/data-quality-observability/profiler/spark-engine/ui-configuration
  - category: How-to Guides / Data Quality and Observability / Data Profiler / Spark Engine / Configuration / External Configuration
    url: /how-to-guides/data-quality-observability/profiler/spark-engine/external-configuration
  - category: How-to Guides / Data Quality and Observability / Anomaly Detection
    url: /how-to-guides/data-quality-observability/anomaly-detection
--- a/openmetadata-docs/content/v1.9.x-SNAPSHOT/how-to-guides/data-quality-observability/profiler/index.md
+++ b/openmetadata-docs/content/v1.9.x-SNAPSHOT/how-to-guides/data-quality-observability/profiler/index.md
@ -38,4 +38,10 @@ Watch the video to understand OpenMetadata’s native Data Profiler and Data Qua
    href="/how-to-guides/data-quality-observability/profiler/external-workflow"%}
    Run a single workflow profiler for the entire source externally.
 {%/inlineCallout%}
 {%inlineCallout
    icon="MdRocket"
    bold="Spark Engine"
    href="/how-to-guides/data-quality-observability/profiler/spark-engine"%}
    Use distributed processing with Apache Spark for large-scale data profiling.
 {%/inlineCallout%}
 {%/inlineCalloutContainer%}
--- a/openmetadata-docs/content/v1.9.x-SNAPSHOT/how-to-guides/data-quality-observability/profiler/spark-engine/configuration.md
+++ b/openmetadata-docs/content/v1.9.x-SNAPSHOT/how-to-guides/data-quality-observability/profiler/spark-engine/configuration.md
@ -0,0 +1,32 @@
 ---
 title: Spark Engine Configuration | OpenMetadata Spark Profiling Setup
 description: Learn how to configure your profiler pipeline to use Spark Engine for distributed data profiling.
 slug: /how-to-guides/data-quality-observability/profiler/spark-engine/configuration
 collate: true
 ---
 # Spark Engine Configuration
 ## Overview
 There are two ways to configure Spark Engine in OpenMetadata:
 {% inlineCalloutContainer %}
 {% inlineCallout
  color="violet-70"
  bold="UI Configuration"
  icon="MdAnalytics"
  href="/how-to-guides/data-quality-observability/profiler/spark-engine/ui-configuration" %}
  Configure Spark Engine through the OpenMetadata UI.
 {% /inlineCallout %}
 {% inlineCallout
    icon="MdOutlineSchema"
    bold="External Configuration"
    href="/how-to-guides/data-quality-observability/profiler/spark-engine/external-configuration" %}
    Configure Spark Engine using YAML files for external workflows.
 {% /inlineCallout %}
 {% /inlineCalloutContainer %}
 {% note %}
 Before configuring, ensure you have completed the [Spark Engine Prerequisites](/how-to-guides/data-quality-observability/profiler/spark-engine/prerequisites) and understand the [Partitioning Requirements](/how-to-guides/data-quality-observability/profiler/spark-engine/partitioning).
 {% /note %} 
--- a/openmetadata-docs/content/v1.9.x-SNAPSHOT/how-to-guides/data-quality-observability/profiler/spark-engine/external-configuration.md
+++ b/openmetadata-docs/content/v1.9.x-SNAPSHOT/how-to-guides/data-quality-observability/profiler/spark-engine/external-configuration.md
@ -0,0 +1,146 @@
 ---
 title: Spark Engine External Configuration | OpenMetadata Spark Profiling
 description: Configure Spark Engine using YAML files for external workflows and distributed data profiling.
 slug: /how-to-guides/data-quality-observability/profiler/spark-engine/external-configuration
 collate: true
 ---
 # Spark Engine External Configuration
 ## Overview
 To configure your profiler pipeline to use Spark Engine, you need to add the `processingEngine` configuration to your existing YAML file.
 {% note %}
 Before configuring, ensure you have completed the [Spark Engine Prerequisites](/how-to-guides/data-quality-observability/profiler/spark-engine/prerequisites) and understand the [Partitioning Requirements](/how-to-guides/data-quality-observability/profiler/spark-engine/partitioning).
 {% /note %}
 ## Step 1: Add Spark Engine Configuration
 In your existing profiler YAML, add the `processingEngine` section under `sourceConfig.config`:
 ```yaml
 sourceConfig:
  config:
    type: Profiler
    # ... your existing configuration ...
    processingEngine:
      type: Spark
      remote: sc://your_spark_connect_host:15002
      config:
        tempPath: your_path
 {% note %}
 **Important**: The `tempPath` must be accessible to all nodes in your Spark cluster. This is typically a shared filesystem path (like S3, HDFS, or a mounted network drive) that all Spark workers can read from and write to.
 {% /note %}
 ```
 ## Step 2: Add Partition Configuration
 In the `processor.config.tableConfig` section, add the `sparkTableProfilerConfig`:
 ```yaml
 processor:
  type: orm-profiler
  config:
    tableConfig:
      - fullyQualifiedName: YourService.YourDatabase.YourSchema.YourTable
        sparkTableProfilerConfig:
          partitioning:
            partitionColumn: your_partition_column
            # lowerBound: 0
            # upperBound: 10000000
 ```
 ## Complete Example
 ### Before (Native Engine)
 ```yaml
 sourceConfig:
  config:
    type: Profiler
    schemaFilterPattern:
      includes:
        - ^your_schema$
    tableFilterPattern:
      includes:
        - your_table_name
 processor:
  type: orm-profiler
  config: {}
 ```
 ### After (Spark Engine)
 ```yaml
 sourceConfig:
  config:
    type: Profiler
    schemaFilterPattern:
      includes:
        - ^your_schema$
    tableFilterPattern:
      includes:
        - your_table_name
    processingEngine:
      type: Spark
      remote: sc://your_spark_connect_host:15002
      config:
        tempPath: s3://your_s3_bucket/table
        # extraConfig:
        #     key: value
 processor:
  type: orm-profiler
  config:
    tableConfig:
      - fullyQualifiedName: YourService.YourDatabase.YourSchema.YourTable
        sparkTableProfilerConfig:
          partitioning:
            partitionColumn: your_partition_column
            # lowerBound: 0
            # upperBound: 1000000
 ```
 ## Required Changes
 1. **Add `processingEngine`** to `sourceConfig.config`
 2. **Add `sparkTableProfilerConfig`** to your table configuration
 3. **Specify partition column** for Spark processing (Else it will fallback to fetching the Primary Key if any, or skipping the table entirely)
 ## Run the Pipeline
 Use the same command as before:
 ```bash
 metadata profile -c your_profiler_config.yaml
 ```
 The pipeline will now use Spark Engine instead of the Native engine for processing.
 ## Troubleshooting Configuration
 ### Common Issues
 1. **Missing Partition Column**: Ensure you've specified a suitable partition column
 2. **Network Connectivity**: Verify Spark Connect and database connectivity
 3. **Driver Issues**: Check that appropriate database drivers are installed in Spark cluster
 4. **Configuration Errors**: Validate YAML syntax and required fields
 ### Debugging Steps
 1. **Check Logs**: Review profiler logs for specific error messages
 2. **Test Connectivity**: Verify all network connections are working
 3. **Validate Configuration**: Ensure all required fields are properly set
 4. **Test with Small Dataset**: Start with a small table to verify the setup
 {% inlineCalloutContainer %}
 {% inlineCallout
  color="violet-70"
  bold="UI Configuration"
  icon="MdAnalytics"
  href="/how-to-guides/data-quality-observability/profiler/spark-engine/ui-configuration" %}
  Configure Spark Engine through the OpenMetadata UI.
 {% /inlineCallout %}
 {% /inlineCalloutContainer %} 
--- a/openmetadata-docs/content/v1.9.x-SNAPSHOT/how-to-guides/data-quality-observability/profiler/spark-engine/index.md
+++ b/openmetadata-docs/content/v1.9.x-SNAPSHOT/how-to-guides/data-quality-observability/profiler/spark-engine/index.md
@ -0,0 +1,53 @@
 ---
 title: Spark Engine Overview | OpenMetadata Distributed Profiling
 description: Learn about OpenMetadata's Spark Engine for distributed data profiling of large-scale datasets using Apache Spark.
 slug: /how-to-guides/data-quality-observability/profiler/spark-engine
 collate: true
 ---
 # Spark Engine Overview
 ## What is Spark Engine?
 The Spark Engine is a distributed processing engine in OpenMetadata that enables large-scale data profiling using Apache Spark. It's an alternative to the default Native engine, designed specifically for handling massive datasets that would be impractical or impossible to profile directly on the source database.
 ## When to Use Spark Engine
 ### Use Spark Engine when:
 - You have access to a Spark cluster (local, standalone, YARN, or Kubernetes)
 - Your datasets are too large to profile directly on the source database
 - You need distributed processing capabilities for enterprise-scale data profiling
 - Your source database doesn't have built-in distributed processing capabilities
 ### Stick with Native Engine when:
 - You are using an already distributed processed database such as BigQuery or Snowflake
 - Your profiler pipeline runs smoothly directly on the source database
 - You're doing development or testing with small tables
 - You don't have access to a Spark cluster
 - You need the simplest possible setup
 The Spark Engine integrates seamlessly with OpenMetadata's existing profiling framework while providing the distributed processing capabilities needed for enterprise-scale data profiling operations.
 {% inlineCalloutContainer %}
 {% inlineCallout
  color="violet-70"
  bold="Prerequisites"
  icon="MdSecurity"
  href="/how-to-guides/data-quality-observability/profiler/spark-engine/prerequisites" %}
  Learn about the required infrastructure and setup for Spark Engine.
 {% /inlineCallout %}
 {% inlineCallout
    icon="MdOutlineSchema"
    bold="Partitioning Requirements"
    href="/how-to-guides/data-quality-observability/profiler/spark-engine/partitioning" %}
    Understand the partitioning requirements for Spark Engine.
 {% /inlineCallout %}
 {% inlineCallout
    icon="MdAnalytics"
    bold="Configuration"
    href="/how-to-guides/data-quality-observability/profiler/spark-engine/configuration" %}
    Configure your profiler pipeline to use Spark Engine.
 {% /inlineCallout %}
 {% /inlineCalloutContainer %} 
--- a/openmetadata-docs/content/v1.9.x-SNAPSHOT/how-to-guides/data-quality-observability/profiler/spark-engine/partitioning.md
+++ b/openmetadata-docs/content/v1.9.x-SNAPSHOT/how-to-guides/data-quality-observability/profiler/spark-engine/partitioning.md
@ -0,0 +1,112 @@
 ---
 title: Spark Engine Partitioning Requirements | OpenMetadata Spark Profiling
 description: Learn about the partitioning requirements for Spark Engine and how to choose the right partition column for optimal performance.
 slug: /how-to-guides/data-quality-observability/profiler/spark-engine/partitioning
 collate: true
 ---
 # Spark Engine Partitioning Requirements
 ## Why Partitioning is Required
 The Spark Engine requires a partition column to efficiently process large datasets. This is because:
 1. **Parallel Processing**: Each partition can be processed independently across different Spark workers
 2. **Resource Optimization**: Prevents memory overflow and ensures stable processing of large datasets
 ## How Partitioning Works
 The Spark Engine automatically detects and uses partition columns based on this logic:
 ### Automatic Detection Priority
 1. **Manual Configuration**: You can explicitly specify a partition column in the table configuration
 2. **Primary Key Columns**: If a table has a primary key with numeric or date/time data types, it's automatically selected
 ### Supported Data Types for Partitioning
 - **Numeric**: `SMALLINT`, `INT`, `BIGINT`, `NUMBER`
 - **Date/Time**: `DATE`, `DATETIME`, `TIMESTAMP`, `TIMESTAMPZ`, `TIME`
 ## What Happens Without a Suitable Partition Column
 If no suitable partition column is found, the table will be skipped during profiling. This ensures that only tables with proper partitioning can be processed by the Spark Engine, preventing potential performance issues or failures.
 ## Choosing the Right Partition Column
 ### Best Practices
 1. **High Cardinality**: Choose columns with many unique values to ensure even data distribution
 2. **Even Distribution**: Avoid columns with heavily skewed data (e.g., mostly NULL values)
 3. **Query Performance**: Select columns that have an index created on them
 4. **Data Type Compatibility**: Ensure the column is of a supported data type for partitioning
 ### Examples
 | Column Type | Good Partition Column | Poor Partition Column |
 | --- | --- | --- |
 | **Numeric** | `user_id`, `order_id`, `age` | `status_code` (limited values) |
 | **Date/Time** | `created_date`, `updated_at`, `event_timestamp` | `last_login` (many NULLs) |
 ### Common Patterns
 - **Primary Keys**: Usually excellent partition columns
 - **Timestamps**: Great for time-based partitioning
 - **Foreign Keys**: Good if they have high cardinality
 - **Business Keys**: Customer IDs, order IDs, etc.
 ## Configuration Examples
 ### Basic Partition Configuration
 ```yaml
 processor:
  type: orm-profiler
  config:
    tableConfig:
      - fullyQualifiedName: YourService.YourDatabase.YourSchema.YourTable
        sparkTableProfilerConfig:
          partitioning:
            partitionColumn: user_id
 ```
 ### Advanced Partition Configuration
 ```yaml
 processor:
  type: orm-profiler
  config:
    tableConfig:
      - fullyQualifiedName: YourService.YourDatabase.YourSchema.YourTable
        sparkTableProfilerConfig:
          partitioning:
            partitionColumn: created_date
            lowerBound: "2023-01-01"
            upperBound: "2024-01-01"
 ```
 ## Troubleshooting Partitioning Issues
 ### Common Issues
 1. **No Suitable Partition Column**: Ensure your table has a column with supported data types
 2. **Low Cardinality**: Choose a column with more unique values
 3. **Data Type Mismatch**: Verify the column is of a supported data type
 4. **Missing Index**: Consider adding an index to improve partitioning performance
 ### Debugging Steps
 1. **Check Table Schema**: Verify available columns and their data types
 2. **Analyze Column Distribution**: Check for NULL values and cardinality
 3. **Test Partition Column**: Validate the chosen column works with your data
 4. **Review Logs**: Check profiler logs for specific partitioning errors
 {% inlineCalloutContainer %}
 {% inlineCallout
  color="violet-70"
  bold="Configuration"
  icon="MdAnalytics"
  href="/how-to-guides/data-quality-observability/profiler/spark-engine/configuration" %}
  Configure your profiler pipeline to use Spark Engine.
 {% /inlineCallout %}
 {% /inlineCalloutContainer %} 
--- a/openmetadata-docs/content/v1.9.x-SNAPSHOT/how-to-guides/data-quality-observability/profiler/spark-engine/prerequisites.md
+++ b/openmetadata-docs/content/v1.9.x-SNAPSHOT/how-to-guides/data-quality-observability/profiler/spark-engine/prerequisites.md
@ -0,0 +1,101 @@
 ---
 title: Spark Engine Prerequisites | OpenMetadata Spark Profiling Setup
 description: Learn about the required infrastructure, network connectivity, and setup for using Spark Engine in OpenMetadata.
 slug: /how-to-guides/data-quality-observability/profiler/spark-engine/prerequisites
 collate: true
 ---
 # Spark Engine Prerequisites
 ## Required Infrastructure
 ### Spark Cluster
 - **Spark Connect available** (versions 3.5.2 to 3.5.6 supported)
 - **Network access** from the pipeline execution environment to the Spark Connect endpoint
 - **Network access** from the pipeline execution environment to the OpenMetadata server
 ### Database Drivers in Spark Cluster
 Depending on your source database, ensure the appropriate driver is installed in your Spark cluster:
 - **PostgreSQL**: `org.postgresql.Driver`
 - **MySQL**: `com.mysql.cj.jdbc.Driver`
 {% note %}
 The specific driver versions should match your Spark version and database version for optimal compatibility.
 {% /note %}
 ## Network Connectivity
 The pipeline execution environment must have:
 - **Outbound access** to your Spark Connect endpoint (typically port 15002)
 - **Outbound access** to your OpenMetadata server (typically port 8585)
 - **Inbound access** from Spark workers to your source database
 ## Verification Steps
 1. **Test Spark Connect**: Verify connectivity from your pipeline environment to Spark Connect
 2. **Test OpenMetadata**: Ensure your pipeline environment can reach the OpenMetadata API
 3. **Test Database**: Confirm Spark workers can connect to your source database
 4. **Verify Drivers**: Check that the appropriate database driver is available in your Spark cluster
 ## Example Verification Commands
 ### Test Spark Connect Connectivity
 ```bash
 # Test basic connectivity to Spark Connect
 telnet your_spark_connect_host 15002
 # Or using curl if available
 curl -X GET http://your_spark_connect_host:15002
 ```
 ### Test OpenMetadata Connectivity
 ```bash
 # Test OpenMetadata API connectivity
 curl -X GET http://your_openmetadata_host:8585/api/v1/version
 ```
 ### Test Database Connectivity from Spark
 ```python
 # Test database connectivity using Spark
 from pyspark.sql import SparkSession
 spark = SparkSession.builder \
    .appName("DatabaseConnectivityTest") \
    .remote("<SPARK_CONNECT_HOST>:<SPARK_CONNECT_PORT>") \
    .config("spark.jars", "/path/to/your/database/driver.jar") \
    .getOrCreate()
 # Test connection to your database
 df = spark.read \
    .format("jdbc") \
    .option("url", "jdbc:your_database_url") \
    .option("dbtable", "your_test_table") \
    .option("user", "your_username") \
    .option("password", "your_password") \
    .load()
 df.head()
 ```
 {% inlineCalloutContainer %}
 {% inlineCallout
  color="violet-70"
  bold="Partitioning Requirements"
  icon="MdOutlineSchema"
  href="/how-to-guides/data-quality-observability/profiler/spark-engine/partitioning" %}
  Learn about the partitioning requirements for Spark Engine.
 {% /inlineCallout %}
 {% inlineCallout
    icon="MdAnalytics"
    bold="Configuration"
    href="/how-to-guides/data-quality-observability/profiler/spark-engine/configuration" %}
    Configure your profiler pipeline to use Spark Engine.
 {% /inlineCallout %}
 {% /inlineCalloutContainer %} 
--- a/openmetadata-docs/content/v1.9.x-SNAPSHOT/how-to-guides/data-quality-observability/profiler/spark-engine/ui-configuration.md
+++ b/openmetadata-docs/content/v1.9.x-SNAPSHOT/how-to-guides/data-quality-observability/profiler/spark-engine/ui-configuration.md
@ -0,0 +1,22 @@
 ---
 title: Spark Engine UI Configuration | OpenMetadata Spark Profiling
 description: Configure Spark Engine through the OpenMetadata UI for distributed data profiling.
 slug: /how-to-guides/data-quality-observability/profiler/spark-engine/ui-configuration
 collate: true
 ---
 # Spark Engine UI Configuration
 {% note %}
 UI configuration for Spark Engine is currently being implemented and will be available in future releases.
 {% /note %}
 {% inlineCalloutContainer %}
 {% inlineCallout
  color="violet-70"
  bold="External Configuration"
  icon="MdAnalytics"
  href="/how-to-guides/data-quality-observability/profiler/spark-engine/external-configuration" %}
  Configure Spark Engine using YAML files for external workflows.
 {% /inlineCallout %}
 {% /inlineCalloutContainer %}