IceS2 5adf32b731
Add Spark Engine docs (#22739)
* Add Spark Engine docs

* Update structure
2025-08-04 16:28:46 +00:00

4.2 KiB

title description slug collate
Spark Engine External Configuration | OpenMetadata Spark Profiling Configure Spark Engine using YAML files for external workflows and distributed data profiling. /how-to-guides/data-quality-observability/profiler/spark-engine/external-configuration true

Spark Engine External Configuration

Overview

To configure your profiler pipeline to use Spark Engine, you need to add the processingEngine configuration to your existing YAML file.

{% note %} Before configuring, ensure you have completed the Spark Engine Prerequisites and understand the Partitioning Requirements. {% /note %}

Step 1: Add Spark Engine Configuration

In your existing profiler YAML, add the processingEngine section under sourceConfig.config:

sourceConfig:
  config:
    type: Profiler
    # ... your existing configuration ...
    processingEngine:
      type: Spark
      remote: sc://your_spark_connect_host:15002
      config:
        tempPath: your_path

{% note %}
**Important**: The `tempPath` must be accessible to all nodes in your Spark cluster. This is typically a shared filesystem path (like S3, HDFS, or a mounted network drive) that all Spark workers can read from and write to.
{% /note %}

Step 2: Add Partition Configuration

In the processor.config.tableConfig section, add the sparkTableProfilerConfig:

processor:
  type: orm-profiler
  config:
    tableConfig:
      - fullyQualifiedName: YourService.YourDatabase.YourSchema.YourTable
        sparkTableProfilerConfig:
          partitioning:
            partitionColumn: your_partition_column
            # lowerBound: 0
            # upperBound: 10000000

Complete Example

Before (Native Engine)

sourceConfig:
  config:
    type: Profiler
    schemaFilterPattern:
      includes:
        - ^your_schema$
    tableFilterPattern:
      includes:
        - your_table_name

processor:
  type: orm-profiler
  config: {}

After (Spark Engine)

sourceConfig:
  config:
    type: Profiler
    schemaFilterPattern:
      includes:
        - ^your_schema$
    tableFilterPattern:
      includes:
        - your_table_name
    processingEngine:
      type: Spark
      remote: sc://your_spark_connect_host:15002
      config:
        tempPath: s3://your_s3_bucket/table
        # extraConfig:
        #     key: value
processor:
  type: orm-profiler
  config:
    tableConfig:
      - fullyQualifiedName: YourService.YourDatabase.YourSchema.YourTable
        sparkTableProfilerConfig:
          partitioning:
            partitionColumn: your_partition_column
            # lowerBound: 0
            # upperBound: 1000000

Required Changes

  1. Add processingEngine to sourceConfig.config
  2. Add sparkTableProfilerConfig to your table configuration
  3. Specify partition column for Spark processing (Else it will fallback to fetching the Primary Key if any, or skipping the table entirely)

Run the Pipeline

Use the same command as before:

metadata profile -c your_profiler_config.yaml

The pipeline will now use Spark Engine instead of the Native engine for processing.

Troubleshooting Configuration

Common Issues

  1. Missing Partition Column: Ensure you've specified a suitable partition column
  2. Network Connectivity: Verify Spark Connect and database connectivity
  3. Driver Issues: Check that appropriate database drivers are installed in Spark cluster
  4. Configuration Errors: Validate YAML syntax and required fields

Debugging Steps

  1. Check Logs: Review profiler logs for specific error messages
  2. Test Connectivity: Verify all network connections are working
  3. Validate Configuration: Ensure all required fields are properly set
  4. Test with Small Dataset: Start with a small table to verify the setup

{% inlineCalloutContainer %} {% inlineCallout color="violet-70" bold="UI Configuration" icon="MdAnalytics" href="/how-to-guides/data-quality-observability/profiler/spark-engine/ui-configuration" %} Configure Spark Engine through the OpenMetadata UI. {% /inlineCallout %} {% /inlineCalloutContainer %}