OpenMetadata/openmetadata-docs/content/v1.9.x-SNAPSHOT/how-to-guides/data-quality-observability/profiler/spark-engine/external-configuration.md

---
title: Spark Engine External Configuration | OpenMetadata Spark Profiling
description: Configure Spark Engine using YAML files for external workflows and distributed data profiling.
slug: /how-to-guides/data-quality-observability/profiler/spark-engine/external-configuration
collate: true
---

# Spark Engine External Configuration

## Overview

To configure your profiler pipeline to use Spark Engine, you need to add the `processingEngine` configuration to your existing YAML file.

{% note %}
Before configuring, ensure you have completed the [Spark Engine Prerequisites](/how-to-guides/data-quality-observability/profiler/spark-engine/prerequisites) and understand the [Partitioning Requirements](/how-to-guides/data-quality-observability/profiler/spark-engine/partitioning).
{% /note %}

## Step 1: Add Spark Engine Configuration

In your existing profiler YAML, add the `processingEngine` section under `sourceConfig.config`:

```yaml
sourceConfig:
  config:
    type: Profiler
    # ... your existing configuration ...
    processingEngine:
      type: Spark
      remote: sc://your_spark_connect_host:15002
      config:
        tempPath: your_path

{% note %}
**Important**: The `tempPath` must be accessible to all nodes in your Spark cluster. This is typically a shared filesystem path (like S3, HDFS, or a mounted network drive) that all Spark workers can read from and write to.
{% /note %}
```

## Step 2: Add Partition Configuration

In the `processor.config.tableConfig` section, add the `sparkTableProfilerConfig`:

```yaml
processor:
  type: orm-profiler
  config:
    tableConfig:
      - fullyQualifiedName: YourService.YourDatabase.YourSchema.YourTable
        sparkTableProfilerConfig:
          partitioning:
            partitionColumn: your_partition_column
            # lowerBound: 0
            # upperBound: 10000000
```

## Complete Example

### Before (Native Engine)

```yaml
sourceConfig:
  config:
    type: Profiler
    schemaFilterPattern:
      includes:
        - ^your_schema$
    tableFilterPattern:
      includes:
        - your_table_name

processor:
  type: orm-profiler
  config: {}
```

### After (Spark Engine)

```yaml
sourceConfig:
  config:
    type: Profiler
    schemaFilterPattern:
      includes:
        - ^your_schema$
    tableFilterPattern:
      includes:
        - your_table_name
    processingEngine:
      type: Spark
      remote: sc://your_spark_connect_host:15002
      config:
        tempPath: s3://your_s3_bucket/table
        # extraConfig:
        #     key: value
processor:
  type: orm-profiler
  config:
    tableConfig:
      - fullyQualifiedName: YourService.YourDatabase.YourSchema.YourTable
        sparkTableProfilerConfig:
          partitioning:
            partitionColumn: your_partition_column
            # lowerBound: 0
            # upperBound: 1000000
```

## Required Changes

1. **Add `processingEngine`** to `sourceConfig.config`
2. **Add `sparkTableProfilerConfig`** to your table configuration
3. **Specify partition column** for Spark processing (Else it will fallback to fetching the Primary Key if any, or skipping the table entirely)

## Run the Pipeline

Use the same command as before:

```bash
metadata profile -c your_profiler_config.yaml
```

The pipeline will now use Spark Engine instead of the Native engine for processing.

## Troubleshooting Configuration

### Common Issues

1. **Missing Partition Column**: Ensure you've specified a suitable partition column
2. **Network Connectivity**: Verify Spark Connect and database connectivity
3. **Driver Issues**: Check that appropriate database drivers are installed in Spark cluster
4. **Configuration Errors**: Validate YAML syntax and required fields

### Debugging Steps

1. **Check Logs**: Review profiler logs for specific error messages
2. **Test Connectivity**: Verify all network connections are working
3. **Validate Configuration**: Ensure all required fields are properly set
4. **Test with Small Dataset**: Start with a small table to verify the setup

{% inlineCalloutContainer %}
 {% inlineCallout
  color="violet-70"
  bold="UI Configuration"
  icon="MdAnalytics"
  href="/how-to-guides/data-quality-observability/profiler/spark-engine/ui-configuration" %}
  Configure Spark Engine through the OpenMetadata UI.
 {% /inlineCallout %}
{% /inlineCalloutContainer %}
Add Spark Engine docs (#22739) * Add Spark Engine docs * Update structure 2025-08-04 18:28:46 +02:00			`---`
			`title: Spark Engine External Configuration \| OpenMetadata Spark Profiling`
			`description: Configure Spark Engine using YAML files for external workflows and distributed data profiling.`
			`slug: /how-to-guides/data-quality-observability/profiler/spark-engine/external-configuration`
			`collate: true`
			`---`

			`# Spark Engine External Configuration`

			`## Overview`

			To configure your profiler pipeline to use Spark Engine, you need to add the `processingEngine` configuration to your existing YAML file.

			`{% note %}`
			`Before configuring, ensure you have completed the [Spark Engine Prerequisites](/how-to-guides/data-quality-observability/profiler/spark-engine/prerequisites) and understand the [Partitioning Requirements](/how-to-guides/data-quality-observability/profiler/spark-engine/partitioning).`
			`{% /note %}`

			`## Step 1: Add Spark Engine Configuration`

			In your existing profiler YAML, add the `processingEngine` section under `sourceConfig.config`:

			```yaml
			`sourceConfig:`
			`config:`
			`type: Profiler`
			`# ... your existing configuration ...`
			`processingEngine:`
			`type: Spark`
			`remote: sc://your_spark_connect_host:15002`
			`config:`
			`tempPath: your_path`

			`{% note %}`
			Important: The `tempPath` must be accessible to all nodes in your Spark cluster. This is typically a shared filesystem path (like S3, HDFS, or a mounted network drive) that all Spark workers can read from and write to.
			`{% /note %}`
			```

			`## Step 2: Add Partition Configuration`

			In the `processor.config.tableConfig` section, add the `sparkTableProfilerConfig`:

			```yaml
			`processor:`
			`type: orm-profiler`
			`config:`
			`tableConfig:`
			`- fullyQualifiedName: YourService.YourDatabase.YourSchema.YourTable`
			`sparkTableProfilerConfig:`
			`partitioning:`
			`partitionColumn: your_partition_column`
			`# lowerBound: 0`
			`# upperBound: 10000000`
			```

			`## Complete Example`

			`### Before (Native Engine)`

			```yaml
			`sourceConfig:`
			`config:`
			`type: Profiler`
			`schemaFilterPattern:`
			`includes:`
			`- ^your_schema$`
			`tableFilterPattern:`
			`includes:`
			`- your_table_name`

			`processor:`
			`type: orm-profiler`
			`config: {}`
			```

			`### After (Spark Engine)`

			```yaml
			`sourceConfig:`
			`config:`
			`type: Profiler`
			`schemaFilterPattern:`
			`includes:`
			`- ^your_schema$`
			`tableFilterPattern:`
			`includes:`
			`- your_table_name`
			`processingEngine:`
			`type: Spark`
			`remote: sc://your_spark_connect_host:15002`
			`config:`
			`tempPath: s3://your_s3_bucket/table`
			`# extraConfig:`
			`# key: value`
			`processor:`
			`type: orm-profiler`
			`config:`
			`tableConfig:`
			`- fullyQualifiedName: YourService.YourDatabase.YourSchema.YourTable`
			`sparkTableProfilerConfig:`
			`partitioning:`
			`partitionColumn: your_partition_column`
			`# lowerBound: 0`
			`# upperBound: 1000000`
			```

			`## Required Changes`

			1. Add `processingEngine` to `sourceConfig.config`
			2. Add `sparkTableProfilerConfig` to your table configuration
			`3. Specify partition column for Spark processing (Else it will fallback to fetching the Primary Key if any, or skipping the table entirely)`

			`## Run the Pipeline`

			`Use the same command as before:`

			```bash
			`metadata profile -c your_profiler_config.yaml`
			```

			`The pipeline will now use Spark Engine instead of the Native engine for processing.`

			`## Troubleshooting Configuration`

			`### Common Issues`

			`1. Missing Partition Column: Ensure you've specified a suitable partition column`
			`2. Network Connectivity: Verify Spark Connect and database connectivity`
			`3. Driver Issues: Check that appropriate database drivers are installed in Spark cluster`
			`4. Configuration Errors: Validate YAML syntax and required fields`

			`### Debugging Steps`

			`1. Check Logs: Review profiler logs for specific error messages`
			`2. Test Connectivity: Verify all network connections are working`
			`3. Validate Configuration: Ensure all required fields are properly set`
			`4. Test with Small Dataset: Start with a small table to verify the setup`

			`{% inlineCalloutContainer %}`
			`{% inlineCallout`
			`color="violet-70"`
			`bold="UI Configuration"`
			`icon="MdAnalytics"`
			`href="/how-to-guides/data-quality-observability/profiler/spark-engine/ui-configuration" %}`
			`Configure Spark Engine through the OpenMetadata UI.`
			`{% /inlineCallout %}`
			`{% /inlineCalloutContainer %}`