4.2 KiB
title | description | slug | collate |
---|---|---|---|
Spark Engine External Configuration | OpenMetadata Spark Profiling | Configure Spark Engine using YAML files for external workflows and distributed data profiling. | /how-to-guides/data-quality-observability/profiler/spark-engine/external-configuration | true |
Spark Engine External Configuration
Overview
To configure your profiler pipeline to use Spark Engine, you need to add the processingEngine
configuration to your existing YAML file.
{% note %} Before configuring, ensure you have completed the Spark Engine Prerequisites and understand the Partitioning Requirements. {% /note %}
Step 1: Add Spark Engine Configuration
In your existing profiler YAML, add the processingEngine
section under sourceConfig.config
:
sourceConfig:
config:
type: Profiler
# ... your existing configuration ...
processingEngine:
type: Spark
remote: sc://your_spark_connect_host:15002
config:
tempPath: your_path
{% note %}
**Important**: The `tempPath` must be accessible to all nodes in your Spark cluster. This is typically a shared filesystem path (like S3, HDFS, or a mounted network drive) that all Spark workers can read from and write to.
{% /note %}
Step 2: Add Partition Configuration
In the processor.config.tableConfig
section, add the sparkTableProfilerConfig
:
processor:
type: orm-profiler
config:
tableConfig:
- fullyQualifiedName: YourService.YourDatabase.YourSchema.YourTable
sparkTableProfilerConfig:
partitioning:
partitionColumn: your_partition_column
# lowerBound: 0
# upperBound: 10000000
Complete Example
Before (Native Engine)
sourceConfig:
config:
type: Profiler
schemaFilterPattern:
includes:
- ^your_schema$
tableFilterPattern:
includes:
- your_table_name
processor:
type: orm-profiler
config: {}
After (Spark Engine)
sourceConfig:
config:
type: Profiler
schemaFilterPattern:
includes:
- ^your_schema$
tableFilterPattern:
includes:
- your_table_name
processingEngine:
type: Spark
remote: sc://your_spark_connect_host:15002
config:
tempPath: s3://your_s3_bucket/table
# extraConfig:
# key: value
processor:
type: orm-profiler
config:
tableConfig:
- fullyQualifiedName: YourService.YourDatabase.YourSchema.YourTable
sparkTableProfilerConfig:
partitioning:
partitionColumn: your_partition_column
# lowerBound: 0
# upperBound: 1000000
Required Changes
- Add
processingEngine
tosourceConfig.config
- Add
sparkTableProfilerConfig
to your table configuration - Specify partition column for Spark processing (Else it will fallback to fetching the Primary Key if any, or skipping the table entirely)
Run the Pipeline
Use the same command as before:
metadata profile -c your_profiler_config.yaml
The pipeline will now use Spark Engine instead of the Native engine for processing.
Troubleshooting Configuration
Common Issues
- Missing Partition Column: Ensure you've specified a suitable partition column
- Network Connectivity: Verify Spark Connect and database connectivity
- Driver Issues: Check that appropriate database drivers are installed in Spark cluster
- Configuration Errors: Validate YAML syntax and required fields
Debugging Steps
- Check Logs: Review profiler logs for specific error messages
- Test Connectivity: Verify all network connections are working
- Validate Configuration: Ensure all required fields are properly set
- Test with Small Dataset: Start with a small table to verify the setup
{% inlineCalloutContainer %} {% inlineCallout color="violet-70" bold="UI Configuration" icon="MdAnalytics" href="/how-to-guides/data-quality-observability/profiler/spark-engine/ui-configuration" %} Configure Spark Engine through the OpenMetadata UI. {% /inlineCallout %} {% /inlineCalloutContainer %}