datahub/docs/PYSPARK.md
Esteban Gutierrez 646c51279f
feat(ingestion): Make PySpark optional for S3 (#15123)
Co-authored-by: Piotr Skrydalewicz <piotr.skrydalewicz@acryl.io>
2025-11-19 12:11:40 +00:00

321 lines
10 KiB
Markdown

# Optional PySpark Support for S3 Source
DataHub's S3 source now supports optional PySpark installation through the `s3-slim` variant. This allows users to choose a lightweight installation when data lake profiling is not needed.
## Overview
The S3 source includes PySpark by default for backward compatibility and profiling support. For users who only need metadata extraction without profiling, the `s3-slim` variant provides a ~500MB smaller installation.
**Current implementation status:**
-**S3**: SparkProfiler pattern fully implemented (optional PySpark)
- **ABS**: Not yet implemented (still requires PySpark for profiling)
- **Unity Catalog**: Not affected by this change (uses separate profiling mechanisms)
- **GCS**: Does not support profiling
> **Note:** This change implements the SparkProfiler pattern for S3 only. The same pattern can be applied to other sources (ABS, etc.) in future PRs.
## PySpark Version
> **Current Version:** PySpark 3.5.x (3.5.6)
>
> PySpark 4.0 support is planned for a future release. Until then, all DataHub components use PySpark 3.5.x for compatibility and stability.
## Installation Options
### Standard Installation (includes PySpark)
```bash
pip install 'acryl-datahub[s3]' # S3 with PySpark/profiling support
```
### Lightweight Installation (without PySpark)
For installations where you don't need profiling capabilities and want to save ~500MB:
```bash
pip install 'acryl-datahub[s3-slim]' # S3 without profiling (~500MB smaller)
```
**Recommendation:** Use `s3-slim` when profiling is not needed.
The `data-lake-profiling` dependencies (included in standard `s3` by default):
- `pyspark~=3.5.6`
- `pydeequ>=1.1.0`
- Profiling dependencies (cachetools)
> **Note:** In a future major release (e.g., DataHub 2.0), the `s3-slim` variant may become the default, and PySpark will be truly optional. This current approach provides backward compatibility while giving users time to adapt.
### What's Included
**S3 source:**
Standard `s3` extra:
- ✅ Metadata extraction (schemas, tables, file listing)
- ✅ Data format detection (Parquet, Avro, CSV, JSON, etc.)
- ✅ Schema inference from files
- ✅ Table and column-level metadata
- ✅ Tags and properties extraction
- ✅ Data profiling (min/max, nulls, distinct counts)
- ✅ Data quality checks (PyDeequ-based)
- Includes: PySpark 3.5.6 + PyDeequ
`s3-slim` variant:
- ✅ All metadata features (same as above)
- ❌ Data profiling disabled
- No PySpark dependencies (~500MB smaller)
## Feature Comparison
| Feature | `s3-slim` | Standard `s3` |
| ------------------------ | ---------------- | -------------------------- |
| **Metadata extraction** | ✅ Full support | ✅ Full support |
| **Schema inference** | ✅ Full support | ✅ Full support |
| **Tags & properties** | ✅ Full support | ✅ Full support |
| **Data profiling** | ❌ Not available | ✅ Full profiling |
| **Installation size** | ~200MB | ~700MB |
| **Install time** | Fast | Slower (PySpark build) |
| **PySpark dependencies** | ❌ None | ✅ PySpark 3.5.6 + PyDeequ |
## Configuration
### With Standard Installation (PySpark included)
When you install `acryl-datahub[s3]`, profiling works out of the box:
```yaml
source:
type: s3
config:
path_specs:
- include: s3://my-bucket/data/**/*.parquet
profiling:
enabled: true # Works seamlessly with standard installation
profile_table_level_only: false
```
### With Slim Installation (no PySpark)
When you install `s3-slim`, disable profiling in your config:
```yaml
source:
type: s3
config:
path_specs:
- include: s3://my-bucket/data/**/*.parquet
profiling:
enabled: false # Required for s3-slim installation
```
**If you enable profiling with s3-slim installation**, you'll see a clear error message at runtime:
```
RuntimeError: PySpark is not installed, but is required for S3 profiling.
Please install with: pip install 'acryl-datahub[s3]'
```
## Developer Guide
### Implementation Pattern
The S3 source demonstrates the recommended pattern for isolating PySpark-dependent code. This pattern can be applied to ABS and other sources in future PRs.
**Architecture (currently implemented for S3 only):**
1. **Main source class** (`source.py`) - Contains no PySpark imports at module level
2. **Profiler class** (`profiling.py`) - Encapsulates all PySpark/PyDeequ logic in `SparkProfiler` class
3. **Conditional instantiation** - `SparkProfiler` created only when profiling is enabled
4. **TYPE_CHECKING imports** - Type annotations use TYPE_CHECKING block for optional dependencies
**Key Benefits:**
- ✅ Type safety preserved (mypy passes without issues)
- ✅ Proper code layer separation
- ✅ Works with both standard and `-slim` installations
- ✅ Clear error messages when dependencies missing
- ✅ Pattern can be reused for ABS and other sources
**Example structure:**
```python
# source.py
if TYPE_CHECKING:
from datahub.ingestion.source.s3.profiling import SparkProfiler
class S3Source:
profiler: Optional["SparkProfiler"]
def __init__(self, config, ctx):
if config.is_profiling_enabled():
from datahub.ingestion.source.s3.profiling import SparkProfiler
self.profiler = SparkProfiler(...)
else:
self.profiler = None
```
```python
# profiling.py
class SparkProfiler:
"""Encapsulates all PySpark/PyDeequ profiling logic."""
def init_spark(self) -> Any:
# Spark session initialization
def read_file_spark(self, file: str, ext: str):
# File reading with Spark
def get_table_profile(self, table_data, dataset_urn):
# Table profiling coordination
```
For more details, see the [Adding a Metadata Ingestion Source](../metadata-ingestion/adding-source.md#31-using-optional-dependencies-eg-pyspark) guide.
## Troubleshooting
### Error: "PySpark is not installed, but is required for profiling"
**Problem:** You installed a `-slim` variant but have profiling enabled in your config.
**Solutions:**
1. **Recommended:** Use standard installation with PySpark:
```bash
pip uninstall acryl-datahub
pip install 'acryl-datahub[s3]' # For S3 profiling
```
2. **Alternative:** Disable profiling in your recipe:
```yaml
profiling:
enabled: false
```
### Verifying Installation
Check if PySpark is installed:
```bash
# Check installed packages
pip list | grep pyspark
# Test import in Python
python -c "import pyspark; print(pyspark.__version__)"
```
Expected output:
- Standard installation (`s3`): Shows `pyspark 3.5.x`
- Slim installation (`s3-slim`): Import fails or package not found
## Migration Guide
### Upgrading from Previous Versions
**No action required!** This change is fully backward compatible:
```bash
# Existing installations continue to work exactly as before
pip install 'acryl-datahub[s3]' # Still includes PySpark by default (profiling supported)
```
**Recommended: Optimize installations**
- **S3 with profiling:** Keep using `acryl-datahub[s3]` (includes PySpark)
- **S3 without profiling:** Switch to `acryl-datahub[s3-slim]` to save ~500MB
```bash
# Recommended installations
pip install 'acryl-datahub[s3]' # S3 with profiling support
pip install 'acryl-datahub[s3-slim]' # S3 metadata only (no profiling)
```
### No Breaking Changes
This implementation maintains full backward compatibility:
- Standard `s3` extra includes PySpark (unchanged behavior)
- All existing recipes and configs continue to work
- New `s3-slim` variant available for users who want smaller installations
- Future DataHub 2.0 may flip defaults, but provides migration path
## Benefits for DataHub Actions
[DataHub Actions](https://github.com/datahub-project/datahub/tree/master/datahub-actions) depends on `acryl-datahub` and can benefit from `s3-slim` when profiling is not needed:
### Reduced Installation Size
DataHub Actions typically doesn't need data lake profiling capabilities since it focuses on reacting to metadata events, not extracting metadata from data lakes. Use `s3-slim` to reduce footprint:
```bash
# If Actions needs S3 metadata access but not profiling
pip install acryl-datahub-actions
pip install 'acryl-datahub[s3-slim]'
# Result: ~500MB smaller than standard s3 extra
# If Actions needs full S3 with profiling
pip install acryl-datahub-actions
pip install 'acryl-datahub[s3]'
# Result: Includes PySpark for profiling capabilities
```
### Faster Deployment
Actions services using `s3-slim` deploy faster in containerized environments:
- **Faster pip install**: No PySpark compilation required
- **Smaller Docker images**: Reduced base image size
- **Quicker cold starts**: Less code to load and initialize
### Fewer Dependency Conflicts
Actions workflows often integrate with other tools (Slack, Teams, email services). Using `s3-slim` reduces:
- Python version constraint conflicts
- Java/Spark runtime conflicts in restricted environments
- Transitive dependency version mismatches
### When Actions Needs Profiling
If your Actions workflow needs to trigger data lake profiling jobs, use the standard extra:
```bash
# Actions with data lake profiling capability
pip install 'acryl-datahub-actions'
pip install 'acryl-datahub[s3]' # Includes PySpark by default
```
**Common Actions use cases that DON'T need PySpark:**
- Slack notifications on schema changes
- Propagating tags and terms to downstream systems
- Triggering dbt runs on metadata updates
- Sending emails on data quality failures
- Creating Jira tickets for governance issues
- Updating external catalogs (e.g., Alation, Collibra)
**Rare Actions use cases that MIGHT need PySpark:**
- Custom actions that programmatically trigger S3 profiling
- Actions that directly process data lake files (not typical)
## Benefits Summary
✅ **Backward compatible**: Standard `s3` extra unchanged, existing users unaffected
✅ **Smaller installations**: Save ~500MB with `s3-slim`
✅ **Faster setup**: No PySpark compilation with `s3-slim`
✅ **Flexible deployment**: Choose based on profiling needs
✅ **Type safety maintained**: Refactored with proper code layer separation (mypy passes)
✅ **Clear error messages**: Runtime errors guide users to correct installation
✅ **Actions-friendly**: DataHub Actions benefits from reduced footprint with `s3-slim`
**Key Takeaways:**
- Use `s3` if you need S3 profiling, `s3-slim` if you don't
- Pattern can be applied to other sources (ABS, etc.) in future PRs
- Existing installations continue working without changes