# Optional PySpark Support for S3 Source DataHub's S3 source now supports optional PySpark installation through the `s3-slim` variant. This allows users to choose a lightweight installation when data lake profiling is not needed. ## Overview The S3 source includes PySpark by default for backward compatibility and profiling support. For users who only need metadata extraction without profiling, the `s3-slim` variant provides a ~500MB smaller installation. **Current implementation status:** - ✅ **S3**: SparkProfiler pattern fully implemented (optional PySpark) - **ABS**: Not yet implemented (still requires PySpark for profiling) - **Unity Catalog**: Not affected by this change (uses separate profiling mechanisms) - **GCS**: Does not support profiling > **Note:** This change implements the SparkProfiler pattern for S3 only. The same pattern can be applied to other sources (ABS, etc.) in future PRs. ## PySpark Version > **Current Version:** PySpark 3.5.x (3.5.6) > > PySpark 4.0 support is planned for a future release. Until then, all DataHub components use PySpark 3.5.x for compatibility and stability. ## Installation Options ### Standard Installation (includes PySpark) ```bash pip install 'acryl-datahub[s3]' # S3 with PySpark/profiling support ``` ### Lightweight Installation (without PySpark) For installations where you don't need profiling capabilities and want to save ~500MB: ```bash pip install 'acryl-datahub[s3-slim]' # S3 without profiling (~500MB smaller) ``` **Recommendation:** Use `s3-slim` when profiling is not needed. The `data-lake-profiling` dependencies (included in standard `s3` by default): - `pyspark~=3.5.6` - `pydeequ>=1.1.0` - Profiling dependencies (cachetools) > **Note:** In a future major release (e.g., DataHub 2.0), the `s3-slim` variant may become the default, and PySpark will be truly optional. This current approach provides backward compatibility while giving users time to adapt. ### What's Included **S3 source:** Standard `s3` extra: - ✅ Metadata extraction (schemas, tables, file listing) - ✅ Data format detection (Parquet, Avro, CSV, JSON, etc.) - ✅ Schema inference from files - ✅ Table and column-level metadata - ✅ Tags and properties extraction - ✅ Data profiling (min/max, nulls, distinct counts) - ✅ Data quality checks (PyDeequ-based) - Includes: PySpark 3.5.6 + PyDeequ `s3-slim` variant: - ✅ All metadata features (same as above) - ❌ Data profiling disabled - No PySpark dependencies (~500MB smaller) ## Feature Comparison | Feature | `s3-slim` | Standard `s3` | | ------------------------ | ---------------- | -------------------------- | | **Metadata extraction** | ✅ Full support | ✅ Full support | | **Schema inference** | ✅ Full support | ✅ Full support | | **Tags & properties** | ✅ Full support | ✅ Full support | | **Data profiling** | ❌ Not available | ✅ Full profiling | | **Installation size** | ~200MB | ~700MB | | **Install time** | Fast | Slower (PySpark build) | | **PySpark dependencies** | ❌ None | ✅ PySpark 3.5.6 + PyDeequ | ## Configuration ### With Standard Installation (PySpark included) When you install `acryl-datahub[s3]`, profiling works out of the box: ```yaml source: type: s3 config: path_specs: - include: s3://my-bucket/data/**/*.parquet profiling: enabled: true # Works seamlessly with standard installation profile_table_level_only: false ``` ### With Slim Installation (no PySpark) When you install `s3-slim`, disable profiling in your config: ```yaml source: type: s3 config: path_specs: - include: s3://my-bucket/data/**/*.parquet profiling: enabled: false # Required for s3-slim installation ``` **If you enable profiling with s3-slim installation**, you'll see a clear error message at runtime: ``` RuntimeError: PySpark is not installed, but is required for S3 profiling. Please install with: pip install 'acryl-datahub[s3]' ``` ## Developer Guide ### Implementation Pattern The S3 source demonstrates the recommended pattern for isolating PySpark-dependent code. This pattern can be applied to ABS and other sources in future PRs. **Architecture (currently implemented for S3 only):** 1. **Main source class** (`source.py`) - Contains no PySpark imports at module level 2. **Profiler class** (`profiling.py`) - Encapsulates all PySpark/PyDeequ logic in `SparkProfiler` class 3. **Conditional instantiation** - `SparkProfiler` created only when profiling is enabled 4. **TYPE_CHECKING imports** - Type annotations use TYPE_CHECKING block for optional dependencies **Key Benefits:** - ✅ Type safety preserved (mypy passes without issues) - ✅ Proper code layer separation - ✅ Works with both standard and `-slim` installations - ✅ Clear error messages when dependencies missing - ✅ Pattern can be reused for ABS and other sources **Example structure:** ```python # source.py if TYPE_CHECKING: from datahub.ingestion.source.s3.profiling import SparkProfiler class S3Source: profiler: Optional["SparkProfiler"] def __init__(self, config, ctx): if config.is_profiling_enabled(): from datahub.ingestion.source.s3.profiling import SparkProfiler self.profiler = SparkProfiler(...) else: self.profiler = None ``` ```python # profiling.py class SparkProfiler: """Encapsulates all PySpark/PyDeequ profiling logic.""" def init_spark(self) -> Any: # Spark session initialization def read_file_spark(self, file: str, ext: str): # File reading with Spark def get_table_profile(self, table_data, dataset_urn): # Table profiling coordination ``` For more details, see the [Adding a Metadata Ingestion Source](../metadata-ingestion/adding-source.md#31-using-optional-dependencies-eg-pyspark) guide. ## Troubleshooting ### Error: "PySpark is not installed, but is required for profiling" **Problem:** You installed a `-slim` variant but have profiling enabled in your config. **Solutions:** 1. **Recommended:** Use standard installation with PySpark: ```bash pip uninstall acryl-datahub pip install 'acryl-datahub[s3]' # For S3 profiling ``` 2. **Alternative:** Disable profiling in your recipe: ```yaml profiling: enabled: false ``` ### Verifying Installation Check if PySpark is installed: ```bash # Check installed packages pip list | grep pyspark # Test import in Python python -c "import pyspark; print(pyspark.__version__)" ``` Expected output: - Standard installation (`s3`): Shows `pyspark 3.5.x` - Slim installation (`s3-slim`): Import fails or package not found ## Migration Guide ### Upgrading from Previous Versions **No action required!** This change is fully backward compatible: ```bash # Existing installations continue to work exactly as before pip install 'acryl-datahub[s3]' # Still includes PySpark by default (profiling supported) ``` **Recommended: Optimize installations** - **S3 with profiling:** Keep using `acryl-datahub[s3]` (includes PySpark) - **S3 without profiling:** Switch to `acryl-datahub[s3-slim]` to save ~500MB ```bash # Recommended installations pip install 'acryl-datahub[s3]' # S3 with profiling support pip install 'acryl-datahub[s3-slim]' # S3 metadata only (no profiling) ``` ### No Breaking Changes This implementation maintains full backward compatibility: - Standard `s3` extra includes PySpark (unchanged behavior) - All existing recipes and configs continue to work - New `s3-slim` variant available for users who want smaller installations - Future DataHub 2.0 may flip defaults, but provides migration path ## Benefits for DataHub Actions [DataHub Actions](https://github.com/datahub-project/datahub/tree/master/datahub-actions) depends on `acryl-datahub` and can benefit from `s3-slim` when profiling is not needed: ### Reduced Installation Size DataHub Actions typically doesn't need data lake profiling capabilities since it focuses on reacting to metadata events, not extracting metadata from data lakes. Use `s3-slim` to reduce footprint: ```bash # If Actions needs S3 metadata access but not profiling pip install acryl-datahub-actions pip install 'acryl-datahub[s3-slim]' # Result: ~500MB smaller than standard s3 extra # If Actions needs full S3 with profiling pip install acryl-datahub-actions pip install 'acryl-datahub[s3]' # Result: Includes PySpark for profiling capabilities ``` ### Faster Deployment Actions services using `s3-slim` deploy faster in containerized environments: - **Faster pip install**: No PySpark compilation required - **Smaller Docker images**: Reduced base image size - **Quicker cold starts**: Less code to load and initialize ### Fewer Dependency Conflicts Actions workflows often integrate with other tools (Slack, Teams, email services). Using `s3-slim` reduces: - Python version constraint conflicts - Java/Spark runtime conflicts in restricted environments - Transitive dependency version mismatches ### When Actions Needs Profiling If your Actions workflow needs to trigger data lake profiling jobs, use the standard extra: ```bash # Actions with data lake profiling capability pip install 'acryl-datahub-actions' pip install 'acryl-datahub[s3]' # Includes PySpark by default ``` **Common Actions use cases that DON'T need PySpark:** - Slack notifications on schema changes - Propagating tags and terms to downstream systems - Triggering dbt runs on metadata updates - Sending emails on data quality failures - Creating Jira tickets for governance issues - Updating external catalogs (e.g., Alation, Collibra) **Rare Actions use cases that MIGHT need PySpark:** - Custom actions that programmatically trigger S3 profiling - Actions that directly process data lake files (not typical) ## Benefits Summary ✅ **Backward compatible**: Standard `s3` extra unchanged, existing users unaffected ✅ **Smaller installations**: Save ~500MB with `s3-slim` ✅ **Faster setup**: No PySpark compilation with `s3-slim` ✅ **Flexible deployment**: Choose based on profiling needs ✅ **Type safety maintained**: Refactored with proper code layer separation (mypy passes) ✅ **Clear error messages**: Runtime errors guide users to correct installation ✅ **Actions-friendly**: DataHub Actions benefits from reduced footprint with `s3-slim` **Key Takeaways:** - Use `s3` if you need S3 profiling, `s3-slim` if you don't - Pattern can be applied to other sources (ABS, etc.) in future PRs - Existing installations continue working without changes