mirror of
https://github.com/datahub-project/datahub.git
synced 2025-11-25 17:26:33 +00:00
321 lines
10 KiB
Markdown
321 lines
10 KiB
Markdown
# Optional PySpark Support for S3 Source
|
|
|
|
DataHub's S3 source now supports optional PySpark installation through the `s3-slim` variant. This allows users to choose a lightweight installation when data lake profiling is not needed.
|
|
|
|
## Overview
|
|
|
|
The S3 source includes PySpark by default for backward compatibility and profiling support. For users who only need metadata extraction without profiling, the `s3-slim` variant provides a ~500MB smaller installation.
|
|
|
|
**Current implementation status:**
|
|
|
|
- ✅ **S3**: SparkProfiler pattern fully implemented (optional PySpark)
|
|
- **ABS**: Not yet implemented (still requires PySpark for profiling)
|
|
- **Unity Catalog**: Not affected by this change (uses separate profiling mechanisms)
|
|
- **GCS**: Does not support profiling
|
|
|
|
> **Note:** This change implements the SparkProfiler pattern for S3 only. The same pattern can be applied to other sources (ABS, etc.) in future PRs.
|
|
|
|
## PySpark Version
|
|
|
|
> **Current Version:** PySpark 3.5.x (3.5.6)
|
|
>
|
|
> PySpark 4.0 support is planned for a future release. Until then, all DataHub components use PySpark 3.5.x for compatibility and stability.
|
|
|
|
## Installation Options
|
|
|
|
### Standard Installation (includes PySpark)
|
|
|
|
```bash
|
|
pip install 'acryl-datahub[s3]' # S3 with PySpark/profiling support
|
|
```
|
|
|
|
### Lightweight Installation (without PySpark)
|
|
|
|
For installations where you don't need profiling capabilities and want to save ~500MB:
|
|
|
|
```bash
|
|
pip install 'acryl-datahub[s3-slim]' # S3 without profiling (~500MB smaller)
|
|
```
|
|
|
|
**Recommendation:** Use `s3-slim` when profiling is not needed.
|
|
|
|
The `data-lake-profiling` dependencies (included in standard `s3` by default):
|
|
|
|
- `pyspark~=3.5.6`
|
|
- `pydeequ>=1.1.0`
|
|
- Profiling dependencies (cachetools)
|
|
|
|
> **Note:** In a future major release (e.g., DataHub 2.0), the `s3-slim` variant may become the default, and PySpark will be truly optional. This current approach provides backward compatibility while giving users time to adapt.
|
|
|
|
### What's Included
|
|
|
|
**S3 source:**
|
|
|
|
Standard `s3` extra:
|
|
|
|
- ✅ Metadata extraction (schemas, tables, file listing)
|
|
- ✅ Data format detection (Parquet, Avro, CSV, JSON, etc.)
|
|
- ✅ Schema inference from files
|
|
- ✅ Table and column-level metadata
|
|
- ✅ Tags and properties extraction
|
|
- ✅ Data profiling (min/max, nulls, distinct counts)
|
|
- ✅ Data quality checks (PyDeequ-based)
|
|
- Includes: PySpark 3.5.6 + PyDeequ
|
|
|
|
`s3-slim` variant:
|
|
|
|
- ✅ All metadata features (same as above)
|
|
- ❌ Data profiling disabled
|
|
- No PySpark dependencies (~500MB smaller)
|
|
|
|
## Feature Comparison
|
|
|
|
| Feature | `s3-slim` | Standard `s3` |
|
|
| ------------------------ | ---------------- | -------------------------- |
|
|
| **Metadata extraction** | ✅ Full support | ✅ Full support |
|
|
| **Schema inference** | ✅ Full support | ✅ Full support |
|
|
| **Tags & properties** | ✅ Full support | ✅ Full support |
|
|
| **Data profiling** | ❌ Not available | ✅ Full profiling |
|
|
| **Installation size** | ~200MB | ~700MB |
|
|
| **Install time** | Fast | Slower (PySpark build) |
|
|
| **PySpark dependencies** | ❌ None | ✅ PySpark 3.5.6 + PyDeequ |
|
|
|
|
## Configuration
|
|
|
|
### With Standard Installation (PySpark included)
|
|
|
|
When you install `acryl-datahub[s3]`, profiling works out of the box:
|
|
|
|
```yaml
|
|
source:
|
|
type: s3
|
|
config:
|
|
path_specs:
|
|
- include: s3://my-bucket/data/**/*.parquet
|
|
profiling:
|
|
enabled: true # Works seamlessly with standard installation
|
|
profile_table_level_only: false
|
|
```
|
|
|
|
### With Slim Installation (no PySpark)
|
|
|
|
When you install `s3-slim`, disable profiling in your config:
|
|
|
|
```yaml
|
|
source:
|
|
type: s3
|
|
config:
|
|
path_specs:
|
|
- include: s3://my-bucket/data/**/*.parquet
|
|
profiling:
|
|
enabled: false # Required for s3-slim installation
|
|
```
|
|
|
|
**If you enable profiling with s3-slim installation**, you'll see a clear error message at runtime:
|
|
|
|
```
|
|
RuntimeError: PySpark is not installed, but is required for S3 profiling.
|
|
Please install with: pip install 'acryl-datahub[s3]'
|
|
```
|
|
|
|
## Developer Guide
|
|
|
|
### Implementation Pattern
|
|
|
|
The S3 source demonstrates the recommended pattern for isolating PySpark-dependent code. This pattern can be applied to ABS and other sources in future PRs.
|
|
|
|
**Architecture (currently implemented for S3 only):**
|
|
|
|
1. **Main source class** (`source.py`) - Contains no PySpark imports at module level
|
|
2. **Profiler class** (`profiling.py`) - Encapsulates all PySpark/PyDeequ logic in `SparkProfiler` class
|
|
3. **Conditional instantiation** - `SparkProfiler` created only when profiling is enabled
|
|
4. **TYPE_CHECKING imports** - Type annotations use TYPE_CHECKING block for optional dependencies
|
|
|
|
**Key Benefits:**
|
|
|
|
- ✅ Type safety preserved (mypy passes without issues)
|
|
- ✅ Proper code layer separation
|
|
- ✅ Works with both standard and `-slim` installations
|
|
- ✅ Clear error messages when dependencies missing
|
|
- ✅ Pattern can be reused for ABS and other sources
|
|
|
|
**Example structure:**
|
|
|
|
```python
|
|
# source.py
|
|
if TYPE_CHECKING:
|
|
from datahub.ingestion.source.s3.profiling import SparkProfiler
|
|
|
|
class S3Source:
|
|
profiler: Optional["SparkProfiler"]
|
|
|
|
def __init__(self, config, ctx):
|
|
if config.is_profiling_enabled():
|
|
from datahub.ingestion.source.s3.profiling import SparkProfiler
|
|
self.profiler = SparkProfiler(...)
|
|
else:
|
|
self.profiler = None
|
|
```
|
|
|
|
```python
|
|
# profiling.py
|
|
class SparkProfiler:
|
|
"""Encapsulates all PySpark/PyDeequ profiling logic."""
|
|
|
|
def init_spark(self) -> Any:
|
|
# Spark session initialization
|
|
|
|
def read_file_spark(self, file: str, ext: str):
|
|
# File reading with Spark
|
|
|
|
def get_table_profile(self, table_data, dataset_urn):
|
|
# Table profiling coordination
|
|
```
|
|
|
|
For more details, see the [Adding a Metadata Ingestion Source](../metadata-ingestion/adding-source.md#31-using-optional-dependencies-eg-pyspark) guide.
|
|
|
|
## Troubleshooting
|
|
|
|
### Error: "PySpark is not installed, but is required for profiling"
|
|
|
|
**Problem:** You installed a `-slim` variant but have profiling enabled in your config.
|
|
|
|
**Solutions:**
|
|
|
|
1. **Recommended:** Use standard installation with PySpark:
|
|
|
|
```bash
|
|
pip uninstall acryl-datahub
|
|
pip install 'acryl-datahub[s3]' # For S3 profiling
|
|
```
|
|
|
|
2. **Alternative:** Disable profiling in your recipe:
|
|
```yaml
|
|
profiling:
|
|
enabled: false
|
|
```
|
|
|
|
### Verifying Installation
|
|
|
|
Check if PySpark is installed:
|
|
|
|
```bash
|
|
# Check installed packages
|
|
pip list | grep pyspark
|
|
|
|
# Test import in Python
|
|
python -c "import pyspark; print(pyspark.__version__)"
|
|
```
|
|
|
|
Expected output:
|
|
|
|
- Standard installation (`s3`): Shows `pyspark 3.5.x`
|
|
- Slim installation (`s3-slim`): Import fails or package not found
|
|
|
|
## Migration Guide
|
|
|
|
### Upgrading from Previous Versions
|
|
|
|
**No action required!** This change is fully backward compatible:
|
|
|
|
```bash
|
|
# Existing installations continue to work exactly as before
|
|
pip install 'acryl-datahub[s3]' # Still includes PySpark by default (profiling supported)
|
|
```
|
|
|
|
**Recommended: Optimize installations**
|
|
|
|
- **S3 with profiling:** Keep using `acryl-datahub[s3]` (includes PySpark)
|
|
- **S3 without profiling:** Switch to `acryl-datahub[s3-slim]` to save ~500MB
|
|
|
|
```bash
|
|
# Recommended installations
|
|
pip install 'acryl-datahub[s3]' # S3 with profiling support
|
|
pip install 'acryl-datahub[s3-slim]' # S3 metadata only (no profiling)
|
|
```
|
|
|
|
### No Breaking Changes
|
|
|
|
This implementation maintains full backward compatibility:
|
|
|
|
- Standard `s3` extra includes PySpark (unchanged behavior)
|
|
- All existing recipes and configs continue to work
|
|
- New `s3-slim` variant available for users who want smaller installations
|
|
- Future DataHub 2.0 may flip defaults, but provides migration path
|
|
|
|
## Benefits for DataHub Actions
|
|
|
|
[DataHub Actions](https://github.com/datahub-project/datahub/tree/master/datahub-actions) depends on `acryl-datahub` and can benefit from `s3-slim` when profiling is not needed:
|
|
|
|
### Reduced Installation Size
|
|
|
|
DataHub Actions typically doesn't need data lake profiling capabilities since it focuses on reacting to metadata events, not extracting metadata from data lakes. Use `s3-slim` to reduce footprint:
|
|
|
|
```bash
|
|
# If Actions needs S3 metadata access but not profiling
|
|
pip install acryl-datahub-actions
|
|
pip install 'acryl-datahub[s3-slim]'
|
|
# Result: ~500MB smaller than standard s3 extra
|
|
|
|
# If Actions needs full S3 with profiling
|
|
pip install acryl-datahub-actions
|
|
pip install 'acryl-datahub[s3]'
|
|
# Result: Includes PySpark for profiling capabilities
|
|
```
|
|
|
|
### Faster Deployment
|
|
|
|
Actions services using `s3-slim` deploy faster in containerized environments:
|
|
|
|
- **Faster pip install**: No PySpark compilation required
|
|
- **Smaller Docker images**: Reduced base image size
|
|
- **Quicker cold starts**: Less code to load and initialize
|
|
|
|
### Fewer Dependency Conflicts
|
|
|
|
Actions workflows often integrate with other tools (Slack, Teams, email services). Using `s3-slim` reduces:
|
|
|
|
- Python version constraint conflicts
|
|
- Java/Spark runtime conflicts in restricted environments
|
|
- Transitive dependency version mismatches
|
|
|
|
### When Actions Needs Profiling
|
|
|
|
If your Actions workflow needs to trigger data lake profiling jobs, use the standard extra:
|
|
|
|
```bash
|
|
# Actions with data lake profiling capability
|
|
pip install 'acryl-datahub-actions'
|
|
pip install 'acryl-datahub[s3]' # Includes PySpark by default
|
|
```
|
|
|
|
**Common Actions use cases that DON'T need PySpark:**
|
|
|
|
- Slack notifications on schema changes
|
|
- Propagating tags and terms to downstream systems
|
|
- Triggering dbt runs on metadata updates
|
|
- Sending emails on data quality failures
|
|
- Creating Jira tickets for governance issues
|
|
- Updating external catalogs (e.g., Alation, Collibra)
|
|
|
|
**Rare Actions use cases that MIGHT need PySpark:**
|
|
|
|
- Custom actions that programmatically trigger S3 profiling
|
|
- Actions that directly process data lake files (not typical)
|
|
|
|
## Benefits Summary
|
|
|
|
✅ **Backward compatible**: Standard `s3` extra unchanged, existing users unaffected
|
|
✅ **Smaller installations**: Save ~500MB with `s3-slim`
|
|
✅ **Faster setup**: No PySpark compilation with `s3-slim`
|
|
✅ **Flexible deployment**: Choose based on profiling needs
|
|
✅ **Type safety maintained**: Refactored with proper code layer separation (mypy passes)
|
|
✅ **Clear error messages**: Runtime errors guide users to correct installation
|
|
✅ **Actions-friendly**: DataHub Actions benefits from reduced footprint with `s3-slim`
|
|
|
|
**Key Takeaways:**
|
|
|
|
- Use `s3` if you need S3 profiling, `s3-slim` if you don't
|
|
- Pattern can be applied to other sources (ABS, etc.) in future PRs
|
|
- Existing installations continue working without changes
|