datahub/docs/how/elasticsearch-search-client-shim.md
RyanHolstien 3eac7fab5b
feat(search): implement multi-client search engine shim for ES8 support (#14904)
Co-authored-by: Esteban Gutierrez <esteban.gutierrez@acryl.io>
2025-10-06 18:40:00 +00:00

350 lines
8.7 KiB
Markdown

# Elasticsearch & OpenSearch Multi-Client Shim
This guide explains how to use DataHub's multi-client search engine shim to support different versions of Elasticsearch and OpenSearch through a unified interface.
## Overview
DataHub's search client shim provides seamless support for:
- **Elasticsearch 7.17**
- **Elasticsearch 8.17+**
- **OpenSearch 2.x** with full REST high-level client support
This enables smooth migrations between different search engine versions while maintaining backward compatibility with existing DataHub deployments.
## Architecture
### Core Components
The shim consists of several key components:
1. **`SearchClientShim`** - Main abstraction interface
2. **`SearchClientShimFactory`** - Factory for creating appropriate client implementations
3. **Implementation Classes** - Concrete implementations for each search engine:
- `Es7CompatibilitySearchClientShim` - ES 7.17
- `Es8SearchClientShim` - ES 8.17+
- `OpenSearch2SearchClientShim` - OpenSearch 2.x
### Supported Configurations
| Source Engine | Target Engine | Shim Implementation | Status |
| ------------------------ | -------------- | ---------------------------------- | ----------- |
| DataHub → ES 7.17 | ES 7.17 | `Es7CompatibilitySearchClientShim` | ✅ Complete |
| DataHub → ES 8.17+ | ES 8.17+ | `Es8SearchClientShim` | ✅ Complete |
| DataHub → OpenSearch 2.x | OpenSearch 2.x | `OpenSearch2SearchClientShim` | ✅ Complete |
## Configuration
### Environment Variables
Configure the shim using these environment variables:
```bash
# Enable the search client shim (required)
ELASTICSEARCH_SHIM_ENABLED=true
# Specify engine type (or use AUTO_DETECT)
ELASTICSEARCH_SHIM_ENGINE_TYPE=AUTO_DETECT
# Options: AUTO_DETECT, ELASTICSEARCH_7, ELASTICSEARCH_8, OPENSEARCH_2
# Enable auto-detection (recommended)
ELASTICSEARCH_SHIM_AUTO_DETECT=true
```
### application.yaml Configuration
Alternatively, configure via application.yaml:
```yaml
elasticsearch:
host: localhost
port: 9200
username: ${ELASTICSEARCH_USERNAME:#{null}}
password: ${ELASTICSEARCH_PASSWORD:#{null}}
useSSL: false
# Standard Elasticsearch configuration...
# Multi-client shim configuration
shim:
enabled: true # Enable shim
engineType: AUTO_DETECT # or specific type
autoDetectEngine: true # Auto-detect cluster type
```
## Migration Scenarios
### Scenario 1: Elasticsearch 7.17 → Elasticsearch 8.x
This is the most common migration path.
**Step 1: Enable the shim**
```bash
ELASTICSEARCH_SHIM_ENABLED=true
ELASTICSEARCH_SHIM_ENGINE_TYPE=ELASTICSEARCH_8
```
**Step 2: Verify connection**
```bash
# Check logs for successful connection
```
### Scenario 2: Elasticsearch 7.17 → OpenSearch 2.x
Direct migration from Elasticsearch to OpenSearch 2.x.
**Configuration:**
```bash
ELASTICSEARCH_SHIM_ENABLED=true
ELASTICSEARCH_SHIM_ENGINE_TYPE=OPENSEARCH_2
ELASTICSEARCH_SHIM_AUTO_DETECT=true
```
### Scenario 3: Auto-Detection (Recommended)
Let DataHub automatically detect your search engine type:
```bash
ELASTICSEARCH_SHIM_ENABLED=true
ELASTICSEARCH_SHIM_ENGINE_TYPE=AUTO_DETECT
ELASTICSEARCH_SHIM_AUTO_DETECT=true
```
The shim will:
1. Connect to your search cluster
2. Identify the engine type and version
3. Select the appropriate client implementation
## Deployment Guide
### Docker Compose
Update your `docker-compose.yml`:
```yaml
services:
datahub-gms:
environment:
- ELASTICSEARCH_SHIM_ENABLED=true
- ELASTICSEARCH_SHIM_ENGINE_TYPE=AUTO_DETECT
# ... other ES config
```
### Kubernetes
Update your deployment manifests:
```yaml
apiVersion: apps/v1
kind: Deployment
metadata:
name: datahub-gms
spec:
template:
spec:
containers:
- name: datahub-gms
env:
- name: ELASTICSEARCH_SHIM_ENABLED
value: "true"
- name: ELASTICSEARCH_SHIM_ENGINE_TYPE
value: "AUTO_DETECT"
# ... other configuration
```
### Helm
Update your `values.yaml`:
```yaml
global:
elasticsearch:
shim:
enabled: true
engineType: "AUTO_DETECT"
autoDetectEngine: true
```
## Validation and Testing
### Verify Shim Configuration
1. **Check logs** for shim initialization:
```bash
docker logs datahub-gms | grep -i "shim\|search"
```
Look for messages like:
```
INFO Creating SearchClientShim for engine type: ELASTICSEARCH_7
INFO Auto-detected search engine type: ELASTICSEARCH_7
```
1. **Test search functionality** in DataHub UI:
- Search for datasets
- Browse data assets
- Check that lineage is working
2. **Monitor performance** during transition:
- Watch for connection errors
- Check response times
- Monitor resource usage
### Common Validation Steps
```bash
# 1. Check DataHub health endpoint
curl http://localhost:8080/health
# 2. Verify search index access
curl -u user:pass "http://elasticsearch:9200/_cat/indices?v"
# 3. Test search functionality
curl -X POST "http://localhost:8080/api/graphql" \
-H "Content-Type: application/json" \
-d '{"query": "{ search(input: {type: DATASET, query: \"*\"}) { total }}"}'
```
## Troubleshooting
### Common Issues
#### 1. Connection Failures
```
ERROR: Unable to connect to search cluster
```
**Solutions:**
- Verify `ELASTICSEARCH_HOST` and `ELASTICSEARCH_PORT`
- Check network connectivity between DataHub and search cluster
- Ensure credentials are correct
- Verify SSL/TLS configuration (ES8 Containers use SSL by default so if you previously weren't this may cause issues)
#### 2. Auto-Detection Failures
```
ERROR: Unable to detect search engine type
```
**Solutions:**
- Manually specify engine type: `ELASTICSEARCH_SHIM_ENGINE_TYPE=ELASTICSEARCH_8`
- Check cluster health: `curl http://elasticsearch:9200/_cluster/health`
- Verify authentication credentials
#### 3. API Compatibility Issues
```
ERROR: Incompatible API version
```
**Solutions:**
- Check Elasticsearch version compatibility
- Review deprecation warnings in ES logs
#### 4. Dependency Issues
```
ERROR: ClassNotFoundException for ES client
```
**Solutions:**
- Ensure correct client dependencies are included in classpath
- Check `build.gradle` for required dependencies
- Rebuild DataHub with appropriate client libraries
### Debug Mode
Enable debug logging to troubleshoot issues:
```bash
# Add to environment
DATAHUB_LOG_LEVEL=DEBUG
ELASTICSEARCH_SHIM_DEBUG=true
```
### Performance Monitoring
Monitor key metrics during migration:
```bash
# Connection pool metrics
curl "http://localhost:8080/actuator/metrics/elasticsearch.connections"
# Search operation metrics
curl "http://localhost:8080/actuator/metrics/elasticsearch.search"
# Error rates
curl "http://localhost:8080/actuator/metrics/elasticsearch.errors"
```
## Best Practices
### Pre-Migration
1. **Backup your data** before changing search engine configuration
2. **Test in staging** with representative data volumes
3. **Monitor resource usage** patterns in current deployment
4. **Document current configuration** for rollback scenarios
### During Migration
1. **Enable auto-detection initially** for smooth transition
2. **Monitor logs closely** for connection and performance issues
3. **Test all search functionality** after configuration changes
### Post-Migration
1. **Update documentation** with new configuration
2. **Monitor performance metrics** for several days
3. **Plan for future upgrades** (ES 8.x native support)
4. **Train team members** on new configuration options
## Future Enhancements
### Planned Features
1. **OpenSearch 3.x support** when available
2. **Enhanced AWS IAM authentication** for all client types
3. **Advanced feature detection** and capability querying
### Contributing
To extend the shim for additional search engines:
1. **Implement `SearchClientShim`** interface
2. **Add engine type** to `SearchEngineType` enum
3. **Update factory logic** in `SearchClientShimFactory`
4. **Add configuration options** to application.yaml
5. **Write tests** and documentation
## Support Matrix
| DataHub Version | ES 7.17 | ES 8.x | OpenSearch 2.x |
| --------------- | ------- | -------- | -------------- |
| 0.3.15+ | ✅ Full | ✅ 8.17+ | ✅ Full |
| Future | ✅ Full | ✅ Full | ✅ Full |
## FAQ
### Q: Can I use the shim with existing deployments?
**A:** Yes, the shim is backward compatible. It is a thin abstraction layer over the existing code
### Q: Can I use multiple search engines simultaneously?
**A:** No, DataHub connects to one search cluster at a time. Use the shim to switch between different engine types.
For additional support, please refer to the [DataHub community forums](https://datahubproject.io/docs/community) or file an issue in the [GitHub repository](https://github.com/datahubproject/datahub).