mirror of
https://github.com/datahub-project/datahub.git
synced 2025-12-18 21:44:03 +00:00
214 lines
5.8 KiB
Markdown
214 lines
5.8 KiB
Markdown
|
|
# DataHub Library Examples
|
||
|
|
|
||
|
|
This directory contains examples demonstrating how to use the DataHub Python SDK and metadata emission APIs.
|
||
|
|
|
||
|
|
## Structure
|
||
|
|
|
||
|
|
Each example is a standalone Python script that demonstrates a specific use case:
|
||
|
|
|
||
|
|
- **Create examples**: Show how to create new metadata entities
|
||
|
|
- **Update examples**: Show how to modify existing metadata
|
||
|
|
- **Query examples**: Show how to read and query metadata
|
||
|
|
- **Delete examples**: Show how to remove metadata
|
||
|
|
|
||
|
|
## Writing Testable Examples
|
||
|
|
|
||
|
|
To ensure examples are maintainable and correct, follow this pattern when writing new examples:
|
||
|
|
|
||
|
|
### Pattern Overview
|
||
|
|
|
||
|
|
Examples should have two main components:
|
||
|
|
|
||
|
|
1. **Testable functions**: Pure functions that take dependencies as parameters and return values/metadata
|
||
|
|
2. **Main function**: Entry point that creates dependencies and calls the testable functions
|
||
|
|
|
||
|
|
### Example Structure
|
||
|
|
|
||
|
|
```python
|
||
|
|
from typing import Optional
|
||
|
|
from datahub.emitter.mcp import MetadataChangeProposalWrapper
|
||
|
|
from datahub.emitter.rest_emitter import DatahubRestEmitter
|
||
|
|
|
||
|
|
|
||
|
|
def create_entity_metadata(...) -> MetadataChangeProposalWrapper:
|
||
|
|
"""
|
||
|
|
Create metadata for an entity.
|
||
|
|
|
||
|
|
This function is pure and testable - it doesn't have side effects.
|
||
|
|
|
||
|
|
Args:
|
||
|
|
... (all required parameters)
|
||
|
|
|
||
|
|
Returns:
|
||
|
|
MetadataChangeProposalWrapper containing the metadata
|
||
|
|
"""
|
||
|
|
# Build and return the MCP
|
||
|
|
return MetadataChangeProposalWrapper(...)
|
||
|
|
|
||
|
|
|
||
|
|
def main(emitter: Optional[DatahubRestEmitter] = None) -> None:
|
||
|
|
"""
|
||
|
|
Main function demonstrating the example use case.
|
||
|
|
|
||
|
|
Args:
|
||
|
|
emitter: Optional emitter for testing. If not provided, creates a new one.
|
||
|
|
"""
|
||
|
|
emitter = emitter or DatahubRestEmitter(gms_server="http://localhost:8080")
|
||
|
|
|
||
|
|
# Use the testable function
|
||
|
|
mcp = create_entity_metadata(...)
|
||
|
|
|
||
|
|
# Emit the metadata
|
||
|
|
emitter.emit(mcp)
|
||
|
|
print(f"Successfully created entity")
|
||
|
|
|
||
|
|
|
||
|
|
if __name__ == "__main__":
|
||
|
|
main()
|
||
|
|
```
|
||
|
|
|
||
|
|
### For SDK-based Examples
|
||
|
|
|
||
|
|
When using the DataHub SDK (`DataHubClient`):
|
||
|
|
|
||
|
|
```python
|
||
|
|
from typing import Optional
|
||
|
|
from datahub.sdk import DataHubClient
|
||
|
|
|
||
|
|
|
||
|
|
def perform_operation(client: DataHubClient, ...) -> ...:
|
||
|
|
"""
|
||
|
|
Perform an operation using the DataHub client.
|
||
|
|
|
||
|
|
Args:
|
||
|
|
client: DataHub client to use
|
||
|
|
...: Other parameters
|
||
|
|
|
||
|
|
Returns:
|
||
|
|
Result of the operation
|
||
|
|
"""
|
||
|
|
# Perform the operation
|
||
|
|
return result
|
||
|
|
|
||
|
|
|
||
|
|
def main(client: Optional[DataHubClient] = None) -> None:
|
||
|
|
"""
|
||
|
|
Main function demonstrating the example use case.
|
||
|
|
|
||
|
|
Args:
|
||
|
|
client: Optional client for testing. If not provided, creates one from env.
|
||
|
|
"""
|
||
|
|
client = client or DataHubClient.from_env()
|
||
|
|
|
||
|
|
result = perform_operation(client, ...)
|
||
|
|
print(f"Operation result: {result}")
|
||
|
|
|
||
|
|
|
||
|
|
if __name__ == "__main__":
|
||
|
|
main()
|
||
|
|
```
|
||
|
|
|
||
|
|
### Benefits of This Pattern
|
||
|
|
|
||
|
|
1. **Testability**: Core logic can be unit tested without needing a running DataHub instance
|
||
|
|
2. **Reusability**: The testable functions can be imported and used in other code
|
||
|
|
3. **Clarity**: Separates business logic from infrastructure setup
|
||
|
|
4. **Flexibility**: Examples can still be run standalone while being testable
|
||
|
|
|
||
|
|
### Running Examples
|
||
|
|
|
||
|
|
**As standalone scripts:**
|
||
|
|
|
||
|
|
```bash
|
||
|
|
python examples/library/notebook_create.py
|
||
|
|
```
|
||
|
|
|
||
|
|
**In tests:**
|
||
|
|
|
||
|
|
```python
|
||
|
|
from examples.library.create_notebook import create_notebook_metadata
|
||
|
|
|
||
|
|
# Unit test
|
||
|
|
mcp = create_notebook_metadata(...)
|
||
|
|
assert mcp.entityUrn == "..."
|
||
|
|
|
||
|
|
# Integration test
|
||
|
|
from examples.library.create_notebook import main
|
||
|
|
main(emitter=test_emitter) # Inject test emitter
|
||
|
|
```
|
||
|
|
|
||
|
|
## Testing
|
||
|
|
|
||
|
|
Examples are tested at two levels:
|
||
|
|
|
||
|
|
### Unit Tests
|
||
|
|
|
||
|
|
Located in `tests/unit/test_library_examples.py`:
|
||
|
|
|
||
|
|
- Test that examples compile and imports resolve
|
||
|
|
- Test that core functions produce valid metadata structures
|
||
|
|
- Use mocking to avoid needing a real DataHub instance
|
||
|
|
- Fast and run on every commit
|
||
|
|
|
||
|
|
### Integration Tests
|
||
|
|
|
||
|
|
Located in `tests/integration/library_examples/`:
|
||
|
|
|
||
|
|
- Test examples against a real DataHub instance
|
||
|
|
- Verify end-to-end functionality including reads after writes
|
||
|
|
- Test that metadata is correctly persisted and retrievable
|
||
|
|
- Slower, may run less frequently
|
||
|
|
|
||
|
|
### Running Tests
|
||
|
|
|
||
|
|
```bash
|
||
|
|
# Run all example tests (unit only)
|
||
|
|
pytest tests/unit/test_library_examples.py
|
||
|
|
|
||
|
|
# Run specific unit tests
|
||
|
|
pytest tests/unit/test_library_examples.py::test_create_notebook_metadata
|
||
|
|
|
||
|
|
# Run integration tests (requires running DataHub)
|
||
|
|
pytest tests/integration/library_examples/ -m integration
|
||
|
|
|
||
|
|
# Run all tests
|
||
|
|
pytest tests/unit/test_library_examples.py tests/integration/library_examples/
|
||
|
|
```
|
||
|
|
|
||
|
|
## Guidelines
|
||
|
|
|
||
|
|
1. **Keep examples simple**: Focus on demonstrating one concept clearly
|
||
|
|
2. **Use realistic data**: URNs, names, and values should look like real-world usage
|
||
|
|
3. **Add comments**: Explain non-obvious choices or important details
|
||
|
|
4. **Follow the pattern**: Use the testable function + main() pattern
|
||
|
|
5. **Document parameters**: Use clear docstrings with type hints
|
||
|
|
6. **Handle errors gracefully**: Show proper error handling where relevant
|
||
|
|
7. **Test your examples**: Add unit tests for new examples
|
||
|
|
|
||
|
|
## Example Categories
|
||
|
|
|
||
|
|
### Entity Creation
|
||
|
|
|
||
|
|
- `notebook_create.py` - Create a notebook entity
|
||
|
|
- `data_platform_create.py` - Create a custom data platform
|
||
|
|
- `glossary_term_create.py` - Create glossary terms
|
||
|
|
|
||
|
|
### Metadata Updates
|
||
|
|
|
||
|
|
- `dataset_add_term.py` - Add glossary terms to datasets
|
||
|
|
- `dataset_add_owner.py` - Add ownership information
|
||
|
|
- `notebook_add_tags.py` - Add tags to notebooks
|
||
|
|
|
||
|
|
### Querying Metadata
|
||
|
|
|
||
|
|
- `dataset_query_deprecation.py` - Check if a dataset is deprecated
|
||
|
|
- `search_with_query.py` - Search for entities
|
||
|
|
- `lineage_column_get.py` - Query column-level lineage
|
||
|
|
|
||
|
|
## Getting Help
|
||
|
|
|
||
|
|
- [DataHub Documentation](https://datahubproject.io/docs/)
|
||
|
|
- [Python SDK Reference](https://datahubproject.io/docs/python-sdk/)
|
||
|
|
- [Metadata Model](https://datahubproject.io/docs/metadata-model/)
|
||
|
|
- [GitHub Issues](https://github.com/datahub-project/datahub/issues)
|