mirror of
https://github.com/datahub-project/datahub.git
synced 2025-11-11 17:03:33 +00:00
DataHub Library Examples
This directory contains examples demonstrating how to use the DataHub Python SDK and metadata emission APIs.
Structure
Each example is a standalone Python script that demonstrates a specific use case:
- Create examples: Show how to create new metadata entities
- Update examples: Show how to modify existing metadata
- Query examples: Show how to read and query metadata
- Delete examples: Show how to remove metadata
Writing Testable Examples
To ensure examples are maintainable and correct, follow this pattern when writing new examples:
Pattern Overview
Examples should have two main components:
- Testable functions: Pure functions that take dependencies as parameters and return values/metadata
- Main function: Entry point that creates dependencies and calls the testable functions
Example Structure
from typing import Optional
from datahub.emitter.mcp import MetadataChangeProposalWrapper
from datahub.emitter.rest_emitter import DatahubRestEmitter
def create_entity_metadata(...) -> MetadataChangeProposalWrapper:
"""
Create metadata for an entity.
This function is pure and testable - it doesn't have side effects.
Args:
... (all required parameters)
Returns:
MetadataChangeProposalWrapper containing the metadata
"""
# Build and return the MCP
return MetadataChangeProposalWrapper(...)
def main(emitter: Optional[DatahubRestEmitter] = None) -> None:
"""
Main function demonstrating the example use case.
Args:
emitter: Optional emitter for testing. If not provided, creates a new one.
"""
emitter = emitter or DatahubRestEmitter(gms_server="http://localhost:8080")
# Use the testable function
mcp = create_entity_metadata(...)
# Emit the metadata
emitter.emit(mcp)
print(f"Successfully created entity")
if __name__ == "__main__":
main()
For SDK-based Examples
When using the DataHub SDK (DataHubClient):
from typing import Optional
from datahub.sdk import DataHubClient
def perform_operation(client: DataHubClient, ...) -> ...:
"""
Perform an operation using the DataHub client.
Args:
client: DataHub client to use
...: Other parameters
Returns:
Result of the operation
"""
# Perform the operation
return result
def main(client: Optional[DataHubClient] = None) -> None:
"""
Main function demonstrating the example use case.
Args:
client: Optional client for testing. If not provided, creates one from env.
"""
client = client or DataHubClient.from_env()
result = perform_operation(client, ...)
print(f"Operation result: {result}")
if __name__ == "__main__":
main()
Benefits of This Pattern
- Testability: Core logic can be unit tested without needing a running DataHub instance
- Reusability: The testable functions can be imported and used in other code
- Clarity: Separates business logic from infrastructure setup
- Flexibility: Examples can still be run standalone while being testable
Running Examples
As standalone scripts:
python examples/library/notebook_create.py
In tests:
from examples.library.create_notebook import create_notebook_metadata
# Unit test
mcp = create_notebook_metadata(...)
assert mcp.entityUrn == "..."
# Integration test
from examples.library.create_notebook import main
main(emitter=test_emitter) # Inject test emitter
Testing
Examples are tested at two levels:
Unit Tests
Located in tests/unit/test_library_examples.py:
- Test that examples compile and imports resolve
- Test that core functions produce valid metadata structures
- Use mocking to avoid needing a real DataHub instance
- Fast and run on every commit
Integration Tests
Located in tests/integration/library_examples/:
- Test examples against a real DataHub instance
- Verify end-to-end functionality including reads after writes
- Test that metadata is correctly persisted and retrievable
- Slower, may run less frequently
Running Tests
# Run all example tests (unit only)
pytest tests/unit/test_library_examples.py
# Run specific unit tests
pytest tests/unit/test_library_examples.py::test_create_notebook_metadata
# Run integration tests (requires running DataHub)
pytest tests/integration/library_examples/ -m integration
# Run all tests
pytest tests/unit/test_library_examples.py tests/integration/library_examples/
Guidelines
- Keep examples simple: Focus on demonstrating one concept clearly
- Use realistic data: URNs, names, and values should look like real-world usage
- Add comments: Explain non-obvious choices or important details
- Follow the pattern: Use the testable function + main() pattern
- Document parameters: Use clear docstrings with type hints
- Handle errors gracefully: Show proper error handling where relevant
- Test your examples: Add unit tests for new examples
Example Categories
Entity Creation
notebook_create.py- Create a notebook entitydata_platform_create.py- Create a custom data platformglossary_term_create.py- Create glossary terms
Metadata Updates
dataset_add_term.py- Add glossary terms to datasetsdataset_add_owner.py- Add ownership informationnotebook_add_tags.py- Add tags to notebooks
Querying Metadata
dataset_query_deprecation.py- Check if a dataset is deprecatedsearch_with_query.py- Search for entitieslineage_column_get.py- Query column-level lineage