datahub/metadata-ingestion/examples/library/README.md

# DataHub Library Examples

This directory contains examples demonstrating how to use the DataHub Python SDK and metadata emission APIs.

## Structure

Each example is a standalone Python script that demonstrates a specific use case:

- **Create examples**: Show how to create new metadata entities
- **Update examples**: Show how to modify existing metadata
- **Query examples**: Show how to read and query metadata
- **Delete examples**: Show how to remove metadata

## Writing Testable Examples

To ensure examples are maintainable and correct, follow this pattern when writing new examples:

### Pattern Overview

Examples should have two main components:

1. **Testable functions**: Pure functions that take dependencies as parameters and return values/metadata
2. **Main function**: Entry point that creates dependencies and calls the testable functions

### Example Structure

```python
from typing import Optional
from datahub.emitter.mcp import MetadataChangeProposalWrapper
from datahub.emitter.rest_emitter import DatahubRestEmitter


def create_entity_metadata(...) -> MetadataChangeProposalWrapper:
    """
    Create metadata for an entity.

    This function is pure and testable - it doesn't have side effects.

    Args:
        ... (all required parameters)

    Returns:
        MetadataChangeProposalWrapper containing the metadata
    """
    # Build and return the MCP
    return MetadataChangeProposalWrapper(...)


def main(emitter: Optional[DatahubRestEmitter] = None) -> None:
    """
    Main function demonstrating the example use case.

    Args:
        emitter: Optional emitter for testing. If not provided, creates a new one.
    """
    emitter = emitter or DatahubRestEmitter(gms_server="http://localhost:8080")

    # Use the testable function
    mcp = create_entity_metadata(...)

    # Emit the metadata
    emitter.emit(mcp)
    print(f"Successfully created entity")


if __name__ == "__main__":
    main()
```

### For SDK-based Examples

When using the DataHub SDK (`DataHubClient`):

```python
from typing import Optional
from datahub.sdk import DataHubClient


def perform_operation(client: DataHubClient, ...) -> ...:
    """
    Perform an operation using the DataHub client.

    Args:
        client: DataHub client to use
        ...: Other parameters

    Returns:
        Result of the operation
    """
    # Perform the operation
    return result


def main(client: Optional[DataHubClient] = None) -> None:
    """
    Main function demonstrating the example use case.

    Args:
        client: Optional client for testing. If not provided, creates one from env.
    """
    client = client or DataHubClient.from_env()

    result = perform_operation(client, ...)
    print(f"Operation result: {result}")


if __name__ == "__main__":
    main()
```

### Benefits of This Pattern

1. **Testability**: Core logic can be unit tested without needing a running DataHub instance
2. **Reusability**: The testable functions can be imported and used in other code
3. **Clarity**: Separates business logic from infrastructure setup
4. **Flexibility**: Examples can still be run standalone while being testable

### Running Examples

**As standalone scripts:**

```bash
python examples/library/notebook_create.py
```

**In tests:**

```python
from examples.library.create_notebook import create_notebook_metadata

# Unit test
mcp = create_notebook_metadata(...)
assert mcp.entityUrn == "..."

# Integration test
from examples.library.create_notebook import main
main(emitter=test_emitter)  # Inject test emitter
```

## Testing

Examples are tested at two levels:

### Unit Tests

Located in `tests/unit/test_library_examples.py`:

- Test that examples compile and imports resolve
- Test that core functions produce valid metadata structures
- Use mocking to avoid needing a real DataHub instance
- Fast and run on every commit

### Integration Tests

Located in `tests/integration/library_examples/`:

- Test examples against a real DataHub instance
- Verify end-to-end functionality including reads after writes
- Test that metadata is correctly persisted and retrievable
- Slower, may run less frequently

### Running Tests

```bash
# Run all example tests (unit only)
pytest tests/unit/test_library_examples.py

# Run specific unit tests
pytest tests/unit/test_library_examples.py::test_create_notebook_metadata

# Run integration tests (requires running DataHub)
pytest tests/integration/library_examples/ -m integration

# Run all tests
pytest tests/unit/test_library_examples.py tests/integration/library_examples/
```

## Guidelines

1. **Keep examples simple**: Focus on demonstrating one concept clearly
2. **Use realistic data**: URNs, names, and values should look like real-world usage
3. **Add comments**: Explain non-obvious choices or important details
4. **Follow the pattern**: Use the testable function + main() pattern
5. **Document parameters**: Use clear docstrings with type hints
6. **Handle errors gracefully**: Show proper error handling where relevant
7. **Test your examples**: Add unit tests for new examples

## Example Categories

### Entity Creation

- `notebook_create.py` - Create a notebook entity
- `data_platform_create.py` - Create a custom data platform
- `glossary_term_create.py` - Create glossary terms

### Metadata Updates

- `dataset_add_term.py` - Add glossary terms to datasets
- `dataset_add_owner.py` - Add ownership information
- `notebook_add_tags.py` - Add tags to notebooks

### Querying Metadata

- `dataset_query_deprecation.py` - Check if a dataset is deprecated
- `search_with_query.py` - Search for entities
- `lineage_column_get.py` - Query column-level lineage

## Getting Help

- [DataHub Documentation](https://datahubproject.io/docs/)
- [Python SDK Reference](https://datahubproject.io/docs/python-sdk/)
- [Metadata Model](https://datahubproject.io/docs/metadata-model/)
- [GitHub Issues](https://github.com/datahub-project/datahub/issues)
feat(docs): enhance metadata model entity documentation with field tables and SDK examples (#15095) 2025-10-23 14:01:45 -07:00			`# DataHub Library Examples`

			`This directory contains examples demonstrating how to use the DataHub Python SDK and metadata emission APIs.`

			`## Structure`

			`Each example is a standalone Python script that demonstrates a specific use case:`

			`- Create examples: Show how to create new metadata entities`
			`- Update examples: Show how to modify existing metadata`
			`- Query examples: Show how to read and query metadata`
			`- Delete examples: Show how to remove metadata`

			`## Writing Testable Examples`

			`To ensure examples are maintainable and correct, follow this pattern when writing new examples:`

			`### Pattern Overview`

			`Examples should have two main components:`

			`1. Testable functions: Pure functions that take dependencies as parameters and return values/metadata`
			`2. Main function: Entry point that creates dependencies and calls the testable functions`

			`### Example Structure`

			```python
			`from typing import Optional`
			`from datahub.emitter.mcp import MetadataChangeProposalWrapper`
			`from datahub.emitter.rest_emitter import DatahubRestEmitter`


			`def create_entity_metadata(...) -> MetadataChangeProposalWrapper:`
			`"""`
			`Create metadata for an entity.`

			`This function is pure and testable - it doesn't have side effects.`

			`Args:`
			`... (all required parameters)`

			`Returns:`
			`MetadataChangeProposalWrapper containing the metadata`
			`"""`
			`# Build and return the MCP`
			`return MetadataChangeProposalWrapper(...)`


			`def main(emitter: Optional[DatahubRestEmitter] = None) -> None:`
			`"""`
			`Main function demonstrating the example use case.`

			`Args:`
			`emitter: Optional emitter for testing. If not provided, creates a new one.`
			`"""`
			`emitter = emitter or DatahubRestEmitter(gms_server="http://localhost:8080")`

			`# Use the testable function`
			`mcp = create_entity_metadata(...)`

			`# Emit the metadata`
			`emitter.emit(mcp)`
			`print(f"Successfully created entity")`


			`if __name__ == "__main__":`
			`main()`
			```

			`### For SDK-based Examples`

			When using the DataHub SDK (`DataHubClient`):

			```python
			`from typing import Optional`
			`from datahub.sdk import DataHubClient`


			`def perform_operation(client: DataHubClient, ...) -> ...:`
			`"""`
			`Perform an operation using the DataHub client.`

			`Args:`
			`client: DataHub client to use`
			`...: Other parameters`

			`Returns:`
			`Result of the operation`
			`"""`
			`# Perform the operation`
			`return result`


			`def main(client: Optional[DataHubClient] = None) -> None:`
			`"""`
			`Main function demonstrating the example use case.`

			`Args:`
			`client: Optional client for testing. If not provided, creates one from env.`
			`"""`
			`client = client or DataHubClient.from_env()`

			`result = perform_operation(client, ...)`
			`print(f"Operation result: {result}")`


			`if __name__ == "__main__":`
			`main()`
			```

			`### Benefits of This Pattern`

			`1. Testability: Core logic can be unit tested without needing a running DataHub instance`
			`2. Reusability: The testable functions can be imported and used in other code`
			`3. Clarity: Separates business logic from infrastructure setup`
			`4. Flexibility: Examples can still be run standalone while being testable`

			`### Running Examples`

			`As standalone scripts:`

			```bash
			`python examples/library/notebook_create.py`
			```

			`In tests:`

			```python
			`from examples.library.create_notebook import create_notebook_metadata`

			`# Unit test`
			`mcp = create_notebook_metadata(...)`
			`assert mcp.entityUrn == "..."`

			`# Integration test`
			`from examples.library.create_notebook import main`
			`main(emitter=test_emitter) # Inject test emitter`
			```

			`## Testing`

			`Examples are tested at two levels:`

			`### Unit Tests`

			Located in `tests/unit/test_library_examples.py`:

			`- Test that examples compile and imports resolve`
			`- Test that core functions produce valid metadata structures`
			`- Use mocking to avoid needing a real DataHub instance`
			`- Fast and run on every commit`

			`### Integration Tests`

			Located in `tests/integration/library_examples/`:

			`- Test examples against a real DataHub instance`
			`- Verify end-to-end functionality including reads after writes`
			`- Test that metadata is correctly persisted and retrievable`
			`- Slower, may run less frequently`

			`### Running Tests`

			```bash
			`# Run all example tests (unit only)`
			`pytest tests/unit/test_library_examples.py`

			`# Run specific unit tests`
			`pytest tests/unit/test_library_examples.py::test_create_notebook_metadata`

			`# Run integration tests (requires running DataHub)`
			`pytest tests/integration/library_examples/ -m integration`

			`# Run all tests`
			`pytest tests/unit/test_library_examples.py tests/integration/library_examples/`
			```

			`## Guidelines`

			`1. Keep examples simple: Focus on demonstrating one concept clearly`
			`2. Use realistic data: URNs, names, and values should look like real-world usage`
			`3. Add comments: Explain non-obvious choices or important details`
			`4. Follow the pattern: Use the testable function + main() pattern`
			`5. Document parameters: Use clear docstrings with type hints`
			`6. Handle errors gracefully: Show proper error handling where relevant`
			`7. Test your examples: Add unit tests for new examples`

			`## Example Categories`

			`### Entity Creation`

			- `notebook_create.py` - Create a notebook entity
			- `data_platform_create.py` - Create a custom data platform
			- `glossary_term_create.py` - Create glossary terms

			`### Metadata Updates`

			- `dataset_add_term.py` - Add glossary terms to datasets
			- `dataset_add_owner.py` - Add ownership information
			- `notebook_add_tags.py` - Add tags to notebooks

			`### Querying Metadata`

			- `dataset_query_deprecation.py` - Check if a dataset is deprecated
			- `search_with_query.py` - Search for entities
			- `lineage_column_get.py` - Query column-level lineage

			`## Getting Help`

			`- [DataHub Documentation](https://datahubproject.io/docs/)`
			`- [Python SDK Reference](https://datahubproject.io/docs/python-sdk/)`
			`- [Metadata Model](https://datahubproject.io/docs/metadata-model/)`
			`- [GitHub Issues](https://github.com/datahub-project/datahub/issues)`