diff --git a/.claude/settings.json b/.claude/settings.json new file mode 100644 index 0000000000..0507db6578 --- /dev/null +++ b/.claude/settings.json @@ -0,0 +1,23 @@ +{ + "permissions": { + "allow": [ + "Bash(cd:*)", + "Bash(gh pr diff:*)", + "Bash(gh pr view:*)", + "Bash(git diff:*)", + "Bash(grep:*)", + "Bash(head:*)", + "Bash(sed:*)", + "Bash(find:*)", + "Bash(rg:*)", + "WebFetch(domain:docs.datahub.com)", + "Bash(mypy:*)", + "Bash(pytest:*)", + "Bash(ruff:*)", + "Bash(python -m mypy:*)", + "Bash(python -m ruff:*)", + "Bash(python -m pytest:*)" + ], + "deny": [] + } +} diff --git a/.gitignore b/.gitignore index 19909b25fe..9370b73450 100644 --- a/.gitignore +++ b/.gitignore @@ -86,6 +86,7 @@ smoke-test/rollback-reports coverage*.xml .vercel .envrc +**/.claude/settings.local.json # A long series of binary directories we should ignore datahub-frontend/bin/main/ @@ -130,3 +131,4 @@ test-models/bin/ datahub-executor/ datahub-integrations-service/ metadata-ingestion-modules/acryl-cloud + diff --git a/CLAUDE.MD b/CLAUDE.MD deleted file mode 100644 index 5e30a59211..0000000000 --- a/CLAUDE.MD +++ /dev/null @@ -1,40 +0,0 @@ -# CLAUDE.md - -This file provides guidance to Claude Code (claude.ai/code) or any other agent when working with code in this repository. - -## Coding conventions - -- Keep code maintainable. This is not throw-away code. This goes to production. -- Generate unit tests where appropriate. -- Do not start generating random scripts to run the code you generated unless asked for. -- Do not add comments which are redundant given the function names - -## Core concept docs - - - `docs/what/urn.md` defines what a URN is - -## Overall Directory structure - -- This is repository for DataHub project. -- `README.MD` should give some basic information about the project. -- This is a multi-project gradle project so you will find a lot of `build.gradle` in most folders - -### metadata-ingestion module details -- `metadata-ingestion` contains source and tests for DataHub OSS CLI. -- `metadata-ingestion/developing.md` contains details about the environment used for testing. -- `.github/workflows/metadata-ingestion.yml` contains our github workflow that is used in CI. -- `metadata-ingestion/build.gradle` contains our build.gradle that has gradle tasks defined for this module -- `pyproject.toml`, `setup.py`, `setup.cfg` in the folder contain rules about the code style for the repository -- The `.md` files at top level in this folder gives you important information about the concepts of ingestion -- You can see examples of how to define various aspect types in `metadata-ingestion/src/datahub/emitter/mcp_builder.py` -- Source code goes in `metadata-ingestion/src/` -- Tests go in `metadata-ingestion/tests/` (not in `src/`) -- **Testing conventions for metadata-ingestion**: - - Unit tests: `metadata-ingestion/tests/unit/` - - Integration tests: `metadata-ingestion/tests/integration/` - - Test files should mirror the source directory structure - - Use pytest, not unittest - - Use `assert` statements, not `self.assertEqual()` or `self.assertIsNone()` - - Use regular classes, not `unittest.TestCase` - - Import `pytest` in test files - - Test files should be named `test_*.py` and placed in the appropriate test directory, not alongside source files diff --git a/CLAUDE.md b/CLAUDE.md new file mode 100644 index 0000000000..0a2b813e2d --- /dev/null +++ b/CLAUDE.md @@ -0,0 +1,110 @@ +# CLAUDE.md + +This file provides guidance to Claude Code (claude.ai/code) when working with code in this repository. + +## Essential Commands + +**Build and test:** + +```bash +./gradlew build # Build entire project +./gradlew check # Run all tests and linting + +# Note that each directory typically has a build.gradle file, but the available tasks follow similar conventions. + +# Java code. +./gradlew spotlessApply # Java code formatting + +# Python code. +./gradlew :metadata-ingestion:testQuick # Fast Python unit tests +./gradlew :metadata-ingestion:lint # Python linting (ruff, mypy) +./gradlew :metadata-ingestion:lintFix # Python linting auto-fix (ruff only) +``` + +**Development setup:** + +```bash +./gradlew :metadata-ingestion:installDev # Setup Python environment +./gradlew quickstartDebug # Start full DataHub stack +cd datahub-web-react && yarn start # Frontend dev server +``` + +## Architecture Overview + +DataHub is a **schema-first, event-driven metadata platform** with three core layers: + +### Core Services + +- **GMS (Generalized Metadata Service)**: Java/Spring backend handling metadata storage and REST/GraphQL APIs +- **Frontend**: React/TypeScript application consuming GraphQL APIs +- **Ingestion Framework**: Python CLI and connectors for extracting metadata from data sources +- **Event Streaming**: Kafka-based real-time metadata change propagation + +### Key Modules + +- `metadata-models/`: Avro/PDL schemas defining the metadata model +- `metadata-service/`: Backend services, APIs, and business logic +- `datahub-web-react/`: Frontend React application +- `metadata-ingestion/`: Python ingestion framework and CLI +- `datahub-graphql-core/`: GraphQL schema and resolvers + +### Metadata Model Concepts + +- **Entities**: Core objects (Dataset, Dashboard, Chart, CorpUser, etc.) +- **Aspects**: Metadata facets (Ownership, Schema, Documentation, etc.) +- **URNs**: Unique identifiers (`urn:li:dataset:(urn:li:dataPlatform:mysql,db.table,PROD)`) +- **MCE/MCL**: Metadata Change Events/Logs for updates +- **Entity Registry**: YAML config defining entity-aspect relationships (`metadata-models/src/main/resources/entity-registry.yml`) + +## Development Flow + +1. **Schema changes** in `metadata-models/` trigger code generation across all languages +2. **Backend changes** in `metadata-service/` and other Java modules expose new REST/GraphQL APIs +3. **Frontend changes** in `datahub-web-react/` consume GraphQL APIs +4. **Ingestion changes** in `metadata-ingestion/` emit metadata to backend APIs + +## Code Standards + +### General Principles + +- This is production code - maintain high quality +- Follow existing patterns within each module +- Generate appropriate unit tests +- Use type annotations everywhere (Python/TypeScript) + +### Language-Specific + +- **Java**: Use Spotless formatting, Spring Boot patterns, TestNG/JUnit Jupiter for tests +- **Python**: Use ruff for linting/formatting, pytest for testing, pydantic for configs + - **Type Safety**: Everything must have type annotations, avoid `Any` type, use specific types (`Dict[str, int]`, `TypedDict`) + - **Data Structures**: Prefer dataclasses/pydantic for internal data, return dataclasses over tuples + - **Code Quality**: Avoid global state, use named arguments, don't re-export in `__init__.py`, refactor repetitive code + - **Error Handling**: Robust error handling with layers of protection for known failure points +- **TypeScript**: Use Prettier formatting, strict types (no `any`), React Testing Library + +### Testing Strategy + +- Python: Tests go in the `tests/` directory alongside `src/`, use `assert` statements +- Java: Tests alongside source in `src/test/` +- Frontend: Tests in `__tests__/` or `.test.tsx` files +- Smoke tests go in the `smoke-test/` directory + +## Key Documentation + +**Essential reading:** + +- `docs/architecture/architecture.md` - System architecture overview +- `docs/modeling/metadata-model.md` - How metadata is modeled +- `docs/what-is-datahub/datahub-concepts.md` - Core concepts (URNs, entities, etc.) + +**External docs:** + +- https://docs.datahub.com/docs/developers - Official developer guide +- https://demo.datahub.com/ - Live demo environment + +## Important Notes + +- Entity Registry is defined in YAML, not code (`entity-registry.yml`) +- All metadata changes flow through the event streaming system +- GraphQL schema is generated from backend GMS APIs +- Follow Conventional Commits format for commit messages diff --git a/metadata-ingestion/CLAUDE.md b/metadata-ingestion/CLAUDE.md new file mode 100644 index 0000000000..2c1e95586e --- /dev/null +++ b/metadata-ingestion/CLAUDE.md @@ -0,0 +1,95 @@ +# DataHub Metadata Ingestion Development Guide + +## Build and Test Commands + +**Using Gradle (slow but reliable):** + +```bash +# Development setup from repository root +../gradlew :metadata-ingestion:installDev # Setup Python environment +source venv/bin/activate # Activate virtual environment + +# Linting and formatting +../gradlew :metadata-ingestion:lint # Run ruff + mypy +../gradlew :metadata-ingestion:lintFix # Auto-fix linting issues + +# Testing +../gradlew :metadata-ingestion:testQuick # Fast unit tests +../gradlew :metadata-ingestion:testFull # All tests +../gradlew :metadata-ingestion:testSingle -PtestFile=tests/unit/test_file.py # Single test +``` + +**Direct Python commands (when venv is activated):** + +```bash +# Linting +ruff format src/ tests/ +ruff check src/ tests/ +mypy src/ tests/ + +# Testing +pytest -vv # Run all tests +pytest -m 'not integration' # Unit tests only +pytest -m 'integration' # Integration tests +pytest tests/path/to/file.py # Single test file +pytest tests/path/to/file.py::TestClass # Single test class +pytest tests/path/to/file.py::TestClass::test_method # Single test +``` + +## Directory Structure + +- `src/datahub/`: Source code for the DataHub CLI and ingestion framework +- `tests/`: All tests (NOT in `src/` directory) +- `tests/unit/`: Unit tests +- `tests/integration/`: Integration tests +- `scripts/`: Build and development scripts +- `examples/`: Example ingestion configurations +- `developing.md`: Detailed development environment information + +## Code Style Guidelines + +- **Formatting**: Uses ruff, 88 character line length +- **Imports**: Sorted with ruff.lint.isort, no relative imports +- **Types**: Always use type annotations, prefer Protocol for interfaces + - Avoid `Any` type - use specific types (`Dict[str, int]`, `TypedDict`, or typevars) + - Use `isinstance` checks instead of `hasattr` + - Prefer `assert isinstance(...)` over `cast` +- **Data Structures**: Use dataclasses/pydantic for internal data representation + - Return dataclasses instead of tuples from methods + - Centralize utility functions to avoid code duplication +- **Naming**: Descriptive names, match source system terminology in configs +- **Error Handling**: Validators throw only ValueError/TypeError/AssertionError + - Add robust error handling with layers of protection for known failure points +- **Code Quality**: Avoid global state, use named arguments, don't re-export in `__init__.py` +- **Documentation**: All configs need descriptions +- **Dependencies**: Avoid version pinning, use ranges with comments +- **Architecture**: Avoid tall inheritance hierarchies, prefer mixins + +## Testing Conventions + +- **Location**: Tests go in `tests/` directory alongside `src/`, NOT in `src/` +- **Structure**: Test files should mirror the source directory structure +- **Framework**: Use pytest, not unittest +- **Assertions**: Use `assert` statements, not `self.assertEqual()` or `self.assertIsNone()` +- **Classes**: Use regular classes, not `unittest.TestCase` +- **Imports**: Import `pytest` in test files +- **Naming**: Test files should be named `test_*.py` +- **Categories**: + - Unit tests: `tests/unit/` - fast, no external dependencies + - Integration tests: `tests/integration/` - may use Docker/external services + +## Configuration Guidelines (Pydantic) + +- **Naming**: Match terminology of the source system (e.g., `account_id` for Snowflake, not `host_port`) +- **Descriptions**: All configs must have descriptions +- **Patterns**: Use AllowDenyPatterns for filtering, named `*_pattern` +- **Defaults**: Set reasonable defaults, avoid config-driven filtering that should be automatic +- **Validation**: Single pydantic validator per validation concern +- **Security**: Use `SecretStr` for passwords, auth tokens, etc. +- **Deprecation**: Use `pydantic_removed_field` helper for field deprecations + +## Key Files + +- `src/datahub/emitter/mcp_builder.py`: Examples of defining various aspect types +- `setup.py`, `pyproject.toml`, `setup.cfg`: Code style and dependency configuration +- `.github/workflows/metadata-ingestion.yml`: CI workflow configuration