chore: add claude configs (#13983)

2025-10-18 20:34:14 +00:00 · 2025-07-07 20:46:49 -04:00 · 2025-07-07 20:46:49 -04:00 · eb349f7b1d
commit eb349f7b1d
parent 6d2796a1c1
5 changed files with 230 additions and 40 deletions
--- a/.claude/settings.json
+++ b/.claude/settings.json
@ -0,0 +1,23 @@
 {
  "permissions": {
    "allow": [
      "Bash(cd:*)",
      "Bash(gh pr diff:*)",
      "Bash(gh pr view:*)",
      "Bash(git diff:*)",
      "Bash(grep:*)",
      "Bash(head:*)",
      "Bash(sed:*)",
      "Bash(find:*)",
      "Bash(rg:*)",
      "WebFetch(domain:docs.datahub.com)",
      "Bash(mypy:*)",
      "Bash(pytest:*)",
      "Bash(ruff:*)",
      "Bash(python -m mypy:*)",
      "Bash(python -m ruff:*)",
      "Bash(python -m pytest:*)"
    ],
    "deny": []
  }
 }
--- a/.gitignore
+++ b/.gitignore
@ -86,6 +86,7 @@ smoke-test/rollback-reports
 coverage*.xml
 .vercel
 .envrc
 **/.claude/settings.local.json
 # A long series of binary directories we should ignore
 datahub-frontend/bin/main/
@ -130,3 +131,4 @@ test-models/bin/
 datahub-executor/
 datahub-integrations-service/
 metadata-ingestion-modules/acryl-cloud
--- a/CLAUDE.MD
+++ b/CLAUDE.MD
@ -1,40 +0,0 @@
 # CLAUDE.md
 This file provides guidance to Claude Code (claude.ai/code) or any other agent when working with code in this repository.
 ## Coding conventions
 - Keep code maintainable. This is not throw-away code. This goes to production. 
 - Generate unit tests where appropriate. 
 - Do not start generating random scripts to run the code you generated unless asked for.
 - Do not add comments which are redundant given the function names
 ## Core concept docs
 - `docs/what/urn.md` defines what a URN is
 ## Overall Directory structure
 - This is repository for DataHub project.
 - `README.MD` should give some basic information about the project.
 - This is a multi-project gradle project so you will find a lot of `build.gradle` in most folders
 ### metadata-ingestion module details
 - `metadata-ingestion` contains source and tests for DataHub OSS CLI.
 - `metadata-ingestion/developing.md` contains details about the environment used for testing.
 - `.github/workflows/metadata-ingestion.yml` contains our github workflow that is used in CI.
 - `metadata-ingestion/build.gradle` contains our build.gradle that has gradle tasks defined for this module
 - `pyproject.toml`, `setup.py`, `setup.cfg` in the folder contain rules about the code style for the repository
 - The `.md` files at top level in this folder gives you important information about the concepts of ingestion
 - You can see examples of how to define various aspect types in `metadata-ingestion/src/datahub/emitter/mcp_builder.py`
 - Source code goes in `metadata-ingestion/src/`
 - Tests go in `metadata-ingestion/tests/` (not in `src/`)
 - **Testing conventions for metadata-ingestion**:
    - Unit tests: `metadata-ingestion/tests/unit/`
    - Integration tests: `metadata-ingestion/tests/integration/`
    - Test files should mirror the source directory structure
    - Use pytest, not unittest
    - Use `assert` statements, not `self.assertEqual()` or `self.assertIsNone()`
    - Use regular classes, not `unittest.TestCase`
    - Import `pytest` in test files
    - Test files should be named `test_*.py` and placed in the appropriate test directory, not alongside source files
--- a/CLAUDE.md
+++ b/CLAUDE.md
@ -0,0 +1,110 @@
 # CLAUDE.md
 This file provides guidance to Claude Code (claude.ai/code) when working with code in this repository.
 ## Essential Commands
 **Build and test:**
 ```bash
 ./gradlew build           # Build entire project
 ./gradlew check           # Run all tests and linting
 # Note that each directory typically has a build.gradle file, but the available tasks follow similar conventions.
 # Java code.
 ./gradlew spotlessApply   # Java code formatting
 # Python code.
 ./gradlew :metadata-ingestion:testQuick     # Fast Python unit tests
 ./gradlew :metadata-ingestion:lint          # Python linting (ruff, mypy)
 ./gradlew :metadata-ingestion:lintFix       # Python linting auto-fix (ruff only)
 ```
 **Development setup:**
 ```bash
 ./gradlew :metadata-ingestion:installDev               # Setup Python environment
 ./gradlew quickstartDebug                              # Start full DataHub stack
 cd datahub-web-react && yarn start                     # Frontend dev server
 ```
 ## Architecture Overview
 DataHub is a **schema-first, event-driven metadata platform** with three core layers:
 ### Core Services
 - **GMS (Generalized Metadata Service)**: Java/Spring backend handling metadata storage and REST/GraphQL APIs
 - **Frontend**: React/TypeScript application consuming GraphQL APIs
 - **Ingestion Framework**: Python CLI and connectors for extracting metadata from data sources
 - **Event Streaming**: Kafka-based real-time metadata change propagation
 ### Key Modules
 - `metadata-models/`: Avro/PDL schemas defining the metadata model
 - `metadata-service/`: Backend services, APIs, and business logic
 - `datahub-web-react/`: Frontend React application
 - `metadata-ingestion/`: Python ingestion framework and CLI
 - `datahub-graphql-core/`: GraphQL schema and resolvers
 ### Metadata Model Concepts
 - **Entities**: Core objects (Dataset, Dashboard, Chart, CorpUser, etc.)
 - **Aspects**: Metadata facets (Ownership, Schema, Documentation, etc.)
 - **URNs**: Unique identifiers (`urn:li:dataset:(urn:li:dataPlatform:mysql,db.table,PROD)`)
 - **MCE/MCL**: Metadata Change Events/Logs for updates
 - **Entity Registry**: YAML config defining entity-aspect relationships (`metadata-models/src/main/resources/entity-registry.yml`)
 ## Development Flow
 1. **Schema changes** in `metadata-models/` trigger code generation across all languages
 2. **Backend changes** in `metadata-service/` and other Java modules expose new REST/GraphQL APIs
 3. **Frontend changes** in `datahub-web-react/` consume GraphQL APIs
 4. **Ingestion changes** in `metadata-ingestion/` emit metadata to backend APIs
 ## Code Standards
 ### General Principles
 - This is production code - maintain high quality
 - Follow existing patterns within each module
 - Generate appropriate unit tests
 - Use type annotations everywhere (Python/TypeScript)
 ### Language-Specific
 - **Java**: Use Spotless formatting, Spring Boot patterns, TestNG/JUnit Jupiter for tests
 - **Python**: Use ruff for linting/formatting, pytest for testing, pydantic for configs
  - **Type Safety**: Everything must have type annotations, avoid `Any` type, use specific types (`Dict[str, int]`, `TypedDict`)
  - **Data Structures**: Prefer dataclasses/pydantic for internal data, return dataclasses over tuples
  - **Code Quality**: Avoid global state, use named arguments, don't re-export in `__init__.py`, refactor repetitive code
  - **Error Handling**: Robust error handling with layers of protection for known failure points
 - **TypeScript**: Use Prettier formatting, strict types (no `any`), React Testing Library
 ### Testing Strategy
 - Python: Tests go in the `tests/` directory alongside `src/`, use `assert` statements
 - Java: Tests alongside source in `src/test/`
 - Frontend: Tests in `__tests__/` or `.test.tsx` files
 - Smoke tests go in the `smoke-test/` directory
 ## Key Documentation
 **Essential reading:**
 - `docs/architecture/architecture.md` - System architecture overview
 - `docs/modeling/metadata-model.md` - How metadata is modeled
 - `docs/what-is-datahub/datahub-concepts.md` - Core concepts (URNs, entities, etc.)
 **External docs:**
 - https://docs.datahub.com/docs/developers - Official developer guide
 - https://demo.datahub.com/ - Live demo environment
 ## Important Notes
 - Entity Registry is defined in YAML, not code (`entity-registry.yml`)
 - All metadata changes flow through the event streaming system
 - GraphQL schema is generated from backend GMS APIs
 - Follow Conventional Commits format for commit messages
--- a/metadata-ingestion/CLAUDE.md
+++ b/metadata-ingestion/CLAUDE.md
@ -0,0 +1,95 @@
 # DataHub Metadata Ingestion Development Guide
 ## Build and Test Commands
 **Using Gradle (slow but reliable):**
 ```bash
 # Development setup from repository root
 ../gradlew :metadata-ingestion:installDev   # Setup Python environment
 source venv/bin/activate                    # Activate virtual environment
 # Linting and formatting
 ../gradlew :metadata-ingestion:lint         # Run ruff + mypy
 ../gradlew :metadata-ingestion:lintFix      # Auto-fix linting issues
 # Testing
 ../gradlew :metadata-ingestion:testQuick                           # Fast unit tests
 ../gradlew :metadata-ingestion:testFull                            # All tests
 ../gradlew :metadata-ingestion:testSingle -PtestFile=tests/unit/test_file.py  # Single test
 ```
 **Direct Python commands (when venv is activated):**
 ```bash
 # Linting
 ruff format src/ tests/
 ruff check src/ tests/
 mypy src/ tests/
 # Testing
 pytest -vv                                 # Run all tests
 pytest -m 'not integration'                # Unit tests only
 pytest -m 'integration'                    # Integration tests
 pytest tests/path/to/file.py               # Single test file
 pytest tests/path/to/file.py::TestClass    # Single test class
 pytest tests/path/to/file.py::TestClass::test_method  # Single test
 ```
 ## Directory Structure
 - `src/datahub/`: Source code for the DataHub CLI and ingestion framework
 - `tests/`: All tests (NOT in `src/` directory)
 - `tests/unit/`: Unit tests
 - `tests/integration/`: Integration tests
 - `scripts/`: Build and development scripts
 - `examples/`: Example ingestion configurations
 - `developing.md`: Detailed development environment information
 ## Code Style Guidelines
 - **Formatting**: Uses ruff, 88 character line length
 - **Imports**: Sorted with ruff.lint.isort, no relative imports
 - **Types**: Always use type annotations, prefer Protocol for interfaces
  - Avoid `Any` type - use specific types (`Dict[str, int]`, `TypedDict`, or typevars)
  - Use `isinstance` checks instead of `hasattr`
  - Prefer `assert isinstance(...)` over `cast`
 - **Data Structures**: Use dataclasses/pydantic for internal data representation
  - Return dataclasses instead of tuples from methods
  - Centralize utility functions to avoid code duplication
 - **Naming**: Descriptive names, match source system terminology in configs
 - **Error Handling**: Validators throw only ValueError/TypeError/AssertionError
  - Add robust error handling with layers of protection for known failure points
 - **Code Quality**: Avoid global state, use named arguments, don't re-export in `__init__.py`
 - **Documentation**: All configs need descriptions
 - **Dependencies**: Avoid version pinning, use ranges with comments
 - **Architecture**: Avoid tall inheritance hierarchies, prefer mixins
 ## Testing Conventions
 - **Location**: Tests go in `tests/` directory alongside `src/`, NOT in `src/`
 - **Structure**: Test files should mirror the source directory structure
 - **Framework**: Use pytest, not unittest
 - **Assertions**: Use `assert` statements, not `self.assertEqual()` or `self.assertIsNone()`
 - **Classes**: Use regular classes, not `unittest.TestCase`
 - **Imports**: Import `pytest` in test files
 - **Naming**: Test files should be named `test_*.py`
 - **Categories**:
  - Unit tests: `tests/unit/` - fast, no external dependencies
  - Integration tests: `tests/integration/` - may use Docker/external services
 ## Configuration Guidelines (Pydantic)
 - **Naming**: Match terminology of the source system (e.g., `account_id` for Snowflake, not `host_port`)
 - **Descriptions**: All configs must have descriptions
 - **Patterns**: Use AllowDenyPatterns for filtering, named `*_pattern`
 - **Defaults**: Set reasonable defaults, avoid config-driven filtering that should be automatic
 - **Validation**: Single pydantic validator per validation concern
 - **Security**: Use `SecretStr` for passwords, auth tokens, etc.
 - **Deprecation**: Use `pydantic_removed_field` helper for field deprecations
 ## Key Files
 - `src/datahub/emitter/mcp_builder.py`: Examples of defining various aspect types
 - `setup.py`, `pyproject.toml`, `setup.cfg`: Code style and dependency configuration
 - `.github/workflows/metadata-ingestion.yml`: CI workflow configuration