chore: add claude configs (#13983)

2025-10-15 02:47:19 +00:00 · 2025-07-07 20:46:49 -04:00 · 2025-07-07 20:46:49 -04:00 · eb349f7b1d
commit eb349f7b1d
parent 6d2796a1c1
5 changed files with 230 additions and 40 deletions
--- a/.claude/settings.json
+++ b/.claude/settings.json
@ -0,0 +1,23 @@
+{
+  "permissions": {
+    "allow": [
+      "Bash(cd:*)",
+      "Bash(gh pr diff:*)",
+      "Bash(gh pr view:*)",
+      "Bash(git diff:*)",
+      "Bash(grep:*)",
+      "Bash(head:*)",
+      "Bash(sed:*)",
+      "Bash(find:*)",
+      "Bash(rg:*)",
+      "WebFetch(domain:docs.datahub.com)",
+      "Bash(mypy:*)",
+      "Bash(pytest:*)",
+      "Bash(ruff:*)",
+      "Bash(python -m mypy:*)",
+      "Bash(python -m ruff:*)",
+      "Bash(python -m pytest:*)"
+    ],
+    "deny": []
+  }
+}
--- a/.gitignore
+++ b/.gitignore
@ -86,6 +86,7 @@ smoke-test/rollback-reports
 coverage*.xml
 .vercel
 .envrc
+**/.claude/settings.local.json

 # A long series of binary directories we should ignore
 datahub-frontend/bin/main/
@ -130,3 +131,4 @@ test-models/bin/
 datahub-executor/
 datahub-integrations-service/
 metadata-ingestion-modules/acryl-cloud
+
--- a/CLAUDE.MD
+++ b/CLAUDE.MD
@ -1,40 +0,0 @@
-# CLAUDE.md
-
-This file provides guidance to Claude Code (claude.ai/code) or any other agent when working with code in this repository.
-
-## Coding conventions
-
- Keep code maintainable. This is not throw-away code. This goes to production. 
- Generate unit tests where appropriate. 
- Do not start generating random scripts to run the code you generated unless asked for.
- Do not add comments which are redundant given the function names
-
-## Core concept docs
-
- - `docs/what/urn.md` defines what a URN is
-
-## Overall Directory structure
-
- This is repository for DataHub project.
- `README.MD` should give some basic information about the project.
- This is a multi-project gradle project so you will find a lot of `build.gradle` in most folders
-
-### metadata-ingestion module details
- `metadata-ingestion` contains source and tests for DataHub OSS CLI.
- `metadata-ingestion/developing.md` contains details about the environment used for testing.
- `.github/workflows/metadata-ingestion.yml` contains our github workflow that is used in CI.
- `metadata-ingestion/build.gradle` contains our build.gradle that has gradle tasks defined for this module
- `pyproject.toml`, `setup.py`, `setup.cfg` in the folder contain rules about the code style for the repository
- The `.md` files at top level in this folder gives you important information about the concepts of ingestion
- You can see examples of how to define various aspect types in `metadata-ingestion/src/datahub/emitter/mcp_builder.py`
- Source code goes in `metadata-ingestion/src/`
- Tests go in `metadata-ingestion/tests/` (not in `src/`)
- **Testing conventions for metadata-ingestion**:
-    - Unit tests: `metadata-ingestion/tests/unit/`
-    - Integration tests: `metadata-ingestion/tests/integration/`
-    - Test files should mirror the source directory structure
-    - Use pytest, not unittest
-    - Use `assert` statements, not `self.assertEqual()` or `self.assertIsNone()`
-    - Use regular classes, not `unittest.TestCase`
-    - Import `pytest` in test files
-    - Test files should be named `test_*.py` and placed in the appropriate test directory, not alongside source files
--- a/CLAUDE.md
+++ b/CLAUDE.md
@ -0,0 +1,110 @@
+# CLAUDE.md
+
+This file provides guidance to Claude Code (claude.ai/code) when working with code in this repository.
+
+## Essential Commands
+
+**Build and test:**
+
+```bash
+./gradlew build           # Build entire project
+./gradlew check           # Run all tests and linting
+
+# Note that each directory typically has a build.gradle file, but the available tasks follow similar conventions.
+
+# Java code.
+./gradlew spotlessApply   # Java code formatting
+
+# Python code.
+./gradlew :metadata-ingestion:testQuick     # Fast Python unit tests
+./gradlew :metadata-ingestion:lint          # Python linting (ruff, mypy)
+./gradlew :metadata-ingestion:lintFix       # Python linting auto-fix (ruff only)
+```
+
+**Development setup:**
+
+```bash
+./gradlew :metadata-ingestion:installDev               # Setup Python environment
+./gradlew quickstartDebug                              # Start full DataHub stack
+cd datahub-web-react && yarn start                     # Frontend dev server
+```
+
+## Architecture Overview
+
+DataHub is a **schema-first, event-driven metadata platform** with three core layers:
+
+### Core Services
+
+- **GMS (Generalized Metadata Service)**: Java/Spring backend handling metadata storage and REST/GraphQL APIs
+- **Frontend**: React/TypeScript application consuming GraphQL APIs
+- **Ingestion Framework**: Python CLI and connectors for extracting metadata from data sources
+- **Event Streaming**: Kafka-based real-time metadata change propagation
+
+### Key Modules
+
+- `metadata-models/`: Avro/PDL schemas defining the metadata model
+- `metadata-service/`: Backend services, APIs, and business logic
+- `datahub-web-react/`: Frontend React application
+- `metadata-ingestion/`: Python ingestion framework and CLI
+- `datahub-graphql-core/`: GraphQL schema and resolvers
+
+### Metadata Model Concepts
+
+- **Entities**: Core objects (Dataset, Dashboard, Chart, CorpUser, etc.)
+- **Aspects**: Metadata facets (Ownership, Schema, Documentation, etc.)
+- **URNs**: Unique identifiers (`urn:li:dataset:(urn:li:dataPlatform:mysql,db.table,PROD)`)
+- **MCE/MCL**: Metadata Change Events/Logs for updates
+- **Entity Registry**: YAML config defining entity-aspect relationships (`metadata-models/src/main/resources/entity-registry.yml`)
+
+## Development Flow
+
+1. **Schema changes** in `metadata-models/` trigger code generation across all languages
+2. **Backend changes** in `metadata-service/` and other Java modules expose new REST/GraphQL APIs
+3. **Frontend changes** in `datahub-web-react/` consume GraphQL APIs
+4. **Ingestion changes** in `metadata-ingestion/` emit metadata to backend APIs
+
+## Code Standards
+
+### General Principles
+
+- This is production code - maintain high quality
+- Follow existing patterns within each module
+- Generate appropriate unit tests
+- Use type annotations everywhere (Python/TypeScript)
+
+### Language-Specific
+
+- **Java**: Use Spotless formatting, Spring Boot patterns, TestNG/JUnit Jupiter for tests
+- **Python**: Use ruff for linting/formatting, pytest for testing, pydantic for configs
+  - **Type Safety**: Everything must have type annotations, avoid `Any` type, use specific types (`Dict[str, int]`, `TypedDict`)
+  - **Data Structures**: Prefer dataclasses/pydantic for internal data, return dataclasses over tuples
+  - **Code Quality**: Avoid global state, use named arguments, don't re-export in `__init__.py`, refactor repetitive code
+  - **Error Handling**: Robust error handling with layers of protection for known failure points
+- **TypeScript**: Use Prettier formatting, strict types (no `any`), React Testing Library
+
+### Testing Strategy
+
+- Python: Tests go in the `tests/` directory alongside `src/`, use `assert` statements
+- Java: Tests alongside source in `src/test/`
+- Frontend: Tests in `__tests__/` or `.test.tsx` files
+- Smoke tests go in the `smoke-test/` directory
+
+## Key Documentation
+
+**Essential reading:**
+
+- `docs/architecture/architecture.md` - System architecture overview
+- `docs/modeling/metadata-model.md` - How metadata is modeled
+- `docs/what-is-datahub/datahub-concepts.md` - Core concepts (URNs, entities, etc.)
+
+**External docs:**
+
+- https://docs.datahub.com/docs/developers - Official developer guide
+- https://demo.datahub.com/ - Live demo environment
+
+## Important Notes
+
+- Entity Registry is defined in YAML, not code (`entity-registry.yml`)
+- All metadata changes flow through the event streaming system
+- GraphQL schema is generated from backend GMS APIs
+- Follow Conventional Commits format for commit messages
--- a/metadata-ingestion/CLAUDE.md
+++ b/metadata-ingestion/CLAUDE.md
@ -0,0 +1,95 @@
+# DataHub Metadata Ingestion Development Guide
+
+## Build and Test Commands
+
+**Using Gradle (slow but reliable):**
+
+```bash
+# Development setup from repository root
+../gradlew :metadata-ingestion:installDev   # Setup Python environment
+source venv/bin/activate                    # Activate virtual environment
+
+# Linting and formatting
+../gradlew :metadata-ingestion:lint         # Run ruff + mypy
+../gradlew :metadata-ingestion:lintFix      # Auto-fix linting issues
+
+# Testing
+../gradlew :metadata-ingestion:testQuick                           # Fast unit tests
+../gradlew :metadata-ingestion:testFull                            # All tests
+../gradlew :metadata-ingestion:testSingle -PtestFile=tests/unit/test_file.py  # Single test
+```
+
+**Direct Python commands (when venv is activated):**
+
+```bash
+# Linting
+ruff format src/ tests/
+ruff check src/ tests/
+mypy src/ tests/
+
+# Testing
+pytest -vv                                 # Run all tests
+pytest -m 'not integration'                # Unit tests only
+pytest -m 'integration'                    # Integration tests
+pytest tests/path/to/file.py               # Single test file
+pytest tests/path/to/file.py::TestClass    # Single test class
+pytest tests/path/to/file.py::TestClass::test_method  # Single test
+```
+
+## Directory Structure
+
+- `src/datahub/`: Source code for the DataHub CLI and ingestion framework
+- `tests/`: All tests (NOT in `src/` directory)
+- `tests/unit/`: Unit tests
+- `tests/integration/`: Integration tests
+- `scripts/`: Build and development scripts
+- `examples/`: Example ingestion configurations
+- `developing.md`: Detailed development environment information
+
+## Code Style Guidelines
+
+- **Formatting**: Uses ruff, 88 character line length
+- **Imports**: Sorted with ruff.lint.isort, no relative imports
+- **Types**: Always use type annotations, prefer Protocol for interfaces
+  - Avoid `Any` type - use specific types (`Dict[str, int]`, `TypedDict`, or typevars)
+  - Use `isinstance` checks instead of `hasattr`
+  - Prefer `assert isinstance(...)` over `cast`
+- **Data Structures**: Use dataclasses/pydantic for internal data representation
+  - Return dataclasses instead of tuples from methods
+  - Centralize utility functions to avoid code duplication
+- **Naming**: Descriptive names, match source system terminology in configs
+- **Error Handling**: Validators throw only ValueError/TypeError/AssertionError
+  - Add robust error handling with layers of protection for known failure points
+- **Code Quality**: Avoid global state, use named arguments, don't re-export in `__init__.py`
+- **Documentation**: All configs need descriptions
+- **Dependencies**: Avoid version pinning, use ranges with comments
+- **Architecture**: Avoid tall inheritance hierarchies, prefer mixins
+
+## Testing Conventions
+
+- **Location**: Tests go in `tests/` directory alongside `src/`, NOT in `src/`
+- **Structure**: Test files should mirror the source directory structure
+- **Framework**: Use pytest, not unittest
+- **Assertions**: Use `assert` statements, not `self.assertEqual()` or `self.assertIsNone()`
+- **Classes**: Use regular classes, not `unittest.TestCase`
+- **Imports**: Import `pytest` in test files
+- **Naming**: Test files should be named `test_*.py`
+- **Categories**:
+  - Unit tests: `tests/unit/` - fast, no external dependencies
+  - Integration tests: `tests/integration/` - may use Docker/external services
+
+## Configuration Guidelines (Pydantic)
+
+- **Naming**: Match terminology of the source system (e.g., `account_id` for Snowflake, not `host_port`)
+- **Descriptions**: All configs must have descriptions
+- **Patterns**: Use AllowDenyPatterns for filtering, named `*_pattern`
+- **Defaults**: Set reasonable defaults, avoid config-driven filtering that should be automatic
+- **Validation**: Single pydantic validator per validation concern
+- **Security**: Use `SecretStr` for passwords, auth tokens, etc.
+- **Deprecation**: Use `pydantic_removed_field` helper for field deprecations
+
+## Key Files
+
+- `src/datahub/emitter/mcp_builder.py`: Examples of defining various aspect types
+- `setup.py`, `pyproject.toml`, `setup.cfg`: Code style and dependency configuration
+- `.github/workflows/metadata-ingestion.yml`: CI workflow configuration