chore: add claude configs (#13983)

This commit is contained in:
Harshal Sheth 2025-07-07 20:46:49 -04:00 committed by GitHub
parent 6d2796a1c1
commit eb349f7b1d
No known key found for this signature in database
GPG Key ID: B5690EEEBB952194
5 changed files with 230 additions and 40 deletions

23
.claude/settings.json Normal file
View File

@ -0,0 +1,23 @@
{
"permissions": {
"allow": [
"Bash(cd:*)",
"Bash(gh pr diff:*)",
"Bash(gh pr view:*)",
"Bash(git diff:*)",
"Bash(grep:*)",
"Bash(head:*)",
"Bash(sed:*)",
"Bash(find:*)",
"Bash(rg:*)",
"WebFetch(domain:docs.datahub.com)",
"Bash(mypy:*)",
"Bash(pytest:*)",
"Bash(ruff:*)",
"Bash(python -m mypy:*)",
"Bash(python -m ruff:*)",
"Bash(python -m pytest:*)"
],
"deny": []
}
}

2
.gitignore vendored
View File

@ -86,6 +86,7 @@ smoke-test/rollback-reports
coverage*.xml coverage*.xml
.vercel .vercel
.envrc .envrc
**/.claude/settings.local.json
# A long series of binary directories we should ignore # A long series of binary directories we should ignore
datahub-frontend/bin/main/ datahub-frontend/bin/main/
@ -130,3 +131,4 @@ test-models/bin/
datahub-executor/ datahub-executor/
datahub-integrations-service/ datahub-integrations-service/
metadata-ingestion-modules/acryl-cloud metadata-ingestion-modules/acryl-cloud

View File

@ -1,40 +0,0 @@
# CLAUDE.md
This file provides guidance to Claude Code (claude.ai/code) or any other agent when working with code in this repository.
## Coding conventions
- Keep code maintainable. This is not throw-away code. This goes to production.
- Generate unit tests where appropriate.
- Do not start generating random scripts to run the code you generated unless asked for.
- Do not add comments which are redundant given the function names
## Core concept docs
- `docs/what/urn.md` defines what a URN is
## Overall Directory structure
- This is repository for DataHub project.
- `README.MD` should give some basic information about the project.
- This is a multi-project gradle project so you will find a lot of `build.gradle` in most folders
### metadata-ingestion module details
- `metadata-ingestion` contains source and tests for DataHub OSS CLI.
- `metadata-ingestion/developing.md` contains details about the environment used for testing.
- `.github/workflows/metadata-ingestion.yml` contains our github workflow that is used in CI.
- `metadata-ingestion/build.gradle` contains our build.gradle that has gradle tasks defined for this module
- `pyproject.toml`, `setup.py`, `setup.cfg` in the folder contain rules about the code style for the repository
- The `.md` files at top level in this folder gives you important information about the concepts of ingestion
- You can see examples of how to define various aspect types in `metadata-ingestion/src/datahub/emitter/mcp_builder.py`
- Source code goes in `metadata-ingestion/src/`
- Tests go in `metadata-ingestion/tests/` (not in `src/`)
- **Testing conventions for metadata-ingestion**:
- Unit tests: `metadata-ingestion/tests/unit/`
- Integration tests: `metadata-ingestion/tests/integration/`
- Test files should mirror the source directory structure
- Use pytest, not unittest
- Use `assert` statements, not `self.assertEqual()` or `self.assertIsNone()`
- Use regular classes, not `unittest.TestCase`
- Import `pytest` in test files
- Test files should be named `test_*.py` and placed in the appropriate test directory, not alongside source files

110
CLAUDE.md Normal file
View File

@ -0,0 +1,110 @@
# CLAUDE.md
This file provides guidance to Claude Code (claude.ai/code) when working with code in this repository.
## Essential Commands
**Build and test:**
```bash
./gradlew build # Build entire project
./gradlew check # Run all tests and linting
# Note that each directory typically has a build.gradle file, but the available tasks follow similar conventions.
# Java code.
./gradlew spotlessApply # Java code formatting
# Python code.
./gradlew :metadata-ingestion:testQuick # Fast Python unit tests
./gradlew :metadata-ingestion:lint # Python linting (ruff, mypy)
./gradlew :metadata-ingestion:lintFix # Python linting auto-fix (ruff only)
```
**Development setup:**
```bash
./gradlew :metadata-ingestion:installDev # Setup Python environment
./gradlew quickstartDebug # Start full DataHub stack
cd datahub-web-react && yarn start # Frontend dev server
```
## Architecture Overview
DataHub is a **schema-first, event-driven metadata platform** with three core layers:
### Core Services
- **GMS (Generalized Metadata Service)**: Java/Spring backend handling metadata storage and REST/GraphQL APIs
- **Frontend**: React/TypeScript application consuming GraphQL APIs
- **Ingestion Framework**: Python CLI and connectors for extracting metadata from data sources
- **Event Streaming**: Kafka-based real-time metadata change propagation
### Key Modules
- `metadata-models/`: Avro/PDL schemas defining the metadata model
- `metadata-service/`: Backend services, APIs, and business logic
- `datahub-web-react/`: Frontend React application
- `metadata-ingestion/`: Python ingestion framework and CLI
- `datahub-graphql-core/`: GraphQL schema and resolvers
### Metadata Model Concepts
- **Entities**: Core objects (Dataset, Dashboard, Chart, CorpUser, etc.)
- **Aspects**: Metadata facets (Ownership, Schema, Documentation, etc.)
- **URNs**: Unique identifiers (`urn:li:dataset:(urn:li:dataPlatform:mysql,db.table,PROD)`)
- **MCE/MCL**: Metadata Change Events/Logs for updates
- **Entity Registry**: YAML config defining entity-aspect relationships (`metadata-models/src/main/resources/entity-registry.yml`)
## Development Flow
1. **Schema changes** in `metadata-models/` trigger code generation across all languages
2. **Backend changes** in `metadata-service/` and other Java modules expose new REST/GraphQL APIs
3. **Frontend changes** in `datahub-web-react/` consume GraphQL APIs
4. **Ingestion changes** in `metadata-ingestion/` emit metadata to backend APIs
## Code Standards
### General Principles
- This is production code - maintain high quality
- Follow existing patterns within each module
- Generate appropriate unit tests
- Use type annotations everywhere (Python/TypeScript)
### Language-Specific
- **Java**: Use Spotless formatting, Spring Boot patterns, TestNG/JUnit Jupiter for tests
- **Python**: Use ruff for linting/formatting, pytest for testing, pydantic for configs
- **Type Safety**: Everything must have type annotations, avoid `Any` type, use specific types (`Dict[str, int]`, `TypedDict`)
- **Data Structures**: Prefer dataclasses/pydantic for internal data, return dataclasses over tuples
- **Code Quality**: Avoid global state, use named arguments, don't re-export in `__init__.py`, refactor repetitive code
- **Error Handling**: Robust error handling with layers of protection for known failure points
- **TypeScript**: Use Prettier formatting, strict types (no `any`), React Testing Library
### Testing Strategy
- Python: Tests go in the `tests/` directory alongside `src/`, use `assert` statements
- Java: Tests alongside source in `src/test/`
- Frontend: Tests in `__tests__/` or `.test.tsx` files
- Smoke tests go in the `smoke-test/` directory
## Key Documentation
**Essential reading:**
- `docs/architecture/architecture.md` - System architecture overview
- `docs/modeling/metadata-model.md` - How metadata is modeled
- `docs/what-is-datahub/datahub-concepts.md` - Core concepts (URNs, entities, etc.)
**External docs:**
- https://docs.datahub.com/docs/developers - Official developer guide
- https://demo.datahub.com/ - Live demo environment
## Important Notes
- Entity Registry is defined in YAML, not code (`entity-registry.yml`)
- All metadata changes flow through the event streaming system
- GraphQL schema is generated from backend GMS APIs
- Follow Conventional Commits format for commit messages

View File

@ -0,0 +1,95 @@
# DataHub Metadata Ingestion Development Guide
## Build and Test Commands
**Using Gradle (slow but reliable):**
```bash
# Development setup from repository root
../gradlew :metadata-ingestion:installDev # Setup Python environment
source venv/bin/activate # Activate virtual environment
# Linting and formatting
../gradlew :metadata-ingestion:lint # Run ruff + mypy
../gradlew :metadata-ingestion:lintFix # Auto-fix linting issues
# Testing
../gradlew :metadata-ingestion:testQuick # Fast unit tests
../gradlew :metadata-ingestion:testFull # All tests
../gradlew :metadata-ingestion:testSingle -PtestFile=tests/unit/test_file.py # Single test
```
**Direct Python commands (when venv is activated):**
```bash
# Linting
ruff format src/ tests/
ruff check src/ tests/
mypy src/ tests/
# Testing
pytest -vv # Run all tests
pytest -m 'not integration' # Unit tests only
pytest -m 'integration' # Integration tests
pytest tests/path/to/file.py # Single test file
pytest tests/path/to/file.py::TestClass # Single test class
pytest tests/path/to/file.py::TestClass::test_method # Single test
```
## Directory Structure
- `src/datahub/`: Source code for the DataHub CLI and ingestion framework
- `tests/`: All tests (NOT in `src/` directory)
- `tests/unit/`: Unit tests
- `tests/integration/`: Integration tests
- `scripts/`: Build and development scripts
- `examples/`: Example ingestion configurations
- `developing.md`: Detailed development environment information
## Code Style Guidelines
- **Formatting**: Uses ruff, 88 character line length
- **Imports**: Sorted with ruff.lint.isort, no relative imports
- **Types**: Always use type annotations, prefer Protocol for interfaces
- Avoid `Any` type - use specific types (`Dict[str, int]`, `TypedDict`, or typevars)
- Use `isinstance` checks instead of `hasattr`
- Prefer `assert isinstance(...)` over `cast`
- **Data Structures**: Use dataclasses/pydantic for internal data representation
- Return dataclasses instead of tuples from methods
- Centralize utility functions to avoid code duplication
- **Naming**: Descriptive names, match source system terminology in configs
- **Error Handling**: Validators throw only ValueError/TypeError/AssertionError
- Add robust error handling with layers of protection for known failure points
- **Code Quality**: Avoid global state, use named arguments, don't re-export in `__init__.py`
- **Documentation**: All configs need descriptions
- **Dependencies**: Avoid version pinning, use ranges with comments
- **Architecture**: Avoid tall inheritance hierarchies, prefer mixins
## Testing Conventions
- **Location**: Tests go in `tests/` directory alongside `src/`, NOT in `src/`
- **Structure**: Test files should mirror the source directory structure
- **Framework**: Use pytest, not unittest
- **Assertions**: Use `assert` statements, not `self.assertEqual()` or `self.assertIsNone()`
- **Classes**: Use regular classes, not `unittest.TestCase`
- **Imports**: Import `pytest` in test files
- **Naming**: Test files should be named `test_*.py`
- **Categories**:
- Unit tests: `tests/unit/` - fast, no external dependencies
- Integration tests: `tests/integration/` - may use Docker/external services
## Configuration Guidelines (Pydantic)
- **Naming**: Match terminology of the source system (e.g., `account_id` for Snowflake, not `host_port`)
- **Descriptions**: All configs must have descriptions
- **Patterns**: Use AllowDenyPatterns for filtering, named `*_pattern`
- **Defaults**: Set reasonable defaults, avoid config-driven filtering that should be automatic
- **Validation**: Single pydantic validator per validation concern
- **Security**: Use `SecretStr` for passwords, auth tokens, etc.
- **Deprecation**: Use `pydantic_removed_field` helper for field deprecations
## Key Files
- `src/datahub/emitter/mcp_builder.py`: Examples of defining various aspect types
- `setup.py`, `pyproject.toml`, `setup.cfg`: Code style and dependency configuration
- `.github/workflows/metadata-ingestion.yml`: CI workflow configuration