mirror of
https://github.com/datahub-project/datahub.git
synced 2025-08-26 18:15:59 +00:00
chore: add claude configs (#13983)
This commit is contained in:
parent
6d2796a1c1
commit
eb349f7b1d
23
.claude/settings.json
Normal file
23
.claude/settings.json
Normal file
@ -0,0 +1,23 @@
|
||||
{
|
||||
"permissions": {
|
||||
"allow": [
|
||||
"Bash(cd:*)",
|
||||
"Bash(gh pr diff:*)",
|
||||
"Bash(gh pr view:*)",
|
||||
"Bash(git diff:*)",
|
||||
"Bash(grep:*)",
|
||||
"Bash(head:*)",
|
||||
"Bash(sed:*)",
|
||||
"Bash(find:*)",
|
||||
"Bash(rg:*)",
|
||||
"WebFetch(domain:docs.datahub.com)",
|
||||
"Bash(mypy:*)",
|
||||
"Bash(pytest:*)",
|
||||
"Bash(ruff:*)",
|
||||
"Bash(python -m mypy:*)",
|
||||
"Bash(python -m ruff:*)",
|
||||
"Bash(python -m pytest:*)"
|
||||
],
|
||||
"deny": []
|
||||
}
|
||||
}
|
2
.gitignore
vendored
2
.gitignore
vendored
@ -86,6 +86,7 @@ smoke-test/rollback-reports
|
||||
coverage*.xml
|
||||
.vercel
|
||||
.envrc
|
||||
**/.claude/settings.local.json
|
||||
|
||||
# A long series of binary directories we should ignore
|
||||
datahub-frontend/bin/main/
|
||||
@ -130,3 +131,4 @@ test-models/bin/
|
||||
datahub-executor/
|
||||
datahub-integrations-service/
|
||||
metadata-ingestion-modules/acryl-cloud
|
||||
|
||||
|
40
CLAUDE.MD
40
CLAUDE.MD
@ -1,40 +0,0 @@
|
||||
# CLAUDE.md
|
||||
|
||||
This file provides guidance to Claude Code (claude.ai/code) or any other agent when working with code in this repository.
|
||||
|
||||
## Coding conventions
|
||||
|
||||
- Keep code maintainable. This is not throw-away code. This goes to production.
|
||||
- Generate unit tests where appropriate.
|
||||
- Do not start generating random scripts to run the code you generated unless asked for.
|
||||
- Do not add comments which are redundant given the function names
|
||||
|
||||
## Core concept docs
|
||||
|
||||
- `docs/what/urn.md` defines what a URN is
|
||||
|
||||
## Overall Directory structure
|
||||
|
||||
- This is repository for DataHub project.
|
||||
- `README.MD` should give some basic information about the project.
|
||||
- This is a multi-project gradle project so you will find a lot of `build.gradle` in most folders
|
||||
|
||||
### metadata-ingestion module details
|
||||
- `metadata-ingestion` contains source and tests for DataHub OSS CLI.
|
||||
- `metadata-ingestion/developing.md` contains details about the environment used for testing.
|
||||
- `.github/workflows/metadata-ingestion.yml` contains our github workflow that is used in CI.
|
||||
- `metadata-ingestion/build.gradle` contains our build.gradle that has gradle tasks defined for this module
|
||||
- `pyproject.toml`, `setup.py`, `setup.cfg` in the folder contain rules about the code style for the repository
|
||||
- The `.md` files at top level in this folder gives you important information about the concepts of ingestion
|
||||
- You can see examples of how to define various aspect types in `metadata-ingestion/src/datahub/emitter/mcp_builder.py`
|
||||
- Source code goes in `metadata-ingestion/src/`
|
||||
- Tests go in `metadata-ingestion/tests/` (not in `src/`)
|
||||
- **Testing conventions for metadata-ingestion**:
|
||||
- Unit tests: `metadata-ingestion/tests/unit/`
|
||||
- Integration tests: `metadata-ingestion/tests/integration/`
|
||||
- Test files should mirror the source directory structure
|
||||
- Use pytest, not unittest
|
||||
- Use `assert` statements, not `self.assertEqual()` or `self.assertIsNone()`
|
||||
- Use regular classes, not `unittest.TestCase`
|
||||
- Import `pytest` in test files
|
||||
- Test files should be named `test_*.py` and placed in the appropriate test directory, not alongside source files
|
110
CLAUDE.md
Normal file
110
CLAUDE.md
Normal file
@ -0,0 +1,110 @@
|
||||
# CLAUDE.md
|
||||
|
||||
This file provides guidance to Claude Code (claude.ai/code) when working with code in this repository.
|
||||
|
||||
## Essential Commands
|
||||
|
||||
**Build and test:**
|
||||
|
||||
```bash
|
||||
./gradlew build # Build entire project
|
||||
./gradlew check # Run all tests and linting
|
||||
|
||||
# Note that each directory typically has a build.gradle file, but the available tasks follow similar conventions.
|
||||
|
||||
# Java code.
|
||||
./gradlew spotlessApply # Java code formatting
|
||||
|
||||
# Python code.
|
||||
./gradlew :metadata-ingestion:testQuick # Fast Python unit tests
|
||||
./gradlew :metadata-ingestion:lint # Python linting (ruff, mypy)
|
||||
./gradlew :metadata-ingestion:lintFix # Python linting auto-fix (ruff only)
|
||||
```
|
||||
|
||||
**Development setup:**
|
||||
|
||||
```bash
|
||||
./gradlew :metadata-ingestion:installDev # Setup Python environment
|
||||
./gradlew quickstartDebug # Start full DataHub stack
|
||||
cd datahub-web-react && yarn start # Frontend dev server
|
||||
```
|
||||
|
||||
## Architecture Overview
|
||||
|
||||
DataHub is a **schema-first, event-driven metadata platform** with three core layers:
|
||||
|
||||
### Core Services
|
||||
|
||||
- **GMS (Generalized Metadata Service)**: Java/Spring backend handling metadata storage and REST/GraphQL APIs
|
||||
- **Frontend**: React/TypeScript application consuming GraphQL APIs
|
||||
- **Ingestion Framework**: Python CLI and connectors for extracting metadata from data sources
|
||||
- **Event Streaming**: Kafka-based real-time metadata change propagation
|
||||
|
||||
### Key Modules
|
||||
|
||||
- `metadata-models/`: Avro/PDL schemas defining the metadata model
|
||||
- `metadata-service/`: Backend services, APIs, and business logic
|
||||
- `datahub-web-react/`: Frontend React application
|
||||
- `metadata-ingestion/`: Python ingestion framework and CLI
|
||||
- `datahub-graphql-core/`: GraphQL schema and resolvers
|
||||
|
||||
### Metadata Model Concepts
|
||||
|
||||
- **Entities**: Core objects (Dataset, Dashboard, Chart, CorpUser, etc.)
|
||||
- **Aspects**: Metadata facets (Ownership, Schema, Documentation, etc.)
|
||||
- **URNs**: Unique identifiers (`urn:li:dataset:(urn:li:dataPlatform:mysql,db.table,PROD)`)
|
||||
- **MCE/MCL**: Metadata Change Events/Logs for updates
|
||||
- **Entity Registry**: YAML config defining entity-aspect relationships (`metadata-models/src/main/resources/entity-registry.yml`)
|
||||
|
||||
## Development Flow
|
||||
|
||||
1. **Schema changes** in `metadata-models/` trigger code generation across all languages
|
||||
2. **Backend changes** in `metadata-service/` and other Java modules expose new REST/GraphQL APIs
|
||||
3. **Frontend changes** in `datahub-web-react/` consume GraphQL APIs
|
||||
4. **Ingestion changes** in `metadata-ingestion/` emit metadata to backend APIs
|
||||
|
||||
## Code Standards
|
||||
|
||||
### General Principles
|
||||
|
||||
- This is production code - maintain high quality
|
||||
- Follow existing patterns within each module
|
||||
- Generate appropriate unit tests
|
||||
- Use type annotations everywhere (Python/TypeScript)
|
||||
|
||||
### Language-Specific
|
||||
|
||||
- **Java**: Use Spotless formatting, Spring Boot patterns, TestNG/JUnit Jupiter for tests
|
||||
- **Python**: Use ruff for linting/formatting, pytest for testing, pydantic for configs
|
||||
- **Type Safety**: Everything must have type annotations, avoid `Any` type, use specific types (`Dict[str, int]`, `TypedDict`)
|
||||
- **Data Structures**: Prefer dataclasses/pydantic for internal data, return dataclasses over tuples
|
||||
- **Code Quality**: Avoid global state, use named arguments, don't re-export in `__init__.py`, refactor repetitive code
|
||||
- **Error Handling**: Robust error handling with layers of protection for known failure points
|
||||
- **TypeScript**: Use Prettier formatting, strict types (no `any`), React Testing Library
|
||||
|
||||
### Testing Strategy
|
||||
|
||||
- Python: Tests go in the `tests/` directory alongside `src/`, use `assert` statements
|
||||
- Java: Tests alongside source in `src/test/`
|
||||
- Frontend: Tests in `__tests__/` or `.test.tsx` files
|
||||
- Smoke tests go in the `smoke-test/` directory
|
||||
|
||||
## Key Documentation
|
||||
|
||||
**Essential reading:**
|
||||
|
||||
- `docs/architecture/architecture.md` - System architecture overview
|
||||
- `docs/modeling/metadata-model.md` - How metadata is modeled
|
||||
- `docs/what-is-datahub/datahub-concepts.md` - Core concepts (URNs, entities, etc.)
|
||||
|
||||
**External docs:**
|
||||
|
||||
- https://docs.datahub.com/docs/developers - Official developer guide
|
||||
- https://demo.datahub.com/ - Live demo environment
|
||||
|
||||
## Important Notes
|
||||
|
||||
- Entity Registry is defined in YAML, not code (`entity-registry.yml`)
|
||||
- All metadata changes flow through the event streaming system
|
||||
- GraphQL schema is generated from backend GMS APIs
|
||||
- Follow Conventional Commits format for commit messages
|
95
metadata-ingestion/CLAUDE.md
Normal file
95
metadata-ingestion/CLAUDE.md
Normal file
@ -0,0 +1,95 @@
|
||||
# DataHub Metadata Ingestion Development Guide
|
||||
|
||||
## Build and Test Commands
|
||||
|
||||
**Using Gradle (slow but reliable):**
|
||||
|
||||
```bash
|
||||
# Development setup from repository root
|
||||
../gradlew :metadata-ingestion:installDev # Setup Python environment
|
||||
source venv/bin/activate # Activate virtual environment
|
||||
|
||||
# Linting and formatting
|
||||
../gradlew :metadata-ingestion:lint # Run ruff + mypy
|
||||
../gradlew :metadata-ingestion:lintFix # Auto-fix linting issues
|
||||
|
||||
# Testing
|
||||
../gradlew :metadata-ingestion:testQuick # Fast unit tests
|
||||
../gradlew :metadata-ingestion:testFull # All tests
|
||||
../gradlew :metadata-ingestion:testSingle -PtestFile=tests/unit/test_file.py # Single test
|
||||
```
|
||||
|
||||
**Direct Python commands (when venv is activated):**
|
||||
|
||||
```bash
|
||||
# Linting
|
||||
ruff format src/ tests/
|
||||
ruff check src/ tests/
|
||||
mypy src/ tests/
|
||||
|
||||
# Testing
|
||||
pytest -vv # Run all tests
|
||||
pytest -m 'not integration' # Unit tests only
|
||||
pytest -m 'integration' # Integration tests
|
||||
pytest tests/path/to/file.py # Single test file
|
||||
pytest tests/path/to/file.py::TestClass # Single test class
|
||||
pytest tests/path/to/file.py::TestClass::test_method # Single test
|
||||
```
|
||||
|
||||
## Directory Structure
|
||||
|
||||
- `src/datahub/`: Source code for the DataHub CLI and ingestion framework
|
||||
- `tests/`: All tests (NOT in `src/` directory)
|
||||
- `tests/unit/`: Unit tests
|
||||
- `tests/integration/`: Integration tests
|
||||
- `scripts/`: Build and development scripts
|
||||
- `examples/`: Example ingestion configurations
|
||||
- `developing.md`: Detailed development environment information
|
||||
|
||||
## Code Style Guidelines
|
||||
|
||||
- **Formatting**: Uses ruff, 88 character line length
|
||||
- **Imports**: Sorted with ruff.lint.isort, no relative imports
|
||||
- **Types**: Always use type annotations, prefer Protocol for interfaces
|
||||
- Avoid `Any` type - use specific types (`Dict[str, int]`, `TypedDict`, or typevars)
|
||||
- Use `isinstance` checks instead of `hasattr`
|
||||
- Prefer `assert isinstance(...)` over `cast`
|
||||
- **Data Structures**: Use dataclasses/pydantic for internal data representation
|
||||
- Return dataclasses instead of tuples from methods
|
||||
- Centralize utility functions to avoid code duplication
|
||||
- **Naming**: Descriptive names, match source system terminology in configs
|
||||
- **Error Handling**: Validators throw only ValueError/TypeError/AssertionError
|
||||
- Add robust error handling with layers of protection for known failure points
|
||||
- **Code Quality**: Avoid global state, use named arguments, don't re-export in `__init__.py`
|
||||
- **Documentation**: All configs need descriptions
|
||||
- **Dependencies**: Avoid version pinning, use ranges with comments
|
||||
- **Architecture**: Avoid tall inheritance hierarchies, prefer mixins
|
||||
|
||||
## Testing Conventions
|
||||
|
||||
- **Location**: Tests go in `tests/` directory alongside `src/`, NOT in `src/`
|
||||
- **Structure**: Test files should mirror the source directory structure
|
||||
- **Framework**: Use pytest, not unittest
|
||||
- **Assertions**: Use `assert` statements, not `self.assertEqual()` or `self.assertIsNone()`
|
||||
- **Classes**: Use regular classes, not `unittest.TestCase`
|
||||
- **Imports**: Import `pytest` in test files
|
||||
- **Naming**: Test files should be named `test_*.py`
|
||||
- **Categories**:
|
||||
- Unit tests: `tests/unit/` - fast, no external dependencies
|
||||
- Integration tests: `tests/integration/` - may use Docker/external services
|
||||
|
||||
## Configuration Guidelines (Pydantic)
|
||||
|
||||
- **Naming**: Match terminology of the source system (e.g., `account_id` for Snowflake, not `host_port`)
|
||||
- **Descriptions**: All configs must have descriptions
|
||||
- **Patterns**: Use AllowDenyPatterns for filtering, named `*_pattern`
|
||||
- **Defaults**: Set reasonable defaults, avoid config-driven filtering that should be automatic
|
||||
- **Validation**: Single pydantic validator per validation concern
|
||||
- **Security**: Use `SecretStr` for passwords, auth tokens, etc.
|
||||
- **Deprecation**: Use `pydantic_removed_field` helper for field deprecations
|
||||
|
||||
## Key Files
|
||||
|
||||
- `src/datahub/emitter/mcp_builder.py`: Examples of defining various aspect types
|
||||
- `setup.py`, `pyproject.toml`, `setup.cfg`: Code style and dependency configuration
|
||||
- `.github/workflows/metadata-ingestion.yml`: CI workflow configuration
|
Loading…
x
Reference in New Issue
Block a user