datahub/CLAUDE.md

# CLAUDE.md

This file provides guidance to Claude Code (claude.ai/code) when working with code in this repository.

## Essential Commands

**Build and test:**

```bash
./gradlew build           # Build entire project
./gradlew check           # Run all tests and linting

# Note that each directory typically has a build.gradle file, but the available tasks follow similar conventions.

# Java code.
./gradlew spotlessApply   # Java code formatting

# Python code.
./gradlew :metadata-ingestion:testQuick     # Fast Python unit tests
./gradlew :metadata-ingestion:lint          # Python linting (ruff, mypy)
./gradlew :metadata-ingestion:lintFix       # Python linting auto-fix (ruff only)
```

**Development setup:**

```bash
./gradlew :metadata-ingestion:installDev               # Setup Python environment
./gradlew quickstartDebug                              # Start full DataHub stack
cd datahub-web-react && yarn start                     # Frontend dev server
```

## Architecture Overview

DataHub is a **schema-first, event-driven metadata platform** with three core layers:

### Core Services

- **GMS (Generalized Metadata Service)**: Java/Spring backend handling metadata storage and REST/GraphQL APIs
- **Frontend**: React/TypeScript application consuming GraphQL APIs
- **Ingestion Framework**: Python CLI and connectors for extracting metadata from data sources
- **Event Streaming**: Kafka-based real-time metadata change propagation

### Key Modules

- `metadata-models/`: Avro/PDL schemas defining the metadata model
- `metadata-service/`: Backend services, APIs, and business logic
- `datahub-web-react/`: Frontend React application
- `metadata-ingestion/`: Python ingestion framework and CLI
- `datahub-graphql-core/`: GraphQL schema and resolvers

Most of the non-frontend modules are written in Java. The modules written in Python are:

- `metadata-ingestion/`
- `datahub-actions/`
- `metadata-ingestion-modules/airflow-plugin/`
- `metadata-ingestion-modules/gx-plugin/`
- `metadata-ingestion-modules/dagster-plugin/`
- `metadata-ingestion-modules/prefect-plugin/`

Each Python module has a gradle setup similar to `metadata-ingestion/` (documented above)

### Metadata Model Concepts

- **Entities**: Core objects (Dataset, Dashboard, Chart, CorpUser, etc.)
- **Aspects**: Metadata facets (Ownership, Schema, Documentation, etc.)
- **URNs**: Unique identifiers (`urn:li:dataset:(urn:li:dataPlatform:mysql,db.table,PROD)`)
- **MCE/MCL**: Metadata Change Events/Logs for updates
- **Entity Registry**: YAML config defining entity-aspect relationships (`metadata-models/src/main/resources/entity-registry.yml`)

## Development Flow

1. **Schema changes** in `metadata-models/` trigger code generation across all languages
2. **Backend changes** in `metadata-service/` and other Java modules expose new REST/GraphQL APIs
3. **Frontend changes** in `datahub-web-react/` consume GraphQL APIs
4. **Ingestion changes** in `metadata-ingestion/` emit metadata to backend APIs

## Code Standards

### General Principles

- This is production code - maintain high quality
- Follow existing patterns within each module
- Generate appropriate unit tests
- Use type annotations everywhere (Python/TypeScript)

### Language-Specific

- **Java**: Use Spotless formatting, Spring Boot patterns, TestNG/JUnit Jupiter for tests
- **Python**: Use ruff for linting/formatting, pytest for testing, pydantic for configs
  - **Type Safety**: Everything must have type annotations, avoid `Any` type, use specific types (`Dict[str, int]`, `TypedDict`)
  - **Data Structures**: Prefer dataclasses/pydantic for internal data, return dataclasses over tuples
  - **Code Quality**: Avoid global state, use named arguments, don't re-export in `__init__.py`, refactor repetitive code
  - **Error Handling**: Robust error handling with layers of protection for known failure points
- **TypeScript**: Use Prettier formatting, strict types (no `any`), React Testing Library

### Code Comments

Only add comments that provide real value beyond what the code already expresses.

**Do NOT** add comments for:

- Obvious operations (`# Get user by ID`, `// Create connection`)
- What the code does when it's self-evident (`# Loop through items`, `// Set variable to true`)
- Restating parameter names or return types already in signatures
- Basic language constructs (`# Import modules`, `// End of function`)

**DO** add comments for:

- **Why** something is done, especially non-obvious business logic or workarounds
- **Context** about external constraints, API quirks, or domain knowledge
- **Warnings** about gotchas, performance implications, or side effects
- **References** to tickets, RFCs, or external documentation that explain decisions
- **Complex algorithms** or mathematical formulas that aren't immediately clear
- **Temporary solutions** with TODOs and context for future improvements

Examples:

```python
# Good: Explains WHY and provides context
# Use a 30-second timeout because Snowflake's query API can hang indefinitely
# on large result sets. See issue #12345.
connection_timeout = 30

# Bad: Restates what's obvious from code
# Set connection timeout to 30 seconds
connection_timeout = 30
```

### Testing Strategy

- Python: Tests go in the `tests/` directory alongside `src/`, use `assert` statements
- Java: Tests alongside source in `src/test/`
- Frontend: Tests in `__tests__/` or `.test.tsx` files
- Smoke tests go in the `smoke-test/` directory

#### Security Testing: Configuration Property Classification

**Critical test**: `metadata-io/src/test/java/com/linkedin/metadata/system_info/collectors/PropertiesCollectorConfigurationTest.java`

This test prevents sensitive data leaks by requiring explicit classification of all configuration properties as either sensitive (redacted) or non-sensitive (visible in system info).

**When adding new configuration properties**: The test will fail with clear instructions on which classification list to add your property to. Refer to the test file's comprehensive documentation for template syntax and examples.

This is a mandatory security guardrail - never disable or skip this test.

### Commits

- Follow Conventional Commits format for commit messages
- Breaking Changes: Always update `docs/how/updating-datahub.md` for breaking changes. Write entries for non-technical audiences, reference the PR number, and focus on what users need to change rather than internal implementation details

## Key Documentation

**Essential reading:**

- `docs/architecture/architecture.md` - System architecture overview
- `docs/modeling/metadata-model.md` - How metadata is modeled
- `docs/what-is-datahub/datahub-concepts.md` - Core concepts (URNs, entities, etc.)

**External docs:**

- https://docs.datahub.com/docs/developers - Official developer guide
- https://demo.datahub.com/ - Live demo environment

## Important Notes

- Entity Registry is defined in YAML, not code (`entity-registry.yml`)
- All metadata changes flow through the event streaming system
- GraphQL schema is generated from backend GMS APIs
chore: add claude configs (#13983) 2025-07-07 20:46:49 -04:00			`# CLAUDE.md`

			`This file provides guidance to Claude Code (claude.ai/code) when working with code in this repository.`

			`## Essential Commands`

			`Build and test:`

			```bash
			`./gradlew build # Build entire project`
			`./gradlew check # Run all tests and linting`

			`# Note that each directory typically has a build.gradle file, but the available tasks follow similar conventions.`

			`# Java code.`
			`./gradlew spotlessApply # Java code formatting`

			`# Python code.`
			`./gradlew :metadata-ingestion:testQuick # Fast Python unit tests`
			`./gradlew :metadata-ingestion:lint # Python linting (ruff, mypy)`
			`./gradlew :metadata-ingestion:lintFix # Python linting auto-fix (ruff only)`
			```

			`Development setup:`

			```bash
			`./gradlew :metadata-ingestion:installDev # Setup Python environment`
			`./gradlew quickstartDebug # Start full DataHub stack`
			`cd datahub-web-react && yarn start # Frontend dev server`
			```

			`## Architecture Overview`

			`DataHub is a schema-first, event-driven metadata platform with three core layers:`

			`### Core Services`

			`- GMS (Generalized Metadata Service): Java/Spring backend handling metadata storage and REST/GraphQL APIs`
			`- Frontend: React/TypeScript application consuming GraphQL APIs`
			`- Ingestion Framework: Python CLI and connectors for extracting metadata from data sources`
			`- Event Streaming: Kafka-based real-time metadata change propagation`

			`### Key Modules`

			- `metadata-models/`: Avro/PDL schemas defining the metadata model
			- `metadata-service/`: Backend services, APIs, and business logic
			- `datahub-web-react/`: Frontend React application
			- `metadata-ingestion/`: Python ingestion framework and CLI
			- `datahub-graphql-core/`: GraphQL schema and resolvers

docs: list python packages in claude.md (#14389) 2025-08-07 15:30:46 -07:00			`Most of the non-frontend modules are written in Java. The modules written in Python are:`

			- `metadata-ingestion/`
			- `datahub-actions/`
			- `metadata-ingestion-modules/airflow-plugin/`
			- `metadata-ingestion-modules/gx-plugin/`
			- `metadata-ingestion-modules/dagster-plugin/`
			- `metadata-ingestion-modules/prefect-plugin/`

			Each Python module has a gradle setup similar to `metadata-ingestion/` (documented above)

chore: add claude configs (#13983) 2025-07-07 20:46:49 -04:00			`### Metadata Model Concepts`

			`- Entities: Core objects (Dataset, Dashboard, Chart, CorpUser, etc.)`
			`- Aspects: Metadata facets (Ownership, Schema, Documentation, etc.)`
			- URNs: Unique identifiers (`urn:li:dataset:(urn:li:dataPlatform:mysql,db.table,PROD)`)
			`- MCE/MCL: Metadata Change Events/Logs for updates`
			- Entity Registry: YAML config defining entity-aspect relationships (`metadata-models/src/main/resources/entity-registry.yml`)

			`## Development Flow`

			1. Schema changes in `metadata-models/` trigger code generation across all languages
			2. Backend changes in `metadata-service/` and other Java modules expose new REST/GraphQL APIs
			3. Frontend changes in `datahub-web-react/` consume GraphQL APIs
			4. Ingestion changes in `metadata-ingestion/` emit metadata to backend APIs

			`## Code Standards`

			`### General Principles`

			`- This is production code - maintain high quality`
			`- Follow existing patterns within each module`
			`- Generate appropriate unit tests`
			`- Use type annotations everywhere (Python/TypeScript)`

			`### Language-Specific`

			`- Java: Use Spotless formatting, Spring Boot patterns, TestNG/JUnit Jupiter for tests`
			`- Python: Use ruff for linting/formatting, pytest for testing, pydantic for configs`
			- Type Safety: Everything must have type annotations, avoid `Any` type, use specific types (`Dict[str, int]`, `TypedDict`)
			`- Data Structures: Prefer dataclasses/pydantic for internal data, return dataclasses over tuples`
			- Code Quality: Avoid global state, use named arguments, don't re-export in `__init__.py`, refactor repetitive code
			`- Error Handling: Robust error handling with layers of protection for known failure points`
			- TypeScript: Use Prettier formatting, strict types (no `any`), React Testing Library

docs: improve code comment guidelines in CLAUDE.md (#14620) Co-authored-by: Claude <noreply@anthropic.com> 2025-09-01 18:41:01 +02:00			`### Code Comments`

			`Only add comments that provide real value beyond what the code already expresses.`

			`Do NOT add comments for:`

			- Obvious operations (`# Get user by ID`, `// Create connection`)
			- What the code does when it's self-evident (`# Loop through items`, `// Set variable to true`)
			`- Restating parameter names or return types already in signatures`
			- Basic language constructs (`# Import modules`, `// End of function`)

			`DO add comments for:`

			`- Why something is done, especially non-obvious business logic or workarounds`
			`- Context about external constraints, API quirks, or domain knowledge`
			`- Warnings about gotchas, performance implications, or side effects`
			`- References to tickets, RFCs, or external documentation that explain decisions`
			`- Complex algorithms or mathematical formulas that aren't immediately clear`
			`- Temporary solutions with TODOs and context for future improvements`

			`Examples:`

			```python
			`# Good: Explains WHY and provides context`
			`# Use a 30-second timeout because Snowflake's query API can hang indefinitely`
			`# on large result sets. See issue #12345.`
			`connection_timeout = 30`

			`# Bad: Restates what's obvious from code`
			`# Set connection timeout to 30 seconds`
			`connection_timeout = 30`
			```

chore: add claude configs (#13983) 2025-07-07 20:46:49 -04:00			`### Testing Strategy`

			- Python: Tests go in the `tests/` directory alongside `src/`, use `assert` statements
			- Java: Tests alongside source in `src/test/`
			- Frontend: Tests in `__tests__/` or `.test.tsx` files
			- Smoke tests go in the `smoke-test/` directory

feat(config): Configuration Endpoint - ConfigurationProvider (#14237) 2025-07-30 17:19:56 -07:00			`#### Security Testing: Configuration Property Classification`

			Critical test: `metadata-io/src/test/java/com/linkedin/metadata/system_info/collectors/PropertiesCollectorConfigurationTest.java`

			`This test prevents sensitive data leaks by requiring explicit classification of all configuration properties as either sensitive (redacted) or non-sensitive (visible in system info).`

			`When adding new configuration properties: The test will fail with clear instructions on which classification list to add your property to. Refer to the test file's comprehensive documentation for template syntax and examples.`

			`This is a mandatory security guardrail - never disable or skip this test.`

refactor(sql-parsing): rename default_dialect to override_dialect parameter (#14015) Co-authored-by: Claude <noreply@anthropic.com> 2025-07-10 12:26:58 -04:00			`### Commits`

			`- Follow Conventional Commits format for commit messages`
			- Breaking Changes: Always update `docs/how/updating-datahub.md` for breaking changes. Write entries for non-technical audiences, reference the PR number, and focus on what users need to change rather than internal implementation details

chore: add claude configs (#13983) 2025-07-07 20:46:49 -04:00			`## Key Documentation`

			`Essential reading:`

			- `docs/architecture/architecture.md` - System architecture overview
			- `docs/modeling/metadata-model.md` - How metadata is modeled
			- `docs/what-is-datahub/datahub-concepts.md` - Core concepts (URNs, entities, etc.)

			`External docs:`

			`- https://docs.datahub.com/docs/developers - Official developer guide`
			`- https://demo.datahub.com/ - Live demo environment`

			`## Important Notes`

			- Entity Registry is defined in YAML, not code (`entity-registry.yml`)
			`- All metadata changes flow through the event streaming system`
			`- GraphQL schema is generated from backend GMS APIs`