CLAUDE.md
This file provides guidance to Claude Code (claude.ai/code) when working with code in this repository.
Essential Commands
Build and test:
./gradlew build # Build entire project
./gradlew check # Run all tests and linting
# Note that each directory typically has a build.gradle file, but the available tasks follow similar conventions.
# Java code.
./gradlew spotlessApply # Java code formatting
# Python code.
./gradlew :metadata-ingestion:testQuick # Fast Python unit tests
./gradlew :metadata-ingestion:lint # Python linting (ruff, mypy)
./gradlew :metadata-ingestion:lintFix # Python linting auto-fix (ruff only)
Development setup:
./gradlew :metadata-ingestion:installDev # Setup Python environment
./gradlew quickstartDebug # Start full DataHub stack
cd datahub-web-react && yarn start # Frontend dev server
Architecture Overview
DataHub is a schema-first, event-driven metadata platform with three core layers:
Core Services
- GMS (Generalized Metadata Service): Java/Spring backend handling metadata storage and REST/GraphQL APIs
- Frontend: React/TypeScript application consuming GraphQL APIs
- Ingestion Framework: Python CLI and connectors for extracting metadata from data sources
- Event Streaming: Kafka-based real-time metadata change propagation
Key Modules
metadata-models/
: Avro/PDL schemas defining the metadata modelmetadata-service/
: Backend services, APIs, and business logicdatahub-web-react/
: Frontend React applicationmetadata-ingestion/
: Python ingestion framework and CLIdatahub-graphql-core/
: GraphQL schema and resolvers
Most of the non-frontend modules are written in Java. The modules written in Python are:
metadata-ingestion/
datahub-actions/
metadata-ingestion-modules/airflow-plugin/
metadata-ingestion-modules/gx-plugin/
metadata-ingestion-modules/dagster-plugin/
metadata-ingestion-modules/prefect-plugin/
Each Python module has a gradle setup similar to metadata-ingestion/
(documented above)
Metadata Model Concepts
- Entities: Core objects (Dataset, Dashboard, Chart, CorpUser, etc.)
- Aspects: Metadata facets (Ownership, Schema, Documentation, etc.)
- URNs: Unique identifiers (
urn:li:dataset:(urn:li:dataPlatform:mysql,db.table,PROD)
) - MCE/MCL: Metadata Change Events/Logs for updates
- Entity Registry: YAML config defining entity-aspect relationships (
metadata-models/src/main/resources/entity-registry.yml
)
Development Flow
- Schema changes in
metadata-models/
trigger code generation across all languages - Backend changes in
metadata-service/
and other Java modules expose new REST/GraphQL APIs - Frontend changes in
datahub-web-react/
consume GraphQL APIs - Ingestion changes in
metadata-ingestion/
emit metadata to backend APIs
Code Standards
General Principles
- This is production code - maintain high quality
- Follow existing patterns within each module
- Generate appropriate unit tests
- Use type annotations everywhere (Python/TypeScript)
Language-Specific
- Java: Use Spotless formatting, Spring Boot patterns, TestNG/JUnit Jupiter for tests
- Python: Use ruff for linting/formatting, pytest for testing, pydantic for configs
- Type Safety: Everything must have type annotations, avoid
Any
type, use specific types (Dict[str, int]
,TypedDict
) - Data Structures: Prefer dataclasses/pydantic for internal data, return dataclasses over tuples
- Code Quality: Avoid global state, use named arguments, don't re-export in
__init__.py
, refactor repetitive code - Error Handling: Robust error handling with layers of protection for known failure points
- Type Safety: Everything must have type annotations, avoid
- TypeScript: Use Prettier formatting, strict types (no
any
), React Testing Library
Code Comments
Only add comments that provide real value beyond what the code already expresses.
Do NOT add comments for:
- Obvious operations (
# Get user by ID
,// Create connection
) - What the code does when it's self-evident (
# Loop through items
,// Set variable to true
) - Restating parameter names or return types already in signatures
- Basic language constructs (
# Import modules
,// End of function
)
DO add comments for:
- Why something is done, especially non-obvious business logic or workarounds
- Context about external constraints, API quirks, or domain knowledge
- Warnings about gotchas, performance implications, or side effects
- References to tickets, RFCs, or external documentation that explain decisions
- Complex algorithms or mathematical formulas that aren't immediately clear
- Temporary solutions with TODOs and context for future improvements
Examples:
# Good: Explains WHY and provides context
# Use a 30-second timeout because Snowflake's query API can hang indefinitely
# on large result sets. See issue #12345.
connection_timeout = 30
# Bad: Restates what's obvious from code
# Set connection timeout to 30 seconds
connection_timeout = 30
Testing Strategy
- Python: Tests go in the
tests/
directory alongsidesrc/
, useassert
statements - Java: Tests alongside source in
src/test/
- Frontend: Tests in
__tests__/
or.test.tsx
files - Smoke tests go in the
smoke-test/
directory
Security Testing: Configuration Property Classification
Critical test: metadata-io/src/test/java/com/linkedin/metadata/system_info/collectors/PropertiesCollectorConfigurationTest.java
This test prevents sensitive data leaks by requiring explicit classification of all configuration properties as either sensitive (redacted) or non-sensitive (visible in system info).
When adding new configuration properties: The test will fail with clear instructions on which classification list to add your property to. Refer to the test file's comprehensive documentation for template syntax and examples.
This is a mandatory security guardrail - never disable or skip this test.
Commits
- Follow Conventional Commits format for commit messages
- Breaking Changes: Always update
docs/how/updating-datahub.md
for breaking changes. Write entries for non-technical audiences, reference the PR number, and focus on what users need to change rather than internal implementation details
Key Documentation
Essential reading:
docs/architecture/architecture.md
- System architecture overviewdocs/modeling/metadata-model.md
- How metadata is modeleddocs/what-is-datahub/datahub-concepts.md
- Core concepts (URNs, entities, etc.)
External docs:
- https://docs.datahub.com/docs/developers - Official developer guide
- https://demo.datahub.com/ - Live demo environment
Important Notes
- Entity Registry is defined in YAML, not code (
entity-registry.yml
) - All metadata changes flow through the event streaming system
- GraphQL schema is generated from backend GMS APIs