Version: Next

CLAUDE.md

This file provides guidance to Claude Code (claude.ai/code) when working with code in this repository.

Essential Commands

Build and test:

./gradlew build           # Build entire project
./gradlew check           # Run all tests and linting

# Note that each directory typically has a build.gradle file, but the available tasks follow similar conventions.

# Java code.
./gradlew spotlessApply   # Java code formatting

# Python code.
./gradlew :metadata-ingestion:testQuick     # Fast Python unit tests
./gradlew :metadata-ingestion:lint          # Python linting (ruff, mypy)
./gradlew :metadata-ingestion:lintFix       # Python linting auto-fix (ruff only)

Development setup:

./gradlew :metadata-ingestion:installDev               # Setup Python environment
./gradlew quickstartDebug                              # Start full DataHub stack
cd datahub-web-react && yarn start                     # Frontend dev server

Architecture Overview

DataHub is a schema-first, event-driven metadata platform with three core layers:

Core Services

GMS (Generalized Metadata Service): Java/Spring backend handling metadata storage and REST/GraphQL APIs
Frontend: React/TypeScript application consuming GraphQL APIs
Ingestion Framework: Python CLI and connectors for extracting metadata from data sources
Event Streaming: Kafka-based real-time metadata change propagation

Key Modules

metadata-models/: Avro/PDL schemas defining the metadata model
metadata-service/: Backend services, APIs, and business logic
datahub-web-react/: Frontend React application
metadata-ingestion/: Python ingestion framework and CLI
datahub-graphql-core/: GraphQL schema and resolvers

Most of the non-frontend modules are written in Java. The modules written in Python are:

metadata-ingestion/
datahub-actions/
metadata-ingestion-modules/airflow-plugin/
metadata-ingestion-modules/gx-plugin/
metadata-ingestion-modules/dagster-plugin/
metadata-ingestion-modules/prefect-plugin/

Each Python module has a gradle setup similar to metadata-ingestion/ (documented above)

Metadata Model Concepts

Entities: Core objects (Dataset, Dashboard, Chart, CorpUser, etc.)
Aspects: Metadata facets (Ownership, Schema, Documentation, etc.)
URNs: Unique identifiers (urn:li:dataset:(urn:li:dataPlatform:mysql,db.table,PROD))
MCE/MCL: Metadata Change Events/Logs for updates
Entity Registry: YAML config defining entity-aspect relationships (metadata-models/src/main/resources/entity-registry.yml)

Development Flow

Schema changes in metadata-models/ trigger code generation across all languages
Backend changes in metadata-service/ and other Java modules expose new REST/GraphQL APIs
Frontend changes in datahub-web-react/ consume GraphQL APIs
Ingestion changes in metadata-ingestion/ emit metadata to backend APIs

Code Standards

General Principles

This is production code - maintain high quality
Follow existing patterns within each module
Generate appropriate unit tests
Use type annotations everywhere (Python/TypeScript)

Language-Specific

Java: Use Spotless formatting, Spring Boot patterns, TestNG/JUnit Jupiter for tests
Python: Use ruff for linting/formatting, pytest for testing, pydantic for configs
- Type Safety: Everything must have type annotations, avoid Any type, use specific types (Dict[str, int], TypedDict)
- Data Structures: Prefer dataclasses/pydantic for internal data, return dataclasses over tuples
- Code Quality: Avoid global state, use named arguments, don't re-export in __init__.py, refactor repetitive code
- Error Handling: Robust error handling with layers of protection for known failure points
TypeScript: Use Prettier formatting, strict types (no any), React Testing Library

Code Comments

Only add comments that provide real value beyond what the code already expresses.

Do NOT add comments for:

Obvious operations (# Get user by ID, // Create connection)
What the code does when it's self-evident (# Loop through items, // Set variable to true)
Restating parameter names or return types already in signatures
Basic language constructs (# Import modules, // End of function)

DO add comments for:

Why something is done, especially non-obvious business logic or workarounds
Context about external constraints, API quirks, or domain knowledge
Warnings about gotchas, performance implications, or side effects
References to tickets, RFCs, or external documentation that explain decisions
Complex algorithms or mathematical formulas that aren't immediately clear
Temporary solutions with TODOs and context for future improvements

Examples:

# Good: Explains WHY and provides context
# Use a 30-second timeout because Snowflake's query API can hang indefinitely
# on large result sets. See issue #12345.
connection_timeout = 30

# Bad: Restates what's obvious from code
# Set connection timeout to 30 seconds
connection_timeout = 30

Testing Strategy

Python: Tests go in the tests/ directory alongside src/, use assert statements
Java: Tests alongside source in src/test/
Frontend: Tests in __tests__/ or .test.tsx files
Smoke tests go in the smoke-test/ directory

Security Testing: Configuration Property Classification

Critical test: metadata-io/src/test/java/com/linkedin/metadata/system_info/collectors/PropertiesCollectorConfigurationTest.java

This test prevents sensitive data leaks by requiring explicit classification of all configuration properties as either sensitive (redacted) or non-sensitive (visible in system info).

When adding new configuration properties: The test will fail with clear instructions on which classification list to add your property to. Refer to the test file's comprehensive documentation for template syntax and examples.

This is a mandatory security guardrail - never disable or skip this test.

Commits

Follow Conventional Commits format for commit messages
Breaking Changes: Always update docs/how/updating-datahub.md for breaking changes. Write entries for non-technical audiences, reference the PR number, and focus on what users need to change rather than internal implementation details

Key Documentation

Essential reading:

docs/architecture/architecture.md - System architecture overview
docs/modeling/metadata-model.md - How metadata is modeled
docs/what-is-datahub/datahub-concepts.md - Core concepts (URNs, entities, etc.)

External docs:

https://docs.datahub.com/docs/developers - Official developer guide
https://demo.datahub.com/ - Live demo environment

Important Notes

Entity Registry is defined in YAML, not code (entity-registry.yml)
All metadata changes flow through the event streaming system
GraphQL schema is generated from backend GMS APIs

Is this page helpful?

CLAUDE.md

Essential Commands​

Architecture Overview​

Core Services​

Key Modules​

Metadata Model Concepts​

Development Flow​

Code Standards​

General Principles​

Language-Specific​

Code Comments​

Testing Strategy​

Security Testing: Configuration Property Classification​

Commits​

Key Documentation​

Important Notes​