Skip to main content
Version: Next

CLAUDE.md

This file provides guidance to Claude Code (claude.ai/code) when working with code in this repository.

Essential Commands

Build and test:

./gradlew build           # Build entire project
./gradlew check # Run all tests and linting

# Note that each directory typically has a build.gradle file, but the available tasks follow similar conventions.

# Java code.
./gradlew spotlessApply # Java code formatting

# Python code.
./gradlew :metadata-ingestion:testQuick # Fast Python unit tests
./gradlew :metadata-ingestion:lint # Python linting (ruff, mypy)
./gradlew :metadata-ingestion:lintFix # Python linting auto-fix (ruff only)

If you are using git worktrees then exclude this as that might cause git related failures when running any gradle command.

./gradlew ... -x generateGitPropertiesGlobal

IMPORTANT: Verifying Python code changes:

  • ALWAYS use ./gradlew :metadata-ingestion:lintFix to verify Python code changes
  • NEVER use python3 -m py_compile - it doesn't catch style issues or type errors
  • lintFix runs ruff formatting and fixing automatically, ensuring code quality
  • For smoke-test changes, the lintFix command will also check those files

Development setup:

./gradlew :metadata-ingestion:installDev               # Setup Python environment
./gradlew quickstartDebug # Start full DataHub stack
cd datahub-web-react && yarn start # Frontend dev server

Architecture Overview

DataHub is a schema-first, event-driven metadata platform with three core layers:

Core Services

  • GMS (Generalized Metadata Service): Java/Spring backend handling metadata storage and REST/GraphQL APIs
  • Frontend: React/TypeScript application consuming GraphQL APIs
  • Ingestion Framework: Python CLI and connectors for extracting metadata from data sources
  • Event Streaming: Kafka-based real-time metadata change propagation

Key Modules

  • metadata-models/: Avro/PDL schemas defining the metadata model
  • metadata-service/: Backend services, APIs, and business logic
  • datahub-web-react/: Frontend React application
  • metadata-ingestion/: Python ingestion framework and CLI
  • datahub-graphql-core/: GraphQL schema and resolvers

Most of the non-frontend modules are written in Java. The modules written in Python are:

  • metadata-ingestion/
  • datahub-actions/
  • metadata-ingestion-modules/airflow-plugin/
  • metadata-ingestion-modules/gx-plugin/
  • metadata-ingestion-modules/dagster-plugin/
  • metadata-ingestion-modules/prefect-plugin/

Each Python module has a gradle setup similar to metadata-ingestion/ (documented above)

Metadata Model Concepts

  • Entities: Core objects (Dataset, Dashboard, Chart, CorpUser, etc.)
  • Aspects: Metadata facets (Ownership, Schema, Documentation, etc.)
  • URNs: Unique identifiers (urn:li:dataset:(urn:li:dataPlatform:mysql,db.table,PROD))
  • MCE/MCL: Metadata Change Events/Logs for updates
  • Entity Registry: YAML config defining entity-aspect relationships (metadata-models/src/main/resources/entity-registry.yml)

Validation Architecture

IMPORTANT: Validation must work across all APIs (GraphQL, OpenAPI, RestLI).

  • Never add validation in API-specific layers (GraphQL resolvers, REST controllers) - this only protects one API
  • Always implement AspectPayloadValidators in metadata-io/src/main/java/com/linkedin/metadata/aspect/validation/
  • Register as Spring beans in SpringStandardPluginConfiguration.java
  • Follow existing patterns: See SystemPolicyValidator.java and PolicyFieldTypeValidator.java as examples

Development Flow

  1. Schema changes in metadata-models/ trigger code generation across all languages
  2. Backend changes in metadata-service/ and other Java modules expose new REST/GraphQL APIs
  3. Frontend changes in datahub-web-react/ consume GraphQL APIs
  4. Ingestion changes in metadata-ingestion/ emit metadata to backend APIs

Code Standards

General Principles

  • This is production code - maintain high quality
  • Follow existing patterns within each module
  • Generate appropriate unit tests
  • Use type annotations everywhere (Python/TypeScript)

Language-Specific

  • Java: Use Spotless formatting, Spring Boot patterns, TestNG/JUnit Jupiter for tests
  • Python: Use ruff for linting/formatting, pytest for testing, pydantic for configs
    • Type Safety: Everything must have type annotations, avoid Any type, use specific types (Dict[str, int], TypedDict)
    • Data Structures: Prefer dataclasses/pydantic for internal data, return dataclasses over tuples
    • Code Quality: Avoid global state, use named arguments, don't re-export in __init__.py, refactor repetitive code
    • Error Handling: Robust error handling with layers of protection for known failure points
  • TypeScript: Use Prettier formatting, strict types (no any), React Testing Library

Code Comments

Only add comments that provide real value beyond what the code already expresses.

Do NOT add comments for:

  • Obvious operations (# Get user by ID, // Create connection)
  • What the code does when it's self-evident (# Loop through items, // Set variable to true)
  • Restating parameter names or return types already in signatures
  • Basic language constructs (# Import modules, // End of function)

DO add comments for:

  • Why something is done, especially non-obvious business logic or workarounds
  • Context about external constraints, API quirks, or domain knowledge
  • Warnings about gotchas, performance implications, or side effects
  • References to tickets, RFCs, or external documentation that explain decisions
  • Complex algorithms or mathematical formulas that aren't immediately clear
  • Temporary solutions with TODOs and context for future improvements

Examples:

# Good: Explains WHY and provides context
# Use a 30-second timeout because Snowflake's query API can hang indefinitely
# on large result sets. See issue #12345.
connection_timeout = 30

# Bad: Restates what's obvious from code
# Set connection timeout to 30 seconds
connection_timeout = 30

Testing Strategy

  • Python: Tests go in the tests/ directory alongside src/, use assert statements
  • Java: Tests alongside source in src/test/
  • Frontend: Tests in __tests__/ or .test.tsx files
  • Smoke tests go in the smoke-test/ directory

Testing Principles: Focus on Value Over Coverage

IMPORTANT: Quality over quantity. Avoid AI-generated test anti-patterns that create maintenance burden without providing real value.

Focus on behavior, not implementation:

  • ✅ Test what the code does (business logic, edge cases that occur in production)
  • ❌ Don't test how it does it (implementation details, private fields via reflection)
  • ❌ Don't test third-party libraries work correctly (Spring, Micrometer, Kafka clients, etc.)
  • ❌ Don't test Java/Python language features (synchronized methods are thread-safe, @Nonnull parameters reject nulls)

Avoid these specific anti-patterns:

  • ❌ Testing null inputs on @Nonnull/@NonNull annotated parameters
  • ❌ Verifying exact error message wording (creates brittleness during refactoring)
  • ❌ Testing every possible input variation (case sensitivity × whitespace × special chars = maintenance nightmare)
  • ❌ Using reflection to verify private implementation details
  • ❌ Redundant concurrency testing on synchronized methods
  • ❌ Testing obvious getter/setter behavior without business logic
  • ❌ Testing Lombok-generated code (@Data, @Builder, @Value classes) - you're testing Lombok's code generator, not your logic
  • ❌ Testing that annotations exist on classes - if required annotations are missing, the framework/compiler will fail at startup, not in your tests

Appropriate test scope:

  • Simple utilities (enums, string parsing, formatters): ~50-100 lines of focused tests
    • Happy path for each method
    • One example of invalid input per method
    • Edge cases likely to occur in production
  • Complex business logic: Test proportional to risk and complexity
    • Integration points and system boundaries
    • Security-critical operations
    • Error handling for realistic failure scenarios
  • Warning sign: If tests are 5x+ the size of implementation, reconsider scope

Examples of low-value tests to avoid:

// ❌ BAD: Testing @Nonnull contract (framework's job)
@Test
public void testNullParameterThrowsException() {
assertThrows(NullPointerException.class,
() -> service.process(null)); // parameter is @Nonnull
}

// ❌ BAD: Testing Lombok-generated code
@Test
public void testBuilderSetsAllFields() {
MyConfig config = MyConfig.builder()
.field1("value1")
.field2("value2")
.build();
assertEquals(config.getField1(), "value1");
assertEquals(config.getField2(), "value2");
}

// ❌ BAD: Testing that annotations exist
@Test
public void testConfigurationAnnotations() {
assertNotNull(MyConfig.class.getAnnotation(Configuration.class));
assertNotNull(MyConfig.class.getAnnotation(ComponentScan.class));
}
// If @Configuration is missing, Spring won't load the context - you don't need a test for this

// ❌ BAD: Exact error message (brittle)
assertEquals(exception.getMessage(),
"Unsupported database type 'oracle'. Only PostgreSQL and MySQL variants are supported.");

// ❌ BAD: Redundant variations
assertEquals(DatabaseType.fromString("postgresql"), DatabaseType.POSTGRES);
assertEquals(DatabaseType.fromString("PostgreSQL"), DatabaseType.POSTGRES);
assertEquals(DatabaseType.fromString("POSTGRESQL"), DatabaseType.POSTGRES);
assertEquals(DatabaseType.fromString(" postgresql "), DatabaseType.POSTGRES);
// ... 10 more case/whitespace variations

// ✅ GOOD: Focused behavioral test
@Test
public void testFromString_ValidInputsCaseInsensitive() {
assertEquals(DatabaseType.fromString("postgresql"), DatabaseType.POSTGRES);
assertEquals(DatabaseType.fromString("POSTGRESQL"), DatabaseType.POSTGRES);
assertEquals(DatabaseType.fromString(" postgresql "), DatabaseType.POSTGRES);
}

@Test
public void testFromString_InvalidInputThrows() {
assertThrows(IllegalArgumentException.class,
() -> DatabaseType.fromString("oracle"));
}

// ✅ GOOD: Testing YOUR custom validation logic on a Lombok class
@Test
public void testCustomValidation() {
assertThrows(IllegalArgumentException.class,
() -> MyConfig.builder().field1("invalid").build().validate());
}

When in doubt: Ask "Does this test protect against a realistic regression?" If not, skip it.

Security Testing: Configuration Property Classification

Critical test: metadata-io/src/test/java/com/linkedin/metadata/system_info/collectors/PropertiesCollectorConfigurationTest.java

This test prevents sensitive data leaks by requiring explicit classification of all configuration properties as either sensitive (redacted) or non-sensitive (visible in system info).

When adding new configuration properties: The test will fail with clear instructions on which classification list to add your property to. Refer to the test file's comprehensive documentation for template syntax and examples.

This is a mandatory security guardrail - never disable or skip this test.

Commits

  • Follow Conventional Commits format for commit messages
  • Breaking Changes: Always update docs/how/updating-datahub.md for breaking changes. Write entries for non-technical audiences, reference the PR number, and focus on what users need to change rather than internal implementation details

Key Documentation

Essential reading:

  • docs/architecture/architecture.md - System architecture overview
  • docs/modeling/metadata-model.md - How metadata is modeled
  • docs/what-is-datahub/datahub-concepts.md - Core concepts (URNs, entities, etc.)

External docs:

Important Notes

  • Entity Registry is defined in YAML, not code (entity-registry.yml)
  • All metadata changes flow through the event streaming system
  • GraphQL schema is generated from backend GMS APIs