mirror of https://github.com/datahub-project/datahub.git synced 2025-10-31 10:49:00 +00:00

feat(docs): enhance metadata model entity documentation with field tables and SDK examples (#15095 )

2025-10-23 14:01:45 -07:00

9.3 KiB

Raw Blame History

DataProcess

DEPRECATED: This entity is deprecated and should not be used for new implementations.

Use dataFlow and dataJob instead.

The dataProcess entity was an early attempt to model data processing tasks but has been superseded by the more robust and flexible dataFlow and dataJob entities which better represent the hierarchical nature of modern data pipelines.

Deprecation Notice

The dataProcess entity was deprecated to provide a clearer separation between:

DataFlow: Represents the overall pipeline/workflow (e.g., an Airflow DAG, dbt project, Spark application)
DataJob: Represents individual tasks within a pipeline (e.g., an Airflow task, dbt model, Spark job)

This two-level hierarchy better matches how modern orchestration systems organize data processing work and provides more flexibility for lineage tracking, ownership assignment, and operational monitoring.

Why was it deprecated?

The original dataProcess entity had several limitations:

No hierarchical structure: It couldn't represent the relationship between a pipeline and its constituent tasks
Limited orchestrator support: The flat structure didn't map well to DAG-based orchestration platforms like Airflow, Prefect, or Dagster
Unclear semantics: It was ambiguous whether a dataProcess represented a whole pipeline or a single task
Poor lineage modeling: Without task-level granularity, lineage relationships were less precise

The new dataFlow and dataJob model addresses these limitations by providing a clear parent-child relationship that mirrors real-world data processing architectures.

Identity (Historical Reference)

DataProcess entities were identified by three components:

Name: The process name (typically an ETL job name)
Orchestrator: The workflow management platform (e.g., airflow, azkaban)
Origin (Fabric): The environment where the process runs (PROD, DEV, etc.)

The URN structure was:

urn:li:dataProcess:(<name>,<orchestrator>,<origin>)

Example URNs

urn:li:dataProcess:(customer_etl_job,airflow,PROD)
urn:li:dataProcess:(sales_aggregation,azkaban,DEV)

Important Capabilities (Historical Reference)

DataProcessInfo Aspect

The dataProcessInfo aspect captured inputs and outputs of the process:

Inputs: Array of dataset URNs consumed by the process
Outputs: Array of dataset URNs produced by the process

This established basic lineage relationships through "Consumes" relationships with datasets.

Common Aspects

Like other entities, dataProcess supported:

Ownership: Assigning owners to processes
Status: Marking processes as removed
Global Tags: Categorization and classification
Institutional Memory: Links to documentation

Migration Guide

When to use DataFlow vs DataJob

Use DataFlow when representing:

Airflow DAGs
dbt projects
Prefect flows
Dagster pipelines
Azkaban workflows
Any container of related data processing tasks

Use DataJob when representing:

Airflow tasks within a DAG
dbt models within a project
Prefect tasks within a flow
Dagster ops/assets within a pipeline
Individual processing steps

Use both together:

Create a DataFlow for the pipeline
Create DataJobs for each task within that pipeline
Link DataJobs to their parent DataFlow

Conceptual Mapping

DataProcess Concept	New Model Equivalent	Notes
Process with tasks	DataFlow + DataJobs	Split into two entities
Process name	DataFlow flowId	Becomes the parent identifier
Single-step process	DataFlow + 1 DataJob	Still requires both entities
Orchestrator	DataFlow orchestrator	Same concept, better modeling
Origin/Fabric	DataFlow cluster	Often matches environment
Inputs/Outputs	DataJob dataJobInputOutput	Moved to job level for precision

Migration Steps

To migrate from dataProcess to dataFlow/dataJob:

Identify your process structure: Determine if your dataProcess represents a pipeline (has multiple steps) or a single task
Create a DataFlow: This represents the overall pipeline/workflow
- Use the same orchestrator value
- Use the process name as the flow ID
- Use a cluster identifier (often matches the origin/fabric)
Create DataJob(s): Create one or more jobs within the flow
- For single-step processes: create one job named after the process
- For multi-step processes: create a job for each step
- Link each job to its parent DataFlow
Migrate lineage: Move input/output dataset relationships from the process level to the job level
Migrate metadata: Transfer ownership, tags, and documentation to the appropriate entity (typically the DataFlow for pipeline-level metadata, or specific DataJobs for task-level metadata)

Migration Examples

Example 1: Simple single-task process

Old dataProcess:

urn:li:dataProcess:(daily_report,airflow,PROD)

New structure:

DataFlow: urn:li:dataFlow:(airflow,daily_report,prod)
DataJob:  urn:li:dataJob:(urn:li:dataFlow:(airflow,daily_report,prod),daily_report_task)

Example 2: Multi-step ETL pipeline

Old dataProcess:

urn:li:dataProcess:(customer_pipeline,airflow,PROD)

New structure:

DataFlow: urn:li:dataFlow:(airflow,customer_pipeline,prod)
DataJob:  urn:li:dataJob:(urn:li:dataFlow:(airflow,customer_pipeline,prod),extract_customers)
DataJob:  urn:li:dataJob:(urn:li:dataFlow:(airflow,customer_pipeline,prod),transform_customers)
DataJob:  urn:li:dataJob:(urn:li:dataFlow:(airflow,customer_pipeline,prod),load_customers)

Code Examples

Querying Existing DataProcess Entities

If you need to query existing dataProcess entities for migration purposes:

Python SDK: Query a dataProcess entity

{{ inline /metadata-ingestion/examples/library/dataprocess_query_deprecated.py show_path_as_comment }}

Creating Equivalent DataFlow and DataJob (Recommended)

Instead of using dataProcess, create the modern equivalent:

Python SDK: Create DataFlow and DataJob to replace dataProcess

{{ inline /metadata-ingestion/examples/library/dataprocess_migrate_to_flow_job.py show_path_as_comment }}

Complete Migration Example

Python SDK: Full migration from dataProcess to dataFlow/dataJob

{{ inline /metadata-ingestion/examples/library/dataprocess_full_migration.py show_path_as_comment }}

Integration Points

Historical Usage

The dataProcess entity was previously used by:

Early ingestion connectors: Original Airflow, Azkaban connectors before they migrated to dataFlow/dataJob
Custom integrations: User-built integrations that haven't been updated
Legacy metadata: Historical data in existing DataHub instances

Modern Replacements

All modern DataHub connectors use dataFlow and dataJob:

Airflow: DAGs → DataFlow, Tasks → DataJob
dbt: Projects → DataFlow, Models → DataJob
Prefect: Flows → DataFlow, Tasks → DataJob
Dagster: Pipelines → DataFlow, Ops/Assets → DataJob
Fivetran: Connectors → DataFlow, Sync operations → DataJob
AWS Glue: Jobs → DataFlow, Steps → DataJob
Azure Data Factory: Pipelines → DataFlow, Activities → DataJob

DataProcessInstance

Note that dataProcessInstance is NOT deprecated. It represents a specific execution/run of either:

A dataJob (recommended)
A legacy dataProcess (for backward compatibility)

DataProcessInstance continues to be used for tracking pipeline run history, status, and runtime information.

Notable Exceptions

Timeline for Removal

Deprecated: Early 2021 (with introduction of dataFlow/dataJob)
Status: Still exists in the entity registry for backward compatibility
Current State: No active ingestion sources create dataProcess entities
Removal: No specific timeline, maintained for existing data

Reading Existing Data

The dataProcess entity remains readable through all DataHub APIs for backward compatibility. Existing dataProcess entities in your instance will continue to function and display in the UI.

No New Writes Recommended

While technically possible to create new dataProcess entities, it is strongly discouraged. All new integrations should use dataFlow and dataJob.

Upgrade Path

There is no automatic migration tool. Organizations with significant dataProcess data should:

Use the Python SDK to query existing dataProcess entities
Create equivalent dataFlow and dataJob entities
Preserve URN mappings for lineage continuity
Consider soft-deleting old dataProcess entities once migration is verified

GraphQL API

The dataProcess entity is minimally exposed in the GraphQL API. Modern GraphQL queries and mutations focus on dataFlow and dataJob entities.

9.3 KiB Raw Blame History