9.3 KiB

DataProcess

DEPRECATED: This entity is deprecated and should not be used for new implementations.

Use dataFlow and dataJob instead.

The dataProcess entity was an early attempt to model data processing tasks but has been superseded by the more robust and flexible dataFlow and dataJob entities which better represent the hierarchical nature of modern data pipelines.

Deprecation Notice

The dataProcess entity was deprecated to provide a clearer separation between:

  • DataFlow: Represents the overall pipeline/workflow (e.g., an Airflow DAG, dbt project, Spark application)
  • DataJob: Represents individual tasks within a pipeline (e.g., an Airflow task, dbt model, Spark job)

This two-level hierarchy better matches how modern orchestration systems organize data processing work and provides more flexibility for lineage tracking, ownership assignment, and operational monitoring.

Why was it deprecated?

The original dataProcess entity had several limitations:

  1. No hierarchical structure: It couldn't represent the relationship between a pipeline and its constituent tasks
  2. Limited orchestrator support: The flat structure didn't map well to DAG-based orchestration platforms like Airflow, Prefect, or Dagster
  3. Unclear semantics: It was ambiguous whether a dataProcess represented a whole pipeline or a single task
  4. Poor lineage modeling: Without task-level granularity, lineage relationships were less precise

The new dataFlow and dataJob model addresses these limitations by providing a clear parent-child relationship that mirrors real-world data processing architectures.

Identity (Historical Reference)

DataProcess entities were identified by three components:

  • Name: The process name (typically an ETL job name)
  • Orchestrator: The workflow management platform (e.g., airflow, azkaban)
  • Origin (Fabric): The environment where the process runs (PROD, DEV, etc.)

The URN structure was:

urn:li:dataProcess:(<name>,<orchestrator>,<origin>)

Example URNs

urn:li:dataProcess:(customer_etl_job,airflow,PROD)
urn:li:dataProcess:(sales_aggregation,azkaban,DEV)

Important Capabilities (Historical Reference)

DataProcessInfo Aspect

The dataProcessInfo aspect captured inputs and outputs of the process:

  • Inputs: Array of dataset URNs consumed by the process
  • Outputs: Array of dataset URNs produced by the process

This established basic lineage relationships through "Consumes" relationships with datasets.

Common Aspects

Like other entities, dataProcess supported:

  • Ownership: Assigning owners to processes
  • Status: Marking processes as removed
  • Global Tags: Categorization and classification
  • Institutional Memory: Links to documentation

Migration Guide

When to use DataFlow vs DataJob

Use DataFlow when representing:

  • Airflow DAGs
  • dbt projects
  • Prefect flows
  • Dagster pipelines
  • Azkaban workflows
  • Any container of related data processing tasks

Use DataJob when representing:

  • Airflow tasks within a DAG
  • dbt models within a project
  • Prefect tasks within a flow
  • Dagster ops/assets within a pipeline
  • Individual processing steps

Use both together:

  • Create a DataFlow for the pipeline
  • Create DataJobs for each task within that pipeline
  • Link DataJobs to their parent DataFlow

Conceptual Mapping

DataProcess Concept New Model Equivalent Notes
Process with tasks DataFlow + DataJobs Split into two entities
Process name DataFlow flowId Becomes the parent identifier
Single-step process DataFlow + 1 DataJob Still requires both entities
Orchestrator DataFlow orchestrator Same concept, better modeling
Origin/Fabric DataFlow cluster Often matches environment
Inputs/Outputs DataJob dataJobInputOutput Moved to job level for precision

Migration Steps

To migrate from dataProcess to dataFlow/dataJob:

  1. Identify your process structure: Determine if your dataProcess represents a pipeline (has multiple steps) or a single task

  2. Create a DataFlow: This represents the overall pipeline/workflow

    • Use the same orchestrator value
    • Use the process name as the flow ID
    • Use a cluster identifier (often matches the origin/fabric)
  3. Create DataJob(s): Create one or more jobs within the flow

    • For single-step processes: create one job named after the process
    • For multi-step processes: create a job for each step
    • Link each job to its parent DataFlow
  4. Migrate lineage: Move input/output dataset relationships from the process level to the job level

  5. Migrate metadata: Transfer ownership, tags, and documentation to the appropriate entity (typically the DataFlow for pipeline-level metadata, or specific DataJobs for task-level metadata)

Migration Examples

Example 1: Simple single-task process

Old dataProcess:

urn:li:dataProcess:(daily_report,airflow,PROD)

New structure:

DataFlow: urn:li:dataFlow:(airflow,daily_report,prod)
DataJob:  urn:li:dataJob:(urn:li:dataFlow:(airflow,daily_report,prod),daily_report_task)

Example 2: Multi-step ETL pipeline

Old dataProcess:

urn:li:dataProcess:(customer_pipeline,airflow,PROD)

New structure:

DataFlow: urn:li:dataFlow:(airflow,customer_pipeline,prod)
DataJob:  urn:li:dataJob:(urn:li:dataFlow:(airflow,customer_pipeline,prod),extract_customers)
DataJob:  urn:li:dataJob:(urn:li:dataFlow:(airflow,customer_pipeline,prod),transform_customers)
DataJob:  urn:li:dataJob:(urn:li:dataFlow:(airflow,customer_pipeline,prod),load_customers)

Code Examples

Querying Existing DataProcess Entities

If you need to query existing dataProcess entities for migration purposes:

Python SDK: Query a dataProcess entity
{{ inline /metadata-ingestion/examples/library/dataprocess_query_deprecated.py show_path_as_comment }}

Instead of using dataProcess, create the modern equivalent:

Python SDK: Create DataFlow and DataJob to replace dataProcess
{{ inline /metadata-ingestion/examples/library/dataprocess_migrate_to_flow_job.py show_path_as_comment }}

Complete Migration Example

Python SDK: Full migration from dataProcess to dataFlow/dataJob
{{ inline /metadata-ingestion/examples/library/dataprocess_full_migration.py show_path_as_comment }}

Integration Points

Historical Usage

The dataProcess entity was previously used by:

  1. Early ingestion connectors: Original Airflow, Azkaban connectors before they migrated to dataFlow/dataJob
  2. Custom integrations: User-built integrations that haven't been updated
  3. Legacy metadata: Historical data in existing DataHub instances

Modern Replacements

All modern DataHub connectors use dataFlow and dataJob:

  • Airflow: DAGs → DataFlow, Tasks → DataJob
  • dbt: Projects → DataFlow, Models → DataJob
  • Prefect: Flows → DataFlow, Tasks → DataJob
  • Dagster: Pipelines → DataFlow, Ops/Assets → DataJob
  • Fivetran: Connectors → DataFlow, Sync operations → DataJob
  • AWS Glue: Jobs → DataFlow, Steps → DataJob
  • Azure Data Factory: Pipelines → DataFlow, Activities → DataJob

DataProcessInstance

Note that dataProcessInstance is NOT deprecated. It represents a specific execution/run of either:

  • A dataJob (recommended)
  • A legacy dataProcess (for backward compatibility)

DataProcessInstance continues to be used for tracking pipeline run history, status, and runtime information.

Notable Exceptions

Timeline for Removal

  • Deprecated: Early 2021 (with introduction of dataFlow/dataJob)
  • Status: Still exists in the entity registry for backward compatibility
  • Current State: No active ingestion sources create dataProcess entities
  • Removal: No specific timeline, maintained for existing data

Reading Existing Data

The dataProcess entity remains readable through all DataHub APIs for backward compatibility. Existing dataProcess entities in your instance will continue to function and display in the UI.

While technically possible to create new dataProcess entities, it is strongly discouraged. All new integrations should use dataFlow and dataJob.

Upgrade Path

There is no automatic migration tool. Organizations with significant dataProcess data should:

  1. Use the Python SDK to query existing dataProcess entities
  2. Create equivalent dataFlow and dataJob entities
  3. Preserve URN mappings for lineage continuity
  4. Consider soft-deleting old dataProcess entities once migration is verified

GraphQL API

The dataProcess entity is minimally exposed in the GraphQL API. Modern GraphQL queries and mutations focus on dataFlow and dataJob entities.

Additional Resources