9.3 KiB
		
	
	
	
	
	
	
	
			
		
		
	
	DataProcess
DEPRECATED: This entity is deprecated and should not be used for new implementations.
Use dataFlow and dataJob instead.
The
dataProcessentity was an early attempt to model data processing tasks but has been superseded by the more robust and flexibledataFlowanddataJobentities which better represent the hierarchical nature of modern data pipelines.
Deprecation Notice
The dataProcess entity was deprecated to provide a clearer separation between:
- DataFlow: Represents the overall pipeline/workflow (e.g., an Airflow DAG, dbt project, Spark application)
- DataJob: Represents individual tasks within a pipeline (e.g., an Airflow task, dbt model, Spark job)
This two-level hierarchy better matches how modern orchestration systems organize data processing work and provides more flexibility for lineage tracking, ownership assignment, and operational monitoring.
Why was it deprecated?
The original dataProcess entity had several limitations:
- No hierarchical structure: It couldn't represent the relationship between a pipeline and its constituent tasks
- Limited orchestrator support: The flat structure didn't map well to DAG-based orchestration platforms like Airflow, Prefect, or Dagster
- Unclear semantics: It was ambiguous whether a dataProcess represented a whole pipeline or a single task
- Poor lineage modeling: Without task-level granularity, lineage relationships were less precise
The new dataFlow and dataJob model addresses these limitations by providing a clear parent-child relationship that mirrors real-world data processing architectures.
Identity (Historical Reference)
DataProcess entities were identified by three components:
- Name: The process name (typically an ETL job name)
- Orchestrator: The workflow management platform (e.g., airflow,azkaban)
- Origin (Fabric): The environment where the process runs (PROD, DEV, etc.)
The URN structure was:
urn:li:dataProcess:(<name>,<orchestrator>,<origin>)
Example URNs
urn:li:dataProcess:(customer_etl_job,airflow,PROD)
urn:li:dataProcess:(sales_aggregation,azkaban,DEV)
Important Capabilities (Historical Reference)
DataProcessInfo Aspect
The dataProcessInfo aspect captured inputs and outputs of the process:
- Inputs: Array of dataset URNs consumed by the process
- Outputs: Array of dataset URNs produced by the process
This established basic lineage relationships through "Consumes" relationships with datasets.
Common Aspects
Like other entities, dataProcess supported:
- Ownership: Assigning owners to processes
- Status: Marking processes as removed
- Global Tags: Categorization and classification
- Institutional Memory: Links to documentation
Migration Guide
When to use DataFlow vs DataJob
Use DataFlow when representing:
- Airflow DAGs
- dbt projects
- Prefect flows
- Dagster pipelines
- Azkaban workflows
- Any container of related data processing tasks
Use DataJob when representing:
- Airflow tasks within a DAG
- dbt models within a project
- Prefect tasks within a flow
- Dagster ops/assets within a pipeline
- Individual processing steps
Use both together:
- Create a DataFlow for the pipeline
- Create DataJobs for each task within that pipeline
- Link DataJobs to their parent DataFlow
Conceptual Mapping
| DataProcess Concept | New Model Equivalent | Notes | 
|---|---|---|
| Process with tasks | DataFlow + DataJobs | Split into two entities | 
| Process name | DataFlow flowId | Becomes the parent identifier | 
| Single-step process | DataFlow + 1 DataJob | Still requires both entities | 
| Orchestrator | DataFlow orchestrator | Same concept, better modeling | 
| Origin/Fabric | DataFlow cluster | Often matches environment | 
| Inputs/Outputs | DataJob dataJobInputOutput | Moved to job level for precision | 
Migration Steps
To migrate from dataProcess to dataFlow/dataJob:
- 
Identify your process structure: Determine if your dataProcess represents a pipeline (has multiple steps) or a single task 
- 
Create a DataFlow: This represents the overall pipeline/workflow - Use the same orchestrator value
- Use the process name as the flow ID
- Use a cluster identifier (often matches the origin/fabric)
 
- 
Create DataJob(s): Create one or more jobs within the flow - For single-step processes: create one job named after the process
- For multi-step processes: create a job for each step
- Link each job to its parent DataFlow
 
- 
Migrate lineage: Move input/output dataset relationships from the process level to the job level 
- 
Migrate metadata: Transfer ownership, tags, and documentation to the appropriate entity (typically the DataFlow for pipeline-level metadata, or specific DataJobs for task-level metadata) 
Migration Examples
Example 1: Simple single-task process
Old dataProcess:
urn:li:dataProcess:(daily_report,airflow,PROD)
New structure:
DataFlow: urn:li:dataFlow:(airflow,daily_report,prod)
DataJob:  urn:li:dataJob:(urn:li:dataFlow:(airflow,daily_report,prod),daily_report_task)
Example 2: Multi-step ETL pipeline
Old dataProcess:
urn:li:dataProcess:(customer_pipeline,airflow,PROD)
New structure:
DataFlow: urn:li:dataFlow:(airflow,customer_pipeline,prod)
DataJob:  urn:li:dataJob:(urn:li:dataFlow:(airflow,customer_pipeline,prod),extract_customers)
DataJob:  urn:li:dataJob:(urn:li:dataFlow:(airflow,customer_pipeline,prod),transform_customers)
DataJob:  urn:li:dataJob:(urn:li:dataFlow:(airflow,customer_pipeline,prod),load_customers)
Code Examples
Querying Existing DataProcess Entities
If you need to query existing dataProcess entities for migration purposes:
Python SDK: Query a dataProcess entity
{{ inline /metadata-ingestion/examples/library/dataprocess_query_deprecated.py show_path_as_comment }}
Creating Equivalent DataFlow and DataJob (Recommended)
Instead of using dataProcess, create the modern equivalent:
Python SDK: Create DataFlow and DataJob to replace dataProcess
{{ inline /metadata-ingestion/examples/library/dataprocess_migrate_to_flow_job.py show_path_as_comment }}
Complete Migration Example
Python SDK: Full migration from dataProcess to dataFlow/dataJob
{{ inline /metadata-ingestion/examples/library/dataprocess_full_migration.py show_path_as_comment }}
Integration Points
Historical Usage
The dataProcess entity was previously used by:
- Early ingestion connectors: Original Airflow, Azkaban connectors before they migrated to dataFlow/dataJob
- Custom integrations: User-built integrations that haven't been updated
- Legacy metadata: Historical data in existing DataHub instances
Modern Replacements
All modern DataHub connectors use dataFlow and dataJob:
- Airflow: DAGs → DataFlow, Tasks → DataJob
- dbt: Projects → DataFlow, Models → DataJob
- Prefect: Flows → DataFlow, Tasks → DataJob
- Dagster: Pipelines → DataFlow, Ops/Assets → DataJob
- Fivetran: Connectors → DataFlow, Sync operations → DataJob
- AWS Glue: Jobs → DataFlow, Steps → DataJob
- Azure Data Factory: Pipelines → DataFlow, Activities → DataJob
DataProcessInstance
Note that dataProcessInstance is NOT deprecated. It represents a specific execution/run of either:
- A dataJob (recommended)
- A legacy dataProcess (for backward compatibility)
DataProcessInstance continues to be used for tracking pipeline run history, status, and runtime information.
Notable Exceptions
Timeline for Removal
- Deprecated: Early 2021 (with introduction of dataFlow/dataJob)
- Status: Still exists in the entity registry for backward compatibility
- Current State: No active ingestion sources create dataProcess entities
- Removal: No specific timeline, maintained for existing data
Reading Existing Data
The dataProcess entity remains readable through all DataHub APIs for backward compatibility. Existing dataProcess entities in your instance will continue to function and display in the UI.
No New Writes Recommended
While technically possible to create new dataProcess entities, it is strongly discouraged. All new integrations should use dataFlow and dataJob.
Upgrade Path
There is no automatic migration tool. Organizations with significant dataProcess data should:
- Use the Python SDK to query existing dataProcess entities
- Create equivalent dataFlow and dataJob entities
- Preserve URN mappings for lineage continuity
- Consider soft-deleting old dataProcess entities once migration is verified
GraphQL API
The dataProcess entity is minimally exposed in the GraphQL API. Modern GraphQL queries and mutations focus on dataFlow and dataJob entities.
