# DataProcess > **DEPRECATED**: This entity is deprecated and should not be used for new implementations. > > **Use [dataFlow](./dataFlow.md) and [dataJob](./dataJob.md) instead.** > > The `dataProcess` entity was an early attempt to model data processing tasks but has been superseded by the more robust and flexible `dataFlow` and `dataJob` entities which better represent the hierarchical nature of modern data pipelines. ## Deprecation Notice The `dataProcess` entity was deprecated to provide a clearer separation between: - **DataFlow**: Represents the overall pipeline/workflow (e.g., an Airflow DAG, dbt project, Spark application) - **DataJob**: Represents individual tasks within a pipeline (e.g., an Airflow task, dbt model, Spark job) This two-level hierarchy better matches how modern orchestration systems organize data processing work and provides more flexibility for lineage tracking, ownership assignment, and operational monitoring. ### Why was it deprecated? The original `dataProcess` entity had several limitations: 1. **No hierarchical structure**: It couldn't represent the relationship between a pipeline and its constituent tasks 2. **Limited orchestrator support**: The flat structure didn't map well to DAG-based orchestration platforms like Airflow, Prefect, or Dagster 3. **Unclear semantics**: It was ambiguous whether a dataProcess represented a whole pipeline or a single task 4. **Poor lineage modeling**: Without task-level granularity, lineage relationships were less precise The new `dataFlow` and `dataJob` model addresses these limitations by providing a clear parent-child relationship that mirrors real-world data processing architectures. ## Identity (Historical Reference) DataProcess entities were identified by three components: - **Name**: The process name (typically an ETL job name) - **Orchestrator**: The workflow management platform (e.g., `airflow`, `azkaban`) - **Origin (Fabric)**: The environment where the process runs (PROD, DEV, etc.) The URN structure was: ``` urn:li:dataProcess:(,,) ``` ### Example URNs ``` urn:li:dataProcess:(customer_etl_job,airflow,PROD) urn:li:dataProcess:(sales_aggregation,azkaban,DEV) ``` ## Important Capabilities (Historical Reference) ### DataProcessInfo Aspect The `dataProcessInfo` aspect captured inputs and outputs of the process: - **Inputs**: Array of dataset URNs consumed by the process - **Outputs**: Array of dataset URNs produced by the process This established basic lineage relationships through "Consumes" relationships with datasets. ### Common Aspects Like other entities, dataProcess supported: - **Ownership**: Assigning owners to processes - **Status**: Marking processes as removed - **Global Tags**: Categorization and classification - **Institutional Memory**: Links to documentation ## Migration Guide ### When to use DataFlow vs DataJob **Use DataFlow when representing:** - Airflow DAGs - dbt projects - Prefect flows - Dagster pipelines - Azkaban workflows - Any container of related data processing tasks **Use DataJob when representing:** - Airflow tasks within a DAG - dbt models within a project - Prefect tasks within a flow - Dagster ops/assets within a pipeline - Individual processing steps **Use both together:** - Create a DataFlow for the pipeline - Create DataJobs for each task within that pipeline - Link DataJobs to their parent DataFlow ### Conceptual Mapping | DataProcess Concept | New Model Equivalent | Notes | | ------------------- | -------------------------- | -------------------------------- | | Process with tasks | DataFlow + DataJobs | Split into two entities | | Process name | DataFlow flowId | Becomes the parent identifier | | Single-step process | DataFlow + 1 DataJob | Still requires both entities | | Orchestrator | DataFlow orchestrator | Same concept, better modeling | | Origin/Fabric | DataFlow cluster | Often matches environment | | Inputs/Outputs | DataJob dataJobInputOutput | Moved to job level for precision | ### Migration Steps To migrate from `dataProcess` to `dataFlow`/`dataJob`: 1. **Identify your process structure**: Determine if your dataProcess represents a pipeline (has multiple steps) or a single task 2. **Create a DataFlow**: This represents the overall pipeline/workflow - Use the same orchestrator value - Use the process name as the flow ID - Use a cluster identifier (often matches the origin/fabric) 3. **Create DataJob(s)**: Create one or more jobs within the flow - For single-step processes: create one job named after the process - For multi-step processes: create a job for each step - Link each job to its parent DataFlow 4. **Migrate lineage**: Move input/output dataset relationships from the process level to the job level 5. **Migrate metadata**: Transfer ownership, tags, and documentation to the appropriate entity (typically the DataFlow for pipeline-level metadata, or specific DataJobs for task-level metadata) ### Migration Examples **Example 1: Simple single-task process** Old dataProcess: ``` urn:li:dataProcess:(daily_report,airflow,PROD) ``` New structure: ``` DataFlow: urn:li:dataFlow:(airflow,daily_report,prod) DataJob: urn:li:dataJob:(urn:li:dataFlow:(airflow,daily_report,prod),daily_report_task) ``` **Example 2: Multi-step ETL pipeline** Old dataProcess: ``` urn:li:dataProcess:(customer_pipeline,airflow,PROD) ``` New structure: ``` DataFlow: urn:li:dataFlow:(airflow,customer_pipeline,prod) DataJob: urn:li:dataJob:(urn:li:dataFlow:(airflow,customer_pipeline,prod),extract_customers) DataJob: urn:li:dataJob:(urn:li:dataFlow:(airflow,customer_pipeline,prod),transform_customers) DataJob: urn:li:dataJob:(urn:li:dataFlow:(airflow,customer_pipeline,prod),load_customers) ``` ## Code Examples ### Querying Existing DataProcess Entities If you need to query existing dataProcess entities for migration purposes:
Python SDK: Query a dataProcess entity ```python {{ inline /metadata-ingestion/examples/library/dataprocess_query_deprecated.py show_path_as_comment }} ```
### Creating Equivalent DataFlow and DataJob (Recommended) Instead of using dataProcess, create the modern equivalent:
Python SDK: Create DataFlow and DataJob to replace dataProcess ```python {{ inline /metadata-ingestion/examples/library/dataprocess_migrate_to_flow_job.py show_path_as_comment }} ```
### Complete Migration Example
Python SDK: Full migration from dataProcess to dataFlow/dataJob ```python {{ inline /metadata-ingestion/examples/library/dataprocess_full_migration.py show_path_as_comment }} ```
## Integration Points ### Historical Usage The dataProcess entity was previously used by: 1. **Early ingestion connectors**: Original Airflow, Azkaban connectors before they migrated to dataFlow/dataJob 2. **Custom integrations**: User-built integrations that haven't been updated 3. **Legacy metadata**: Historical data in existing DataHub instances ### Modern Replacements All modern DataHub connectors use dataFlow and dataJob: - **Airflow**: DAGs → DataFlow, Tasks → DataJob - **dbt**: Projects → DataFlow, Models → DataJob - **Prefect**: Flows → DataFlow, Tasks → DataJob - **Dagster**: Pipelines → DataFlow, Ops/Assets → DataJob - **Fivetran**: Connectors → DataFlow, Sync operations → DataJob - **AWS Glue**: Jobs → DataFlow, Steps → DataJob - **Azure Data Factory**: Pipelines → DataFlow, Activities → DataJob ### DataProcessInstance Note that `dataProcessInstance` is **NOT deprecated**. It represents a specific execution/run of either: - A dataJob (recommended) - A legacy dataProcess (for backward compatibility) DataProcessInstance continues to be used for tracking pipeline run history, status, and runtime information. ## Notable Exceptions ### Timeline for Removal - **Deprecated**: Early 2021 (with introduction of dataFlow/dataJob) - **Status**: Still exists in the entity registry for backward compatibility - **Current State**: No active ingestion sources create dataProcess entities - **Removal**: No specific timeline, maintained for existing data ### Reading Existing Data The dataProcess entity remains readable through all DataHub APIs for backward compatibility. Existing dataProcess entities in your instance will continue to function and display in the UI. ### No New Writes Recommended While technically possible to create new dataProcess entities, it is **strongly discouraged**. All new integrations should use dataFlow and dataJob. ### Upgrade Path There is no automatic migration tool. Organizations with significant dataProcess data should: 1. Use the Python SDK to query existing dataProcess entities 2. Create equivalent dataFlow and dataJob entities 3. Preserve URN mappings for lineage continuity 4. Consider soft-deleting old dataProcess entities once migration is verified ### GraphQL API The dataProcess entity is minimally exposed in the GraphQL API. Modern GraphQL queries and mutations focus on dataFlow and dataJob entities. ## Additional Resources - [DataFlow Entity Documentation](./dataFlow.md) - [DataJob Entity Documentation](./dataJob.md) - [Lineage Documentation](../../../features/feature-guides/lineage.md) - [Airflow Integration Guide](../../../lineage/airflow.md)