mirror of
https://github.com/datahub-project/datahub.git
synced 2025-10-31 18:59:23 +00:00
270 lines
9.3 KiB
Markdown
270 lines
9.3 KiB
Markdown
# DataProcess
|
|
|
|
> **DEPRECATED**: This entity is deprecated and should not be used for new implementations.
|
|
>
|
|
> **Use [dataFlow](./dataFlow.md) and [dataJob](./dataJob.md) instead.**
|
|
>
|
|
> The `dataProcess` entity was an early attempt to model data processing tasks but has been superseded by the more robust and flexible `dataFlow` and `dataJob` entities which better represent the hierarchical nature of modern data pipelines.
|
|
|
|
## Deprecation Notice
|
|
|
|
The `dataProcess` entity was deprecated to provide a clearer separation between:
|
|
|
|
- **DataFlow**: Represents the overall pipeline/workflow (e.g., an Airflow DAG, dbt project, Spark application)
|
|
- **DataJob**: Represents individual tasks within a pipeline (e.g., an Airflow task, dbt model, Spark job)
|
|
|
|
This two-level hierarchy better matches how modern orchestration systems organize data processing work and provides more flexibility for lineage tracking, ownership assignment, and operational monitoring.
|
|
|
|
### Why was it deprecated?
|
|
|
|
The original `dataProcess` entity had several limitations:
|
|
|
|
1. **No hierarchical structure**: It couldn't represent the relationship between a pipeline and its constituent tasks
|
|
2. **Limited orchestrator support**: The flat structure didn't map well to DAG-based orchestration platforms like Airflow, Prefect, or Dagster
|
|
3. **Unclear semantics**: It was ambiguous whether a dataProcess represented a whole pipeline or a single task
|
|
4. **Poor lineage modeling**: Without task-level granularity, lineage relationships were less precise
|
|
|
|
The new `dataFlow` and `dataJob` model addresses these limitations by providing a clear parent-child relationship that mirrors real-world data processing architectures.
|
|
|
|
## Identity (Historical Reference)
|
|
|
|
DataProcess entities were identified by three components:
|
|
|
|
- **Name**: The process name (typically an ETL job name)
|
|
- **Orchestrator**: The workflow management platform (e.g., `airflow`, `azkaban`)
|
|
- **Origin (Fabric)**: The environment where the process runs (PROD, DEV, etc.)
|
|
|
|
The URN structure was:
|
|
|
|
```
|
|
urn:li:dataProcess:(<name>,<orchestrator>,<origin>)
|
|
```
|
|
|
|
### Example URNs
|
|
|
|
```
|
|
urn:li:dataProcess:(customer_etl_job,airflow,PROD)
|
|
urn:li:dataProcess:(sales_aggregation,azkaban,DEV)
|
|
```
|
|
|
|
## Important Capabilities (Historical Reference)
|
|
|
|
### DataProcessInfo Aspect
|
|
|
|
The `dataProcessInfo` aspect captured inputs and outputs of the process:
|
|
|
|
- **Inputs**: Array of dataset URNs consumed by the process
|
|
- **Outputs**: Array of dataset URNs produced by the process
|
|
|
|
This established basic lineage relationships through "Consumes" relationships with datasets.
|
|
|
|
### Common Aspects
|
|
|
|
Like other entities, dataProcess supported:
|
|
|
|
- **Ownership**: Assigning owners to processes
|
|
- **Status**: Marking processes as removed
|
|
- **Global Tags**: Categorization and classification
|
|
- **Institutional Memory**: Links to documentation
|
|
|
|
## Migration Guide
|
|
|
|
### When to use DataFlow vs DataJob
|
|
|
|
**Use DataFlow when representing:**
|
|
|
|
- Airflow DAGs
|
|
- dbt projects
|
|
- Prefect flows
|
|
- Dagster pipelines
|
|
- Azkaban workflows
|
|
- Any container of related data processing tasks
|
|
|
|
**Use DataJob when representing:**
|
|
|
|
- Airflow tasks within a DAG
|
|
- dbt models within a project
|
|
- Prefect tasks within a flow
|
|
- Dagster ops/assets within a pipeline
|
|
- Individual processing steps
|
|
|
|
**Use both together:**
|
|
|
|
- Create a DataFlow for the pipeline
|
|
- Create DataJobs for each task within that pipeline
|
|
- Link DataJobs to their parent DataFlow
|
|
|
|
### Conceptual Mapping
|
|
|
|
| DataProcess Concept | New Model Equivalent | Notes |
|
|
| ------------------- | -------------------------- | -------------------------------- |
|
|
| Process with tasks | DataFlow + DataJobs | Split into two entities |
|
|
| Process name | DataFlow flowId | Becomes the parent identifier |
|
|
| Single-step process | DataFlow + 1 DataJob | Still requires both entities |
|
|
| Orchestrator | DataFlow orchestrator | Same concept, better modeling |
|
|
| Origin/Fabric | DataFlow cluster | Often matches environment |
|
|
| Inputs/Outputs | DataJob dataJobInputOutput | Moved to job level for precision |
|
|
|
|
### Migration Steps
|
|
|
|
To migrate from `dataProcess` to `dataFlow`/`dataJob`:
|
|
|
|
1. **Identify your process structure**: Determine if your dataProcess represents a pipeline (has multiple steps) or a single task
|
|
|
|
2. **Create a DataFlow**: This represents the overall pipeline/workflow
|
|
|
|
- Use the same orchestrator value
|
|
- Use the process name as the flow ID
|
|
- Use a cluster identifier (often matches the origin/fabric)
|
|
|
|
3. **Create DataJob(s)**: Create one or more jobs within the flow
|
|
|
|
- For single-step processes: create one job named after the process
|
|
- For multi-step processes: create a job for each step
|
|
- Link each job to its parent DataFlow
|
|
|
|
4. **Migrate lineage**: Move input/output dataset relationships from the process level to the job level
|
|
|
|
5. **Migrate metadata**: Transfer ownership, tags, and documentation to the appropriate entity (typically the DataFlow for pipeline-level metadata, or specific DataJobs for task-level metadata)
|
|
|
|
### Migration Examples
|
|
|
|
**Example 1: Simple single-task process**
|
|
|
|
Old dataProcess:
|
|
|
|
```
|
|
urn:li:dataProcess:(daily_report,airflow,PROD)
|
|
```
|
|
|
|
New structure:
|
|
|
|
```
|
|
DataFlow: urn:li:dataFlow:(airflow,daily_report,prod)
|
|
DataJob: urn:li:dataJob:(urn:li:dataFlow:(airflow,daily_report,prod),daily_report_task)
|
|
```
|
|
|
|
**Example 2: Multi-step ETL pipeline**
|
|
|
|
Old dataProcess:
|
|
|
|
```
|
|
urn:li:dataProcess:(customer_pipeline,airflow,PROD)
|
|
```
|
|
|
|
New structure:
|
|
|
|
```
|
|
DataFlow: urn:li:dataFlow:(airflow,customer_pipeline,prod)
|
|
DataJob: urn:li:dataJob:(urn:li:dataFlow:(airflow,customer_pipeline,prod),extract_customers)
|
|
DataJob: urn:li:dataJob:(urn:li:dataFlow:(airflow,customer_pipeline,prod),transform_customers)
|
|
DataJob: urn:li:dataJob:(urn:li:dataFlow:(airflow,customer_pipeline,prod),load_customers)
|
|
```
|
|
|
|
## Code Examples
|
|
|
|
### Querying Existing DataProcess Entities
|
|
|
|
If you need to query existing dataProcess entities for migration purposes:
|
|
|
|
<details>
|
|
<summary>Python SDK: Query a dataProcess entity</summary>
|
|
|
|
```python
|
|
{{ inline /metadata-ingestion/examples/library/dataprocess_query_deprecated.py show_path_as_comment }}
|
|
```
|
|
|
|
</details>
|
|
|
|
### Creating Equivalent DataFlow and DataJob (Recommended)
|
|
|
|
Instead of using dataProcess, create the modern equivalent:
|
|
|
|
<details>
|
|
<summary>Python SDK: Create DataFlow and DataJob to replace dataProcess</summary>
|
|
|
|
```python
|
|
{{ inline /metadata-ingestion/examples/library/dataprocess_migrate_to_flow_job.py show_path_as_comment }}
|
|
```
|
|
|
|
</details>
|
|
|
|
### Complete Migration Example
|
|
|
|
<details>
|
|
<summary>Python SDK: Full migration from dataProcess to dataFlow/dataJob</summary>
|
|
|
|
```python
|
|
{{ inline /metadata-ingestion/examples/library/dataprocess_full_migration.py show_path_as_comment }}
|
|
```
|
|
|
|
</details>
|
|
|
|
## Integration Points
|
|
|
|
### Historical Usage
|
|
|
|
The dataProcess entity was previously used by:
|
|
|
|
1. **Early ingestion connectors**: Original Airflow, Azkaban connectors before they migrated to dataFlow/dataJob
|
|
2. **Custom integrations**: User-built integrations that haven't been updated
|
|
3. **Legacy metadata**: Historical data in existing DataHub instances
|
|
|
|
### Modern Replacements
|
|
|
|
All modern DataHub connectors use dataFlow and dataJob:
|
|
|
|
- **Airflow**: DAGs → DataFlow, Tasks → DataJob
|
|
- **dbt**: Projects → DataFlow, Models → DataJob
|
|
- **Prefect**: Flows → DataFlow, Tasks → DataJob
|
|
- **Dagster**: Pipelines → DataFlow, Ops/Assets → DataJob
|
|
- **Fivetran**: Connectors → DataFlow, Sync operations → DataJob
|
|
- **AWS Glue**: Jobs → DataFlow, Steps → DataJob
|
|
- **Azure Data Factory**: Pipelines → DataFlow, Activities → DataJob
|
|
|
|
### DataProcessInstance
|
|
|
|
Note that `dataProcessInstance` is **NOT deprecated**. It represents a specific execution/run of either:
|
|
|
|
- A dataJob (recommended)
|
|
- A legacy dataProcess (for backward compatibility)
|
|
|
|
DataProcessInstance continues to be used for tracking pipeline run history, status, and runtime information.
|
|
|
|
## Notable Exceptions
|
|
|
|
### Timeline for Removal
|
|
|
|
- **Deprecated**: Early 2021 (with introduction of dataFlow/dataJob)
|
|
- **Status**: Still exists in the entity registry for backward compatibility
|
|
- **Current State**: No active ingestion sources create dataProcess entities
|
|
- **Removal**: No specific timeline, maintained for existing data
|
|
|
|
### Reading Existing Data
|
|
|
|
The dataProcess entity remains readable through all DataHub APIs for backward compatibility. Existing dataProcess entities in your instance will continue to function and display in the UI.
|
|
|
|
### No New Writes Recommended
|
|
|
|
While technically possible to create new dataProcess entities, it is **strongly discouraged**. All new integrations should use dataFlow and dataJob.
|
|
|
|
### Upgrade Path
|
|
|
|
There is no automatic migration tool. Organizations with significant dataProcess data should:
|
|
|
|
1. Use the Python SDK to query existing dataProcess entities
|
|
2. Create equivalent dataFlow and dataJob entities
|
|
3. Preserve URN mappings for lineage continuity
|
|
4. Consider soft-deleting old dataProcess entities once migration is verified
|
|
|
|
### GraphQL API
|
|
|
|
The dataProcess entity is minimally exposed in the GraphQL API. Modern GraphQL queries and mutations focus on dataFlow and dataJob entities.
|
|
|
|
## Additional Resources
|
|
|
|
- [DataFlow Entity Documentation](./dataFlow.md)
|
|
- [DataJob Entity Documentation](./dataJob.md)
|
|
- [Lineage Documentation](../../../features/feature-guides/lineage.md)
|
|
- [Airflow Integration Guide](../../../lineage/airflow.md)
|