mirror of
https://github.com/datahub-project/datahub.git
synced 2025-12-27 09:58:14 +00:00
docs: add sdk entity guides (#13870)
This commit is contained in:
parent
a3688f78e7
commit
827a2308cd
@ -862,6 +862,10 @@ module.exports = {
|
||||
"docs/api/tutorials/domains",
|
||||
"docs/api/tutorials/forms",
|
||||
"docs/api/tutorials/lineage",
|
||||
"docs/api/tutorials/container",
|
||||
"docs/api/tutorials/dashboard-chart",
|
||||
"docs/api/tutorials/dataflow-datajob",
|
||||
"docs/api/tutorials/mlmodel-mlmodelgroup",
|
||||
{
|
||||
type: "doc",
|
||||
id: "docs/api/tutorials/ml",
|
||||
|
||||
39
docs/api/tutorials/container.md
Normal file
39
docs/api/tutorials/container.md
Normal file
@ -0,0 +1,39 @@
|
||||
import Tabs from '@theme/Tabs';
|
||||
import TabItem from '@theme/TabItem';
|
||||
|
||||
# Container
|
||||
|
||||
## Why Would You Use Containers?
|
||||
|
||||
The Container entity represents a logical grouping of entities, such as datasets, data processing instances, or even other containers. It helps users organize and manage metadata in a hierarchical structure, making it easier to navigate and understand relationships between different entities.
|
||||
|
||||
#### How is a Container related to other entities?
|
||||
|
||||
1. **Parent-Child Relationship** : Containers can contain other entities such as datasets, charts, dashboards, data jobs, ML models, and even other containers (nested containers). For example, a dataset can have a container aspect that links it to the schema or folder (container) it belongs to.
|
||||
|
||||
2. **Hierarchical Organization** : Containers can be nested, forming a hierarchy (e.g., a database container contains schema containers, which contain table containers). This enables a folder-like browsing experience in the DataHub UI.
|
||||
|
||||
3. **Relationships in the Metadata Model** : Many entities (datasets, data jobs, ML models, etc.) have a container aspect that links them to their parent container. Containers themselves can have a parentContainer aspect for nesting.
|
||||
|
||||
### Goal Of This Guide
|
||||
|
||||
This guide will show you how to
|
||||
|
||||
- Create a container
|
||||
|
||||
## Prerequisites
|
||||
|
||||
For this tutorial, you need to deploy DataHub Quickstart and ingest sample data.
|
||||
For detailed steps, please refer to [Datahub Quickstart Guide](/docs/quickstart.md).
|
||||
|
||||
## Create Container
|
||||
|
||||
<Tabs>
|
||||
<TabItem value="python" label="Python" default>
|
||||
|
||||
```python
|
||||
{{ inline /metadata-ingestion/examples/library/create_container.py show_path_as_comment }}
|
||||
```
|
||||
|
||||
</TabItem>
|
||||
</Tabs>
|
||||
77
docs/api/tutorials/dashboard-chart.md
Normal file
77
docs/api/tutorials/dashboard-chart.md
Normal file
@ -0,0 +1,77 @@
|
||||
import Tabs from '@theme/Tabs';
|
||||
import TabItem from '@theme/TabItem';
|
||||
|
||||
# Dashboard & Chart
|
||||
|
||||
## Why Would You Use Dashboards and Charts?
|
||||
|
||||
The dashboard and chart entities are used to represent visualizations of data, typically in the context of business intelligence or analytics platforms. They allow users to create, manage, and share visual representations of data insights.
|
||||
|
||||
### Goal Of This Guide
|
||||
|
||||
This guide will show you how to
|
||||
|
||||
- Create a dashboard and a chart.
|
||||
- Link the dashboard to the chart or another dashboard.
|
||||
- Read dashboard and chart entities.
|
||||
|
||||
## Prerequisites
|
||||
|
||||
For this tutorial, you need to deploy DataHub Quickstart and ingest sample data.
|
||||
For detailed steps, please refer to [Datahub Quickstart Guide](/docs/quickstart.md).
|
||||
|
||||
## Create Chart
|
||||
|
||||
```python
|
||||
{{ inline /metadata-ingestion/examples/library/create_chart.py show_path_as_comment }}
|
||||
```
|
||||
|
||||
### Link Chart with Datasets
|
||||
|
||||
You can associate datasets with the chart by providing the dataset URN in the `input_datasets` parameter. This will create lineage between the chart and the datasets, so you can track the data sources used by the chart.
|
||||
|
||||
```python
|
||||
{{ inline /metadata-ingestion/examples/library/create_chart_complex.py show_path_as_comment }}
|
||||
```
|
||||
|
||||
## Create Dashboard
|
||||
|
||||
```python
|
||||
{{ inline /metadata-ingestion/examples/library/create_dashboard.py show_path_as_comment }}
|
||||
```
|
||||
|
||||
### Link Dashboard with Charts, Dashboards, and Datasets
|
||||
|
||||
You can associate charts, dashboards, and datasets with the dashboard by providing their URNs in the `charts`, `dashboards`, and `input_datasets` parameters, respectively. This will create lineage between the dashboard and the associated entities.
|
||||
|
||||
```python
|
||||
{{ inline /metadata-ingestion/examples/library/create_dashboard_complex.py show_path_as_comment }}
|
||||
```
|
||||
|
||||
## Read Chart
|
||||
|
||||
```python
|
||||
{{ inline /metadata-ingestion/examples/library/read_chart.py show_path_as_comment }}
|
||||
```
|
||||
|
||||
#### Expected Output
|
||||
|
||||
```python
|
||||
>> Chart name: example_chart
|
||||
>> Chart platform: urn:li:dataPlatform:looker
|
||||
>> Chart description: looker chart for production
|
||||
```
|
||||
|
||||
## Read Dashboard
|
||||
|
||||
```python
|
||||
{{ inline /metadata-ingestion/examples/library/read_dashboard.py show_path_as_comment }}
|
||||
```
|
||||
|
||||
#### Expected Output
|
||||
|
||||
```python
|
||||
>> Dashboard name: example_dashboard
|
||||
>> Dashboard platform: urn:li:dataPlatform:looker
|
||||
>> Dashboard description: looker dashboard for production
|
||||
```
|
||||
78
docs/api/tutorials/dataflow-datajob.md
Normal file
78
docs/api/tutorials/dataflow-datajob.md
Normal file
@ -0,0 +1,78 @@
|
||||
import Tabs from '@theme/Tabs';
|
||||
import TabItem from '@theme/TabItem';
|
||||
|
||||
# DataFlow & DataJob
|
||||
|
||||
## Why Would You Use DataFlow and DataJob?
|
||||
|
||||
The DataFlow and DataJob entities are used to represent data processing pipelines and jobs within a data ecosystem. They allow users to define, manage, and monitor the flow of data through various stages of processing, from ingestion to transformation and storage.
|
||||
|
||||
### Goal Of This Guide
|
||||
|
||||
This guide will show you how to
|
||||
|
||||
- Create a DataFlow.
|
||||
- Create a Datajob with a DataFlow.
|
||||
|
||||
## Prerequisites
|
||||
|
||||
For this tutorial, you need to deploy DataHub Quickstart and ingest sample data.
|
||||
For detailed steps, please refer to [Datahub Quickstart Guide](/docs/quickstart.md).
|
||||
|
||||
## Create DataFlow
|
||||
|
||||
<Tabs>
|
||||
<TabItem value="python" label="Python" default>
|
||||
|
||||
```python
|
||||
{{ inline /metadata-ingestion/examples/library/create_dataflow.py show_path_as_comment }}
|
||||
```
|
||||
|
||||
</TabItem>
|
||||
</Tabs>
|
||||
|
||||
## Create DataJob
|
||||
|
||||
DataJob must be associated with a DataFlow. You can create a DataJob by providing the DataFlow object or the DataFlow URN and its platform instance.
|
||||
|
||||
<Tabs>
|
||||
<TabItem value="python" label="Create DataJob with a DataFlow Object" default>
|
||||
```python
|
||||
{{ inline /metadata-ingestion/examples/library/create_datajob.py show_path_as_comment }}
|
||||
```
|
||||
|
||||
</TabItem>
|
||||
<TabItem value="python" label="Create DataJob with DataFlow URN">
|
||||
```python
|
||||
{{ inline /metadata-ingestion/examples/library/create_datajob_with_flow_urn.py show_path_as_comment }}
|
||||
```
|
||||
</TabItem>
|
||||
</Tabs>
|
||||
|
||||
## Read DataFlow
|
||||
|
||||
```python
|
||||
{{ inline /metadata-ingestion/examples/library/read_dataflow.py show_path_as_comment }}
|
||||
```
|
||||
|
||||
#### Example Output
|
||||
|
||||
```python
|
||||
>> DataFlow name: example_dataflow
|
||||
>> DataFlow platform: urn:li:dataPlatform:airflow
|
||||
>> DataFlow description: airflow pipeline for production
|
||||
```
|
||||
|
||||
## Read DataJob
|
||||
|
||||
```python
|
||||
{{ inline /metadata-ingestion/examples/library/read_datajob.py show_path_as_comment }}
|
||||
```
|
||||
|
||||
#### Example Output
|
||||
|
||||
```python
|
||||
>> DataJob name: example_datajob
|
||||
>> DataJob Flow URN: urn:li:dataFlow:(airflow,PROD.example_dag,PROD)
|
||||
>> DataJob description: example datajob
|
||||
```
|
||||
78
docs/api/tutorials/mlmodel-mlmodelgroup.md
Normal file
78
docs/api/tutorials/mlmodel-mlmodelgroup.md
Normal file
@ -0,0 +1,78 @@
|
||||
import Tabs from '@theme/Tabs';
|
||||
import TabItem from '@theme/TabItem';
|
||||
|
||||
# MLModel & MLModelGroup
|
||||
|
||||
## Why Would You Use MLModel and MLModelGroup?
|
||||
|
||||
MLModel and MLModelGroup entities are used to represent machine learning models and their associated groups within a metadata ecosystem. They allow users to define, manage, and monitor machine learning models, including their versions, configurations, and performance metrics.
|
||||
|
||||
### Goal Of This Guide
|
||||
|
||||
This guide will show you how to
|
||||
|
||||
- Create an MLModel or MLModelGroup.
|
||||
- Associate an MLModel with an MLModelGroup.
|
||||
- Read MLModel and MLModelGroup entities.
|
||||
|
||||
## Prerequisites
|
||||
|
||||
For this tutorial, you need to deploy DataHub Quickstart and ingest sample data.
|
||||
For detailed steps, please refer to [Datahub Quickstart Guide](/docs/quickstart.md).
|
||||
|
||||
## Create MLModelGroup
|
||||
|
||||
You can create an MLModelGroup by providing the necessary attributes such as name, platform, and other metadata.
|
||||
|
||||
```python
|
||||
{{ inline /metadata-ingestion/examples/library/create_mlmodel_group.py show_path_as_comment }}
|
||||
```
|
||||
|
||||
## Create MLModel
|
||||
|
||||
You can create an MLModel by providing the necessary attributes such as name, platform, and other metadata.
|
||||
|
||||
```python
|
||||
{{ inline /metadata-ingestion/examples/library/create_mlmodel.py show_path_as_comment }}
|
||||
```
|
||||
|
||||
Note that you can associate an MLModel with an MLModelGroup by providing the group URN when creating the MLModel.
|
||||
|
||||
You can also set MLModelGroup later by updating the MLModel entity as shown below.
|
||||
|
||||
```python
|
||||
{{ inline /metadata-ingestion/examples/library/add_mlgroup_to_mlmodel.py show_path_as_comment }}
|
||||
```
|
||||
|
||||
## Read MLModelGroup
|
||||
|
||||
You can read an MLModelGroup by providing the group URN.
|
||||
|
||||
```python
|
||||
{{ inline /metadata-ingestion/examples/library/read_mlmodel_group.py show_path_as_comment }}
|
||||
```
|
||||
|
||||
#### Expected Output
|
||||
|
||||
```python
|
||||
>> Model Group Name: My Recommendations Model Group
|
||||
>> Model Group Description: A group for recommendations models
|
||||
>> Model Group Custom Properties: {'owner': 'John Doe', 'team': 'recommendations', 'domain': 'marketing'}
|
||||
```
|
||||
|
||||
## Read MLModel
|
||||
|
||||
You can read an MLModel by providing the model URN.
|
||||
|
||||
```python
|
||||
{{ inline /metadata-ingestion/examples/library/read_mlmodel.py show_path_as_comment }}
|
||||
```
|
||||
|
||||
#### Expected Output
|
||||
|
||||
```python
|
||||
>> Model Name: My Recommendations Model
|
||||
>> Model Description: A model for recommending products to users
|
||||
>> Model Group: urn:li:mlModelGroup:(urn:li:dataPlatform:mlflow,my-recommendations-model,PROD)
|
||||
>> Model Hyper Parameters: [MLHyperParamClass({'name': 'learning_rate', 'description': None, 'value': '0.01', 'createdAt': None}), MLHyperParamClass({'name': 'num_epochs', 'description': None, 'value': '100', 'createdAt': None}), MLHyperParamClass({'name': 'batch_size', 'description': None, 'value': '32', 'createdAt': None})]
|
||||
```
|
||||
13
metadata-ingestion/examples/library/create_container.py
Normal file
13
metadata-ingestion/examples/library/create_container.py
Normal file
@ -0,0 +1,13 @@
|
||||
from datahub.emitter.mcp_builder import ContainerKey
|
||||
from datahub.sdk import Container, DataHubClient
|
||||
|
||||
client = DataHubClient.from_env()
|
||||
|
||||
# datajob will inherit the platform and platform instance from the flow
|
||||
|
||||
container = Container(
|
||||
container_key=ContainerKey(platform="mlflow", name="airline_forecast_experiment"),
|
||||
display_name="Airline Forecast Experiment",
|
||||
)
|
||||
|
||||
client.entities.upsert(container)
|
||||
@ -0,0 +1,24 @@
|
||||
from datahub.metadata.urns import DataFlowUrn, DatasetUrn
|
||||
from datahub.sdk import DataHubClient, DataJob
|
||||
|
||||
client = DataHubClient.from_env()
|
||||
|
||||
# datajob will inherit the platform and platform instance from the flow
|
||||
|
||||
datajob = DataJob(
|
||||
name="example_datajob",
|
||||
flow_urn=DataFlowUrn(
|
||||
orchestrator="airflow",
|
||||
flow_id="example_dag",
|
||||
cluster="PROD",
|
||||
),
|
||||
platform_instance="PROD",
|
||||
inlets=[
|
||||
DatasetUrn(platform="hdfs", name="dataset1", env="PROD"),
|
||||
],
|
||||
outlets=[
|
||||
DatasetUrn(platform="hdfs", name="dataset2", env="PROD"),
|
||||
],
|
||||
)
|
||||
|
||||
client.entities.upsert(datajob)
|
||||
19
metadata-ingestion/examples/library/read_chart.py
Normal file
19
metadata-ingestion/examples/library/read_chart.py
Normal file
@ -0,0 +1,19 @@
|
||||
from datahub.metadata.urns import TagUrn
|
||||
from datahub.sdk import Chart, DataHubClient
|
||||
|
||||
client = DataHubClient.from_env()
|
||||
|
||||
chart = Chart(
|
||||
name="example_chart",
|
||||
platform="looker",
|
||||
description="looker chart for production",
|
||||
tags=[TagUrn(name="production"), TagUrn(name="data_engineering")],
|
||||
)
|
||||
|
||||
client.entities.upsert(chart)
|
||||
|
||||
chart_entity = client.entities.get(chart.urn)
|
||||
|
||||
print("Chart name:", chart_entity.name)
|
||||
print("Chart platform:", chart_entity.platform)
|
||||
print("Chart description:", chart_entity.description)
|
||||
18
metadata-ingestion/examples/library/read_dashboard.py
Normal file
18
metadata-ingestion/examples/library/read_dashboard.py
Normal file
@ -0,0 +1,18 @@
|
||||
from datahub.metadata.urns import TagUrn
|
||||
from datahub.sdk import Dashboard, DataHubClient
|
||||
|
||||
client = DataHubClient.from_env()
|
||||
|
||||
dashboard = Dashboard(
|
||||
name="example_dashboard",
|
||||
platform="looker",
|
||||
description="looker dashboard for production",
|
||||
tags=[TagUrn(name="production"), TagUrn(name="data_engineering")],
|
||||
)
|
||||
|
||||
client.entities.upsert(dashboard)
|
||||
|
||||
dashboard_entity = client.entities.get(dashboard.urn)
|
||||
print("Dashboard name:", dashboard_entity.name)
|
||||
print("Dashboard platform:", dashboard_entity.platform)
|
||||
print("Dashboard description:", dashboard_entity.description)
|
||||
18
metadata-ingestion/examples/library/read_dataflow.py
Normal file
18
metadata-ingestion/examples/library/read_dataflow.py
Normal file
@ -0,0 +1,18 @@
|
||||
from datahub.metadata.urns import TagUrn
|
||||
from datahub.sdk import DataFlow, DataHubClient
|
||||
|
||||
client = DataHubClient.from_env()
|
||||
|
||||
dataflow = DataFlow(
|
||||
name="example_dataflow",
|
||||
platform="airflow",
|
||||
description="airflow pipeline for production",
|
||||
tags=[TagUrn(name="production"), TagUrn(name="data_engineering")],
|
||||
)
|
||||
|
||||
client.entities.upsert(dataflow)
|
||||
|
||||
dataflow_entity = client.entities.get(dataflow.urn)
|
||||
print("DataFlow name:", dataflow_entity.name)
|
||||
print("DataFlow platform:", dataflow_entity.platform)
|
||||
print("DataFlow description:", dataflow_entity.description)
|
||||
25
metadata-ingestion/examples/library/read_datajob.py
Normal file
25
metadata-ingestion/examples/library/read_datajob.py
Normal file
@ -0,0 +1,25 @@
|
||||
from datahub.sdk import DataFlow, DataHubClient, DataJob
|
||||
|
||||
client = DataHubClient.from_env()
|
||||
|
||||
dataflow = DataFlow(
|
||||
platform="airflow",
|
||||
name="example_dag",
|
||||
platform_instance="PROD",
|
||||
)
|
||||
|
||||
# datajob will inherit the platform and platform instance from the flow
|
||||
datajob = DataJob(
|
||||
name="example_datajob",
|
||||
description="example datajob",
|
||||
flow=dataflow,
|
||||
)
|
||||
|
||||
client.entities.upsert(dataflow)
|
||||
client.entities.upsert(datajob)
|
||||
|
||||
datajob_entity = client.entities.get(datajob.urn)
|
||||
|
||||
print("DataJob name:", datajob_entity.name)
|
||||
print("DataJob Flow URN:", datajob_entity.flow_urn)
|
||||
print("DataJob description:", datajob_entity.description)
|
||||
Loading…
x
Reference in New Issue
Block a user