datahub/mlflow_post.md at adding-application-entity-sidebar

yujunjun/datahub

Fork 0

mirror of https://github.com/datahub-project/datahub.git synced 2025-07-07 09:11:47 +00:00

Andrew Sikowitz d138a64a6a

ci(graphql,workflows): Format .md, .graphql, and workflow .yml files via prettier (#13220 )

2025-04-16 16:55:51 -07:00

1.5 KiB

Raw Permalink Blame History

Auth Configuration

You can configure the MLflow source to authenticate with the MLflow server using the username and password configuration options.

source:
  type: mlflow
  config:
    tracking_uri: "http://127.0.0.1:5000"
    username: <username>
    password: <password>

Dataset Lineage

You can map MLflow run datasets to specific DataHub platforms using the source_mapping_to_platform configuration option. This allows you to specify which DataHub platform should be associated with datasets from different MLflow engines.

Example:

source_mapping_to_platform:
  huggingface: snowflake # Maps Hugging Face datasets to Snowflake platform
  http: s3 # Maps HTTP data sources to s3 platform

By default, DataHub will attempt to connect lineage with existing datasets based on the platform and name, but will not create new datasets if they don't exist.

To enable automatic dataset creation and lineage mapping, use the materialize_dataset_inputs option:

materlize_dataset_inputs: true # Creates new datasets if they don't exist

You can configure these options independently:

# Only map to existing datasets
materlize_dataset_inputs: false
source_mapping_to_platform:
    huggingface: snowflake  # Maps Hugging Face datasets to Snowflake platform
    pytorch: snowflake      # Maps PyTorch datasets to Snowflake platform

# Create new datasets and map platforms
materlize_dataset_inputs: true
source_mapping_to_platform:
    huggingface: snowflake
    pytorch: snowflake

1.5 KiB Raw Permalink Blame History

Auth Configuration

Dataset Lineage

1.5 KiB

Raw Permalink Blame History