OpenMetadata/data-profiler.md at fd4efb0c1e7d1e55a0e55a1bec0a1be17953b51d

mirror of https://github.com/open-metadata/OpenMetadata.git synced 2025-12-04 11:33:07 +00:00

Pere Miquel Brull 613fd331e0

MINOR - Clean up configs & add auto classification docs (#18907 )

* MINOR - Clean up configs & add auto classification docs

* deprecation notice

2024-12-04 09:32:25 +01:00

5.0 KiB

Raw Blame History

Data Profiler

The Data Profiler workflow will be using the orm-profiler processor.

After running a Metadata Ingestion workflow, we can run the Data Profiler workflow. While the serviceName will be the same to that was used in Metadata Ingestion, so the ingestion bot can get the serviceConnection details from the server.

1. Define the YAML Config

This is a sample config for the profiler:

{% codePreview %}

{% codeInfoContainer %}

Source Configuration - Source Config

You can find all the definitions and types for the sourceConfig here.

{% codeInfo srNumber=14 %}

profileSample: Percentage of data or no. of rows we want to execute the profiler and tests on.

{% /codeInfo %}

{% codeInfo srNumber=15 %}

threadCount: Number of threads to use during metric computations.

{% /codeInfo %}

{% codeInfo srNumber=18 %}

timeoutSeconds: Profiler Timeout in Seconds

{% /codeInfo %}

{% codeInfo srNumber=19 %}

databaseFilterPattern: Regex to only fetch databases that matches the pattern.

{% /codeInfo %}

{% codeInfo srNumber=20 %}

schemaFilterPattern: Regex to only fetch tables or databases that matches the pattern.

{% /codeInfo %}

{% codeInfo srNumber=21 %}

tableFilterPattern: Regex to only fetch tables or databases that matches the pattern.

{% /codeInfo %}

{% codeInfo srNumber=22 %}

Processor Configuration

Choose the orm-profiler. Its config can also be updated to define tests from the YAML itself instead of the UI:

tableConfig: tableConfig allows you to set up some configuration at the table level. {% /codeInfo %}

{% codeInfo srNumber=23 %}

Sink Configuration

To send the metadata to OpenMetadata, it needs to be specified as type: metadata-rest. {% /codeInfo %}

{% partial file="/v1.5/connectors/yaml/workflow-config-def.md" /%}

{% /codeInfoContainer %}

{% codeBlock fileName="filename.yaml" %}

source:
  type: {% $connector %}
  serviceName: {% $connector %}
  sourceConfig:
    config:
      type: Profiler

      # profileSample: 85

      # threadCount: 5

      # timeoutSeconds: 43200

      # databaseFilterPattern:
      #   includes:
      #     - database1
      #     - database2
      #   excludes:
      #     - database3
      #     - database4

      # schemaFilterPattern:
      #   includes:
      #     - schema1
      #     - schema2
      #   excludes:
      #     - schema3
      #     - schema4

      # tableFilterPattern:
      #   includes:
      #     - table1
      #     - table2
      #   excludes:
      #     - table3
      #     - table4

processor:
  type: orm-profiler
  config: {}  # Remove braces if adding properties
    # tableConfig:
    #   - fullyQualifiedName: <table fqn>
    #     profileSample: <number between 0 and 99> # default will be 100 if omitted
    #     profileQuery: <query to use for sampling data for the profiler>
    #     columnConfig:
    #       excludeColumns:
    #         - <column name>
    #       includeColumns:
    #         - columnName: <column name>
    #         - metrics:
    #           - MEAN
    #           - MEDIAN
    #           - ...
    #     partitionConfig:
    #       enablePartitioning: <set to true to use partitioning>
    #       partitionColumnName: <partition column name>
    #       partitionIntervalType: <TIME-UNIT, INTEGER-RANGE, INGESTION-TIME, COLUMN-VALUE>
    #       Pick one of the variation shown below
    #       ----'TIME-UNIT' or 'INGESTION-TIME'-------
    #       partitionInterval: <partition interval>
    #       partitionIntervalUnit: <YEAR, MONTH, DAY, HOUR>
    #       ------------'INTEGER-RANGE'---------------
    #       partitionIntegerRangeStart: <integer>
    #       partitionIntegerRangeEnd: <integer>
    #       -----------'COLUMN-VALUE'----------------
    #       partitionValues:
    #         - <value>
    #         - <value>

sink:
  type: metadata-rest
  config: {}

{% partial file="/v1.5/connectors/yaml/workflow-config.md" /%}

{% /codeBlock %}

{% /codePreview %}

You can learn more about how to configure and run the Profiler Workflow to extract Profiler data and execute the Data Quality from here

2. Run with the CLI

After saving the YAML config, we will run the command the same way we did for the metadata ingestion:

metadata profile -c <path-to-yaml>

Note now instead of running ingest, we are using the profile command to select the Profiler workflow.

{% tilesContainer %}

{% tile title="Data Profiler" description="Find more information about the Data Profiler here" link="/how-to-guides/data-quality-observability/profiler/workflow" / %}

{% /tilesContainer %}

5.0 KiB Raw Blame History

Data Profiler

1. Define the YAML Config

Source Configuration - Source Config

Processor Configuration

Sink Configuration

2. Run with the CLI

5.0 KiB

Raw Blame History