Pere Miquel Brull 613fd331e0
MINOR - Clean up configs & add auto classification docs (#18907)
* MINOR - Clean up configs & add auto classification docs

* deprecation notice
2024-12-04 09:32:25 +01:00

5.0 KiB

Data Profiler

The Data Profiler workflow will be using the orm-profiler processor.

After running a Metadata Ingestion workflow, we can run the Data Profiler workflow. While the serviceName will be the same to that was used in Metadata Ingestion, so the ingestion bot can get the serviceConnection details from the server.

1. Define the YAML Config

This is a sample config for the profiler:

{% codePreview %}

{% codeInfoContainer %}

Source Configuration - Source Config

You can find all the definitions and types for the sourceConfig here.

{% codeInfo srNumber=14 %}

profileSample: Percentage of data or no. of rows we want to execute the profiler and tests on.

{% /codeInfo %}

{% codeInfo srNumber=15 %}

threadCount: Number of threads to use during metric computations.

{% /codeInfo %}

{% codeInfo srNumber=18 %}

timeoutSeconds: Profiler Timeout in Seconds

{% /codeInfo %}

{% codeInfo srNumber=19 %}

databaseFilterPattern: Regex to only fetch databases that matches the pattern.

{% /codeInfo %}

{% codeInfo srNumber=20 %}

schemaFilterPattern: Regex to only fetch tables or databases that matches the pattern.

{% /codeInfo %}

{% codeInfo srNumber=21 %}

tableFilterPattern: Regex to only fetch tables or databases that matches the pattern.

{% /codeInfo %}

{% codeInfo srNumber=22 %}

Processor Configuration

Choose the orm-profiler. Its config can also be updated to define tests from the YAML itself instead of the UI:

tableConfig: tableConfig allows you to set up some configuration at the table level. {% /codeInfo %}

{% codeInfo srNumber=23 %}

Sink Configuration

To send the metadata to OpenMetadata, it needs to be specified as type: metadata-rest. {% /codeInfo %}

{% partial file="/v1.5/connectors/yaml/workflow-config-def.md" /%}

{% /codeInfoContainer %}

{% codeBlock fileName="filename.yaml" %}

source:
  type: {% $connector %}
  serviceName: {% $connector %}
  sourceConfig:
    config:
      type: Profiler
      # profileSample: 85
      # threadCount: 5
      # timeoutSeconds: 43200
      # databaseFilterPattern:
      #   includes:
      #     - database1
      #     - database2
      #   excludes:
      #     - database3
      #     - database4
      # schemaFilterPattern:
      #   includes:
      #     - schema1
      #     - schema2
      #   excludes:
      #     - schema3
      #     - schema4
      # tableFilterPattern:
      #   includes:
      #     - table1
      #     - table2
      #   excludes:
      #     - table3
      #     - table4
processor:
  type: orm-profiler
  config: {}  # Remove braces if adding properties
    # tableConfig:
    #   - fullyQualifiedName: <table fqn>
    #     profileSample: <number between 0 and 99> # default will be 100 if omitted
    #     profileQuery: <query to use for sampling data for the profiler>
    #     columnConfig:
    #       excludeColumns:
    #         - <column name>
    #       includeColumns:
    #         - columnName: <column name>
    #         - metrics:
    #           - MEAN
    #           - MEDIAN
    #           - ...
    #     partitionConfig:
    #       enablePartitioning: <set to true to use partitioning>
    #       partitionColumnName: <partition column name>
    #       partitionIntervalType: <TIME-UNIT, INTEGER-RANGE, INGESTION-TIME, COLUMN-VALUE>
    #       Pick one of the variation shown below
    #       ----'TIME-UNIT' or 'INGESTION-TIME'-------
    #       partitionInterval: <partition interval>
    #       partitionIntervalUnit: <YEAR, MONTH, DAY, HOUR>
    #       ------------'INTEGER-RANGE'---------------
    #       partitionIntegerRangeStart: <integer>
    #       partitionIntegerRangeEnd: <integer>
    #       -----------'COLUMN-VALUE'----------------
    #       partitionValues:
    #         - <value>
    #         - <value>

sink:
  type: metadata-rest
  config: {}

{% partial file="/v1.5/connectors/yaml/workflow-config.md" /%}

{% /codeBlock %}

{% /codePreview %}

  • You can learn more about how to configure and run the Profiler Workflow to extract Profiler data and execute the Data Quality from here

2. Run with the CLI

After saving the YAML config, we will run the command the same way we did for the metadata ingestion:

metadata profile -c <path-to-yaml>

Note now instead of running ingest, we are using the profile command to select the Profiler workflow.

{% tilesContainer %}

{% tile title="Data Profiler" description="Find more information about the Data Profiler here" link="/how-to-guides/data-quality-observability/profiler/workflow" / %}

{% /tilesContainer %}