5.0 KiB
Data Profiler
The Data Profiler workflow will be using the orm-profiler processor.
After running a Metadata Ingestion workflow, we can run the Data Profiler workflow.
While the serviceName will be the same to that was used in Metadata Ingestion, so the ingestion bot can get the serviceConnection details from the server.
1. Define the YAML Config
This is a sample config for the profiler:
{% codePreview %}
{% codeInfoContainer %}
Source Configuration - Source Config
You can find all the definitions and types for the sourceConfig here.
{% codeInfo srNumber=14 %}
profileSample: Percentage of data or no. of rows we want to execute the profiler and tests on.
{% /codeInfo %}
{% codeInfo srNumber=15 %}
threadCount: Number of threads to use during metric computations.
{% /codeInfo %}
{% codeInfo srNumber=18 %}
timeoutSeconds: Profiler Timeout in Seconds
{% /codeInfo %}
{% codeInfo srNumber=19 %}
databaseFilterPattern: Regex to only fetch databases that matches the pattern.
{% /codeInfo %}
{% codeInfo srNumber=20 %}
schemaFilterPattern: Regex to only fetch tables or databases that matches the pattern.
{% /codeInfo %}
{% codeInfo srNumber=21 %}
tableFilterPattern: Regex to only fetch tables or databases that matches the pattern.
{% /codeInfo %}
{% codeInfo srNumber=22 %}
Processor Configuration
Choose the orm-profiler. Its config can also be updated to define tests from the YAML itself instead of the UI:
tableConfig: tableConfig allows you to set up some configuration at the table level.
{% /codeInfo %}
{% codeInfo srNumber=23 %}
Sink Configuration
To send the metadata to OpenMetadata, it needs to be specified as type: metadata-rest.
{% /codeInfo %}
{% partial file="/v1.5/connectors/yaml/workflow-config-def.md" /%}
{% /codeInfoContainer %}
{% codeBlock fileName="filename.yaml" %}
source:
type: {% $connector %}
serviceName: {% $connector %}
sourceConfig:
config:
type: Profiler
# profileSample: 85
# threadCount: 5
# timeoutSeconds: 43200
# databaseFilterPattern:
# includes:
# - database1
# - database2
# excludes:
# - database3
# - database4
# schemaFilterPattern:
# includes:
# - schema1
# - schema2
# excludes:
# - schema3
# - schema4
# tableFilterPattern:
# includes:
# - table1
# - table2
# excludes:
# - table3
# - table4
processor:
type: orm-profiler
config: {} # Remove braces if adding properties
# tableConfig:
# - fullyQualifiedName: <table fqn>
# profileSample: <number between 0 and 99> # default will be 100 if omitted
# profileQuery: <query to use for sampling data for the profiler>
# columnConfig:
# excludeColumns:
# - <column name>
# includeColumns:
# - columnName: <column name>
# - metrics:
# - MEAN
# - MEDIAN
# - ...
# partitionConfig:
# enablePartitioning: <set to true to use partitioning>
# partitionColumnName: <partition column name>
# partitionIntervalType: <TIME-UNIT, INTEGER-RANGE, INGESTION-TIME, COLUMN-VALUE>
# Pick one of the variation shown below
# ----'TIME-UNIT' or 'INGESTION-TIME'-------
# partitionInterval: <partition interval>
# partitionIntervalUnit: <YEAR, MONTH, DAY, HOUR>
# ------------'INTEGER-RANGE'---------------
# partitionIntegerRangeStart: <integer>
# partitionIntegerRangeEnd: <integer>
# -----------'COLUMN-VALUE'----------------
# partitionValues:
# - <value>
# - <value>
sink:
type: metadata-rest
config: {}
{% partial file="/v1.5/connectors/yaml/workflow-config.md" /%}
{% /codeBlock %}
{% /codePreview %}
- You can learn more about how to configure and run the Profiler Workflow to extract Profiler data and execute the Data Quality from here
2. Run with the CLI
After saving the YAML config, we will run the command the same way we did for the metadata ingestion:
metadata profile -c <path-to-yaml>
Note now instead of running ingest, we are using the profile command to select the Profiler workflow.
{% tilesContainer %}
{% tile title="Data Profiler" description="Find more information about the Data Profiler here" link="/how-to-guides/data-quality-observability/profiler/workflow" / %}
{% /tilesContainer %}