3.8 KiB
Auto Classification
The Auto Classification workflow will be using the orm-profiler processor.
After running a Metadata Ingestion workflow, we can run the Auto Classification workflow.
While the serviceName will be the same to that was used in Metadata Ingestion, so the ingestion bot can get the serviceConnection details from the server.
1. Define the YAML Config
This is a sample config for the Auto Classification Workflow:
{% codePreview %}
{% codeInfoContainer %}
Source Configuration - Source Config
You can find all the definitions and types for the sourceConfig here.
{% codeInfo srNumber=14 %}
storeSampleData: Option to turn on/off storing sample data. If enabled, we will ingest sample data for each table.
{% /codeInfo %}
{% codeInfo srNumber=15 %}
enableAutoClassification: Optional configuration to automatically tag columns that might contain sensitive information.
{% /codeInfo %}
{% codeInfo srNumber=18 %}
confidence: Set the Confidence value for which you want the column to be tagged as PII. Confidence value ranges from 0 to 100. A higher number will yield less false positives but more false negatives. A lower number will yield more false positives but less false negatives.
{% /codeInfo %}
{% codeInfo srNumber=19 %}
databaseFilterPattern: Regex to only fetch databases that matches the pattern.
{% /codeInfo %}
{% codeInfo srNumber=20 %}
schemaFilterPattern: Regex to only fetch tables or databases that matches the pattern.
{% /codeInfo %}
{% codeInfo srNumber=21 %}
tableFilterPattern: Regex to only fetch tables or databases that matches the pattern.
{% /codeInfo %}
{% codeInfo srNumber=22 %}
Processor Configuration
Choose the orm-profiler. Its config can also be updated to define tests from the YAML itself instead of the UI:
tableConfig: tableConfig allows you to set up some configuration at the table level.
{% /codeInfo %}
{% codeInfo srNumber=23 %}
Sink Configuration
To send the metadata to OpenMetadata, it needs to be specified as type: metadata-rest.
{% /codeInfo %}
{% partial file="/v1.5/connectors/yaml/workflow-config-def.md" /%}
{% /codeInfoContainer %}
{% codeBlock fileName="filename.yaml" %}
source:
type: {% $connector %}
serviceName: {% $connector %}
sourceConfig:
config:
type: AutoClassification
# storeSampleData: true
# enableAutoClassification: true
# confidence: 80
# databaseFilterPattern:
# includes:
# - database1
# - database2
# excludes:
# - database3
# - database4
# schemaFilterPattern:
# includes:
# - schema1
# - schema2
# excludes:
# - schema3
# - schema4
# tableFilterPattern:
# includes:
# - table1
# - table2
# excludes:
# - table3
# - table4
processor:
type: orm-profiler
config: {}
sink:
type: metadata-rest
config: {}
{% partial file="/v1.5/connectors/yaml/workflow-config.md" /%}
{% /codeBlock %}
{% /codePreview %}
2. Run with the CLI
After saving the YAML config, we will run the command the same way we did for the metadata ingestion:
metadata classify -c <path-to-yaml>
Note now instead of running ingest, we are using the classify command to select the Auto Classification workflow.