OpenMetadata/auto-classification.md at fd4efb0c1e7d1e55a0e55a1bec0a1be17953b51d

mirror of https://github.com/open-metadata/OpenMetadata.git synced 2025-12-07 13:05:09 +00:00

Pere Miquel Brull 613fd331e0

MINOR - Clean up configs & add auto classification docs (#18907 )

* MINOR - Clean up configs & add auto classification docs

* deprecation notice

2024-12-04 09:32:25 +01:00

3.8 KiB

Raw Blame History

Auto Classification

The Auto Classification workflow will be using the orm-profiler processor.

After running a Metadata Ingestion workflow, we can run the Auto Classification workflow. While the serviceName will be the same to that was used in Metadata Ingestion, so the ingestion bot can get the serviceConnection details from the server.

1. Define the YAML Config

This is a sample config for the Auto Classification Workflow:

{% codePreview %}

{% codeInfoContainer %}

Source Configuration - Source Config

You can find all the definitions and types for the sourceConfig here.

{% codeInfo srNumber=14 %}

storeSampleData: Option to turn on/off storing sample data. If enabled, we will ingest sample data for each table.

{% /codeInfo %}

{% codeInfo srNumber=15 %}

enableAutoClassification: Optional configuration to automatically tag columns that might contain sensitive information.

{% /codeInfo %}

{% codeInfo srNumber=18 %}

confidence: Set the Confidence value for which you want the column to be tagged as PII. Confidence value ranges from 0 to 100. A higher number will yield less false positives but more false negatives. A lower number will yield more false positives but less false negatives.

{% /codeInfo %}

{% codeInfo srNumber=19 %}

databaseFilterPattern: Regex to only fetch databases that matches the pattern.

{% /codeInfo %}

{% codeInfo srNumber=20 %}

schemaFilterPattern: Regex to only fetch tables or databases that matches the pattern.

{% /codeInfo %}

{% codeInfo srNumber=21 %}

tableFilterPattern: Regex to only fetch tables or databases that matches the pattern.

{% /codeInfo %}

{% codeInfo srNumber=22 %}

Processor Configuration

Choose the orm-profiler. Its config can also be updated to define tests from the YAML itself instead of the UI:

tableConfig: tableConfig allows you to set up some configuration at the table level. {% /codeInfo %}

{% codeInfo srNumber=23 %}

Sink Configuration

To send the metadata to OpenMetadata, it needs to be specified as type: metadata-rest. {% /codeInfo %}

{% partial file="/v1.5/connectors/yaml/workflow-config-def.md" /%}

{% /codeInfoContainer %}

{% codeBlock fileName="filename.yaml" %}

source:
  type: {% $connector %}
  serviceName: {% $connector %}
  sourceConfig:
    config:
      type: AutoClassification

      # storeSampleData: true

      # enableAutoClassification: true

      # confidence: 80

      # databaseFilterPattern:
      #   includes:
      #     - database1
      #     - database2
      #   excludes:
      #     - database3
      #     - database4

      # schemaFilterPattern:
      #   includes:
      #     - schema1
      #     - schema2
      #   excludes:
      #     - schema3
      #     - schema4

      # tableFilterPattern:
      #   includes:
      #     - table1
      #     - table2
      #   excludes:
      #     - table3
      #     - table4

processor:
  type: orm-profiler
  config: {}

sink:
  type: metadata-rest
  config: {}

{% partial file="/v1.5/connectors/yaml/workflow-config.md" /%}

{% /codeBlock %}

{% /codePreview %}

2. Run with the CLI

After saving the YAML config, we will run the command the same way we did for the metadata ingestion:

metadata classify -c <path-to-yaml>

Note now instead of running ingest, we are using the classify command to select the Auto Classification workflow.

3.8 KiB Raw Blame History

Auto Classification

1. Define the YAML Config

Source Configuration - Source Config

Processor Configuration

Sink Configuration

2. Run with the CLI

3.8 KiB

Raw Blame History