OpenMetadata/openmetadata-docs/content/v1.6.x/how-to-guides/data-governance/classification/Auto Classification/external-workflow.md

---
title: External Auto Classification Workflow
slug: /how-to-guides/data-governance/classification/auto/external-workflow
---

# Auto Classification Workflow Configuration

The Auto Classification Workflow enables automatic tagging of sensitive information within databases. Below are the configuration parameters available in the **Service Classification Pipeline JSON**.

## Pipeline Configuration Parameters

| **Parameter**                | **Description**                                                                 | **Type**  | **Default Value**      |
|-------------------------------|---------------------------------------------------------------------------------|-----------|-------------------------|
| `type`                       | Specifies the pipeline type.                                                    | String    | `AutoClassification`    |
| `classificationFilterPattern`| Regex to compute metrics for tables matching specific tags, tiers, or glossary patterns. | Object    | N/A                     |
| `schemaFilterPattern`         | Regex to fetch schemas matching the specified pattern.                         | Object    | N/A                     |
| `tableFilterPattern`          | Regex to exclude tables matching the specified pattern.                        | Object    | N/A                     |
| `databaseFilterPattern`       | Regex to fetch databases matching the specified pattern.                       | Object    | N/A                     |
| `includeViews`                | Option to include or exclude views during metadata ingestion.                  | Boolean   | `true`                  |
| `useFqnForFiltering`          | Determines whether filtering is applied to the Fully Qualified Name (FQN) instead of raw names. | Boolean   | `false`                 |
| `storeSampleData`             | Option to enable or disable storing sample data for each table.                | Boolean   | `true`                  |
| `enableAutoClassification`    | Enables automatic tagging of columns that might contain sensitive information. | Boolean   | `false`                 |
| `confidence`                  | Confidence level for tagging columns as sensitive. Value ranges from 0 to 100. | Number    | `80`                    |
| `sampleDataCount`             | Number of sample rows to ingest when Store Sample Data is enabled.             | Integer   | `50`                    |

## Key Parameters Explained

### `enableAutoClassification`
- Set this to `true` to enable automatic detection of sensitive columns (e.g., PII).
- Applies pattern recognition and tagging based on predefined criteria.

### `confidence`
- Confidence level for tagging sensitive columns:
  - A higher confidence value (e.g., `90`) reduces false positives but may miss some sensitive data.
  - A lower confidence value (e.g., `70`) identifies more sensitive columns but may result in false positives.

### `storeSampleData`
- Controls whether sample rows are stored during ingestion.
- If enabled, the specified number of rows (`sampleDataCount`) will be fetched for each table.

### `useFqnForFiltering`
- When set to `true`, filtering patterns will be applied to the Fully Qualified Name of a table (e.g., `service_name.db_name.schema_name.table_name`).
- When set to `false`, filtering applies only to raw table names.

## Sample Auto Classification Workflow yaml

```yaml
source:
  type: bigquery
  serviceName: local_bigquery
  serviceConnection:
    config:
      type: BigQuery
      credentials:
        gcpConfig:
          type: service_account
          projectId: my-project-id-1234
          privateKeyId: privateKeyID
          privateKey: "-----BEGIN PRIVATE KEY-----\nmySuperSecurePrivateKey==\n-----END PRIVATE KEY-----\n"
          clientEmail: client@email.secure
          clientId: "1234567890"
          authUri: https://accounts.google.com/o/oauth2/auth
          tokenUri: https://oauth2.googleapis.com/token
          authProviderX509CertUrl: https://www.googleapis.com/oauth2/v1/certs
          clientX509CertUrl: https://www.googleapis.com/oauth2/v1/certs
  sourceConfig:
    config:
      type: AutoClassification
      storeSampleData: true
      enableAutoClassification: true
      databaseFilterPattern:
        includes: 
          - hello-world-1234
      schemaFilterPattern:
        includes: 
          - super_schema
      tableFilterPattern:
        includes: 
          - abc

processor:
   type: "orm-profiler"
   config:
    tableConfig:
      - fullyQualifiedName: local_bigquery.hello-world-1234.super_schema.abc
        profileSample: 85
        partitionConfig:
          partitionQueryDuration: 180
        columnConfig:
          excludeColumns:
            - a
            - b

sink:
  type: metadata-rest
  config: {}
workflowConfig:
#  loggerLevel: INFO # DEBUG, INFO, WARN or ERROR
  openMetadataServerConfig:
    hostPort: http://localhost:8585/api
    authProvider: openmetadata
    securityConfig:
      jwtToken: "eyJraWQiOiJHYjM4OWEtOWY3Ni1nZGpzLWE5MmotMDI0MmJrOTQzNTYiLCJ0eXAiOiJKV1QiLCJhbGciOiJSUzI1NiJ9.eyJzdWIiOiJhZG1pbiIsImlzQm90IjpmYWxzZSwiaXNzIjoib3Blbi1tZXRhZGF0YS5vcmciLCJpYXQiOjE2NjM5Mzg0NjIsImVtYWlsIjoiYWRtaW5Ab3Blbm1ldGFkYXRhLm9yZyJ9.tS8um_5DKu7HgzGBzS1VTA5uUjKWOCU0B_j08WXBiEC0mr0zNREkqVfwFDD-d24HlNEbrqioLsBuFRiwIWKc1m_ZlVQbG7P36RUxhuv2vbSp80FKyNM-Tj93FDzq91jsyNmsQhyNv_fNr3TXfzzSPjHt8Go0FMMP66weoKMgW2PbXlhVKwEuXUHyakLLzewm9UMeQaEiRzhiTMU3UkLXcKbYEJJvfNFcLwSl9W8JCO_l0Yj3ud-qt_nQYEZwqW6u5nfdQllN133iikV4fM5QZsMCnm8Rq1mvLR0y9bmJiD7fwM1tmJ791TUWqmKaTnP49U493VanKpUAfzIiOiIbhg"
```

## Workflow Execution

### To Execute the Auto Classification Workflow:

1. **Create a Pipeline**  
   - Configure the Auto Classification JSON as demonstrated in the provided configuration example.

2. **Run the Ingestion Pipeline**  
   - Use OpenMetadata or an external scheduler like Argo to trigger the pipeline execution.

3. **Validate Results**  
   - Verify the metadata and tags applied to sensitive columns in the OpenMetadata UI.

### Expected Outcomes

- **Automatic Tagging:**  
  Columns containing sensitive information (e.g., names, emails, SSNs) are automatically tagged based on predefined confidence levels.

- **Enhanced Visibility:** 
  Gain improved visibility and classification of sensitive data within your databases.

- **Sample Data Integration:**  
  Store sample data to provide better insights during profiling and testing workflows.
Docs: Auto Classification Doc Addition (#19148) Co-authored-by: Rounak Dhillon <rounakdhillon@Rounaks-MacBook-Air.local> 2025-01-02 13:30:54 +05:30			`---`
			`title: External Auto Classification Workflow`
			`slug: /how-to-guides/data-governance/classification/auto/external-workflow`
			`---`

			`# Auto Classification Workflow Configuration`

			`The Auto Classification Workflow enables automatic tagging of sensitive information within databases. Below are the configuration parameters available in the Service Classification Pipeline JSON.`

			`## Pipeline Configuration Parameters`

			`\| Parameter \| Description \| Type \| Default Value \|`
			`\|-------------------------------\|---------------------------------------------------------------------------------\|-----------\|-------------------------\|`
			\| `type` \| Specifies the pipeline type. \| String \| `AutoClassification` \|
			\| `classificationFilterPattern`\| Regex to compute metrics for tables matching specific tags, tiers, or glossary patterns. \| Object \| N/A \|
			\| `schemaFilterPattern` \| Regex to fetch schemas matching the specified pattern. \| Object \| N/A \|
			\| `tableFilterPattern` \| Regex to exclude tables matching the specified pattern. \| Object \| N/A \|
			\| `databaseFilterPattern` \| Regex to fetch databases matching the specified pattern. \| Object \| N/A \|
			\| `includeViews` \| Option to include or exclude views during metadata ingestion. \| Boolean \| `true` \|
			\| `useFqnForFiltering` \| Determines whether filtering is applied to the Fully Qualified Name (FQN) instead of raw names. \| Boolean \| `false` \|
			\| `storeSampleData` \| Option to enable or disable storing sample data for each table. \| Boolean \| `true` \|
			\| `enableAutoClassification` \| Enables automatic tagging of columns that might contain sensitive information. \| Boolean \| `false` \|
			\| `confidence` \| Confidence level for tagging columns as sensitive. Value ranges from 0 to 100. \| Number \| `80` \|
			\| `sampleDataCount` \| Number of sample rows to ingest when Store Sample Data is enabled. \| Integer \| `50` \|

			`## Key Parameters Explained`

			### `enableAutoClassification`
			- Set this to `true` to enable automatic detection of sensitive columns (e.g., PII).
			`- Applies pattern recognition and tagging based on predefined criteria.`

			### `confidence`
			`- Confidence level for tagging sensitive columns:`
			- A higher confidence value (e.g., `90`) reduces false positives but may miss some sensitive data.
			- A lower confidence value (e.g., `70`) identifies more sensitive columns but may result in false positives.

			### `storeSampleData`
			`- Controls whether sample rows are stored during ingestion.`
			- If enabled, the specified number of rows (`sampleDataCount`) will be fetched for each table.

			### `useFqnForFiltering`
			- When set to `true`, filtering patterns will be applied to the Fully Qualified Name of a table (e.g., `service_name.db_name.schema_name.table_name`).
			- When set to `false`, filtering applies only to raw table names.

			`## Sample Auto Classification Workflow yaml`

			```yaml
			`source:`
			`type: bigquery`
			`serviceName: local_bigquery`
			`serviceConnection:`
			`config:`
			`type: BigQuery`
			`credentials:`
			`gcpConfig:`
			`type: service_account`
			`projectId: my-project-id-1234`
			`privateKeyId: privateKeyID`
			`privateKey: "-----BEGIN PRIVATE KEY-----\nmySuperSecurePrivateKey==\n-----END PRIVATE KEY-----\n"`
			`clientEmail: client@email.secure`
			`clientId: "1234567890"`
			`authUri: https://accounts.google.com/o/oauth2/auth`
			`tokenUri: https://oauth2.googleapis.com/token`
			`authProviderX509CertUrl: https://www.googleapis.com/oauth2/v1/certs`
			`clientX509CertUrl: https://www.googleapis.com/oauth2/v1/certs`
			`sourceConfig:`
			`config:`
			`type: AutoClassification`
			`storeSampleData: true`
			`enableAutoClassification: true`
			`databaseFilterPattern:`
			`includes:`
			`- hello-world-1234`
			`schemaFilterPattern:`
			`includes:`
			`- super_schema`
			`tableFilterPattern:`
			`includes:`
			`- abc`

			`processor:`
			`type: "orm-profiler"`
			`config:`
			`tableConfig:`
			`- fullyQualifiedName: local_bigquery.hello-world-1234.super_schema.abc`
			`profileSample: 85`
			`partitionConfig:`
			`partitionQueryDuration: 180`
			`columnConfig:`
			`excludeColumns:`
			`- a`
			`- b`

			`sink:`
			`type: metadata-rest`
			`config: {}`
			`workflowConfig:`
			`# loggerLevel: INFO # DEBUG, INFO, WARN or ERROR`
			`openMetadataServerConfig:`
			`hostPort: http://localhost:8585/api`
			`authProvider: openmetadata`
			`securityConfig:`
			jwtToken: "eyJraWQiOiJHYjM4OWEtOWY3Ni1nZGpzLWE5MmotMDI0MmJrOTQzNTYiLCJ0eXAiOiJKV1QiLCJhbGciOiJSUzI1NiJ9.eyJzdWIiOiJhZG1pbiIsImlzQm90IjpmYWxzZSwiaXNzIjoib3Blbi1tZXRhZGF0YS5vcmciLCJpYXQiOjE2NjM5Mzg0NjIsImVtYWlsIjoiYWRtaW5Ab3Blbm1ldGFkYXRhLm9yZyJ9.tS8um_5DKu7HgzGBzS1VTA5uUjKWOCU0B_j08WXBiEC0mr0zNREkqVfwFDD-d24HlNEbrqioLsBuFRiwIWKc1m_ZlVQbG7P36RUxhuv2vbSp80FKyNM-Tj93FDzq91jsyNmsQhyNv_fNr3TXfzzSPjHt8Go0FMMP66weoKMgW2PbXlhVKwEuXUHyakLLzewm9UMeQaEiRzhiTMU3UkLXcKbYEJJvfNFcLwSl9W8JCO_l0Yj3ud-qt_nQYEZwqW6u5nfdQllN133iikV4fM5QZsMCnm8Rq1mvLR0y9bmJiD7fwM1tmJ791TUWqmKaTnP49U493VanKpUAfzIiOiIbhg"
			```

			`## Workflow Execution`

			`### To Execute the Auto Classification Workflow:`

			`1. Create a Pipeline`
			`- Configure the Auto Classification JSON as demonstrated in the provided configuration example.`

			`2. Run the Ingestion Pipeline`
			`- Use OpenMetadata or an external scheduler like Argo to trigger the pipeline execution.`

			`3. Validate Results`
			`- Verify the metadata and tags applied to sensitive columns in the OpenMetadata UI.`

			`### Expected Outcomes`

			`- Automatic Tagging:`
			`Columns containing sensitive information (e.g., names, emails, SSNs) are automatically tagged based on predefined confidence levels.`

			`- Enhanced Visibility:`
			`Gain improved visibility and classification of sensitive data within your databases.`

			`- Sample Data Integration:`
			`Store sample data to provide better insights during profiling and testing workflows.`