--- title: External Auto Classification Workflow slug: /how-to-guides/data-governance/classification/auto-classification/external-workflow --- # Auto Classification Workflow Configuration The Auto Classification Workflow enables automatic tagging of sensitive information within databases. Below are the configuration parameters available in the **Service Classification Pipeline JSON**. ## Pipeline Configuration Parameters | **Parameter** | **Description** | **Type** | **Default Value** | |-------------------------------|---------------------------------------------------------------------------------|-----------|-------------------------| | `type` | Specifies the pipeline type. | String | `AutoClassification` | | `classificationFilterPattern`| Regex to compute metrics for tables matching specific tags, tiers, or glossary patterns. | Object | N/A | | `schemaFilterPattern` | Regex to fetch schemas matching the specified pattern. | Object | N/A | | `tableFilterPattern` | Regex to exclude tables matching the specified pattern. | Object | N/A | | `databaseFilterPattern` | Regex to fetch databases matching the specified pattern. | Object | N/A | | `includeViews` | Option to include or exclude views during metadata ingestion. | Boolean | `true` | | `useFqnForFiltering` | Determines whether filtering is applied to the Fully Qualified Name (FQN) instead of raw names. | Boolean | `false` | | `storeSampleData` | Option to enable or disable storing sample data for each table. | Boolean | `true` | | `enableAutoClassification` | Enables automatic tagging of columns that might contain sensitive information. | Boolean | `false` | | `confidence` | Confidence level for tagging columns as sensitive. Value ranges from 0 to 100. | Number | `80` | | `sampleDataCount` | Number of sample rows to ingest when Store Sample Data is enabled. | Integer | `50` | ## Key Parameters Explained ### `enableAutoClassification` - Set this to `true` to enable automatic detection of sensitive columns (e.g., PII). - Applies pattern recognition and tagging based on predefined criteria. ### `confidence` - Confidence level for tagging sensitive columns: - A higher confidence value (e.g., `90`) reduces false positives but may miss some sensitive data. - A lower confidence value (e.g., `70`) identifies more sensitive columns but may result in false positives. ### `storeSampleData` - Controls whether sample rows are stored during ingestion. - If enabled, the specified number of rows (`sampleDataCount`) will be fetched for each table. ### `useFqnForFiltering` - When set to `true`, filtering patterns will be applied to the Fully Qualified Name of a table (e.g., `service_name.db_name.schema_name.table_name`). - When set to `false`, filtering applies only to raw table names. ## Auto Classification Workflow Execution To execute the **Auto Classification Workflow**, follow the steps below: ### 1. Install the Required Python Package Ensure you have the correct OpenMetadata ingestion package installed, including the **PII Processor** module: ```bash pip install "openmetadata-ingestion[pii-processor]" ``` ## 2. Define and Execute the Python Workflow Instead of using a YAML configuration, use the AutoClassificationWorkflow from OpenMetadata to trigger the ingestion process programmatically. ## Sample Auto Classification Workflow yaml ```yaml source: type: bigquery serviceName: local_bigquery serviceConnection: config: type: BigQuery credentials: gcpConfig: type: service_account projectId: my-project-id-1234 privateKeyId: privateKeyID privateKey: "-----BEGIN PRIVATE KEY-----\nmySuperSecurePrivateKey==\n-----END PRIVATE KEY-----\n" clientEmail: client@email.secure clientId: "1234567890" authUri: https://accounts.google.com/o/oauth2/auth tokenUri: https://oauth2.googleapis.com/token authProviderX509CertUrl: https://www.googleapis.com/oauth2/v1/certs clientX509CertUrl: https://www.googleapis.com/oauth2/v1/certs sourceConfig: config: type: AutoClassification storeSampleData: true enableAutoClassification: true databaseFilterPattern: includes: - hello-world-1234 schemaFilterPattern: includes: - super_schema tableFilterPattern: includes: - abc processor: type: "orm-profiler" config: tableConfig: - fullyQualifiedName: local_bigquery.hello-world-1234.super_schema.abc profileSample: 85 partitionConfig: partitionQueryDuration: 180 columnConfig: excludeColumns: - a - b sink: type: metadata-rest config: {} workflowConfig: # loggerLevel: INFO # DEBUG, INFO, WARN or ERROR openMetadataServerConfig: hostPort: http://localhost:8585/api authProvider: openmetadata securityConfig: jwtToken: "eyJraWQiOiJHYjM4OWEtOWY3Ni1nZGpzLWE5MmotMDI0MmJrOTQzNTYiLCJ0eXAiOiJKV1QiLCJhbGciOiJSUzI1NiJ9.eyJzdWIiOiJhZG1pbiIsImlzQm90IjpmYWxzZSwiaXNzIjoib3Blbi1tZXRhZGF0YS5vcmciLCJpYXQiOjE2NjM5Mzg0NjIsImVtYWlsIjoiYWRtaW5Ab3Blbm1ldGFkYXRhLm9yZyJ9.tS8um_5DKu7HgzGBzS1VTA5uUjKWOCU0B_j08WXBiEC0mr0zNREkqVfwFDD-d24HlNEbrqioLsBuFRiwIWKc1m_ZlVQbG7P36RUxhuv2vbSp80FKyNM-Tj93FDzq91jsyNmsQhyNv_fNr3TXfzzSPjHt8Go0FMMP66weoKMgW2PbXlhVKwEuXUHyakLLzewm9UMeQaEiRzhiTMU3UkLXcKbYEJJvfNFcLwSl9W8JCO_l0Yj3ud-qt_nQYEZwqW6u5nfdQllN133iikV4fM5QZsMCnm8Rq1mvLR0y9bmJiD7fwM1tmJ791TUWqmKaTnP49U493VanKpUAfzIiOiIbhg" ``` ### 3. Expected Outcome - Automatically classifies and tags sensitive data based on predefined patterns and confidence levels. - Improves metadata enrichment and enhances data governance practices. - Provides visibility into sensitive data across databases. This approach ensures that the Auto Classification Workflow is executed correctly using the appropriate OpenMetadata ingestion framework. {% partial file="/v1.7/connectors/yaml/auto-classification.md" variables={connector: "snowflake"} /%} ## Workflow Execution ### To Execute the Auto Classification Workflow: 1. **Create a Pipeline** - Configure the Auto Classification JSON as demonstrated in the provided configuration example. 2. **Run the Ingestion Pipeline** - Use OpenMetadata or an external scheduler like Argo to trigger the pipeline execution. 3. **Validate Results** - Verify the metadata and tags applied to sensitive columns in the OpenMetadata UI. ### Expected Outcomes - **Automatic Tagging:** Columns containing sensitive information (e.g., names, emails, SSNs) are automatically tagged based on predefined confidence levels. - **Enhanced Visibility:** Gain improved visibility and classification of sensitive data within your databases. - **Sample Data Integration:** Store sample data to provide better insights during profiling and testing workflows.