2025-01-02 13:30:54 +05:30 
										
									 
								 
							 
							
								
							 
							
								 
							
							
								---
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								title: External Auto Classification Workflow
							 
						 
					
						
							
								
									
										
										
										
											2025-04-16 12:10:51 +05:30 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
								slug: /how-to-guides/data-governance/classification/auto-classification/external-workflow
							 
						 
					
						
							
								
									
										
										
										
											2025-01-02 13:30:54 +05:30 
										
									 
								 
							 
							
								
							 
							
								 
							
							
								---
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								# Auto Classification Workflow Configuration
  
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								The Auto Classification Workflow enables automatic tagging of sensitive information within databases. Below are the configuration parameters available in the **Service Classification Pipeline JSON** .
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								## Pipeline Configuration Parameters
  
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								| **Parameter**                 | **Description**                                                                  | **Type**   | **Default Value**       |
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								|-------------------------------|---------------------------------------------------------------------------------|-----------|-------------------------|
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								| `type`                        | Specifies the pipeline type.                                                    | String    | `AutoClassification`     |
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								| `classificationFilterPattern` | Regex to compute metrics for tables matching specific tags, tiers, or glossary patterns. | Object    | N/A                     |
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								| `schemaFilterPattern`          | Regex to fetch schemas matching the specified pattern.                         | Object    | N/A                     |
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								| `tableFilterPattern`           | Regex to exclude tables matching the specified pattern.                        | Object    | N/A                     |
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								| `databaseFilterPattern`        | Regex to fetch databases matching the specified pattern.                       | Object    | N/A                     |
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								| `includeViews`                 | Option to include or exclude views during metadata ingestion.                  | Boolean   | `true`                   |
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								| `useFqnForFiltering`           | Determines whether filtering is applied to the Fully Qualified Name (FQN) instead of raw names. | Boolean   | `false`                  |
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								| `storeSampleData`              | Option to enable or disable storing sample data for each table.                | Boolean   | `true`                   |
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								| `enableAutoClassification`     | Enables automatic tagging of columns that might contain sensitive information. | Boolean   | `false`                  |
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								| `confidence`                   | Confidence level for tagging columns as sensitive. Value ranges from 0 to 100. | Number    | `80`                     |
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								| `sampleDataCount`              | Number of sample rows to ingest when Store Sample Data is enabled.             | Integer   | `50`                     |
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								## Key Parameters Explained
  
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								### `enableAutoClassification`
  
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								-  Set this to `true`  to enable automatic detection of sensitive columns (e.g., PII). 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								-  Applies pattern recognition and tagging based on predefined criteria. 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								### `confidence`
  
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								-  Confidence level for tagging sensitive columns: 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								  -  A higher confidence value (e.g., `90` ) reduces false positives but may miss some sensitive data.
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								  -  A lower confidence value (e.g., `70` ) identifies more sensitive columns but may result in false positives.
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								### `storeSampleData`
  
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								-  Controls whether sample rows are stored during ingestion. 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								-  If enabled, the specified number of rows (`sampleDataCount` ) will be fetched for each table. 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								### `useFqnForFiltering`
  
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								-  When set to `true` , filtering patterns will be applied to the Fully Qualified Name of a table (e.g., `service_name.db_name.schema_name.table_name` ). 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								-  When set to `false` , filtering applies only to raw table names. 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								
							 
						 
					
						
							
								
									
										
										
										
											2025-02-11 14:33:37 +05:30 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
								## Auto Classification Workflow Execution  
  
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								To execute the **Auto Classification Workflow** , follow the steps below:  
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								### 1. Install the Required Python Package  
  
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								Ensure you have the correct OpenMetadata ingestion package installed, including the **PII Processor**  module:  
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								```bash
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								pip install "openmetadata-ingestion[pii-processor]"
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								```
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								## 2. Define and Execute the Python Workflow
  
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								Instead of using a YAML configuration, use the AutoClassificationWorkflow from OpenMetadata to trigger the ingestion process programmatically.
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								
							 
						 
					
						
							
								
									
										
										
										
											2025-01-02 13:30:54 +05:30 
										
									 
								 
							 
							
								
							 
							
								 
							
							
								## Sample Auto Classification Workflow yaml
  
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								```yaml
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								source:
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								  type: bigquery
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								  serviceName: local_bigquery
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								  serviceConnection:
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								    config:
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								      type: BigQuery
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								      credentials:
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								        gcpConfig:
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								          type: service_account
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								          projectId: my-project-id-1234
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								          privateKeyId: privateKeyID
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								          privateKey: "-----BEGIN PRIVATE KEY-----\nmySuperSecurePrivateKey==\n-----END PRIVATE KEY-----\n"
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								          clientEmail: client@email .secure
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								          clientId: "1234567890"
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								          authUri: https://accounts.google.com/o/oauth2/auth
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								          tokenUri: https://oauth2.googleapis.com/token
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								          authProviderX509CertUrl: https://www.googleapis.com/oauth2/v1/certs
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								          clientX509CertUrl: https://www.googleapis.com/oauth2/v1/certs
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								  sourceConfig:
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								    config:
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								      type: AutoClassification
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								      storeSampleData: true
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								      enableAutoClassification: true
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								      databaseFilterPattern:
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								        includes: 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								          -  hello-world-1234
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								      schemaFilterPattern:
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								        includes: 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								          -  super_schema
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								      tableFilterPattern:
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								        includes: 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								          -  abc
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								processor:
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								   type: "orm-profiler"
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								   config:
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								    tableConfig:
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								      -  fullyQualifiedName: local_bigquery.hello-world-1234.super_schema.abc
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								        profileSample: 85
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								        partitionConfig:
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								          partitionQueryDuration: 180
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								        columnConfig:
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								          excludeColumns:
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								            -  a
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								            -  b
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								sink:
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								  type: metadata-rest
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								  config: {}
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								workflowConfig:
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								#  loggerLevel: INFO # DEBUG, INFO, WARN or ERROR
  
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								  openMetadataServerConfig:
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								    hostPort: http://localhost:8585/api
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								    authProvider: openmetadata
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								    securityConfig:
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								      jwtToken: "eyJraWQiOiJHYjM4OWEtOWY3Ni1nZGpzLWE5MmotMDI0MmJrOTQzNTYiLCJ0eXAiOiJKV1QiLCJhbGciOiJSUzI1NiJ9.eyJzdWIiOiJhZG1pbiIsImlzQm90IjpmYWxzZSwiaXNzIjoib3Blbi1tZXRhZGF0YS5vcmciLCJpYXQiOjE2NjM5Mzg0NjIsImVtYWlsIjoiYWRtaW5Ab3Blbm1ldGFkYXRhLm9yZyJ9.tS8um_5DKu7HgzGBzS1VTA5uUjKWOCU0B_j08WXBiEC0mr0zNREkqVfwFDD-d24HlNEbrqioLsBuFRiwIWKc1m_ZlVQbG7P36RUxhuv2vbSp80FKyNM-Tj93FDzq91jsyNmsQhyNv_fNr3TXfzzSPjHt8Go0FMMP66weoKMgW2PbXlhVKwEuXUHyakLLzewm9UMeQaEiRzhiTMU3UkLXcKbYEJJvfNFcLwSl9W8JCO_l0Yj3ud-qt_nQYEZwqW6u5nfdQllN133iikV4fM5QZsMCnm8Rq1mvLR0y9bmJiD7fwM1tmJ791TUWqmKaTnP49U493VanKpUAfzIiOiIbhg"
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								```
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								
							 
						 
					
						
							
								
									
										
										
										
											2025-02-11 14:33:37 +05:30 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
								### 3. Expected Outcome
  
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								-  Automatically classifies and tags sensitive data based on predefined patterns and confidence levels. 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								-  Improves metadata enrichment and enhances data governance practices. 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								-  Provides visibility into sensitive data across databases. 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								This approach ensures that the Auto Classification Workflow is executed correctly using the appropriate OpenMetadata ingestion framework.
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								
							 
						 
					
						
							
								
									
										
										
										
											2025-06-27 12:22:38 +05:30 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
								{% partial file="/v1.9/connectors/yaml/auto-classification.md" variables={connector: "snowflake"} /%}
							 
						 
					
						
							
								
									
										
										
										
											2025-02-11 14:33:37 +05:30 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
								
							 
						 
					
						
							
								
									
										
										
										
											2025-01-02 13:30:54 +05:30 
										
									 
								 
							 
							
								
							 
							
								 
							
							
								## Workflow Execution
  
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								### To Execute the Auto Classification Workflow:
  
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								1.  **Create a Pipeline**    
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								   -  Configure the Auto Classification JSON as demonstrated in the provided configuration example.
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								2.  **Run the Ingestion Pipeline**    
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								   -  Use OpenMetadata or an external scheduler like Argo to trigger the pipeline execution.
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								3.  **Validate Results**    
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								   -  Verify the metadata and tags applied to sensitive columns in the OpenMetadata UI.
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								### Expected Outcomes
  
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								-  **Automatic Tagging:**   
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								  Columns containing sensitive information (e.g., names, emails, SSNs) are automatically tagged based on predefined confidence levels.
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								-  **Enhanced Visibility:**  
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								  Gain improved visibility and classification of sensitive data within your databases.
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								-  **Sample Data Integration:**   
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								  Store sample data to provide better insights during profiling and testing workflows.