mirror of
https://github.com/open-metadata/OpenMetadata.git
synced 2025-07-09 18:21:23 +00:00
130 lines
6.3 KiB
Markdown
130 lines
6.3 KiB
Markdown
![]() |
---
|
||
|
title: External Auto Classification Workflow
|
||
|
slug: /how-to-guides/data-governance/classification/auto/external-workflow
|
||
|
---
|
||
|
|
||
|
# Auto Classification Workflow Configuration
|
||
|
|
||
|
The Auto Classification Workflow enables automatic tagging of sensitive information within databases. Below are the configuration parameters available in the **Service Classification Pipeline JSON**.
|
||
|
|
||
|
## Pipeline Configuration Parameters
|
||
|
|
||
|
| **Parameter** | **Description** | **Type** | **Default Value** |
|
||
|
|-------------------------------|---------------------------------------------------------------------------------|-----------|-------------------------|
|
||
|
| `type` | Specifies the pipeline type. | String | `AutoClassification` |
|
||
|
| `classificationFilterPattern`| Regex to compute metrics for tables matching specific tags, tiers, or glossary patterns. | Object | N/A |
|
||
|
| `schemaFilterPattern` | Regex to fetch schemas matching the specified pattern. | Object | N/A |
|
||
|
| `tableFilterPattern` | Regex to exclude tables matching the specified pattern. | Object | N/A |
|
||
|
| `databaseFilterPattern` | Regex to fetch databases matching the specified pattern. | Object | N/A |
|
||
|
| `includeViews` | Option to include or exclude views during metadata ingestion. | Boolean | `true` |
|
||
|
| `useFqnForFiltering` | Determines whether filtering is applied to the Fully Qualified Name (FQN) instead of raw names. | Boolean | `false` |
|
||
|
| `storeSampleData` | Option to enable or disable storing sample data for each table. | Boolean | `true` |
|
||
|
| `enableAutoClassification` | Enables automatic tagging of columns that might contain sensitive information. | Boolean | `false` |
|
||
|
| `confidence` | Confidence level for tagging columns as sensitive. Value ranges from 0 to 100. | Number | `80` |
|
||
|
| `sampleDataCount` | Number of sample rows to ingest when Store Sample Data is enabled. | Integer | `50` |
|
||
|
|
||
|
## Key Parameters Explained
|
||
|
|
||
|
### `enableAutoClassification`
|
||
|
- Set this to `true` to enable automatic detection of sensitive columns (e.g., PII).
|
||
|
- Applies pattern recognition and tagging based on predefined criteria.
|
||
|
|
||
|
### `confidence`
|
||
|
- Confidence level for tagging sensitive columns:
|
||
|
- A higher confidence value (e.g., `90`) reduces false positives but may miss some sensitive data.
|
||
|
- A lower confidence value (e.g., `70`) identifies more sensitive columns but may result in false positives.
|
||
|
|
||
|
### `storeSampleData`
|
||
|
- Controls whether sample rows are stored during ingestion.
|
||
|
- If enabled, the specified number of rows (`sampleDataCount`) will be fetched for each table.
|
||
|
|
||
|
### `useFqnForFiltering`
|
||
|
- When set to `true`, filtering patterns will be applied to the Fully Qualified Name of a table (e.g., `service_name.db_name.schema_name.table_name`).
|
||
|
- When set to `false`, filtering applies only to raw table names.
|
||
|
|
||
|
## Sample Auto Classification Workflow yaml
|
||
|
|
||
|
```yaml
|
||
|
source:
|
||
|
type: bigquery
|
||
|
serviceName: local_bigquery
|
||
|
serviceConnection:
|
||
|
config:
|
||
|
type: BigQuery
|
||
|
credentials:
|
||
|
gcpConfig:
|
||
|
type: service_account
|
||
|
projectId: my-project-id-1234
|
||
|
privateKeyId: privateKeyID
|
||
|
privateKey: "-----BEGIN PRIVATE KEY-----\nmySuperSecurePrivateKey==\n-----END PRIVATE KEY-----\n"
|
||
|
clientEmail: client@email.secure
|
||
|
clientId: "1234567890"
|
||
|
authUri: https://accounts.google.com/o/oauth2/auth
|
||
|
tokenUri: https://oauth2.googleapis.com/token
|
||
|
authProviderX509CertUrl: https://www.googleapis.com/oauth2/v1/certs
|
||
|
clientX509CertUrl: https://www.googleapis.com/oauth2/v1/certs
|
||
|
sourceConfig:
|
||
|
config:
|
||
|
type: AutoClassification
|
||
|
storeSampleData: true
|
||
|
enableAutoClassification: true
|
||
|
databaseFilterPattern:
|
||
|
includes:
|
||
|
- hello-world-1234
|
||
|
schemaFilterPattern:
|
||
|
includes:
|
||
|
- super_schema
|
||
|
tableFilterPattern:
|
||
|
includes:
|
||
|
- abc
|
||
|
|
||
|
processor:
|
||
|
type: "orm-profiler"
|
||
|
config:
|
||
|
tableConfig:
|
||
|
- fullyQualifiedName: local_bigquery.hello-world-1234.super_schema.abc
|
||
|
profileSample: 85
|
||
|
partitionConfig:
|
||
|
partitionQueryDuration: 180
|
||
|
columnConfig:
|
||
|
excludeColumns:
|
||
|
- a
|
||
|
- b
|
||
|
|
||
|
sink:
|
||
|
type: metadata-rest
|
||
|
config: {}
|
||
|
workflowConfig:
|
||
|
# loggerLevel: INFO # DEBUG, INFO, WARN or ERROR
|
||
|
openMetadataServerConfig:
|
||
|
hostPort: http://localhost:8585/api
|
||
|
authProvider: openmetadata
|
||
|
securityConfig:
|
||
|
jwtToken: "eyJraWQiOiJHYjM4OWEtOWY3Ni1nZGpzLWE5MmotMDI0MmJrOTQzNTYiLCJ0eXAiOiJKV1QiLCJhbGciOiJSUzI1NiJ9.eyJzdWIiOiJhZG1pbiIsImlzQm90IjpmYWxzZSwiaXNzIjoib3Blbi1tZXRhZGF0YS5vcmciLCJpYXQiOjE2NjM5Mzg0NjIsImVtYWlsIjoiYWRtaW5Ab3Blbm1ldGFkYXRhLm9yZyJ9.tS8um_5DKu7HgzGBzS1VTA5uUjKWOCU0B_j08WXBiEC0mr0zNREkqVfwFDD-d24HlNEbrqioLsBuFRiwIWKc1m_ZlVQbG7P36RUxhuv2vbSp80FKyNM-Tj93FDzq91jsyNmsQhyNv_fNr3TXfzzSPjHt8Go0FMMP66weoKMgW2PbXlhVKwEuXUHyakLLzewm9UMeQaEiRzhiTMU3UkLXcKbYEJJvfNFcLwSl9W8JCO_l0Yj3ud-qt_nQYEZwqW6u5nfdQllN133iikV4fM5QZsMCnm8Rq1mvLR0y9bmJiD7fwM1tmJ791TUWqmKaTnP49U493VanKpUAfzIiOiIbhg"
|
||
|
```
|
||
|
|
||
|
## Workflow Execution
|
||
|
|
||
|
### To Execute the Auto Classification Workflow:
|
||
|
|
||
|
1. **Create a Pipeline**
|
||
|
- Configure the Auto Classification JSON as demonstrated in the provided configuration example.
|
||
|
|
||
|
2. **Run the Ingestion Pipeline**
|
||
|
- Use OpenMetadata or an external scheduler like Argo to trigger the pipeline execution.
|
||
|
|
||
|
3. **Validate Results**
|
||
|
- Verify the metadata and tags applied to sensitive columns in the OpenMetadata UI.
|
||
|
|
||
|
### Expected Outcomes
|
||
|
|
||
|
- **Automatic Tagging:**
|
||
|
Columns containing sensitive information (e.g., names, emails, SSNs) are automatically tagged based on predefined confidence levels.
|
||
|
|
||
|
- **Enhanced Visibility:**
|
||
|
Gain improved visibility and classification of sensitive data within your databases.
|
||
|
|
||
|
- **Sample Data Integration:**
|
||
|
Store sample data to provide better insights during profiling and testing workflows.
|
||
|
|