
* DOCS - Prepare 1.7 Release and 1.8 SNAPSHOT * DOCS - Prepare 1.7 Release and 1.8 SNAPSHOT
7.3 KiB
title | slug |
---|---|
External Auto Classification Workflow | /how-to-guides/data-governance/classification/auto-classification/external-workflow |
Auto Classification Workflow Configuration
The Auto Classification Workflow enables automatic tagging of sensitive information within databases. Below are the configuration parameters available in the Service Classification Pipeline JSON.
Pipeline Configuration Parameters
Parameter | Description | Type | Default Value |
---|---|---|---|
type |
Specifies the pipeline type. | String | AutoClassification |
classificationFilterPattern |
Regex to compute metrics for tables matching specific tags, tiers, or glossary patterns. | Object | N/A |
schemaFilterPattern |
Regex to fetch schemas matching the specified pattern. | Object | N/A |
tableFilterPattern |
Regex to exclude tables matching the specified pattern. | Object | N/A |
databaseFilterPattern |
Regex to fetch databases matching the specified pattern. | Object | N/A |
includeViews |
Option to include or exclude views during metadata ingestion. | Boolean | true |
useFqnForFiltering |
Determines whether filtering is applied to the Fully Qualified Name (FQN) instead of raw names. | Boolean | false |
storeSampleData |
Option to enable or disable storing sample data for each table. | Boolean | true |
enableAutoClassification |
Enables automatic tagging of columns that might contain sensitive information. | Boolean | false |
confidence |
Confidence level for tagging columns as sensitive. Value ranges from 0 to 100. | Number | 80 |
sampleDataCount |
Number of sample rows to ingest when Store Sample Data is enabled. | Integer | 50 |
Key Parameters Explained
enableAutoClassification
- Set this to
true
to enable automatic detection of sensitive columns (e.g., PII). - Applies pattern recognition and tagging based on predefined criteria.
confidence
- Confidence level for tagging sensitive columns:
- A higher confidence value (e.g.,
90
) reduces false positives but may miss some sensitive data. - A lower confidence value (e.g.,
70
) identifies more sensitive columns but may result in false positives.
- A higher confidence value (e.g.,
storeSampleData
- Controls whether sample rows are stored during ingestion.
- If enabled, the specified number of rows (
sampleDataCount
) will be fetched for each table.
useFqnForFiltering
- When set to
true
, filtering patterns will be applied to the Fully Qualified Name of a table (e.g.,service_name.db_name.schema_name.table_name
). - When set to
false
, filtering applies only to raw table names.
Auto Classification Workflow Execution
To execute the Auto Classification Workflow, follow the steps below:
1. Install the Required Python Package
Ensure you have the correct OpenMetadata ingestion package installed, including the PII Processor module:
pip install "openmetadata-ingestion[pii-processor]"
2. Define and Execute the Python Workflow
Instead of using a YAML configuration, use the AutoClassificationWorkflow from OpenMetadata to trigger the ingestion process programmatically.
Sample Auto Classification Workflow yaml
source:
type: bigquery
serviceName: local_bigquery
serviceConnection:
config:
type: BigQuery
credentials:
gcpConfig:
type: service_account
projectId: my-project-id-1234
privateKeyId: privateKeyID
privateKey: "-----BEGIN PRIVATE KEY-----\nmySuperSecurePrivateKey==\n-----END PRIVATE KEY-----\n"
clientEmail: client@email.secure
clientId: "1234567890"
authUri: https://accounts.google.com/o/oauth2/auth
tokenUri: https://oauth2.googleapis.com/token
authProviderX509CertUrl: https://www.googleapis.com/oauth2/v1/certs
clientX509CertUrl: https://www.googleapis.com/oauth2/v1/certs
sourceConfig:
config:
type: AutoClassification
storeSampleData: true
enableAutoClassification: true
databaseFilterPattern:
includes:
- hello-world-1234
schemaFilterPattern:
includes:
- super_schema
tableFilterPattern:
includes:
- abc
processor:
type: "orm-profiler"
config:
tableConfig:
- fullyQualifiedName: local_bigquery.hello-world-1234.super_schema.abc
profileSample: 85
partitionConfig:
partitionQueryDuration: 180
columnConfig:
excludeColumns:
- a
- b
sink:
type: metadata-rest
config: {}
workflowConfig:
# loggerLevel: INFO # DEBUG, INFO, WARN or ERROR
openMetadataServerConfig:
hostPort: http://localhost:8585/api
authProvider: openmetadata
securityConfig:
jwtToken: "eyJraWQiOiJHYjM4OWEtOWY3Ni1nZGpzLWE5MmotMDI0MmJrOTQzNTYiLCJ0eXAiOiJKV1QiLCJhbGciOiJSUzI1NiJ9.eyJzdWIiOiJhZG1pbiIsImlzQm90IjpmYWxzZSwiaXNzIjoib3Blbi1tZXRhZGF0YS5vcmciLCJpYXQiOjE2NjM5Mzg0NjIsImVtYWlsIjoiYWRtaW5Ab3Blbm1ldGFkYXRhLm9yZyJ9.tS8um_5DKu7HgzGBzS1VTA5uUjKWOCU0B_j08WXBiEC0mr0zNREkqVfwFDD-d24HlNEbrqioLsBuFRiwIWKc1m_ZlVQbG7P36RUxhuv2vbSp80FKyNM-Tj93FDzq91jsyNmsQhyNv_fNr3TXfzzSPjHt8Go0FMMP66weoKMgW2PbXlhVKwEuXUHyakLLzewm9UMeQaEiRzhiTMU3UkLXcKbYEJJvfNFcLwSl9W8JCO_l0Yj3ud-qt_nQYEZwqW6u5nfdQllN133iikV4fM5QZsMCnm8Rq1mvLR0y9bmJiD7fwM1tmJ791TUWqmKaTnP49U493VanKpUAfzIiOiIbhg"
3. Expected Outcome
- Automatically classifies and tags sensitive data based on predefined patterns and confidence levels.
- Improves metadata enrichment and enhances data governance practices.
- Provides visibility into sensitive data across databases. This approach ensures that the Auto Classification Workflow is executed correctly using the appropriate OpenMetadata ingestion framework.
{% partial file="/v1.7/connectors/yaml/auto-classification.md" variables={connector: "snowflake"} /%}
Workflow Execution
To Execute the Auto Classification Workflow:
-
Create a Pipeline
- Configure the Auto Classification JSON as demonstrated in the provided configuration example.
-
Run the Ingestion Pipeline
- Use OpenMetadata or an external scheduler like Argo to trigger the pipeline execution.
-
Validate Results
- Verify the metadata and tags applied to sensitive columns in the OpenMetadata UI.
Expected Outcomes
-
Automatic Tagging:
Columns containing sensitive information (e.g., names, emails, SSNs) are automatically tagged based on predefined confidence levels. -
Enhanced Visibility: Gain improved visibility and classification of sensitive data within your databases.
-
Sample Data Integration:
Store sample data to provide better insights during profiling and testing workflows.