mirror of
https://github.com/datahub-project/datahub.git
synced 2025-10-23 23:13:57 +00:00
455 lines
29 KiB
Markdown
455 lines
29 KiB
Markdown
# Classification
|
|
|
|
The classification feature enables sources to be configured to automatically predict info types for columns and use them as glossary terms. This is an explicit opt-in feature and is not enabled by default.
|
|
|
|
## Config details
|
|
|
|
Note that a `.` is used to denote nested fields in the YAML recipe.
|
|
|
|
| Field | Required | Type | Description | Default |
|
|
| ------------------------- | -------- | --------------------------------------- | ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | ---------------------------------------------------------- |
|
|
| enabled | | boolean | Whether classification should be used to auto-detect glossary terms | False |
|
|
| sample_size | | int | Number of sample values used for classification. | 100 |
|
|
| max_workers | | int | Number of worker processes to use for classification. Set to 1 to disable. | Number of CPU cores |
|
|
| info_type_to_term | | Dict[str,string] | Optional mapping to provide glossary term identifier for info type. | By default, info type is used as glossary term identifier. |
|
|
| classifiers | | Array of object | Classifiers to use to auto-detect glossary terms. If more than one classifier, infotype predictions from the classifier defined later in sequence take precedance. | [{'type': 'datahub', 'config': None}] |
|
|
| table_pattern | | AllowDenyPattern (see below for fields) | Regex patterns to filter tables for classification. This is used in combination with other patterns in parent config. Specify regex to match the entire table name in `database.schema.table` format. e.g. to match all tables starting with customer in Customer database and public schema, use the regex 'Customer.public.customer.\*' | {'allow': ['.*'], 'deny': [], 'ignoreCase': True} |
|
|
| table_pattern.allow | | Array of string | List of regex patterns to include in ingestion | ['.*'] |
|
|
| table_pattern.deny | | Array of string | List of regex patterns to exclude from ingestion. | [] |
|
|
| table_pattern.ignoreCase | | boolean | Whether to ignore case sensitivity during pattern matching. | True |
|
|
| column_pattern | | AllowDenyPattern (see below for fields) | Regex patterns to filter columns for classification. This is used in combination with other patterns in parent config. Specify regex to match the column name in `database.schema.table.column` format. | {'allow': ['.*'], 'deny': [], 'ignoreCase': True} |
|
|
| column_pattern.allow | | Array of string | List of regex patterns to include in ingestion | ['.*'] |
|
|
| column_pattern.deny | | Array of string | List of regex patterns to exclude from ingestion. | [] |
|
|
| column_pattern.ignoreCase | | boolean | Whether to ignore case sensitivity during pattern matching. | True |
|
|
|
|
## DataHub Classifier
|
|
|
|
DataHub Classifier is the default classifier implementation, which uses [acryl-datahub-classify](https://pypi.org/project/acryl-datahub-classify/) library to predict info types.
|
|
|
|
### Config Details
|
|
|
|
| Field | Required | Type | Description | Default |
|
|
| ------------------------------------------------------ | ------------------------------------------------------ | ---------------------------------------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------- | --------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
|
|
| confidence_level_threshold | | number | | 0.68 |
|
|
| strip_exclusion_formatting | | bool | A flag that determines whether the exclusion list uses exact matching or format stripping (case-insensitivity, punctuation removal, and special character removal). | True |
|
|
| info_types | | list[string] | List of infotypes to be predicted. By default, all supported infotypes are considered, along with any custom infotypes configured in `info_types_config`. | None |
|
|
| info_types_config | Configuration details for infotypes | Dict[str, InfoTypeConfig] | | See [reference_input.py](https://github.com/acryldata/datahub-classify/blob/main/datahub-classify/src/datahub_classify/reference_input.py) for default configuration. |
|
|
| info_types_config.`key`.prediction_factors_and_weights | ❓ (required if info_types_config.`key` is set) | Dict[str,number] | Factors and their weights to consider when predicting info types | |
|
|
| info_types_config.`key`.exclude_name | | list[string] | Optional list of names to exclude from classification. | None |
|
|
| info_types_config.`key`.name | | NameFactorConfig (see below for fields) | | |
|
|
| info_types_config.`key`.name.regex | | Array of string | List of regex patterns the column name follows for the info type | ['.*'] |
|
|
| info_types_config.`key`.description | | DescriptionFactorConfig (see below for fields) | | |
|
|
| info_types_config.`key`.description.regex | | Array of string | List of regex patterns the column description follows for the info type | ['.*'] |
|
|
| info_types_config.`key`.datatype | | DataTypeFactorConfig (see below for fields) | | |
|
|
| info_types_config.`key`.datatype.type | | Array of string | List of data types for the info type | ['.*'] |
|
|
| info_types_config.`key`.values | | ValuesFactorConfig (see below for fields) | | |
|
|
| info_types_config.`key`.values.prediction_type | ❓ (required if info_types_config.`key`.values is set) | string | | None |
|
|
| info_types_config.`key`.values.regex | | Array of string | List of regex patterns the column value follows for the info type | None |
|
|
| info_types_config.`key`.values.library | | Array of string | Library used for prediction | None |
|
|
| minimum_values_threshold | | number | Minimum number of non-null column values required to process `values` prediction factor. | 50 |
|
|
| |
|
|
|
|
### Supported infotypes
|
|
|
|
- `Email_Address`
|
|
- `Gender`
|
|
- `Credit_Debit_Card_Number`
|
|
- `Phone_Number`
|
|
- `Street_Address`
|
|
- `Full_Name`
|
|
- `Age`
|
|
- `IBAN`
|
|
- `US_Social_Security_Number`
|
|
- `Vehicle_Identification_Number`
|
|
- `IP_Address_v4`
|
|
- `IP_Address_v6`
|
|
- `US_Driving_License_Number`
|
|
- `Swift_Code`
|
|
- Regex based Custom InfoTypes
|
|
|
|
## Supported sources
|
|
|
|
- All SQL sources
|
|
|
|
## Future Work
|
|
|
|
- Classification for nested columns (struct, array type)
|
|
|
|
## Examples
|
|
|
|
### Basic
|
|
|
|
```yml
|
|
source:
|
|
type: snowflake
|
|
config:
|
|
env: PROD
|
|
# Coordinates
|
|
account_id: account_name
|
|
warehouse: "COMPUTE_WH"
|
|
|
|
# Credentials
|
|
username: user
|
|
password: pass
|
|
role: "sysadmin"
|
|
|
|
# Options
|
|
top_n_queries: 10
|
|
email_domain: mycompany.com
|
|
|
|
classification:
|
|
enabled: True
|
|
classifiers:
|
|
- type: datahub
|
|
```
|
|
|
|
### Advanced Configuration: Customizing configuration for supported info types
|
|
|
|
```yml
|
|
source:
|
|
type: snowflake
|
|
config:
|
|
env: PROD
|
|
# Coordinates
|
|
account_id: account_name
|
|
warehouse: "COMPUTE_WH"
|
|
|
|
# Credentials
|
|
username: user
|
|
password: pass
|
|
role: "sysadmin"
|
|
|
|
# Options
|
|
top_n_queries: 10
|
|
email_domain: mycompany.com
|
|
|
|
classification:
|
|
enabled: True
|
|
info_type_to_term:
|
|
Email_Address: "Email"
|
|
classifiers:
|
|
- type: datahub
|
|
config:
|
|
confidence_level_threshold: 0.7
|
|
info_types_config:
|
|
Email_Address:
|
|
prediction_factors_and_weights:
|
|
name: 0.4
|
|
description: 0
|
|
datatype: 0
|
|
values: 0.6
|
|
name:
|
|
regex:
|
|
- "^.*mail.*id.*$"
|
|
- "^.*id.*mail.*$"
|
|
- "^.*mail.*add.*$"
|
|
- "^.*add.*mail.*$"
|
|
- email
|
|
- mail
|
|
description:
|
|
regex:
|
|
- "^.*mail.*id.*$"
|
|
- "^.*mail.*add.*$"
|
|
- email
|
|
- mail
|
|
datatype:
|
|
type:
|
|
- str
|
|
values:
|
|
prediction_type: regex
|
|
regex:
|
|
- "[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\\.[A-Z|a-z]{2,}"
|
|
library: []
|
|
Gender:
|
|
prediction_factors_and_weights:
|
|
name: 0.4
|
|
description: 0
|
|
datatype: 0
|
|
values: 0.6
|
|
name:
|
|
regex:
|
|
- "^.*gender.*$"
|
|
- "^.*sex.*$"
|
|
- gender
|
|
- sex
|
|
description:
|
|
regex:
|
|
- "^.*gender.*$"
|
|
- "^.*sex.*$"
|
|
- gender
|
|
- sex
|
|
datatype:
|
|
type:
|
|
- int
|
|
- str
|
|
values:
|
|
prediction_type: regex
|
|
regex:
|
|
- male
|
|
- female
|
|
- man
|
|
- woman
|
|
- m
|
|
- f
|
|
- w
|
|
- men
|
|
- women
|
|
library: []
|
|
Credit_Debit_Card_Number:
|
|
prediction_factors_and_weights:
|
|
name: 0.4
|
|
description: 0
|
|
datatype: 0
|
|
values: 0.6
|
|
name:
|
|
regex:
|
|
- "^.*card.*number.*$"
|
|
- "^.*number.*card.*$"
|
|
- "^.*credit.*card.*$"
|
|
- "^.*debit.*card.*$"
|
|
description:
|
|
regex:
|
|
- "^.*card.*number.*$"
|
|
- "^.*number.*card.*$"
|
|
- "^.*credit.*card.*$"
|
|
- "^.*debit.*card.*$"
|
|
datatype:
|
|
type:
|
|
- str
|
|
- int
|
|
values:
|
|
prediction_type: regex
|
|
regex:
|
|
- "^4[0-9]{12}(?:[0-9]{3})?$"
|
|
- "^(?:5[1-5][0-9]{2}|222[1-9]|22[3-9][0-9]|2[3-6][0-9]{2}|27[01][0-9]|2720)[0-9]{12}$"
|
|
- "^3[47][0-9]{13}$"
|
|
- "^3(?:0[0-5]|[68][0-9])[0-9]{11}$"
|
|
- "^6(?:011|5[0-9]{2})[0-9]{12}$"
|
|
- "^(?:2131|1800|35\\d{3})\\d{11}$"
|
|
- "^(6541|6556)[0-9]{12}$"
|
|
- "^389[0-9]{11}$"
|
|
- "^63[7-9][0-9]{13}$"
|
|
- "^9[0-9]{15}$"
|
|
- "^(6304|6706|6709|6771)[0-9]{12,15}$"
|
|
- "^(5018|5020|5038|6304|6759|6761|6763)[0-9]{8,15}$"
|
|
- "^(62[0-9]{14,17})$"
|
|
- "^(?:4[0-9]{12}(?:[0-9]{3})?|5[1-5][0-9]{14})$"
|
|
- "^(4903|4905|4911|4936|6333|6759)[0-9]{12}|(4903|4905|4911|4936|6333|6759)[0-9]{14}|(4903|4905|4911|4936|6333|6759)[0-9]{15}|564182[0-9]{10}|564182[0-9]{12}|564182[0-9]{13}|633110[0-9]{10}|633110[0-9]{12}|633110[0-9]{13}$"
|
|
- "^(6334|6767)[0-9]{12}|(6334|6767)[0-9]{14}|(6334|6767)[0-9]{15}$"
|
|
library: []
|
|
Phone_Number:
|
|
prediction_factors_and_weights:
|
|
name: 0.4
|
|
description: 0
|
|
datatype: 0
|
|
values: 0.6
|
|
name:
|
|
regex:
|
|
- ".*phone.*(num|no).*"
|
|
- ".*(num|no).*phone.*"
|
|
- ".*[^a-z]+ph[^a-z]+.*(num|no).*"
|
|
- ".*(num|no).*[^a-z]+ph[^a-z]+.*"
|
|
- ".*mobile.*(num|no).*"
|
|
- ".*(num|no).*mobile.*"
|
|
- ".*telephone.*(num|no).*"
|
|
- ".*(num|no).*telephone.*"
|
|
- ".*cell.*(num|no).*"
|
|
- ".*(num|no).*cell.*"
|
|
- ".*contact.*(num|no).*"
|
|
- ".*(num|no).*contact.*"
|
|
- ".*landline.*(num|no).*"
|
|
- ".*(num|no).*landline.*"
|
|
- ".*fax.*(num|no).*"
|
|
- ".*(num|no).*fax.*"
|
|
- phone
|
|
- telephone
|
|
- landline
|
|
- mobile
|
|
- tel
|
|
- fax
|
|
- cell
|
|
- contact
|
|
description:
|
|
regex:
|
|
- ".*phone.*(num|no).*"
|
|
- ".*(num|no).*phone.*"
|
|
- ".*[^a-z]+ph[^a-z]+.*(num|no).*"
|
|
- ".*(num|no).*[^a-z]+ph[^a-z]+.*"
|
|
- ".*mobile.*(num|no).*"
|
|
- ".*(num|no).*mobile.*"
|
|
- ".*telephone.*(num|no).*"
|
|
- ".*(num|no).*telephone.*"
|
|
- ".*cell.*(num|no).*"
|
|
- ".*(num|no).*cell.*"
|
|
- ".*contact.*(num|no).*"
|
|
- ".*(num|no).*contact.*"
|
|
- ".*landline.*(num|no).*"
|
|
- ".*(num|no).*landline.*"
|
|
- ".*fax.*(num|no).*"
|
|
- ".*(num|no).*fax.*"
|
|
- phone
|
|
- telephone
|
|
- landline
|
|
- mobile
|
|
- tel
|
|
- fax
|
|
- cell
|
|
- contact
|
|
datatype:
|
|
type:
|
|
- int
|
|
- str
|
|
values:
|
|
prediction_type: library
|
|
regex: []
|
|
library:
|
|
- phonenumbers
|
|
Street_Address:
|
|
prediction_factors_and_weights:
|
|
name: 0.5
|
|
description: 0
|
|
datatype: 0
|
|
values: 0.5
|
|
name:
|
|
regex:
|
|
- ".*street.*add.*"
|
|
- ".*add.*street.*"
|
|
- ".*full.*add.*"
|
|
- ".*add.*full.*"
|
|
- ".*mail.*add.*"
|
|
- ".*add.*mail.*"
|
|
- add[^a-z]+
|
|
- address
|
|
- street
|
|
description:
|
|
regex:
|
|
- ".*street.*add.*"
|
|
- ".*add.*street.*"
|
|
- ".*full.*add.*"
|
|
- ".*add.*full.*"
|
|
- ".*mail.*add.*"
|
|
- ".*add.*mail.*"
|
|
- add[^a-z]+
|
|
- address
|
|
- street
|
|
datatype:
|
|
type:
|
|
- str
|
|
values:
|
|
prediction_type: library
|
|
regex: []
|
|
library:
|
|
- spacy
|
|
Full_Name:
|
|
prediction_factors_and_weights:
|
|
name: 0.3
|
|
description: 0
|
|
datatype: 0
|
|
values: 0.7
|
|
name:
|
|
regex:
|
|
- ".*person.*name.*"
|
|
- ".*name.*person.*"
|
|
- ".*user.*name.*"
|
|
- ".*name.*user.*"
|
|
- ".*full.*name.*"
|
|
- ".*name.*full.*"
|
|
- fullname
|
|
- name
|
|
- person
|
|
- user
|
|
description:
|
|
regex:
|
|
- ".*person.*name.*"
|
|
- ".*name.*person.*"
|
|
- ".*user.*name.*"
|
|
- ".*name.*user.*"
|
|
- ".*full.*name.*"
|
|
- ".*name.*full.*"
|
|
- fullname
|
|
- name
|
|
- person
|
|
- user
|
|
datatype:
|
|
type:
|
|
- str
|
|
values:
|
|
prediction_type: library
|
|
regex: []
|
|
library:
|
|
- spacy
|
|
Age:
|
|
prediction_factors_and_weights:
|
|
name: 0.65
|
|
description: 0
|
|
datatype: 0
|
|
values: 0.35
|
|
name:
|
|
regex:
|
|
- age[^a-z]+.*
|
|
- ".*[^a-z]+age"
|
|
- ".*[^a-z]+age[^a-z]+.*"
|
|
- age
|
|
description:
|
|
regex:
|
|
- age[^a-z]+.*
|
|
- ".*[^a-z]+age"
|
|
- ".*[^a-z]+age[^a-z]+.*"
|
|
- age
|
|
datatype:
|
|
type:
|
|
- int
|
|
values:
|
|
prediction_type: library
|
|
regex: []
|
|
library:
|
|
- rule_based_logic
|
|
```
|
|
|
|
### Advanced Configuration: Specifying Custom InfoType
|
|
|
|
```yml
|
|
source:
|
|
type: snowflake
|
|
config:
|
|
env: PROD
|
|
# Coordinates
|
|
account_id: account_name
|
|
warehouse: "COMPUTE_WH"
|
|
|
|
# Credentials
|
|
username: user
|
|
password: pass
|
|
role: "sysadmin"
|
|
|
|
# Options
|
|
top_n_queries: 10
|
|
email_domain: mycompany.com
|
|
|
|
classification:
|
|
enabled: True
|
|
classifiers:
|
|
- type: datahub
|
|
config:
|
|
confidence_level_threshold: 0.7
|
|
minimum_values_threshold: 10
|
|
info_types_config:
|
|
CloudRegion:
|
|
prediction_factors_and_weights:
|
|
name: 0
|
|
description: 0
|
|
datatype: 0
|
|
values: 1
|
|
values:
|
|
prediction_type: regex
|
|
regex:
|
|
- "(af|ap|ca|eu|me|sa|us)-(central|north|(north(?:east|west))|south|south(?:east|west)|east|west)-\\d+"
|
|
library: []
|
|
```
|
|
|
|
## Additional Resources
|
|
|
|
### DataHub Blog
|
|
|
|
- [PII Classification just got easier with DataHub](https://medium.com/datahub-project/pii-classification-just-got-easier-with-datahub-6bab2b63abcb)
|