# Classification The classification feature enables sources to be configured to automatically predict info types for columns and use them as glossary terms. This is an explicit opt-in feature and is not enabled by default. ## Config details Note that a `.` is used to denote nested fields in the YAML recipe. | Field | Required | Type | Description | Default | | ------------------------- | -------- | --------------------------------------- | ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | ---------------------------------------------------------- | | enabled | | boolean | Whether classification should be used to auto-detect glossary terms | False | | sample_size | | int | Number of sample values used for classification. | 100 | | max_workers | | int | Number of worker processes to use for classification. Set to 1 to disable. | Number of CPU cores | | info_type_to_term | | Dict[str,string] | Optional mapping to provide glossary term identifier for info type. | By default, info type is used as glossary term identifier. | | classifiers | | Array of object | Classifiers to use to auto-detect glossary terms. If more than one classifier, infotype predictions from the classifier defined later in sequence take precedance. | [{'type': 'datahub', 'config': None}] | | table_pattern | | AllowDenyPattern (see below for fields) | Regex patterns to filter tables for classification. This is used in combination with other patterns in parent config. Specify regex to match the entire table name in `database.schema.table` format. e.g. to match all tables starting with customer in Customer database and public schema, use the regex 'Customer.public.customer.\*' | {'allow': ['.*'], 'deny': [], 'ignoreCase': True} | | table_pattern.allow | | Array of string | List of regex patterns to include in ingestion | ['.*'] | | table_pattern.deny | | Array of string | List of regex patterns to exclude from ingestion. | [] | | table_pattern.ignoreCase | | boolean | Whether to ignore case sensitivity during pattern matching. | True | | column_pattern | | AllowDenyPattern (see below for fields) | Regex patterns to filter columns for classification. This is used in combination with other patterns in parent config. Specify regex to match the column name in `database.schema.table.column` format. | {'allow': ['.*'], 'deny': [], 'ignoreCase': True} | | column_pattern.allow | | Array of string | List of regex patterns to include in ingestion | ['.*'] | | column_pattern.deny | | Array of string | List of regex patterns to exclude from ingestion. | [] | | column_pattern.ignoreCase | | boolean | Whether to ignore case sensitivity during pattern matching. | True | ## DataHub Classifier DataHub Classifier is the default classifier implementation, which uses [acryl-datahub-classify](https://pypi.org/project/acryl-datahub-classify/) library to predict info types. ### Config Details | Field | Required | Type | Description | Default | | ------------------------------------------------------ | ------------------------------------------------------ | ---------------------------------------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------- | --------------------------------------------------------------------------------------------------------------------------------------------------------------------- | | confidence_level_threshold | | number | | 0.68 | | strip_exclusion_formatting | | bool | A flag that determines whether the exclusion list uses exact matching or format stripping (case-insensitivity, punctuation removal, and special character removal). | True | | info_types | | list[string] | List of infotypes to be predicted. By default, all supported infotypes are considered, along with any custom infotypes configured in `info_types_config`. | None | | info_types_config | Configuration details for infotypes | Dict[str, InfoTypeConfig] | | See [reference_input.py](https://github.com/acryldata/datahub-classify/blob/main/datahub-classify/src/datahub_classify/reference_input.py) for default configuration. | | info_types_config.`key`.prediction_factors_and_weights | ❓ (required if info_types_config.`key` is set) | Dict[str,number] | Factors and their weights to consider when predicting info types | | | info_types_config.`key`.exclude_name | | list[string] | Optional list of names to exclude from classification. | None | | info_types_config.`key`.name | | NameFactorConfig (see below for fields) | | | | info_types_config.`key`.name.regex | | Array of string | List of regex patterns the column name follows for the info type | ['.*'] | | info_types_config.`key`.description | | DescriptionFactorConfig (see below for fields) | | | | info_types_config.`key`.description.regex | | Array of string | List of regex patterns the column description follows for the info type | ['.*'] | | info_types_config.`key`.datatype | | DataTypeFactorConfig (see below for fields) | | | | info_types_config.`key`.datatype.type | | Array of string | List of data types for the info type | ['.*'] | | info_types_config.`key`.values | | ValuesFactorConfig (see below for fields) | | | | info_types_config.`key`.values.prediction_type | ❓ (required if info_types_config.`key`.values is set) | string | | None | | info_types_config.`key`.values.regex | | Array of string | List of regex patterns the column value follows for the info type | None | | info_types_config.`key`.values.library | | Array of string | Library used for prediction | None | | minimum_values_threshold | | number | Minimum number of non-null column values required to process `values` prediction factor. | 50 | | | ### Supported infotypes - `Email_Address` - `Gender` - `Credit_Debit_Card_Number` - `Phone_Number` - `Street_Address` - `Full_Name` - `Age` - `IBAN` - `US_Social_Security_Number` - `Vehicle_Identification_Number` - `IP_Address_v4` - `IP_Address_v6` - `US_Driving_License_Number` - `Swift_Code` - Regex based Custom InfoTypes ## Supported sources - All SQL sources ## Future Work - Classification for nested columns (struct, array type) ## Examples ### Basic ```yml source: type: snowflake config: env: PROD # Coordinates account_id: account_name warehouse: "COMPUTE_WH" # Credentials username: user password: pass role: "sysadmin" # Options top_n_queries: 10 email_domain: mycompany.com classification: enabled: True classifiers: - type: datahub ``` ### Advanced Configuration: Customizing configuration for supported info types ```yml source: type: snowflake config: env: PROD # Coordinates account_id: account_name warehouse: "COMPUTE_WH" # Credentials username: user password: pass role: "sysadmin" # Options top_n_queries: 10 email_domain: mycompany.com classification: enabled: True info_type_to_term: Email_Address: "Email" classifiers: - type: datahub config: confidence_level_threshold: 0.7 info_types_config: Email_Address: prediction_factors_and_weights: name: 0.4 description: 0 datatype: 0 values: 0.6 name: regex: - "^.*mail.*id.*$" - "^.*id.*mail.*$" - "^.*mail.*add.*$" - "^.*add.*mail.*$" - email - mail description: regex: - "^.*mail.*id.*$" - "^.*mail.*add.*$" - email - mail datatype: type: - str values: prediction_type: regex regex: - "[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\\.[A-Z|a-z]{2,}" library: [] Gender: prediction_factors_and_weights: name: 0.4 description: 0 datatype: 0 values: 0.6 name: regex: - "^.*gender.*$" - "^.*sex.*$" - gender - sex description: regex: - "^.*gender.*$" - "^.*sex.*$" - gender - sex datatype: type: - int - str values: prediction_type: regex regex: - male - female - man - woman - m - f - w - men - women library: [] Credit_Debit_Card_Number: prediction_factors_and_weights: name: 0.4 description: 0 datatype: 0 values: 0.6 name: regex: - "^.*card.*number.*$" - "^.*number.*card.*$" - "^.*credit.*card.*$" - "^.*debit.*card.*$" description: regex: - "^.*card.*number.*$" - "^.*number.*card.*$" - "^.*credit.*card.*$" - "^.*debit.*card.*$" datatype: type: - str - int values: prediction_type: regex regex: - "^4[0-9]{12}(?:[0-9]{3})?$" - "^(?:5[1-5][0-9]{2}|222[1-9]|22[3-9][0-9]|2[3-6][0-9]{2}|27[01][0-9]|2720)[0-9]{12}$" - "^3[47][0-9]{13}$" - "^3(?:0[0-5]|[68][0-9])[0-9]{11}$" - "^6(?:011|5[0-9]{2})[0-9]{12}$" - "^(?:2131|1800|35\\d{3})\\d{11}$" - "^(6541|6556)[0-9]{12}$" - "^389[0-9]{11}$" - "^63[7-9][0-9]{13}$" - "^9[0-9]{15}$" - "^(6304|6706|6709|6771)[0-9]{12,15}$" - "^(5018|5020|5038|6304|6759|6761|6763)[0-9]{8,15}$" - "^(62[0-9]{14,17})$" - "^(?:4[0-9]{12}(?:[0-9]{3})?|5[1-5][0-9]{14})$" - "^(4903|4905|4911|4936|6333|6759)[0-9]{12}|(4903|4905|4911|4936|6333|6759)[0-9]{14}|(4903|4905|4911|4936|6333|6759)[0-9]{15}|564182[0-9]{10}|564182[0-9]{12}|564182[0-9]{13}|633110[0-9]{10}|633110[0-9]{12}|633110[0-9]{13}$" - "^(6334|6767)[0-9]{12}|(6334|6767)[0-9]{14}|(6334|6767)[0-9]{15}$" library: [] Phone_Number: prediction_factors_and_weights: name: 0.4 description: 0 datatype: 0 values: 0.6 name: regex: - ".*phone.*(num|no).*" - ".*(num|no).*phone.*" - ".*[^a-z]+ph[^a-z]+.*(num|no).*" - ".*(num|no).*[^a-z]+ph[^a-z]+.*" - ".*mobile.*(num|no).*" - ".*(num|no).*mobile.*" - ".*telephone.*(num|no).*" - ".*(num|no).*telephone.*" - ".*cell.*(num|no).*" - ".*(num|no).*cell.*" - ".*contact.*(num|no).*" - ".*(num|no).*contact.*" - ".*landline.*(num|no).*" - ".*(num|no).*landline.*" - ".*fax.*(num|no).*" - ".*(num|no).*fax.*" - phone - telephone - landline - mobile - tel - fax - cell - contact description: regex: - ".*phone.*(num|no).*" - ".*(num|no).*phone.*" - ".*[^a-z]+ph[^a-z]+.*(num|no).*" - ".*(num|no).*[^a-z]+ph[^a-z]+.*" - ".*mobile.*(num|no).*" - ".*(num|no).*mobile.*" - ".*telephone.*(num|no).*" - ".*(num|no).*telephone.*" - ".*cell.*(num|no).*" - ".*(num|no).*cell.*" - ".*contact.*(num|no).*" - ".*(num|no).*contact.*" - ".*landline.*(num|no).*" - ".*(num|no).*landline.*" - ".*fax.*(num|no).*" - ".*(num|no).*fax.*" - phone - telephone - landline - mobile - tel - fax - cell - contact datatype: type: - int - str values: prediction_type: library regex: [] library: - phonenumbers Street_Address: prediction_factors_and_weights: name: 0.5 description: 0 datatype: 0 values: 0.5 name: regex: - ".*street.*add.*" - ".*add.*street.*" - ".*full.*add.*" - ".*add.*full.*" - ".*mail.*add.*" - ".*add.*mail.*" - add[^a-z]+ - address - street description: regex: - ".*street.*add.*" - ".*add.*street.*" - ".*full.*add.*" - ".*add.*full.*" - ".*mail.*add.*" - ".*add.*mail.*" - add[^a-z]+ - address - street datatype: type: - str values: prediction_type: library regex: [] library: - spacy Full_Name: prediction_factors_and_weights: name: 0.3 description: 0 datatype: 0 values: 0.7 name: regex: - ".*person.*name.*" - ".*name.*person.*" - ".*user.*name.*" - ".*name.*user.*" - ".*full.*name.*" - ".*name.*full.*" - fullname - name - person - user description: regex: - ".*person.*name.*" - ".*name.*person.*" - ".*user.*name.*" - ".*name.*user.*" - ".*full.*name.*" - ".*name.*full.*" - fullname - name - person - user datatype: type: - str values: prediction_type: library regex: [] library: - spacy Age: prediction_factors_and_weights: name: 0.65 description: 0 datatype: 0 values: 0.35 name: regex: - age[^a-z]+.* - ".*[^a-z]+age" - ".*[^a-z]+age[^a-z]+.*" - age description: regex: - age[^a-z]+.* - ".*[^a-z]+age" - ".*[^a-z]+age[^a-z]+.*" - age datatype: type: - int values: prediction_type: library regex: [] library: - rule_based_logic ``` ### Advanced Configuration: Specifying Custom InfoType ```yml source: type: snowflake config: env: PROD # Coordinates account_id: account_name warehouse: "COMPUTE_WH" # Credentials username: user password: pass role: "sysadmin" # Options top_n_queries: 10 email_domain: mycompany.com classification: enabled: True classifiers: - type: datahub config: confidence_level_threshold: 0.7 minimum_values_threshold: 10 info_types_config: CloudRegion: prediction_factors_and_weights: name: 0 description: 0 datatype: 0 values: 1 values: prediction_type: regex regex: - "(af|ap|ca|eu|me|sa|us)-(central|north|(north(?:east|west))|south|south(?:east|west)|east|west)-\\d+" library: [] ``` ## Additional Resources ### DataHub Blog - [PII Classification just got easier with DataHub](https://medium.com/datahub-project/pii-classification-just-got-easier-with-datahub-6bab2b63abcb)