diff --git a/openmetadata-docs/content/v1.0.0/connectors/ingestion/workflows/profiler/auto_tagging.md b/openmetadata-docs/content/v1.0.0/connectors/ingestion/workflows/profiler/auto_tagging.md index c31e5cf34ee..bce1de71b67 100644 --- a/openmetadata-docs/content/v1.0.0/connectors/ingestion/workflows/profiler/auto_tagging.md +++ b/openmetadata-docs/content/v1.0.0/connectors/ingestion/workflows/profiler/auto_tagging.md @@ -3,7 +3,7 @@ title: Lineage Ingestion slug: /connectors/ingestion/auto_tagging --- -## Auto PII Tagging +# Auto PII Tagging Auto PII tagging for Sensitive/NonSensitive at the column level is performed based on the two approaches described below. @@ -12,9 +12,36 @@ PII Tagging is only available during `Profiler Ingestion`. {% /note %} -### During Profiler Ingestion -- If sample data ingestion is enabled, during the profiler workflow OpenMetadata will infer the status of a column (PII or not) based on its content. -- **Column Name Scanning** we pass the column name through a regex, which identifies credit card, email, etc. -- **Sample Data Scanning** If the column status (PII or not) cannot be parsed with regex, we will use [presidio](https://microsoft.github.io/presidio/) library, which allows OpenMetadata to determine the PII status. -- `confidence` parameter is passed to determine the minimum score required to tag the column. +## Tagging logic +1. **Column Name Scanner**: We validate the column names of the table against a set of regex rules that help us identify + common English patterns to identify email addresses, SSN, bank accounts, etc. +2. **Entity Recognition**: If the sample data ingestion is enabled, we'll validate the sample rows against an Entity + Recognition engine that will bring up any sensitive information from a list of [supported entities](https://microsoft.github.io/presidio/supported_entities/). + In that case, the `confidence` parameter lets you tune the minimum score required to tag a column as `PII.Sensitive`. + +Note that if a column is already tagged as `PII`, we will ignore its execution. + +## Troubleshooting + +### SSL: CERTIFICATE_VERIFY_FAILED + +If you see an error similar to: + +``` +Unexpected error while processing sample data for auto pii tagging - HTTPSConnectionPool(host='raw.githubusercontent.com', port=443): +Max retries exceeded with url: /explosion/spacy-models/master/compatibility.json +(Caused by SSLError(SSLCertVerificationError(1, '[SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed: unable to +get local issuer certificate (_ssl.c:1129)'))) +``` + +This is a scenario that we identified on some corporate Windows laptops. The bottom-line here is that the profiler +is trying to download the Entity Recognition model but having certificate issues when trying the request. + +A solution here is to manually download the model on the ingestion container / Airflow host by running: + +``` +pip --trusted-host github.com --trusted-host objects.githubusercontent.com install https://github.com/explosion/spacy-models/releases/download/en_core_web_md-3.5.0/en_core_web_md-3.5.0.tar.gz +``` + +If using Docker, you might want to customize the `openmetadata-ingestion` image to have this command run there by default. diff --git a/openmetadata-docs/content/v1.1.0-snapshot/connectors/ingestion/workflows/profiler/auto_tagging.md b/openmetadata-docs/content/v1.1.0-snapshot/connectors/ingestion/workflows/profiler/auto_tagging.md index c31e5cf34ee..0cd8ab80bd8 100644 --- a/openmetadata-docs/content/v1.1.0-snapshot/connectors/ingestion/workflows/profiler/auto_tagging.md +++ b/openmetadata-docs/content/v1.1.0-snapshot/connectors/ingestion/workflows/profiler/auto_tagging.md @@ -3,7 +3,7 @@ title: Lineage Ingestion slug: /connectors/ingestion/auto_tagging --- -## Auto PII Tagging +# Auto PII Tagging Auto PII tagging for Sensitive/NonSensitive at the column level is performed based on the two approaches described below. @@ -12,9 +12,36 @@ PII Tagging is only available during `Profiler Ingestion`. {% /note %} -### During Profiler Ingestion -- If sample data ingestion is enabled, during the profiler workflow OpenMetadata will infer the status of a column (PII or not) based on its content. -- **Column Name Scanning** we pass the column name through a regex, which identifies credit card, email, etc. -- **Sample Data Scanning** If the column status (PII or not) cannot be parsed with regex, we will use [presidio](https://microsoft.github.io/presidio/) library, which allows OpenMetadata to determine the PII status. -- `confidence` parameter is passed to determine the minimum score required to tag the column. +## Tagging logic +1. **Column Name Scanner**: We validate the column names of the table against a set of regex rules that help us identify + common English patterns to identify email addresses, SSN, bank accounts, etc. +2. **Entity Recognition**: If the sample data ingestion is enabled, we'll validate the sample rows against an Entity + Recognition engine that will bring up any sensitive information from a list of [supported entities](https://microsoft.github.io/presidio/supported_entities/). + In that case, the `confidence` parameter lets you tune the minimum score required to tag a column as `PII.Sensitive`. + +Note that if a column is already tagged as `PII`, we will ignore its execution. + +## Troubleshooting + +### SSL: CERTIFICATE_VERIFY_FAILED + +If you see an error similar to: + +``` +Unexpected error while processing sample data for auto pii tagging - HTTPSConnectionPool(host='raw.githubusercontent.com', port=443): +Max retries exceeded with url: /explosion/spacy-models/master/compatibility.json +(Caused by SSLError(SSLCertVerificationError(1, '[SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed: unable to +get local issuer certificate (_ssl.c:1129)'))) +``` + +This is a scenario that we identified on some corporate Windows laptops. The bottom-line here is that the profiler +is trying to download the Entity Recognition model but having certificate issues when trying the request. + +A solution here is to manually download the model on the ingestion container / Airflow host by running: + +``` +pip --trusted-host github.com --trusted-host objects.githubusercontent.com install https://github.com/explosion/spacy-models/releases/download/en_core_web_md-3.5.0/en_core_web_md-3.5.0.tar.gz +``` + +If using Docker, you might want to customize the `openmetadata-ingestion` image to have this command run there by default.