Auto tagging Windows troubleshooting (#12173)

This commit is contained in:
Pere Miquel Brull 2023-06-27 13:55:19 +02:00 committed by GitHub
parent 6ca50225a1
commit 397fc364a8
No known key found for this signature in database
GPG Key ID: 4AEE18F83AFDEB23
2 changed files with 66 additions and 12 deletions

View File

@ -3,7 +3,7 @@ title: Lineage Ingestion
slug: /connectors/ingestion/auto_tagging
---
## Auto PII Tagging
# Auto PII Tagging
Auto PII tagging for Sensitive/NonSensitive at the column level is performed based on the two approaches described below.
@ -12,9 +12,36 @@ PII Tagging is only available during `Profiler Ingestion`.
{% /note %}
### During Profiler Ingestion
- If sample data ingestion is enabled, during the profiler workflow OpenMetadata will infer the status of a column (PII or not) based on its content.
- **Column Name Scanning** we pass the column name through a regex, which identifies credit card, email, etc.
- **Sample Data Scanning** If the column status (PII or not) cannot be parsed with regex, we will use [presidio](https://microsoft.github.io/presidio/) library, which allows OpenMetadata to determine the PII status.
- `confidence` parameter is passed to determine the minimum score required to tag the column.
## Tagging logic
1. **Column Name Scanner**: We validate the column names of the table against a set of regex rules that help us identify
common English patterns to identify email addresses, SSN, bank accounts, etc.
2. **Entity Recognition**: If the sample data ingestion is enabled, we'll validate the sample rows against an Entity
Recognition engine that will bring up any sensitive information from a list of [supported entities](https://microsoft.github.io/presidio/supported_entities/).
In that case, the `confidence` parameter lets you tune the minimum score required to tag a column as `PII.Sensitive`.
Note that if a column is already tagged as `PII`, we will ignore its execution.
## Troubleshooting
### SSL: CERTIFICATE_VERIFY_FAILED
If you see an error similar to:
```
Unexpected error while processing sample data for auto pii tagging - HTTPSConnectionPool(host='raw.githubusercontent.com', port=443):
Max retries exceeded with url: /explosion/spacy-models/master/compatibility.json
(Caused by SSLError(SSLCertVerificationError(1, '[SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed: unable to
get local issuer certificate (_ssl.c:1129)')))
```
This is a scenario that we identified on some corporate Windows laptops. The bottom-line here is that the profiler
is trying to download the Entity Recognition model but having certificate issues when trying the request.
A solution here is to manually download the model on the ingestion container / Airflow host by running:
```
pip --trusted-host github.com --trusted-host objects.githubusercontent.com install https://github.com/explosion/spacy-models/releases/download/en_core_web_md-3.5.0/en_core_web_md-3.5.0.tar.gz
```
If using Docker, you might want to customize the `openmetadata-ingestion` image to have this command run there by default.

View File

@ -3,7 +3,7 @@ title: Lineage Ingestion
slug: /connectors/ingestion/auto_tagging
---
## Auto PII Tagging
# Auto PII Tagging
Auto PII tagging for Sensitive/NonSensitive at the column level is performed based on the two approaches described below.
@ -12,9 +12,36 @@ PII Tagging is only available during `Profiler Ingestion`.
{% /note %}
### During Profiler Ingestion
- If sample data ingestion is enabled, during the profiler workflow OpenMetadata will infer the status of a column (PII or not) based on its content.
- **Column Name Scanning** we pass the column name through a regex, which identifies credit card, email, etc.
- **Sample Data Scanning** If the column status (PII or not) cannot be parsed with regex, we will use [presidio](https://microsoft.github.io/presidio/) library, which allows OpenMetadata to determine the PII status.
- `confidence` parameter is passed to determine the minimum score required to tag the column.
## Tagging logic
1. **Column Name Scanner**: We validate the column names of the table against a set of regex rules that help us identify
common English patterns to identify email addresses, SSN, bank accounts, etc.
2. **Entity Recognition**: If the sample data ingestion is enabled, we'll validate the sample rows against an Entity
Recognition engine that will bring up any sensitive information from a list of [supported entities](https://microsoft.github.io/presidio/supported_entities/).
In that case, the `confidence` parameter lets you tune the minimum score required to tag a column as `PII.Sensitive`.
Note that if a column is already tagged as `PII`, we will ignore its execution.
## Troubleshooting
### SSL: CERTIFICATE_VERIFY_FAILED
If you see an error similar to:
```
Unexpected error while processing sample data for auto pii tagging - HTTPSConnectionPool(host='raw.githubusercontent.com', port=443):
Max retries exceeded with url: /explosion/spacy-models/master/compatibility.json
(Caused by SSLError(SSLCertVerificationError(1, '[SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed: unable to
get local issuer certificate (_ssl.c:1129)')))
```
This is a scenario that we identified on some corporate Windows laptops. The bottom-line here is that the profiler
is trying to download the Entity Recognition model but having certificate issues when trying the request.
A solution here is to manually download the model on the ingestion container / Airflow host by running:
```
pip --trusted-host github.com --trusted-host objects.githubusercontent.com install https://github.com/explosion/spacy-models/releases/download/en_core_web_md-3.5.0/en_core_web_md-3.5.0.tar.gz
```
If using Docker, you might want to customize the `openmetadata-ingestion` image to have this command run there by default.