mirror of
https://github.com/open-metadata/OpenMetadata.git
synced 2025-08-22 07:58:06 +00:00
Auto tagging Windows troubleshooting (#12173)
This commit is contained in:
parent
6ca50225a1
commit
397fc364a8
@ -3,7 +3,7 @@ title: Lineage Ingestion
|
||||
slug: /connectors/ingestion/auto_tagging
|
||||
---
|
||||
|
||||
## Auto PII Tagging
|
||||
# Auto PII Tagging
|
||||
|
||||
Auto PII tagging for Sensitive/NonSensitive at the column level is performed based on the two approaches described below.
|
||||
|
||||
@ -12,9 +12,36 @@ PII Tagging is only available during `Profiler Ingestion`.
|
||||
{% /note %}
|
||||
|
||||
|
||||
### During Profiler Ingestion
|
||||
- If sample data ingestion is enabled, during the profiler workflow OpenMetadata will infer the status of a column (PII or not) based on its content.
|
||||
- **Column Name Scanning** we pass the column name through a regex, which identifies credit card, email, etc.
|
||||
- **Sample Data Scanning** If the column status (PII or not) cannot be parsed with regex, we will use [presidio](https://microsoft.github.io/presidio/) library, which allows OpenMetadata to determine the PII status.
|
||||
- `confidence` parameter is passed to determine the minimum score required to tag the column.
|
||||
## Tagging logic
|
||||
|
||||
1. **Column Name Scanner**: We validate the column names of the table against a set of regex rules that help us identify
|
||||
common English patterns to identify email addresses, SSN, bank accounts, etc.
|
||||
2. **Entity Recognition**: If the sample data ingestion is enabled, we'll validate the sample rows against an Entity
|
||||
Recognition engine that will bring up any sensitive information from a list of [supported entities](https://microsoft.github.io/presidio/supported_entities/).
|
||||
In that case, the `confidence` parameter lets you tune the minimum score required to tag a column as `PII.Sensitive`.
|
||||
|
||||
Note that if a column is already tagged as `PII`, we will ignore its execution.
|
||||
|
||||
## Troubleshooting
|
||||
|
||||
### SSL: CERTIFICATE_VERIFY_FAILED
|
||||
|
||||
If you see an error similar to:
|
||||
|
||||
```
|
||||
Unexpected error while processing sample data for auto pii tagging - HTTPSConnectionPool(host='raw.githubusercontent.com', port=443):
|
||||
Max retries exceeded with url: /explosion/spacy-models/master/compatibility.json
|
||||
(Caused by SSLError(SSLCertVerificationError(1, '[SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed: unable to
|
||||
get local issuer certificate (_ssl.c:1129)')))
|
||||
```
|
||||
|
||||
This is a scenario that we identified on some corporate Windows laptops. The bottom-line here is that the profiler
|
||||
is trying to download the Entity Recognition model but having certificate issues when trying the request.
|
||||
|
||||
A solution here is to manually download the model on the ingestion container / Airflow host by running:
|
||||
|
||||
```
|
||||
pip --trusted-host github.com --trusted-host objects.githubusercontent.com install https://github.com/explosion/spacy-models/releases/download/en_core_web_md-3.5.0/en_core_web_md-3.5.0.tar.gz
|
||||
```
|
||||
|
||||
If using Docker, you might want to customize the `openmetadata-ingestion` image to have this command run there by default.
|
||||
|
@ -3,7 +3,7 @@ title: Lineage Ingestion
|
||||
slug: /connectors/ingestion/auto_tagging
|
||||
---
|
||||
|
||||
## Auto PII Tagging
|
||||
# Auto PII Tagging
|
||||
|
||||
Auto PII tagging for Sensitive/NonSensitive at the column level is performed based on the two approaches described below.
|
||||
|
||||
@ -12,9 +12,36 @@ PII Tagging is only available during `Profiler Ingestion`.
|
||||
{% /note %}
|
||||
|
||||
|
||||
### During Profiler Ingestion
|
||||
- If sample data ingestion is enabled, during the profiler workflow OpenMetadata will infer the status of a column (PII or not) based on its content.
|
||||
- **Column Name Scanning** we pass the column name through a regex, which identifies credit card, email, etc.
|
||||
- **Sample Data Scanning** If the column status (PII or not) cannot be parsed with regex, we will use [presidio](https://microsoft.github.io/presidio/) library, which allows OpenMetadata to determine the PII status.
|
||||
- `confidence` parameter is passed to determine the minimum score required to tag the column.
|
||||
## Tagging logic
|
||||
|
||||
1. **Column Name Scanner**: We validate the column names of the table against a set of regex rules that help us identify
|
||||
common English patterns to identify email addresses, SSN, bank accounts, etc.
|
||||
2. **Entity Recognition**: If the sample data ingestion is enabled, we'll validate the sample rows against an Entity
|
||||
Recognition engine that will bring up any sensitive information from a list of [supported entities](https://microsoft.github.io/presidio/supported_entities/).
|
||||
In that case, the `confidence` parameter lets you tune the minimum score required to tag a column as `PII.Sensitive`.
|
||||
|
||||
Note that if a column is already tagged as `PII`, we will ignore its execution.
|
||||
|
||||
## Troubleshooting
|
||||
|
||||
### SSL: CERTIFICATE_VERIFY_FAILED
|
||||
|
||||
If you see an error similar to:
|
||||
|
||||
```
|
||||
Unexpected error while processing sample data for auto pii tagging - HTTPSConnectionPool(host='raw.githubusercontent.com', port=443):
|
||||
Max retries exceeded with url: /explosion/spacy-models/master/compatibility.json
|
||||
(Caused by SSLError(SSLCertVerificationError(1, '[SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed: unable to
|
||||
get local issuer certificate (_ssl.c:1129)')))
|
||||
```
|
||||
|
||||
This is a scenario that we identified on some corporate Windows laptops. The bottom-line here is that the profiler
|
||||
is trying to download the Entity Recognition model but having certificate issues when trying the request.
|
||||
|
||||
A solution here is to manually download the model on the ingestion container / Airflow host by running:
|
||||
|
||||
```
|
||||
pip --trusted-host github.com --trusted-host objects.githubusercontent.com install https://github.com/explosion/spacy-models/releases/download/en_core_web_md-3.5.0/en_core_web_md-3.5.0.tar.gz
|
||||
```
|
||||
|
||||
If using Docker, you might want to customize the `openmetadata-ingestion` image to have this command run there by default.
|
||||
|
Loading…
x
Reference in New Issue
Block a user