* Refactor presidio utils
Extract the spacy model functionality from the analyzer building function
* Added a new `TagClassifier`
This classifier uses tags to dynamically build presidio `RecognizerRegistry`s
* Added a new `TagProcessor`
This processor uses `TagClassifier` to label a column based on the tags' recognizers
* Create `TagProcessor` based on workflow configuration
* Create decorator to apply threshold to recognizers
This is so that we can apply thresholds on recognizer results without subclassing or having to keep a map between the presidio recognizer and the recognizer configuration
* Fix broken test
* Add `reason` property to `TagLabel`
This is to understand what score was used for selecting the entity
* Build `TagLabel`s with `reason`
* Increase `PIIProcessor._tolerance`
This is so we correctly filter out low scores from classifiers while still maintaining the normalization that filters out confusing outcomes.
e.g: an output with scores 0.3, 0.7 and 0.75, would initially filter the 0.3 and then discard the other two because they're both relatively high results.
* Make database and DAO changes needed to persist `TagLabel.reason`
* Update generated TypeScript types
---------
Co-authored-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>
* Add support for translations in multi lang
* Add Tag Feedback System
* Update generated TypeScript types
* Fix typing issues and add tests to reocgnizer factory
* Updated `TagResourceTest.assertFieldChange` to fix broken test
This is because change description values had been serialized into strings and for some reason the keys ended up in a different order. So instead of performing String comparison, we do Json comparisons
---------
Co-authored-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>
Co-authored-by: Eugenio Doñaque <eugenio.donaque@getcollate.io>
* Add PIICategoryTags and some utilities on top of them.
* Fix static-check
* Add test for fqn representation
* Add NEREntityGeneralTags.json from Collate
* Add test to check PIICategoryTags agree with the ones used by OM server
* Add LabelExtractor
* Fix style
* Add ignore superflous-parens for pylint
* Ass comment as per PR review
* Fix not-updated PII-IT
* Remove duplicated IT test for PII
---------
Co-authored-by: Pere Menal <pere.menal@getcollate.io>
Co-authored-by: Sriharsha Chintalapani <harshach@users.noreply.github.com>
* Reproduce failing behaviour with non-date-time data
* Add a presidio patch for DateTimes
* Fix type-check error
---------
Co-authored-by: Pere Menal <pere.menal@getcollate.io>
* Remove 'ORGANIZATION' PII Tag as it is no longer supported by our PII detectors.
* Updata presidio version to fix wrong regex for indian passport
* Increase sample size of Indian passport numbers
---------
Co-authored-by: Pere Menal <pere.menal@getcollate.io>
* Add PII Tag and Sensitivity Level enums.
* Add feature-extraction for PII classification tasks
* Add faker as test dependency
* Add unit tests for presidio tag extractor
* Add PIISensitivityTags enum and update sensitivity mapping logic
* Add Presidio utility functions for PII analysis
* Extend column name regexs for PII
* Add tests for PAN, NIF, SSN entities
* Fix version of faker to prevent flaky tests. Fix failing tests.
* Add Generated to State enum
* Integrate PIISensitive classifier to PIIProcessor
* Add PII Tag and Sensitivity Level enums.
* Add feature-extraction for PII classification tasks
* Add faker as test dependency
* Add unit tests for presidio tag extractor
* Add PIISensitivityTags enum and update sensitivity mapping logic
* Add Presidio utility functions for PII analysis
* Extend column name regexs for PII
* Add colum name split
* Move pii algorithms to dedicated package
* Add tests for PAN, NIF, SSN entities
* Fix linting
* Add comment on why we need to set specific lanaguage to Presidio recognizers, as per PR suggestion.
* Fix version of faker to prevent flaky tests. Fix failing tests.
* Fix wrong import
---------
Co-authored-by: Pere Menal <pere.menal@getcollate.io>
* linting: fix python linting
* fix: get column types from parquet schema for parquet files
* style: python linting
* fix: remove displayType check in test as variation depending on OS