18 Commits

Author SHA1 Message Date
Pere Menal-Ferrer
44e09e41a2
Revert "FIX #1464 (#21520)" (#21726)
This reverts commit 1e86f9870fd663122b9bbb64f3cf17cf32619c7f.
2025-06-13 17:27:32 +02:00
Pere Menal-Ferrer
1e86f9870f
FIX #1464 (#21520)
* Add PIICategoryTags and some utilities on top of them.

* Fix static-check

* Add test for fqn representation

* Add NEREntityGeneralTags.json from Collate

* Add test to check PIICategoryTags agree with the ones used by OM server

* Add LabelExtractor

* Fix style

* Add ignore superflous-parens for pylint

* Ass comment as per PR review

* Fix not-updated PII-IT

* Remove duplicated IT test for PII

---------

Co-authored-by: Pere Menal <pere.menal@getcollate.io>
Co-authored-by: Sriharsha Chintalapani <harshach@users.noreply.github.com>
2025-06-09 16:05:35 -07:00
Pere Menal-Ferrer
6683c632f4
FIX #21464 (#21463)
* Reproduce failing behaviour with non-date-time data

* Add a presidio patch for DateTimes

* Fix type-check error

---------

Co-authored-by: Pere Menal <pere.menal@getcollate.io>
2025-05-30 08:18:50 +02:00
Pere Menal-Ferrer
3c6c762d9c
fix/indian-passport-detection (#21311)
* Remove 'ORGANIZATION' PII Tag as it is no longer supported by our PII detectors.

* Updata presidio version to fix wrong regex for indian passport

* Increase sample size of Indian passport numbers

---------

Co-authored-by: Pere Menal <pere.menal@getcollate.io>
2025-05-20 15:32:21 +02:00
Pere Menal-Ferrer
5d2dfa712a
feature/pii-processor-improvement (#21248)
* Add PII Tag and Sensitivity Level enums.

* Add feature-extraction for PII classification tasks

* Add faker as test dependency

* Add unit tests for presidio tag extractor

* Add PIISensitivityTags enum and update sensitivity mapping logic

* Add Presidio utility functions for PII analysis

* Extend column name regexs for PII

* Add tests for PAN, NIF, SSN entities

* Fix version of faker to prevent flaky tests. Fix failing tests.

* Add Generated to State enum

* Integrate PIISensitive classifier to PIIProcessor
2025-05-19 17:52:17 +00:00
Pere Menal-Ferrer
a7e2f33adc
feature/pii-column-classifier (#21200)
* Add PII Tag and Sensitivity Level enums.

* Add feature-extraction for PII classification tasks

* Add faker as test dependency

* Add unit tests for presidio tag extractor

* Add PIISensitivityTags enum and update sensitivity mapping logic

* Add Presidio utility functions for PII analysis

* Extend column name regexs for PII

* Add colum name split

* Move pii algorithms to dedicated package

* Add tests for PAN, NIF, SSN entities

* Fix linting

* Add comment on why we need to set specific lanaguage to Presidio recognizers, as per PR suggestion.

* Fix version of faker to prevent flaky tests. Fix failing tests.

* Fix wrong import

---------

Co-authored-by: Pere Menal <pere.menal@getcollate.io>
2025-05-16 14:03:49 +02:00
Mayur Singal
7760663b22
MINOR: Change ingestion licence header (#20549) 2025-04-03 10:39:47 +05:30
Pere Miquel Brull
2e7c9a0875
FIX #19765 - Improve Column Name Scanner (#20136) 2025-03-07 14:32:59 +01:00
Pere Miquel Brull
c309906a1b
MINOR - Bump Presidio Analyzer and validate support for legal entities (#17750) 2024-09-06 16:07:08 +02:00
Pere Miquel Brull
8191202850
MINOR - Better PII classification for JSON data (#17734)
* MINOR - Better PII classification for JSON data

* linting
2024-09-06 08:54:23 +02:00
Pere Miquel Brull
2237d5a8d5
MINOR - PII Scanner tests and log levels (#17686)
* MINOR - PII Scanner tests and log levels

* MINOR - PII Scanner tests and log levels
2024-09-04 12:11:07 +02:00
Teddy
9a4a9df836
Fix #14895 - Get Metadata from Parquet Schema (#14956)
* linting: fix python linting

* fix: get column types from parquet schema for parquet files

* style: python linting

* fix: remove displayType check in test as variation depending on OS
2024-02-01 09:02:52 +01:00
Pere Miquel Brull
0282574bdd
Create ometa client once and pass it around & improve pycln config (#13310)
* Create ometa client once and pass it around & improve pycln config

* Fix

* Fix

* Fix tests

* Fix maven ci

* Fix tests

* Fix tests

* Fix tests

* Format

* Fix DI
2023-10-04 09:14:03 +02:00
Pere Miquel Brull
de7e06d024
Update structure for PII processing (#13079)
* Update structure for PII processing

* Fix tests

* Fix tests

* Lint

* Remove typo
2023-09-06 11:30:46 +02:00
Pere Miquel Brull
a3bfd4e696
Part of #11968 - Restructure Profiler Workflow and PII Processor (#13059)
* Structure PII

* Restructure Profiler Workflow

* Update signature for abc

* remove profiler sink

* Fix tests

* Fix lint

* Fix test

* Fix test
2023-09-04 11:02:57 +02:00
Pere Miquel Brull
0eb2201f94
Restructure NER Scanner internals (#11690)
* Simplify col name scanner

* Restructure NER Scanner internals
2023-05-19 18:21:01 +02:00
Pere Miquel Brull
8795337f88
Clean NER Scanner imports (#11653) 2023-05-18 12:53:22 +02:00
Pere Miquel Brull
1b90badd0e
Restructure PII processor (#11640)
* Restructure PII processor

* Restructure PII processor

* Format
2023-05-17 15:58:17 +02:00