* Add PIICategoryTags and some utilities on top of them.
* Fix static-check
* Add test for fqn representation
* Add NEREntityGeneralTags.json from Collate
* Add test to check PIICategoryTags agree with the ones used by OM server
* Add LabelExtractor
* Fix style
* Add ignore superflous-parens for pylint
* Ass comment as per PR review
* Fix not-updated PII-IT
* Remove duplicated IT test for PII
---------
Co-authored-by: Pere Menal <pere.menal@getcollate.io>
Co-authored-by: Sriharsha Chintalapani <harshach@users.noreply.github.com>
* Add PII Tag and Sensitivity Level enums.
* Add feature-extraction for PII classification tasks
* Add faker as test dependency
* Add unit tests for presidio tag extractor
* Add PIISensitivityTags enum and update sensitivity mapping logic
* Add Presidio utility functions for PII analysis
* Extend column name regexs for PII
* Add tests for PAN, NIF, SSN entities
* Fix version of faker to prevent flaky tests. Fix failing tests.
* Add Generated to State enum
* Integrate PIISensitive classifier to PIIProcessor
* tests(datalake): use minio
1. use minio instead of moto for mimicking s3 behavior.
2. removed moto dependency as it is not compatible with aiobotocore (https://github.com/getmoto/moto/issues/7070#issuecomment-1828484982)
* - moved test_datalake_profiler_e2e.py to datalake/test_profiler
- use minio instead of moto
* fixed tests
* fixed tests
* removed default name for minio container
* Refactor output_handlers to a WorkflowOutputHandler class
* Add old methods as deprecated to avoid breaking changes
* Extract WorkflowInitErrorHandler from workflow_output_handler
* Fix static checks
* Fix tests
* Fix tests
* Update code based on comments from PR
* Update comment
* fix: removed sqlparse dependency for system metrics
* fix: update sample query
* fix: move system test os retrieval to `.get()`
* fix: move os.environ to `get`
* feat: fetch metrics from system tables
* feat: add permission doc for fetching metrics from system tables
* feat: fix E2E tests to reflect full table row count after table metric update
* feat: ran linting
* feat: fix doc string engine name + function typing
* feat: ran python linting
* feat(profiler): renamed module to
* feat(profiler): added dbt-artifacts-parser to test setup.py
* feat(profiler): refactor workflow and interface
* feat(profiler): linting
* feat(profiler): removed old profiler modules
* feat(profiler): added support for value and integer range partition
* feat(profiler): fixed linting
* feat(profiler): added partitionning support for datalake profiler
* feat(profiler): removed `ProfilerInterfaceArgs` class
* feat(profiler): address comments
* feat(profiler): Added `OTHER` as an `IntervalType` for UI type generation
* Change entityReference to entity name or fullyQualifiedName
* Change backend code and tests to use FQN
* UI change for using fqns instead of EntityReference
* Ingestion framework changes for using fqns instead of EntityReference
* Fix test failures
* Fixed python tests and sample data new
* fix: minor ui changes for fqn
* Fixed python integration tests
* Fixed superset tests
* fix UI tests
* fix type issue
* fix cypress
* fix name for testcase
---------
Co-authored-by: Onkar Ravgan <onkar.10r@gmail.com>
Co-authored-by: karanh37 <karanh37@gmail.com>
Co-authored-by: Chirag Madlani <12962843+chirag-madlani@users.noreply.github.com>
* refactor(profiler): integrated getter func.
Removed metric getter function from their own file.
Added metric getter to their own interface classs.
created dispatch by value methdo to dispatch metric getter func.
* feature(profiler): added systemProfiler schema
* feat(profiler): workflow fresh. & snflk impl.
* feat(profiler): freshness endpoint for put and get
* feat(profiler): added system met. for redshift
* feat(profiler): freshness met. for bigquery
* fix(profiler): keyword not found in func
* feat(profiler): Added sample data for freshness
* fix(profiler): fetch previous day for BQ
* fix(profiler): sonar + data fetching logic
* fix: typo in SystemMetric Class
* fix: linting
* fix: extracted out EntityList class into models.py
* Fix#6571: Add EntityLink for the testCase to ID columns
* Fix#6571: Add EntityLink for the testCase to ID columns
* Fix#6782: Separate TableProfile and ColumnProfile api calls
* Fix#6782: Separate TableProfile and ColumnProfile api calls - fix tests
* Fix#6782: Separate TableProfile and ColumnProfile api calls - fix tests
* Fix setFields
* Fix tests
* Update pipeline status endpoint
* updated ui side as per new schema for profiler tab
* updated profiler details with new API
* Fix Profiler tests and validation errors (#6827)
* add profilerSample field in TableProfile
* add profilerSample field in TableProfile
* get columnProfile with field profile
* get columnProfile with field profile
* Fixed sample data and python tests
* fixed date range filter change issue
* handled empty profiler case
* Added column level test case and results
Co-authored-by: Pere Miquel Brull <peremiquelbrull@gmail.com>
Co-authored-by: Shailesh Parmar <shailesh.parmar.webdev@gmail.com>
Co-authored-by: Ayush Shah <ayush@getcollate.io>
Co-authored-by: Teddy Crepineau <teddy.crepineau@gmail.com>
* Added database filter in workflow
* Removed association between profiler and data quality
* fixed tests with removed association
* Fixed sonar code smells and bugs
* Updated profiler workflow to:
- support only running profiler (removed test run)
- support column inclusion and exclusion
- added back support for partitioned table and sample
* moved status to workflow
* Fixed tests
* removed test logic from profiler sink
* Added logic to return sample from workflow sample value
* Added profiler examples
* Updated documentation for profiler
* Fixed code smells
* Added tests for multithreading SQA interface
* Added multithread support for metric computation
* Added thread ID to log debuger
* Cleaned up tests
* Fixed python formatting issues
* Added non blocking result processing + threadCount in config file to set numbers of threads
* Added frontend input field to set number of threads
* Fixed code smell, bug and comments from reviewer