* chore: implement logger levels tests for depreciation
* fix: use METADATA_LOGGER instead of warnings
* use unit test syntax
* isort
* black
* fix test
---------
Co-authored-by: Sriharsha Chintalapani <harshach@users.noreply.github.com>
* Update `TableDiffParamsSetter` to move data at table level
This means that `key_columns` and `extra_columns` will be defined per table instead of "globally", just like `data_diff` expects
* Update `TableDiffValidator` to use table's `key_columns`
Call `data_diff` and run validations using each table's `key_columns`
* Create migration to update `tableDiff` test definition
* Fix Playwright test
* Refactor presidio utils
Extract the spacy model functionality from the analyzer building function
* Added a new `TagClassifier`
This classifier uses tags to dynamically build presidio `RecognizerRegistry`s
* Added a new `TagProcessor`
This processor uses `TagClassifier` to label a column based on the tags' recognizers
* Create `TagProcessor` based on workflow configuration
* Create decorator to apply threshold to recognizers
This is so that we can apply thresholds on recognizer results without subclassing or having to keep a map between the presidio recognizer and the recognizer configuration
* Fix broken test
* Add `reason` property to `TagLabel`
This is to understand what score was used for selecting the entity
* Build `TagLabel`s with `reason`
* Increase `PIIProcessor._tolerance`
This is so we correctly filter out low scores from classifiers while still maintaining the normalization that filters out confusing outcomes.
e.g: an output with scores 0.3, 0.7 and 0.75, would initially filter the 0.3 and then discard the other two because they're both relatively high results.
* Make database and DAO changes needed to persist `TagLabel.reason`
* Update generated TypeScript types
---------
Co-authored-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>
* Add support for translations in multi lang
* Add Tag Feedback System
* Update generated TypeScript types
* Fix typing issues and add tests to reocgnizer factory
* Updated `TagResourceTest.assertFieldChange` to fix broken test
This is because change description values had been serialized into strings and for some reason the keys ended up in a different order. So instead of performing String comparison, we do Json comparisons
---------
Co-authored-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>
Co-authored-by: Eugenio Doñaque <eugenio.donaque@getcollate.io>
* feat: databricks oauth and azure ad auth setup
* refactor: add auth type changes in databricks.md
* fix: test after oauth changes
* refactor: unity catalog connection to databricks connection code
* Feat: show dbt project name
* Update generated TypeScript types
* added dbtSourceProject in data asset header properties
* Added tests
* Addressed comments
* Update generated TypeScript types
* move from dataAssetHeader to the dbt tab itself
* added unit test for added code
* test name change
---------
Co-authored-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>
Co-authored-by: Ashish Gupta <ashish@getcollate.io>
* fix: added code for separate engine and session for each project in rofiler and classification and refactor billing project approach
* fix: added entity.database check, bigquery sampling tests
* fix: system metrics logic when bigquery billing project is provided
* MINOR: Improve UDF Lineage Processing & Better Logging Time & MultiProcessing (#20848)
* Fix multiprocessing with better memory management and Airflow 2+ compatibility
* Add support for both multiprocessing and multithreading for relevant platforms
* Handle conflicting cross-db lineage changes of service_name parameter change
* Handle stored proc queries without caching all and increase the thread timeout times to cover 100% lineage
* Fix `get_table_query` inheritance and pylint
* Remove mocks from db_utils tests
* Better db_utils test and fix the service_names parameter in case of schema_fallback
---------
Co-authored-by: Mayur Singal <39544459+ulixius9@users.noreply.github.com>
* Fix Oracle DataDiff and Change Oracle Connection to BaseConnection
* Add small unittest
* Fix Test
* Fix logic, to void other engines to denormalize table/schema names
* Add calculated view columns' formula parsing logic with correct source reference
* Handle top level column formula parsing and pass formula expression in column lineage detail
---------
Co-authored-by: Suman Maharana <sumanmaharana786@gmail.com>
* fix: ingestion fails for Iceberg tables with nested partition column
* test: added test to cover nested partition column for iceberg
* refactor: used if-else in tablePartition check
* fix: partition_column_name & column_partition_type typo