66 Commits

Author SHA1 Message Date
Pere Menal-Ferrer
44e09e41a2
Revert "FIX #1464 (#21520)" (#21726)
This reverts commit 1e86f9870fd663122b9bbb64f3cf17cf32619c7f.
2025-06-13 17:27:32 +02:00
Pere Menal-Ferrer
1e86f9870f
FIX #1464 (#21520)
* Add PIICategoryTags and some utilities on top of them.

* Fix static-check

* Add test for fqn representation

* Add NEREntityGeneralTags.json from Collate

* Add test to check PIICategoryTags agree with the ones used by OM server

* Add LabelExtractor

* Fix style

* Add ignore superflous-parens for pylint

* Ass comment as per PR review

* Fix not-updated PII-IT

* Remove duplicated IT test for PII

---------

Co-authored-by: Pere Menal <pere.menal@getcollate.io>
Co-authored-by: Sriharsha Chintalapani <harshach@users.noreply.github.com>
2025-06-09 16:05:35 -07:00
Pere Menal-Ferrer
5d2dfa712a
feature/pii-processor-improvement (#21248)
* Add PII Tag and Sensitivity Level enums.

* Add feature-extraction for PII classification tasks

* Add faker as test dependency

* Add unit tests for presidio tag extractor

* Add PIISensitivityTags enum and update sensitivity mapping logic

* Add Presidio utility functions for PII analysis

* Extend column name regexs for PII

* Add tests for PAN, NIF, SSN entities

* Fix version of faker to prevent flaky tests. Fix failing tests.

* Add Generated to State enum

* Integrate PIISensitive classifier to PIIProcessor
2025-05-19 17:52:17 +00:00
Pere Miquel Brull
3186937cc2
MINOR - Update Auto Classification defaults for sample data & classif… (#20587)
* MINOR - Update Auto Classification defaults for sample data & classification

* fix tests
2025-04-07 15:56:57 +02:00
Mayur Singal
7760663b22
MINOR: Change ingestion licence header (#20549) 2025-04-03 10:39:47 +05:30
Teddy
28bd01c471
MINOR: Remove default 100 when profileSample is None (#19672)
* fix: remove default 100% percent

* fix: use get_dataset

* fix: orm_profiler tests
2025-02-05 19:14:31 +01:00
Teddy
58699063db
MINOR -- Fix DQ Partition Issue (#18641)
* fix: renamed `random_sample` to `get_dataset` and change dunder method access for SQA Table object

* fix: removed handle_partition decorator

* fix: fixed DQ partition issue + moved to `tablesample` method

* style: ran python linting

* style: fix python format check issues

* feat: added postgres tablesample

* style: ran python linting

* fix: sampling delta

* fix: merge conflicts

* fix: resolved conflicts

* style: ran python linting

* fix: patch orm call in test case

* fix: mock build_table_orm call in tests

* fix: test case failures and errors

* fix: removed unused import

* fix: patch typo

* fix: trino table schema retrieval

* fix: remove tuple context manager for 3.8 test support
2024-11-27 08:50:54 +01:00
Pere Miquel Brull
c68a45e7d8
Create new Auto Classification Workflow (#18610) 2024-11-19 08:10:45 +01:00
Pere Miquel Brull
4cccaae446
GEN-996 - Allow PII Processor without storing Sample Data (#17927)
* GEN-996 - Allow PII Processor without storing Sample Data

* fix import

* fix import
2024-09-20 16:05:29 +02:00
Imri Paran
a3d6c1dd20
MINOR: tests(datalake): use minio (#17805)
* tests(datalake): use minio

1. use minio instead of moto for mimicking s3 behavior.
2. removed moto dependency as it is not compatible with aiobotocore (https://github.com/getmoto/moto/issues/7070#issuecomment-1828484982)

* - moved test_datalake_profiler_e2e.py to datalake/test_profiler
- use minio instead of moto

* fixed tests

* fixed tests

* removed default name for minio container
2024-09-12 07:13:01 +02:00
IceS2
c522f14178
MINOR: Refactor output_handlers to a WorkflowOutputHandler class (#17149)
* Refactor output_handlers to a WorkflowOutputHandler class

* Add old methods as deprecated to avoid breaking changes

* Extract WorkflowInitErrorHandler from workflow_output_handler

* Fix static checks

* Fix tests

* Fix tests

* Update code based on comments from PR

* Update comment
2024-07-29 09:20:34 +02:00
Pere Miquel Brull
cb72a22b59
Fix - e2e tests for pydantic V2 (#16551)
* Fix - e2e tests for pydantic V2

* add correct default

* add correct default

* revert datetime aware

* revert datetime aware

* revert datetime aware

* revert datetime aware

* revert datetime aware

* revert datetime aware

* revert datetime aware

* revert datetime aware

* fix apis

* format
2024-06-06 19:36:17 -07:00
Pere Miquel Brull
d8e2187980
#15243 - Pydantic V2 & Airflow 2.9 (#16480)
* pydantic v2

* pydanticv2

* fix parser

* fix annotated

* fix model dumping

* mysql ingestion

* clean root models

* clean root models

* bump airflow

* bump airflow

* bump airflow

* optionals

* optionals

* optionals

* jdk

* airflow migrate

* fab provider

* fab provider

* fab provider

* some more fixes

* fixing tests and imports

* model_dump and model_validate

* model_dump and model_validate

* model_dump and model_validate

* union

* pylint

* pylint

* integration tests

* fix CostAnalysisReportData

* integration tests

* tests

* missing defaults

* missing defaults
2024-06-05 21:18:37 +02:00
juntao
8dd613caa5
Fixes #16235: need quote fullyQualifiedName in Ingestion Framework (#16273)
* Fixes #16235: need quote fullyQualifiedName in Ingestion Framework

* MINOR: fix UT issue

* revert: fix UT issue

* revert code

* revert code

* format code
2024-05-23 17:45:47 +02:00
Pere Miquel Brull
b786064bc2
#11857 - Store workflow status in the Ingestion Pipeline Status (#14462)
* Register StackTraceError in spec

* Register StackTraceError in spec

* Register StackTraceError in spec

* Add todos

* Update status

* docs

* format

* Fix tests

* Fix tests

* Fix tests

* Ignore generated

* Fix tests

* Fix tests

* Tests

* Try constants

* Try constants

* Print

* Print

* Print

* order

* Fix service name

* fix ui error

---------

Co-authored-by: Chirag Madlani <12962843+chirag-madlani@users.noreply.github.com>
2023-12-22 15:43:50 +01:00
Teddy
31d2595e4f
fix: pass rnd table bound columns to sample query (#13561) 2023-10-13 14:57:28 +05:30
Teddy
1cbdfb3ae7
Fixes #12601 - column filter for profiler workflow (#13535)
* fix: sample data ingestion to match entity profiler column setting

* fix: python linting

* fix: updated fn call

* fix: added logic to handle json filed in datalake connector

* fix: handle NA values in parsing

* fix: reverted sampler changes from #13338

* fix: reverted metric changes from #13338

* fix: added datalake profiler ingestion test

* fix: python linting

* fix: removed normalization of json blob in NoSQL db
2023-10-12 14:51:38 +02:00
Pere Miquel Brull
0282574bdd
Create ometa client once and pass it around & improve pycln config (#13310)
* Create ometa client once and pass it around & improve pycln config

* Fix

* Fix

* Fix tests

* Fix maven ci

* Fix tests

* Fix tests

* Fix tests

* Format

* Fix DI
2023-10-04 09:14:03 +02:00
Pere Miquel Brull
b5596a4640
Batch PII tagging (#13385)
* Batch PII tagging

* Batch PII tagging

* Fix tests

* Fix tests
2023-10-02 14:44:41 +02:00
Pere Miquel Brull
de7e06d024
Update structure for PII processing (#13079)
* Update structure for PII processing

* Fix tests

* Fix tests

* Lint

* Remove typo
2023-09-06 11:30:46 +02:00
Pere Miquel Brull
a3bfd4e696
Part of #11968 - Restructure Profiler Workflow and PII Processor (#13059)
* Structure PII

* Restructure Profiler Workflow

* Update signature for abc

* remove profiler sink

* Fix tests

* Fix lint

* Fix test

* Fix test
2023-09-04 11:02:57 +02:00
Pere Miquel Brull
6c0e9f5061
Part of #7272 - Centralize Workflows, Status, and Exception Management (#13029)
* Prep changes

* Prep changes

* prep changes

* Update imports

* Format

* Prep delete

* Prep delete

* Fix sink

* Prep test

* Commit

* passing either

* passing either

* Prep Either

* Metadata source with Either

* Update status

* Merge remote-tracking branch 'upstream/main' into issue-7272

* Format

* Linting

* Linting

* Linting

* Linting

* Fix tests

* Fix tests

* Fix tests

* Fix tests

* Fix tests

* Fix tests

* Fix tests

* Comments
2023-08-30 15:49:42 +02:00
Teddy
101cd0ebac
Issue 8930 - Update profiler timestamp from seconds to milliseconds (#12948) 2023-08-25 08:47:16 +02:00
Suresh Srinivas
28b5e00c0c
Clean up documentation typos and grammar issues (#12930) 2023-08-20 20:08:30 -07:00
Teddy
bfa0cc7598
fix: python tests failure after PR #12865 (#12927)
* fix: python tests failure after https://github.com/open-metadata/OpenMetadata/pull/12865

* fix: test in ometa_table_api

* fix: skip is None test temporarly
2023-08-18 18:11:47 +02:00
Ayush Shah
ab1ec50c2c
Fixes Mssql Ntext, text and Image (#12490) 2023-07-20 13:34:35 +05:30
Teddy
1e86b6533c
Fixes #11743 - Remove SQLParse dependency for System Metrics (#12072)
* fix: removed sqlparse dependency for system metrics

* fix: update sample query

* fix: move system test os retrieval to `.get()`

* fix: move os.environ to `get`
2023-06-22 06:51:24 +02:00
Ayush Shah
f80eaf3a26
Fixes 11068: mysql & postgres iam auth (#11937) 2023-06-16 13:18:12 +05:30
Teddy
8c50d1af52
Fixes #4565 - Fetch Metrics from System tables (#11645)
* feat: fetch metrics from system tables

* feat: add permission doc for fetching metrics from system tables

* feat: fix E2E tests to reflect full table row count after table metric update

* feat: ran linting

* feat: fix doc string engine name + function typing

* feat: ran python linting
2023-05-22 09:04:18 +02:00
Pere Miquel Brull
1b90badd0e
Restructure PII processor (#11640)
* Restructure PII processor

* Restructure PII processor

* Format
2023-05-17 15:58:17 +02:00
Ayush Shah
2c9ba537eb
Fix min max on rowversion/timestamp mssql (#11455) 2023-05-08 14:52:53 +05:30
Teddy
754074f1be
Fixes #7758 - Added Column value and Integer Range Partitionning (#10350)
* feat(profiler): renamed  module to

* feat(profiler): added dbt-artifacts-parser to test setup.py

* feat(profiler): refactor workflow and interface

* feat(profiler): linting

* feat(profiler): removed old profiler modules

* feat(profiler): added support for value and integer range partition

* feat(profiler): fixed linting

* feat(profiler): added partitionning support for datalake profiler

* feat(profiler): removed `ProfilerInterfaceArgs` class

* feat(profiler): address comments

* feat(profiler): Added `OTHER` as an `IntervalType` for UI type generation
2023-03-01 08:20:38 +01:00
Suresh Srinivas
afad0a4769
Fixes #10123 - Change entityReference in createRequests to fullyQualifiedName (#10124)
* Change entityReference to entity name or fullyQualifiedName

* Change backend code and tests to use FQN

* UI change for using fqns instead of EntityReference

* Ingestion framework changes for using fqns instead of EntityReference

* Fix test failures

* Fixed python tests and sample data new

* fix: minor ui changes for fqn

* Fixed python integration tests

* Fixed superset tests

* fix UI tests

* fix type issue

* fix cypress

* fix name for testcase

---------

Co-authored-by: Onkar Ravgan <onkar.10r@gmail.com>
Co-authored-by: karanh37 <karanh37@gmail.com>
Co-authored-by: Chirag Madlani <12962843+chirag-madlani@users.noreply.github.com>
2023-02-13 13:38:55 +05:30
Pere Miquel Brull
7f21a7bced
Fix #8088 - Restructure source connections & clients (#9545) 2023-01-02 13:52:27 +01:00
Ayush Shah
2bf5eb9051
fix 7995: profileSample % and row number (#9104) 2022-12-20 14:55:11 +05:30
Teddy
ac77f33b08
Fixes #7447 -- Add freshness metrics to profiler (#9159)
* refactor(profiler): integrated getter func.

Removed metric getter function from their own file.
Added metric getter to their own interface classs.
created dispatch by value methdo to dispatch metric getter func.

* feature(profiler): added systemProfiler schema

* feat(profiler): workflow fresh. & snflk impl.

* feat(profiler): freshness endpoint for put and get

* feat(profiler): added system met. for redshift

* feat(profiler): freshness met. for bigquery

* fix(profiler): keyword not found in func

* feat(profiler): Added sample data for freshness

* fix(profiler): fetch previous day for BQ

* fix(profiler): sonar + data fetching logic

* fix: typo in SystemMetric Class

* fix: linting

* fix: extracted out EntityList class into models.py
2022-12-07 14:33:30 +01:00
Sriharsha Chintalapani
25449001ca
Fix #9040: Remove fields such as tableQueries, tableProfile, tests, sample data as part of table fields (#9041) 2022-12-06 21:07:04 -08:00
Ayush Shah
5be0f8ee76
Dl Profiler (#8694)
* DQ commit

* Add DL Profiler

* Fix Ingestion and Profliing pylint checks

* Fix Tests

* PyFormat files

* Fix Tests

* Resolve Comments

* Fix Tests and Format Files

* Resolve Comments

* Fix Pylint and Code smells

* Resolve Comments

* Fix S3 parquet

* Fix Metrics Code Smell
2022-11-15 16:01:10 +01:00
Onkar Ravgan
35efd49256
Added control for DBT descriptions (#7653)
* Added control for DBT descriptions

* Fixed tests

* Added UI changes

* fixed maven ci tests

* Java formatting changes

* ui review fixes

* Fixed pytests

* Fixed python integration tests

* fixed airflow tests

Co-authored-by: Onkar Ravgan <onkarravgan@Onkars-MacBook-Pro.local>
2022-09-26 16:19:47 +05:30
Nahuel
2a6c6134f4
Fix#7272: Improve logging when initializing workflow from CLI (#7522)
* Improve logging when initializing workflow from CLI

* Fix broken tests
2022-09-19 08:00:00 -07:00
Sriharsha Chintalapani
821d70eae4
Fix #6782: Separate TableProfile and ColumnProfile api calls (#6783)
* Fix #6571: Add EntityLink for the testCase to ID columns

* Fix #6571: Add EntityLink for the testCase to ID columns

* Fix #6782: Separate TableProfile and ColumnProfile api calls

* Fix #6782: Separate TableProfile and ColumnProfile api calls - fix tests

* Fix #6782: Separate TableProfile and ColumnProfile api calls - fix tests

* Fix setFields

* Fix tests

* Update pipeline status endpoint

* updated ui side as per new schema for profiler tab

* updated profiler details with new API

* Fix Profiler tests and validation errors (#6827)

* add profilerSample field in TableProfile

* add profilerSample field in TableProfile

* get columnProfile with field profile

* get columnProfile with field profile

* Fixed sample data and python tests

* fixed date range filter change issue

* handled empty profiler case

* Added column level test case and results

Co-authored-by: Pere Miquel Brull <peremiquelbrull@gmail.com>
Co-authored-by: Shailesh Parmar <shailesh.parmar.webdev@gmail.com>
Co-authored-by: Ayush Shah <ayush@getcollate.io>
Co-authored-by: Teddy Crepineau <teddy.crepineau@gmail.com>
2022-08-22 21:31:24 +05:30
Ayush Shah
383f4497cc
Update Entity Reference parameter fields (#6841) 2022-08-22 19:37:24 +05:30
Teddy
78b5f8c8e2
Part 1 of #5831 -- Profiler workflow implementation (#6809)
* Added database filter in workflow

* Removed association between profiler and data quality

* fixed tests with removed association

* Fixed sonar code smells and bugs

* Updated profiler workflow to:
- support only running profiler (removed test run)
- support column inclusion and exclusion
- added back support for partitioned table and sample

* moved status to workflow

* Fixed tests

* removed test logic from profiler sink

* Added logic to return sample from workflow sample value

* Added profiler examples

* Updated documentation for profiler

* Fixed code smells
2022-08-19 10:52:08 +02:00
Ayush Shah
a6db2e8a84
Fix for profiler: modified filter patterns and added error handling (#6608) 2022-08-08 10:43:17 +05:30
Sriharsha Chintalapani
1a42428e42
Add time series extention (#6416)
Co-authored-by: Vivek Ratnavel Subramanian <vivekratnavel90@gmail.com>
Co-authored-by: Teddy <teddy.crepineau@gmail.com>
Co-authored-by: Shailesh Parmar <shailesh.parmar.webdev@gmail.com>
2022-08-04 07:22:47 -07:00
Teddy
818736e2ca
Fix SQLite same thread error (#6486) 2022-08-01 17:33:53 +02:00
Teddy
6397b6a0b1
Fixes #6325 -- Implement multithreading for metrics computation (#6406)
* Added tests for multithreading SQA interface

* Added multithread support for metric computation

* Added thread ID to log debuger

* Cleaned up tests

* Fixed python formatting issues

* Added non blocking result processing + threadCount in config file to set numbers of threads

* Added frontend input field to set number of threads

* Fixed code smell, bug and comments from reviewer
2022-07-29 10:41:53 +02:00
Teddy
aae4410c93
Fies #6183 - Ability to set profile sample at the profilier workflow level (#6292)
Fies  #6183 - Ability to set profile sample at the profilier workflow level (#6292)
2022-07-25 12:08:20 +02:00
Teddy
5067e24374
[ISSUE-4723] Fix Snowflake Case Sensitive Error with Profiler (#5533)
* Fixed snowflake profiler + enabled profiler integration tests

* Fixed py formating
2022-06-20 22:23:17 +02:00
Pere Miquel Brull
8e9d0a73f6
Fix #3573 - Sample Data refactor & ORM converter improvements (#5265)
Fix #3573 - Sample Data refactor & ORM converter improvements (#5265)
2022-06-08 16:10:40 +02:00