151 Commits

Author SHA1 Message Date
Teddy
29450d1104
feat: add support for DBX system metrics (#22044)
* feat: add support for DBX system metrics

* feat: add support for DBX system metrics

* fix: added WRITE back

* fix: failing test cases

* fix: failing test
2025-07-02 08:54:16 +02:00
IceS2
040a33117c
MINOR: Fix Profiler Infinite Loop (#21843) 2025-06-19 10:33:45 +05:30
IceS2
e79c54e6a5
MINOR: Add injection to profiler (#21738)
* Initial implementation for our Connection Class

* Implement the Initial Connection class

* Add Unit Tests

* Implement Dependency Injection for the Ingestion Framework

* Fix Test

* Fix Profile Test Connection

* Add Injection to Metrics in Profiler

* Add Injection to the Profiler

* Fix UnitTests

* Fix Pytests

* Fix Tests

* Fix types
2025-06-17 19:01:00 +02:00
Pere Menal-Ferrer
ca812852d6
ci/nox-setup-testing (#21377)
* Make pytest to user code from src rather than from install package

* Fix test_amundsen: missing None

* Update pytest configuration to use importlib mode

* Fix custom_basemodel_validation to check model_fields on type(values) to prevent noisy warnings

* Refactor referencedByQueries validation to use field_validator as per deprecation warning

* Update ColumnJson to use model_rebuild rather as replacement for forward reference updates as per deprecation warning

* Move superset test to integration test as they are using testcontainers

* Update coverage source path

* Fix wrong import.

* Add install_dev_env target to Makefile for development dependencies

* Add test-unit as extra in setup.py

* Modify dependencies in dev environment.

* Ignore all airflow tests

* Remove coverage in unit_ingestion_dev_env. Revert coverage source to prevent broken CI.

* Add nox for running unit test

* FIx PowerBI integration test to use pathlib for resource paths and not os.getcwd to prevent failures when not executed from the right path

* Move test_helpers.py to unit test, as it is not an integration test.

* Remove utils empty folder in integration tests

* Refactor testcontainers configuration to avoid pitfalls with max_tries setting

* Add nox unit testing basic setup

* Add format check session

* Refactor nox-unit and add plugins tests

* Add GHA for py-nox-ci

* Add comment to GHA

* Restore conftest.py file

* Clarify comment

* Simplify function

* Fix matrix startegy and nox mismatch

* Improve python version strategy with nox and GHA

---------

Co-authored-by: Pere Menal <pere.menal@getcollate.io>
2025-05-27 10:56:52 +02:00
Mayur Singal
7760663b22
MINOR: Change ingestion licence header (#20549) 2025-04-03 10:39:47 +05:30
Pere Miquel Brull
6b7a9fe76c
MINOR - Update sampler tablenames (#19976)
* MINOR - Update sampler tablenames

* MINOR - Update sampler tablenames

* MINOR - Update sampler tablenames
2025-02-26 14:08:14 +01:00
Pere Miquel Brull
91b62fdc32
FIX #19798 - Shortening SQA __tablename__ to avoid hitting errors in … (#19809)
* FIX #19798 - Shortening SQA __tablename__ to avoid hitting errors in postgres

* fix tests

---------

Co-authored-by: Sriharsha Chintalapani <harshach@users.noreply.github.com>
2025-02-17 09:37:06 +01:00
Pere Miquel Brull
e4fa16e574
FIX - Profiler Source Include Views Filter (#19746)
* FIX - Profiler Source Include Views Filter

* FIX - Profiler Source Include Views Filter
2025-02-12 08:35:47 +01:00
Teddy
e1b3e08317
MINOR BQ sampler type missing (#19696)
* fix: missing entity type in bq sampler

* fix: failing tests
2025-02-11 10:46:34 -08:00
Teddy
ef131d7e20
MINOR: Wrong attribute name in SampleConfig model (#19641)
* fix: wrong attribute name in SampleConfig model

* fix: test attribute

* fix: failing tests

* fix: trino filter error + adjust test to take into account null value

* fix: mssql and azuresql tablesample on views
2025-02-04 10:40:40 +01:00
Teddy
79b2888bb5
fix: azuresql sampler logic (#19034) 2024-12-13 07:35:04 +01:00
Teddy
610322ffed
MINOR - MSSQL timestamp type profiler fix (#18935)
* fix: mssql timestamp processing

* fix: min/max test type on datetime column

* style: fix python format
2024-12-06 08:03:42 +01:00
Teddy
03bd8e9dc4
FEAT: added TABLESAMPLE for MSSQL (#18926)
* feat: added TABLESAMPLE for sqlserver

* fix: class name

* test: added test to generated sample query
2024-12-05 14:17:39 +01:00
Teddy
ac2f6d7132
MINOR - Fix sqa table reference (#18839)
* fix: sqa table reference

* style: ran python linting

* fix: added raw dataset to query runner

* fix: get table and schema name from orm object

* fix: get table level config for table tests
2024-11-28 18:49:11 +01:00
Imri Paran
cd74d8f55a
MINOR: ref(data-quality): modularized test case validator import (#18716)
* ref(data-quality): modularized test case validator import

- removed test_suite_factory
- implemented TestCaseImporter
- removed SQAValidatorBuilder and PandasValidatorBuilder in favor of a SourceType enum
- removed the orm table creation from test suite source

* format

* IValidatorBuilder -> ValidatorBuilder

* use the table from the sampler in the test suite interface

* linting

* fixed the profiler with similar solution

* removed unused inheritance

* removed unneeded super().__init__()

* removed all instances of orm_table

* fixed tests

* add reportExplicitAny=false

* fixed tests
2024-11-27 16:25:12 +01:00
Teddy
58699063db
MINOR -- Fix DQ Partition Issue (#18641)
* fix: renamed `random_sample` to `get_dataset` and change dunder method access for SQA Table object

* fix: removed handle_partition decorator

* fix: fixed DQ partition issue + moved to `tablesample` method

* style: ran python linting

* style: fix python format check issues

* feat: added postgres tablesample

* style: ran python linting

* fix: sampling delta

* fix: merge conflicts

* fix: resolved conflicts

* style: ran python linting

* fix: patch orm call in test case

* fix: mock build_table_orm call in tests

* fix: test case failures and errors

* fix: removed unused import

* fix: patch typo

* fix: trino table schema retrieval

* fix: remove tuple context manager for 3.8 test support
2024-11-27 08:50:54 +01:00
Pere Miquel Brull
c68a45e7d8
Create new Auto Classification Workflow (#18610) 2024-11-19 08:10:45 +01:00
Teddy
45d27a377d
GEN 1184 - Added Workflow Classification and Metric LevelConfig (#18572) 2024-11-11 15:59:42 +01:00
Imri Paran
95982b9395
[GEN-356] Use ServiceSpec for loading sources based on connectors (#18322)
* ref(profiler): use di for system profile

- use source classes that can be overridden in system profiles
- use a manifest class instead of factory to specify which class to resolve for connectors
- example usage can be seen in redshift and snowflake

* - added manifests for all custom profilers
- used super() dependency injection in order for system metrics source
- formatting

* - implement spec for all source types
- added docs for the new specification
- added some pylint ignores in the importer module

* remove TYPE_CHECKING in core.py

* - deleted valuedispatch function
- deleted get_system_metrics_by_dialect
- implemented BigQueryProfiler with a system metrics source
- moved import_source_class to BaseSpec

* - removed tests related to the profiler factory

* - reverted start_time
- removed DML_STAT_TO_DML_STATEMENT_MAPPING
- removed unused logger

* - reverted start_time
- removed DML_STAT_TO_DML_STATEMENT_MAPPING
- removed unused logger

* fixed tests

* format

* bigquery system profile e2e tests

* fixed module docstring

* - removed import_side_effects from redshift. we still use it in postgres for the orm conversion maps.
- removed leftover methods

* - tests for BaseSpec
- moved get_class_path to importer

* - moved constructors around to get rid of useless kwargs

* - changed test_system_metric

* - added linage and usage to service_spec
- fixed postgres native lineage test

* add comments on collaborative constructors
2024-10-24 07:47:50 +02:00
Pere Miquel Brull
7012e73d75
GEN-1166 - Improve Ingestion Workflow Error Summary (#18280)
* GEN-1166 - Improve Ingestion Workflow Error Summary

* fix test

* docs

* comments
2024-10-16 18:15:50 +02:00
Imri Paran
68e71cb3dc
GEN-970: Refactor redshift system metrics to support freshness test (#17981)
* ref(profiler): redshift system metrics

- moved redshift system metrics to the redshift source module
- use Timestamp in data quality
- added plugin feature to test utils

* use timezone.utc

* format

* reverted unintended snowflake changes

* fixed import test_system_metrics.py

* revert

* fixed import in tests
2024-10-10 08:32:07 +02:00
Pere Miquel Brull
4cccaae446
GEN-996 - Allow PII Processor without storing Sample Data (#17927)
* GEN-996 - Allow PII Processor without storing Sample Data

* fix import

* fix import
2024-09-20 16:05:29 +02:00
k.nakagaki
3d8e30142c
Fixes 8428: make it possible to choice a sampling method type when we create profile ingestion for the Snowflake (#17831)
* Add test for existing code

* Add sampling method at ingestion.

* add samplingMethodType into UI

* modify init method to use new parameter.

* create descriptions

* execute isort

* fix an unintended change.

* apply py_format

* close section

* specify  init arguments

* fix bug

* apply py_format

---------

Co-authored-by: Teddy <teddy.crepineau@gmail.com>
2024-09-15 21:51:17 +02:00
Imri Paran
a3d6c1dd20
MINOR: tests(datalake): use minio (#17805)
* tests(datalake): use minio

1. use minio instead of moto for mimicking s3 behavior.
2. removed moto dependency as it is not compatible with aiobotocore (https://github.com/getmoto/moto/issues/7070#issuecomment-1828484982)

* - moved test_datalake_profiler_e2e.py to datalake/test_profiler
- use minio instead of moto

* fixed tests

* fixed tests

* removed default name for minio container
2024-09-12 07:13:01 +02:00
Teddy
e4c01c5702
fix: region typo in test (#17766) 2024-09-09 17:54:07 +05:30
Imri Paran
84be1a3162
Fix 17698: use resolution logic for snowflake system metrics profiler (#17699)
* fix(profiler): snowflake

resolve tables using the snowflake engine instead of OpenMetadata

* added env for cleaning up dbs in E2E

* moved system metric method to profiler. all the rest says in snowflake

* format

* revert unnecessary changes

* removed test for previous resolution method

* use shutdown39
2024-09-06 07:25:10 +00:00
Ayush Shah
9880f06b2c
Fixes #17489: Allow non numeric numbers to be sent via Json, Replace NaN value… (#17490)
* fix: Allow non numeric numbers to be sent via Json, Replace NaN values with None in SQAProfilerInterface

Replace NaN values with None in the SQAProfilerInterface class to maintain database parity. NaN values will be cast to null in OpenMetadata. This change ensures that data handling processes account for this conversion.

* fix: histogram overflow error

* test: Add Unit Test for Null and Null Ratio Metric

* chore: Address comments

* chore: Address comments

* fix: checkstyle and message

* fix: failing tests as null count works as expected
2024-08-20 16:33:55 +05:30
Ayush Shah
5be5a05390
Fixes #17051: Dynamic import for Profiler Interface (#17073) 2024-07-19 17:33:19 +05:30
Ayush Shah
421d191bae
Fixes 16562: Modify HiveCompiler to compile column names properly (#16954)
* Modify HiveCompiler to compile column names properly
2024-07-09 12:59:23 +05:30
IceS2
f0049853ec
FIXES 14885: Initial deltalake implementation for s3 (#16665)
* Initial deltalake implementation for s3

* Fix styles

* Fix test_amundsen

* Fix UnitTests

* Fix Checkstyle

* Fix integration tests due to datalake client refactor

* Fix unit tests

* Fix tests

* Fix Integration DeltaLake Storage test

* Skip delta storage integration test for python 3.8

* DeltaLake JSONSchema changes migrations

* Update import name

* Add some comments based on sonarcloud suggestions

* Update DeltaLake documentation

* Resolve some comments
2024-06-20 12:08:21 +05:30
Mayur Singal
7359d6210c
MINOR: Fix Profiler for SSL Enabled Source (#16613) 2024-06-12 11:40:30 +05:30
Pere Miquel Brull
cb72a22b59
Fix - e2e tests for pydantic V2 (#16551)
* Fix - e2e tests for pydantic V2

* add correct default

* add correct default

* revert datetime aware

* revert datetime aware

* revert datetime aware

* revert datetime aware

* revert datetime aware

* revert datetime aware

* revert datetime aware

* revert datetime aware

* fix apis

* format
2024-06-06 19:36:17 -07:00
Pere Miquel Brull
d8e2187980
#15243 - Pydantic V2 & Airflow 2.9 (#16480)
* pydantic v2

* pydanticv2

* fix parser

* fix annotated

* fix model dumping

* mysql ingestion

* clean root models

* clean root models

* bump airflow

* bump airflow

* bump airflow

* optionals

* optionals

* optionals

* jdk

* airflow migrate

* fab provider

* fab provider

* fab provider

* some more fixes

* fixing tests and imports

* model_dump and model_validate

* model_dump and model_validate

* model_dump and model_validate

* union

* pylint

* pylint

* integration tests

* fix CostAnalysisReportData

* integration tests

* tests

* missing defaults

* missing defaults
2024-06-05 21:18:37 +02:00
Teddy
449a5f2de3
FIX #11951 - ingestion logic for global profiler config (#15948)
* feat: add global metric configuration for the profiler

* style: ran java linting

* fix: renamed disable to disabled

* style: ran java linting

* feat: ometa sdk for profiler setting

* test: ingestion profiler global config tests

* fix: update metric name to use MetricType Enum

* fix: allow bot to retrieve settings

* fix: exclude GX artifacts

* feat: implement global profiler setting logic for ingestion side

* fix: exclude metrics if Metric is empty

* style: ran python linting

* style: ran python linting

* fix: skip empty metrics

* style: ran python linting

* fix: moved GET profiler config to seperate endpoint in system resource

* fix: moved compute metric filter to MetricFilter + renamed container

* fix: test failures

* fix: profiler test case
2024-04-22 22:35:37 +02:00
Ayush Shah
b79e5c064b
Fix 15576 - Eval Data Type issue fix (#15702) 2024-04-03 15:51:19 +05:30
Teddy
056e6368d0
Issue #14765 - Preparatory Work (#15312)
* refactor!: change partition metadata structure for table entities

* refactor!: updated json schema for TypeScript code gen

* chore: migration of partition for table entities

* style: python & java linting

* updated ui side change for table partitioned key

* miner fix

* addressing comments

* fixed ci error

---------

Co-authored-by: Shailesh Parmar <shailesh.parmar.webdev@gmail.com>
2024-02-28 07:11:00 +01:00
Teddy
9a4a9df836
Fix #14895 - Get Metadata from Parquet Schema (#14956)
* linting: fix python linting

* fix: get column types from parquet schema for parquet files

* style: python linting

* fix: remove displayType check in test as variation depending on OS
2024-02-01 09:02:52 +01:00
Pere Miquel Brull
db985fda57
MINOR - Snowflake system queries to work with ES & IDENTIFIER (#14864) 2024-01-26 18:41:16 +05:30
NiharDoshi99
2efa0c9e28
#13974 handle for hyphen in schema and median function (#14834) 2024-01-24 15:57:36 +05:30
Teddy
d228a93fbf
fix: increase floating point precision (#14827) 2024-01-24 09:19:19 +01:00
Ayush Shah
831fce5b7e
Fixes 10709: Add useFqnForFiltering to profiler workflow (#14717) 2024-01-18 18:52:43 +05:30
Teddy
61ef55290e
MINOR - generic profiler optimization for sampling and BQ (#14507)
* fix: limit sampling to specific column

* fix: handle bigquery struct columns

* fix: default partition to 1 DAY for BQ

* fix: default to __TABLES__ for BQ table metrics

* style: ran python linting

* style: fix linting

* fix: python style

* fix: set partition to DAY if not HOUR
2023-12-27 19:13:44 +01:00
Ayush Shah
ebc0a551e5
Fixes 12947: Add Support For DQ and Profiler in Databricks Unity Catalog (#14424) 2023-12-20 21:18:05 +05:30
Teddy
c7ac28f2c2
Fixes #11357 - Implement profiler custom metric processing (#14021)
* feat: add backend support for custom metrics

* feat: fix python test

* feat: support custom metrics computation

* feat: updated tests for custom metrics

* feat: added dl support for min max of datetime

* feat: added is safe query check for query sampler

* feat: added support for custom metric computation in dl

* feat: added explicit addProper for pydantic model import fo Extra

* feat: added custom metric to returned obj

* feat: wrapped trino import in __init__

* feat: fix python linting

* feat: fix typing in 3.8
2023-11-17 17:51:39 +01:00
Teddy
f3da919329
Feat: Backend Support for Custom Metrics (#13965)
* feat: add backend support for custom metrics

* feat: fix python test
2023-11-17 19:16:35 +05:30
Mayur Singal
a8145a82fa
Fix #13603: Configurable Sample Data Rows for Profiler (#13807)
* Fix #13603: Configurable Sample Data Rows

* Fix #13603: Configurable Sample Data Rows for Profiler

* fix table config

* support configurable overwriting of sample data

* add support for schema and database profiler configuration

* chore(ui): put sampleDataStorageConfig under advanced config

* fix tests

* py format

* chore(ui): add sampleDataCount in table profiler config

* fix tests

* pylint & tests

* feat(ui): add profiler settings tab in database and database schema page

* chore(ui): show different inputs for profile sample type

* schema changes to make default storange config null

* add unit test

* schema changes to fix api

* update profiler setting schema

* move profiler settings to manage button

* sync locals

* fix(ui): unit tests

* fix tests

* py format

* fix lint

* minor improvements

* chore(ui): update profiler settings schema

* resolve review comments

* pytest

---------

Co-authored-by: Sachin Chaurasiya <sachinchaurasiyachotey87@gmail.com>
2023-11-09 18:49:42 +05:30
Teddy
10904049e4
fix: handle lower and upper case name (#13778) 2023-10-31 09:51:13 +01:00
Teddy
1cbdfb3ae7
Fixes #12601 - column filter for profiler workflow (#13535)
* fix: sample data ingestion to match entity profiler column setting

* fix: python linting

* fix: updated fn call

* fix: added logic to handle json filed in datalake connector

* fix: handle NA values in parsing

* fix: reverted sampler changes from #13338

* fix: reverted metric changes from #13338

* fix: added datalake profiler ingestion test

* fix: python linting

* fix: removed normalization of json blob in NoSQL db
2023-10-12 14:51:38 +02:00
Ayush Shah
08d7ee6d55
Fixes #13052: Datalake Nested Columns Sample Data ingestion (#13338) 2023-10-08 20:08:51 +05:30
Ayush Shah
5fea08cd33
Datalake: Add manifest file support, fix profiler metrics, add array and json column type support (#13017) 2023-09-13 15:15:49 +05:30