3592 Commits

Author SHA1 Message Date
Teddy
e103a8c805
MINOR: Fix uppercase DBT to lowercase dbt (#23900)
* fix: uppercase DBT to lowercase dbt

* fix: change DBT to lowercase dbt in TestPlatform enum

* fix: fix dbt syntax in valueMax

---------

Co-authored-by: Shailesh Parmar <shailesh.parmar.webdev@gmail.com>
2025-10-21 07:59:09 +02:00
Eugenio
ae1b3ce953
[DQaC] Simplified API (#23850)
* Extend `metadata.sdk.configure` function

* Create convenience classes for existing `TestDefinition`s

* Create `WorkflowConfigBuilder` for data quality

* Create `ResultCapturingProcessor` for data quality

This is so we can intercept results from `TestCaseRunner` and return results to the calling application

* Implement `TestRunner` interface to run test cases as code

* Add an example of the simplified API

Also, fix some static checks errors in `builder_end_to_end.py`
2025-10-20 12:12:57 +00:00
Keshav Mohta
7ea87e7ca2
fix: table column description (#23928) 2025-10-20 09:59:23 +05:30
Keshav Mohta
e49d3ee31a
Fixes:: protobuf version (#23878)
* fix: upgraded opentelemetry-exporter-otlp & google-cloud-secret-manager for protobuf

* deps: upgrade pandas, numpy, opentelemetry-exporter-otlp, & asammdf

* fix: revert numpy and asammdf versions

* deps: downgrade pandas to 2.0.3
2025-10-20 09:55:15 +05:30
Keshav Mohta
1afe32f0c1
deps: upgraded sqlalchemy-bigquery to 1.15.0 (#23909) 2025-10-20 09:52:45 +05:30
mmigdiso
64d468188e
Fixes 23881: Added native query lineage extraction for powerbi-databricks (#23882)
* Added native query lineage extraction for powerbi-databricks

* improved error handling and logging

* checkstyle fix

---------

Co-authored-by: m.migdisoglu <m.migdisoglu@criteo.com>
Co-authored-by: Sriharsha Chintalapani <harshach@users.noreply.github.com>
2025-10-16 15:21:55 +05:30
Suman Maharana
63b663d884
Improve Tableau logging (#23892)
* Improve Tableau logging

* Addressed comments
2025-10-16 09:52:05 +05:30
sonika-shah
303ee47d6f
Add assets API and deprecate inline assets field for Domain and Dataproduct (#23856)
* Add assets API and deprecate inline assets field for Domain and Dataproduct

* fix mvn test

* fix py test and add new tests

* fix py test

* fix py test

* fix timeout for workflow test

* address pr feedback

* Update generated TypeScript types

* minor- remove unused function

---------

Co-authored-by: Bhanu Agrawal <bhanuagrawal2018@gmail.com>
Co-authored-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>
2025-10-16 05:23:05 +05:30
Mayur Singal
3c527ca83b
MINOR: Fix Databricks DLT Pipeline Lineage to Track Table (#23888)
* MINOR: Fix Databricks DLT Pipeline Lineage to Track Table

* fix tests

* add support for s3 pipeline lineage as well
2025-10-15 10:54:01 +02:00
Akash Verma
9b16119ab5
feat: Add Hex dashboard connector support (#23246)
* feat: Add Hex dashboard connector support

* files

* Added tests and UI image

* fix tests

---------

Co-authored-by: Akash Verma <akashverma@Mac.lan>
2025-10-15 11:05:42 +05:30
Mohit Tilala
09c851265e
[Redshift] Add better handling of incomplete redshift view definition (#23866)
* Add better handling of incomplete redshift view definition

* Match exact definitions in tests

* Correct isort on tests
2025-10-14 12:51:07 +05:30
Keshav Mohta
50dbe6fe44
fix: view_names issue when incremental enabled (#23858) 2025-10-13 19:21:07 +05:30
Mayur Singal
a638bdcfe0
MINOR: Fix databricks pipeline repeating tasks issue (#23851) 2025-10-13 00:41:05 +05:30
Copilot
c8722faf47
Fix Grafana connector validation error for integer format fields (#23202)
* Initial plan

* Fix Grafana connector format field validation issue

- Update GrafanaTarget.format field to accept both str and int types
- Add field_validator to convert integer format codes to string equivalents
- Add comprehensive tests for format field validation scenarios
- Add test fixture with integer format fields that reproduces the original issue
- Ensure backwards compatibility with existing string format values

This resolves the issue where Grafana dashboards with integer format fields
(e.g., format: 0 instead of format: "table") were causing validation errors
and being skipped during ingestion.

Co-authored-by: ulixius9 <39544459+ulixius9@users.noreply.github.com>

* fix: GrafanaTarget model format type from str to Any

---------

Co-authored-by: copilot-swe-agent[bot] <198982749+Copilot@users.noreply.github.com>
Co-authored-by: ulixius9 <39544459+ulixius9@users.noreply.github.com>
Co-authored-by: Sriharsha Chintalapani <harshach@users.noreply.github.com>
Co-authored-by: Keshav Mohta <keshavmohta09@gmail.com>
2025-10-12 23:14:16 +05:30
harshsoni2024
c32a9b957f
Add AWS kinesis firehose connector [OSS] (#23807)
* AWS Firehose

* Add AWS Firehose

* add kinesis fireshose support

* remove unnecessary doc

* Update generated TypeScript types

* add connection doc, optional msg service name

* Update generated TypeScript types

---------

Co-authored-by: Sriharsha Chintalapani <harsha@getcollate.io>
Co-authored-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>
Co-authored-by: Ayush Shah <ayush@getcollate.io>
2025-10-12 08:27:13 -07:00
Ayush Shah
d71a47db1d
fix(kafkaconnect): update table search method to use search_in_any_service (#23852) 2025-10-12 20:02:12 +05:30
Sriharsha Chintalapani
ce3a9bd654
Kafka connect improvements (#23845)
* Kafka Connect Lineage Improvements

* Remove specific Kafka topic example from docstring

Removed example from the documentation regarding the earnin.bank.dev topic.

* fix: update comment to reflect accurate example for database server name handling

* fix: improve expected FQN display in warning messages for missing Kafka topics

* fix: update table entity retrieval method in KafkaconnectSource

* fix: enhance lineage information checks and improve logging for missing configurations in KafkaconnectSource

* Kafka Connect Lineage Improvements

* address comments; work without the table.include.list

---------

Co-authored-by: Ayush Shah <ayush@getcollate.io>
2025-10-11 22:26:14 +02:00
Sriharsha Chintalapani
5c638f5c8e
Databricks DLT pipelines parsing (#23848) 2025-10-11 22:25:43 +02:00
Ayush Shah
a90cacc93b
MINOR: fix Kafka connect CDC lineage (#23836) 2025-10-11 15:40:03 +05:30
Teddy
1f8cf64dd4
chore: added python 3.12 to CI (#23835)
* chore: added python 3.12 to CI

* chore: changed py-test-skip to 3.12
2025-10-10 17:26:45 +02:00
Teddy
93e5ee8cb1
fix: url encode fqn when retrieving test case results in python sdk (#23834) 2025-10-10 17:25:33 +02:00
Sriharsha Chintalapani
76020bd0e7
Fix Kafka Connect for lineage parsing (#23819)
* Fix Kafka Connect for lineage parsing

* Fix Kafka Connect for lineage parsing
2025-10-09 14:01:36 -07:00
Mayur Singal
88115e1218
MINOR: Fix training / issue in UC S3 lineage (#23816) 2025-10-09 18:44:07 +02:00
Antoine Balliet
be3a91f7df
fix: logger level should work for deprecation warnings (#23784)
* chore: implement logger levels tests for depreciation

* fix: use METADATA_LOGGER instead of warnings

* use unit test syntax

* isort

* black

* fix test

---------

Co-authored-by: Sriharsha Chintalapani <harshach@users.noreply.github.com>
2025-10-09 18:21:28 +02:00
Mayur Singal
05f064787f
Feat: Add kafka lineage support in databricks pipelines (#23813)
* Add dlt pipeline support

* Fix code style

* Add variable parsing

* Fix kafka lineage

---------

Co-authored-by: Sriharsha Chintalapani <harsha@getcollate.io>
2025-10-09 16:42:08 +02:00
Sriharsha Chintalapani
454d7367b0
Kafka Connect: Support Confluent Cloud connectors (#23780) 2025-10-09 01:28:27 +05:30
Mohit Tilala
da8c50d2a0
Add pagination for snowflake usage and lineage queries sql (#23781)
* Add pagination for snowflake usage and lineage queries sql

* py_format
2025-10-08 20:45:14 +05:30
Mayur Singal
4708c2b64f
feat: Unity Catalog Lineage Enhancement: External Location Support (#23790) 2025-10-08 20:26:39 +05:30
harshsoni2024
f2819ce4e4
Fix: PowerBI snowflake query lineage parsing (#23746) 2025-10-08 18:32:25 +05:30
Mohit Tilala
61e4c1ffba
Pin pydantic to <2.12.0 (#23782)
* Bump datamodel-code-generator to 0.34.0

* Pin down pydantic to <2.12

* Revert "Bump datamodel-code-generator to 0.34.0"

This reverts commit c69116d2935eea49e9c78b2607f2fea94bc44738.
2025-10-08 13:24:27 +05:30
Eugenio
af0672e4cf
Fixes #22302: add table2.keyColumns parameter for table diff validation (#23667)
* Update `TableDiffParamsSetter` to move data at table level

This means that `key_columns` and `extra_columns` will be defined per table instead of "globally", just like `data_diff` expects

* Update `TableDiffValidator` to use table's `key_columns`

Call `data_diff` and run validations using each table's `key_columns`

* Create migration to update `tableDiff` test definition

* Fix Playwright test
2025-10-08 09:32:00 +02:00
Eugenio
a6ac42371d
Ensure recognizers are created (#23645)
* Add the migration classes and data for recognizers

This is so that we can run a migration that sets `json->recognizers` of `PII.Sensitive` and `PII.NonSensitive` tags from json values.

The issue with normal migrations was that the value of recognizers was too long to be persisted in the server migrations log.

Created a common `migration.utils.v1110.MigrationProcessBase`

* Ensure building automatically with the right parameters

* Update typescript types
2025-10-07 15:13:35 +00:00
Eugenio
47e953f9d3
PLAYWRIGHT FIXES: ensure sample data is passed to the right columns (#23761)
* Ensure we take columns ordered from the sampler

This is to avoid analyzing columns with data from other columns

* Remove expectation of address to have Sensitive tag

This is for a couple of reasons:
- First: per our internal definition it should actually be Non Sensitive.
- Second: presidio actually picks SOME of them up as PERSON (Sensitive) entities, but since we've raised the tolerance, now we're not classifying them as Sensitive.
2025-10-07 09:39:24 +02:00
harshsoni2024
9ba65ac0d2
Fix: Add support for datamodel source url (#23715) 2025-10-06 20:04:43 +00:00
Mohit Tilala
0cf0394d0b
Fixes #22406: Add workflow resource utilisation metrics for better troubleshooting (#23696)
* Add workflow resource utilization metrics for better troubleshooting

* Add types for correct static type checking

* Remove duplicate type annotations
2025-10-06 13:20:06 +05:30
harshsoni2024
da7a2778f6
MINOR: iceberg load table retry backoff (#23579) 2025-10-05 23:42:56 +05:30
Sriharsha Chintalapani
fc7412f6dd
Add Timescale Connector (#23665)
* Add Timescale Connector

* Update generated TypeScript types

* Add UI changes for the Timescale

* lineage, usage and java

* Add beta tag

* update logo

---------

Co-authored-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>
Co-authored-by: Aniket Katkar <aniketkatkar97@gmail.com>
Co-authored-by: Akash Verma <akashverma@Mac.lan>
2025-10-03 19:00:59 -07:00
Mohit Tilala
b15dc8fe42
Add better handling of no columns found/permission issue exceptions (#23695) 2025-10-03 21:07:16 +05:30
Keshav Mohta
3d49b6689d
Fixes #23356: Databricks & UnityCatalog OAuth and Azure AD Auth (#23561)
* feat: databricks oauth and azure ad auth setup

* refactor: add auth type changes in databricks.md

* fix: test after oauth changes

* refactor: unity catalog connection to databricks connection code

* feat: added oauth and azure ad for unity catalog

* fix: unitycatalog tests, doc & required type in connection.json

* fix: generated tx files

* fix: exporter databricksConnection file

* refactor: unitycatalog example file

* fix: usage example files

* fix: unity catalog sqlalchemy connection

* fix: unity catalog client headers

* refactor: make common auth.py for dbx and unitycatalog

* fix: auth functions import

* fix: test unity catalog tags as None

* fix: type hinting and sql migration

* fix: migration for postgres
2025-10-03 19:53:19 +05:30
harshsoni2024
ea54b6b883
MINOR: datalake column subfields fix (#23576) 2025-10-03 16:13:10 +05:30
Akash Verma
06453a925d
Fix #21093 : Update test connection improvements (#23516)
* Update test connection improvements

* Update queries

* checkstyle

* fix test failure

---------

Co-authored-by: Akash Verma <akashverma@Akashs-MacBook-Pro-2.local>
2025-10-03 13:50:46 +05:30
Akash Verma
5bb2924a6a
Fix #16081 : Add support for SQL Server hierarchyid, geography, and geometry types (#23527) 2025-10-03 11:46:01 +05:30
Akash Verma
4d68fe7a10
feat: Add ML model lineage support (#23494) 2025-10-03 11:38:41 +05:30
Suman Maharana
c8055576ba
Fixes #21686 : Add missing includeOwners check in dashboard services (#22514) 2025-10-03 10:53:25 +05:30
Keshav Mohta
48ff77c917
Fixes: MF4 Import Error (#23659)
* fix: asammdf and avro import error

* fix: mf4 import only

* test: fix mf4 test
2025-10-01 20:08:45 +05:30
Eugenio
5da2d32b34
Use recognizer in classification (#23628)
* Refactor presidio utils

Extract the spacy model functionality from the analyzer building function

* Added a new `TagClassifier`

This classifier uses tags to dynamically build presidio `RecognizerRegistry`s

* Added a new `TagProcessor`

This processor uses `TagClassifier` to label a column based on the tags' recognizers

* Create `TagProcessor` based on workflow configuration

* Create decorator to apply threshold to recognizers

This is so that we can apply thresholds on recognizer results without subclassing or having to keep a map between the presidio recognizer and the recognizer configuration

* Fix broken test
2025-10-01 14:43:28 +02:00
Eugenio
dff2b394d5
Fix classification scoring (#23523)
* Add `reason` property to `TagLabel`

This is to understand what score was used for selecting the entity

* Build `TagLabel`s with `reason`

* Increase `PIIProcessor._tolerance`

This is so we correctly filter out low scores from classifiers while still maintaining the normalization that filters out confusing outcomes.

e.g: an output with scores 0.3, 0.7 and 0.75, would initially filter the 0.3 and then discard the other two because they're both relatively high results.

* Make database and DAO changes needed to persist `TagLabel.reason`

* Update generated TypeScript types

---------

Co-authored-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>
2025-10-01 12:11:14 +00:00
Keshav Mohta
6b7262a8ea
Feature: MF4 File Reader (#23308)
* feat: mf4 file reader

* refactor: removed schema_from_data implementation

* test: added tests for mf4 files
2025-10-01 11:19:00 +02:00
Pere Miquel Brull
375e001dd9
MINOR - Fix S3 logging from ingestion pipelines (#23590)
* MINOR - Fix S3 logging from ingestion pipelines

* Update generated TypeScript types

* config

* update s3 configurations for streamable logs

* Update generated TypeScript types

* update s3 configurations for streamable logs

* update s3 configurations for streamable logs

* update s3 configurations for streamable logs

* SSE off by default

* Update log retrieval to use s3 if ingestion runner has streamable logs enabled

---------

Co-authored-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>
Co-authored-by: Pablo Takara <pjt1991@gmail.com>
2025-10-01 09:44:17 +02:00
Ayush Shah
dd99ab5678
feat: Add Unity Catalog data diff module to use DBX connection instead of workspaceclient (#23404) 2025-09-30 20:56:54 +05:30