Compare commits

...

875 Commits

Author SHA1 Message Date
Harshal Sheth
308c6bd28e
docs(airflow): follow-ups for dropping airflow <2.7 (#13619) (#13946) 2025-07-04 17:20:33 +02:00
Sergio Gómez Villamor
2388d770dd
docs(dbt): incremental-lineage (#13959) 2025-07-04 11:16:00 +01:00
Andrew Sikowitz
416c2093d1
fix(ui/navBar): Fix logic to display manage tags link (#13564)
Co-authored-by: v-tarasevich-blitz-brain <v.tarasevich@blitz-brain.com>
2025-07-03 13:53:02 -07:00
Saketh Varma
ae234d671e
fix(ui): reset page in embedded components (#13949) 2025-07-03 16:35:02 -04:00
Michael Maltese
9d79914295
fix(graph/client): use fixed GMS URL consistently (#13945) 2025-07-03 11:57:11 -06:00
Aseem Bansal
ef3446f066
feat(cli): add kafka helper, improve restore indices helper (#13951) 2025-07-03 20:07:25 +05:30
Jay
8352617375
feat(docs): v0.3.12.2 release notes (#13932) 2025-07-03 10:10:53 -04:00
Aseem Bansal
a7c5895d98
feat(ingest): add aspects by subtype in report, telemetry (#13921) 2025-07-03 17:07:39 +05:30
Sergio Gómez Villamor
1561a6c8ca
feat(hex): add retry logic with exponential backoff for 429 rate limiting (#13905)
Co-authored-by: Claude <noreply@anthropic.com>
2025-07-03 11:12:31 +02:00
Tamas Nemeth
eaf2bf6dec
feat(ingest/kafka-connect): Add more connectors the regexp transformation support (#13748) 2025-07-03 08:57:50 +02:00
Chris Collins
69ac3147d4
feat(templates/modules) Ingest default page templates and modules (#13925) 2025-07-02 18:11:09 -04:00
v-tarasevich-blitz-brain
b64bf9bd36
feat(customHomePage): add asset item component (#13936)
Co-authored-by: Chris Collins <chriscollins3456@gmail.com>
2025-07-02 17:42:58 -04:00
david-leifker
e7463ac1f4
fix(token-service): extend validation for actor (#13947) 2025-07-02 16:27:42 -05:00
Chris Collins
5292a268c9
feat(graphql) Add GraphQL Objects, Type classes, and Mappers for Templates and Modules (#13923) 2025-07-02 16:44:38 -04:00
Anthony Burdi
1c72cb3a79
feat(docs): Instructions for bulk creation using new assertions SDK (#13924) 2025-07-02 16:13:48 -04:00
david-leifker
9b26c8d598
feat(secret): increase secret encryption strength (#13942) 2025-07-02 14:34:10 -05:00
Saketh Varma
fa7b4132d2
fix(ui): Fix non-functional page size in tags page (#13891) 2025-07-02 15:33:56 -04:00
Jay
1345e63977
fix(docs): hide unsupported assertion API feature docs (#13943) 2025-07-02 15:31:05 -04:00
Aseem Bansal
661a4ae9f1
fix(ingest): mypy lint python 3.8 (#13939)
Co-authored-by: Harshal Sheth <hsheth2@gmail.com>
2025-07-02 12:14:39 -07:00
RyanHolstien
87c8378b12
fix(smoke): prevent audit events smoke flake (#13944) 2025-07-02 13:00:17 -05:00
hkr
77a5671256
fix(airflow): set minimum supported version to 2.7.0 (#13619)
Co-authored-by: Harshal Sheth <hsheth2@gmail.com>
2025-07-02 09:54:15 -07:00
v-tarasevich-blitz-brain
cfadc80843
feat(customHomePage): add generic module component (#13922) 2025-07-02 11:46:37 -04:00
Harshal Sheth
68130a33e4
chore(actions): bump acryl-executor (#13930) 2025-07-02 08:37:34 -07:00
Anna Everhart
0a82d4a5f9
update hover color in selects (#13931) 2025-07-02 08:14:33 -07:00
Aseem Bansal
7b7038d8d3
refactor(ingest): centralise subtype strings (#13935) 2025-07-02 19:06:49 +05:30
Hyejin Yoon
fedbfa3f7e
feat: update fivetran connector with new sdk (#13859) 2025-07-02 21:47:02 +09:00
Tamas Nemeth
ecc24da0fa
fix(ingest/bigquery): Emit dataset profile when table does not have rows (#13919) 2025-07-02 13:53:12 +02:00
Michael Maltese
f7aa9ba1c2
fix(ingest/unity): don't crash when processing Platform Resources hits an error (#13877) 2025-07-02 08:03:11 +02:00
gabriel-morais-rokos
242fc1e50d
feat(ingestion): add patch structured properties to data product (#13813)
Co-authored-by: Hyejin Yoon <0327jane@gmail.com>
2025-07-02 08:29:10 +09:00
Amanda Ng
3867c1cbce
fix(login): show login errors provided in response (#10871)
Co-authored-by: Hyejin Yoon <0327jane@gmail.com>
2025-07-02 08:28:47 +09:00
Chris Collins
7457319c0a
feat(models) Add data models for Custom Home page project - Templates and Modules (#13911) 2025-07-01 19:22:23 -04:00
RyanHolstien
1c34f96b6d
feat(auditEvents): add in top level delete policy event by default (#13928) 2025-07-01 17:46:25 -05:00
Jay
aa6e658cda
feat(docs): clearer slack troubleshoot guide (#13926) 2025-07-01 17:10:13 -04:00
sleeperdeep
70a39b70f2
fix(ingest): support ownership types in AddDatasetOwnership transformer (#13081) 2025-07-01 11:34:29 -07:00
Harshal Sheth
ee9a3ea0a8
docs(slack): add docs on slackbot permissions (#13908) 2025-07-01 09:33:58 -07:00
Harshal Sheth
bc03e9452d
docs(mcp): add troubleshooting section (#13893) 2025-07-01 09:33:47 -07:00
Aseem Bansal
92784ec3a4
feat(ingest/lineage): generate static json lineage file (#13906) 2025-07-01 20:51:18 +05:30
Hyejin Yoon
6b1817902d
docs: archive 1.0.0 versioned docs (#13912) 2025-07-01 08:15:02 -07:00
Alex Haynes
fd2f0b15d1
chore: Fix typo (#13910) 2025-07-01 23:18:06 +09:00
Aseem Bansal
1dab349517
feat(ingest): add source aspect number in telemetry (#13914) 2025-07-01 19:08:23 +05:30
Sergio Gómez Villamor
96bb33bed6
chore: aggregator_generate_timer for snowflake (#13913) 2025-07-01 11:16:28 +02:00
Benjamin Maquet
34b340e3b9
feat(preset): add preset to the list of platforms (#13896) 2025-07-01 10:33:56 +02:00
david-leifker
2d2c3754d9
fix(openapi): fix example (#13899) 2025-06-30 19:08:26 -05:00
Kevin Karch
85e6b7eb6f
docs: fix api-gateway version in release notes (#13909) 2025-06-30 17:50:11 -04:00
Harshal Sheth
3583aa1eff
docs(airflow): document background operation (#13907) 2025-06-30 14:35:19 -07:00
v-tarasevich-blitz-brain
31ee414008
feat(homePageRedesign): add header (#13904) 2025-06-30 15:34:59 -04:00
purnimagarg1
73cb3621e0
feat(ui/homepage): render skeleton of the new homepage (#13886) 2025-06-30 12:49:29 -04:00
v-tarasevich-blitz-brain
bcfa3e08b5
fix(searchBarV2): do not clean filters on the search hit with empty filters (#13706) 2025-06-30 12:48:56 -04:00
v-tarasevich-blitz-brain
aa186c2887
feat(ingestion): backend changes from saas (#13796) 2025-06-30 12:47:42 -04:00
v-tarasevich-blitz-brain
f13fb828d7
fix(ingestion): fix group avatar (changes from SaaS) (#13772) 2025-06-30 12:36:54 -04:00
purnimagarg1
e5ef9a0f3a
feat(homepage): create new feature flag for homepage redesign (#13882) 2025-06-30 11:58:55 -04:00
RyanHolstien
aeef5d6870
docs(v.0.3.12.1): adding maintenance release docs (#13892) 2025-06-30 09:55:33 -05:00
Aseem Bansal
d567c5d4bb
tests(doc): add tests for doc gen (#13903)
Co-authored-by: Sergio Gómez Villamor <sgomezvillamor@gmail.com>
2025-06-30 20:16:56 +05:30
Benjamin Maquet
bd9a3f5e2a
docs: add missing preset logo (#13897) 2025-06-30 15:02:40 +02:00
skrydal
e8b5a60d1d
fix(cli): explicit default field value for optional field (pydantic v2) (#13901) 2025-06-30 14:16:42 +02:00
Aseem Bansal
7345af898d
feat(ingest): generate capability summary (#13881) 2025-06-30 15:16:08 +05:30
Aseem Bansal
03309b7ffa
feat(mock-data-source): add first seen urn in report (#13889) 2025-06-30 15:15:50 +05:30
Deepak Garg
856da011c8
refactor(ingestion): make ingestion schedular properties configurable (#13887) 2025-06-28 21:37:33 -05:00
david-leifker
55cfb95cad
fix(search): additional search size defaults (#13888) 2025-06-27 16:43:37 -05:00
Harshal Sheth
05d029d690
feat(ingest/snowflake): add extra_info for snowflake (#13539) 2025-06-27 12:23:28 -07:00
david-leifker
ddb4e17772
feat(openapi): entity registry api (#13878) 2025-06-27 13:32:29 -05:00
rahul MALAWADKAR
b162d6f365
chore(deps): fix (org.glassfish:jakarta.json) (#13269) 2025-06-27 11:17:26 -07:00
sleeperdeep
f3c8bf9cb4
fix(ingest/sql_server): switch to engine inspector instead of connection (#13104) 2025-06-27 11:15:33 -07:00
Aseem Bansal
54db272c4d
doc(cloud): make datahub cloud docs prominent in side bar (#13860) 2025-06-27 22:56:06 +05:30
Harshal Sheth
da54ea6fbd
fix(docs): add recommended versions for v0.3.12 (#13863) 2025-06-27 10:08:14 -07:00
dependabot[bot]
e1477ee48b
build(deps): bump brace-expansion from 1.1.11 to 1.1.12 in /datahub-web-react (#13770)
Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2025-06-27 11:20:55 -05:00
Aseem Bansal
8c1aaaf02f
doc(ingest/azuread): remove outdated information (#13885) 2025-06-27 19:40:52 +05:30
Kevin Karch
d05791a4e6
fix(docs): revise slack bot troubleshoot instructions (#13884) 2025-06-27 09:19:15 -04:00
Aseem Bansal
c19b91d9a6
doc(cli): update for restore-indices (#13880)
Co-authored-by: Sergio Gómez Villamor <sgomezvillamor@gmail.com>
2025-06-27 18:07:09 +05:30
Aseem Bansal
5759711992
fix(ingest/rest): out-of-date structured report being sent (#13866)
Co-authored-by: Sergio Gómez Villamor <sgomezvillamor@gmail.com>
2025-06-27 15:07:28 +05:30
Chakru
aac98c7d12
fix(quickstart): use kafka version that supports zookeeper (#13879) 2025-06-27 12:58:59 +05:30
Hyejin Yoon
827a2308cd
docs: add sdk entity guides (#13870) 2025-06-27 15:28:20 +09:00
RyanHolstien
a3688f78e7
fix(docs): update patch support documentation (#13093) 2025-06-26 16:01:02 -05:00
RyanHolstien
6a74f8f12f
feat(fabrictype): add sbx fabrictype (#13875) 2025-06-26 16:00:42 -05:00
Shirshanka Das
820a449b2a
docs(ingestion): preset - update source certification status (#13641) 2025-06-26 11:29:20 -07:00
Sergio Gómez Villamor
7ed7dd25c5
feat(docs): add subscription client docs to assertions tutorial (#13872) 2025-06-26 18:24:01 +02:00
RyanHolstien
52e49eb79b
fix(emitter): fix emitter handling of unicode characters (#13867) 2025-06-26 10:43:55 -05:00
Chakru
933e75f5e1
fix(quickstart): restore kafka version (#13873) 2025-06-26 21:08:04 +05:30
Sergio Gómez Villamor
9a32dd7f7f
feat(dremio): add configurable time range for query lineage extraction, sql aggregator report and fix schema_pattern filtering (#13613)
Co-authored-by: Claude <noreply@anthropic.com>
2025-06-26 15:28:42 +02:00
Chakru
02fa0ac77f
fix(quickstart): use latest zookeeper version (#13871) 2025-06-26 07:37:50 -05:00
Jay
b28a65bdf2
fix(docs-site): updated weekly demo link (#13868) 2025-06-25 23:32:55 -07:00
david-leifker
79cdf78339
feat(): lineage registry via openapi (#13865) 2025-06-25 18:37:03 -05:00
Michael Maltese
db0873058d
ci: add ligfx to team in pr-labeler.yml (#13864) 2025-06-25 18:31:26 -04:00
Michael Maltese
0f0119f219
feat(ingestion): use approx_distinct when profiling Athena and Trino (#13671) 2025-06-25 16:29:26 -04:00
david-leifker
677182daf7
feat(operations): add es raw operations endpoints (#13855) 2025-06-25 12:30:41 -05:00
Chakru
83b9eca358
fix(quickstart): limit aws_endpoint_url override only for localstack (#13861) 2025-06-25 21:56:32 +05:30
Aseem Bansal
468e62b8cc
deprecate(ingest): match_fully_qualified_names for redshift,bigquery (#13858) 2025-06-25 16:25:29 +05:30
Aseem Bansal
a9e9ac9808
fix(ingest/dremio): fix report, mark usage stats capability (#13851) 2025-06-25 15:09:52 +05:30
Aseem Bansal
3b44ed847c
doc(ingest): mark for usage (#13850) 2025-06-25 14:48:28 +05:30
Aseem Bansal
e9f208f514
docs(ingest): update integrations page (#13846) 2025-06-25 14:35:57 +05:30
Aseem Bansal
40452f7c54
docs(ingest): docs for lineage (#13847) 2025-06-25 14:35:39 +05:30
Harshal Sheth
d58f41c78b
docs(release): add MCP server note to release notes (#13853) 2025-06-24 14:22:43 -07:00
Aseem Bansal
f554679be4
fix(ui/ingest): ingestion run report (#13838) 2025-06-24 19:19:05 +05:30
Aseem Bansal
9fe319bc4d
fix(ingest): add fineGrainedLineages as a special case for aspects (#13844) 2025-06-24 16:31:38 +05:30
RyanHolstien
6c067cd997
feat(fabricType): add new supported fabric types (#13842) 2025-06-24 10:29:27 +05:30
Aseem Bansal
19440271df
log(nullpointer): when elastic doc has missing urn (#13839) 2025-06-24 10:05:01 +05:30
Harshal Sheth
e4c5cbdbba
docs: add note about MCP transport types (#13822) 2025-06-23 15:31:29 -07:00
Harshal Sheth
9bfcdb2a3d
docs: add docs on DataHub slack bot (#13834) 2025-06-23 15:31:18 -07:00
RyanHolstien
a17dbf4849
feat(policies): support policy privilege constraints and ingestion aspect validators (#13819)
Co-authored-by: Saketh Varma <sakethvarma397@gmail.com>
Co-authored-by: Chris Collins <chriscollins3456@gmail.com>
Co-authored-by: David Leifker <david.leifker@acryl.io>
Co-authored-by: david-leifker <114954101+david-leifker@users.noreply.github.com>
2025-06-23 16:22:23 -05:00
Aseem Bansal
582164401e
fix(ui/ingest): mark more sources as supporting test connection (#13836) 2025-06-23 21:43:59 +05:30
Aseem Bansal
d08a293225
doc(ingest): mark test connection capability (#13837) 2025-06-23 20:08:32 +05:30
Chakru
1964bcf4d1
docs(iceberg): added notes on session expiry config, some minor updates (#13832) 2025-06-23 18:01:15 +05:30
Chakru
a8c5202993
telemetry: add support for unified mixpanel + kafka tracking in GMS (#13795)
Co-authored-by: Shirshanka Das <shirshanka@apache.org>
2025-06-23 17:01:38 +05:30
Harshal Sheth
32039dc6a1
docs: add cloud v0.3.12 release notes (#13784)
Co-authored-by: david-leifker <114954101+david-leifker@users.noreply.github.com>
Co-authored-by: Sergio Gómez Villamor <sgomezvillamor@gmail.com>
Co-authored-by: Chris Collins <chriscollins3456@gmail.com>
Co-authored-by: John Joyce <john@acryl.io>
Co-authored-by: Andrew Sikowitz <andrew.sikowitz@acryl.io>
2025-06-22 15:23:25 -07:00
david-leifker
e9103f1851
feat(openapi-31): properly update openapi spec to 3.1.0 (#13828) 2025-06-21 15:41:12 -05:00
Chris Collins
889441217b
feat(docs) Update managed ingestion docs with new UI (#13799)
Co-authored-by: Maggie Hays <maggiem.hays@gmail.com>
2025-06-20 16:28:51 -05:00
Harshal Sheth
8bc136d350
docs: add screenshots for AI documentation (#13821) 2025-06-20 09:37:50 -07:00
Aseem Bansal
f8c6db07d8
ingest(snowflake): remove email_as_user_identifier support (#13827) 2025-06-20 19:38:16 +05:30
Aseem Bansal
b3a25d6fbd
fix(ingest/bigquery): use email as user urn (#13831) 2025-06-20 18:45:41 +05:30
Aseem Bansal
dbcbca9a38
feat(doc): hide repeating allow/deny things in config tables (#13826) 2025-06-20 15:04:05 +05:30
Aseem Bansal
614e627720
feat(cli): --streaming-batch option delete large hierarchy (#13824) 2025-06-19 17:53:56 +05:30
Aseem Bansal
135e905e7c
doc(mysql,kafka): fix support status (#13812)
Co-authored-by: Harshal Sheth <hsheth2@gmail.com>
2025-06-19 16:51:14 +05:30
Harshal Sheth
4aa3a928c0
feat(cli): add restore-indices CLI command (#13820) 2025-06-19 16:33:47 +05:30
Aseem Bansal
85b29c9361
feat(docs): add showing specific fields to docs of specific connectors (#13810) 2025-06-19 15:08:11 +05:30
Sergio Gómez Villamor
b12b9aa919
fix(sql-parsing): catch pyo3_runtime.PanicException instead of BaseException (#13806) 2025-06-19 08:58:28 +02:00
Maggie Hays
d4123c8fa9
docs(Forms) Update Form Creation guide with notification details (#13786) 2025-06-18 22:41:43 -05:00
Saketh Varma
bb1a593e2a
fix(docs): Update proposals docs (#13800)
Co-authored-by: John Joyce <john@acryl.io>
2025-06-18 14:49:57 -06:00
Kevin Karch
ee7adcc1b4
fix(docs): remove ref to github_info (#13816) 2025-06-18 15:25:40 -04:00
david-leifker
70a6135b6b
fix(reindex): fix cast exception during reindex (#13815) 2025-06-18 13:55:12 -05:00
Kevin Karch
f4f7b7ade3
fix(docs): add missing privileges to docs (#13805) 2025-06-18 08:13:25 -04:00
Shirshanka Das
e827fdfebc
docs(actions): schema registry configuration tips (#13789) 2025-06-17 18:30:02 -07:00
Gabe Lyons
efa8d7dc27
fix(snowflake summary): fixing snowflake summary source (#13785) 2025-06-17 19:17:09 -04:00
RyanHolstien
f67481fcc8
fix(metadata_change_sync): fix unicode handling (#13804) 2025-06-17 15:12:46 -05:00
Jay
49f3a1e24c
feat(docs): smart assertions and dataset health dashboard docs (#13782)
Co-authored-by: Anthony Burdi <anthony.burdi@acryl.io>
2025-06-17 11:31:17 -04:00
Deepak Garg
d3c2a36ded
refactor(neo4j): parameterized neo4j Queries execution to escape special characters in urn (#13793) 2025-06-17 08:54:34 -05:00
Sergio Gómez Villamor
edd3324553
fix(sql-parsing): handle pyo3_runtime.PanicException from SQLGlot Rust tokenizer (#13758)
Co-authored-by: Claude <noreply@anthropic.com>
2025-06-17 11:45:16 +02:00
Harshal Sheth
0c24b94612
docs: add MCP server guide (#13779) 2025-06-16 21:59:35 -07:00
Anthony Burdi
eb9de90ecc
fix(sdk): Move resolver client import into resolve method (#13780) 2025-06-16 10:26:47 -04:00
Harshal Sheth
2651bb1e76
fix(sdk): fix typos + improve links in sdk docs (#13777) 2025-06-16 01:38:46 -04:00
david-leifker
cbd97186e2
feat(kafka): bump confluent kafka (#13767) 2025-06-15 10:52:16 -05:00
Gabe Lyons
4924602773
fix(embedded search): making embedded search list use smart defaults for types (#13778) 2025-06-14 09:53:28 -04:00
Tamas Nemeth
6ec7b9292c
doc(unity-catalog): Add doc for Databricks metadata sync (#13760)
Co-authored-by: John Joyce <john@acryl.io>
2025-06-14 10:42:32 +02:00
david-leifker
e03149ba03
fix(ownershipOwnerTypes): fix missing features in hook (#13775) 2025-06-13 16:38:58 -05:00
david-leifker
bf833c0fc7
feat(ci): restrict workflow runs for publish jars (#13776) 2025-06-13 15:23:51 -05:00
david-leifker
65a8206605
fix(dataschema): fix for reflections upgrade (#13774) 2025-06-13 14:26:47 -05:00
Saketh Varma
9add18eff5
fix(doc): Update proposals documentation (#13769) 2025-06-13 12:57:16 -06:00
rahul MALAWADKAR
3c94cbc0b4
chore(deps): fix (org.reflections:reflections) (#13298) 2025-06-13 11:23:27 -05:00
mihai103
c39e150e29
Handle Cyclic References in AVRO Schema Conversion (#13608)
Co-authored-by: Mihai Ciocirdel <mihai.ciocirdel@swisscom.com>
2025-06-13 11:21:23 -05:00
leaderofrogue
5404ee9b39
Update README.md (#13668) 2025-06-13 11:20:01 -05:00
rahul MALAWADKAR
faae9a6b1f
chore(deps): fix (com.google.code.gson:gson) (#13672) 2025-06-13 11:19:02 -05:00
dependabot[bot]
b27bd43fbb
build(deps): bump aquasecurity/trivy-action from 0.30.0 to 0.31.0 (#13719)
Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2025-06-13 11:17:42 -05:00
Chakru
e08b997797
fix(irc): map conflict exception per iceberg spec (#5960) (#13751) 2025-06-13 11:16:44 -05:00
Shuixi Li
28d58e8973
fix(ingest/preset): url for view in source from edit view to explore view (#13666) 2025-06-13 15:20:58 +05:30
Aseem Bansal
47eee11257
feat(ui): add copy urn tag and structured properties (#13766) 2025-06-13 13:21:19 +05:30
Hyejin Yoon
e24ef15e77
docs: use sphinx-markdown-builder for sdk doc generation (#13721) 2025-06-13 16:50:19 +09:00
v-tarasevich-blitz-brain
f3c49c3174
fix(ingestion): replace nonexistent icon (#13750) 2025-06-12 19:36:46 -04:00
v-tarasevich-blitz-brain
782b3e531a
fix(ingestion): UI fixes (#13765) 2025-06-12 19:35:55 -04:00
david-leifker
b90837ded9
chore(): bump postgresql lib (#13764) 2025-06-12 12:14:07 -05:00
Hyejin Yoon
c879836ea6
docs: linage client SDK guide (#13700) 2025-06-13 00:57:16 +09:00
dependabot[bot]
a53e62f701
build(deps): bump brace-expansion from 1.1.11 to 1.1.12 in /smoke-test/tests/cypress (#13756)
Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2025-06-12 10:40:47 -05:00
Aseem Bansal
bc1c9bdaec
feat(ingest): add debug source (#13759) 2025-06-12 18:41:22 +05:30
readl1
6bf91d1563
fix(ingestion/nifi): Incorrect SSL context usage for client certificate auth (#13531)
Co-authored-by: Mayuri Nehate <33225191+mayurinehate@users.noreply.github.com>
2025-06-12 11:53:13 +05:30
Hyejin Yoon
d423c9bb59
docs: add search sdk guide (#13682)
Co-authored-by: Harshal Sheth <hsheth2@gmail.com>
2025-06-12 14:14:23 +09:00
Hyejin Yoon
45e5ef84be
docs: update the example scripts with the new sdk (#13717) 2025-06-12 14:00:26 +09:00
Chris Collins
081edb11dd
fix(theme) Fix env var for customizing order of home page sidebar (#13749) 2025-06-12 00:32:54 -04:00
Tamas Nemeth
606439cafe
fix(ingest/bigquery): Set qualified name for bigquery containers (#13747) 2025-06-12 05:53:59 +02:00
Harshal Sheth
cb0f7ec801
fix(ingest): restrict snowflake-sqlalchemy dep (#13753) 2025-06-11 20:37:00 -07:00
Andrew Sikowitz
2376f56934
feat(ingest,graphql): Support displaying actor of CLI ingestion (#13754) 2025-06-11 20:34:31 -07:00
Chris Collins
c91ebf7e9d
fix(graphql) Allow updating names of groups without corpGroupInfo (#13745) 2025-06-11 11:26:53 -04:00
Chris Collins
425d4004b1
fix(ui) Take users to page 1 after page size change (#13744) 2025-06-11 11:25:13 -04:00
Gabe Lyons
476ac881a3
feat(application): editing application assignment via UI (#13739) 2025-06-11 09:26:40 -04:00
Tamas Nemeth
6b8dfb0aa8
feat(ingest/glue): Lake formation tags ingestion (#13693) 2025-06-11 12:26:27 +02:00
Chris Collins
3f81f90c73
fix(ui) Fix cursor jumping around in search bar (#13743) 2025-06-10 22:15:48 -04:00
purnimagarg1
7acc41ba63
feat(ui/ingestion): show the newly added ingestion source at the top of the list (#13723) 2025-06-10 21:27:26 -04:00
Chris Collins
53c0b11c26
fix(ui) Fix location of data-testid for schema field drawer (#13741) 2025-06-10 17:44:37 -04:00
Gabe Lyons
f62a58c080
fix(application): adjusting applicatino to new search signature (#13742) 2025-06-10 15:31:45 -05:00
david-leifker
794a2ab317
feat(config): update configuration caching (#13740) 2025-06-10 15:17:05 -05:00
purnimagarg1
9d8033ef58
feat(ui/ingestions): maintain the system sources filter across tabs (#13734) 2025-06-10 15:54:26 -04:00
Anthony Burdi
eba14287bb
refactor(sdk): Use sdk instead of _sdk_extras (#13736) 2025-06-10 14:10:57 -04:00
Harshal Sheth
78cfc49703
chore(ingest): bump sqlglot dep (#13730) 2025-06-10 11:09:37 -07:00
david-leifker
c58a49886e
refactor(limits): refactor configuration of query limits (#13726) 2025-06-10 12:49:44 -05:00
Jonny Dixon
0fa88189c8
feat(ingestion/mssql): detection of rds or managed sql server for jobs history (#13731) 2025-06-10 18:41:26 +01:00
Saketh Varma
523e16d8b7
fix(ui): Move buttons outside the form (#13715) 2025-06-10 11:24:06 -06:00
Anna Everhart
f632f970a4
Updated the Did You Mean to not be cut off (#13725) 2025-06-10 08:32:48 -07:00
Jay
72b1dd5053
feat(graphql) enriching entity health results with more context (#13728) 2025-06-10 11:11:36 -04:00
david-leifker
4f5e9c7508
feat(rest-emitter): set 60s ttl on gms config cache (#13729) 2025-06-10 09:21:29 -05:00
v-tarasevich-blitz-brain
4020494bc9
feat(ingestion): add hidden sources message to table's footer (#13685) 2025-06-09 18:26:36 -04:00
purnimagarg1
16a3211a96
feat(ui/ingestion): implement hover state for stacked avatars in owners column (#13703) 2025-06-09 18:25:42 -04:00
david-leifker
8889181b31
feat(smoke-test): support fixtures with timestamp updates (#13727) 2025-06-09 12:02:10 -05:00
RyanHolstien
fa55e5e39a
fix(usageEvent): fix hook interface (#13724) 2025-06-09 11:11:59 -05:00
Anthony Burdi
1dc3a19145
feat(sdk): Add subscriptions client to main client OSS (#13713) 2025-06-09 11:29:15 -04:00
Anna Everhart
c997668110
Updated drawer to match entity drawer styling (#13665) 2025-06-09 08:12:31 -07:00
Gabe Lyons
c913aa4161
feat(application): Adding application entity models, apis, search and entity page. (#13660) 2025-06-09 10:12:40 -04:00
v-tarasevich-blitz-brain
4574768b3a
fix(ingestion): show dash for null values in sources and execution tables (#13705) 2025-06-08 21:49:27 -04:00
v-tarasevich-blitz-brain
7e72dca891
fix(ingestion): rename name column and filter to sources on the executions tab (#13704) 2025-06-08 21:47:15 -04:00
purnimagarg1
83581dc14f
fix(ui/ ingestion): fix double onboarding modals on ingestion page (#13709) 2025-06-08 21:02:35 -04:00
purnimagarg1
b3f4d5f135
fix(ui/ingestion): filter current user to prevent the owner from getting added twice (#13708) 2025-06-08 21:01:50 -04:00
david-leifker
289abb6463
fix(ci): handle missing github ref (#13714) 2025-06-07 23:14:19 -07:00
david-leifker
ab5b08e725
chore(): global tomcat exclude (#13698) 2025-06-06 17:38:24 -05:00
Anthony Burdi
79e8da01e2
fix(sdk): Ignore mypy error for conditional ResolverClient import (#13712) 2025-06-06 18:01:02 -04:00
Kevin Karch
bf57943f3d
fix(docs): policies footnotes out of order (#13691) 2025-06-06 13:42:07 -04:00
Andrew Sikowitz
11cd28ebc8
fix(build/storybook): Surface errors (#13711) 2025-06-06 10:12:04 -07:00
Anthony Burdi
92af5546b6
feat(sdk): Add support for Assertion and Monitor entities (#13699) 2025-06-06 09:38:11 -04:00
Tamas Nemeth
0eef7a02c7
feat(ingest/unity-catalog): Tag extraction (#13642) 2025-06-06 13:24:56 +02:00
Hyejin Yoon
e82cc6672b
feat(sdk) add dashboard & chart entity (#13669) 2025-06-06 16:28:28 +09:00
Hyejin Yoon
e169b4ac05
feat(sdk): add get_lineage (#13654) 2025-06-06 12:34:52 +09:00
purnimagarg1
01357940b1
feat(ui/ingestion): use routed tabs and add links between sources and execution logs tab (#13694) 2025-06-05 23:21:25 -04:00
Aseem Bansal
81a510aff6
fix(ui): null deref (#13696) 2025-06-05 12:37:08 -04:00
dependabot[bot]
0e2a79665a
build(deps): bump tar-fs from 2.1.2 to 2.1.3 in /docs-website (#13673)
Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2025-06-05 11:13:33 -05:00
Chakru
57ffbc17e5
fix(build): restore deleted code that broke build (#13695)
Co-authored-by: Hyejin Yoon <0327jane@gmail.com>
2025-06-05 09:01:19 -05:00
Sergio Gómez Villamor
67844f1a53
feat(sdk): add EntityClient.delete documentation and tests (#13688) 2025-06-05 14:46:14 +02:00
Hyejin Yoon
83881e4129
feat(sdk): add add_lineage to lineage subclient (#13622) 2025-06-05 15:26:53 +09:00
Hyejin Yoon
488f011fa0
feat(sdk): add structured properties aspect (#13689) 2025-06-05 14:24:22 +09:00
Harshal Sheth
4114392647
fix(docs): make docs inclusion opt-in (#13680) 2025-06-04 16:33:08 -07:00
Andrew Sikowitz
454f76f2b5
feat(ui/lineage): Support changing home node via double click (#13403) 2025-06-04 15:04:27 -07:00
david-leifker
799a9f2e31
feat(search): lineage search performance (#13545) 2025-06-04 13:54:45 -05:00
jmacryl
235a773eea
Pfp 1277/indexing perf (#13480) 2025-06-04 19:56:10 +02:00
purnimagarg1
995c1ed730
improvement(ui/ingestion): improvements in redesigned ingestion table (#13674)
Co-authored-by: Chris Collins <chriscollins3456@gmail.com>
2025-06-04 13:39:10 -04:00
v-tarasevich-blitz-brain
80ad1cc7a3
fix(ingestion): fix size of the sources filter (#13686) 2025-06-04 11:52:26 -04:00
v-tarasevich-blitz-brain
93aa64200a
feat(ingestion): add rollback button (#13677) 2025-06-04 11:50:33 -04:00
purnimagarg1
65ae48e774
feat(ingestion): add owners in create/edit ingestion source and show owners in ingestion table (#13663) 2025-06-04 11:35:24 -04:00
Jonny Dixon
612f68eced
feat(ingestion/sql-common): add column level lineage for external tables (#11997) 2025-06-04 13:24:44 +01:00
Aseem Bansal
13df355675
docs(cloud): update remote executor version (#13675) 2025-06-04 15:03:55 +05:30
Hyejin Yoon
082cc9e8d6
docs: skip release note details for 1.1.0 on docs site to fix build (#13683) 2025-06-04 15:42:12 +09:00
Anthony Burdi
f1f6fd9fa9
fix(web): Only display assertion tab buttons if there is more than one button (#13678) 2025-06-03 17:24:50 -04:00
david-leifker
b6ae134a56
chore(): bump jetty version (#13679) 2025-06-03 14:06:02 -05:00
Connell Donaghy
07e1c2d6b3
fix(cli): safely initialize schema_fields (#13586) 2025-06-03 15:16:27 +02:00
RyanHolstien
5311434c70
feat(kafka): add interface design for listeners (#13637) 2025-06-02 13:39:32 -05:00
Anthony Burdi
7d5519f2e1
refactor(web): Embed url mapping into Tabs component for general use (#13648) 2025-06-02 14:01:56 -04:00
v-tarasevich-blitz-brain
5041677082
feat(ingestion): add execution log tab (#13611) 2025-06-02 10:47:27 -04:00
ksrinath
ee5d1e5260
feat(iceberg): add namespace permissions (#13414)
Co-authored-by: Hyejin Yoon <0327jane@gmail.com>
2025-06-02 17:25:48 +05:30
Atanu Chatterjee
487d52ad6e
fix(openapi_parser): add ability to parse openapi 3.0+ schemas (#13624) 2025-06-01 18:52:52 +02:00
david-leifker
1194f910bd
bump(): upgrade Kafka 7.9.1 (#13667) 2025-05-30 21:51:36 -05:00
purnimagarg1
2620af8e2a
feat(ui/ingestion): add empty and loading states for sources and secrets tables (#13646)
Co-authored-by: Chris Collins <chriscollins3456@gmail.com>
2025-05-30 20:44:02 -04:00
v-tarasevich-blitz-brain
313997cc76
feat(ingestion): add source to executionRequest (#13614) 2025-05-30 19:46:15 -04:00
purnimagarg1
e050a10922
feat(components): show the remaining number in stacked avatar component (#13633) 2025-05-30 16:43:38 -04:00
purnimagarg1
d2f9db2d9d
feat(ingestion): update the layout of secrets tab (#13627) 2025-05-30 16:42:07 -04:00
purnimagarg1
de20d4ec20
feat(ingestions): support backend sort and combine type and name sorts in ingestion table (#13625) 2025-05-30 15:51:41 -04:00
Jay
bf174e47e7
fix(web) fallback to search to get count (#13556) 2025-05-30 15:17:31 -04:00
Pedro Silva
4f41832ca8
fix(quickstart): Enable V2 UI by default (#13664) 2025-05-30 11:31:05 -05:00
david-leifker
85bd9cf21a
config(docker-compose): localstack healthcheck (#13652) 2025-05-30 11:01:33 -05:00
david-leifker
4145ff3d36
chore(): bump beanutils version (#13656)
Co-authored-by: Esteban Gutierrez <esteban.gutierrez@acryl.io>
2025-05-30 11:00:41 -05:00
Chakru
ad84b0575c
build: add depot.json (#13595) 2025-05-30 21:24:36 +05:30
Aseem Bansal
acb7210040
fix(log): improve log for list-source-runs without privileges (#13653) 2025-05-30 15:06:26 +05:30
jmacryl
075c57bb60
change ES default refresh_interval:3s (#13155)
Co-authored-by: david-leifker <114954101+david-leifker@users.noreply.github.com>
2025-05-29 17:46:33 -05:00
Chakru
906eba2b3e
ci(coverage): run coverage without ci-optimization on release and schedule (#13579) 2025-05-30 01:11:24 +05:30
purnimagarg1
528d705544
feat(ingestions): redesign ingestion table and page (#13585) 2025-05-29 12:27:44 -04:00
Pedro Silva
14fc4f0343
fix(docs): Correct Access Policy references to GraphQL endpoints (#13645) 2025-05-29 15:35:35 +01:00
Hyejin Yoon
a142a9e2d2
feat(sdk): add dataflow and datajob entity (#13551) 2025-05-29 22:53:56 +09:00
purnimagarg1
f335093607
improvement(ui): bring compact markdown viewer changes to OSS (#13593) 2025-05-29 09:47:39 -04:00
Pedro Silva
7b21a5cea3
feat(docs): Publish docs for 1.1.0 release (#13647) 2025-05-29 11:51:12 +01:00
Aseem Bansal
4976656df8
fix(policies): more assertions, add missing policy for editor role (#13644) 2025-05-29 15:44:54 +05:30
david-leifker
7cb26edb5b
feat(docker-compose): add localstack to compose profiles (#13650) 2025-05-28 15:11:27 -05:00
david-leifker
8787892cbb
config(): disable service name in header (#13638) 2025-05-28 15:11:08 -05:00
Sergio Gómez Villamor
fea6e6ac4f
fix(snowflake): pass correct BaseTimeWindowConfig instead of SnowflakeV2Config (#13643)
Co-authored-by: Claude <noreply@anthropic.com>
2025-05-28 13:08:43 +02:00
Sergio Gómez Villamor
f2ca93d275
feat(openapi): add verify_ssl configuration option (#13634)
Co-authored-by: Claude <noreply@anthropic.com>
2025-05-28 09:35:45 +02:00
david-leifker
492b55322f
feat(tracing): trace error log with timestamp & update system-metadata (#13628) 2025-05-27 17:33:40 -05:00
joseph-sentry
747e42497e
build: add codecov bundle analysis (#13087) 2025-05-27 12:21:04 -07:00
Chakru
3de438692d
build: support tag computation from branch with slash in name (#13636) 2025-05-27 22:00:11 +05:30
Andrew R Smith
647fb792de
feat(ingest): add snowflake ingestion config options (#12841)
Co-authored-by: Harshal Sheth <hsheth2@gmail.com>
2025-05-27 09:22:16 -07:00
Esteban Gutierrez
ec9719c801
ci(smoke-tests): run tests on push to release branches (#13629) (addendum) (#13635) 2025-05-27 10:04:25 -05:00
Sergio Gómez Villamor
d8d1de431f
fix(iceberg): update MinIO client commands for compatibility (#13631)
Co-authored-by: Claude <noreply@anthropic.com>
2025-05-27 11:15:41 +02:00
Aseem Bansal
2b2f6e7d82
fix(test): url encode urn (#13626) 2025-05-27 12:00:13 +05:30
Chakru
79ff05abcd
ci(smoke-tests): run tests on push to release branches (#13629) 2025-05-27 00:32:06 +05:30
david-leifker
650cff172d
fix(generic-patch): fix mixed attributed/non-attributed patches (#13618) 2025-05-26 12:28:44 -05:00
David Leifker
b956132e3c Revert "feat(tracing): python logging & update system metadata on no-op"
This reverts commit 64f315eb64417467cfef5cec473068dc42886293.
2025-05-26 12:15:18 -05:00
David Leifker
64f315eb64 feat(tracing): python logging & update system metadata on no-op 2025-05-26 10:22:51 -05:00
Tamas Nemeth
2ffa84be5c
fix(ingest/datahub): Create Structured property templates in advance and batch processing (#13355)
Co-authored-by: Pedro Silva <pedro@acryl.io>
2025-05-26 14:05:17 +02:00
david-leifker
4c6672213c
config(gradle): pin jackson via bom (#13617) 2025-05-24 13:42:10 -05:00
Esteban Gutierrez
09facfdfc5
feat(docs): Add section on updating DataHub for 1.1.0 (#13561) 2025-05-23 20:15:58 -05:00
Pedro Silva
da3f82ea58
feat(docs): Add docs for 0.3.11.1 Cloud Release (#13604) 2025-05-23 20:30:54 +01:00
david-leifker
62ab7d817f
fix(config-servlet): fix config endpoint to be thread-safe (#13616) 2025-05-23 13:39:43 -05:00
Harshal Sheth
e1de763c17
docs: add note on column transformation logic (#13615) 2025-05-23 10:07:45 -07:00
Aseem Bansal
947761f875
doc(assertion): change to millis in example (#13610) 2025-05-23 21:19:30 +05:30
Jay
76a9fc0d7f
fix(web) set domain dropdown ui to match domains page (#13294) 2025-05-23 11:14:44 -04:00
Sergio Gómez Villamor
7bd7a33d0b
feat(hex): consider additional context when parsing hex query metadata (#13596) 2025-05-23 08:01:04 +02:00
Harshal Sheth
69f368b00e
fix(ingest): restrict duckdb dep on old python versions (#13605) 2025-05-22 16:19:31 -07:00
david-leifker
d234f5580a
fix(config): fix mcp batch configuration (#13598) 2025-05-22 15:54:48 -05:00
david-leifker
c5d83683ed
fix(test): adjust exclusion rule (#13603) 2025-05-22 15:20:15 -05:00
Chris Collins
9a0a9ac19b
fix(forms) Remove schema field entities from form assignment entity types (#13599) 2025-05-22 14:39:33 -04:00
Jay
d2578b7d34
refactor(web) move access management tab to the front (#13092) 2025-05-22 13:03:35 -04:00
Chakru
550a801196
fix(build): fix quickstartslim to use datahub-actions (#13592) 2025-05-22 20:48:16 +05:30
Aseem Bansal
dda24dd169
docs(datahub cloud): fix remote executor version (#13597) 2025-05-22 20:22:24 +05:30
Saketh Varma
d473b3f580
fix(ui): Changes to components (#13589) 2025-05-22 09:00:00 -05:00
Hyejin Yoon
76e952f4ac
docs: update markdown_process_inline_directive to work with indentations (#13590) 2025-05-22 17:35:14 +09:00
Jay
a02d31c604
feat(docs-website): announcement banner features series b (#13588) 2025-05-21 19:38:48 -04:00
Jay
7e377592b4
feat(web): display external assertion result links in run event view (#13587) 2025-05-21 16:04:32 -04:00
John Joyce
3a81adb94a
fix(tags): Support null tagProperties aspect when updating tag color (#13572)
Co-authored-by: John Joyce <john@Mac.lan>
2025-05-21 12:32:08 -07:00
Andrew Sikowitz
822be781cd
fix(ui): Fix pagination overflow on embedded list search results (#13580) 2025-05-21 09:58:48 -07:00
Harshal Sheth
7e60587dec
fix(cli): strictly validate structured property values (#13576)
Co-authored-by: Chakravarthy Racharla <chakru.racharla@acryl.io>
2025-05-21 08:36:50 -07:00
v-tarasevich-blitz-brain
7fb9140d41
feat(ingestionSource/ownership): add ownership aspect to ingestionSource (#13567)
Co-authored-by: Victor Tarasevich <v.tarasevitch@invento.by>
2025-05-21 07:16:28 -07:00
Tamas Nemeth
9fca1737ff
fix(ingest/dbt): Fix urn validation in ownership type check (#13563) 2025-05-21 13:02:26 +02:00
Jay
f4a8d9e7fc
fix(web): glossary term create buttons inlined in content area (#13571) 2025-05-21 00:32:10 -04:00
david-leifker
d8099da973
docs(): Update v_0_3_11 Release notes (#13555) 2025-05-20 14:13:59 -05:00
John Joyce
47509ce85b
fix(ui): Add Admin Onboarding Steps + Change display name for iceberg policies DES-369 (#13456)
Co-authored-by: John Joyce <john@Mac-136.lan>
Co-authored-by: John Joyce <john@Mac-191.lan>
Co-authored-by: John Joyce <john@Johns-MacBook-Pro.local>
Co-authored-by: Andrew Sikowitz <andrew.sikowitz@acryl.io>
2025-05-20 11:39:29 -07:00
Andrew Sikowitz
e9df5401cb
fix(ui): Hide 404 page while loading permissions (#13570) 2025-05-20 11:29:56 -07:00
Harshal Sheth
e37b3c3394
feat(ingest/dbt): fallback to schema from graph (#13438) 2025-05-20 10:20:50 -07:00
purnimagarg1
3613e7e0d7
feat(ui/ingestion): create ingestionV2 folder and copy files (#13565) 2025-05-20 10:02:24 -07:00
Chakru
f8ad7de412
ci(publish): restore image publish on push to master (#13562) 2025-05-20 09:12:42 -05:00
Aseem Bansal
489db812b5
fix(hook): collect write mutation hook to ensure side effects (#13554) 2025-05-20 10:09:22 +05:30
Andrew Sikowitz
28b052c367
fix(ui/filters): Improve platform instance filter (#13559) 2025-05-19 16:30:56 -07:00
Andrew Sikowitz
3ee25c465c
ci(docs): Fix md prettier lint by ignoring inline blocks (#13558) 2025-05-19 17:52:26 -05:00
Andrew Sikowitz
5eca21dfbe
feat(lineage): Add feature flag to hide expand more action (#13557) 2025-05-19 13:57:43 -07:00
Hyejin Yoon
a0787c3abe
docs: fix inline code format (#13549) 2025-05-19 11:14:20 -05:00
Esteban Gutierrez
52e01bd599
fix(smoke-test): use full quickstart image instead of slim for spark tests (#13543) 2025-05-19 11:04:53 -05:00
Tamas Nemeth
0eca4dfde2
fix(ingest/hive): Fix hive storage path formats (#13536) 2025-05-19 16:17:28 +02:00
Aseem Bansal
1dec8d8ccb
fix(ingest(gc): remove default cli version (#13552) 2025-05-19 18:53:30 +05:30
Jonny Dixon
9584006a72
fix(ingestion/datahubapply): fix typos in config descriptions (#13546) 2025-05-19 11:25:25 +01:00
Sergio Gómez Villamor
8cae980286
tests(ingestion): moving some tests so they are available for sdk users (#13540) 2025-05-19 08:39:53 +02:00
Jonny Dixon
132ff7081f
feat(ingestion/s3): Add externalUrls for datasets in s3 and gcs (#12763) 2025-05-17 17:03:40 +01:00
Harshal Sheth
d3944ded93
feat(ingest/snowflake): generate lineage through temp views (#13517) 2025-05-16 21:27:13 -07:00
purnimagarg1
0f227a364a
feat(ingestion): create feature flag for ingestion page redesign (#13532) 2025-05-16 16:39:02 -07:00
jmacryl
8f52bdc5e4
feat(search): PFP-1275/look-into-0-doc-indices (#13296)
Co-authored-by: David Leifker <david.leifker@acryl.io>
Co-authored-by: david-leifker <114954101+david-leifker@users.noreply.github.com>
2025-05-16 12:59:46 -05:00
Pedro Silva
7e6b853a6d
fix(graphql): Add default parameters to access token resolver (#13535) 2025-05-16 15:23:58 +01:00
RyanHolstien
8e21ae3211
fix(docs): add known issue for server_config (#13530) 2025-05-15 16:54:12 -05:00
John Joyce
c9e6831f08
fea(ui): Add menu action for copying the full name of the asset (#13224)
Co-authored-by: John Joyce <john@Mac-2465.lan>
Co-authored-by: John Joyce <john@Mac-1293.lan>
2025-05-15 13:50:31 -07:00
John Joyce
0817041232
fix(docs): Improve backup and restore doc (#13466)
Co-authored-by: John Joyce <john@Mac-191.lan>
Co-authored-by: John Joyce <john@Johns-MacBook-Pro.local>
2025-05-15 13:48:47 -07:00
david-leifker
064e3618f2
fix(): remove unused recursive code (#13528) 2025-05-15 15:06:48 -05:00
Andrew Sikowitz
53c25adc9b
fix(ui/table): Fix column / ml feature description show more button (#13525) 2025-05-15 11:31:12 -07:00
Andrew Sikowitz
c6ffab1fe3
fix(ui/glossary): Display custom properties on glossary nodes (#13526) 2025-05-15 10:54:07 -07:00
david-leifker
13dbfd453f
chore(): bump kafka-setup base image (#13527) 2025-05-15 12:42:26 -05:00
Kevin Karch
e47606d21c
fix(docs): update support limits for liquid-python (#13524) 2025-05-15 11:59:58 -04:00
Chakru
9cee32c963
build(ci): fix a gradle implicit dependency error (#13522) 2025-05-15 14:48:01 +05:30
Chakru
37182d9e25
build(quickstart): skip composeForceDownOnFailure for debug variants (#13521) 2025-05-15 13:47:04 +05:30
Aseem Bansal
401214fa5d
fix(cli): move to stderr instead of stdout (#13512)
Co-authored-by: Harshal Sheth <hsheth2@gmail.com>
2025-05-15 13:35:56 +05:30
Aseem Bansal
199144167a
fix(sdk): change deprecated value use (#13511) 2025-05-15 13:35:34 +05:30
Harshal Sheth
1c7836dce8
fix(cli): avoid click 8.2.0 due to bugs (#13518) 2025-05-14 23:26:18 -07:00
Felix Lüdin
a00e65cd2f
fix(ui): fix useGetUserGroupUrns when user urn is empty (#13359) 2025-05-14 22:11:55 -05:00
david-leifker
44efca3961
fix(ci): fix typo (#13520) 2025-05-14 21:48:10 -05:00
Andrew Sikowitz
b4cc77cfaf
fix(graphql,lineage): Fix CLL through queries (#13519) 2025-05-14 15:23:06 -07:00
Kevin Karch
18d1a3c2e3
fix(docs): broken link in spark docs (#13516) 2025-05-14 16:11:44 -04:00
david-leifker
7e5295ac3c
update(ci): specify branch (#13515) 2025-05-14 13:55:45 -05:00
david-leifker
978d7b1afa
fix(mcp-processor): prevent exception in mcp processor (#13513) 2025-05-14 13:55:40 -05:00
uk555-git
2b5ba356e5
Support different container runtimes aliased as docker (#13207)
Co-authored-by: david-leifker <114954101+david-leifker@users.noreply.github.com>
2025-05-14 13:49:43 -05:00
dependabot[bot]
8d2ce281df
build(deps): bump actions/cache from 3 to 4 (#13346)
Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Co-authored-by: david-leifker <114954101+david-leifker@users.noreply.github.com>
2025-05-14 13:33:44 -05:00
dependabot[bot]
3de036ed0b
build(deps): bump gradle/gradle-build-action from 2 to 3 (#12951)
Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Co-authored-by: david-leifker <114954101+david-leifker@users.noreply.github.com>
2025-05-14 13:33:07 -05:00
david-leifker
6e499d3731
fix(ci): metadata-io test (#13514) 2025-05-14 12:00:37 -05:00
Harshal Sheth
56618a5e96
feat(ingest): support pydantic v2 in mysql source (#13501) 2025-05-14 09:51:11 -07:00
Harshal Sheth
9a892c6eca
feat(ingest): improve join extraction (#13502) 2025-05-14 09:50:17 -07:00
Jonny Dixon
61ea244659
feat(platform): up limit of corpuser char length from 64 to 128 (#13510) 2025-05-14 16:48:16 +01:00
Michael Minichino
8b4217f7fa
fix(ingest/mode): Additional 404 handling and caching update (#13508) 2025-05-14 09:09:40 -05:00
jmacryl
5749f6f970
Refactor elasticsearch search indexed (#13451) 2025-05-14 08:57:08 -05:00
Jonny Dixon
c756af31b1
feat(ingestion/looker): extract group_labels from looker and add as tags in datahub (#13503) 2025-05-14 13:08:13 +01:00
david-leifker
1ac41e95d6
fix(ci): enable publish on release (#13506) 2025-05-13 20:31:22 -05:00
david-leifker
effc5e2f4d
fix(config): fix mcp consumer batch property (#13504) 2025-05-13 14:46:37 -05:00
david-leifker
2a8c7b2aa0
ci(workflow): postgres consolidation & release unit tests (#13500) 2025-05-13 13:24:04 -05:00
RyanHolstien
dd377b33dc
feat(config): add configurable search filter min length (#13499) 2025-05-13 12:37:59 -05:00
Michael Minichino
bc860181d8
fix(ingest/mode): Additional pagination and timing metrics (#13497)
Co-authored-by: NehaGslab <neha.marne@gslab.com>
2025-05-13 08:48:38 -05:00
Sergio Gómez Villamor
184fb09fc0
fix(mssql): improve stored proc lineage + add temporary_tables_pattern config (#13415)
Co-authored-by: Harshal Sheth <hsheth2@gmail.com>
2025-05-13 10:36:52 +02:00
Sergio Gómez Villamor
a03b35e167
chore(hex): debug logs (#13473) 2025-05-13 08:22:12 +02:00
Harshal Sheth
7c791db087
feat(ingest/sql): column logic + join extraction (#13426) 2025-05-12 17:19:44 -07:00
david-leifker
dbcab5e404
chore(): graphiql latest versions (#13484) 2025-05-12 16:36:28 -05:00
Tamas Nemeth
93c04ad5b4
fix(ingest/presto): Presto/Trino property extraction fix (#13487) 2025-05-12 22:47:56 +02:00
Gabe Lyons
aeda8f4c95
feat(cassandra): Support ssl auth with cassandra (#13465) 2025-05-12 13:15:49 -04:00
Rafael Francisco Luque Cerezo
a97af36e93
Adding KafkaClients dependency to the datahub-upgrade module (#13488)
Co-authored-by: Jonny Dixon <45681293+acrylJonny@users.noreply.github.com>
2025-05-12 09:37:57 -05:00
Aseem Bansal
6bcec4ea15
docs(cloud): fix indent, remove note not relevant for cloud users (#13493) 2025-05-12 09:02:03 -05:00
Aseem Bansal
dd1b056d35
fix(ui): enable to edit tag when properties aspect was not present (#13470) 2025-05-12 15:08:20 +05:30
Aseem Bansal
01b5cdb541
fix(cli): warn more strongly about hard deletion (#13471) 2025-05-12 15:03:38 +05:30
Jay
bcedd5d372
docs: Adding notes on remote executors handling smart assertions (#13479)
Co-authored-by: Andrew Sikowitz <andrew.sikowitz@acryl.io>
2025-05-12 14:14:51 +05:30
Sergio Gómez Villamor
ff3f36a291
tests(smoke): removes non existing mix_stderr param in CliRunner (#13485) 2025-05-12 07:14:24 +02:00
Harshal Sheth
53af032b6f
fix(sdk): always url-encode entity links (#13483) 2025-05-11 17:36:42 -07:00
Sergio Gómez Villamor
fb7bcbaf17
tests(ingestion): fixes hex and hive docker flakiness (#13476) 2025-05-11 19:06:23 +02:00
david-leifker
7264743920
docs(): v0.3.11 DataHub Cloud Docs (#13439)
Co-authored-by: Chris Collins <chriscollins3456@gmail.com>
Co-authored-by: John Joyce <john@acryl.io>
Co-authored-by: Andrew Sikowitz <andrew.sikowitz@acryl.io>
2025-05-10 20:36:28 -05:00
Andrew Sikowitz
f094e50ba2
docs: Add show manage tags environment var (#13482) 2025-05-10 07:51:18 -05:00
Anthony Burdi
f71ff7722a
fix(sdk): use pluralized assertions (#13481) 2025-05-09 16:19:50 -04:00
Tamas Nemeth
60f79153ac
fix(ingest/hive): Fix hive properties with double colon (#13478) 2025-05-09 19:59:40 +02:00
RyanHolstien
25991a5ec1
fix(test): prevent audit test flakiness (#13475) 2025-05-09 11:34:27 -05:00
Jay
c94ca1e094
fix(web) domain search result item visual cleanup (#13474) 2025-05-09 11:18:55 -04:00
Kevin Karch
4c77c71315
feat(ingest): filter by database in superset and preset (#13409) 2025-05-09 09:51:59 -04:00
Jonny Dixon
ac3ad8bb36
feat(ingestion/kafka): Add optional externalURL base for link to external platform (#12675) 2025-05-09 12:24:24 +01:00
Pedro Silva
601d3e6010
fix(docs): Add requirement on yarn for documentation for local development (#13461) 2025-05-09 16:15:11 +05:30
Hyejin Yoon
62363016bd
docs: remove 0.15.0 from archived list (#13468) 2025-05-09 17:48:38 +09:00
Pedro Silva
2b613e139b
feat(docs): Add doc links for 1.0.0 (#13462)
Co-authored-by: Hyejin Yoon <0327jane@gmail.com>
2025-05-08 20:41:40 -07:00
Hyejin Yoon
f6e2f296a0
feat(sdk): update lineage sample script to use client.lineage (#13467) 2025-05-09 12:08:28 +09:00
Andrew Sikowitz
5aabf650fc
fix(versioning): Properly set versioning scheme on unlink; always run side effects (#13440) 2025-05-08 19:29:04 -07:00
Hyejin Yoon
a414bbb798
feat(sdk): add datajob lineage & dataset sql parsing lineage (#13365) 2025-05-09 10:20:48 +09:00
Hyejin Yoon
71e104068e
docs: add runllm chatbot (#13464) 2025-05-09 09:08:57 +09:00
Hyejin Yoon
79dac4ed45
docs: remove markprompt (#13463) 2025-05-09 00:36:26 +09:00
Pedro Silva
e0f57b5ef0
fix(docs): Add feature availability to audit API (#13459) 2025-05-08 14:07:11 +01:00
Jay
cb9796e047
fix(authentication) redirection for native login and sso to function within iframes (#13453)
Co-authored-by: Esteban Gutierrez <esteban.gutierrez@acryl.io>
2025-05-08 12:37:09 +01:00
Pedro Silva
e97488de18
feat(docs): Add 0.3.10.4 hotfix release notes (#13458) 2025-05-08 11:01:39 +01:00
Harshal Sheth
926bb3ceba
chore(ingest): bump bounds on cooperative timeout test (#13449) 2025-05-08 12:12:21 +05:30
purnimagarg1
81c100c0fc
improvement(ui): add wrapper component for stop propagation (#13434) 2025-05-08 10:34:25 +05:30
Gabe Lyons
79a1ac22c0
feat(UI): funnel subtype for dataflows and datajobs all the way to the UI (#13455) 2025-05-07 21:24:24 -04:00
david-leifker
2be8c07b74
ci(): run smoke tests on release (#13454) 2025-05-07 16:43:40 -05:00
Esteban Gutierrez
826caa7935
chore(avro): bump parquet-avro version (#13452) 2025-05-07 16:28:43 -05:00
Tamas Nemeth
f7ea0b9d5d
fix(ingest/mode): Not failing if queries endpoint returns 404 (#13447) 2025-05-07 22:27:15 +02:00
Jay
5cd115fc7c
docs: Adding color to 3.10 release notes (#13448) 2025-05-07 16:22:24 -04:00
Chakru
2679e75071
fix(build): fix local quickstart builds (#13445) 2025-05-07 20:13:52 +05:30
Hyejin Yoon
b48774c1ec
docs: remove old pages & assets (#13367) 2025-05-07 23:16:49 +09:00
Aseem Bansal
87199ff9d7
fix(ingest/snowflake): parsing issues with empty queries (#13446) 2025-05-07 18:57:04 +05:30
Aseem Bansal
035af5bdf9
fix(cli): ignore extra configs (#13444) 2025-05-07 16:45:25 +05:30
Tamas Nemeth
c3852dada5
fix(ingest/tableau): Fix infinite loop in Tableau retry (#13442) 2025-05-07 12:16:55 +02:00
John Joyce
ed86feec5d
docs(release): Adding notes for v0.3.10.3 release (#13437)
Co-authored-by: John Joyce <john@Mac-1108.lan>
Co-authored-by: Jay <159848059+jayacryl@users.noreply.github.com>
Co-authored-by: Andrew Sikowitz <andrew.sikowitz@acryl.io>
2025-05-07 14:03:44 +05:30
Anna Everhart
7a4232d105
updated search menu items after search update (#13422) 2025-05-07 00:19:54 -07:00
Chris Collins
a2b1667d9d
fix(ui) Add ellipses and tooltip to long names on home page header (#13425) 2025-05-06 18:10:21 -04:00
Tamas Nemeth
2f44cc74a9
fix(docker): Fix for metadata ingestion docker build (#13435) 2025-05-06 23:07:47 +02:00
v-tarasevich-blitz-brain
a002793e63
fix(searchBarAutocomplete): ui tweaks (#13430)
Co-authored-by: Victor Tarasevich <v.tarasevitch@invento.by>
2025-05-06 14:18:10 -04:00
Harshal Sheth
287f373a9c
fix(ingest/snowflake): fix previously broken tests (#13428) 2025-05-06 10:19:25 -07:00
RyanHolstien
2dfe84afe3
fix(smoke-test): fix flakiness of audit smoke test (#13429) 2025-05-06 11:15:05 -05:00
Chakru
e9b867ddd8
fix(build): fix version in jars (#13432) 2025-05-06 21:13:08 +05:30
david-leifker
131de0f026
fix(): DUE Producer Configuration & tracking message validation (#13427)
Co-authored-by: Pedro Silva <pedro@acryl.io>
2025-05-06 10:28:58 -05:00
Anthony Burdi
294ad23500
feat(sdk): scaffold assertion client (#13362)
Co-authored-by: Harshal Sheth <hsheth2@gmail.com>
2025-05-06 10:51:34 -04:00
skrydal
65d1c2b43c
feat(ingestion): Make jsonProps of schemaMetadata less verbose (#13416) 2025-05-06 16:18:26 +02:00
Jay
75e3d29231
fix(graphql): remove false deprecation note (#13402) 2025-05-06 13:33:51 +05:30
Andrew Sikowitz
eb41e0bbdd
ci: Add yaml format check (#13407) 2025-05-05 16:21:34 -07:00
Andrew Sikowitz
2aa11a6bf6
feat(ui/lineage): Make show ghost entities toggle local storage sticky (#13424) 2025-05-05 14:40:00 -07:00
RyanHolstien
d8739c4e3e
feat(changeSyncAction): support RESTATE type syncs (#13406) 2025-05-05 16:22:07 -05:00
Aseem Bansal
964a4b70c2
chore(airflow): update dev mypy to 1.14.1 (#13374) 2025-05-05 13:15:20 -07:00
Anna Everhart
ca7853c5a4
Update search results page (#13303) 2025-05-05 11:44:38 -07:00
Harshal Sheth
cb3988a5f3
feat(ingest): associate queries with operations (#13404) 2025-05-05 11:27:33 -07:00
david-leifker
1b3173ace3
test(audit-events): updates for audit event tests (#13419) 2025-05-05 13:16:49 -05:00
Harshal Sheth
eefdded9f0
ci: don't rerun docker workflows on labels (#13405) 2025-05-05 10:28:37 -07:00
Harshal Sheth
2e3328fce0
chore(ingest): bump sqlglot dep (#13411) 2025-05-05 09:24:50 -07:00
david-leifker
4e7bb3998d
feat(ingestion): refactor api-tracing EmitMode (#13397) 2025-05-05 10:54:31 -05:00
Chakru
214d376a96
fix(build): fix regression in local quickstart builds (#13413) 2025-05-05 08:18:31 -05:00
Harshal Sheth
f83460255a
feat(ingest): add urn -> url helpers (#13410) 2025-05-02 19:54:01 -07:00
Harshal Sheth
24f9bc0f18
feat(ci): use local ingestion in actions (#13408) 2025-05-02 19:22:09 -07:00
Harshal Sheth
096e6d9af4
chore(ingest/snowflake): clean up unused params in Snowflake connections (#13379) 2025-05-02 13:05:12 -07:00
Harshal Sheth
e2844b6c95
fix(ingest): move to acryl-great-expectations (#13398) 2025-05-02 13:04:53 -07:00
Harshal Sheth
b7ef234bc7
fix(ingest): fix deps for fivetran (#13385) 2025-05-02 12:31:07 -07:00
jmacryl
854ec614b9
feat(search) use parametrized painless in updates see https://linear.… (#13401) 2025-05-02 19:19:25 +02:00
david-leifker
96c92fda71
fix(mce-consumer): prevent too large SQL statements (#13392) 2025-05-02 10:21:45 -05:00
Aseem Bansal
e0e41b33e7
fix(ui): null pointers on frontend (#13400) 2025-05-02 18:06:32 +05:30
Aseem Bansal
03531520ce
fix(ingest/dynamodb): put primary keys correctly (#13373) 2025-05-02 15:25:34 +05:30
Aseem Bansal
42aeed074f
chore: upgrade dev dependencies mypy and ruff (#13375) 2025-05-02 15:02:00 +05:30
Aseem Bansal
7f681ee339
docs(dynamodb): add privileges for dynamodb (#13372) 2025-05-02 12:24:03 +05:30
RyanHolstien
6f1968fbd1
feat(auditSearch): support backend audit events and search api (#13377) 2025-05-01 21:19:58 -05:00
David Leifker
e6babc3b81 Revert "feat(ingestion): refactor api-tracing EmitMode"
This reverts commit bf598aed9687e9b08ccfbd72257fc890b505d775.
2025-05-01 21:06:10 -05:00
David Leifker
bf598aed96 feat(ingestion): refactor api-tracing EmitMode
* Created EmitMode to control write guarantees
    * IMMEDIATE, QUEUE, BLOCKING_QUEUE
2025-05-01 20:30:21 -05:00
Esteban Gutierrez
9b6960ac8e
feat(azure): include azure-identity-extensions for Microsoft Entra Workload Identity connections (#13395) 2025-05-01 20:22:47 -05:00
david-leifker
8919154f02
fix(ingestion): fix cloud vs core logic (#13387) 2025-05-01 16:16:41 -05:00
Maggie Hays
8f02b90c57
docs(remote executor) Fix typo in k8s snippets (#13393) 2025-05-01 16:01:14 -05:00
david-leifker
143e2b7ae3
chore(): bump Spring 3.4.5 (#13390) 2025-05-01 14:40:49 -05:00
Harshal Sheth
bb9838d789
fix(ingest): use server config method from graph (#13391) 2025-05-01 12:32:56 -07:00
Chris Collins
3e54a842ef
fix(ui) Fix bug with entity select modal with no entity types passed in (#13388) 2025-05-01 13:40:57 -04:00
Chris Collins
95531a3944
fix(ui) Fix a few bugs around new search bar experience (#13382) 2025-05-01 13:40:42 -04:00
Chris Collins
f6d962ff96
fix(validation) Fix bug in duplicate prompt ID validator (#13351) 2025-05-01 13:39:57 -04:00
dependabot[bot]
bdf9e195ef
build(deps-dev): bump vite from 4.5.11 to 4.5.14 in /datahub-web-react (#13384)
Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2025-05-01 11:54:58 -05:00
Harshal Sheth
591b6ce0c9
feat(actions): support pydantic v2 (#13378) 2025-04-30 19:39:35 -07:00
david-leifker
d25d318233
feat(ingestion-sdk): OpenAPI & Tracing With SDK (#13349)
Co-authored-by: Andrew Sikowitz <andrew.sikowitz@acryl.io>
2025-04-30 21:09:12 -05:00
Harshal Sheth
34e74d826e
fix(ingest): update dremio golden files (#13381) 2025-04-30 17:45:54 -05:00
david-leifker
7ae750e69f
feat(graphiql): upgrade graphiql interface (#13380) 2025-04-30 17:43:21 -05:00
Saketh Varma
3154289dd8
fix(ui): Backfill proposals changes (#13350) 2025-04-30 14:15:25 -05:00
v-tarasevich-blitz-brain
afa9209f5d
fix(ui/select): replace clear icon and decrease size of it (#13335)
Co-authored-by: Victor Tarasevich <v.tarasevitch@invento.by>
2025-04-30 12:49:40 -04:00
v-tarasevich-blitz-brain
7f3bbe1d33
fix(searchBarAutocomplete): fix search bar issues (#13315)
Co-authored-by: Victor Tarasevich <v.tarasevitch@invento.by>
2025-04-30 12:49:11 -04:00
Chris Collins
5f14e1d8ee
feat(glossary) Add better scaling support for business glossary (#13353) 2025-04-30 12:46:26 -04:00
Chakru
8e6d78403f
fix(policyEngine): policy evaluation incorrect without type (#13371) 2025-04-30 21:42:05 +05:30
Chris Collins
ae50381c30
feat(ui) Support the foundations for basic theme support with primary color (#13361) 2025-04-30 11:21:41 -04:00
Saketh Varma
87dd6bc30a
fix(ui): Fix sidebar edit structured properties permissions (#13339) 2025-04-30 10:14:52 -05:00
purnimagarg1
80cb574d1e
improvement(component-library): changes in the table component (#13348)
Co-authored-by: Saketh Varma <sakethvarma397@gmail.com>
2025-04-30 18:21:45 +05:30
Sergio Gómez Villamor
4d524ecf3b
fix(actions): h11 dependency (#13370) 2025-04-30 14:45:33 +02:00
Aseem Bansal
f09bd8ac31
feat(granted privileges): Report reasons for denied access (#13231) 2025-04-30 17:32:55 +05:30
Hyejin Yoon
f6ab586fdc
docs: update GTM & GA ID (#13366) 2025-04-30 09:59:52 +09:00
Esteban Gutierrez
ccd4565418
fix(datahub-actions) bump h11 version to 0.16 (#13364) 2025-04-29 18:37:48 -05:00
Chris Collins
3cb45f56db
feat(analytics) Add tracking events to all the home page modules (#13334) 2025-04-29 18:11:16 -04:00
Harshal Sheth
d264a7afba
feat(ingest/dbt): make catalog.json optional (#13352) 2025-04-29 10:39:53 -07:00
Chakru
0029cbedf6
fix(policy): show platormInstances in search when applicable (#13356) 2025-04-29 21:40:13 +05:30
Hyejin Yoon
f986315582
doc: Acryl to DataHub, datahubproject.io to datahub.com (#13252)
Co-authored-by: Jay <159848059+jayacryl@users.noreply.github.com>
2025-04-28 10:34:33 -04:00
Sergio Gómez Villamor
312e1ff573
chore(looker): reduce verbosity if error during initialization (#13331) 2025-04-28 13:11:58 +02:00
Sergio Gómez Villamor
e143422d2d
feat(dbt): log catalog and manifest metadata (#13329) 2025-04-28 10:19:03 +02:00
Aseem Bansal
1de63cc817
fix(auth): admin role missing privileges (#13337) 2025-04-28 11:42:41 +05:30
Shirshanka Das
894655622d
fix(gms): class collision in mcp/mcl consumers (#13345)
Co-authored-by: David Leifker <david.leifker@acryl.io>
2025-04-27 22:50:28 -07:00
Shirshanka Das
2bdf4933e2
ci: ensure smoke-tests run on python 3.11 (#13344)
Co-authored-by: Harshal Sheth <hsheth2@gmail.com>
2025-04-27 20:48:14 -07:00
david-leifker
5fea1b9f69
config(): swagger ui path (#13343) 2025-04-27 18:46:35 -07:00
Jay
c800ac3131
feat(docs): updating links to demo.datahub.com (#13336) 2025-04-27 18:37:30 -07:00
Jonny Dixon
9c718c870e
feat(ingestion/neo4j): Add stateful_ingestion and platform_instance capabilities to connector (#12631)
Co-authored-by: Harshal Sheth <hsheth2@gmail.com>
2025-04-26 11:52:20 +01:00
John Joyce
c844e485b5
fix(ui): Hide deleted assets on home page recommendations (#13328)
Co-authored-by: John Joyce <john@ip-192-168-1-63.us-west-2.compute.internal>
Co-authored-by: Andrew Sikowitz <andrew.sikowitz@acryl.io>
2025-04-25 18:37:36 -07:00
Michael Minichino
fee67788a3
fix(ingest/mode): Add pagination and warn on missing reports (#13322) 2025-04-25 18:21:27 -05:00
Andrew Sikowitz
284f26ddd3
fix(ui/lineage): Fix bug when hiding transformations: edges via queries would disappear (#13323) 2025-04-25 17:39:54 -05:00
Andrew Sikowitz
825ac9ae7d
fix(graphql/stats): Support SAMPLE dataset profiles with sample size (#13327) 2025-04-25 15:24:54 -07:00
Anna Everhart
0c2ffb3409
changed logo (#13338) 2025-04-25 14:45:27 -07:00
Andrew Sikowitz
632d14e592
dev(cypress): Add docs on running cypress tests against a remote instance (#13341) 2025-04-25 16:43:27 -05:00
Harshal Sheth
ab760f0fef
fix(ingest): handle newlines in CLI-internal log tracker (#13340) 2025-04-25 13:08:10 -07:00
david-leifker
9b0634805a
feat(ingestion-openapi): patch support (#13282)
Co-authored-by: Sergio Gómez Villamor <sgomezvillamor@gmail.com>
2025-04-25 13:54:28 -05:00
Esteban Gutierrez
c37eee18e6
chore(aws-msk-iam-auth): bump dependency version (#12600)
Co-authored-by: david-leifker <114954101+david-leifker@users.noreply.github.com>
2025-04-25 11:20:50 -05:00
Deepak Garg
5f73d21b7e
fix(gms): resolve cyclic dependency in Neo4jGraphService.java (#13302) 2025-04-25 11:06:25 -05:00
Chakru
25a78d4960
ci: fix publish and scan tasks and schedule (#13332) 2025-04-25 20:48:08 +05:30
Jay
66e59f6ee7
feat(web) Update OSS demo button to link to new site (#13324) 2025-04-25 11:11:57 -04:00
Aseem Bansal
49ee849382
chore: update lint dependencies (#13316)
Co-authored-by: Andrew Sikowitz <andrew.sikowitz@acryl.io>
2025-04-25 20:29:11 +05:30
Chakru
0f473232a3
build: support reload of some modules with env changes (#13325) 2025-04-25 08:12:41 +05:30
Andrew Sikowitz
3e11bb7d04
fix(ui): Sort less files first (#13268) 2025-04-24 15:25:37 -07:00
Anthony Burdi
ea47645ca4
ci: add anthonyburdi to team in pr-labeler.yml (#13326) 2025-04-24 16:27:48 -05:00
Deepak Garg
ce82a96bbd
fix(UI): fix business attributes related schemaFields (#13313) 2025-04-24 12:16:32 -07:00
Chakru
51863325a5
build: use versioning in gradle consistent with ci (#13259)
Enable use of gradle for all image builds for publishing, eliminating the per-image build action in docker-unified.yml that duplicated what was in gradle but used slightly different mechanisms to determine what is the tag. Enabled gradle build to consume tags provided by the workflow and produce tags same as earlier.

Use bake matrix builds to build slim/full versions of datahub-ingestion, datahub-actions.

Publish images and scan relies on gradle to get the list of images, via depot.

Image publish and scans run once a day on schedule or on manual triggers only.
Pending work: Separate the publish and scans into a separate workflow that runs on a schedule and could also run other tests.
2025-04-24 23:00:03 +05:30
Anna Everhart
34feb0f3f1
removed dash line and weird border radius in sidebar sections (#13266) 2025-04-24 10:23:40 -07:00
Anna Everhart
03025b67b5
updated nav logo (#13308) 2025-04-24 08:47:36 -07:00
Chris Collins
0ce24e3199
fix(cypress) Improve flakiness of managing_secrets v1 and v2 (#13311) 2025-04-24 10:00:59 -04:00
purnimagarg1
d638d6f030
fix(ui/structured-properties): add data contract entity in v1 to fix structured properties page issue (#13300)
Co-authored-by: Chris Collins <chriscollins3456@gmail.com>
2025-04-24 09:55:59 -04:00
v-tarasevich-blitz-brain
49bb2b50a5
feat(searchBarAutocomplete): add description to matched fields to results of the search bar (#13314) 2025-04-24 09:48:53 -04:00
david-leifker
effff339e5
fix(cypress): fix cypress test data incidents (#13305) 2025-04-24 12:31:50 +05:30
Harshal Sheth
f6764ee17a
chore(sdk): rename _schema_classes to mark it as internal-only (#13309) 2025-04-24 12:22:01 +05:30
jmacryl
3c9da5a0e9
feat(search): use zstd-no-dict codec in Opensearch (#13273)
Co-authored-by: david-leifker <114954101+david-leifker@users.noreply.github.com>
2025-04-23 20:22:46 -05:00
Chris Collins
86daf2bd5b
fix(cypress) Fix flaky mutations/domains cypress test (#13310) 2025-04-23 19:04:31 -04:00
Chris Collins
fb4c505800
fix(cypress) Fix occasional socket closed exception in cypress (#13312) 2025-04-23 18:10:16 -04:00
Chris Collins
050d003169
fix(cypress) Fix flakiness in nested_domains and v2_nested_domains cypress tests (#13304) 2025-04-23 17:59:06 -04:00
Andrew Sikowitz
f3a41201f2
fix(ui/graphql): Fetch glossary node details when fetching glossary node children (#13307) 2025-04-23 12:50:28 -07:00
John Joyce
1cb616cff9
refactor(ui): Use phosphor icons for asset health (#13293)
Co-authored-by: John Joyce <john@Mac-242.lan>
2025-04-23 11:55:20 -07:00
purnimagarg1
c52cae72cb
improvement(component-library): make improvements in the checkbox component (#13299) 2025-04-23 14:20:18 -04:00
v-tarasevich-blitz-brain
854d2025f4
feat(searchBarAutocomplete): add support of matched fields to the search bar (#13255) 2025-04-23 12:53:55 -04:00
Chakru
294d77446b
ci: increase runner size for smoke tests (#13301) 2025-04-23 21:59:37 +05:30
Sergio Gómez Villamor
1c5b7c18fc
chore(ingestion): removes ignore for SIM117 ruff rule (#13295) 2025-04-23 15:55:46 +02:00
skrydal
1c1734bf41
doc(ingestion/iceberg): Improve Iceberg docs (#13097) 2025-04-23 14:24:35 +02:00
Sergio Gómez Villamor
1563b0e9fb
fix(ingestion): use default generate_browse_path_v2 even if no pipeline_config (#13117) 2025-04-23 13:25:58 +02:00
Aseem Bansal
64829a3279
chore(build): upgrade dependencies (#13286) 2025-04-23 14:49:40 +05:30
Aseem Bansal
1de5fb3e6f
fix(cli): redact more secrets (#13287) 2025-04-23 14:49:26 +05:30
Sergio Gómez Villamor
0b6fd75d37
feat(slack): restores retry logic for get_user_to_be_updated (#13228) 2025-04-23 09:14:12 +02:00
Esteban Gutierrez
40c4579810
fix(): Remove embededed tomcat transitive dependency from spring boot (#13283) 2025-04-22 20:07:22 -05:00
v-tarasevich-blitz-brain
ee40115bc6
fix(searchBarAutocomplete): UI fixes for the search bar (#13229) 2025-04-22 18:40:26 -04:00
v-tarasevich-blitz-brain
e3c45a7da7
feat(searchBarAutocomplete): add options to mixpanel events (#13284) 2025-04-22 17:59:13 -04:00
Jay
6651c8ed95
fix(web) clean up domains ui (#13143) 2025-04-22 17:50:57 -04:00
Harshal Sheth
82fafceba4
ci(airflow): separate airflow constraints from deps (#13291) 2025-04-22 13:54:15 -07:00
Chris Collins
45c5a620e7
feat(ui) Add new Tabs component and replace on home page (#13144) 2025-04-22 16:26:45 -04:00
Andrew Sikowitz
c140450a1e
fix(ui/storybook): Allow relative imports to fix storybook build (#13292) 2025-04-22 13:05:16 -07:00
Harshal Sheth
a88e15c0d2
fix(ci): ensure extra airflow requirements are respected (#13289) 2025-04-22 12:08:28 -07:00
Harshal Sheth
fe3ae92a5e
chore: only show pull request checklist on PR creation (#13290) 2025-04-22 12:08:10 -07:00
Chris Collins
fa531d70c8
fix(cypress) Catch resizeObserverLoop globally and fix setThemeV2 (#13288) 2025-04-22 14:12:42 -04:00
david-leifker
169c982b4d
chore(pegasus): bump pegasus 29.65.7 (#13285) 2025-04-22 12:46:42 -05:00
david-leifker
0b2e0ef100
fix(): docker mysql env (#13274) 2025-04-22 12:37:24 -05:00
v-tarasevich-blitz-brain
7b8aed85a7
refactor(searchBarAutocomplete): refactoring of the new search bar (#13199)
Co-authored-by: Chris Collins <chriscollins3456@gmail.com>
2025-04-22 12:16:01 -04:00
purnimagarg1
6b1806465a
improvement(component-library): bring back component library changes (#13263) 2025-04-22 11:07:55 -04:00
Chris Collins
886f3f8c43
fix(cypress) Fix flakiness in dataset_ownership cypress test (#13280) 2025-04-22 10:45:48 -04:00
Chris Collins
3ae7d1f3f4
fix(cypress) Fix flakiness in query_tab cypress test (#13279) 2025-04-22 10:45:19 -04:00
Sergio Gómez Villamor
a8637abfe2
tests(kafka): fixing flaky tests (#13171) 2025-04-22 12:58:47 +02:00
Chris Collins
a02ca68386
fix(pytest) Fix broken pytest after recent schema field urn change (#13278) 2025-04-21 18:03:22 -04:00
Harshal Sheth
f48c6b53ee
feat(ingest/snowflake): show returned query row counts (#13246) 2025-04-21 14:41:40 -07:00
Chakru
54156ea78a
fix(cli): use patch to update dataset properties (#13226) 2025-04-21 14:41:31 -07:00
Esteban Gutierrez
8a17ba14d6
fix(): Fixes multiple minor security vulnerabilities (#13222)
bug(snappy): Make sure right snappy version is installed
fix(docker): update Dockerize to version v0.9.3
fix(gms): fixes dgraph4j netty deps
fix(docker): remove SGID on /home/datahub and /home/datahub-integration
fix(datahub-actions): bump setuptools and wheel version
fix(docker): update c-ares version
fix(docker): datahub-actions addendum
2025-04-21 16:39:26 -05:00
Harshal Sheth
fa750573e2
fix(actions): fix datahub-actions publishing + wheels (#13276) 2025-04-21 14:19:25 -07:00
Harshal Sheth
796331a960
docs: fix fivetran docs formatting (#13277) 2025-04-21 12:38:15 -07:00
Harshal Sheth
08453cfbb1
fix(ingest/hive): support multiline view definitions (#13248) 2025-04-21 11:11:35 -07:00
Harshal Sheth
9f7f3cb886
chore(ingest/snowflake): remove unused query code (#13245) 2025-04-21 10:31:36 -07:00
v-tarasevich-blitz-brain
6f71a2d204
fix(select): fix select opening on clear (#13202) 2025-04-21 12:16:57 -04:00
v-tarasevich-blitz-brain
8e479f8296
feat(searchBarAutocomplete): add support of searchAPI in the search bar (#13151) 2025-04-21 12:16:16 -04:00
david-leifker
a409037a9e
feat(lineage-graph): optimize lineage queries (#13257) 2025-04-21 10:41:43 -05:00
dependabot[bot]
9f91edc724
build(deps): bump aquasecurity/trivy-action from 0.29.0 to 0.30.0 (#13271)
Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2025-04-21 09:18:07 -05:00
Harshal Sheth
64bda48b51
feat(sdk): auto-fix bad entity type casing (#13218) 2025-04-20 21:40:00 -07:00
Harshal Sheth
5ba8b7d173
fix(ingest/fivetran): use project id by default for bigquery (#13250) 2025-04-20 21:39:40 -07:00
Pedro Silva
61a71de360
docs(fivetran): Update docs on connection mapping to source systems (#13256) 2025-04-20 10:19:46 +01:00
Andrew Sikowitz
e9a7d35cb8
docs: Format stragglers via prettier (#13247) 2025-04-18 15:08:45 -07:00
Andrew Sikowitz
23ceff950b
fix(ui): Various minor fixes (#13253) 2025-04-18 13:54:24 -07:00
dependabot[bot]
d0c31a0ec5
build(deps-dev): bump vite from 4.5.6 to 4.5.11 in /datahub-web-react (#13054)
Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Co-authored-by: david-leifker <114954101+david-leifker@users.noreply.github.com>
2025-04-18 11:53:31 -05:00
Felix Lüdin
05a42196d7
fix(ui): fix a type error when a CorpGroup entity appears in the search result with theme V2 (#13254) 2025-04-18 12:17:10 -04:00
dependabot[bot]
210dcdc557
build(deps): bump aquasecurity/trivy-action from 0.29.0 to 0.30.0 (#12892)
Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2025-04-18 11:15:01 -05:00
RyanHolstien
98c88e7263
fix(patch): add encoding to patch builders (#13164) 2025-04-18 11:02:53 -05:00
rahul MALAWADKAR
8ef571b3c4
chore(deps): fix (org.glassfish:javax.json) (#13197) 2025-04-18 11:02:05 -05:00
Dmitry Bryazgin
bdd8c9935a
fix(metadata-io): Fixes a random failure for LineageDataFixtureTestBase.testDatasetLineage() (#13215)
Co-authored-by: RyanHolstien <RyanHolstien@users.noreply.github.com>
2025-04-18 11:00:24 -05:00
Rafael Sousa
d72361ddb3
feat(deps): update OpenTelemetry version to 2.15.0 (#13237) 2025-04-18 10:57:01 -05:00
dependabot[bot]
b49a86505f
build(deps): bump http-proxy-middleware from 2.0.7 to 2.0.9 in /docs-website (#13243)
Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2025-04-18 10:55:18 -05:00
amit-apptware
9bea6341c2
Fix(ui/incident): Refactor code for updating incidents (#13172) 2025-04-17 19:15:15 -07:00
Chris Collins
36cce7d1be
fix(ui) Show warning in the UI when we use lightning cache in impact analysis (#13139) 2025-04-17 17:54:43 -04:00
Chris Collins
de9de49f22
Update 0.3.10 docs explaining bug and fix in 0.3.10.2 (#13114)
Co-authored-by: John Joyce <john@acryl.io>
Co-authored-by: david-leifker <114954101+david-leifker@users.noreply.github.com>
Co-authored-by: Jay <159848059+jayacryl@users.noreply.github.com>
2025-04-17 17:53:10 -04:00
david-leifker
abfdfb6a3a
v1.0.0 release notes update (#13260) 2025-04-17 16:24:25 -05:00
Chris Collins
f39ad1f735
fix(gms) Generate schema field urns consistently (#5583) (#13221) 2025-04-17 16:21:24 -04:00
Kevin Karch
24a866fcd8
docs(snowflake): remove procedures from list of assets not ingested (#13258) 2025-04-17 15:11:29 -04:00
skrydal
47490ec050
feat(ingestion/iceberg): Add capability to extract namespace properties to the iceberg ingestor (#13238) 2025-04-17 16:29:43 +02:00
Hyejin Yoon
72aab9fe63
feat(sdk): add sdk lineage client (#13244)
Co-authored-by: Harshal Sheth <hsheth2@gmail.com>
2025-04-17 17:44:07 +09:00
Harshal Sheth
b75dbaa3a1
feat(ci): make datahub-actions docker build standalone (#13241) 2025-04-17 08:51:46 +02:00
Hyejin Yoon
4e37202373
docs(mlflow): add docs on version requirement for mlflow (#13251) 2025-04-17 15:13:13 +09:00
Sergio Gómez Villamor
f3dfd86680
fix(looker): missing Looker Explore relationship when Look references multiple models (#13198) 2025-04-17 07:53:13 +02:00
Maggie Hays
0875f79477
docs(forms-analytics) Adding Forms Analytics guide (#13205)
Co-authored-by: Chakru <161002324+chakru-r@users.noreply.github.com>
2025-04-16 20:43:34 -05:00
Andrew Sikowitz
d138a64a6a
ci(graphql,workflows): Format .md, .graphql, and workflow .yml files via prettier (#13220) 2025-04-16 16:55:51 -07:00
Andrew Sikowitz
45ebaa7d1a
build(ui): Sort imports and enforce absolute imports with aliases (#13113) 2025-04-16 16:55:38 -07:00
John Joyce
dbad52283b
fix(events API): Add back the original fixes for events API (#13242)
Co-authored-by: John Joyce <john@Mac-2582.lan>
2025-04-16 16:52:01 -07:00
Anna Everhart
dc21386247
fix(ui): updated sidebar tabs removed margins updated icon in ML (#13234) 2025-04-16 15:33:17 -07:00
Anna Everhart
2660276b4e
fix(ui): removed line height from oss nav header (#13235) 2025-04-16 15:32:26 -07:00
Saketh Varma
c37c42aaa1
fix(ci): cloudflare github actions syntax (#13225) 2025-04-16 11:31:33 -04:00
Tamas Nemeth
a1291faad9
feat(actions): Moving datahub-actions into oss datahub (#13120) 2025-04-16 12:23:29 +02:00
Tamas Nemeth
5edd41c4bf
doc(ingestion/gc): Add doc for GC source (#12296) 2025-04-16 11:01:44 +02:00
Hyejin Yoon
da83cb6afe
feat(ui) : add dpi stat column to v2 search card (#12972) 2025-04-16 12:22:38 +09:00
John Joyce
b2e3d0c1c6
fix(): Hotfix assertions tag add / remove (#13216)
Co-authored-by: John Joyce <john@Mac-2447.lan>
2025-04-15 18:52:49 -07:00
Hyejin Yoon
6cc3fff57f
fix(ingest/mlflow): pin mlflow-skinny version (#13208) 2025-04-16 08:27:31 +09:00
Anna Everhart
4e30fa78ed
feat(ui): Introducing flows for creating and deleting manage tags (#13107) 2025-04-15 15:16:33 -07:00
david-leifker
4df375df32
fix(api-validation): correct mcp/urn entity mismatch (#13212) 2025-04-15 16:02:56 -05:00
ryota-cloud
e79445e469
fix(metadata-ingestion) update vertexAI source doc with permissions detail (#13219) 2025-04-15 13:37:26 -07:00
david-leifker
5da4698ed9
docs(): Update generic patch openapi-usage-guide (#13210) 2025-04-15 15:13:52 -05:00
Chakru
912fac3eb5
fix(ci): fix image publish task, python-docker version (#13206) 2025-04-15 08:44:09 -07:00
Esteban Gutierrez
4c1c12a1e0
Update pr-labeler.yml (#13204) 2025-04-15 13:23:35 +05:30
Hyejin Yoon
bafd93d38f
feat(sdk): add mlmodel and mlmodelgroup (#13150) 2025-04-15 16:12:38 +09:00
Chakru
5b1e194a3c
build: use docker bake to build all images in a single step (#13191) 2025-04-15 09:48:11 +05:30
John Joyce
4c6b356a50
refactor(ui): UI hotfixes for the deprecated popover + sidebar logic section (#13142)
Co-authored-by: John Joyce <john@ip-192-168-1-63.us-west-2.compute.internal>
Co-authored-by: John Joyce <john@Mac-1768.lan>
Co-authored-by: John Joyce <john@Mac-2177.lan>
2025-04-14 17:00:05 -07:00
v-tarasevich-blitz-brain
e7a0447267
feat(searchBarAutocomplete): Add redesigned search bar (#13106)
Co-authored-by: Victor Tarasevich <v.tarasevitch@invento.by>
2025-04-14 16:34:58 -04:00
Dmitry Bryazgin
16471479d6
fix(metadata-io): Fixes a typo in the test code (#13192)
Co-authored-by: RyanHolstien <RyanHolstien@users.noreply.github.com>
2025-04-14 15:08:46 -04:00
Andrew Sikowitz
c971cdaebc
fix(ui/v2): Remove code around usage and storage features (#13049) 2025-04-14 12:05:42 -07:00
Chris Collins
3fd0e37111
fix(ui) Sanitize V1 UI sidebar description section (#13203) 2025-04-14 13:34:15 -04:00
Chris Collins
319b849532
fix(ui) Fix backwards compatibility bug with entity structured props (#13159) 2025-04-14 13:33:04 -04:00
david-leifker
143c18679b
test(config): update metadata-io tests (#13193) 2025-04-14 12:05:46 -05:00
Anna Everhart
1a55dd357c
fix(ui styles): updated hard coded button gradient and added it only for violet (#13201) 2025-04-14 09:03:11 -07:00
Chris Collins
9258226e2f
fix(cypress) Fix flaky manage_policies cypress test (#13188) 2025-04-14 10:33:45 -04:00
Sergio Gómez Villamor
60b769fbf6
feat(ariflow-plugin): ability to disable datajob lineage (#13187) 2025-04-14 16:15:19 +02:00
david-leifker
5ee0b66920
feat(spring): upgrade to SpringBoot 3.4 (#13186) 2025-04-12 15:34:17 -05:00
Andrew Sikowitz
3e37f76428
feat(ingest/tableau): Allow specifying asset types for ingest_hidden_assets (#13190) 2025-04-11 20:07:37 -07:00
Harshal Sheth
8df235a40c
fix: more Python dockerfile refactoring (#13180) 2025-04-11 14:57:49 -07:00
david-leifker
a5f9bd94ae
fix(platform-events): add platform events privilege to platform list (#13189) 2025-04-11 16:38:58 -05:00
Anna Everhart
0b26ee0ad5
fix(): Update OSS to DataHub Brand (#13160)
Co-authored-by: John Joyce <john@acryl.io>
2025-04-11 10:33:42 -07:00
Anna Everhart
ff34e11d52
fix(): Match navigation with sidebar icons fill when selected (#13088) 2025-04-11 09:25:37 -07:00
Anna Everhart
3744668ada
fix(): Clear icons were purple after adding button gradient (#13141) 2025-04-11 09:22:48 -07:00
Gabe Lyons
1bcdda740d
feat(data contracts): supporting structured properties on data contracts (#13176) 2025-04-11 08:30:20 -07:00
Chris Collins
ed0cfe911f
fix(cypress) Fix broken searchFilters.js cypress test (#13185) 2025-04-11 10:06:04 -05:00
Deepak Garg
ac9f3ca8ef
fix(gms): bean not found (#13183) 2025-04-11 09:00:25 -05:00
david-leifker
b9e5d213b2
feat(openapi): platform events endpoint (#13179) 2025-04-11 08:55:03 -05:00
Tamas Nemeth
e048cf7ce7
fix(ingest/sigma): Fix missing key in workspace_counts (#13182) 2025-04-11 15:28:43 +02:00
ryota-cloud
ca4eab52e4
VertexAI Connector (v3 - pipeline and pipeline task) (#12960) 2025-04-10 22:54:17 -07:00
Maggie Hays
4b85e86cdb
docs(champions) Update DataHub Champions (#13025) 2025-04-10 19:22:24 -05:00
Maggie Hays
b87602a32f
docs: remove scarf (#13177) 2025-04-10 19:22:12 -05:00
david-leifker
d49b7841cb
feat(openapi): add entity patch support (#13165) 2025-04-10 18:17:27 -05:00
david-leifker
2add837621
fix(): handle null systemmetadata corner cases (#13086) 2025-04-10 17:44:58 -05:00
david-leifker
b82ec1c65e
fix(runId): make sure runid includes urn (#13175) 2025-04-10 15:21:42 -05:00
Andrew Sikowitz
ca51df880f
fix(ingest/snowflake): Use CREATE change type when creating structured properties; support MCP headers (#13158) 2025-04-10 15:13:13 -05:00
Jay
bb479db079
feat(docs-site) banner color alignment (#13174) 2025-04-10 15:20:36 -04:00
david-leifker
075175c749
fix(): ingestion backfill source v2 (#13173) 2025-04-10 13:34:40 -05:00
Jay
7365ac6c64
fix(web) execution request details modal prioritizes stats from ingestion report (#13161) 2025-04-10 12:34:07 -04:00
Anna Everhart
2504d28255
refactor(ui): Update colors (#13154) 2025-04-10 09:21:19 -07:00
david-leifker
c17549a6f2
docs(tracing): Add known limitations about openapi tracing (#13091) 2025-04-10 11:09:16 -05:00
Michael Minichino
0b105395e9
feat(ingest/powerbi): Support ODBC Data Source (#13090) 2025-04-10 08:57:37 -05:00
Harshal Sheth
dd3aff90a0
fix(ci): enforce docker snippet validation in CI (#13163) 2025-04-09 22:49:59 -07:00
Hyejin Yoon
443134ca96
fix(ingest/mlflow): skip experiment/run ingestion for older version of mlflow (#13122) 2025-04-10 14:09:09 +09:00
jayacryl
07f68d1278 feat(docs-site) added a little line spacing in announcement banner 2025-04-09 21:22:24 -04:00
jayacryl
bd1e313197 feat(web) touch up spacing on banner 2025-04-09 20:47:49 -04:00
Harshal Sheth
0c67336f06
fix(ci): add actionlint file (#13157) 2025-04-09 17:14:38 -05:00
Chris Collins
ac6bca7f61
fix(ui) Fix query tab filter dropdowns showing raw urns (#5230) (#13132) 2025-04-09 16:12:32 -04:00
Harshal Sheth
9f0c0aa3dd
refactor(ingest/sigma): make some error cases more clear (#13110)
Co-authored-by: Tamas Nemeth <treff7es@gmail.com>
Co-authored-by: Sergio Gómez Villamor <sgomezvillamor@gmail.com>
2025-04-09 12:03:17 -07:00
skrydal
7b6ab3ba15
fix(ingest): Make workunit processor ensuring schema size more aggressive (#13153) 2025-04-09 12:02:32 -07:00
Harshal Sheth
275535d4d3
feat: start removing ingestion-base image (#13146) 2025-04-09 11:53:34 -07:00
jmacryl
bbbeab8467
Update pr-labeler.yml (#13156) 2025-04-09 13:43:25 -05:00
amit-apptware
48ff755b72
refactor(ui/incident-v2) : Raise incident from search card and header (#13149) 2025-04-09 09:35:17 -07:00
Harshal Sheth
e4a8c77344
fix(ingest): quote db name in streams query (#13131) 2025-04-09 09:03:36 -07:00
Jay
ae95a5c408
fix(web) structured props to display right (#13128) 2025-04-09 11:24:03 -04:00
Shirshanka Das
2f07dc3fcd
docs-website: fix search modal positioning and support smaller screens (#13152) 2025-04-09 08:05:04 -07:00
Harshal Sheth
072cd8b30f
docs: link to mcp-server-datahub repo (#13134) 2025-04-09 07:47:53 -07:00
Sergio Gómez Villamor
e7d8f2913c
fix(snowflake): fixes deduplication and fingerprint requirements for Hex (#13121) 2025-04-09 10:17:43 +02:00
Sergio Gómez Villamor
75894399f0
fix(hex): fixes AccessType model (#13123) 2025-04-09 09:04:05 +02:00
Aseem Bansal
2a75a981ca
chore(ci): upgrade ruff version (#13125) 2025-04-09 11:24:07 +05:30
ryota-cloud
ba9df6c4f1
fix(metadata-io) improve logging to add search response when ES search fails (#13119) 2025-04-08 22:47:49 -07:00
amit-apptware
7d9d8b925a
fix(ui/incident): Change for showing the custom type and note (#13056) 2025-04-08 20:38:25 -07:00
Shirshanka Das
b8fad7e383
docs-website: add announcement for MCP. Some improvements in rendering (#13137) 2025-04-08 20:17:43 -07:00
Anna Everhart
7ac926d7da
refactor(ui): Replace blues and greens with violets (#13138) 2025-04-08 18:27:30 -07:00
Anna Everhart
ebb4d2e487
refactor(ui): Updated impact / explore switch to colors.violet[600] (#13136) 2025-04-08 18:27:00 -07:00
Harshal Sheth
1fca9855ee
fix(ingest/snowflake): fix error on stored procs in non-SQL languages (#13127) 2025-04-08 17:00:56 -07:00
John Joyce
742d060722
feat(models): Making ML deployment status searchable! (#13140)
Co-authored-by: John Joyce <john@ip-192-168-1-63.us-west-2.compute.internal>
2025-04-08 15:09:25 -07:00
Chris Collins
538072069b
feat(froms) Add validator preventing duplicate form prompt IDs globally (#13135) 2025-04-08 17:14:26 -04:00
Maggie Hays
edd052c4dc
docs(remote-executor) Remote Executor guides (#13115) 2025-04-08 16:07:57 -05:00
Chris Collins
23edbca6cd
fix(ui) Render assets owned by groups you are member of as your own assets (#13133) 2025-04-08 15:50:40 -04:00
Sergio Gómez Villamor
5c7b8e10ce
fix(hex): filter out queries if non scheduled runs (#13126) 2025-04-08 20:55:28 +02:00
John Joyce
967db2a136
refactor(ui): Fix appearance of skeleton loading indicator for search bar (#13129)
Co-authored-by: John Joyce <john@Johns-MacBook-Pro.local>
2025-04-08 11:07:10 -07:00
Anna Everhart
9b7b534283
refactor(ui): Updated Button Component (#13130) 2025-04-08 11:06:45 -07:00
v-tarasevich-blitz-brain
c906547ea5
feat(searchBarAutocomplete): improve select components for autocomplete (#13083) 2025-04-08 13:26:00 -04:00
v-tarasevich-blitz-brain
dbbe5639ee
feat(searchBarAutocomplete): add autocomplete entity item component (#12879)
Co-authored-by: Victor Tarasevich <v.tarasevitch@invento.by>
2025-04-08 11:32:01 -04:00
Sergio Gómez Villamor
7dd4f06e71
docs(hex): additional limitations (#13103) 2025-04-08 07:59:21 +02:00
Anna Everhart
c1187c8fa4
refactor(ui): updated Titles and styling for Features page OSS (#13070) 2025-04-07 16:53:05 -07:00
Anna Everhart
81f36d8224
refactor(ui): Updating icons, gaps, and pills in selects (#13098) 2025-04-07 16:27:42 -07:00
John Joyce
f758785cee
fix(ui): Styling fix for css regression (#13077)
Co-authored-by: John Joyce <john@Mac-32.lan>
2025-04-07 15:52:14 -07:00
Chris Collins
ffb4b5f627
Update docs discussing 0.3.10.1 (#13108) 2025-04-07 17:53:19 -04:00
v-tarasevich-blitz-brain
2dea3780d8
fix UI bugs on queries tab (#13060) 2025-04-07 15:50:15 -04:00
Chakru
d664b9f4ff
fix(ci): make depot remote container builder optional (#13105) 2025-04-07 23:21:03 +05:30
v-tarasevich-blitz-brain
3f8d61c39d
feat(searchBarAutocomplete): add autocomplete component to components library (#12867) 2025-04-07 12:10:56 -04:00
david-leifker
f16f056e91
config(): cache telemetry id (#13089) 2025-04-07 10:51:28 -05:00
david-leifker
dd14507d7d
fix(openapi): required fields w/ defaults (#13095) 2025-04-07 10:51:03 -05:00
Chris Collins
cfd891bdc8
docs(cloud): DataHub Cloud v0.3.10 release notes (#13034)
Co-authored-by: Jay <159848059+jayacryl@users.noreply.github.com>
Co-authored-by: Hyejin Yoon <0327jane@gmail.com>
Co-authored-by: david-leifker <114954101+david-leifker@users.noreply.github.com>
Co-authored-by: Maggie Hays <maggiem.hays@gmail.com>
Co-authored-by: John Joyce <john@acryl.io>
Co-authored-by: Aseem Bansal <asmbansal2@gmail.com>
2025-04-07 15:22:15 +05:30
Chakru
cf40116680
CI speedup (#13057) 2025-04-07 10:13:07 +05:30
Sergio Gómez Villamor
4e48e098dc
fix(ingestion): fixes missing platform instance aspect for DataFlow entitiy (#13080) 2025-04-06 08:19:47 +02:00
Gabe Lyons
dadc27fd0c
feat(structured properties): use wider search select modal to edit structured properties (#13076) 2025-04-05 08:24:55 -07:00
david-leifker
287fda19c7
fix(docker): also rename group to datahub (#13094) 2025-04-04 18:30:26 -05:00
Anna Everhart
9a2aedbac7
refactor(ui): update searchbar width in manage tags (#13064) 2025-04-04 14:40:27 -07:00
skrydal
38f1553315
feat(ingestion): Refactoring timestamping logic for WorkUnits + custom logic for Iceberg (#13030)
Co-authored-by: Sergio Gómez Villamor <sgomezvillamor@gmail.com>
2025-04-04 22:30:27 +02:00
Dmitry Bryazgin
ec7c099384
feat (metadata-models-custom): Use java-library plugin to extends the java plugin and adds additional features specifically for building Java libraries. (#12965)
Co-authored-by: RyanHolstien <RyanHolstien@users.noreply.github.com>
2025-04-04 15:04:23 -05:00
Pedro Silva
a4b343cc82
fix(ingest/delta-lake): Bump delta-lake dependency (#12766)
Co-authored-by: Sergio Gómez Villamor <sgomezvillamor@gmail.com>
2025-04-04 12:19:13 -07:00
Anna Everhart
4301b1f08a
refactor(ui): Updated page title and styling to match new ui (#13079)
Co-authored-by: John Joyce <john@acryl.io>
2025-04-04 11:07:16 -07:00
Tamas Nemeth
df119cea1a
fix(ingest/mlflow): Fix stateful ingestion setup (#13084) 2025-04-04 18:47:19 +02:00
Tamas Nemeth
250b100a93
doc(ingestion/s3): Document permissions requirements for s3 source (#12816)
Co-authored-by: Harshal Sheth <hsheth2@gmail.com>
2025-04-04 18:28:42 +02:00
Rafael Sousa
c1f6bd171e
fix(opentelemetry): Resolve type mismatch in metrics exporter (#13053)
Co-authored-by: Pedro Silva <pedro@acryl.io>
2025-04-04 15:51:10 +01:00
Sergio Gómez Villamor
b37fa03846
chore: fixes SIM118 ruff rule (#13069) 2025-04-04 11:59:43 +02:00
Andrew Sikowitz
0e068e2fe3
docs(cloud): DataHub Cloud v0.3.9.2 release notes (#13075) 2025-04-04 14:07:13 +05:30
Rasnar
38e240e916
feat(ingest/airflow): platform_instance support in Airflow plugin (#12751)
Co-authored-by: rasnar <11248833+Rasnar@users.noreply.github.com>
Co-authored-by: Sergio Gómez Villamor <sgomezvillamor@gmail.com>
2025-04-04 09:26:58 +02:00
Hyejin Yoon
34928ad5f5
fix(ui): humanize timestamps on ML entities UI (#12788) 2025-04-04 15:00:25 +09:00
Jay
372feeeade
feat(ingestion) cleaning up ingestion page UI (#12710) 2025-04-03 17:55:18 -04:00
david-leifker
046c59bdb5
chore(): bump base ubuntu image 22.04 -> 24.04 (#13072) 2025-04-03 14:39:41 -05:00
John Joyce
e465c99e8b
fix(ui): Adding incident changes from DataHub Cloud QA (#13074)
Co-authored-by: John Joyce <john@Mac-27.lan>
2025-04-03 11:37:10 -07:00
Gabe Lyons
1e999090e6
docs(oss-vs-cloud): update to re-align with current offering (#13063)
Co-authored-by: Jay <159848059+jayacryl@users.noreply.github.com>
2025-04-03 13:22:48 -04:00
david-leifker
efa98e558b
chore(): bump parquet-avro (#13071) 2025-04-03 11:28:43 -05:00
Chris Collins
c4866a959c
fix(docs) Update impact analysis docs to call out lightning cache bugs (#12918) 2025-04-03 09:16:40 -07:00
Anna Everhart
497ac3c58b
fix(lint): update icon props in sidebar components to match IconProps type (#13068) 2025-04-03 12:15:16 -04:00
Pedro Silva
f8f2fc1b60
feat(docs): Add environment variables for OSS 1.0.0 (#12894)
Co-authored-by: Harshal Sheth <hsheth2@gmail.com>
2025-04-03 09:13:11 -07:00
Harshal Sheth
4d53df63a2
fix(ingest/sigma): include workspace names in report (#13055) 2025-04-03 09:09:44 -07:00
dependabot[bot]
c5aa6cb56a
build(deps): bump prismjs from 1.29.0 to 1.30.0 in /docs-website (#12849)
Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2025-04-03 10:29:33 -05:00
dependabot[bot]
ebc02f8f83
chore(deps): bump tar-fs from 2.1.1 to 2.1.2 in /docs-website (#13023)
Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2025-04-03 10:29:18 -05:00
dependabot[bot]
9dafe1bd02
build(deps): bump image-size from 1.1.1 to 1.2.1 in /docs-website (#13059)
Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2025-04-03 10:29:01 -05:00
Sergio Gómez Villamor
d2bb33f7c5
feat(ingest): new hex connector - part 2 (#12985) 2025-04-03 12:44:37 +02:00
Nate Bryant
7618af549c
fix(logging): fixes slow query logging formatting and adds parsing fo… (#12955)
Co-authored-by: RyanHolstien <RyanHolstien@users.noreply.github.com>
2025-04-02 15:10:46 -05:00
Andrew Sikowitz
90666ef0f5
refactor(ui/v2): Update icons and search bars on schema tab, glossary search, domain search (#12740) 2025-04-02 12:16:50 -07:00
david-leifker
a1096e615a
fix(test): improve test stability (#13062) 2025-04-02 13:46:52 -05:00
Anna Everhart
156d4de375
refactor(ui): Updated Sidebar for Glossary and Domains to have same styling and fixed count badge (#13018) 2025-04-02 11:03:23 -07:00
Peter Wang
719cc67cac
feat(ingest/superset): leverage threads for superset API calls (#13006)
Co-authored-by: Harshal Sheth <hsheth2@gmail.com>
2025-04-02 10:27:57 -07:00
Kevin Karch
cd05c0f2fe
docs(ingest): more specific CLL limitation language (#13061) 2025-04-02 13:12:25 -04:00
david-leifker
f11d7f9287
refactor(auth-filter): refactor exception and logging (#13035) 2025-04-02 12:00:41 -05:00
david-leifker
1e6db0680a
fix(openapi): restore openapi v3 aspect version endpoint (#13047) 2025-04-02 09:15:34 -05:00
Saketh Varma
9071d67545
fix(ci): Avoid meticulous steps on fork PRs (#13051) 2025-04-02 11:01:03 -03:00
skrydal
ff799c9370
feat(ingestion/iceberg): source lastModified from table metadata field (#13052) 2025-04-02 12:05:41 +02:00
Sergio Gómez Villamor
e072a42d03
feat(ingest): adds get_entities_v3 method to DataHubGraph (#13045) 2025-04-02 10:22:14 +02:00
Chakru
40106be208
build: optimizations for incremental builds and faster CI (#13033)
Co-authored-by: Andrew Sikowitz <andrew.sikowitz@acryl.io>
2025-04-02 11:51:10 +05:30
John Joyce
87af4b9d53
refactor(): Fix incidents feedback on QA (#13044)
Co-authored-by: Aseem Bansal <asmbansal2@gmail.com>
Co-authored-by: John Joyce <john@Mac-307.lan>
2025-04-01 17:22:41 -07:00
Harshal Sheth
18aa1f076d
fix(ingest/trino): always use table properties fallback (#13048) 2025-04-01 15:35:23 -07:00
Andrew Sikowitz
cc5ce6f19c
fix(ui/lineageV2): Convert toggle to hide data process instances instead of show (#13022) 2025-04-01 13:59:22 -07:00
Andrew Sikowitz
3132ca7c0c
test(metadata-io/graph-service): Update lineage registry creation for dgraph and neo4j tests (#13037) 2025-04-01 15:09:52 -05:00
Kevin Karch
d75de77d6b
docs(ingest): clarify snowflake key language (#13050) 2025-04-01 14:59:51 -04:00
Hugo Hobson
b394ae6350
docs(ingest): make fail_safe_threshold config visable in docs (#13017) 2025-04-01 12:49:13 +01:00
Hugo Hobson
acc84c2459
fix(cli): stop deployment config being overwritten by cli defaults (#13036)
`executor_id` and `time_zone` values set in the `deployment` block of a recipe are not being used by `datahub ingest deploy` cli command when deploying recipes. This is because the [cli values take precedence over deployment config](https://github.com/datahub-project/datahub/blob/master/metadata-ingestion/src/datahub/utilities/ingest_utils.py#L63), so where the cli has default values these as always used.

Default values should be set in [`DeployOptions`](https://github.com/datahub-project/datahub/blob/master/metadata-ingestion/src/datahub/utilities/ingest_utils.py#L23), not in the cli options.
2025-04-01 10:21:30 +01:00
Sergio Gómez Villamor
c6acce9906
feat(powerbi): capture dataset report lineage (#12993) 2025-04-01 11:12:47 +02:00
ryota-cloud
ebea3b7ca3
fix(ingestion) create ExperimentKey instead of containerKeyId used in MLflow and Vertex AI (#12995) 2025-04-01 00:00:09 -07:00
ryota-cloud
b5af0084cb
fix(ui) fix subtype name of datajob on dataflow card (#13021)
Co-authored-by: Hyejin Yoon <0327jane@gmail.com>
2025-03-31 23:59:19 -07:00
Hyejin Yoon
9e28c1af63
docs(mlflow): add docs for the mlflow dataset config (#12973) 2025-04-01 12:20:32 +09:00
ryota-cloud
b6af240e97
fix(ui) remove SQL from TaskIcon for Vertex AI Pipeline task (#13024)
Co-authored-by: Andrew Sikowitz <andrew.sikowitz@acryl.io>
2025-03-31 18:16:20 -07:00
Harshal Sheth
e7033e4d89
docs: hide rc releases from autogenerated docs (#13040) 2025-03-31 15:31:15 -07:00
John Joyce
00edc4205f
feat(Tags): Support Managing Tags via "Manage Tags" nav bar page (V1) (#12983)
Co-authored-by: Annadoesdesign <annaerocca@gmail.com>
Co-authored-by: Anna Everhart <149417426+annadoesdesign@users.noreply.github.com>
Co-authored-by: John Joyce <john@ip-192-168-1-64.us-west-2.compute.internal>
Co-authored-by: John Joyce <john@Mac.lan>
Co-authored-by: amit-apptware <132869468+amit-apptware@users.noreply.github.com>
Co-authored-by: John Joyce <john@Mac-302.lan>
2025-03-31 15:30:51 -07:00
Harshal Sheth
c2ca0c28c2
fix(ui): fix hex logo filename (#13041) 2025-03-31 15:28:09 -07:00
Harshal Sheth
c79192090d
feat(ingest): propagate backpressure in ThreadedIteratorExecutor (#13027) 2025-03-31 10:23:00 -07:00
Harshal Sheth
58169ad7cc
fix(sdk): fix bugs in v2 sdk search client (#13026) 2025-03-31 08:59:32 -07:00
Anna Everhart
dcfc5e6f15
refactor(ui): Updated entitysidebar tabs to have the same styling as the redesigned… (#12917) 2025-03-31 08:58:01 -07:00
Harshal Sheth
2ef0086394
feat(ingest): allow sources to produce sdk entities (#13028) 2025-03-31 08:33:17 -07:00
Aseem Bansal
68b4fcb054
fix(ingest): add mutator for ownership types (#13002) 2025-03-31 15:43:38 +05:30
Gabe Lyons
ee4827e1b2
fix(oracle): fixing oracle CLL for view parsing. (#13029) 2025-03-30 07:42:55 -07:00
Chakru
04a8e305a2
fix(lineage): lineage incorrect for some entities (#13020) 2025-03-30 19:01:19 +05:30
Chakru
d2dd54acf1
feat(dataset_cli): add dry-run support (#12814) 2025-03-29 18:08:52 +05:30
Harshal Sheth
3e4d14734b
feat(ingest/sigma): add reporting on filtered workspaces (#12998) 2025-03-28 16:58:34 -07:00
david-leifker
a37a1a502e
fix(system-update): make buildIndices first step (#13015) 2025-03-28 18:11:53 -05:00
david-leifker
fa80c8dbfe
refactor(): use static yaml mapper es search service (#13016) 2025-03-28 17:47:28 -05:00
Andrew Sikowitz
a846f9d92d
fix(ui/lineageV2): Fix bug with not getting all lineage nodes when hiding transformations (#13005) 2025-03-28 14:56:36 -07:00
Harshal Sheth
5bc8a895f9
chore(ingest): remove calls to deprecated methods (#13009) 2025-03-28 13:42:54 -07:00
Harshal Sheth
d3ef859802
fix(airflow): drop airflow 2.3 and 2.4 support (#13004) 2025-03-28 12:58:50 -07:00
Harshal Sheth
716bf87f8f
fix(ingest): make emitter endpoint fully env-controlled (#13019) 2025-03-28 12:21:54 -07:00
amit-apptware
5ea572aac5
Feat(ui/incident-v2) : Added smoke test and unit test cases for sibling create incidents (#12953) 2025-03-28 11:41:18 -07:00
Michael Minichino
8757a0e01a
feat(ingest/powerbi): PowerBI source updates (#12857) 2025-03-28 13:22:25 -05:00
skrydal
b7bb0b056c
feat(ingestion/iceberg): Refactor iceberg connector (#12921) 2025-03-28 18:34:36 +01:00
Aseem Bansal
d4812d7b05
fix(ingest/redshift): respect skip external table config (#13003) 2025-03-28 12:33:39 +05:30
Harshal Sheth
29e09185eb
fix(ingest): warn when API tracing is unexpectedly inactive (#13007) 2025-03-27 22:21:33 -07:00
Jay
c9999d4cf0
fix(web) styltistic lint downgraded to support legacy lint definition path (#12999) 2025-03-27 19:08:36 -04:00
Felix Lüdin
54fe8bbd8d
fix(ingest/powerbi): fix KeyError when ingesting apps from Power BI (#12975) 2025-03-27 13:42:15 -07:00
kyungryun choi
26d9962a98
fix(airflow): fix logging typo (#13000) 2025-03-27 13:41:37 -07:00
amit-apptware
6b96844bae
fix(ui/incident): Add validation to restrict unauthorized edit or resolve (#12950) 2025-03-27 11:32:33 -07:00
Harshal Sheth
f58febc040
feat(sdk): add FilterOperator enum to search sdk (#12997) 2025-03-27 11:09:56 -07:00
Mayuri Nehate
f9d6c9f1f9
docs(ingest/snowflake): update docs to capture key pair auth section (#13001) 2025-03-27 11:02:32 -07:00
Gabe Lyons
c045cf15a5
feat(edit lineage): add edit lineage functionality to datahub (#12976) 2025-03-27 09:34:40 -07:00
Peter Wang
e0d805c8f7
feat(ingestion/superset): column level lineage for charts (#12930)
Co-authored-by: Harshal Sheth <hsheth2@gmail.com>
Co-authored-by: Jay <159848059+jayacryl@users.noreply.github.com>
Co-authored-by: Shuixi Li <llance@users.noreply.github.com>
Co-authored-by: Gabe Lyons <gabe.lyons@acryl.io>
Co-authored-by: Sergio Gómez Villamor <sgomezvillamor@gmail.com>
Co-authored-by: Chris Collins <chriscollins3456@gmail.com>
Co-authored-by: amit-apptware <132869468+amit-apptware@users.noreply.github.com>
Co-authored-by: Austin SeungJun Park <110667795+eagle-25@users.noreply.github.com>
Co-authored-by: skrydal <piotr.skrydalewicz@acryl.io>
Co-authored-by: Saketh Varma <sakethvarma397@gmail.com>
Co-authored-by: Dmitry Bryazgin <58312247+bda618@users.noreply.github.com>
Co-authored-by: Hyejin Yoon <0327jane@gmail.com>
Co-authored-by: John Joyce <john@acryl.io>
Co-authored-by: John Joyce <john@Mac.lan>
Co-authored-by: Chakru <161002324+chakru-r@users.noreply.github.com>
Co-authored-by: Mayuri Nehate <33225191+mayurinehate@users.noreply.github.com>
Co-authored-by: Maggie Hays <maggiem.hays@gmail.com>
Co-authored-by: Pedro Silva <pedro@acryl.io>
Co-authored-by: Gabe Lyons <itsgabelyons@gmail.com>
2025-03-27 08:05:54 -07:00
Sergio Gómez Villamor
6ab518faf8
feat(databricks): approx percentile for median (#12987) 2025-03-27 15:58:39 +01:00
Michael Minichino
ca695a7429
fix(ingest/redshift): resolve ingestion errors (#12992) 2025-03-27 07:52:44 -05:00
Mayuri Nehate
1bf395bf55
fix(ingest): split merge statements correctly (#12989) 2025-03-27 14:24:50 +05:30
Tamas Nemeth
a61a5856c0
fix(ingest/gc): Fix for slow soft-deleted entity deletion (#12931) 2025-03-27 09:43:48 +01:00
Tamas Nemeth
0093aa8185
chore(ingestion/airflow): Example custom operator with sql parsing (#12959) 2025-03-27 09:18:29 +01:00
Hyejin Yoon
55cf62b49f
feat(ingest/mlflow): add mlflow auth config (#12984) 2025-03-27 11:30:30 +09:00
amit-apptware
fbeedac497
feat(ui/incident): Add sibling button dropdown for create incident (#12941) 2025-03-26 19:08:09 -07:00
ryota-cloud
95be7eb185
fix(ingestion) Adding Vertex AI Connector documentation (#12967) 2025-03-26 16:37:34 -07:00
John Joyce
d3becb0ede
feat(ui): Support for Filtering by Deprecated, Showing Deprecation in Upstream Health Indicator (#12991)
Co-authored-by: John Joyce <john@ip-192-168-1-64.us-west-2.compute.internal>
2025-03-26 15:18:56 -07:00
david-leifker
952f3cc311
fix(entity-service): fix delete non-existent row (#12990) 2025-03-26 14:51:20 -05:00
ryota-cloud
7cc0301fd5
fix(ingestion) Added externalURL of model group for Vertex AI (#12981) 2025-03-26 12:32:46 -07:00
Harshal Sheth
1d22c58e13
chore(ingest): use typing-aware deprecation type (#12982) 2025-03-26 11:32:46 -07:00
John Joyce
98dc67fa4c
fix(): Aligning tests and data process instance models (#12943)
Co-authored-by: John Joyce <john@ip-192-168-1-64.us-west-2.compute.internal>
2025-03-26 10:18:21 -07:00
Mayuri Nehate
ac9997d970
feat(ingest/snowflake): ingest stored procedures (#12929)
Co-authored-by: Sergio Gómez Villamor <sgomezvillamor@gmail.com>
2025-03-26 20:02:55 +05:30
Sergio Gómez Villamor
29d05c214a
feat(hex): warehouse integration via Query enrichment (#12949) 2025-03-26 09:35:50 +01:00
amit-apptware
048bbf83c5
fix(ui/incident): Add validation for custom type (#12924)
Co-authored-by: John Joyce <john@acryl.io>
2025-03-25 18:47:32 -07:00
John Joyce
0c315d2c62
feat(ui): Adding support for 'has siblings' filter behind a feature flag. (#12685)
Co-authored-by: John Joyce <john@Mac-4613.lan>
2025-03-25 18:40:47 -07:00
Chris Collins
3e5a3928f8
fix(cypress) Fix flaky managing_secrets cypress test (#12971) 2025-03-25 20:24:14 -04:00
Jay
7e474e9856
feat(web) require trailing commas (#12934) 2025-03-25 14:48:23 -04:00
Sergio Gómez Villamor
5c58dbacce
feat(ingest): new hex connector - part 1 (#12915) 2025-03-25 19:47:11 +01:00
Harshal Sheth
7535583958
feat(ingest/mode): fix issue in mode request validation (#12948) 2025-03-25 08:03:40 -07:00
Saketh Varma
9b3743db27
feat(ui): enabling meticulous recording (#12966) 2025-03-25 08:46:51 -03:00
Chakru
88994607d9
ci(docs): adjust build and test workflow for doc only PRs (#12952) 2025-03-25 14:36:25 +05:30
Sergio Gómez Villamor
87a84d35a9
feat(model): QueryProperties updates (#12923) 2025-03-25 09:36:07 +01:00
Jay
bb8bf6ae7b
feat(ingestion) slack source v2 - now ingests all user and channels (#12795) 2025-03-24 21:19:42 -04:00
ryota-cloud
11353d1172
fix(metadata-ingestion) fix connector test error for Vertex AI (#12963) 2025-03-24 13:17:57 -07:00
Harshal Sheth
2cc8856c6b
feat(ingest): allow MCPWs instead of workunits (#12947) 2025-03-24 08:10:48 -07:00
Mayuri Nehate
a3a1f50886
fix(ingest/dbt): consider dbt run results with success status (#12942) 2025-03-24 15:09:48 +05:30
trialiya
566cdf8bc1
docs(Timeseries): Update Timeseries aspect documentation to add support for @Searchable and @Relationship annotations (#12945)
Co-authored-by: trialiya <trialiya@gmail.com>
Co-authored-by: Chakru <161002324+chakru-r@users.noreply.github.com>
2025-03-22 09:14:33 -07:00
Günther Hackl
f48b8dd6c0
fix(ingestion): schema-metadata - fix jsonProps not being ingested for optional fields (#12927)
Co-authored-by: Sergio Gómez Villamor <sgomezvillamor@gmail.com>
2025-03-22 08:41:11 -07:00
Harshal Sheth
dd4aff2208
fix(cli): fix unknown aspect bug in dataset upsert cli (#12946) 2025-03-22 08:30:44 -07:00
Harshal Sheth
1d6c15edd3
docs(sdk): update some examples with the new SDK (#12933) 2025-03-21 16:00:05 -07:00
ryota-cloud
9f8753bcb2
feat(ingestion) Adding vertexAI ingestion source (v2 - experiment and experiment run) (#12836) 2025-03-21 14:06:17 -07:00
Gabe Lyons
8ada20636c
docs(view authorization): document view authorization in application.yaml (#12871) 2025-03-21 14:02:17 -07:00
Pedro Silva
449f10970b
feat(docs): Add section on updating DataHub for 1.0.0 (#12907) 2025-03-21 11:46:49 -07:00
Harshal Sheth
fbd4c1e012
docs(sdk): add docstrings for some sdk classes (#12940) 2025-03-21 11:39:45 -07:00
Harshal Sheth
9d245fb2d6
fix(ci): update docker helpers script (#12935) 2025-03-21 11:36:36 -07:00
Maggie Hays
21ce3f40d1
docs(website) update docusaurus config (#12936) 2025-03-21 12:59:57 -05:00
Peter Wang
6914df3721
feat(ingestion/superset): add timeout values to config to prevent hanging queries from blocking ingestion (#12884) 2025-03-21 08:08:56 -07:00
Mayuri Nehate
8323bc3910
fix(ingest/dremio): simplify and fix build source map (#12908) 2025-03-21 11:31:16 +05:30
Saketh Varma
56d92f3e66
fix(ui): versions null reference issues (#12919) 2025-03-20 23:50:14 -03:00
Chakru
90ad3935a7
fix(build): use sync instead of copy so excess files are deleted (#12925) 2025-03-20 19:19:00 -07:00
Hyejin Yoon
ac2a8e2ef0
feat(ingest/mlflow): add dataset lineage (#12837) 2025-03-21 08:38:43 +09:00
John Joyce
c15bc04a4f
fix(): Fixes from merge release (#12932)
Co-authored-by: John Joyce <john@Mac.lan>
2025-03-20 16:27:26 -07:00
Hyejin Yoon
0384407511
feat(ui): add external url button for ml entities for v2 (#12893) 2025-03-21 07:40:15 +09:00
Dmitry Bryazgin
a21fc54931
fix(metadata-models-custom): fix at entity-registry.yaml to load plugins correctly (#12681) 2025-03-20 15:41:24 -05:00
Harshal Sheth
8943e6d7b0
fix(ui): improve mixpanel analytics support (#12902) 2025-03-20 10:05:09 -07:00
Saketh Varma
083827d148
ci(ui): enable frontend previews (#12909) 2025-03-20 11:25:05 -03:00
skrydal
1185ba8121
feat(ingestion/iceberg): Refactor iceberg source to use MCPWs instead of MCEs (#12912) 2025-03-20 10:46:08 +01:00
Austin SeungJun Park
41895fe24f
feat(ingest/s3): add table filtering (#12661) 2025-03-20 07:57:43 +01:00
amit-apptware
54cccc79ba
feat(ui/incident-v2) : Add Incident V2 Integration (#12851) 2025-03-19 15:16:25 -07:00
Chris Collins
30e6b9b3b4
fix(ui) Fix styling of new nav bar header in safai (#12877) 2025-03-19 10:44:55 -04:00
Sergio Gómez Villamor
2aba2e3ed8
fix(powerbi): fixes direction of the dashboard-report lineage (#12881) 2025-03-19 12:25:42 +01:00
Gabe Lyons
ecd9ffd137
feat(openapi): Adding subtype for openapi source (#12873) 2025-03-18 12:35:46 -07:00
Shuixi Li
78f4852d55
fix(ingestion/superset): fixed changed_on_utc value being a string (#12883) 2025-03-17 17:46:14 -07:00
Jay
92581f01b7
fix(gql) add incident assignee owner type resolver (#12897) 2025-03-17 17:39:38 -04:00
Harshal Sheth
85a5b5cea1
fix(ingest): pin lookml liquid dep (#12896) 2025-03-17 13:51:04 -07:00
Harshal Sheth
8fbc4d125f
fix(ingest): fix superset declared deps (#12889) 2025-03-17 10:29:02 -07:00
ryota-cloud
eb1cd7f38c
feat(models): Support DPI in edges fields of DPI relationship aspects (#12886) 2025-03-15 00:39:03 -07:00
Shirshanka Das
e695cf59fd
docs: making top level blog links point to medium directly (#12885) 2025-03-14 17:45:47 -07:00
david-leifker
9a9cd384f1
fix(open-telemetry): include missing dependency (#12882) 2025-03-14 17:09:26 -05:00
Andrew Sikowitz
c9d77fdcb1
docs(cloud): DataHub Cloud v0.3.9 release notes (#12794)
Co-authored-by: Hyejin Yoon <0327jane@gmail.com>
Co-authored-by: Chris Collins <chriscollins3456@gmail.com>
Co-authored-by: Maggie Hays <maggiem.hays@gmail.com>
Co-authored-by: david-leifker <114954101+david-leifker@users.noreply.github.com>
Co-authored-by: John Joyce <john@acryl.io>
2025-03-14 12:52:03 -07:00
Peter Wang
cffc6d4693
feat(ingestion/superset): superset column level lineage (#12786) 2025-03-14 08:02:54 -07:00
david-leifker
3df12dcb8a
fix(python-version): Fix with dash docker_helpers.sh (#12876) 2025-03-14 12:55:11 +05:30
Hyejin Yoon
669c67ad53
feat(docs/mlflow): update sample scripts to be compatible with edges/versioning (#12878) 2025-03-14 12:20:25 +09:00
ryota-cloud
106d7755d5
fix(ingest): fix formatting to resolve lint error (#12875) 2025-03-13 16:09:39 -07:00
skrydal
ba8932cdc2
fix(ingest/snowflake): Fixing table rename query handling (#12852) 2025-03-13 23:26:23 +01:00
david-leifker
d1c804e323
feat(system-metrics): track api usage by user, client, api (#12872) 2025-03-13 16:39:46 -05:00
Maggie Hays
bce580501d
docs(website) update docusaurus config (#12862) 2025-03-13 16:26:15 -05:00
david-leifker
453d82a0cc
fix(api-tracing): handle corner case for historic (#12870) 2025-03-13 15:24:21 -05:00
Deepak Garg
eb17f80e8a
feat(ingest/hive): identify partition columns in hive tables (#12833) 2025-03-13 11:37:52 -07:00
ryota-cloud
0e62e8c77a
feat(ingestion) Adding vertexAI ingestion source (v1 - model group and model) (#12632) 2025-03-13 11:02:15 -07:00
John Joyce
f507e2c942
hotfix(ui): Addressing assertions hotfixes (#12785)
Co-authored-by: John Joyce <john@ip-10-209-186-159.us-west-2.compute.internal>
2025-03-13 10:14:53 -07:00
david-leifker
ebd3a5078d
feat(ingestion-tracing): implement ingestion with tracing api (#12714) 2025-03-13 11:33:28 -05:00
Mayuri Nehate
f9d71d67a0
feat(ingest/salesforce): include formula in in field description (#12840) 2025-03-13 22:00:06 +05:30
Chakru
71d0f125ce
ci(tests):show cypress smoke tests in junit format for better reporting (#12865) 2025-03-13 21:05:45 +05:30
david-leifker
463803e2d1
feat(restore-indices): createDefaultAspects argument (#12859) 2025-03-13 10:17:14 -05:00
Aseem Bansal
298917542f
docs(ingest): custom transformer remote executor (#12864) 2025-03-13 20:38:09 +05:30
Harshal Sheth
4305a62b10
fix(ingest): fix error in deploy command (#12820) 2025-03-13 08:04:18 -07:00
Sergio Gómez Villamor
30719ac87d
fix(databricks): fixes profile median (#12856) 2025-03-13 12:39:53 +01:00
Sergio Gómez Villamor
0d54352e3f
fix(ge-profiler): catch TimeoutError (#12855) 2025-03-13 10:55:06 +01:00
Hyejin Yoon
86f4b805cf
feat(ingest/mlflow): update dpi to use edge for lineage (#12861) 2025-03-13 10:05:00 +09:00
Chris Collins
976fecdc31
fix(ui) Support glossary nodes in autocomplete (#12858) 2025-03-12 16:47:07 -07:00
k7ragav
95205a01a0
feat(ui): Update ExternalUrlButton to include self-hosted gitlab URLs (#12734)
Co-authored-by: Andrew Sikowitz <andrew.sikowitz@acryl.io>
2025-03-12 15:13:52 -07:00
Andrew Sikowitz
887e30f3e2
feat(models): Add edges fields to data process instance relationship aspects (#12860) 2025-03-12 15:11:49 -07:00
Aseem Bansal
0176543c53
fix(ingest/dynamodb): pass env to dataset urn function (#12853) 2025-03-12 14:50:40 -07:00
Jay
85e27511ec
feat(gql) allow unsetting optional incident fields (#12801) 2025-03-12 13:11:44 -07:00
david-leifker
7e749ff0c5
fix(jaas): fix jaas login (#12848) 2025-03-12 10:25:35 +00:00
dependabot[bot]
3ce7651cd5
build(deps): bump @babel/helpers from 7.24.4 to 7.26.10 in /docs-website (#12847)
Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2025-03-12 03:19:46 -04:00
dependabot[bot]
666da6a11f
build(deps): bump @babel/runtime-corejs3 from 7.24.4 to 7.26.10 in /docs-website (#12846)
Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2025-03-12 03:19:39 -04:00
dependabot[bot]
4c2722163d
build(deps): bump @babel/runtime from 7.24.4 to 7.26.10 in /docs-website (#12844)
Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2025-03-12 03:19:34 -04:00
Aseem Bansal
a3a01752ff
docs: clear remote executor docs (#12839) 2025-03-12 03:18:32 -04:00
Andrew Sikowitz
d1d91d81d8
fix(graphql/search): Remove schema field and data process instance from default search types (#12845) 2025-03-11 18:20:32 -07:00
Saketh Varma
9a4b829853
fix(UI): Multiple data product delete modals (#12781) 2025-03-11 14:53:06 -07:00
Chris Collins
e59f497ad5
fix(ui) Fix submitting when selecting replacement in deprecation modal (#12842) 2025-03-11 14:01:02 -06:00
Chris Collins
c73b6660eb
fix(ui) Hide default filters we want to hide from impact analysis (#12843) 2025-03-11 14:00:51 -06:00
david-leifker
fa4ff7bced
feat(openapi-ingestion): implement openapi ingestion (#12757) 2025-03-11 08:25:53 -05:00
Mayuri Nehate
31df9c4e67
feat(ingest/redshift): lineage for external schema created from redshift (#12826) 2025-03-11 17:43:34 +05:30
Sergio Gómez Villamor
dd371130c1
chore(ruff): enable some ignored rules (#12815) 2025-03-11 09:30:06 +01:00
Andrew Sikowitz
b28291bf57
fix(workflows): Update pr-labeler (#12835) 2025-03-10 18:16:55 -07:00
Hyejin Yoon
8bf1f718da
feat(ingestion/mlflow): improve mlflow connector to pull run and experiments (#12587) 2025-03-11 07:55:48 +09:00
Felix Lüdin
598cda2684
feat(ui): support all entities with display names in browse paths v2 (#11657) 2025-03-10 13:27:11 -07:00
Harshal Sheth
e5486a5353
fix(doc): re-enable Algolia search (#12834) 2025-03-10 13:23:30 -07:00
Blize
2688bf39b4
Add variable to show full title in lineage by default (#12078)
Co-authored-by: Matthias Mantsch <matthias.mantsch@swisscom.com>
2025-03-10 12:58:20 -07:00
Jay
679d4cbae1
feat(docs) add perms req to ai docs (#12819) 2025-03-10 10:25:42 -07:00
Chakru
7c1ed744f4
fix(build): build improvements to help with incremental builds (#12823) 2025-03-10 22:43:31 +05:30
Tamas Nemeth
b9f3d07455
fix(doc): Disable Algolia search (#12831) 2025-03-10 14:05:52 +01:00
Peter Wang
48b6581f12
fix(ingest/superset): fixed iterate over int error for building urns (#12807)
Co-authored-by: Sergio Gómez Villamor <sgomezvillamor@gmail.com>
2025-03-07 15:41:27 -08:00
Harshal Sheth
c8347a72ee
docs(ingest/mode): update mode workspace docs (#12774) 2025-03-07 15:41:11 -08:00
david-leifker
47f59e62dd
fix(openapi): fix openapi timeseries async ingestion (#12812) 2025-03-07 16:21:16 -06:00
Harshal Sheth
55fbb71a53
fix(ingest/oracle): refresh golden files (#12818) 2025-03-07 12:38:39 -08:00
Harshal Sheth
8ff905f0d3
docs(ingest): update metadata-ingestion dev guide (#12779) 2025-03-07 11:22:17 -08:00
trialiya
a51713a378
refactor(graphql): simplify getLastIngestionRun method (#12706)
Co-authored-by: trialiya <trialiya@gmail.com>
2025-03-07 11:19:03 -06:00
Pedro Silva
593838795c
feat(docs): Release for DataHub Cloud 0.3.8.2 (#12811) 2025-03-07 12:04:23 +00:00
Chakru
f053188bb6
fix: search cache invalidation for iceberg entities (#12805) 2025-03-07 15:24:14 +05:30
Saketh Varma
a101c27388
fix(UI): Showing platform instances only once (#12806) 2025-03-06 17:01:04 -05:00
Harshal Sheth
a6461853dc
feat(ingest): improve extract-sql-agg-log command (#12803) 2025-03-06 11:08:56 -08:00
david-leifker
41b0629e70
feat(api): URN, Entity, and Aspect name Async Validation (#12797) 2025-03-06 12:49:22 -06:00
david-leifker
2bc1e52253
chore(aws): bump aws libraries (#12809) 2025-03-06 11:55:57 -06:00
david-leifker
08f9c6833a
chore(postgres): bump version (#12808) 2025-03-06 11:55:42 -06:00
skrydal
4f50861996
feat(ingest/iceberg): Introduce network problems resiliency for Iceberg source (#12804) 2025-03-06 18:20:36 +01:00
Jonny Dixon
fcabe88962
fix(ingestion/oracle): Improved foreign key handling (#11867)
Co-authored-by: Harshal Sheth <hsheth2@gmail.com>
2025-03-06 14:32:03 +00:00
Jonny Dixon
a700448bad
feat(ingestion/business-glossary): Automatically generate predictable glossary term and node URNs when incompatible URL characters are specified in term and node names. (#12673) 2025-03-06 14:30:10 +00:00
Mayuri Nehate
4714f46f11
feat(ingest/redshift): support for datashares lineage (#12660)
Co-authored-by: Harshal Sheth <hsheth2@gmail.com>
2025-03-06 19:37:18 +05:30
Hyejin Yoon
ba8affbc7a
docs: add exporting from source to write mcp guide (#12800) 2025-03-05 19:23:22 -08:00
Chakru
6d4744f93b
doc(iceberg): iceberg doc updates (#12787)
Co-authored-by: Shirshanka Das <shirshanka@apache.org>
2025-03-05 16:58:46 -08:00
Jay
484faee243
fix(web) move form entity sidebar to right to align with cloud (#12796) 2025-03-05 16:57:38 -08:00
Hyejin Yoon
1d1ed78be7
docs: update mlflow ingestion docs to include new concept mappings (#12791)
Co-authored-by: Harshal Sheth <hsheth2@gmail.com>
2025-03-05 15:41:18 -08:00
ryota-cloud
cf0dc3ac6b
Support container in ML Model Group, Model and Deployment (#12793) 2025-03-05 14:52:39 -08:00
Chris Collins
1068e2b512
fix(ui) Fix changing color and icon for domains in UI (#12792) 2025-03-05 17:45:28 -05:00
Hyejin Yoon
9cb5886d6d
fix(ui): change tags to properties in ml model view (#12789) 2025-03-05 11:59:21 -08:00
Alex Bransky
dbf33dba77
docs(ingest): update azure.md by removing extra word (#12780) 2025-03-05 11:07:04 -08:00
Harshal Sheth
de60ca30e8
fix(ingest): enable fuzzy case resolution for oracle sql (#12778) 2025-03-05 10:28:29 -08:00
v-tarasevich-blitz-brain
256e488d28
feat(searchBarAutocomplete): add feature flag for search bar's autocomplete redesign (#12690)
Co-authored-by: Victor Tarasevich <v.tarasevitch@invento.by>
2025-03-05 11:54:26 -05:00
Sergio Gómez Villamor
69981675a5
feat(mssql): adds subtypes aspect for dataflow and datajobs (#12775) 2025-03-05 17:11:04 +01:00
Sergio Gómez Villamor
85d3a9d31d
feat(okta): custom properties for okta user (#12773) 2025-03-05 15:53:01 +01:00
Sergio Gómez Villamor
a0319af7db
fix(ingestion): fixes producing some URNs with reserved characters (#12772) 2025-03-05 14:25:20 +01:00
Sergio Gómez Villamor
aed2433c4c
fix: fixes mypy complaints about pkgresources (#12790) 2025-03-05 12:41:03 +01:00
Shirshanka Das
cc3782ecfd
feat(models): adds subtypes to most entities in the model (#12783) 2025-03-04 18:03:03 -08:00
Kevin Karch
f32798125b
feat(ingest): allowdenypattern for dashboard, chart, dataset in superset (#12782) 2025-03-04 17:59:42 -05:00
Peter Wang
9e7f48278a
feat(ingestion/superset): ownership info for charts, dashboards and datasets (#12750) 2025-03-04 12:28:24 -08:00
Chakru
be42e11bd2
dataset cli - add support for schema, round-tripping to yaml (#12764) 2025-03-04 10:17:51 -08:00
6165 changed files with 327016 additions and 96591 deletions

View File

@ -5,6 +5,7 @@
**/.tox/
**/.mypy_cache/
**/.pytest_cache/
**/.ruff_cache/
**/__pycache__/
out
**/*.class
@ -16,3 +17,6 @@ out
.git/COMMIT_*
.git/index
.gradle
/metadata-ingestion/tests
/metadata-ingestion/examples

View File

@ -3,8 +3,7 @@ name: "\U0001F41EBug report"
about: Create a report to help us improve
title: A short description of the bug
labels: bug
assignees: ''
assignees: ""
---
**Describe the bug**
@ -12,6 +11,7 @@ A clear and concise description of what the bug is.
**To Reproduce**
Steps to reproduce the behavior:
1. Go to '...'
2. Click on '....'
3. Scroll down to '....'
@ -24,6 +24,7 @@ A clear and concise description of what you expected to happen.
If applicable, add screenshots to help explain your problem.
**Desktop (please complete the following information):**
- OS: [e.g. iOS]
- Browser [e.g. chrome, safari]
- Version [e.g. 22]

View File

@ -4,7 +4,6 @@ about: Report issues found in DataHub v1.0 Release Candidates
title: "[v1.0-rc/bug] Description of Bug"
labels: bug, datahub-v1.0-rc
assignees: chriscollins3456, david-leifker, maggiehays
---
**Describe the bug**
@ -12,6 +11,7 @@ A clear and concise description of what the bug is.
**To Reproduce**
Steps to reproduce the behavior:
1. Go to '...'
2. Click on '....'
3. Scroll down to '....'
@ -24,6 +24,7 @@ A clear and concise description of what you expected to happen.
If applicable, add screenshots and/or screen recordings to help explain the issue.
**System details (please complete the following information):**
- DataHub Version Tag [e.g. v1.0-rc1]
- OS: [e.g. iOS]
- Browser [e.g. chrome, safari]

8
.github/actionlint.yaml vendored Normal file
View File

@ -0,0 +1,8 @@
self-hosted-runner:
labels:
- "depot-ubuntu-22.04-small"
- "depot-ubuntu-22.04-4"
- "depot-ubuntu-22.04"
- "depot-ubuntu-24.04-small"
- "depot-ubuntu-24.04"
- "depot-ubuntu-24.04-4"

View File

@ -41,6 +41,10 @@ outputs:
smoke-test-change:
description: "Smoke test change"
value: ${{ steps.filter.outputs.smoke-test == 'true' }}
actions-change:
description: "Actions code has changed"
value: ${{ steps.filter.outputs.actions == 'true' }}
runs:
using: "composite"
steps:
@ -97,3 +101,6 @@ runs:
- "docker/elasticsearch-setup/**"
smoke-test:
- "smoke-test/**"
actions:
- "datahub-actions/**"
- "docker/datahub-actions/**"

View File

@ -1,5 +1,5 @@
name: 'Ensure codegen is updated'
description: 'Will check the local filesystem against git, and abort if there are uncommitted changes.'
name: "Ensure codegen is updated"
description: "Will check the local filesystem against git, and abort if there are uncommitted changes."
runs:
using: "composite"

View File

@ -1,8 +1,13 @@
<!--
## Checklist
Thank you for contributing to DataHub!
Before you submit your PR, please go through the checklist below:
- [ ] The PR conforms to DataHub's [Contributing Guideline](https://github.com/datahub-project/datahub/blob/master/docs/CONTRIBUTING.md) (particularly [Commit Message Format](https://github.com/datahub-project/datahub/blob/master/docs/CONTRIBUTING.md#commit-message-format))
- [ ] Links to related issues (if applicable)
- [ ] Tests for the changes have been added/updated (if applicable)
- [ ] Docs related to the changes have been added/updated (if applicable). If a new feature has been added a Usage Guide has been added for the same.
- [ ] For any breaking change/potential downtime/deprecation/big changes an entry has been made in [Updating DataHub](https://github.com/datahub-project/datahub/blob/master/docs/how/updating-datahub.md)
-->

View File

@ -13,14 +13,32 @@ without_info = []
metadata_privileges = set()
platform_privileges = set()
root_user_platform_policy_privileges = set()
root_user_all_privileges = set()
admin_role_platform_privileges = set()
admin_role_all_privileges = set()
reader_role_all_privileges = set()
editor_role_all_privileges = set()
for policy in all_policies:
urn = policy["urn"]
if urn == "urn:li:dataHubPolicy:0":
root_user_platform_policy_privileges = policy["info"]["privileges"]
root_user_all_privileges.update(set(root_user_platform_policy_privileges))
elif urn == "urn:li:dataHubPolicy:1":
root_user_all_privileges.update(set(policy["info"]["privileges"]))
elif urn == "urn:li:dataHubPolicy:admin-platform-policy":
admin_role_platform_privileges = policy["info"]["privileges"]
admin_role_all_privileges.update(set(admin_role_platform_privileges))
elif urn == "urn:li:dataHubPolicy:admin-metadata-policy":
admin_role_all_privileges.update(set(policy["info"]["privileges"]))
elif urn == "urn:li:dataHubPolicy:editor-platform-policy":
editor_platform_policy_privileges = policy["info"]["privileges"]
elif urn == "urn:li:dataHubPolicy:7":
all_user_platform_policy_privileges = policy["info"]["privileges"]
elif urn.startswith("urn:li:dataHubPolicy:reader-"):
reader_role_all_privileges.update(set(policy["info"]["privileges"]))
elif urn.startswith("urn:li:dataHubPolicy:editor-"):
editor_role_all_privileges.update(set(policy["info"]["privileges"]))
try:
doc_type = policy["info"]["type"]
privileges = policy["info"]["privileges"]
@ -49,11 +67,41 @@ print(
"""
)
# Root user has all privileges
diff_policies = set(platform_privileges).difference(
set(root_user_platform_policy_privileges)
)
assert len(diff_policies) == 0, f"Missing privileges for root user are {diff_policies}"
# admin role and root user have same platform privileges
diff_root_missing_from_admin = set(root_user_platform_policy_privileges).difference(set(admin_role_platform_privileges))
diff_admin_missing_from_root = set(admin_role_platform_privileges).difference(set(root_user_platform_policy_privileges))
assert len(diff_root_missing_from_admin) == 0, f"Admin role missing: {diff_root_missing_from_admin}"
assert len(diff_admin_missing_from_root) == 0, f"Root user missing: {diff_admin_missing_from_root}"
# admin role and root user have same privileges
diff_root_missing_from_admin_all = set(root_user_all_privileges).difference(set(admin_role_all_privileges))
diff_admin_missing_from_root_all = set(admin_role_all_privileges).difference(set(root_user_all_privileges))
## Admin user has EDIT_ENTITY privilege which is super privilege for editing entities
diff_admin_missing_from_root_all_new = set()
for privilege in diff_admin_missing_from_root_all:
if privilege.startswith("EDIT_"):
continue
diff_admin_missing_from_root_all_new.add(privilege)
diff_admin_missing_from_root_all = diff_admin_missing_from_root_all_new
assert len(diff_root_missing_from_admin_all) == 0, f"Admin role missing: {diff_root_missing_from_admin_all}"
assert len(diff_admin_missing_from_root_all) == 0, f"Root user missing: {diff_admin_missing_from_root_all}"
# Editor role has all privielges of Reader
diff_reader_missing_from_editor = set(reader_role_all_privileges).difference(set(editor_role_all_privileges))
assert len(diff_reader_missing_from_editor) == 0, f"Editor role missing: {diff_reader_missing_from_editor}"
# Admin role has all privileges of editor
diff_editor_missing_from_admin = set(editor_role_all_privileges).difference(set(admin_role_all_privileges))
assert len(diff_editor_missing_from_admin) == 0, f"Admin role missing: {diff_editor_missing_from_admin}"
# All users privileges checks
assert "MANAGE_POLICIES" not in all_user_platform_policy_privileges
assert "MANAGE_USERS_AND_GROUPS" not in all_user_platform_policy_privileges

View File

@ -1,4 +1,12 @@
echo "GITHUB_REF: $GITHUB_REF"
#!/bin/bash
REF="${GITHUB_REF:-${GITHUB_REF_FALLBACK:-}}"
if [ -z "$REF" ]; then
echo "Error: No ref available from GITHUB_REF or fallback"
exit 1
fi
echo "GITHUB_REF: $REF"
echo "GITHUB_SHA: $GITHUB_SHA"
export MAIN_BRANCH="master"
@ -12,37 +20,82 @@ export SHORT_SHA=$(get_short_sha)
echo "SHORT_SHA: $SHORT_SHA"
function get_tag {
echo $(echo ${GITHUB_REF} | sed -e "s,refs/heads/${MAIN_BRANCH},${MAIN_BRANCH_TAG},g" -e 's,refs/tags/,,g' -e 's,refs/pull/\([0-9]*\).*,pr\1,g')
echo $(echo ${REF} | sed -e "s,refs/heads/${MAIN_BRANCH},${MAIN_BRANCH_TAG},g" -e 's,refs/tags/,,g' -e 's,refs/heads/,,g' -e 's,refs/heads/,,g' -e 's,refs/pull/\([0-9]*\).*,pr\1,g' -e 's,/,-,g')
}
function get_tag_slim {
echo $(echo ${GITHUB_REF} | sed -e "s,refs/heads/${MAIN_BRANCH},${MAIN_BRANCH_TAG}-slim,g" -e 's,refs/tags/\(.*\),\1-slim,g' -e 's,refs/pull/\([0-9]*\).*,pr\1-slim,g')
echo $(echo ${REF} | sed -e "s,refs/heads/${MAIN_BRANCH},${MAIN_BRANCH_TAG}-slim,g" -e 's,refs/tags/\(.*\),\1-slim,g' -e 's,refs/heads/\(.*\),\1-slim,g' -e 's,refs/heads/\(.*\),\1-slim,g' -e 's,refs/pull/\([0-9]*\).*,pr\1-slim,g' -e 's,/,-,g')
}
function get_tag_full {
echo $(echo ${GITHUB_REF} | sed -e "s,refs/heads/${MAIN_BRANCH},${MAIN_BRANCH_TAG}-full,g" -e 's,refs/tags/\(.*\),\1-full,g' -e 's,refs/pull/\([0-9]*\).*,pr\1-full,g')
echo $(echo ${REF} | sed -e "s,refs/heads/${MAIN_BRANCH},${MAIN_BRANCH_TAG}-full,g" -e 's,refs/tags/\(.*\),\1-full,g' -e 's,refs/heads/\(.*\),\1-full,g' -e 's,refs/heads/\(.*\),\1-full,g' -e 's,refs/pull/\([0-9]*\).*,pr\1-full,g' -e 's,/,-,g')
}
function get_python_docker_release_v {
echo $(echo ${GITHUB_REF} | sed -e "s,refs/heads/${MAIN_BRANCH},1!0.0.0+docker.${SHORT_SHA},g" -e 's,refs/tags/v\(.*\),1!\1+docker,g' -e 's,refs/pull/\([0-9]*\).*,1!0.0.0+docker.pr\1,g')
function get_python_docker_release_v() {
echo "$(echo "${REF}" | \
sed -e "s,refs/heads/${MAIN_BRANCH},1\!0.0.0+docker.${SHORT_SHA},g" \
-e 's,refs/heads/\(.*\),1!0.0.0+docker.\1,g' \
-e 's,refs/heads/\(.*\),1!0.0.0+docker.\1,g' \
-e 's,refs/tags/v\([0-9a-zA-Z.]*\).*,\1+docker,g' \
-e 's,refs/pull/\([0-9]*\).*,1!0.0.0+docker.pr\1,g' \
-e 's,/,-,g'
)"
}
# To run these, set TEST_DOCKER_HELPERS=1 and then copy the function + test cases into a bash shell.
if [ ${TEST_DOCKER_HELPERS:-0} -eq 1 ]; then
REF="refs/pull/4788/merge" get_python_docker_release_v # '1!0.0.0+docker.pr4788'
REF="refs/tags/v0.1.2-test" get_python_docker_release_v # '0.1.2'
REF="refs/tags/v0.1.2.1-test" get_python_docker_release_v # '0.1.2.1'
REF="refs/tags/v0.1.2rc1-test" get_python_docker_release_v # '0.1.2rc1'
REF="refs/heads/branch-name" get_python_docker_release_v # '1!0.0.0+docker.branch-name'
REF="refs/heads/releases/branch-name" get_python_docker_release_v # 1!0.0.0+docker.releases-branch-name'
GITHUB_REF="refs/tags/v0.1.2rc1" get_tag # '0.1.2rc1'
GITHUB_REF="refs/tags/v0.1.2rc1" get_tag_slim # '0.1.2rc1-slim'
GITHUB_REF="refs/tags/v0.1.2rc1" get_tag_full # '0.1.2rc1-full'
GITHUB_REF="refs/pull/4788/merge" get_tag # 'pr4788'
GITHUB_REF="refs/pull/4788/merge" get_tag_slim # 'pr4788-slim'
GITHUB_REF="refs/pull/4788/merge" get_tag_full # 'pr4788-full'
GITHUB_REF="refs/heads/branch-name" get_tag # 'branch-name'
GITHUB_REF="refs/heads/branch-name" get_tag_slim # 'branch-name-slim'
GITHUB_REF="refs/heads/branch-name" get_tag_full # 'branch-name-full'
GITHUB_REF="refs/heads/releases/branch-name" get_tag # 'releases-branch-name'
GITHUB_REF="refs/heads/releases/branch-name" get_tag_slim # 'releases-branch-name-slim'
GITHUB_REF="refs/heads/releases/branch-name" get_tag_full # 'releases-branch-name-full'
fi
function get_unique_tag {
echo $(echo ${GITHUB_REF} | sed -e "s,refs/heads/${MAIN_BRANCH},${SHORT_SHA},g" -e 's,refs/tags/,,g' -e 's,refs/pull/\([0-9]*\).*,pr\1,g')
echo $(echo ${REF} | sed -e "s,refs/heads/${MAIN_BRANCH},${SHORT_SHA},g" -e 's,refs/tags/,,g' -e "s,refs/heads/.*,${SHORT_SHA},g" -e 's,refs/pull/\([0-9]*\).*,pr\1,g')
}
function get_unique_tag_slim {
echo $(echo ${GITHUB_REF} | sed -e "s,refs/heads/${MAIN_BRANCH},${SHORT_SHA}-slim,g" -e 's,refs/tags/\(.*\),\1-slim,g' -e 's,refs/pull/\([0-9]*\).*,pr\1-slim,g')
echo $(echo ${REF} | sed -e "s,refs/heads/${MAIN_BRANCH},${SHORT_SHA}-slim,g" -e 's,refs/tags/\(.*\),\1-slim,g' -e "s,refs/heads/.*,${SHORT_SHA}-slim,g" -e 's,refs/pull/\([0-9]*\).*,pr\1-slim,g')
}
function get_unique_tag_full {
echo $(echo ${GITHUB_REF} | sed -e "s,refs/heads/${MAIN_BRANCH},${SHORT_SHA}-full,g" -e 's,refs/tags/\(.*\),\1-full,g' -e 's,refs/pull/\([0-9]*\).*,pr\1-full,g')
echo $(echo ${REF} | sed -e "s,refs/heads/${MAIN_BRANCH},${SHORT_SHA}-full,g" -e 's,refs/tags/\(.*\),\1-full,g' -e "s,refs/heads/.*,${SHORT_SHA}-full,g" -e 's,refs/pull/\([0-9]*\).*,pr\1-full,g')
}
function get_platforms_based_on_branch {
if [ "${{ github.event_name }}" == 'push' && "${{ github.ref }}" == "refs/heads/${MAIN_BRANCH}" ]; then
if [ "${GITHUB_EVENT_NAME}" == "push" ] && [ "${REF}" == "refs/heads/${MAIN_BRANCH}" ]; then
echo "linux/amd64,linux/arm64"
else
echo "linux/amd64"
fi
}
function echo_tags {
echo "short_sha=${SHORT_SHA}"
echo "tag=$(get_tag)"
echo "slim_tag=$(get_tag_slim)"
echo "full_tag=$(get_tag_full)"
echo "unique_tag=$(get_unique_tag)"
echo "unique_slim_tag=$(get_unique_tag_slim)"
echo "unique_full_tag=$(get_unique_tag_full)"
echo "python_release_version=$(get_python_docker_release_v)"
echo "branch_name=${GITHUB_HEAD_REF:-${REF#refs/heads/}}"
echo "repository_name=${GITHUB_REPOSITORY#*/}"
}

View File

@ -19,6 +19,7 @@ class ProjectType(Enum):
JAVA = auto()
PYTHON = auto()
PRETTIER = auto()
@dataclass
@ -27,6 +28,8 @@ class Project:
path: str
type: ProjectType
taskName: str | None = None # Used for prettier projects
filePattern: str | None = None # Used for prettier projects
@property
def gradle_path(self) -> str:
@ -151,8 +154,12 @@ class HookGenerator:
for project in self.projects:
if project.type == ProjectType.PYTHON:
hooks.append(self._generate_lint_fix_hook(project))
else: # ProjectType.JAVA
elif project.type == ProjectType.JAVA:
hooks.append(self._generate_spotless_hook(project))
elif project.type == ProjectType.PRETTIER:
hooks.append(self._generate_prettier_hook(project))
else:
print(f"Warning: Unsupported project type {project.type} for {project.path}")
config = {"repos": [{"repo": "local", "hooks": hooks}]}
@ -203,6 +210,17 @@ class HookGenerator:
"pass_filenames": False,
}
def _generate_prettier_hook(self, project: Project) -> dict:
"""Generate a prettier hook for projects."""
return {
"id": f"{project.project_id}-{project.taskName}",
"name": f"{project.taskName}",
"entry": f"./gradlew {project.gradle_path}:{project.taskName}",
"language": "system",
"files": project.filePattern,
"pass_filenames": False,
}
class PrecommitDumper(yaml.Dumper):
"""Custom YAML dumper that maintains proper indentation."""
@ -253,7 +271,21 @@ def main():
# Find projects
finder = ProjectFinder(root_dir)
projects = finder.find_all_projects()
prettier_projects = [
Project(
path="datahub-web-react",
type=ProjectType.PRETTIER,
taskName="mdPrettierWriteChanged",
filePattern="^.*\\.md$",
),
Project(
path="datahub-web-react",
type=ProjectType.PRETTIER,
taskName="githubActionsPrettierWriteChanged",
filePattern="^\\.github/.*\\.(yml|yaml)$"
),
]
projects = [*prettier_projects, *finder.find_all_projects()]
# Print summary
print("Found projects:")

View File

@ -7,3 +7,15 @@ repos:
language: system
files: ^smoke-test/tests/cypress/.*\.tsx$
pass_filenames: false
- id: update-capability-summary
name: update-capability-summary
entry: ./gradlew :metadata-ingestion:capabilitySummary
language: system
files: ^metadata-ingestion/src/datahub/ingestion/source/.*\.py$
pass_filenames: false
- id: update-lineage-file
name: update-lineage-file
entry: ./gradlew :metadata-ingestion:lineageGen
language: system
files: ^(metadata-ingestion-modules/.*|metadata-models/.*)$
pass_filenames: false

70
.github/workflows/actions.yml vendored Normal file
View File

@ -0,0 +1,70 @@
name: DataHub Actions
on:
push:
branches:
- master
- releases/**
paths:
- ".github/workflows/actions.yml"
- "datahub-actions/**"
- "metadata-ingestion/**"
- "metadata-models/**"
pull_request:
branches:
- "**"
paths:
- ".github/workflows/actions.yml"
- "datahub-actions/**"
- "metadata-ingestion/**"
- "metadata-models/**"
release:
types: [published]
workflow_dispatch:
concurrency:
group: ${{ github.workflow }}-${{ github.event.pull_request.number || github.ref }}
cancel-in-progress: true
jobs:
build:
runs-on: ubuntu-latest
timeout-minutes: 60
steps:
- uses: actions/checkout@v4
- uses: actions/setup-python@v5
with:
python-version: "3.10"
- name: Test packages are correct
run: |
cd datahub-actions;
python -c 'import setuptools; where="./src"; assert setuptools.find_packages(where) == setuptools.find_namespace_packages(where), "you seem to be missing or have extra __init__.py files"'
- name: Gradle build (and test)
run: |
./gradlew :datahub-actions:build
- uses: actions/upload-artifact@v4
if: always()
with:
name: Test Results (build)
path: |
**/build/reports/tests/test/**
**/build/test-results/test/**
**/junit.*.xml
- name: Upload datahub-actions coverage to Codecov
uses: codecov/codecov-action@v5
with:
token: ${{ secrets.CODECOV_TOKEN }}
#handle_no_reports_found: true
fail_ci_if_error: false
name: datahub-actions
verbose: true
override_branch: ${{ github.head_ref || github.ref_name }}
event-file:
runs-on: ubuntu-latest
steps:
- name: Upload
uses: actions/upload-artifact@v4
with:
name: Event File
path: ${{ github.event_path }}

View File

@ -3,6 +3,7 @@ on:
push:
branches:
- master
- releases/**
paths:
- ".github/workflows/airflow-plugin.yml"
- "metadata-ingestion-modules/airflow-plugin/**"
@ -34,21 +35,20 @@ jobs:
include:
# Note: this should be kept in sync with tox.ini.
- python-version: "3.8"
extra_pip_requirements: "apache-airflow~=2.3.4"
extra_pip_extras: test-airflow23
extra_pip_requirements: "apache-airflow~=2.7.3"
extra_pip_constraints: "-c https://raw.githubusercontent.com/apache/airflow/constraints-2.7.3/constraints-3.8.txt"
- python-version: "3.10"
extra_pip_requirements: "apache-airflow~=2.4.3"
extra_pip_extras: test-airflow24
extra_pip_requirements: "apache-airflow~=2.7.3"
extra_pip_constraints: "-c https://raw.githubusercontent.com/apache/airflow/constraints-2.7.3/constraints-3.10.txt"
- python-version: "3.10"
extra_pip_requirements: "apache-airflow~=2.6.3 -c https://raw.githubusercontent.com/apache/airflow/constraints-2.6.3/constraints-3.10.txt"
- python-version: "3.10"
extra_pip_requirements: "apache-airflow~=2.7.3 -c https://raw.githubusercontent.com/apache/airflow/constraints-2.7.3/constraints-3.10.txt"
- python-version: "3.10"
extra_pip_requirements: "apache-airflow~=2.8.1 -c https://raw.githubusercontent.com/apache/airflow/constraints-2.8.1/constraints-3.10.txt"
extra_pip_requirements: "apache-airflow~=2.8.1"
extra_pip_constraints: "-c https://raw.githubusercontent.com/apache/airflow/constraints-2.8.1/constraints-3.10.txt"
- python-version: "3.11"
extra_pip_requirements: "apache-airflow~=2.9.3 -c https://raw.githubusercontent.com/apache/airflow/constraints-2.9.3/constraints-3.11.txt"
extra_pip_requirements: "apache-airflow~=2.9.3"
extra_pip_constraints: "-c https://raw.githubusercontent.com/apache/airflow/constraints-2.9.3/constraints-3.11.txt"
- python-version: "3.11"
extra_pip_requirements: "apache-airflow~=2.10.3 -c https://raw.githubusercontent.com/apache/airflow/constraints-2.10.3/constraints-3.11.txt"
extra_pip_requirements: "apache-airflow~=2.10.3"
extra_pip_constraints: "-c https://raw.githubusercontent.com/apache/airflow/constraints-2.10.3/constraints-3.11.txt"
fail-fast: false
steps:
- name: Set up JDK 17
@ -65,7 +65,7 @@ jobs:
- name: Install dependencies
run: ./metadata-ingestion/scripts/install_deps.sh
- name: Install airflow package and test (extras ${{ matrix.extra_pip_requirements }})
run: ./gradlew -Pextra_pip_requirements='${{ matrix.extra_pip_requirements }}' -Pextra_pip_extras='${{ matrix.extra_pip_extras }}' :metadata-ingestion-modules:airflow-plugin:build
run: ./gradlew -Pextra_pip_requirements='${{ matrix.extra_pip_requirements }}' -Pextra_pip_constraints='${{ matrix.extra_pip_constraints }}' -Pextra_pip_extras='${{ matrix.extra_pip_extras }}' :metadata-ingestion-modules:airflow-plugin:build
- name: pip freeze show list installed
if: always()
run: source metadata-ingestion-modules/airflow-plugin/venv/bin/activate && uv pip freeze
@ -88,11 +88,13 @@ jobs:
flags: ingestion-airflow
name: pytest-airflow-${{ matrix.python-version }}-${{ matrix.extra_pip_requirements }}
verbose: true
override_branch: ${{ github.head_ref || github.ref_name }}
- name: Upload test results to Codecov
if: ${{ !cancelled() }}
uses: codecov/test-results-action@v1
with:
token: ${{ secrets.CODECOV_TOKEN }}
override_branch: ${{ github.head_ref || github.ref_name }}
event-file:
runs-on: ubuntu-latest

View File

@ -3,16 +3,16 @@ on:
push:
branches:
- master
- releases/**
paths-ignore:
- "docs/**"
- "**.md"
pull_request:
branches:
- "**"
paths-ignore:
- "docs/**"
- "**.md"
workflow_dispatch:
schedule:
- cron: "0 0 * * *" # Run at midnight UTC every day
release:
types: [published]
@ -24,10 +24,10 @@ jobs:
setup:
runs-on: ubuntu-latest
outputs:
frontend_change: ${{ steps.ci-optimize.outputs.frontend-change == 'true' }}
frontend_change: ${{ steps.ci-optimize.outputs.frontend-change == 'true' || github.event_name != 'pull_request' }}
ingestion_change: ${{ steps.ci-optimize.outputs.ingestion-change == 'true' }}
backend_change: ${{ steps.ci-optimize.outputs.backend-change == 'true' }}
docker_change: ${{ steps.ci-optimize.outputs.docker-change == 'true' }}
backend_change: ${{ steps.ci-optimize.outputs.backend-change == 'true' || github.event_name != 'pull_request'}}
docker_change: ${{ steps.ci-optimize.outputs.docker-change == 'true' || github.event_name != 'pull_request' }}
frontend_only: ${{ steps.ci-optimize.outputs.frontend-only == 'true' }}
ingestion_only: ${{ steps.ci-optimize.outputs.ingestion-only == 'true' }}
kafka_setup_change: ${{ steps.ci-optimize.outputs.kafka-setup-change == 'true' }}
@ -106,17 +106,21 @@ jobs:
-x :datahub-web-react:build \
-x :metadata-integration:java:datahub-schematron:cli:test \
--parallel
env:
CODECOV_TOKEN: ${{ secrets.CODECOV_TOKEN }}
- name: Gradle build (and test) for frontend
if: ${{ matrix.command == 'frontend' && needs.setup.outputs.frontend_change == 'true' }}
run: |
./gradlew :datahub-frontend:build :datahub-web-react:build --parallel
env:
CODECOV_TOKEN: ${{ secrets.CODECOV_TOKEN }}
- name: Gradle compile (jdk8) for legacy Spark
if: ${{ matrix.command == 'except_metadata_ingestion' && needs.setup.outputs.backend_change == 'true' }}
run: |
./gradlew -PjavaClassVersionDefault=8 :metadata-integration:java:spark-lineage:compileJava
- name: Gather coverage files
run: |
echo "BACKEND_FILES=`find ./build/coverage-reports/ -type f | grep -E '(metadata-models|entity-registry|datahuyb-graphql-core|metadata-io|metadata-jobs|metadata-utils|metadata-service|medata-dao-impl|metadata-operation|li-utils|metadata-integration|metadata-events|metadata-auth|ingestion-scheduler|notifications|datahub-upgrade)' | xargs | sed 's/ /,/g'`" >> $GITHUB_ENV
echo "BACKEND_FILES=`find ./build/coverage-reports/ -type f | grep -E '(metadata-models|entity-registry|datahub-graphql-core|metadata-io|metadata-jobs|metadata-utils|metadata-service|medata-dao-impl|metadata-operation|li-utils|metadata-integration|metadata-events|metadata-auth|ingestion-scheduler|notifications|datahub-upgrade)' | xargs | sed 's/ /,/g'`" >> $GITHUB_ENV
echo "FRONTEND_FILES=`find ./build/coverage-reports/ -type f | grep -E '(datahub-frontend|datahub-web-react).*\.(xml|json)$' | xargs | sed 's/ /,/g'`" >> $GITHUB_ENV
- name: Generate tz artifact name
run: echo "NAME_TZ=$(echo ${{ matrix.timezone }} | tr '/' '-')" >> $GITHUB_ENV
@ -132,7 +136,7 @@ jobs:
- name: Ensure codegen is updated
uses: ./.github/actions/ensure-codegen-updated
- name: Upload backend coverage to Codecov
if: ${{ matrix.command == 'except_metadata_ingestion' && needs.setup.outputs.backend_change == 'true' }}
if: ${{ (matrix.command == 'except_metadata_ingestion' && needs.setup.outputs.backend_change == 'true' && github.event_name != 'release') }}
uses: codecov/codecov-action@v5
with:
token: ${{ secrets.CODECOV_TOKEN }}
@ -143,8 +147,22 @@ jobs:
flags: backend
name: ${{ matrix.command }}
verbose: true
- name: Upload backend coverage to Codecov on release
if: ${{ (matrix.command == 'except_metadata_ingestion' && github.event_name == 'release' ) }}
uses: codecov/codecov-action@v5
with:
token: ${{ secrets.CODECOV_TOKEN }}
files: ${{ env.BACKEND_FILES }}
disable_search: true
#handle_no_reports_found: true
fail_ci_if_error: false
flags: backend
name: ${{ matrix.command }}
verbose: true
override_branch: ${{ github.head_ref || github.ref_name }}
- name: Upload frontend coverage to Codecov
if: ${{ matrix.command == 'frontend' && needs.setup.outputs.frontend_change == 'true' }}
if: ${{ (matrix.command == 'frontend' && needs.setup.outputs.frontend_change == 'true' && github.event_name != 'release') }}
uses: codecov/codecov-action@v5
with:
token: ${{ secrets.CODECOV_TOKEN }}
@ -155,13 +173,33 @@ jobs:
flags: frontend
name: ${{ matrix.command }}
verbose: true
- name: Upload frontend coverage to Codecov on Release
if: ${{ (matrix.command == 'frontend' && github.event_name == 'release') }}
uses: codecov/codecov-action@v5
with:
token: ${{ secrets.CODECOV_TOKEN }}
files: ${{ env.FRONTEND_FILES }}
disable_search: true
#handle_no_reports_found: true
fail_ci_if_error: false
flags: frontend
name: ${{ matrix.command }}
verbose: true
override_branch: ${{ github.head_ref || github.ref_name }}
- name: Upload test results to Codecov
if: ${{ !cancelled() }}
if: ${{ !cancelled() && github.event_name != 'release' }}
uses: codecov/test-results-action@v1
with:
token: ${{ secrets.CODECOV_TOKEN }}
- name: Upload test results to Codecov on release
if: ${{ !cancelled() && github.event_name == 'release' }}
uses: codecov/test-results-action@v1
with:
token: ${{ secrets.CODECOV_TOKEN }}
override_branch: ${{ github.head_ref || github.ref_name }}
quickstart-compose-validation:
docker-codegen-validation:
runs-on: ubuntu-latest
needs: setup
if: ${{ needs.setup.outputs.docker_change == 'true' }}
@ -173,6 +211,8 @@ jobs:
python-version: "3.10"
- name: Quickstart Compose Validation
run: ./docker/quickstart/generate_and_compare.sh
- name: Docker Snippet Validation
run: python python-build/generate_ingestion_docker.py --check
event-file:
runs-on: ubuntu-latest

View File

@ -4,6 +4,7 @@ on:
push:
branches:
- master
- releases/**
paths:
- "metadata-integration/**"
pull_request:

View File

@ -19,7 +19,7 @@ jobs:
days-before-issue-close: 30
stale-issue-label: "stale"
stale-issue-message:
"This issue is stale because it has been open for 30 days with no activity. If you believe this is still an issue on the latest DataHub release please leave a comment with the version that you tested it with. If this is a question/discussion please head to https://slack.datahubproject.io.\
"This issue is stale because it has been open for 30 days with no activity. If you believe this is still an issue on the latest DataHub release please leave a comment with the version that you tested it with. If this is a question/discussion please head to https://datahub.com/slack.\
\ For feature requests please use https://feature-requests.datahubproject.io"
close-issue-message: "This issue was closed because it has been inactive for 30 days since being marked as stale."
days-before-pr-stale: -1

View File

@ -3,6 +3,7 @@ on:
push:
branches:
- master
- releases/**
paths:
- ".github/workflows/dagster-plugin.yml"
- "metadata-ingestion-modules/dagster-plugin/**"
@ -75,11 +76,13 @@ jobs:
flags: ingestion-dagster-plugin
name: pytest-dagster
verbose: true
override_branch: ${{ github.head_ref || github.ref_name }}
- name: Upload test results to Codecov
if: ${{ !cancelled() }}
uses: codecov/test-results-action@v1
with:
token: ${{ secrets.CODECOV_TOKEN }}
override_branch: ${{ github.head_ref || github.ref_name }}
event-file:
runs-on: ubuntu-latest

View File

@ -28,6 +28,9 @@ jobs:
uses: acryldata/sane-checkout-action@v3
- name: Compute Tag
id: tag
env:
GITHUB_REF_FALLBACK: ${{ github.ref }}
GITHUB_EVENT_NAME: ${{ github.event_name }}
run: |
source .github/scripts/docker_helpers.sh
echo "tag=$(get_tag)" >> $GITHUB_OUTPUT

View File

@ -1,61 +0,0 @@
name: postgres-setup docker
on:
push:
branches:
- master
paths:
- "docker/postgres-setup/**"
- ".github/workflows/docker-postgres-setup.yml"
pull_request:
branches:
- "**"
paths:
- "docker/postgres-setup/**"
- ".github/workflows/docker-postgres-setup.yml"
release:
types: [published]
concurrency:
group: ${{ github.workflow }}-${{ github.event.pull_request.number || github.ref }}
cancel-in-progress: true
jobs:
setup:
runs-on: ubuntu-latest
outputs:
tag: ${{ steps.tag.outputs.tag }}
publish: ${{ steps.publish.outputs.publish }}
steps:
- name: Checkout
uses: acryldata/sane-checkout-action@v3
- name: Compute Tag
id: tag
run: |
source .github/scripts/docker_helpers.sh
echo "tag=$(get_tag)" >> $GITHUB_OUTPUT
- name: Check whether publishing enabled
id: publish
env:
ENABLE_PUBLISH: ${{ secrets.ACRYL_DOCKER_PASSWORD }}
run: |
echo "Enable publish: ${{ env.ENABLE_PUBLISH != '' }}"
echo "publish=${{ env.ENABLE_PUBLISH != '' }}" >> $GITHUB_OUTPUT
push_to_registries:
name: Build and Push Docker Image to Docker Hub
runs-on: ubuntu-latest
needs: setup
steps:
- name: Check out the repo
uses: acryldata/sane-checkout-action@v3
- name: Build and push
uses: ./.github/actions/docker-custom-build-and-push
with:
images: |
acryldata/datahub-postgres-setup
image_tag: ${{ needs.setup.outputs.tag }}
username: ${{ secrets.ACRYL_DOCKER_USERNAME }}
password: ${{ secrets.ACRYL_DOCKER_PASSWORD }}
publish: ${{ needs.setup.outputs.publish == 'true' }}
context: .
file: ./docker/postgres-setup/Dockerfile
platforms: linux/amd64,linux/arm64

File diff suppressed because it is too large Load Diff

View File

@ -53,6 +53,9 @@ jobs:
key: ${{ runner.os }}-uv-${{ hashFiles('**/requirements.txt') }}
- name: Install Python dependencies
run: ./metadata-ingestion/scripts/install_deps.sh
- name: Run tests
run: |
./gradlew --info :metadata-ingestion:testScripts
- name: Build Docs
run: |
./gradlew --info docs-website:build

View File

@ -0,0 +1,30 @@
name: github actions format
on:
push:
branches:
- master
paths:
- ".github/**/*.{yml,yaml}"
pull_request:
branches:
- "**"
paths:
- ".github/**/*.{yml,yaml}"
concurrency:
group: ${{ github.workflow }}-${{ github.event.pull_request.number || github.ref }}
cancel-in-progress: true
jobs:
github_actions_format_check:
name: github_actions_format_check
runs-on: ubuntu-latest
steps:
- name: Check out the repo
uses: acryldata/sane-checkout-action@v3
- uses: actions/setup-python@v5
with:
python-version: "3.10"
- name: run prettier --check
run: |-
./gradlew :datahub-web-react:githubActionsPrettierCheck

View File

@ -3,6 +3,7 @@ on:
push:
branches:
- master
- releases/**
paths:
- ".github/workflows/gx-plugin.yml"
- "metadata-ingestion-modules/gx-plugin/**"
@ -79,11 +80,13 @@ jobs:
flags: ingestion-gx-plugin
name: pytest-gx
verbose: true
override_branch: ${{ github.head_ref || github.ref_name }}
- name: Upload test results to Codecov
if: ${{ !cancelled() }}
uses: codecov/test-results-action@v1
with:
token: ${{ secrets.CODECOV_TOKEN }}
override_branch: ${{ github.head_ref || github.ref_name }}
event-file:
runs-on: ubuntu-latest

30
.github/workflows/markdown-format.yml vendored Normal file
View File

@ -0,0 +1,30 @@
name: markdown format
on:
push:
branches:
- master
paths:
- "**/*.md"
pull_request:
branches:
- "**"
paths:
- "**/*.md"
concurrency:
group: ${{ github.workflow }}-${{ github.event.pull_request.number || github.ref }}
cancel-in-progress: true
jobs:
markdown_format_check:
name: markdown_format_check
runs-on: ubuntu-latest
steps:
- name: Check out the repo
uses: acryldata/sane-checkout-action@v3
- uses: actions/setup-python@v5
with:
python-version: "3.10"
- name: run prettier --check
run: |-
./gradlew :datahub-web-react:mdPrettierCheck

View File

@ -3,6 +3,7 @@ on:
push:
branches:
- master
- releases/**
paths:
- ".github/workflows/metadata-ingestion.yml"
- "metadata-ingestion/**"
@ -68,9 +69,25 @@ jobs:
run: ./metadata-ingestion/scripts/install_deps.sh
- name: Install package
run: ./gradlew :metadata-ingestion:installPackageOnly
- name: Run lint alongwith testQuick
- name: Check lint passes and autogenerated JSON files are up-to-date
if: ${{ matrix.command == 'testQuick' }}
run: ./gradlew :metadata-ingestion:lint
run: |
./gradlew :metadata-ingestion:lint
- name: Check autogenerated JSON files are up-to-date
if: ${{ matrix.command == 'testQuick' }}
run: |
./gradlew :metadata-ingestion:capabilitySummary :metadata-ingestion:lineageGen
for json_file in metadata-ingestion/src/datahub/ingestion/autogenerated/*.json; do
filename=$(basename "$json_file")
if git diff --quiet "$json_file"; then
echo "✅ $filename is unchanged"
else
echo "❌ $filename has changed. Please commit the updated file."
echo "Changed lines:"
git diff "$json_file"
exit 1
fi
done
- name: Run metadata-ingestion tests
run: ./gradlew :metadata-ingestion:${{ matrix.command }}
- name: Debug info
@ -99,11 +116,13 @@ jobs:
flags: ingestion
name: pytest-${{ matrix.python-version }}-${{ matrix.command }}
verbose: true
override_branch: ${{ github.head_ref || github.ref_name }}
- name: Upload test results to Codecov
if: ${{ !cancelled() }}
uses: codecov/test-results-action@v1
with:
token: ${{ secrets.CODECOV_TOKEN }}
override_branch: ${{ github.head_ref || github.ref_name }}
event-file:
runs-on: ubuntu-latest

View File

@ -3,6 +3,7 @@ on:
push:
branches:
- master
- releases/**
paths:
- "**/*.gradle"
- "li-utils/**"
@ -30,9 +31,9 @@ jobs:
setup:
runs-on: ubuntu-latest
outputs:
frontend_change: ${{ steps.ci-optimize.outputs.frontend-change == 'true' }}
ingestion_change: ${{ steps.ci-optimize.outputs.ingestion-change == 'true' }}
backend_change: ${{ steps.ci-optimize.outputs.backend-change == 'true' }}
frontend_change: ${{ steps.ci-optimize.outputs.frontend-change == 'true' || github.event_name == 'release' }}
ingestion_change: ${{ steps.ci-optimize.outputs.ingestion-change == 'true' || github.event_name == 'release' }}
backend_change: ${{ steps.ci-optimize.outputs.backend-change == 'true' || github.event_name == 'release' }}
docker_change: ${{ steps.ci-optimize.outputs.docker-change == 'true' }}
frontend_only: ${{ steps.ci-optimize.outputs.frontend-only == 'true' }}
ingestion_only: ${{ steps.ci-optimize.outputs.ingestion-only == 'true' }}
@ -58,10 +59,6 @@ jobs:
- name: Disk Check
run: df -h . && docker images
- uses: acryldata/sane-checkout-action@v3
- uses: actions/setup-python@v5
with:
python-version: "3.10"
cache: "pip"
- name: Set up JDK 17
uses: actions/setup-java@v4
with:
@ -92,11 +89,13 @@ jobs:
flags: metadata-io
name: metadata-io-test
verbose: true
override_branch: ${{ github.head_ref || github.ref_name }}
- name: Upload test results to Codecov
if: ${{ !cancelled() }}
uses: codecov/test-results-action@v1
with:
token: ${{ secrets.CODECOV_TOKEN }}
override_branch: ${{ github.head_ref || github.ref_name }}
event-file:
runs-on: ubuntu-latest

View File

@ -19,9 +19,8 @@ jobs:
repo-token: "${{ secrets.GITHUB_TOKEN }}"
configuration-path: ".github/pr-labeler-config.yml"
- uses: actions-ecosystem/action-add-labels@v1.1.3
# only add names of Acryl Data team members here
if:
${{
# only add names of DataHub team members here
if: ${{
!contains(
fromJson('[
"anshbansal",
@ -52,7 +51,14 @@ jobs:
"chakru-r",
"brock-acryl",
"mminichino",
"jayacryl"
"jayacryl",
"v-tarasevich-blitz-brain",
"ryota-cloud",
"annadoesdesign",
"jmacryl",
"esteban",
"anthonyburdi",
"ligfx"
]'),
github.actor
)
@ -63,8 +69,7 @@ jobs:
community-contribution
- uses: actions-ecosystem/action-add-labels@v1.1.3
# only add names of champions here. Confirm with DevRel Team
if:
${{
if: ${{
contains(
fromJson('[
"siladitya2",

View File

@ -3,6 +3,7 @@ on:
push:
branches:
- master
- releases/**
paths:
- ".github/workflows/prefect-plugin.yml"
- "metadata-ingestion-modules/prefect-plugin/**"
@ -71,11 +72,13 @@ jobs:
flags: ingestion-prefect-plugin
name: pytest-prefect-${{ matrix.python-version }}
verbose: true
override_branch: ${{ github.head_ref || github.ref_name }}
- name: Upload test results to Codecov
if: ${{ !cancelled() }}
uses: codecov/test-results-action@v1
with:
token: ${{ secrets.CODECOV_TOKEN }}
override_branch: ${{ github.head_ref || github.ref_name }}
event-file:
runs-on: ubuntu-latest

View File

@ -27,8 +27,13 @@ jobs:
env:
SIGNING_KEY: ${{ secrets.SIGNING_KEY }}
run: |
echo "Enable publish: ${{ env.SIGNING_KEY != '' }}"
if [[ "${{ github.repository }}" == "acryldata/datahub" ]]; then
echo "Enable publish for main repository: ${{ env.SIGNING_KEY != '' }}"
echo "publish=${{ env.SIGNING_KEY != '' }}" >> $GITHUB_OUTPUT
else
echo "Skipping publish for repository: ${{ github.repository }}"
echo "publish=false" >> $GITHUB_OUTPUT
fi
setup:
if: startsWith(github.ref, 'refs/tags/v')
runs-on: ubuntu-latest
@ -39,6 +44,9 @@ jobs:
uses: acryldata/sane-checkout-action@v3
- name: Compute Tag
id: tag
env:
GITHUB_REF_FALLBACK: ${{ github.ref }}
GITHUB_EVENT_NAME: ${{ github.event_name }}
run: |
source .github/scripts/docker_helpers.sh
TAG=$(echo ${GITHUB_REF} | sed -e 's,refs/tags/v,,g')

View File

@ -6,16 +6,20 @@ on:
paths:
- ".github/workflows/python-build-pages.yml"
- "metadata-ingestion/**"
- "datahub-actions/**"
- "metadata-ingestion-modules/**"
- "metadata-models/**"
- "python-build/**"
pull_request:
branches:
- "**"
paths:
- ".github/workflows/python-build-pages.yml"
- "metadata-ingestion/**"
- "datahub-actions/**"
- "metadata-ingestion-modules/**"
- "metadata-models/**"
- "python-build/**"
concurrency:
group: ${{ github.workflow }}-${{ github.event.pull_request.number || github.ref }}

View File

@ -0,0 +1,63 @@
name: Frontend Preview
on:
push:
branches:
- master
paths-ignore:
- "docs/**"
- "**.md"
pull_request:
branches:
- "**"
paths-ignore:
- "docs/**"
- "**.md"
release:
types: [published]
concurrency:
group: ${{ github.workflow }}-${{ github.event.pull_request.number || github.ref }}
cancel-in-progress: true
jobs:
setup:
runs-on: ubuntu-22.04
outputs:
frontend_change: ${{ steps.ci-optimize.outputs.frontend-change == 'true' }}
steps:
- name: Check out the repo
uses: acryldata/sane-checkout-action@v3
- uses: ./.github/actions/ci-optimization
id: ci-optimize
deploy:
runs-on: ubuntu-22.04
permissions:
contents: read
deployments: write
timeout-minutes: 30
needs: setup
if: ${{ github.event.pull_request.head.repo.fork != 'true' }}
steps:
- name: Check out the repo
uses: acryldata/sane-checkout-action@v3
- name: Set up JDK 17
uses: actions/setup-java@v4
with:
distribution: "zulu"
java-version: 17
- uses: gradle/gradle-build-action@v3
- name: Gradle build for frontend
if: ${{ needs.setup.outputs.frontend_change == 'true' }}
run: |
./gradlew :datahub-web-react:build -x test -x check --parallel
- name: Publish
if: ${{ needs.setup.outputs.frontend_change == 'true' }}
uses: cloudflare/pages-action@1
with:
apiToken: ${{ secrets.CLOUDFLARE_API_TOKEN }}
accountId: ${{ secrets.CLOUDFLARE_ACCOUNT_ID }}
projectName: datahub-project-web-react
workingDirectory: datahub-web-react
directory: dist
gitHubToken: ${{ secrets.GITHUB_TOKEN }}

View File

@ -5,6 +5,7 @@ on:
push:
branches:
- master
- releases/**
paths:
- "metadata_models/**"
- "metadata-integration/java/datahub-client/**"

View File

@ -2,7 +2,15 @@ name: Test Results
on:
workflow_run:
workflows: ["build & test", "metadata ingestion", "Airflow Plugin", "Dagster Plugin", "Prefect Plugin", "GX Plugin"]
workflows:
[
"build & test",
"metadata ingestion",
"Airflow Plugin",
"Dagster Plugin",
"Prefect Plugin",
"GX Plugin",
]
types:
- completed

30
.github/workflows/yaml-format.yml vendored Normal file
View File

@ -0,0 +1,30 @@
name: yaml format
on:
push:
branches:
- master
paths:
- "**/*.{yml,yaml}"
pull_request:
branches:
- "**"
paths:
- "**/*.{yml,yaml}"
concurrency:
group: ${{ github.workflow }}-${{ github.event.pull_request.number || github.ref }}
cancel-in-progress: true
jobs:
yaml_format_check:
name: yaml_format_check
runs-on: ubuntu-latest
steps:
- name: Check out the repo
uses: acryldata/sane-checkout-action@v3
- uses: actions/setup-python@v5
with:
python-version: "3.10"
- name: run prettier --check
run: |-
./gradlew :datahub-web-react:githubActionsPrettierCheck

View File

@ -1,9 +1,30 @@
# Auto-generated by .github/scripts/generate_pre_commit.py at 2025-02-11 10:00:11 UTC
# Auto-generated by .github/scripts/generate_pre_commit.py at 2025-07-01 10:36:31 UTC
# Do not edit this file directly. Run the script to regenerate.
# Add additional hooks in .github/scripts/pre-commit-override.yaml
repos:
- repo: local
hooks:
- id: datahub-web-react-mdPrettierWriteChanged
name: mdPrettierWriteChanged
entry: ./gradlew :datahub-web-react:mdPrettierWriteChanged
language: system
files: ^.*\.md$
pass_filenames: false
- id: datahub-web-react-githubActionsPrettierWriteChanged
name: githubActionsPrettierWriteChanged
entry: ./gradlew :datahub-web-react:githubActionsPrettierWriteChanged
language: system
files: ^\.github/.*\.(yml|yaml)$
pass_filenames: false
- id: datahub-actions-lint-fix
name: datahub-actions Lint Fix
entry: ./gradlew :datahub-actions:lintFix
language: system
files: ^datahub-actions/.*\.py$
pass_filenames: false
- id: datahub-graphql-core-spotless
name: datahub-graphql-core Spotless Apply
entry: ./gradlew :datahub-graphql-core:spotlessApply
@ -53,6 +74,13 @@ repos:
files: ^metadata-dao-impl/kafka-producer/.*\.java$
pass_filenames: false
- id: metadata-events-mxe-avro-spotless
name: metadata-events/mxe-avro Spotless Apply
entry: ./gradlew :metadata-events:mxe-avro:spotlessApply
language: system
files: ^metadata-events/mxe-avro/.*\.java$
pass_filenames: false
- id: metadata-events-mxe-registration-spotless
name: metadata-events/mxe-registration Spotless Apply
entry: ./gradlew :metadata-events:mxe-registration:spotlessApply
@ -291,6 +319,13 @@ repos:
files: ^metadata-service/configuration/.*\.java$
pass_filenames: false
- id: metadata-service-events-service-spotless
name: metadata-service/events-service Spotless Apply
entry: ./gradlew :metadata-service:events-service:spotlessApply
language: system
files: ^metadata-service/events-service/.*\.java$
pass_filenames: false
- id: metadata-service-factories-spotless
name: metadata-service/factories Spotless Apply
entry: ./gradlew :metadata-service:factories:spotlessApply
@ -458,3 +493,17 @@ repos:
language: system
files: ^smoke-test/tests/cypress/.*\.tsx$
pass_filenames: false
- id: update-capability-summary
name: update-capability-summary
entry: ./gradlew :metadata-ingestion:capabilitySummary
language: system
files: ^metadata-ingestion/src/datahub/ingestion/source/.*\.py$
pass_filenames: false
- id: update-lineage-file
name: update-lineage-file
entry: ./gradlew :metadata-ingestion:lineageGen
language: system
files: ^(metadata-ingestion-modules/.*|metadata-models/.*)$
pass_filenames: false

40
CLAUDE.MD Normal file
View File

@ -0,0 +1,40 @@
# CLAUDE.md
This file provides guidance to Claude Code (claude.ai/code) or any other agent when working with code in this repository.
## Coding conventions
- Keep code maintainable. This is not throw-away code. This goes to production.
- Generate unit tests where appropriate.
- Do not start generating random scripts to run the code you generated unless asked for.
- Do not add comments which are redundant given the function names
## Core concept docs
- `docs/what/urn.md` defines what a URN is
## Overall Directory structure
- This is repository for DataHub project.
- `README.MD` should give some basic information about the project.
- This is a multi-project gradle project so you will find a lot of `build.gradle` in most folders
### metadata-ingestion module details
- `metadata-ingestion` contains source and tests for DataHub OSS CLI.
- `metadata-ingestion/developing.md` contains details about the environment used for testing.
- `.github/workflows/metadata-ingestion.yml` contains our github workflow that is used in CI.
- `metadata-ingestion/build.gradle` contains our build.gradle that has gradle tasks defined for this module
- `pyproject.toml`, `setup.py`, `setup.cfg` in the folder contain rules about the code style for the repository
- The `.md` files at top level in this folder gives you important information about the concepts of ingestion
- You can see examples of how to define various aspect types in `metadata-ingestion/src/datahub/emitter/mcp_builder.py`
- Source code goes in `metadata-ingestion/src/`
- Tests go in `metadata-ingestion/tests/` (not in `src/`)
- **Testing conventions for metadata-ingestion**:
- Unit tests: `metadata-ingestion/tests/unit/`
- Integration tests: `metadata-ingestion/tests/integration/`
- Test files should mirror the source directory structure
- Use pytest, not unittest
- Use `assert` statements, not `self.assertEqual()` or `self.assertIsNone()`
- Use regular classes, not `unittest.TestCase`
- Import `pytest` in test files
- Test files should be named `test_*.py` and placed in the appropriate test directory, not alongside source files

View File

@ -18,7 +18,7 @@ export const Logo = (props) => {
<!--
HOSTED_DOCS_ONLY-->
<p align="center">
<a href="https://datahubproject.io">
<a href="https://datahub.com">
<img alt="DataHub" src="https://raw.githubusercontent.com/datahub-project/static-assets/main/imgs/datahub-logo-color-mark.svg" height="150" />
</a>
</p>
@ -26,7 +26,7 @@ HOSTED_DOCS_ONLY-->
# DataHub: The Data Discovery Platform for the Modern Data Stack
### Built with ❤️ by <img src="https://datahubproject.io/img/acryl-logo-light-mark.png" width="20"/> [Acryl Data](https://acryldata.io) and <img src="https://datahubproject.io/img/LI-In-Bug.png" width="20"/> [LinkedIn](https://engineering.linkedin.com)
### Built with ❤️ by <img src="https://raw.githubusercontent.com/datahub-project/static-assets/main/imgs/datahub-logo-color-mark.svg" width="20"/> [DataHub](https://datahub.com) and <img src="https://docs.datahub.com/img/LI-In-Bug.png" width="20"/> [LinkedIn](https://engineering.linkedin.com)
<div>
<a target="_blank" href="https://github.com/datahub-project/datahub/blob/master/LICENSE">
@ -36,11 +36,11 @@ HOSTED_DOCS_ONLY-->
<a target="_blank" href="https://github.com/datahub-project/datahub/pulse">
<img alt="GitHub commit activity" src="https://img.shields.io/github/commit-activity/m/datahub-project/datahub?label=commits&labelColor=133554&color=1890ff" /></a>
<br />
<a target="_blank" href="https://pages.acryl.io/slack?utm_source=github&utm_medium=readme&utm_campaign=github_readme">
<a target="_blank" href="https://datahub.com/slack?utm_source=github&utm_medium=readme&utm_campaign=github_readme">
<img alt="Slack" src="https://img.shields.io/badge/slack-join_community-red.svg?logo=slack&labelColor=133554&color=1890ff" /></a>
<a href="https://www.youtube.com/channel/UC3qFQC5IiwR5fvWEqi_tJ5w">
<img alt="YouTube" src="https://img.shields.io/youtube/channel/subscribers/UC3qFQC5IiwR5fvWEqi_tJ5w?style=flat&logo=youtube&label=subscribers&labelColor=133554&color=1890ff"/></a>
<a href="https://blog.datahubproject.io/">
<a href="https://medium.com/datahub-project/">
<img alt="Medium" src="https://img.shields.io/badge/blog-DataHub-red.svg?style=flat&logo=medium&logoColor=white&labelColor=133554&color=1890ff" /></a>
<a href="https://x.com/datahubproject">
<img alt="X (formerly Twitter) Follow" src="https://img.shields.io/badge/follow-datahubproject-red.svg?style=flat&logo=x&labelColor=133554&color=1890ff" /></a>
@ -48,26 +48,26 @@ HOSTED_DOCS_ONLY-->
---
### 🏠 Docs: [datahubproject.io](https://datahubproject.io/docs)
### 🏠 Docs: [docs.datahub.com](https://docs.datahub.com/)
[Quickstart](https://datahubproject.io/docs/quickstart) |
[Features](https://datahubproject.io/docs/) |
[Quickstart](https://docs.datahub.com/docs/quickstart) |
[Features](https://docs.datahub.com/docs/features) |
[Roadmap](https://feature-requests.datahubproject.io/roadmap) |
[Adoption](#adoption) |
[Demo](https://demo.datahubproject.io/) |
[Town Hall](https://datahubproject.io/docs/townhalls)
[Demo](https://demo.datahub.com/) |
[Town Hall](https://docs.datahub.com/docs/townhalls)
---
> 📣DataHub Town Hall is the 4th Thursday at 9am US PT of every month - [add it to your calendar!](https://rsvp.datahubproject.io/)
> 📣DataHub Town Hall is the 4th Thursday at 9am US PT of every month - [add it to your calendar!](https://lu.ma/datahubevents/)
>
> - Town-hall Zoom link: [zoom.datahubproject.io](https://zoom.datahubproject.io)
> - [Meeting details](docs/townhalls.md) & [past recordings](docs/townhall-history.md)
> ✨DataHub Community Highlights:
>
> - Read our Monthly Project Updates [here](https://blog.datahubproject.io/tagged/project-updates).
> - Bringing The Power Of The DataHub Real-Time Metadata Graph To Everyone At Acryl Data: [Data Engineering Podcast](https://www.dataengineeringpodcast.com/acryl-data-datahub-metadata-graph-episode-230/)
> - Read our Monthly Project Updates [here](https://medium.com/datahub-project/tagged/project-updates).
> - Bringing The Power Of The DataHub Real-Time Metadata Graph To Everyone At DataHub: [Data Engineering Podcast](https://www.dataengineeringpodcast.com/acryl-data-datahub-metadata-graph-episode-230/)
> - Check out our most-read blog post, [DataHub: Popular Metadata Architectures Explained](https://engineering.linkedin.com/blog/2020/datahub-popular-metadata-architectures-explained) @ LinkedIn Engineering Blog.
> - Join us on [Slack](docs/slack.md)! Ask questions and keep up with the latest announcements.
@ -82,18 +82,18 @@ Check out DataHub's [Features](docs/features.md) & [Roadmap](https://feature-req
## Demo and Screenshots
There's a [hosted demo environment](https://demo.datahubproject.io/) courtesy of [Acryl Data](https://acryldata.io) where you can explore DataHub without installing it locally.
There's a [hosted demo environment](https://demo.datahub.com/) courtesy of DataHub where you can explore DataHub without installing it locally.
## Quickstart
Please follow the [DataHub Quickstart Guide](https://datahubproject.io/docs/quickstart) to run DataHub locally using [Docker](https://docker.com).
Please follow the [DataHub Quickstart Guide](https://docs.datahub.com/docs/quickstart) to run DataHub locally using [Docker](https://docker.com).
## Development
If you're looking to build & modify datahub please take a look at our [Development Guide](https://datahubproject.io/docs/developers).
If you're looking to build & modify datahub please take a look at our [Development Guide](https://docs.datahub.com/docs/developers).
<p align="center">
<a href="https://demo.datahubproject.io/">
<a href="https://demo.datahub.com/">
<img width="70%" src="https://raw.githubusercontent.com/datahub-project/static-assets/main/imgs/entity.png"/>
</a>
</p>
@ -102,11 +102,12 @@ If you're looking to build & modify datahub please take a look at our [Developme
- [datahub-project/datahub](https://github.com/datahub-project/datahub): This repository contains the complete source code for DataHub's metadata model, metadata services, integration connectors and the web application.
- [acryldata/datahub-actions](https://github.com/acryldata/datahub-actions): DataHub Actions is a framework for responding to changes to your DataHub Metadata Graph in real time.
- [acryldata/datahub-helm](https://github.com/acryldata/datahub-helm): Repository of helm charts for deploying DataHub on a Kubernetes cluster
- [acryldata/meta-world](https://github.com/acryldata/meta-world): A repository to store recipes, custom sources, transformations and other things to make your DataHub experience magical
- [dbt-impact-action](https://github.com/acryldata/dbt-impact-action) : This repository contains a github action for commenting on your PRs with a summary of the impact of changes within a dbt project
- [datahub-tools](https://github.com/makenotion/datahub-tools) : Additional python tools to interact with the DataHub GraphQL endpoints, built by Notion
- [business-glossary-sync-action](https://github.com/acryldata/business-glossary-sync-action) : This repository contains a github action that opens PRs to update your business glossary yaml file.
- [acryldata/datahub-helm](https://github.com/acryldata/datahub-helm): Helm charts for deploying DataHub on a Kubernetes cluster
- [acryldata/meta-world](https://github.com/acryldata/meta-world): A repository to store recipes, custom sources, transformations and other things to make your DataHub experience magical.
- [dbt-impact-action](https://github.com/acryldata/dbt-impact-action): A github action for commenting on your PRs with a summary of the impact of changes within a dbt project.
- [datahub-tools](https://github.com/makenotion/datahub-tools): Additional python tools to interact with the DataHub GraphQL endpoints, built by Notion.
- [business-glossary-sync-action](https://github.com/acryldata/business-glossary-sync-action): A github action that opens PRs to update your business glossary yaml file.
- [mcp-server-datahub](https://github.com/acryldata/mcp-server-datahub): A [Model Context Protocol](https://modelcontextprotocol.io/) server implementation for DataHub.
## Releases
@ -118,7 +119,7 @@ We welcome contributions from the community. Please refer to our [Contributing G
## Community
Join our [Slack workspace](https://pages.acryl.io/slack?utm_source=github&utm_medium=readme&utm_campaign=github_readme) for discussions and important announcements. You can also find out more about our upcoming [town hall meetings](docs/townhalls.md) and view past recordings.
Join our [Slack workspace](https://datahub.com/slack?utm_source=github&utm_medium=readme&utm_campaign=github_readme) for discussions and important announcements. You can also find out more about our upcoming [town hall meetings](docs/townhalls.md) and view past recordings.
## Security
@ -173,11 +174,11 @@ Here are the companies that have officially adopted DataHub. Please feel free to
## Select Articles & Talks
- [DataHub Blog](https://blog.datahubproject.io/)
- [DataHub Blog](https://medium.com/datahub-project/)
- [DataHub YouTube Channel](https://www.youtube.com/channel/UC3qFQC5IiwR5fvWEqi_tJ5w)
- [Optum: Data Mesh via DataHub](https://opensource.optum.com/blog/2022/03/23/data-mesh-via-datahub)
- [Saxo Bank: Enabling Data Discovery in Data Mesh](https://medium.com/datahub-project/enabling-data-discovery-in-a-data-mesh-the-saxo-journey-451b06969c8f)
- [Bringing The Power Of The DataHub Real-Time Metadata Graph To Everyone At Acryl Data](https://www.dataengineeringpodcast.com/acryl-data-datahub-metadata-graph-episode-230/)
- [Bringing The Power Of The DataHub Real-Time Metadata Graph To Everyone At DataHub](https://www.dataengineeringpodcast.com/acryl-data-datahub-metadata-graph-episode-230/)
- [DataHub: Popular Metadata Architectures Explained](https://engineering.linkedin.com/blog/2020/datahub-popular-metadata-architectures-explained)
- [Driving DataOps Culture with LinkedIn DataHub](https://www.youtube.com/watch?v=ccsIKK9nVxk) @ [DataOps Unleashed 2021](https://dataopsunleashed.com/#shirshanka-session)
- [The evolution of metadata: LinkedIns story](https://speakerdeck.com/shirshanka/the-evolution-of-metadata-linkedins-journey-strata-nyc-2019) @ [Strata Data Conference 2019](https://conferences.oreilly.com/strata/strata-ny-2019.html)

View File

@ -1,6 +1,6 @@
# Reporting Security Issues
If you think you have found a security vulnerability, please send a report to security@datahubproject.io. This address can be used for all of Acryl Datas open source and commercial products (including but not limited to DataHub and Acryl Data). We can accept only vulnerability reports at this address.
If you think you have found a security vulnerability, please send a report to security@datahubproject.io. This address can be used for all of DataHubs open source and commercial products (including but not limited to DataHub Core and DataHub Cloud). We can accept only vulnerability reports at this address.
It's not mandatory, but if you'd like to encrypt your message to us; please use our PGP key. The key fingerprint is:
@ -8,9 +8,9 @@ A50B10A86CC21F4B7BE102E170764C95B4FACEBF
The key is available from [keyserver.ubuntu.com](https://keyserver.ubuntu.com/pks/lookup?search=A50B10A86CC21F4B7BE102E170764C95B4FACEBF&fingerprint=on&op=index).
Acryl Data will send you a response indicating the next steps in handling your report. After the initial reply to your report, the security team will keep you informed of the progress towards a fix and full announcement, and may ask for additional information or guidance.
DataHub will send you a response indicating the next steps in handling your report. After the initial reply to your report, the security team will keep you informed of the progress towards a fix and full announcement, and may ask for additional information or guidance.
**Important:** We ask you to not disclose the vulnerability before it have been fixed and announced, unless you received a response from the Acryl Data security team that you can do so.
**Important:** We ask you to not disclose the vulnerability before it have been fixed and announced, unless you received a response from the DataHub security team that you can do so.
## Security announcements

View File

@ -1,3 +1,6 @@
import org.apache.tools.ant.filters.ReplaceTokens
buildscript {
ext.jdkVersionDefault = 17
ext.javaClassVersionDefault = 11
@ -32,36 +35,37 @@ buildscript {
ext.junitJupiterVersion = '5.6.1'
// Releases: https://github.com/linkedin/rest.li/blob/master/CHANGELOG.md
ext.pegasusVersion = '29.57.0'
ext.pegasusVersion = '29.65.7'
ext.mavenVersion = '3.6.3'
ext.versionGradle = '8.11.1'
ext.springVersion = '6.1.14'
ext.springBootVersion = '3.2.9'
ext.springKafkaVersion = '3.1.6'
ext.openTelemetryVersion = '1.45.0'
ext.springVersion = '6.2.5'
ext.springBootVersion = '3.4.5'
ext.springKafkaVersion = '3.3.6'
ext.openTelemetryVersion = '1.49.0'
ext.neo4jVersion = '5.20.0'
ext.neo4jTestVersion = '5.20.0'
ext.neo4jApocVersion = '5.20.0'
ext.testContainersVersion = '1.17.4'
ext.testContainersVersion = '1.21.1'
ext.elasticsearchVersion = '2.11.1' // ES 7.10, Opensearch 1.x, 2.x
ext.jacksonVersion = '2.15.3'
ext.jettyVersion = '12.0.16'
ext.jacksonVersion = '2.18.4'
ext.jettyVersion = '12.0.21'
// see also datahub-frontend/play.gradle
ext.playVersion = '2.8.22'
ext.playScalaVersion = '2.13'
ext.akkaVersion = '2.6.21' // 2.7.0+ has incompatible license
ext.log4jVersion = '2.23.1'
ext.slf4jVersion = '1.7.36'
ext.logbackClassic = '1.4.14'
ext.logbackClassic = '1.5.18'
ext.hadoop3Version = '3.3.6'
ext.kafkaVersion = '5.5.15'
ext.kafkaVersion = '8.0.0'
ext.hazelcastVersion = '5.3.6'
ext.ebeanVersion = '15.5.2'
ext.googleJavaFormatVersion = '1.18.1'
ext.openLineageVersion = '1.25.0'
ext.logbackClassicJava8 = '1.2.12'
ext.awsSdk2Version = '2.30.33'
ext.docker_registry = 'acryldata'
ext.docker_registry = project.getProperties().getOrDefault("dockerRegistry", 'acryldata')
apply from: './repositories.gradle'
buildscript.repositories.addAll(project.repositories)
@ -81,13 +85,13 @@ plugins {
id 'com.gorylenko.gradle-git-properties' version '2.4.1'
id 'com.gradleup.shadow' version '8.3.5' apply false
id 'com.palantir.docker' version '0.35.0' apply false
id 'com.avast.gradle.docker-compose' version '0.17.6'
id 'com.avast.gradle.docker-compose' version '0.17.12'
id "com.diffplug.spotless" version "6.23.3"
// https://blog.ltgt.net/javax-jakarta-mess-and-gradle-solution/
// TODO id "org.gradlex.java-ecosystem-capabilities" version "1.0"
}
apply from: "gradle/docker/docker.gradle"
apply from: "gradle/docker/docker-utils.gradle"
project.ext.spec = [
'product' : [
@ -108,24 +112,26 @@ project.ext.spec = [
project.ext.externalDependency = [
'akkaHttp': "com.typesafe.akka:akka-http-core_$playScalaVersion:10.2.10", // max version due to licensing
'akkaParsing': "com.typesafe.akka:akka-parsing_$playScalaVersion:10.2.10", // akka-parsing is part of akka-http, so use akka http version
'akkaActor': "com.typesafe.akka:akka-actor_$playScalaVersion:$akkaVersion",
'akkaStream': "com.typesafe.akka:akka-stream_$playScalaVersion:$akkaVersion",
'akkaActorTyped': "com.typesafe.akka:akka-actor-typed_$playScalaVersion:$akkaVersion",
'akkaSlf4j': "com.typesafe.akka:akka-slf4j_$playScalaVersion:$akkaVersion",
'akkaJackson': "com.typesafe.akka:akka-serialization-jackson_$playScalaVersion:$akkaVersion",
'akkaParsing': "com.typesafe.akka:akka-parsing_$playScalaVersion:$akkaVersion",
'akkaProtobuf': "com.typesafe.akka:akka-protobuf-v3_$playScalaVersion:$akkaVersion",
'antlr4Runtime': 'org.antlr:antlr4-runtime:4.9.3',
'antlr4': 'org.antlr:antlr4:4.9.3',
'assertJ': 'org.assertj:assertj-core:3.11.1',
'avro': 'org.apache.avro:avro:1.11.4',
'avroCompiler': 'org.apache.avro:avro-compiler:1.11.4',
'awsGlueSchemaRegistrySerde': 'software.amazon.glue:schema-registry-serde:1.1.17',
'awsMskIamAuth': 'software.amazon.msk:aws-msk-iam-auth:2.0.3',
'awsS3': 'software.amazon.awssdk:s3:2.26.21',
'awsSecretsManagerJdbc': 'com.amazonaws.secretsmanager:aws-secretsmanager-jdbc:1.0.13',
'awsPostgresIamAuth': 'software.amazon.jdbc:aws-advanced-jdbc-wrapper:1.0.2',
'awsRds':'software.amazon.awssdk:rds:2.18.24',
'awsGlueSchemaRegistrySerde': 'software.amazon.glue:schema-registry-serde:1.1.23',
'awsMskIamAuth': 'software.amazon.msk:aws-msk-iam-auth:2.3.0',
'awsS3': "software.amazon.awssdk:s3:$awsSdk2Version",
'awsSecretsManagerJdbc': 'com.amazonaws.secretsmanager:aws-secretsmanager-jdbc:1.0.15',
'awsPostgresIamAuth': 'software.amazon.jdbc:aws-advanced-jdbc-wrapper:2.5.4',
'awsRds':"software.amazon.awssdk:rds:$awsSdk2Version",
'azureIdentityExtensions': 'com.azure:azure-identity-extensions:1.2.2',
'azureIdentity': 'com.azure:azure-identity:1.15.4',
'cacheApi': 'javax.cache:cache-api:1.1.0',
'commonsCli': 'commons-cli:commons-cli:1.5.0',
'commonsIo': 'commons-io:commons-io:2.17.0',
@ -137,7 +143,8 @@ project.ext.externalDependency = [
'datastaxOssCore': 'com.datastax.oss:java-driver-core:4.14.1',
'datastaxOssQueryBuilder': 'com.datastax.oss:java-driver-query-builder:4.14.1',
'dgraph4j' : 'io.dgraph:dgraph4j:24.1.1',
'dgraphNetty': 'io.grpc:grpc-netty-shaded:1.69.0',
'dgraphNetty': 'io.grpc:grpc-netty:1.71.0',
'dgraphShadedNetty': 'io.grpc:grpc-netty-shaded:1.71.0',
'dropwizardMetricsCore': 'io.dropwizard.metrics:metrics-core:4.2.3',
'dropwizardMetricsJmx': 'io.dropwizard.metrics:metrics-jmx:4.2.3',
'ebean': 'io.ebean:ebean:' + ebeanVersion,
@ -146,11 +153,10 @@ project.ext.externalDependency = [
'ebeanDdl': 'io.ebean:ebean-ddl-generator:' + ebeanVersion,
'ebeanQueryBean': 'io.ebean:querybean-generator:' + ebeanVersion,
'elasticSearchRest': 'org.opensearch.client:opensearch-rest-high-level-client:' + elasticsearchVersion,
'elasticSearchJava': 'org.opensearch.client:opensearch-java:2.6.0',
'findbugsAnnotations': 'com.google.code.findbugs:annotations:3.0.1',
'graphqlJava': 'com.graphql-java:graphql-java:21.5',
'graphqlJavaScalars': 'com.graphql-java:graphql-java-extended-scalars:21.0',
'gson': 'com.google.code.gson:gson:2.8.9',
'gson': 'com.google.code.gson:gson:2.12.0',
'guice': 'com.google.inject:guice:7.0.0',
'guicePlay': 'com.google.inject:guice:5.0.1', // Used for frontend while still on old Play version
'guava': 'com.google.guava:guava:32.1.3-jre',
@ -163,13 +169,17 @@ project.ext.externalDependency = [
'hazelcastSpring':"com.hazelcast:hazelcast-spring:$hazelcastVersion",
'hazelcastTest':"com.hazelcast:hazelcast:$hazelcastVersion:tests",
'hibernateCore': 'org.hibernate:hibernate-core:5.2.16.Final',
'httpClient': 'org.apache.httpcomponents.client5:httpclient5:5.3',
'httpClient': 'org.apache.httpcomponents.client5:httpclient5:5.4.3',
'iStackCommons': 'com.sun.istack:istack-commons-runtime:4.0.1',
'jacksonJDK8': "com.fasterxml.jackson.datatype:jackson-datatype-jdk8:$jacksonVersion",
'jacksonDataPropertyFormat': "com.fasterxml.jackson.dataformat:jackson-dataformat-properties:$jacksonVersion",
'jacksonCore': "com.fasterxml.jackson.core:jackson-core:$jacksonVersion",
'jacksonDataBind': "com.fasterxml.jackson.core:jackson-databind:$jacksonVersion",
'jacksonDataFormatYaml': "com.fasterxml.jackson.dataformat:jackson-dataformat-yaml:$jacksonVersion",
// The jacksonBom controls the version of other jackson modules, pin the version once
// implementation platform(externalDependency.jacksonBom)
'jacksonBom': "com.fasterxml.jackson:jackson-bom:$jacksonVersion",
'jacksonJDK8': 'com.fasterxml.jackson.datatype:jackson-datatype-jdk8',
'jacksonDataPropertyFormat': 'com.fasterxml.jackson.dataformat:jackson-dataformat-properties',
'jacksonCore': 'com.fasterxml.jackson.core:jackson-core',
'jacksonDataBind': 'com.fasterxml.jackson.core:jackson-databind',
'jacksonJsr310': 'com.fasterxml.jackson.datatype:jackson-datatype-jsr310',
'jacksonDataFormatYaml': 'com.fasterxml.jackson.dataformat:jackson-dataformat-yaml',
'woodstoxCore': 'com.fasterxml.woodstox:woodstox-core:6.4.0',
'javatuples': 'org.javatuples:javatuples:1.2',
'javaxInject' : 'javax.inject:javax.inject:1',
@ -197,7 +207,7 @@ project.ext.externalDependency = [
'kafkaAvroSerde': "io.confluent:kafka-streams-avro-serde:$kafkaVersion",
'kafkaAvroSerializer': "io.confluent:kafka-avro-serializer:$kafkaVersion",
'kafkaClients': "org.apache.kafka:kafka-clients:$kafkaVersion-ccs",
'snappy': 'org.xerial.snappy:snappy-java:1.1.10.5',
'snappy': 'org.xerial.snappy:snappy-java:1.1.10.7',
'logbackClassic': "ch.qos.logback:logback-classic:$logbackClassic",
'logbackClassicJava8' : "ch.qos.logback:logback-classic:$logbackClassicJava8",
'slf4jApi': "org.slf4j:slf4j-api:$slf4jVersion",
@ -222,10 +232,14 @@ project.ext.externalDependency = [
'opentelemetryApi': 'io.opentelemetry:opentelemetry-api:' + openTelemetryVersion,
'opentelemetrySdk': 'io.opentelemetry:opentelemetry-sdk:' + openTelemetryVersion,
'opentelemetrySdkTrace': 'io.opentelemetry:opentelemetry-sdk-trace:' + openTelemetryVersion,
'opentelemetrySdkMetrics': 'io.opentelemetry:opentelemetry-sdk-metrics:' + openTelemetryVersion,
'opentelemetryAutoConfig': 'io.opentelemetry:opentelemetry-sdk-extension-autoconfigure:' + openTelemetryVersion,
'opentelemetryAnnotations': 'io.opentelemetry.instrumentation:opentelemetry-instrumentation-annotations:2.11.0',
'opentelemetryExporter': 'io.opentelemetry:opentelemetry-exporter-otlp:' + openTelemetryVersion,
'openTelemetryExporterLogging': 'io.opentelemetry:opentelemetry-exporter-logging:' + openTelemetryVersion,
'openTelemetryExporterCommon': 'io.opentelemetry:opentelemetry-exporter-otlp-common:' + openTelemetryVersion,
'opentelemetryAnnotations': 'io.opentelemetry.instrumentation:opentelemetry-instrumentation-annotations:2.15.0',
'opentracingJdbc':'io.opentracing.contrib:opentracing-jdbc:0.2.15',
'parquet': 'org.apache.parquet:parquet-avro:1.12.3',
'parquet': 'org.apache.parquet:parquet-avro:1.15.2',
'parquetHadoop': 'org.apache.parquet:parquet-hadoop:1.13.1',
'picocli': 'info.picocli:picocli:4.5.0',
'playCache': "com.typesafe.play:play-cache_$playScalaVersion:$playVersion",
@ -240,11 +254,11 @@ project.ext.externalDependency = [
'playFilters': "com.typesafe.play:filters-helpers_$playScalaVersion:$playVersion",
'pac4j': 'org.pac4j:pac4j-oidc:6.0.6',
'playPac4j': "org.pac4j:play-pac4j_$playScalaVersion:12.0.0-PLAY2.8",
'postgresql': 'org.postgresql:postgresql:42.7.4',
'postgresql': 'org.postgresql:postgresql:42.7.7',
'protobuf': 'com.google.protobuf:protobuf-java:3.25.5',
'grpcProtobuf': 'io.grpc:grpc-protobuf:1.53.0',
'rangerCommons': 'org.apache.ranger:ranger-plugins-common:2.3.0',
'reflections': 'org.reflections:reflections:0.9.9',
'reflections': 'org.reflections:reflections:0.9.12',
'resilience4j': 'io.github.resilience4j:resilience4j-retry:1.7.1',
'rythmEngine': 'org.rythmengine:rythm-engine:1.3.0',
'servletApi': 'jakarta.servlet:jakarta.servlet-api:6.0.0',
@ -255,7 +269,7 @@ project.ext.externalDependency = [
'springBeans': "org.springframework:spring-beans:$springVersion",
'springContext': "org.springframework:spring-context:$springVersion",
'springCore': "org.springframework:spring-core:$springVersion",
'springDocUI': 'org.springdoc:springdoc-openapi-starter-webmvc-ui:2.3.0',
'springDocUI': 'org.springdoc:springdoc-openapi-starter-webmvc-ui:2.8.9',
'springJdbc': "org.springframework:spring-jdbc:$springVersion",
'springWeb': "org.springframework:spring-web:$springVersion",
'springWebMVC': "org.springframework:spring-webmvc:$springVersion",
@ -268,10 +282,11 @@ project.ext.externalDependency = [
'springBootStarterValidation': "org.springframework.boot:spring-boot-starter-validation:$springBootVersion",
'springKafka': "org.springframework.kafka:spring-kafka:$springKafkaVersion",
'springActuator': "org.springframework.boot:spring-boot-starter-actuator:$springBootVersion",
'springRetry': "org.springframework.retry:spring-retry:2.0.6",
'swaggerAnnotations': 'io.swagger.core.v3:swagger-annotations:2.2.15',
'springRetry': "org.springframework.retry:spring-retry:2.0.11",
'swaggerAnnotations': 'io.swagger.core.v3:swagger-annotations:2.2.30',
'swaggerCli': 'io.swagger.codegen.v3:swagger-codegen-cli:3.0.46',
'swaggerCore': 'io.swagger.core.v3:swagger-core:2.2.7',
'swaggerCore': 'io.swagger.core.v3:swagger-core:2.2.30',
'swaggerParser': 'io.swagger.parser.v3:swagger-parser:2.1.27',
'springBootAutoconfigureJdk11': 'org.springframework.boot:spring-boot-autoconfigure:2.7.18',
'testng': 'org.testng:testng:7.8.0',
'testContainers': 'org.testcontainers:testcontainers:' + testContainersVersion,
@ -280,7 +295,7 @@ project.ext.externalDependency = [
'testContainersElasticsearch': 'org.testcontainers:elasticsearch:' + testContainersVersion,
'testContainersCassandra': 'org.testcontainers:cassandra:' + testContainersVersion,
'testContainersKafka': 'org.testcontainers:kafka:' + testContainersVersion,
'testContainersOpenSearch': 'org.opensearch:opensearch-testcontainers:2.0.0',
'testContainersOpenSearch': 'org.opensearch:opensearch-testcontainers:2.1.3',
'typesafeConfig':'com.typesafe:config:1.4.1',
'wiremock':'com.github.tomakehurst:wiremock:2.10.0',
'zookeeper': 'org.apache.zookeeper:zookeeper:3.8.4',
@ -381,6 +396,12 @@ configure(subprojects.findAll {! it.name.startsWith('spark-lineage')}) {
exclude group: "org.slf4j", module: "slf4j-ext"
exclude group: "org.codehaus.jackson", module: "jackson-mapper-asl"
exclude group: "javax.mail", module: "mail"
exclude group: 'org.glassfish', module: 'javax.json'
exclude group: 'org.glassfish', module: 'jakarta.json'
// Tomcat excluded for jetty
exclude group: 'org.apache.tomcat.embed', module: 'tomcat-embed-el'
exclude group: 'org.springframework.boot', module: 'spring-boot-starter-tomcat'
resolutionStrategy.force externalDependency.antlr4Runtime
resolutionStrategy.force externalDependency.antlr4
@ -395,25 +416,56 @@ configure(subprojects.findAll {! it.name.startsWith('spark-lineage')}) {
}
}
subprojects {
apply plugin: 'maven-publish'
apply plugin: 'com.gorylenko.gradle-git-properties'
apply plugin: 'com.diffplug.spotless'
gitProperties {
keys = ['git.commit.id','git.commit.id.describe','git.commit.time']
// using any tags (not limited to annotated tags) for "git.commit.id.describe" property
// see http://ajoberstar.org/grgit/grgit-describe.html for more info about the describe method and available parameters
// 'it' is an instance of org.ajoberstar.grgit.Grgit
customProperty 'git.commit.id.describe', { it.describe(tags: true) }
gitPropertiesResourceDir = rootProject.buildDir
failOnNoGitDirectory = false
}
def gitPropertiesGenerated = false
apply from: 'gradle/versioning/versioning-global.gradle'
tasks.register("generateGitPropertiesGlobal", com.gorylenko.GenerateGitPropertiesTask) {
doFirst {
if (!gitPropertiesGenerated) {
println "Generating git.properties"
gitPropertiesGenerated = true
} else {
// Skip actual execution if already run
onlyIf { false }
}
}
}
subprojects {
apply plugin: 'maven-publish'
apply plugin: 'com.diffplug.spotless'
def gitPropertiesTask = tasks.register("copyGitProperties", Copy) {
dependsOn rootProject.tasks.named("generateGitPropertiesGlobal")
def sourceFile = file("${rootProject.buildDir}/git.properties")
from sourceFile
into "$project.buildDir/resources/main"
}
plugins.withType(JavaPlugin).configureEach {
project.tasks.named(JavaPlugin.CLASSES_TASK_NAME).configure{
dependsOn gitPropertiesTask
}
if (project.name == 'datahub-web-react') {
return
}
/* TODO: evaluate ignoring jar timestamps for increased caching (compares checksum instead)
jar {
preserveFileTimestamps = false
}*/
dependencies {
implementation externalDependency.annotationApi
@ -517,3 +569,17 @@ wrapper {
gradleVersion = project.versionGradle
distributionType = Wrapper.DistributionType.ALL
}
tasks.register('format') {
dependsOn(':datahub-web-react:graphqlPrettierWrite')
dependsOn(':datahub-web-react:githubActionsPrettierWrite')
dependsOn(':datahub-web-react:mdPrettierWrite')
dependsOn('spotlessApply')
}
tasks.register('formatChanged') {
dependsOn(':datahub-web-react:graphqlPrettierWriteChanged')
dependsOn(':datahub-web-react:githubActionsPrettierWriteChanged')
dependsOn(':datahub-web-react:mdPrettierWriteChanged')
dependsOn('spotlessApply')
}

24
datahub-actions/.gitignore vendored Normal file
View File

@ -0,0 +1,24 @@
venv*/
.coverage
# Distribution / packaging
.Python
build/
develop-eggs/
dist/
downloads/
eggs/
.eggs/
lib/
lib64/
parts/
sdist/
var/
wheels/
pip-wheel-metadata/
share/python-wheels/
*.egg-info/
.installed.cfg
*.egg
MANIFEST
junit.*xml

242
datahub-actions/README.md Normal file
View File

@ -0,0 +1,242 @@
# ⚡ DataHub Actions Framework
Welcome to DataHub Actions! The Actions framework makes responding to realtime changes in your Metadata Graph easy, enabling you to seamlessly integrate [DataHub](https://github.com/datahub-project/datahub) into a broader events-based architecture.
For a detailed introduction, check out the [original announcement](https://www.youtube.com/watch?v=7iwNxHgqxtg&t=2189s) of the DataHub Actions Framework at the DataHub April 2022 Town Hall. For a more in-depth look at use cases and concepts, check out [DataHub Actions Concepts](../docs/actions/concepts.md).
## Quickstart
To get started right away, check out the [DataHub Actions Quickstart](../docs/actions/quickstart.md) Guide.
## Prerequisites
The DataHub Actions CLI commands are an extension of the base `datahub` CLI commands. We recommend
first installing the `datahub` CLI:
```shell
python3 -m pip install --upgrade pip wheel setuptools
python3 -m pip install --upgrade acryl-datahub
datahub --version
```
> Note that the Actions Framework requires a version of `acryl-datahub` >= v0.8.34
## Installation
Next, simply install the `acryl-datahub-actions` package from PyPi:
```shell
python3 -m pip install --upgrade pip wheel setuptools
python3 -m pip install --upgrade acryl-datahub-actions
datahub actions version
```
## Configuring an Action
Actions are configured using a YAML file, much in the same way DataHub ingestion sources are. An action configuration file consists of the following
1. Action Pipeline Name (Should be unique and static)
2. Source Configurations
3. Transform + Filter Configurations
4. Action Configuration
5. Pipeline Options (Optional)
6. DataHub API configs (Optional - required for select actions)
With each component being independently pluggable and configurable.
```yml
# 1. Required: Action Pipeline Name
name: <action-pipeline-name>
# 2. Required: Event Source - Where to source event from.
source:
type: <source-type>
config:
# Event Source specific configs (map)
# 3a. Optional: Filter to run on events (map)
filter:
event_type: <filtered-event-type>
event:
# Filter event fields by exact-match
<filtered-event-fields>
# 3b. Optional: Custom Transformers to run on events (array)
transform:
- type: <transformer-type>
config:
# Transformer-specific configs (map)
# 4. Required: Action - What action to take on events.
action:
type: <action-type>
config:
# Action-specific configs (map)
# 5. Optional: Additional pipeline options (error handling, etc)
options:
retry_count: 0 # The number of times to retry an Action with the same event. (If an exception is thrown). 0 by default.
failure_mode: "CONTINUE" # What to do when an event fails to be processed. Either 'CONTINUE' to make progress or 'THROW' to stop the pipeline. Either way, the failed event will be logged to a failed_events.log file.
failed_events_dir: "/tmp/datahub/actions" # The directory in which to write a failed_events.log file that tracks events which fail to be processed. Defaults to "/tmp/logs/datahub/actions".
# 6. Optional: DataHub API configuration
datahub:
server: "http://localhost:8080" # Location of DataHub API
# token: <your-access-token> # Required if Metadata Service Auth enabled
```
### Example: Hello World
An simple configuration file for a "Hello World" action, which simply prints all events it receives, is
```yml
# 1. Action Pipeline Name
name: "hello_world"
# 2. Event Source: Where to source event from.
source:
type: "kafka"
config:
connection:
bootstrap: ${KAFKA_BOOTSTRAP_SERVER:-localhost:9092}
schema_registry_url: ${SCHEMA_REGISTRY_URL:-http://localhost:8081}
# 3. Action: What action to take on events.
action:
type: "hello_world"
```
We can modify this configuration further to filter for specific events, by adding a "filter" block.
```yml
# 1. Action Pipeline Name
name: "hello_world"
# 2. Event Source - Where to source event from.
source:
type: "kafka"
config:
connection:
bootstrap: ${KAFKA_BOOTSTRAP_SERVER:-localhost:9092}
schema_registry_url: ${SCHEMA_REGISTRY_URL:-http://localhost:8081}
# 3. Filter - Filter events that reach the Action
filter:
event_type: "EntityChangeEvent_v1"
event:
category: "TAG"
operation: "ADD"
modifier: "urn:li:tag:pii"
# 4. Action - What action to take on events.
action:
type: "hello_world"
```
## Running an Action
To run a new Action, just use the `actions` CLI command
```
datahub actions -c <config.yml>
```
Once the Action is running, you will see
```
Action Pipeline with name '<action-pipeline-name>' is now running.
```
### Running multiple Actions
You can run multiple actions pipeline within the same command. Simply provide multiple
config files by restating the "-c" command line argument.
For example,
```
datahub actions -c <config-1.yaml> -c <config-2.yaml>
```
### Running in debug mode
Simply append the `--debug` flag to the CLI to run your action in debug mode.
```
datahub actions -c <config.yaml> --debug
```
### Stopping an Action
Just issue a Control-C as usual. You should see the Actions Pipeline shut down gracefully, with a small
summary of processing results.
```
Actions Pipeline with name '<action-pipeline-name' has been stopped.
```
## Supported Events
Two event types are currently supported. Read more about them below.
- [Entity Change Event V1](../docs/actions/events/entity-change-event.md)
- [Metadata Change Log V1](../docs/actions/events/metadata-change-log-event.md)
## Supported Event Sources
Currently, the only event source that is officially supported is `kafka`, which polls for events
via a Kafka Consumer.
- [Kafka Event Source](../docs/actions/sources/kafka-event-source.md)
## Supported Actions
By default, DataHub supports a set of standard actions plugins. These can be found inside the folder
`src/datahub-actions/plugins`.
Some pre-included Actions include
- [Hello World](../docs/actions/actions/hello_world.md)
- [Executor](../docs/actions/actions/executor.md)
## Development
### Build and Test
Notice that we support all actions command using a separate `datahub-actions` CLI entry point. Feel free
to use this during development.
```
# Build datahub-actions module
./gradlew datahub-actions:build
# Drop into virtual env
cd datahub-actions && source venv/bin/activate
# Start hello world action
datahub-actions actions -c ../examples/hello_world.yaml
# Start ingestion executor action
datahub-actions actions -c ../examples/executor.yaml
# Start multiple actions
datahub-actions actions -c ../examples/executor.yaml -c ../examples/hello_world.yaml
```
### Developing a Transformer
To develop a new Transformer, check out the [Developing a Transformer](../docs/actions/guides/developing-a-transformer.md) guide.
### Developing an Action
To develop a new Action, check out the [Developing an Action](../docs/actions/guides/developing-an-action.md) guide.
## Contributing
Contributing guidelines follow those of the [main DataHub project](../docs/CONTRIBUTING.md). We are accepting contributions for Actions, Transformers, and general framework improvements (tests, error handling, etc).
## Resources
Check out the [original announcement](https://www.youtube.com/watch?v=7iwNxHgqxtg&t=2189s) of the DataHub Actions Framework at the DataHub April 2022 Town Hall.
## License
[Apache 2.0](./LICENSE)

View File

@ -0,0 +1,162 @@
/**
* Copyright 2021 Acryl Data, Inc.
*
* Licensed under the Apache License, Version 2.0 (the "License");
* you may not use this file except in compliance with the License.
* You may obtain a copy of the License at
*
* http://www.apache.org/licenses/LICENSE-2.0
*
* Unless required by applicable law or agreed to in writing, software
* distributed under the License is distributed on an "AS IS" BASIS,
* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
* See the License for the specific language governing permissions and
* limitations under the License.
*/
plugins {
id 'scala'
id 'org.gradle.playframework'
}
apply from: "../gradle/versioning/versioning.gradle"
apply from: "../gradle/coverage/python-coverage.gradle"
apply from: '../gradle/docker/docker.gradle'
ext {
python_executable = 'python3'
venv_name = 'venv'
docker_registry = 'acryldata'
docker_repo = 'datahub-actions'
docker_target = project.getProperties().getOrDefault("dockerTarget", "slim")
python_docker_version = project.getProperties().getOrDefault("pythonDockerVersion", "1!0.0.0+docker.${version}")
}
if (!project.hasProperty("extra_pip_requirements")) {
ext.extra_pip_requirements = ""
}
def pip_install_command = "VIRTUAL_ENV=${venv_name} ${venv_name}/bin/uv pip install -e ../metadata-ingestion"
task checkPythonVersion(type: Exec) {
commandLine python_executable, '-c',
'import sys; sys.version_info >= (3, 8), f"Python version {sys.version_info[:2]} not allowed"'
}
task environmentSetup(type: Exec, dependsOn: checkPythonVersion) {
def sentinel_file = "${venv_name}/.venv_environment_sentinel"
inputs.file file('setup.py')
outputs.file(sentinel_file)
commandLine 'bash', '-c',
"${python_executable} -m venv ${venv_name} && " +
"${venv_name}/bin/python -m pip install --upgrade uv && " +
"touch ${sentinel_file}"
}
task installPackage(type: Exec, dependsOn: [environmentSetup, ':metadata-ingestion:codegen']) {
def sentinel_file = "${venv_name}/.build_install_package_sentinel"
inputs.file file('setup.py')
outputs.file(sentinel_file)
commandLine 'bash', '-c',
"source ${venv_name}/bin/activate && set -x && " +
"${pip_install_command} -e . ${extra_pip_requirements} && " +
"touch ${sentinel_file}"
}
task install(dependsOn: [installPackage])
task installDev(type: Exec, dependsOn: [install]) {
def sentinel_file = "${venv_name}/.build_install_dev_sentinel"
inputs.file file('setup.py')
outputs.file(sentinel_file)
commandLine 'bash', '-c',
"source ${venv_name}/bin/activate && set -x && " +
"${pip_install_command} -e .[dev] ${extra_pip_requirements} && " +
"touch ${sentinel_file}"
}
task lint(type: Exec, dependsOn: installDev) {
commandLine 'bash', '-c',
"source ${venv_name}/bin/activate && set -x && " +
"ruff check src/ tests/ && " +
"ruff format --check src/ tests/ && " +
"mypy --show-traceback --show-error-codes src/ tests/"
}
task lintFix(type: Exec, dependsOn: installDev) {
commandLine 'bash', '-c',
"source ${venv_name}/bin/activate && set -x && " +
"ruff check --fix src/ tests/ && " +
"ruff format src/ tests/ "
}
task installDevTest(type: Exec, dependsOn: [installDev]) {
def sentinel_file = "${venv_name}/.build_install_dev_test_sentinel"
inputs.file file('setup.py')
outputs.dir("${venv_name}")
outputs.file(sentinel_file)
commandLine 'bash', '-c',
"source ${venv_name}/bin/activate && set -x && " +
"${pip_install_command} -e .[dev,integration-tests] ${extra_pip_requirements} && " +
"touch ${sentinel_file}"
}
task testFull(type: Exec, dependsOn: installDevTest) {
inputs.files(project.fileTree(dir: "src/", include: "**/*.py"))
inputs.files(project.fileTree(dir: "tests/"))
outputs.dir("${venv_name}")
commandLine 'bash', '-c',
"source ${venv_name}/bin/activate && set -x && " +
"pytest -vv ${get_coverage_args('full')} --continue-on-collection-errors --junit-xml=junit.full.xml"
}
task buildWheel(type: Exec, dependsOn: [environmentSetup]) {
commandLine 'bash', '-c', "source ${venv_name}/bin/activate && " +
'uv pip install build && RELEASE_VERSION="\${RELEASE_VERSION:-0.0.0.dev1}" RELEASE_SKIP_INSTALL=1 RELEASE_SKIP_UPLOAD=1 ./scripts/release.sh'
}
task cleanPythonCache(type: Exec) {
commandLine 'bash', '-x', '-c',
"find src -type f -name '*.py[co]' -delete -o -type d -name __pycache__ -delete -o -type d -empty -delete"
}
docker {
dependsOn ':metadata-ingestion:codegen'
name "${docker_registry}/${docker_repo}:${versionTag}"
dockerfile file("${rootProject.projectDir}/docker/datahub-actions/Dockerfile")
files fileTree(rootProject.projectDir) {
exclude "datahub-actions/scripts/**"
exclude "datahub-actions/build/**"
exclude "datahub-actions/venv/**"
exclude "datahub-actions/tests/**"
exclude "**/*.xml"
include ".dockerignore"
include "docker/datahub-actions/**"
include "docker/snippets/**"
include "metadata-ingestion/**"
include "datahub-actions/**"
include "python-build/**"
}.exclude {
i -> (!i.file.name.endsWith(".dockerignore") && i.file.isHidden())
}
additionalTag("Debug", "${docker_registry}/${docker_repo}:debug")
defaultVariant = "slim"
variants = [
"slim": [suffix: "-slim", args: [APP_ENV: "slim", RELEASE_VERSION: python_docker_version]],
"full": [suffix: "", args: [APP_ENV: "full", RELEASE_VERSION: python_docker_version]]
]
}
build.dependsOn install
check.dependsOn lint
check.dependsOn testFull
clean {
delete venv_name
delete 'build'
delete 'dist'
}
clean.dependsOn cleanPythonCache

View File

@ -0,0 +1,19 @@
name: "ingestion_executor"
source:
type: "kafka"
config:
connection:
bootstrap: ${KAFKA_BOOTSTRAP_SERVER:-localhost:9092}
schema_registry_url: ${SCHEMA_REGISTRY_URL:-http://localhost:8081}
topic_routes:
mcl: ${METADATA_CHANGE_LOG_VERSIONED_TOPIC_NAME:-MetadataChangeLog_Versioned_v1}
filter:
event_type: "MetadataChangeLogEvent_v1"
event:
entityType: "dataHubExecutionRequest"
changeType: "UPSERT"
action:
type: "executor"
datahub:
server: "${DATAHUB_GMS_PROTOCOL:-http}://${DATAHUB_GMS_HOST:-localhost}:${DATAHUB_GMS_PORT:-8080}"
# token: <your-access-token # Requires 'Manage Secrets' platform privilege.

View File

@ -0,0 +1,12 @@
# hello_world.yaml
name: "hello_world"
# 1. Event Source: Where to source event from.
source:
type: "kafka"
config:
connection:
bootstrap: ${KAFKA_BOOTSTRAP_SERVER:-localhost:9092}
schema_registry_url: ${SCHEMA_REGISTRY_URL:-http://localhost:8081}
# 2. Action: What action to take on events.
action:
type: "hello_world"

View File

@ -0,0 +1,13 @@
# hello_world.yaml
name: "hello_world_datahub_cloud"
# 1. DataHub Cloud Connection: Configure how to talk to DataHub Cloud
datahub:
server: "https://<your-organization>.acryl.io"
token: "<your-datahub-cloud-token>"
# 2. Event Source: Where to source event from.
source:
type: "datahub-cloud"
# 3. Action: What action to take on events.
# To learn how to develop a custom Action, see https://docs.datahub.com/docs/actions/guides/developing-an-action
action:
type: "hello_world"

View File

@ -0,0 +1,28 @@
name: "metadata_change_sync"
source:
type: "kafka"
config:
connection:
bootstrap: ${KAFKA_BOOTSTRAP_SERVER:-localhost:9092}
schema_registry_url: ${SCHEMA_REGISTRY_URL:-http://localhost:8081}
filter:
event_type: "MetadataChangeLogEvent_v1"
event:
changeType: "UPSERT"
action:
type: "metadata_change_sync"
config:
gms_server: ${DEST_DATAHUB_GMS_URL}
# If you have METADATA_SERVICE_AUTH_ENABLED enabled in GMS, you'll need to configure the auth token here
gms_auth_token: ${DEST_DATAHUB_GMS_TOKEN}
# you can provide a list of aspects you would like to exclude
# By default, we are excluding these aspects:
# dataHubAccessTokenInfo, dataHubAccessTokenKey, dataHubSecretKey, dataHubSecretValue, dataHubExecutionRequestInput
# dataHubExecutionRequestKey, dataHubExecutionRequestResult
aspects_to_exclude: []
aspects_to_include: ['schemaMetadata','editableSchemaMetadata','ownership','domain']
# you can provide extra headers in the request in key value format
extra_headers: {}
# you can provide a regex pattern for URNs to include
# By default, we are including all URNs
urn_regex: ".*"

View File

@ -0,0 +1,29 @@
name: "snowflake_tag_propagation"
source:
type: "kafka"
config:
connection:
bootstrap: ${KAFKA_BOOTSTRAP_SERVER:-localhost:9092}
schema_registry_url: ${SCHEMA_REGISTRY_URL:-http://localhost:8081}
filter:
event_type: "EntityChangeEvent_v1"
action:
type: "snowflake_tag_propagation"
config:
tag_propagation:
tag_prefixes:
- classification
term_propagation:
target_terms:
- Classification
term_groups:
- "Personal Information"
snowflake:
account_id: ${SNOWFLAKE_ACCOUNT_ID}
warehouse: COMPUTE_WH
username: ${SNOWFLAKE_USER_NAME}
password: ${SNOWFLAKE_PASSWORD}
role: ACCOUNTADMIN
datahub:
server: "http://localhost:8080"

View File

@ -0,0 +1,60 @@
[build-system]
build-backend = "setuptools.build_meta"
requires = ["setuptools>65.5.1", "wheel>0.38.1", "pip>=21.0.0"]
[tool.ruff]
line-length = 88
target-version = "py38"
exclude = [
".git",
"venv",
".tox",
"__pycache__",
]
[tool.ruff.format]
quote-style = "double"
indent-style = "space"
skip-magic-trailing-comma = false
line-ending = "auto"
[tool.ruff.lint.isort]
combine-as-imports = true
known-first-party = ["datahub"]
extra-standard-library = ["__future__"]
section-order = ["future", "standard-library", "third-party", "first-party", "local-folder"]
force-sort-within-sections = false
force-wrap-aliases = false
split-on-trailing-comma = false
order-by-type = true
relative-imports-order = "closest-to-furthest"
force-single-line = false
single-line-exclusions = ["typing"]
length-sort = false
from-first = false
required-imports = []
classes = ["typing"]
[tool.ruff.lint]
extend-select = [
"B", # flake8-bugbear
"C90", # mccabe complexity
"E", # pycodestyle errors
"F", # pyflakes
"G010", # logging.warn -> logging.warning
"I", # isort
"TID", # flake8-tidy-imports
"RUF100", # unused-noqa
]
ignore = [
"E501", # Line length violations (handled by formatter)
]
[tool.ruff.lint.mccabe]
max-complexity = 15
[tool.ruff.lint.flake8-tidy-imports]
ban-relative-imports = "all"
[tool.ruff.lint.per-file-ignores]
"__init__.py" = ["F401"]

View File

@ -0,0 +1,31 @@
#!/bin/bash
# Auto-generated by python-build/generate_release_scripts.py. Do not edit manually.
set -euxo pipefail
ROOT=..
MODULE=datahub_actions
if [[ ! ${RELEASE_SKIP_TEST:-} ]] && [[ ! ${RELEASE_SKIP_INSTALL:-} ]]; then
${ROOT}/gradlew build # also runs tests
elif [[ ! ${RELEASE_SKIP_INSTALL:-} ]]; then
${ROOT}/gradlew install
fi
# Check packaging constraint.
python -c 'import setuptools; where="./src"; assert setuptools.find_packages(where) == setuptools.find_namespace_packages(where), "you seem to be missing or have extra __init__.py files"'
# Update the release version.
if [[ ! ${RELEASE_VERSION:-} ]]; then
echo "RELEASE_VERSION is not set"
exit 1
fi
sed -i.bak "s/__version__ = .*$/__version__ = \"$(echo $RELEASE_VERSION|sed s/-/+/)\"/" src/${MODULE}/_version.py
# Build and upload the release.
rm -rf build dist || true
python -m build
if [[ ! ${RELEASE_SKIP_UPLOAD:-} ]]; then
python -m twine upload 'dist/*'
fi
mv src/${MODULE}/_version.py.bak src/${MODULE}/_version.py

49
datahub-actions/setup.cfg Normal file
View File

@ -0,0 +1,49 @@
[mypy]
plugins =
pydantic.mypy
exclude = ^(venv|build|dist)/
ignore_missing_imports = yes
strict_optional = yes
check_untyped_defs = yes
disallow_incomplete_defs = yes
disallow_untyped_decorators = yes
warn_unused_configs = yes
# eventually we'd like to enable these
disallow_untyped_defs = no
# try to be a bit more strict in certain areas of the codebase
[mypy-datahub.*]
ignore_missing_imports = no
[mypy-tests.*]
ignore_missing_imports = no
[tool:pytest]
asyncio_mode = auto
addopts = --cov=src --cov-report='' --cov-config setup.cfg --strict-markers -s -v
markers =
integration: marks tests to only run in integration (deselect with '-m "not integration"')
testpaths =
tests/unit
tests/integration
# [coverage:run]
# # Because of some quirks in the way setup.cfg, coverage.py, pytest-cov,
# # and tox interact, we should not uncomment the following line.
# # See https://pytest-cov.readthedocs.io/en/latest/config.html and
# # https://coverage.readthedocs.io/en/coverage-5.0/config.html.
# # We also have some additional pytest/cov config options in tox.ini.
# # source = src
# [coverage:paths]
# # This is necessary for tox-based coverage to be counted properly.
# source =
# src
# */site-packages
[coverage:report]
show_missing = true
exclude_lines =
pragma: no cover
@abstract
if TYPE_CHECKING:

251
datahub-actions/setup.py Normal file
View File

@ -0,0 +1,251 @@
# Copyright 2021 Acryl Data, Inc.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
import os
from typing import Dict, Set
import setuptools
package_metadata: dict = {}
with open("./src/datahub_actions/_version.py") as fp:
exec(fp.read(), package_metadata)
_version: str = package_metadata["__version__"]
_self_pin = (
f"=={_version}"
if not (_version.endswith(("dev0", "dev1")) or "docker" in _version)
else ""
)
def get_long_description():
root = os.path.dirname(__file__)
with open(os.path.join(root, "README.md")) as f:
description = f.read()
return description
lint_requirements = {
# This is pinned only to avoid spurious errors in CI.
# We should make an effort to keep it up to date.
"ruff==0.11.7",
"mypy==1.14.1",
}
base_requirements = {
f"acryl-datahub[datahub-kafka]{_self_pin}",
# Compatibility.
"typing_extensions>=3.7.4; python_version < '3.8'",
"mypy_extensions>=0.4.3",
# Actual dependencies.
"typing-inspect",
"pydantic>=1.10.21",
"ratelimit",
# Lower bounds on httpcore and h11 due to CVE-2025-43859.
"httpcore>=1.0.9",
"azure-identity==1.21.0",
"aws-msk-iam-sasl-signer-python==1.0.2",
"h11>=0.16",
}
framework_common = {
"click>=6.0.0",
"click-default-group",
"prometheus-client",
"PyYAML",
"toml>=0.10.0",
"entrypoints",
"python-dateutil>=2.8.0",
"stackprinter",
"progressbar2",
"tenacity",
}
# Note: for all of these, framework_common will be added.
plugins: Dict[str, Set[str]] = {
# Source Plugins
"kafka": {
"confluent-kafka[schemaregistry]",
},
# Action Plugins
"executor": {
"acryl-executor==0.2.2",
},
"slack": {
"slack-bolt>=1.15.5",
},
"teams": {
"pymsteams >=0.2.2",
},
"tag_propagation": set(),
"term_propagation": set(),
"snowflake_tag_propagation": {
f"acryl-datahub[snowflake-slim]{_self_pin}",
},
"doc_propagation": set(),
# Transformer Plugins (None yet)
}
mypy_stubs = {
"types-pytz",
"types-dataclasses",
"sqlalchemy-stubs",
"types-setuptools",
"types-six",
"types-python-dateutil",
"types-requests",
"types-toml",
"types-PyMySQL",
"types-PyYAML",
"types-freezegun",
"types-cachetools",
# versions 0.1.13 and 0.1.14 seem to have issues
"types-click==0.1.12",
}
base_dev_requirements = {
*lint_requirements,
*base_requirements,
*framework_common,
*mypy_stubs,
"coverage>=5.1",
"pytest>=6.2.2",
"pytest-cov>=2.8.1",
"pytest-dependency>=0.5.1",
"pytest-docker>=0.10.3",
"tox",
"deepdiff",
"requests-mock",
"freezegun",
"jsonpickle",
"build",
"twine",
*list(
dependency
for plugin in [
"kafka",
"executor",
"slack",
"teams",
"tag_propagation",
"term_propagation",
"snowflake_tag_propagation",
"doc_propagation",
]
for dependency in plugins[plugin]
),
}
dev_requirements = {
*base_dev_requirements,
}
full_test_dev_requirements = {
*list(
dependency
for plugin in [
"kafka",
"executor",
"slack",
"teams",
"tag_propagation",
"term_propagation",
"snowflake_tag_propagation",
"doc_propagation",
]
for dependency in plugins[plugin]
),
# In our tests, we want to always test against pydantic v2.
# However, we maintain compatibility with pydantic v1 for now.
"pydantic>2",
}
entry_points = {
"console_scripts": ["datahub-actions = datahub_actions.entrypoints:main"],
"datahub_actions.action.plugins": [
"executor = datahub_actions.plugin.action.execution.executor_action:ExecutorAction",
"slack = datahub_actions.plugin.action.slack.slack:SlackNotificationAction",
"teams = datahub_actions.plugin.action.teams.teams:TeamsNotificationAction",
"metadata_change_sync = datahub_actions.plugin.action.metadata_change_sync.metadata_change_sync:MetadataChangeSyncAction",
"tag_propagation = datahub_actions.plugin.action.tag.tag_propagation_action:TagPropagationAction",
"term_propagation = datahub_actions.plugin.action.term.term_propagation_action:TermPropagationAction",
"snowflake_tag_propagation = datahub_actions.plugin.action.snowflake.tag_propagator:SnowflakeTagPropagatorAction",
"doc_propagation = datahub_actions.plugin.action.propagation.docs.propagation_action:DocPropagationAction",
],
"datahub_actions.transformer.plugins": [],
"datahub_actions.source.plugins": [],
}
setuptools.setup(
# Package metadata.
name=package_metadata["__package_name__"],
version=package_metadata["__version__"],
url="https://docs.datahub.com/",
project_urls={
"Documentation": "https://docs.datahub.com/docs/actions",
"Source": "https://github.com/acryldata/datahub-actions",
"Changelog": "https://github.com/acryldata/datahub-actions/releases",
},
license="Apache License 2.0",
description="An action framework to work with DataHub real time changes.",
long_description=get_long_description(),
long_description_content_type="text/markdown",
classifiers=[
"Development Status :: 5 - Production/Stable",
"Programming Language :: Python",
"Programming Language :: Python :: 3",
"Programming Language :: Python :: 3 :: Only",
"Programming Language :: Python :: 3.7",
"Programming Language :: Python :: 3.8",
"Programming Language :: Python :: 3.9",
"Programming Language :: Python :: 3.10",
"Intended Audience :: Developers",
"Intended Audience :: Information Technology",
"Intended Audience :: System Administrators",
"License :: OSI Approved",
"License :: OSI Approved :: Apache Software License",
"Operating System :: Unix",
"Operating System :: POSIX :: Linux",
"Environment :: Console",
"Environment :: MacOS X",
"Topic :: Software Development",
],
# Package info.
zip_safe=False,
python_requires=">=3.8",
package_dir={"": "src"},
packages=setuptools.find_namespace_packages(where="./src"),
package_data={
"datahub_actions": ["py.typed"],
},
entry_points=entry_points,
# Dependencies.
install_requires=list(base_requirements | framework_common),
extras_require={
"base": list(framework_common),
**{
plugin: list(framework_common | dependencies)
for (plugin, dependencies) in plugins.items()
},
"all": list(
framework_common.union(
*[requirements for plugin, requirements in plugins.items()]
)
),
"dev": list(dev_requirements),
"integration-tests": list(full_test_dev_requirements),
},
)

View File

@ -0,0 +1,15 @@
# Copyright 2021 Acryl Data, Inc.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
from datahub_actions._version import __package_name__, __version__

View File

@ -0,0 +1,13 @@
# Published at https://pypi.org/project/acryl-datahub-actions/.
__package_name__ = "acryl-datahub-actions"
__version__ = "1!0.0.0.dev0"
def is_dev_mode() -> bool:
return __version__.endswith("dev0")
def nice_version_name() -> str:
if is_dev_mode():
return "unavailable (installed in develop mode)"
return __version__

View File

@ -0,0 +1,13 @@
# Copyright 2021 Acryl Data, Inc.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

View File

@ -0,0 +1,41 @@
# Copyright 2021 Acryl Data, Inc.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
from abc import ABCMeta, abstractmethod
from datahub.ingestion.api.closeable import Closeable
from datahub_actions.event.event_envelope import EventEnvelope
from datahub_actions.pipeline.pipeline_context import PipelineContext
class Action(Closeable, metaclass=ABCMeta):
"""
The base class for all DataHub Actions.
A DataHub action is a component capable of performing a specific action (notification, auditing, synchronization, & more)
when important events occur on DataHub.
Each Action may provide its own semantics, configurations, compatibility and guarantees.
"""
@classmethod
@abstractmethod
def create(cls, config_dict: dict, ctx: PipelineContext) -> "Action":
"""Factory method to create an instance of an Action"""
pass
@abstractmethod
def act(self, event: EventEnvelope) -> None:
"""Take Action on DataHub events, provided an instance of a DataHub event."""
pass

View File

@ -0,0 +1,21 @@
# Copyright 2021 Acryl Data, Inc.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
from datahub.ingestion.api.registry import PluginRegistry
from datahub_actions.action.action import Action
from datahub_actions.plugin.action.hello_world.hello_world import HelloWorldAction
action_registry = PluginRegistry[Action]()
action_registry.register_from_entrypoint("datahub_actions.action.plugins")
action_registry.register("hello_world", HelloWorldAction)

View File

@ -0,0 +1,40 @@
# Copyright 2021 Acryl Data, Inc.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
import json
# Class that stores running statistics for a single Action.
# TODO: Invocation time tracking.
class ActionStats:
# The number of exception raised by the Action.
exception_count: int = 0
# The number of events that were actually submitted to the Action
success_count: int = 0
def increment_exception_count(self) -> None:
self.exception_count = self.exception_count + 1
def get_exception_count(self) -> int:
return self.exception_count
def increment_success_count(self) -> None:
self.success_count = self.success_count + 1
def get_success_count(self) -> int:
return self.success_count
def as_string(self) -> str:
return json.dumps(self.__dict__, indent=4, sort_keys=True)

View File

@ -0,0 +1,13 @@
# Copyright 2021 Acryl Data, Inc.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

View File

@ -0,0 +1,413 @@
# Copyright 2021 Acryl Data, Inc.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
import json
import logging
import urllib.parse
from dataclasses import dataclass
from typing import Any, Dict, List, Optional
from datahub.configuration.common import OperationalError
from datahub.ingestion.graph.client import DataHubGraph
from datahub.metadata.schema_classes import (
GlossaryTermAssociationClass,
TagAssociationClass,
)
from datahub.specific.dataset import DatasetPatchBuilder
logger = logging.getLogger(__name__)
@dataclass
class AcrylDataHubGraph:
def __init__(self, baseGraph: DataHubGraph):
self.graph = baseGraph
def get_by_query(
self,
query: str,
entity: str,
start: int = 0,
count: int = 100,
filters: Optional[Dict] = None,
) -> List[Dict]:
url_frag = "/entities?action=search"
url = f"{self.graph._gms_server}{url_frag}"
payload = {"input": query, "start": start, "count": count, "entity": entity}
if filters is not None:
payload["filter"] = filters
headers = {
"X-RestLi-Protocol-Version": "2.0.0",
"Content-Type": "application/json",
}
try:
response = self.graph._session.post(
url, data=json.dumps(payload), headers=headers
)
if response.status_code != 200:
return []
json_resp = response.json()
return json_resp.get("value", {}).get("entities")
except Exception as e:
print(e)
return []
def get_by_graphql_query(self, query: Dict) -> Dict:
url_frag = "/api/graphql"
url = f"{self.graph._gms_server}{url_frag}"
headers = {
"X-DataHub-Actor": "urn:li:corpuser:admin",
"Content-Type": "application/json",
}
try:
response = self.graph._session.post(
url, data=json.dumps(query), headers=headers
)
if response.status_code != 200:
return {}
json_resp = response.json()
return json_resp.get("data", {})
except Exception as e:
print(e)
return {}
def query_constraints_for_dataset(self, dataset_id: str) -> List:
resp = self.get_by_graphql_query(
{
"query": """
query dataset($input: String!) {
dataset(urn: $input) {
constraints {
type
displayName
description
params {
hasGlossaryTermInNodeParams {
nodeName
}
}
}
}
}
""",
"variables": {"input": dataset_id},
}
)
constraints: List = resp.get("dataset", {}).get("constraints", [])
return constraints
def query_execution_result_details(self, execution_id: str) -> Any:
resp = self.get_by_graphql_query(
{
"query": """
query executionRequest($urn: String!) {
executionRequest(urn: $urn) {
input {
task
arguments {
key
value
}
}
}
}
""",
"variables": {"urn": f"urn:li:dataHubExecutionRequest:{execution_id}"},
}
)
return resp.get("executionRequest", {}).get("input", {})
def query_ingestion_sources(self) -> List:
sources = []
start, count = 0, 10
while True:
resp = self.get_by_graphql_query(
{
"query": """
query listIngestionSources($input: ListIngestionSourcesInput!, $execution_start: Int!, $execution_count: Int!) {
listIngestionSources(input: $input) {
start
count
total
ingestionSources {
urn
type
name
executions(start: $execution_start, count: $execution_count) {
start
count
total
executionRequests {
urn
}
}
}
}
}
""",
"variables": {
"input": {"start": start, "count": count},
"execution_start": 0,
"execution_count": 10,
},
}
)
listIngestionSources = resp.get("listIngestionSources", {})
sources.extend(listIngestionSources.get("ingestionSources", []))
cur_total = listIngestionSources.get("total", 0)
if cur_total > count:
start += count
else:
break
return sources
def get_downstreams(
self, entity_urn: str, max_downstreams: int = 3000
) -> List[str]:
start = 0
count_per_page = 1000
entities = []
done = False
total_downstreams = 0
while not done:
# if start > 0:
# breakpoint()
url_frag = f"/relationships?direction=INCOMING&types=List(DownstreamOf)&urn={urllib.parse.quote(entity_urn)}&count={count_per_page}&start={start}"
url = f"{self.graph._gms_server}{url_frag}"
response = self.graph._get_generic(url)
if response["count"] > 0:
relnships = response["relationships"]
entities.extend([x["entity"] for x in relnships])
start += count_per_page
total_downstreams += response["count"]
if start >= response["total"] or total_downstreams >= max_downstreams:
done = True
else:
done = True
return entities
def get_upstreams(self, entity_urn: str, max_upstreams: int = 3000) -> List[str]:
start = 0
count_per_page = 100
entities = []
done = False
total_upstreams = 0
while not done:
url_frag = f"/relationships?direction=OUTGOING&types=List(DownstreamOf)&urn={urllib.parse.quote(entity_urn)}&count={count_per_page}&start={start}"
url = f"{self.graph._gms_server}{url_frag}"
response = self.graph._get_generic(url)
if response["count"] > 0:
relnships = response["relationships"]
entities.extend([x["entity"] for x in relnships])
start += count_per_page
total_upstreams += response["count"]
if start >= response["total"] or total_upstreams >= max_upstreams:
done = True
else:
done = True
return entities
def get_relationships(
self, entity_urn: str, direction: str, relationship_types: List[str]
) -> List[str]:
url_frag = (
f"/relationships?"
f"direction={direction}"
f"&types=List({','.join(relationship_types)})"
f"&urn={urllib.parse.quote(entity_urn)}"
)
url = f"{self.graph._gms_server}{url_frag}"
response = self.graph._get_generic(url)
if response["count"] > 0:
relnships = response["relationships"]
entities = [x["entity"] for x in relnships]
return entities
return []
def check_relationship(self, entity_urn, target_urn, relationship_type):
url_frag = f"/relationships?direction=INCOMING&types=List({relationship_type})&urn={urllib.parse.quote(entity_urn)}"
url = f"{self.graph._gms_server}{url_frag}"
response = self.graph._get_generic(url)
if response["count"] > 0:
relnships = response["relationships"]
entities = [x["entity"] for x in relnships]
return target_urn in entities
return False
def add_tags_to_dataset(
self,
entity_urn: str,
dataset_tags: List[str],
field_tags: Optional[Dict] = None,
context: Optional[Dict] = None,
) -> None:
if field_tags is None:
field_tags = {}
dataset = DatasetPatchBuilder(entity_urn)
for t in dataset_tags:
dataset.add_tag(
tag=TagAssociationClass(
tag=t, context=json.dumps(context) if context else None
)
)
for field_path, tags in field_tags.items():
field_builder = dataset.for_field(field_path=field_path)
for tag in tags:
field_builder.add_tag(
tag=TagAssociationClass(
tag=tag, context=json.dumps(context) if context else None
)
)
for mcp in dataset.build():
self.graph.emit(mcp)
def add_terms_to_dataset(
self,
entity_urn: str,
dataset_terms: List[str],
field_terms: Optional[Dict] = None,
context: Optional[Dict] = None,
) -> None:
if field_terms is None:
field_terms = {}
dataset = DatasetPatchBuilder(urn=entity_urn)
for term in dataset_terms:
dataset.add_term(
GlossaryTermAssociationClass(
term, context=json.dumps(context) if context else None
)
)
for field_path, terms in field_terms.items():
field_builder = dataset.for_field(field_path=field_path)
for term in terms:
field_builder.add_term(
GlossaryTermAssociationClass(
term, context=json.dumps(context) if context else None
)
)
for mcp in dataset.build():
self.graph.emit(mcp)
def get_corpuser_info(self, urn: str) -> Any:
return self.get_untyped_aspect(
urn, "corpUserInfo", "com.linkedin.identity.CorpUserInfo"
)
def get_untyped_aspect(
self,
entity_urn: str,
aspect: str,
aspect_type_name: str,
) -> Any:
url = f"{self.graph._gms_server}/aspects/{urllib.parse.quote(entity_urn)}?aspect={aspect}&version=0"
response = self.graph._session.get(url)
if response.status_code == 404:
# not found
return None
response.raise_for_status()
response_json = response.json()
aspect_json = response_json.get("aspect", {}).get(aspect_type_name)
if aspect_json:
return aspect_json
else:
raise OperationalError(
f"Failed to find {aspect_type_name} in response {response_json}"
)
def _get_entity_by_name(
self,
name: str,
entity_type: str,
indexed_fields: Optional[List[str]] = None,
) -> Optional[str]:
"""Retrieve an entity urn based on its name and type. Returns None if there is no match found"""
if indexed_fields is None:
indexed_fields = ["name", "displayName"]
filters = []
if len(indexed_fields) > 1:
for indexed_field in indexed_fields:
filter_criteria = [
{
"field": indexed_field,
"value": name,
"condition": "EQUAL",
}
]
filters.append({"and": filter_criteria})
search_body = {
"input": "*",
"entity": entity_type,
"start": 0,
"count": 10,
"orFilters": [filters],
}
else:
search_body = {
"input": "*",
"entity": entity_type,
"start": 0,
"count": 10,
"filter": {
"or": [
{
"and": [
{
"field": indexed_fields[0],
"value": name,
"condition": "EQUAL",
}
]
}
]
},
}
results: Dict = self.graph._post_generic(
self.graph._search_endpoint, search_body
)
num_entities = results.get("value", {}).get("numEntities", 0)
if num_entities > 1:
logger.warning(
f"Got {num_entities} results for {entity_type} {name}. Will return the first match."
)
entities_yielded: int = 0
entities = []
for x in results["value"]["entities"]:
entities_yielded += 1
logger.debug(f"yielding {x['entity']}")
entities.append(x["entity"])
return entities[0] if entities_yielded else None
def get_glossary_term_urn_by_name(self, term_name: str) -> Optional[str]:
"""Retrieve a glossary term urn based on its name. Returns None if there is no match found"""
return self._get_entity_by_name(
term_name, "glossaryTerm", indexed_fields=["name"]
)
def get_glossary_node_urn_by_name(self, node_name: str) -> Optional[str]:
"""Retrieve a glossary node urn based on its name. Returns None if there is no match found"""
return self._get_entity_by_name(node_name, "glossaryNode")

View File

@ -0,0 +1,13 @@
# Copyright 2021 Acryl Data, Inc.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

View File

@ -0,0 +1,191 @@
# Copyright 2021 Acryl Data, Inc.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
import logging
import pathlib
import signal
import sys
import time
from typing import Any, List
import click
from click_default_group import DefaultGroup
from expandvars import UnboundVariable
import datahub_actions._version as actions_version
from datahub.configuration.config_loader import load_config_file
from datahub_actions.pipeline.pipeline import Pipeline
from datahub_actions.pipeline.pipeline_manager import PipelineManager
logger = logging.getLogger(__name__)
# Instantiate a singleton instance of the Pipeline Manager.
pipeline_manager = PipelineManager()
def pipeline_config_to_pipeline(pipeline_config: dict) -> Pipeline:
logger.debug(
f"Attempting to create Actions Pipeline using config {pipeline_config.get('name')}"
)
try:
return Pipeline.create(pipeline_config)
except Exception as e:
raise Exception(
f"Failed to instantiate Actions Pipeline using config {pipeline_config.get('name')}: {e}"
) from e
@click.group(cls=DefaultGroup, default="run")
def actions() -> None:
"""Execute one or more Actions Pipelines"""
pass
def load_raw_config_file(config_file: pathlib.Path) -> dict:
"""
Load a config file as raw YAML/JSON without variable expansion.
Args:
config_file: Path to the configuration file
Returns:
dict: Raw configuration dictionary
Raises:
Exception: If the file cannot be loaded or is invalid YAML/JSON
"""
try:
with open(config_file, "r") as f:
import yaml
return yaml.safe_load(f)
except Exception as e:
raise Exception(
f"Failed to load raw configuration file {config_file}: {e}"
) from e
def is_pipeline_enabled(config: dict) -> bool:
"""
Check if a pipeline configuration is enabled.
Args:
config: Raw configuration dictionary
Returns:
bool: True if pipeline is enabled, False otherwise
"""
enabled = config.get("enabled", True)
return not (enabled == "false" or enabled is False)
@actions.command(
name="run",
context_settings=dict(
ignore_unknown_options=True,
allow_extra_args=True,
),
)
@click.option("-c", "--config", required=True, type=str, multiple=True)
@click.option("--debug/--no-debug", default=False)
@click.pass_context
def run(ctx: Any, config: List[str], debug: bool) -> None:
"""Execute one or more Actions Pipelines"""
logger.info("DataHub Actions version: %s", actions_version.nice_version_name())
if debug:
logging.getLogger().setLevel(logging.DEBUG)
else:
logging.getLogger().setLevel(logging.INFO)
pipelines: List[Pipeline] = []
logger.debug("Creating Actions Pipelines...")
# Phase 1: Initial validation of configs
valid_configs = []
for pipeline_config in config:
pipeline_config_file = pathlib.Path(pipeline_config)
try:
# First just load the raw config to check if it's enabled
raw_config = load_raw_config_file(pipeline_config_file)
if not is_pipeline_enabled(raw_config):
logger.warning(
f"Skipping pipeline {raw_config.get('name') or pipeline_config} as it is not enabled"
)
continue
valid_configs.append(pipeline_config_file)
except Exception as e:
if len(config) == 1:
raise Exception(
f"Failed to load raw configuration file {pipeline_config_file}"
) from e
logger.warning(
f"Failed to load pipeline configuration! Skipping action config file {pipeline_config_file}...: {e}"
)
# Phase 2: Full config loading and pipeline creation
for pipeline_config_file in valid_configs:
try:
# Now load the full config with variable expansion
pipeline_config_dict = load_config_file(pipeline_config_file)
pipelines.append(pipeline_config_to_pipeline(pipeline_config_dict))
except UnboundVariable as e:
if len(valid_configs) == 1:
raise Exception(
"Failed to load action configuration. Unbound variable(s) provided in config YAML."
) from e
logger.warning(
f"Failed to resolve variables in config file {pipeline_config_file}...: {e}"
)
continue
# Exit early if no valid pipelines were created
if not pipelines:
logger.error(
f"No valid pipelines were started from {len(config)} config(s). "
"Check that at least one pipeline is enabled and all required environment variables are set."
)
sys.exit(1)
logger.debug("Starting Actions Pipelines")
# Start each pipeline
for p in pipelines:
pipeline_manager.start_pipeline(p.name, p)
logger.info(f"Action Pipeline with name '{p.name}' is now running.")
# Now, run forever only if we have valid pipelines
while True:
time.sleep(5)
@actions.command()
def version() -> None:
"""Print version number and exit."""
click.echo(f"DataHub Actions version: {actions_version.nice_version_name()}")
click.echo(f"Python version: {sys.version}")
# Handle shutdown signal. (ctrl-c)
def handle_shutdown(signum: int, frame: Any) -> None:
logger.info("Stopping all running Action Pipelines...")
pipeline_manager.stop_all()
sys.exit(1)
signal.signal(signal.SIGINT, handle_shutdown)

View File

@ -0,0 +1,140 @@
# Copyright 2021 Acryl Data, Inc.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
import logging
import platform
import sys
import click
import stackprinter
from prometheus_client import start_http_server
import datahub_actions._version as actions_version
from datahub.cli.env_utils import get_boolean_env_variable
from datahub_actions.cli.actions import actions
logger = logging.getLogger(__name__)
# Configure logger.
BASE_LOGGING_FORMAT = (
"[%(asctime)s] %(levelname)-8s {%(name)s:%(lineno)d} - %(message)s"
)
logging.basicConfig(format=BASE_LOGGING_FORMAT)
MAX_CONTENT_WIDTH = 120
@click.group(
context_settings=dict(
# Avoid truncation of help text.
# See https://github.com/pallets/click/issues/486.
max_content_width=MAX_CONTENT_WIDTH,
)
)
@click.option(
"--enable-monitoring",
type=bool,
is_flag=True,
default=False,
help="Enable prometheus monitoring endpoint. You can set the portnumber with --monitoring-port.",
)
@click.option(
"--monitoring-port",
type=int,
default=8000,
help="""Prometheus monitoring endpoint will be available on :<PORT>/metrics.
To enable monitoring use the --enable-monitoring flag
""",
)
@click.option("--debug/--no-debug", default=False)
@click.version_option(
version=actions_version.nice_version_name(),
prog_name=actions_version.__package_name__,
)
@click.option(
"-dl",
"--detect-memory-leaks",
type=bool,
is_flag=True,
default=False,
help="Run memory leak detection.",
)
@click.pass_context
def datahub_actions(
ctx: click.Context,
enable_monitoring: bool,
monitoring_port: int,
debug: bool,
detect_memory_leaks: bool,
) -> None:
# Insulate 'datahub_actions' and all child loggers from inadvertent changes to the
# root logger by the external site packages that we import.
# (Eg: https://github.com/reata/sqllineage/commit/2df027c77ea0a8ea4909e471dcd1ecbf4b8aeb2f#diff-30685ea717322cd1e79c33ed8d37903eea388e1750aa00833c33c0c5b89448b3R11
# changes the root logger's handler level to WARNING, causing any message below
# WARNING level to be dropped after this module is imported, irrespective
# of the logger's logging level! The lookml source was affected by this).
# 1. Create 'datahub' parent logger.
datahub_logger = logging.getLogger("datahub_actions")
# 2. Setup the stream handler with formatter.
stream_handler = logging.StreamHandler()
formatter = logging.Formatter(BASE_LOGGING_FORMAT)
stream_handler.setFormatter(formatter)
datahub_logger.addHandler(stream_handler)
# 3. Turn off propagation to the root handler.
datahub_logger.propagate = False
# 4. Adjust log-levels.
if debug or get_boolean_env_variable("DATAHUB_DEBUG", False):
logging.getLogger().setLevel(logging.INFO)
datahub_logger.setLevel(logging.DEBUG)
else:
logging.getLogger().setLevel(logging.WARNING)
datahub_logger.setLevel(logging.INFO)
if enable_monitoring:
start_http_server(monitoring_port)
# Setup the context for the memory_leak_detector decorator.
ctx.ensure_object(dict)
ctx.obj["detect_memory_leaks"] = detect_memory_leaks
def main(**kwargs):
# This wrapper prevents click from suppressing errors.
try:
sys.exit(datahub_actions(standalone_mode=False, **kwargs))
except click.exceptions.Abort:
# Click already automatically prints an abort message, so we can just exit.
sys.exit(1)
except click.ClickException as error:
error.show()
sys.exit(1)
except Exception as exc:
logger.error(
stackprinter.format(
exc,
line_wrap=MAX_CONTENT_WIDTH,
truncate_vals=10 * MAX_CONTENT_WIDTH,
suppressed_paths=[r"lib/python.*/site-packages/click/"],
show_vals=False,
)
)
logger.info(
f"DataHub Actions version: {actions_version.__version__} at {actions_version.__file__}"
)
logger.info(
f"Python version: {sys.version} at {sys.executable} on {platform.platform()}"
)
sys.exit(1)
datahub_actions.add_command(actions)

View File

@ -0,0 +1,13 @@
# Copyright 2021 Acryl Data, Inc.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

View File

@ -0,0 +1,34 @@
# Copyright 2021 Acryl Data, Inc.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
from abc import ABCMeta, abstractmethod
class Event(metaclass=ABCMeta):
"""
A DataHub Event.
"""
@classmethod
@abstractmethod
def from_json(cls, json_str: str) -> "Event":
"""
Convert from json format into the event object.
"""
@abstractmethod
def as_json(self) -> str:
"""
Convert the event into its JSON representation.
"""

View File

@ -0,0 +1,60 @@
# Copyright 2021 Acryl Data, Inc.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
import json
import logging
from dataclasses import dataclass
from typing import Any, Dict
from datahub_actions.event.event import Event
from datahub_actions.event.event_registry import event_registry
logger = logging.getLogger(__name__)
# An object representation of the actual change event.
@dataclass
class EventEnvelope:
# The type of the event. This corresponds to the shape of the payload.
event_type: str
# The event itself
event: Event
# Arbitrary metadata about the event
meta: Dict[str, Any]
# Convert an enveloped event to JSON representation
def as_json(self) -> str:
# Be careful about converting meta bag, since anything can be put inside at runtime.
meta_json = None
try:
if self.meta is not None:
meta_json = json.dumps(self.meta)
except Exception:
logger.warning(
f"Failed to serialize meta field of EventEnvelope to json {self.meta}. Ignoring it during serialization."
)
result = f'{{ "event_type": "{self.event_type}", "event": {self.event.as_json()}, "meta": {meta_json if meta_json is not None else "null"} }}'
return result
# Convert a json event envelope back into the object.
@classmethod
def from_json(cls, json_str: str) -> "EventEnvelope":
json_obj = json.loads(json_str)
event_type = json_obj["event_type"]
event_class = event_registry.get(event_type)
event = event_class.from_json(json.dumps(json_obj["event"]))
meta = json_obj["meta"] if "meta" in json_obj else {}
return EventEnvelope(event_type=event_type, event=event, meta=meta)

View File

@ -0,0 +1,93 @@
# Copyright 2021 Acryl Data, Inc.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
import json
from datahub.ingestion.api.registry import PluginRegistry
from datahub.metadata.schema_classes import (
EntityChangeEventClass,
MetadataChangeLogClass,
)
from datahub_actions.event.event import Event
# TODO: Figure out where to put these.
# TODO: Perform runtime validation based on the event types found in the registry.
# A DataHub Event representing a Metadata Change Log Event.
# See MetadataChangeLogEvent class object for full field set.
class MetadataChangeLogEvent(MetadataChangeLogClass, Event):
@classmethod
def from_class(cls, clazz: MetadataChangeLogClass) -> "MetadataChangeLogEvent":
instance = cls._construct({})
instance._restore_defaults()
# Shallow map inner dictionaries.
instance._inner_dict = clazz._inner_dict
return instance
@classmethod
def from_json(cls, json_str: str) -> "Event":
json_obj = json.loads(json_str)
return cls.from_class(cls.from_obj(json_obj))
def as_json(self) -> str:
return json.dumps(self.to_obj())
# A DataHub Event representing an Entity Change Event.
# See EntityChangeEventClass class object for full field set.
class EntityChangeEvent(EntityChangeEventClass, Event):
@classmethod
def from_class(cls, clazz: EntityChangeEventClass) -> "EntityChangeEvent":
instance = cls._construct({})
instance._restore_defaults()
# Shallow map inner dictionaries.
instance._inner_dict = clazz._inner_dict
return instance
@classmethod
def from_json(cls, json_str: str) -> "EntityChangeEvent":
json_obj = json.loads(json_str)
# Remove parameters from json_obj and add it later to _inner_dict, this hack exists because of the way EntityChangeLogClass does not support "AnyRecord"
parameters = json_obj.pop("parameters", None)
event = cls.from_class(cls.from_obj(json_obj))
# Hack: Since parameters is an "AnyRecord" (arbitrary json) we have to insert into the underlying map directly
# to avoid validation at object creation time. This means the reader is responsible to understand the serialized JSON format, which
# is simply PDL serialized to JSON.
if parameters:
event._inner_dict["__parameters_json"] = parameters
return event
def as_json(self) -> str:
json_obj = self.to_obj()
# Insert parameters, this hack exists because of the way EntityChangeLogClass does not support "AnyRecord"
if "__parameters_json" in self._inner_dict:
json_obj["parameters"] = self._inner_dict["__parameters_json"]
return json.dumps(json_obj)
# Standard Event Types for easy reference.
ENTITY_CHANGE_EVENT_V1_TYPE = "EntityChangeEvent_v1"
METADATA_CHANGE_LOG_EVENT_V1_TYPE = "MetadataChangeLogEvent_v1"
# Lightweight Event Registry
event_registry = PluginRegistry[Event]()
# Register standard event library. Each type can be considered a separate "stream" / "topic"
event_registry.register(METADATA_CHANGE_LOG_EVENT_V1_TYPE, MetadataChangeLogEvent)
event_registry.register(ENTITY_CHANGE_EVENT_V1_TYPE, EntityChangeEvent)

View File

@ -0,0 +1,13 @@
# Copyright 2021 Acryl Data, Inc.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

View File

@ -0,0 +1,323 @@
# Copyright 2021 Acryl Data, Inc.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
import logging
import os
from typing import List, Optional
from datahub_actions.action.action import Action
from datahub_actions.event.event_envelope import EventEnvelope
from datahub_actions.pipeline.pipeline_config import FailureMode, PipelineConfig
from datahub_actions.pipeline.pipeline_stats import PipelineStats
from datahub_actions.pipeline.pipeline_util import (
create_action,
create_action_context,
create_event_source,
create_filter_transformer,
create_transformer,
normalize_directory_name,
)
from datahub_actions.source.event_source import EventSource
from datahub_actions.transform.transformer import Transformer
logger = logging.getLogger(__name__)
# Defaults for the location where failed events will be written.
DEFAULT_RETRY_COUNT = 0 # Do not retry unless instructed.
DEFAULT_FAILED_EVENTS_DIR = "/tmp/logs/datahub/actions"
DEFAULT_FAILED_EVENTS_FILE_NAME = "failed_events.log" # Not currently configurable.
DEFAULT_FAILURE_MODE = FailureMode.CONTINUE
class PipelineException(Exception):
"""
An exception thrown when a Pipeline encounters and unrecoverable situation.
Mainly a placeholder for now.
"""
pass
class Pipeline:
"""
A Pipeline is responsible for coordinating execution of a single DataHub Action.
This responsibility includes:
- sourcing events from an Event Source
- executing a configurable chain of Transformers
- invoking an Action with the final Event
- acknowledging the processing of an Event with the Event Source
Additionally, a Pipeline supports the following notable capabilities:
- Configurable retries of event processing in cases of component failure
- Configurable dead letter queue
- Capturing basic statistics about each Pipeline component
- At-will start and stop of an individual pipeline
"""
name: str
source: EventSource
transforms: List[Transformer] = []
action: Action
# Whether the Pipeline has been requested to shut down
_shutdown: bool = False
# Pipeline statistics
_stats: PipelineStats = PipelineStats()
# Options
_retry_count: int = DEFAULT_RETRY_COUNT # Number of times a single event should be retried in case of processing error.
_failure_mode: FailureMode = DEFAULT_FAILURE_MODE
_failed_events_dir: str = DEFAULT_FAILED_EVENTS_DIR # The top-level path where failed events will be logged.
def __init__(
self,
name: str,
source: EventSource,
transforms: List[Transformer],
action: Action,
retry_count: Optional[int],
failure_mode: Optional[FailureMode],
failed_events_dir: Optional[str],
) -> None:
self.name = name
self.source = source
self.transforms = transforms
self.action = action
if retry_count is not None:
self._retry_count = retry_count
if failure_mode is not None:
self._failure_mode = failure_mode
if failed_events_dir is not None:
self._failed_events_dir = failed_events_dir
self._init_failed_events_dir()
@classmethod
def create(cls, config_dict: dict) -> "Pipeline":
# Bind config
config = PipelineConfig.parse_obj(config_dict)
if not config.enabled:
raise Exception(
"Pipeline is disabled, but create method was called unexpectedly."
)
# Create Context
ctx = create_action_context(config.name, config.datahub)
# Create Event Source
event_source = create_event_source(config.source, ctx)
# Create Transforms
transforms = []
if config.filter is not None:
transforms.append(create_filter_transformer(config.filter, ctx))
if config.transform is not None:
for transform_config in config.transform:
transforms.append(create_transformer(transform_config, ctx))
# Create Action
action = create_action(config.action, ctx)
# Finally, create Pipeline.
return cls(
config.name,
event_source,
transforms,
action,
config.options.retry_count if config.options else None,
config.options.failure_mode if config.options else None,
config.options.failed_events_dir if config.options else None,
)
async def start(self) -> None:
"""
Start the action pipeline asynchronously. This method is non-blocking.
"""
self.run()
def run(self) -> None:
"""
Run the action pipeline synchronously. This method is blocking.
Raises an instance of PipelineException if an unrecoverable pipeline failure occurs.
"""
self._stats.mark_start()
# First, source the events.
enveloped_events = self.source.events()
for enveloped_event in enveloped_events:
# Then, process the event.
retval = self._process_event(enveloped_event)
# For legacy users w/o selective ack support, convert
# None to True, i.e. always commit.
if retval is None:
retval = True
# Finally, ack the event.
self._ack_event(enveloped_event, retval)
def stop(self) -> None:
"""
Stops a running action pipeline.
"""
logger.debug(f"Preparing to stop Actions Pipeline with name {self.name}")
self._shutdown = True
self._failed_events_fd.close()
self.source.close()
self.action.close()
def stats(self) -> PipelineStats:
"""
Returns basic statistics about the Pipeline run.
"""
return self._stats
def _process_event(self, enveloped_event: EventEnvelope) -> Optional[bool]:
# Attempt to process the incoming event, with retry.
curr_attempt = 1
max_attempts = self._retry_count + 1
retval = None
while curr_attempt <= max_attempts:
try:
# First, transform the event.
transformed_event = self._execute_transformers(enveloped_event)
# Then, invoke the action if the event is non-null.
if transformed_event is not None:
retval = self._execute_action(transformed_event)
# Short circuit - processing has succeeded.
return retval
except Exception:
logger.exception(
f"Caught exception while attempting to process event. Attempt {curr_attempt}/{max_attempts} event type: {enveloped_event.event_type}, pipeline name: {self.name}"
)
curr_attempt = curr_attempt + 1
logger.error(
f"Failed to process event after {self._retry_count} retries. event type: {enveloped_event.event_type}, pipeline name: {self.name}. Handling failure..."
)
# Increment failed event count.
self._stats.increment_failed_event_count()
# Finally, handle the failure
self._handle_failure(enveloped_event)
return retval
def _execute_transformers(
self, enveloped_event: EventEnvelope
) -> Optional[EventEnvelope]:
curr_event = enveloped_event
# Iterate through all transformers, sequentially apply them to the result of the previous.
for transformer in self.transforms:
# Increment stats
self._stats.increment_transformer_processed_count(transformer)
# Transform the event
transformed_event = self._execute_transformer(curr_event, transformer)
# Process result
if transformed_event is None:
# If the transformer has filtered the event, short circuit.
self._stats.increment_transformer_filtered_count(transformer)
return None
# Otherwise, set the result to the transformed event.
curr_event = transformed_event
# Return the final transformed event.
return curr_event
def _execute_transformer(
self, enveloped_event: EventEnvelope, transformer: Transformer
) -> Optional[EventEnvelope]:
try:
return transformer.transform(enveloped_event)
except Exception as e:
self._stats.increment_transformer_exception_count(transformer)
raise PipelineException(
f"Caught exception while executing Transformer with name {type(transformer).__name__}"
) from e
def _execute_action(self, enveloped_event: EventEnvelope) -> Optional[bool]:
try:
retval = self.action.act(enveloped_event)
self._stats.increment_action_success_count()
return retval
except Exception as e:
self._stats.increment_action_exception_count()
raise PipelineException(
f"Caught exception while executing Action with type {type(self.action).__name__}"
) from e
def _ack_event(self, enveloped_event: EventEnvelope, processed: bool) -> None:
try:
self.source.ack(enveloped_event, processed)
self._stats.increment_success_count()
except Exception:
self._stats.increment_failed_ack_count()
logger.exception(
f"Caught exception while attempting to ack successfully processed event. event type: {enveloped_event.event_type}, pipeline name: {self.name}",
)
logger.debug(f"Failed to ack event: {enveloped_event}")
def _handle_failure(self, enveloped_event: EventEnvelope) -> None:
# First, always save the failed event to a file. Useful for investigation.
self._append_failed_event_to_file(enveloped_event)
if self._failure_mode == FailureMode.THROW:
raise PipelineException("Failed to process event after maximum retries.")
elif self._failure_mode == FailureMode.CONTINUE:
# Simply return, nothing left to do.
pass
def _append_failed_event_to_file(self, enveloped_event: EventEnvelope) -> None:
# First, convert the event to JSON.
try:
json = enveloped_event.as_json()
# Then append to failed events file.
self._failed_events_fd.write(json + "\n")
self._failed_events_fd.flush()
except Exception as e:
# This is a serious issue, as if we do not handle it can mean losing an event altogether.
# Raise an exception to ensure this issue is reported to the operator.
raise PipelineException(
f"Failed to log failed event to file! {enveloped_event}"
) from e
def _init_failed_events_dir(self) -> None:
# create a directory for failed events from this actions pipeine.
failed_events_dir = os.path.join(
self._failed_events_dir, normalize_directory_name(self.name)
)
try:
os.makedirs(failed_events_dir, exist_ok=True)
failed_events_file_name = os.path.join(
failed_events_dir, DEFAULT_FAILED_EVENTS_FILE_NAME
)
self._failed_events_fd = open(failed_events_file_name, "a")
except Exception as e:
logger.debug(e)
raise PipelineException(
f"Caught exception while attempting to create failed events log file at path {failed_events_dir}. Please check your file system permissions."
) from e

View File

@ -0,0 +1,74 @@
# Copyright 2021 Acryl Data, Inc.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
from typing import Any, Dict, List, Optional, Union
from pydantic import BaseModel
from datahub.configuration import ConfigModel
from datahub.configuration.common import ConfigEnum
from datahub.ingestion.graph.client import DatahubClientConfig
class FailureMode(ConfigEnum):
# Log the failed event to the failed events log. Then throw an pipeline exception to stop the pipeline.
THROW = "THROW"
# Log the failed event to the failed events log. Then continue processing the event stream.
CONTINUE = "CONTINUE"
class SourceConfig(ConfigModel):
type: str
config: Optional[Dict[str, Any]] = None
class TransformConfig(ConfigModel):
type: str
config: Optional[Dict[str, Any]] = None
class FilterConfig(ConfigModel):
event_type: Union[str, List[str]]
event: Optional[Dict[str, Any]] = None
class ActionConfig(ConfigModel):
type: str
config: Optional[dict]
class PipelineOptions(BaseModel):
retry_count: Optional[int] = None
failure_mode: Optional[FailureMode] = None
failed_events_dir: Optional[str] = (
None # The path where failed events should be logged.
)
class PipelineConfig(ConfigModel):
"""
Configuration required to create a new Actions Pipeline.
This exactly matches the structure of the YAML file used
to configure a Pipeline.
"""
name: str
enabled: bool = True
source: SourceConfig
filter: Optional[FilterConfig] = None
transform: Optional[List[TransformConfig]] = None
action: ActionConfig
datahub: Optional[DatahubClientConfig] = None
options: Optional[PipelineOptions] = None

View File

@ -0,0 +1,31 @@
# Copyright 2021 Acryl Data, Inc.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
from dataclasses import dataclass
from typing import Optional
from datahub_actions.api.action_graph import AcrylDataHubGraph
@dataclass
class PipelineContext:
"""
Context which is provided to each component in a Pipeline.
"""
# The name of the running pipeline.
pipeline_name: str
# An instance of a DataHub client.
graph: Optional[AcrylDataHubGraph]

View File

@ -0,0 +1,105 @@
# Copyright 2021 Acryl Data, Inc.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
import logging
import traceback
from dataclasses import dataclass
from threading import Thread
from typing import Dict
from datahub_actions.pipeline.pipeline import Pipeline, PipelineException
logger = logging.getLogger(__name__)
@dataclass
class PipelineSpec:
# The pipeline name
name: str
# The pipeline
pipeline: Pipeline
# The thread which is executing the pipeline.
thread: Thread
# Run a pipeline in blocking fashion
# TODO: Exit process on failure of single pipeline.
def run_pipeline(pipeline: Pipeline) -> None:
try:
pipeline.run()
except PipelineException:
logger.error(
f"Caught exception while running pipeline with name {pipeline.name}: {traceback.format_exc(limit=3)}"
)
pipeline.stop()
logger.debug(f"Thread for pipeline with name {pipeline.name} has stopped.")
# A manager of multiple Action Pipelines.
# This class manages 1 thread per pipeline registered.
class PipelineManager:
# A catalog of all the currently executing Action Pipelines.
pipeline_registry: Dict[str, PipelineSpec] = {}
def __init__(self) -> None:
pass
# Start a new Action Pipeline.
def start_pipeline(self, name: str, pipeline: Pipeline) -> None:
logger.debug(f"Attempting to start pipeline with name {name}...")
if name not in self.pipeline_registry:
thread = Thread(target=run_pipeline, args=([pipeline]))
thread.start()
spec = PipelineSpec(name, pipeline, thread)
self.pipeline_registry[name] = spec
logger.debug(f"Started pipeline with name {name}.")
else:
raise Exception(f"Pipeline with name {name} is already running.")
# Stop a running Action Pipeline.
def stop_pipeline(self, name: str) -> None:
logger.debug(f"Attempting to stop pipeline with name {name}...")
if name in self.pipeline_registry:
# First, stop the pipeline.
try:
pipeline_spec = self.pipeline_registry[name]
pipeline_spec.pipeline.stop()
pipeline_spec.thread.join() # Wait for the pipeline thread to terminate.
logger.info(f"Actions Pipeline with name '{name}' has been stopped.")
pipeline_spec.pipeline.stats().pretty_print_summary(
name
) # Print the pipeline's statistics.
del self.pipeline_registry[name]
except Exception as e:
# Failed to stop a pipeline - this is a critical issue, we should avoid starting another action of the same type
# until this pipeline is confirmed killed.
logger.error(
f"Caught exception while attempting to stop pipeline with name {name}: {traceback.format_exc(limit=3)}"
)
raise Exception(
f"Caught exception while attempting to stop pipeline with name {name}."
) from e
else:
raise Exception(f"No pipeline with name {name} found.")
# Stop all running pipelines.
def stop_all(self) -> None:
logger.debug("Attempting to stop all running pipelines...")
# Stop each running pipeline.
names = list(self.pipeline_registry.keys()).copy()
for name in names:
self.stop_pipeline(name)
logger.debug("Successfully stop all running pipelines.")

View File

@ -0,0 +1,131 @@
# Copyright 2021 Acryl Data, Inc.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
import datetime
import json
from time import time
from typing import Dict
import click
from datahub_actions.action.action_stats import ActionStats
from datahub_actions.pipeline.pipeline_util import get_transformer_name
from datahub_actions.transform.transformer import Transformer
from datahub_actions.transform.transformer_stats import TransformerStats
# Class that stores running statistics for a single Actions Pipeline.
class PipelineStats:
# Timestamp in milliseconds when the pipeline was launched.
started_at: int
# Number of events that failed processing even after retry.
failed_event_count: int = 0
# Number of events that failed when "ack" was invoked.
failed_ack_count: int = 0
# Top-level number of succeeded processing executions.
success_count: int = 0
# Transformer Stats
transformer_stats: Dict[str, TransformerStats] = {}
# Action Stats
action_stats: ActionStats = ActionStats()
def mark_start(self) -> None:
self.started_at = int(time() * 1000)
def increment_failed_event_count(self) -> None:
self.failed_event_count = self.failed_event_count + 1
def increment_failed_ack_count(self) -> None:
self.failed_ack_count = self.failed_ack_count + 1
def increment_success_count(self) -> None:
self.success_count = self.success_count + 1
def increment_transformer_exception_count(self, transformer: Transformer) -> None:
transformer_name = get_transformer_name(transformer)
if transformer_name not in self.transformer_stats:
self.transformer_stats[transformer_name] = TransformerStats()
self.transformer_stats[transformer_name].increment_exception_count()
def increment_transformer_processed_count(self, transformer: Transformer) -> None:
transformer_name = get_transformer_name(transformer)
if transformer_name not in self.transformer_stats:
self.transformer_stats[transformer_name] = TransformerStats()
self.transformer_stats[transformer_name].increment_processed_count()
def increment_transformer_filtered_count(self, transformer: Transformer) -> None:
transformer_name = get_transformer_name(transformer)
if transformer_name not in self.transformer_stats:
self.transformer_stats[transformer_name] = TransformerStats()
self.transformer_stats[transformer_name].increment_filtered_count()
def increment_action_exception_count(self) -> None:
self.action_stats.increment_exception_count()
def increment_action_success_count(self) -> None:
self.action_stats.increment_success_count()
def get_started_at(self) -> int:
return self.started_at
def get_failed_event_count(self) -> int:
return self.failed_event_count
def get_failed_ack_count(self) -> int:
return self.failed_ack_count
def get_success_count(self) -> int:
return self.success_count
def get_transformer_stats(self, transformer: Transformer) -> TransformerStats:
transformer_name = get_transformer_name(transformer)
if transformer_name not in self.transformer_stats:
self.transformer_stats[transformer_name] = TransformerStats()
return self.transformer_stats[transformer_name]
def get_action_stats(self) -> ActionStats:
return self.action_stats
def as_string(self) -> str:
return json.dumps(self.__dict__, indent=4, sort_keys=True)
def pretty_print_summary(self, name: str) -> None:
curr_time = int(time() * 1000)
click.echo()
click.secho(f"Pipeline Report for {name}", bold=True, fg="blue")
click.echo()
click.echo(
f"Started at: {datetime.datetime.fromtimestamp(self.started_at / 1000.0)} (Local Time)"
)
click.echo(f"Duration: {(curr_time - self.started_at) / 1000.0}s")
click.echo()
click.secho("Pipeline statistics", bold=True)
click.echo()
click.echo(self.as_string())
click.echo()
if len(self.transformer_stats.keys()) > 0:
click.secho("Transformer statistics", bold=True)
for key in self.transformer_stats:
click.echo()
click.echo(f"{key}: {self.transformer_stats[key].as_string()}")
click.echo()
click.secho("Action statistics", bold=True)
click.echo()
click.echo(self.action_stats.as_string())
click.echo()

View File

@ -0,0 +1,156 @@
# Copyright 2021 Acryl Data, Inc.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
import logging
import re
from typing import Optional
from datahub.ingestion.graph.client import DatahubClientConfig, DataHubGraph
from datahub_actions.action.action import Action
from datahub_actions.action.action_registry import action_registry
from datahub_actions.api.action_graph import AcrylDataHubGraph
from datahub_actions.pipeline.pipeline_config import (
ActionConfig,
FilterConfig,
SourceConfig,
TransformConfig,
)
from datahub_actions.pipeline.pipeline_context import PipelineContext
from datahub_actions.plugin.transform.filter.filter_transformer import (
FilterTransformer,
FilterTransformerConfig,
)
from datahub_actions.source.event_source import EventSource
from datahub_actions.source.event_source_registry import event_source_registry
from datahub_actions.transform.transformer import Transformer
from datahub_actions.transform.transformer_registry import transformer_registry
logger = logging.getLogger(__name__)
def create_action_context(
pipeline_name: str, datahub_config: Optional[DatahubClientConfig]
) -> PipelineContext:
return PipelineContext(
pipeline_name,
(
AcrylDataHubGraph(DataHubGraph(datahub_config))
if datahub_config is not None
else None
),
)
def create_event_source(
source_config: SourceConfig, ctx: PipelineContext
) -> EventSource:
event_source_type = source_config.type
event_source_class = event_source_registry.get(event_source_type)
event_source_instance = None
try:
logger.debug(
f"Attempting to instantiate new Event Source of type {source_config.type}.."
)
event_source_config = (
source_config.config if source_config.config is not None else {}
)
event_source_instance = event_source_class.create(event_source_config, ctx)
except Exception as e:
raise Exception(
f"Caught exception while attempting to instantiate Event Source of type {source_config.type}"
) from e
if event_source_instance is None:
raise Exception(
f"Failed to create Event Source with type {event_source_type}. Event Source create method returned 'None'."
)
return event_source_instance
def create_filter_transformer(
filter_config: FilterConfig, ctx: PipelineContext
) -> Transformer:
try:
logger.debug("Attempting to instantiate filter transformer..")
filter_transformer_config = FilterTransformerConfig(
event_type=filter_config.event_type, event=filter_config.event
)
return FilterTransformer(filter_transformer_config)
except Exception as e:
raise Exception(
"Caught exception while attempting to instantiate Filter transformer"
) from e
def create_transformer(
transform_config: TransformConfig, ctx: PipelineContext
) -> Transformer:
transformer_type = transform_config.type
transformer_class = transformer_registry.get(transformer_type)
transformer_instance = None
try:
logger.debug(
f"Attempting to instantiate new Transformer of type {transform_config.type}.."
)
transformer_config = (
transform_config.config if transform_config.config is not None else {}
)
transformer_instance = transformer_class.create(transformer_config, ctx)
except Exception as e:
raise Exception(
f"Caught exception while attempting to instantiate Transformer with type {transformer_type}"
) from e
if transformer_instance is None:
raise Exception(
f"Failed to create transformer with type {transformer_type}. Transformer create method returned 'None'."
)
return transformer_instance
def create_action(action_config: ActionConfig, ctx: PipelineContext) -> Action:
action_type = action_config.type
action_instance = None
try:
logger.debug(
f"Attempting to instantiate new Action of type {action_config.type}.."
)
action_class = action_registry.get(action_type)
action_config_dict = (
action_config.config if action_config.config is not None else {}
)
action_instance = action_class.create(action_config_dict, ctx)
except Exception as e:
raise Exception(
f"Caught exception while attempting to instantiate Action with type {action_type}. "
) from e
if action_instance is None:
raise Exception(
f"Failed to create action with type {action_type}. Action create method returned 'None'."
)
return action_instance
def normalize_directory_name(name: str) -> str:
# Lower case & remove whitespaces + periods.
return re.sub(r"[^\w\-_]", "_", name.lower())
def get_transformer_name(transformer: Transformer) -> str:
# TODO: Would be better to compute this using the transformer registry itself.
return type(transformer).__name__

View File

@ -0,0 +1,13 @@
# Copyright 2021 Acryl Data, Inc.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

View File

@ -0,0 +1,13 @@
# Copyright 2021 Acryl Data, Inc.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

View File

@ -0,0 +1,13 @@
# Copyright 2021 Acryl Data, Inc.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

View File

@ -0,0 +1,219 @@
# Copyright 2021 Acryl Data, Inc.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
import importlib
import json
import logging
import sys
from typing import Any, List, Optional, cast
from acryl.executor.dispatcher.default_dispatcher import DefaultDispatcher
from acryl.executor.execution.reporting_executor import (
ReportingExecutor,
ReportingExecutorConfig,
)
from acryl.executor.execution.task import TaskConfig
from acryl.executor.request.execution_request import ExecutionRequest
from acryl.executor.request.signal_request import SignalRequest
from acryl.executor.secret.datahub_secret_store import DataHubSecretStoreConfig
from acryl.executor.secret.secret_store import SecretStoreConfig
from pydantic import BaseModel
from datahub.metadata.schema_classes import MetadataChangeLogClass
from datahub_actions.action.action import Action
from datahub_actions.event.event_envelope import EventEnvelope
from datahub_actions.event.event_registry import METADATA_CHANGE_LOG_EVENT_V1_TYPE
from datahub_actions.pipeline.pipeline_context import PipelineContext
logger = logging.getLogger(__name__)
DATAHUB_EXECUTION_REQUEST_ENTITY_NAME = "dataHubExecutionRequest"
DATAHUB_EXECUTION_REQUEST_INPUT_ASPECT_NAME = "dataHubExecutionRequestInput"
DATAHUB_EXECUTION_REQUEST_SIGNAL_ASPECT_NAME = "dataHubExecutionRequestSignal"
APPLICATION_JSON_CONTENT_TYPE = "application/json"
def _is_importable(path: str) -> bool:
return "." in path or ":" in path
def import_path(path: str) -> Any:
"""
Import an item from a package, where the path is formatted as 'package.module.submodule.ClassName'
or 'package.module.submodule:ClassName.classmethod'. The dot-based format assumes that the bit
after the last dot is the item to be fetched. In cases where the item to be imported is embedded
within another type, the colon-based syntax can be used to disambiguate.
"""
assert _is_importable(path), "path must be in the appropriate format"
if ":" in path:
module_name, object_name = path.rsplit(":", 1)
else:
module_name, object_name = path.rsplit(".", 1)
item = importlib.import_module(module_name)
for attr in object_name.split("."):
item = getattr(item, attr)
return item
class ExecutorConfig(BaseModel):
executor_id: Optional[str] = None
task_configs: Optional[List[TaskConfig]] = None
# Listens to new Execution Requests & dispatches them to the appropriate handler.
class ExecutorAction(Action):
@classmethod
def create(cls, config_dict: dict, ctx: PipelineContext) -> "Action":
config = ExecutorConfig.parse_obj(config_dict or {})
return cls(config, ctx)
def __init__(self, config: ExecutorConfig, ctx: PipelineContext):
self.ctx = ctx
executors = []
executor_config = self._build_executor_config(config, ctx)
executors.append(ReportingExecutor(executor_config))
# Construct execution request dispatcher
self.dispatcher = DefaultDispatcher(executors)
def act(self, event: EventEnvelope) -> None:
"""This method listens for ExecutionRequest changes to execute in schedule and trigger events"""
if event.event_type is METADATA_CHANGE_LOG_EVENT_V1_TYPE:
orig_event = cast(MetadataChangeLogClass, event.event)
if (
orig_event.get("entityType") == DATAHUB_EXECUTION_REQUEST_ENTITY_NAME
and orig_event.get("changeType") == "UPSERT"
):
if (
orig_event.get("aspectName")
== DATAHUB_EXECUTION_REQUEST_INPUT_ASPECT_NAME
):
logger.debug("Received execution request input. Processing...")
self._handle_execution_request_input(orig_event)
elif (
orig_event.get("aspectName")
== DATAHUB_EXECUTION_REQUEST_SIGNAL_ASPECT_NAME
):
logger.debug("Received execution request signal. Processing...")
self._handle_execution_request_signal(orig_event)
def _handle_execution_request_input(self, orig_event):
entity_urn = orig_event.get("entityUrn")
entity_key = orig_event.get("entityKeyAspect")
# Get the run id to use.
exec_request_id = None
if entity_key is not None:
exec_request_key = json.loads(
entity_key.get("value")
) # this becomes the run id.
exec_request_id = exec_request_key.get("id")
elif entity_urn is not None:
urn_parts = entity_urn.split(":")
exec_request_id = urn_parts[len(urn_parts) - 1]
# Decode the aspect json into something more readable :)
exec_request_input = json.loads(orig_event.get("aspect").get("value"))
# Build an Execution Request
exec_request = ExecutionRequest(
executor_id=exec_request_input.get("executorId"),
exec_id=exec_request_id,
name=exec_request_input.get("task"),
args=exec_request_input.get("args"),
)
# Try to dispatch the execution request
try:
self.dispatcher.dispatch(exec_request)
except Exception:
logger.error("ERROR", exc_info=sys.exc_info())
def _handle_execution_request_signal(self, orig_event):
entity_urn = orig_event.get("entityUrn")
if (
orig_event.get("aspect").get("contentType") == APPLICATION_JSON_CONTENT_TYPE
and entity_urn is not None
):
# Decode the aspect json into something more readable :)
signal_request_input = json.loads(orig_event.get("aspect").get("value"))
# Build a Signal Request
urn_parts = entity_urn.split(":")
exec_id = urn_parts[len(urn_parts) - 1]
signal_request = SignalRequest(
executor_id=signal_request_input.get("executorId"),
exec_id=exec_id,
signal=signal_request_input.get("signal"),
)
# Try to dispatch the signal request
try:
self.dispatcher.dispatch_signal(signal_request)
except Exception:
logger.error("ERROR", exc_info=sys.exc_info())
def _build_executor_config(
self, config: ExecutorConfig, ctx: PipelineContext
) -> ReportingExecutorConfig:
if config.task_configs:
task_configs = config.task_configs
else:
# Build default task config
task_configs = [
TaskConfig(
name="RUN_INGEST",
type="acryl.executor.execution.sub_process_ingestion_task.SubProcessIngestionTask",
configs=dict({}),
),
TaskConfig(
name="TEST_CONNECTION",
type="acryl.executor.execution.sub_process_test_connection_task.SubProcessTestConnectionTask",
configs={},
),
]
if not ctx.graph:
raise Exception(
"Invalid configuration provided to action. DataHub Graph Client Required. Try including the 'datahub' block in your configuration."
)
graph = ctx.graph.graph
# Build default executor config
local_executor_config = ReportingExecutorConfig(
id=config.executor_id or "default",
task_configs=task_configs,
secret_stores=[
SecretStoreConfig(type="env", config=dict({})),
SecretStoreConfig(
type="datahub",
# TODO: Once SecretStoreConfig is updated to accept arbitrary types
# and not just dicts, we can just pass in the DataHubSecretStoreConfig
# object directly.
config=DataHubSecretStoreConfig(graph_client=graph).dict(),
),
],
graph_client=graph,
)
return local_executor_config
def close(self) -> None:
# TODO: Handle closing action ingestion processing.
pass

View File

@ -0,0 +1,13 @@
# Copyright 2021 Acryl Data, Inc.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

View File

@ -0,0 +1,52 @@
# Copyright 2021 Acryl Data, Inc.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
import json
import logging
from typing import Optional
from pydantic import BaseModel
from datahub_actions.action.action import Action
from datahub_actions.event.event_envelope import EventEnvelope
from datahub_actions.pipeline.pipeline_context import PipelineContext
logger = logging.getLogger(__name__)
class HelloWorldConfig(BaseModel):
# Whether to print the message in upper case.
to_upper: Optional[bool] = None
# A basic example of a DataHub action that prints all
# events received to the console.
class HelloWorldAction(Action):
@classmethod
def create(cls, config_dict: dict, ctx: PipelineContext) -> "Action":
action_config = HelloWorldConfig.parse_obj(config_dict or {})
return cls(action_config, ctx)
def __init__(self, config: HelloWorldConfig, ctx: PipelineContext):
self.config = config
def act(self, event: EventEnvelope) -> None:
print("Hello world! Received event:")
message = json.dumps(json.loads(event.as_json()), indent=4)
if self.config.to_upper:
print(message.upper())
else:
print(message)
def close(self) -> None:
pass

View File

@ -0,0 +1,54 @@
# Copyright 2021 Acryl Data, Inc.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
from typing import Any, Callable
from datahub.metadata.schema_classes import MetadataChangeLogClass
from datahub_actions.event.event_envelope import EventEnvelope
from datahub_actions.event.event_registry import METADATA_CHANGE_LOG_EVENT_V1_TYPE
class MCLProcessor:
"""
A utility class to register and process MetadataChangeLog events.
"""
def __init__(self) -> None:
self.entity_aspect_processors: dict[str, dict[str, Callable]] = {}
pass
def is_mcl(self, event: EventEnvelope) -> bool:
return event.event_type is METADATA_CHANGE_LOG_EVENT_V1_TYPE
def register_processor(
self, entity_type: str, aspect: str, processor: Callable
) -> None:
if entity_type not in self.entity_aspect_processors:
self.entity_aspect_processors[entity_type] = {}
self.entity_aspect_processors[entity_type][aspect] = processor
def process(self, event: EventEnvelope) -> Any:
if isinstance(event.event, MetadataChangeLogClass):
entity_type = event.event.entityType
aspect = event.event.aspectName
if (
entity_type in self.entity_aspect_processors
and aspect in self.entity_aspect_processors[entity_type]
):
return self.entity_aspect_processors[entity_type][aspect](
entity_urn=event.event.entityUrn,
aspect_name=event.event.aspectName,
aspect_value=event.event.aspect,
previous_aspect_value=event.event.previousAspectValue,
)

View File

@ -0,0 +1,169 @@
import json
import logging
import re
from typing import Dict, List, Optional, Set, Union, cast
from pydantic import BaseModel, Field
from datahub.emitter.rest_emitter import DatahubRestEmitter
from datahub.metadata.schema_classes import (
ChangeTypeClass,
MetadataChangeLogClass,
MetadataChangeProposalClass,
)
from datahub_actions.action.action import Action
from datahub_actions.event.event_envelope import EventEnvelope
from datahub_actions.event.event_registry import METADATA_CHANGE_LOG_EVENT_V1_TYPE
from datahub_actions.pipeline.pipeline_context import PipelineContext
logger = logging.getLogger(__name__)
class MetadataChangeEmitterConfig(BaseModel):
gms_server: Optional[str] = None
gms_auth_token: Optional[str] = None
aspects_to_exclude: Optional[List] = None
aspects_to_include: Optional[List] = None
entity_type_to_exclude: List[str] = Field(default_factory=list)
extra_headers: Optional[Dict[str, str]] = None
urn_regex: Optional[str] = None
class MetadataChangeSyncAction(Action):
rest_emitter: DatahubRestEmitter
aspects_exclude_set: Set
# By default, we exclude the following aspects since different datahub instances have their own encryption keys for
# encrypting tokens and secrets, we can't decrypt them even if these values sync to another datahub instance
# also, we don't sync execution request aspects because the ingestion recipe might contain datahub secret
# that another datahub instance could not decrypt
DEFAULT_ASPECTS_EXCLUDE_SET = {
"dataHubAccessTokenInfo",
"dataHubAccessTokenKey",
"dataHubSecretKey",
"dataHubSecretValue",
"dataHubExecutionRequestInput",
"dataHubExecutionRequestKey",
"dataHubExecutionRequestResult",
}
@classmethod
def create(cls, config_dict: dict, ctx: PipelineContext) -> "Action":
action_config = MetadataChangeEmitterConfig.parse_obj(config_dict or {})
return cls(action_config, ctx)
def __init__(self, config: MetadataChangeEmitterConfig, ctx: PipelineContext):
self.config = config
assert isinstance(self.config.gms_server, str)
self.rest_emitter = DatahubRestEmitter(
gms_server=self.config.gms_server,
token=self.config.gms_auth_token,
extra_headers=self.config.extra_headers,
)
self.aspects_exclude_set = (
self.DEFAULT_ASPECTS_EXCLUDE_SET.union(set(self.config.aspects_to_exclude))
if self.config.aspects_to_exclude
else self.DEFAULT_ASPECTS_EXCLUDE_SET
)
self.aspects_include_set = self.config.aspects_to_include
extra_headers_keys = (
list(self.config.extra_headers.keys())
if self.config.extra_headers
else None
)
logger.info(
f"MetadataChangeSyncAction configured to emit mcp to gms server {self.config.gms_server} with extra headers {extra_headers_keys} and aspects to exclude {self.aspects_exclude_set} and aspects to include {self.aspects_include_set}"
)
self.urn_regex = self.config.urn_regex
def act(self, event: EventEnvelope) -> None:
"""
This method listens for MetadataChangeLog events, casts it to MetadataChangeProposal,
and emits it to another datahub instance
"""
# MetadataChangeProposal only supports UPSERT type for now
if event.event_type is METADATA_CHANGE_LOG_EVENT_V1_TYPE:
orig_event = cast(MetadataChangeLogClass, event.event)
logger.debug(f"received orig_event {orig_event}")
regexUrn = self.urn_regex
if regexUrn is None:
urn_match = re.match(".*", "default match")
elif orig_event.entityUrn is not None:
urn_match = re.match(regexUrn, orig_event.entityUrn)
else:
logger.warning(f"event missing entityUrn: {orig_event}")
urn_match = None
aspect_name = orig_event.get("aspectName")
logger.info(f"urn_match {urn_match} for entityUrn {orig_event.entityUrn}")
if (
(
(
self.aspects_include_set is not None
and aspect_name in self.aspects_include_set
)
or (
self.aspects_include_set is None
and aspect_name not in self.aspects_exclude_set
)
)
and (
orig_event.get("entityType")
not in self.config.entity_type_to_exclude
if self.config.entity_type_to_exclude
else True
)
and urn_match is not None
):
mcp = self.buildMcp(orig_event)
if mcp is not None:
logger.debug(f"built mcp {mcp}")
self.emit(mcp)
else:
logger.debug(
f"skip emitting mcp for aspect {orig_event.get('aspectName')} or entityUrn {orig_event.entityUrn} or entity type {orig_event.get('entityType')} on exclude list"
)
def buildMcp(
self, orig_event: MetadataChangeLogClass
) -> Union[MetadataChangeProposalClass, None]:
try:
changeType = orig_event.get("changeType")
if changeType == ChangeTypeClass.RESTATE or changeType == "RESTATE":
changeType = ChangeTypeClass.UPSERT
mcp = MetadataChangeProposalClass(
entityType=orig_event.get("entityType"),
changeType=changeType,
entityUrn=orig_event.get("entityUrn"),
entityKeyAspect=orig_event.get("entityKeyAspect"),
aspectName=orig_event.get("aspectName"),
aspect=orig_event.get("aspect"),
)
return mcp
except Exception as ex:
logger.error(
f"error when building mcp from mcl {json.dumps(orig_event.to_obj(), indent=4)}"
)
logger.error(f"exception: {ex}")
return None
def emit(self, mcp: MetadataChangeProposalClass) -> None:
# Create an emitter to DataHub over REST
try:
# For unit test purpose, moving test_connection from initialization to here
# if rest_emitter.server_config is empty, that means test_connection() has not been called before
if not self.rest_emitter.server_config:
self.rest_emitter.test_connection()
logger.info(
f"emitting the mcp: entityType {mcp.entityType}, changeType {mcp.changeType}, urn {mcp.entityUrn}, aspect name {mcp.aspectName}"
)
self.rest_emitter.emit_mcp(mcp)
logger.info("successfully emit the mcp")
except Exception as ex:
logger.error(
f"error when emitting mcp, {json.dumps(mcp.to_obj(), indent=4)}"
)
logger.error(f"exception: {ex}")
def close(self) -> None:
pass

View File

@ -0,0 +1,13 @@
# Copyright 2021 Acryl Data, Inc.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

View File

@ -0,0 +1,13 @@
# Copyright 2021 Acryl Data, Inc.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

View File

@ -0,0 +1,847 @@
# Copyright 2021 Acryl Data, Inc.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
import json
import logging
import time
from typing import Iterable, List, Optional, Tuple
from pydantic import Field
from datahub.configuration.common import ConfigEnum
from datahub.emitter.mcp import MetadataChangeProposalWrapper
from datahub.metadata.schema_classes import (
AuditStampClass,
DocumentationAssociationClass,
DocumentationClass,
EditableSchemaMetadataClass,
EntityChangeEventClass as EntityChangeEvent,
GenericAspectClass,
MetadataAttributionClass,
MetadataChangeLogClass,
)
from datahub.metadata.urns import DatasetUrn
from datahub.utilities.urns.urn import Urn, guess_entity_type
from datahub_actions.action.action import Action
from datahub_actions.api.action_graph import AcrylDataHubGraph
from datahub_actions.event.event_envelope import EventEnvelope
from datahub_actions.pipeline.pipeline_context import PipelineContext
from datahub_actions.plugin.action.mcl_utils import MCLProcessor
from datahub_actions.plugin.action.propagation.propagation_utils import (
DirectionType,
PropagationConfig,
PropagationDirective,
RelationshipType,
SourceDetails,
get_unique_siblings,
)
from datahub_actions.plugin.action.stats_util import (
ActionStageReport,
EventProcessingStats,
)
logger = logging.getLogger(__name__)
class DocPropagationDirective(PropagationDirective):
doc_string: Optional[str] = Field(
default=None, description="Documentation string to be propagated."
)
class ColumnPropagationRelationships(ConfigEnum):
UPSTREAM = "upstream"
DOWNSTREAM = "downstream"
SIBLING = "sibling"
class DocPropagationConfig(PropagationConfig):
"""
Configuration model for documentation propagation.
Attributes:
enabled (bool): Indicates whether documentation propagation is enabled or not. Default is True.
columns_enabled (bool): Indicates whether column documentation propagation is enabled or not. Default is True.
datasets_enabled (bool): Indicates whether dataset level documentation propagation is enabled or not. Default is False.
Example:
config = DocPropagationConfig(enabled=True)
"""
enabled: bool = Field(
True,
description="Indicates whether documentation propagation is enabled or not.",
)
columns_enabled: bool = Field(
True,
description="Indicates whether column documentation propagation is enabled or not.",
)
# TODO: Currently this flag does nothing. Datasets are NOT supported for docs propagation.
datasets_enabled: bool = Field(
False,
description="Indicates whether dataset level documentation propagation is enabled or not.",
)
column_propagation_relationships: List[ColumnPropagationRelationships] = Field(
[
ColumnPropagationRelationships.SIBLING,
ColumnPropagationRelationships.DOWNSTREAM,
ColumnPropagationRelationships.UPSTREAM,
],
description="Relationships for column documentation propagation.",
)
def get_field_path(schema_field_urn: str) -> str:
urn = Urn.from_string(schema_field_urn)
return urn.get_entity_id()[1]
def get_field_doc_from_dataset(
graph: AcrylDataHubGraph, dataset_urn: str, schema_field_urn: str
) -> Optional[str]:
editableSchemaMetadata = graph.graph.get_aspect(
dataset_urn, EditableSchemaMetadataClass
)
if editableSchemaMetadata is not None:
if editableSchemaMetadata.editableSchemaFieldInfo is not None:
field_info = [
x
for x in editableSchemaMetadata.editableSchemaFieldInfo
if x.fieldPath == get_field_path(schema_field_urn)
]
if field_info:
return field_info[0].description
return None
ECE_EVENT_TYPE = "EntityChangeEvent_v1"
class DocPropagationAction(Action):
def __init__(self, config: DocPropagationConfig, ctx: PipelineContext):
super().__init__()
self.action_urn: str
if not ctx.pipeline_name.startswith("urn:li:dataHubAction"):
self.action_urn = f"urn:li:dataHubAction:{ctx.pipeline_name}"
else:
self.action_urn = ctx.pipeline_name
self.config: DocPropagationConfig = config
self.last_config_refresh: float = 0
self.ctx = ctx
self.mcl_processor = MCLProcessor()
self.actor_urn = "urn:li:corpuser:__datahub_system"
self.mcl_processor.register_processor(
"schemaField",
"documentation",
self.process_schema_field_documentation,
)
self.refresh_config()
self._stats = ActionStageReport()
self._stats.start()
assert self.ctx.graph
self._rate_limited_emit_mcp = self.config.get_rate_limited_emit_mcp(
self.ctx.graph.graph
)
def name(self) -> str:
return "DocPropagator"
@classmethod
def create(cls, config_dict: dict, ctx: PipelineContext) -> "Action":
action_config = DocPropagationConfig.parse_obj(config_dict or {})
logger.info(f"Doc Propagation Config action configured with {action_config}")
return cls(action_config, ctx)
def should_stop_propagation(
self, source_details: SourceDetails
) -> Tuple[bool, str]:
"""
Check if the propagation should be stopped based on the source details.
Return result and reason.
"""
if source_details.propagation_started_at and (
int(time.time() * 1000.0) - source_details.propagation_started_at
>= self.config.max_propagation_time_millis
):
return (True, "Propagation time exceeded.")
if (
source_details.propagation_depth
and source_details.propagation_depth >= self.config.max_propagation_depth
):
return (True, "Propagation depth exceeded.")
return False, ""
def get_propagation_relationships(
self, entity_type: str, source_details: Optional[SourceDetails]
) -> List[Tuple[RelationshipType, DirectionType]]:
possible_relationships = []
if entity_type == "schemaField":
if (source_details is not None) and (
source_details.propagation_relationship
and source_details.propagation_direction
):
restricted_relationship = source_details.propagation_relationship
restricted_direction = source_details.propagation_direction
else:
restricted_relationship = None
restricted_direction = None
for relationship in self.config.column_propagation_relationships:
if relationship == ColumnPropagationRelationships.UPSTREAM:
if (
restricted_relationship == RelationshipType.LINEAGE
and restricted_direction == DirectionType.DOWN
): # Skip upstream if the propagation has been restricted to downstream
continue
possible_relationships.append(
(RelationshipType.LINEAGE, DirectionType.UP)
)
elif relationship == ColumnPropagationRelationships.DOWNSTREAM:
if (
restricted_relationship == RelationshipType.LINEAGE
and restricted_direction == DirectionType.UP
): # Skip upstream if the propagation has been restricted to downstream
continue
possible_relationships.append(
(RelationshipType.LINEAGE, DirectionType.DOWN)
)
elif relationship == ColumnPropagationRelationships.SIBLING:
possible_relationships.append(
(RelationshipType.SIBLING, DirectionType.ALL)
)
logger.debug(f"Possible relationships: {possible_relationships}")
return possible_relationships
def process_schema_field_documentation(
self,
entity_urn: str,
aspect_name: str,
aspect_value: GenericAspectClass,
previous_aspect_value: Optional[GenericAspectClass],
) -> Optional[DocPropagationDirective]:
"""
Process changes in the documentation aspect of schemaField entities.
Produce a directive to propagate the documentation.
Business Logic checks:
- If the documentation is sourced by this action, then we propagate
it.
- If the documentation is not sourced by this action, then we log a
warning and propagate it.
- If we have exceeded the maximum depth of propagation or maximum
time for propagation, then we stop propagation and don't return a directive.
"""
if (
aspect_name != "documentation"
or guess_entity_type(entity_urn) != "schemaField"
):
# not a documentation aspect or not a schemaField entity
return None
logger.debug("Processing 'documentation' MCL")
if self.config.columns_enabled:
current_docs = DocumentationClass.from_obj(json.loads(aspect_value.value))
old_docs = (
None
if previous_aspect_value is None
else DocumentationClass.from_obj(
json.loads(previous_aspect_value.value)
)
)
if current_docs.documentations:
# get the most recently updated documentation with attribution
current_documentation_instance = sorted(
[doc for doc in current_docs.documentations if doc.attribution],
key=lambda x: x.attribution.time if x.attribution else 0,
)[-1]
assert current_documentation_instance.attribution
if (
current_documentation_instance.attribution.source is None
or current_documentation_instance.attribution.source
!= self.action_urn
):
logger.warning(
f"Documentation is not sourced by this action which is unexpected. Will be propagating for {entity_urn}"
)
source_details = (
(current_documentation_instance.attribution.sourceDetail)
if current_documentation_instance.attribution
else {}
)
source_details_parsed: SourceDetails = SourceDetails.parse_obj(
source_details
)
should_stop_propagation, reason = self.should_stop_propagation(
source_details_parsed
)
if should_stop_propagation:
logger.warning(f"Stopping propagation for {entity_urn}. {reason}")
return None
else:
logger.debug(f"Propagating documentation for {entity_urn}")
propagation_relationships = self.get_propagation_relationships(
entity_type="schemaField", source_details=source_details_parsed
)
origin_entity = (
source_details_parsed.origin
if source_details_parsed.origin
else entity_urn
)
if old_docs is None or not old_docs.documentations:
return DocPropagationDirective(
propagate=True,
doc_string=current_documentation_instance.documentation,
operation="ADD",
entity=entity_urn,
origin=origin_entity,
via=entity_urn,
actor=self.actor_urn,
propagation_started_at=source_details_parsed.propagation_started_at,
propagation_depth=(
source_details_parsed.propagation_depth + 1
if source_details_parsed.propagation_depth
else 1
),
relationships=propagation_relationships,
)
else:
old_docs_instance = sorted(
old_docs.documentations,
key=lambda x: x.attribution.time if x.attribution else 0,
)[-1]
if (
current_documentation_instance.documentation
!= old_docs_instance.documentation
):
return DocPropagationDirective(
propagate=True,
doc_string=current_documentation_instance.documentation,
operation="MODIFY",
entity=entity_urn,
origin=origin_entity,
via=entity_urn,
actor=self.actor_urn,
propagation_started_at=source_details_parsed.propagation_started_at,
propagation_depth=(
source_details_parsed.propagation_depth + 1
if source_details_parsed.propagation_depth
else 1
),
relationships=propagation_relationships,
)
return None
def should_propagate(
self, event: EventEnvelope
) -> Optional[DocPropagationDirective]:
if self.mcl_processor.is_mcl(event):
return self.mcl_processor.process(event)
if event.event_type == "EntityChangeEvent_v1":
assert isinstance(event.event, EntityChangeEvent)
assert self.ctx.graph is not None
semantic_event = event.event
if (
semantic_event.category == "DOCUMENTATION"
and self.config is not None
and self.config.enabled
):
logger.debug("Processing EntityChangeEvent Documentation Change")
if self.config.columns_enabled and (
semantic_event.entityType == "schemaField"
):
if semantic_event.parameters:
parameters = semantic_event.parameters
else:
parameters = semantic_event._inner_dict.get(
"__parameters_json", {}
)
doc_string = parameters.get("description")
origin = parameters.get("origin")
origin = origin or semantic_event.entityUrn
via = (
semantic_event.entityUrn
if origin != semantic_event.entityUrn
else None
)
logger.debug(f"Origin: {origin}")
logger.debug(f"Via: {via}")
logger.debug(f"Doc string: {doc_string}")
logger.debug(f"Semantic event {semantic_event}")
if doc_string:
return DocPropagationDirective(
propagate=True,
doc_string=doc_string,
operation=semantic_event.operation,
entity=semantic_event.entityUrn,
origin=origin,
via=via, # if origin is set, then via is the entity itself
actor=(
semantic_event.auditStamp.actor
if semantic_event.auditStamp
else self.actor_urn
),
propagation_started_at=int(time.time() * 1000.0),
propagation_depth=1, # we start at 1 because this is the first propagation
relationships=self.get_propagation_relationships(
entity_type="schemaField",
source_details=None,
),
)
return None
def modify_docs_on_columns(
self,
graph: AcrylDataHubGraph,
operation: str,
schema_field_urn: str,
dataset_urn: str,
field_doc: Optional[str],
context: SourceDetails,
) -> Optional[MetadataChangeProposalWrapper]:
if context.origin == schema_field_urn:
# No need to propagate to self
return None
try:
DatasetUrn.from_string(dataset_urn)
except Exception as e:
logger.error(
f"Invalid dataset urn {dataset_urn}. {e}. Skipping documentation propagation."
)
return None
auditStamp = AuditStampClass(
time=int(time.time() * 1000.0), actor=self.actor_urn
)
source_details = context.for_metadata_attribution()
attribution: MetadataAttributionClass = MetadataAttributionClass(
source=self.action_urn,
time=auditStamp.time,
actor=self.actor_urn,
sourceDetail=source_details,
)
documentations = graph.graph.get_aspect(schema_field_urn, DocumentationClass)
if documentations:
mutation_needed = False
action_sourced = False
# we check if there are any existing documentations generated by
# this action and sourced from the same origin, if so, we update them
# otherwise, we add a new documentation entry sourced by this action
for doc_association in documentations.documentations[:]:
if doc_association.attribution and doc_association.attribution.source:
source_details_parsed: SourceDetails = SourceDetails.parse_obj(
doc_association.attribution.sourceDetail
)
if doc_association.attribution.source == self.action_urn and (
source_details_parsed.origin == context.origin
):
action_sourced = True
if doc_association.documentation != field_doc:
mutation_needed = True
if operation == "ADD" or operation == "MODIFY":
doc_association.documentation = field_doc or ""
doc_association.attribution = attribution
elif operation == "REMOVE":
documentations.documentations.remove(doc_association)
if not action_sourced:
documentations.documentations.append(
DocumentationAssociationClass(
documentation=field_doc or "",
attribution=attribution,
)
)
mutation_needed = True
else:
# no docs found, create a new one
# we don't check editableSchemaMetadata because our goal is to
# propagate documentation to downstream entities
# UI will handle resolving priorities and conflicts
documentations = DocumentationClass(
documentations=[
DocumentationAssociationClass(
documentation=field_doc or "",
attribution=attribution,
)
]
)
mutation_needed = True
if mutation_needed:
logger.debug(
f"Will emit documentation change proposal for {schema_field_urn} with {field_doc}"
)
return MetadataChangeProposalWrapper(
entityUrn=schema_field_urn,
aspect=documentations,
)
return None
def refresh_config(self, event: Optional[EventEnvelope] = None) -> None:
"""
Fetches important configuration flags from the global settings entity to
override client-side settings.
If not found, it will use the client-side values.
"""
now = time.time()
try:
if now - self.last_config_refresh > 60 or self._is_settings_change(event):
assert self.ctx.graph
entity_dict = self.ctx.graph.graph.get_entity_raw(
"urn:li:globalSettings:0", ["globalSettingsInfo"]
)
if entity_dict:
global_settings = entity_dict.get("aspects", {}).get(
"globalSettingsInfo"
)
if global_settings:
doc_propagation_config = global_settings.get("value", {}).get(
"docPropagation"
)
if doc_propagation_config:
if doc_propagation_config.get("enabled") is not None:
logger.info(
"Overwriting the asset-level config using globalSettings"
)
self.config.enabled = doc_propagation_config.get(
"enabled"
)
if (
doc_propagation_config.get("columnPropagationEnabled")
is not None
):
logger.info(
"Overwriting the column-level config using globalSettings"
)
self.config.columns_enabled = (
doc_propagation_config.get(
"columnPropagationEnabled"
)
)
except Exception:
# We don't want to fail the pipeline if we can't fetch the config
logger.warning(
"Error fetching global settings for doc propagation. Will try again in 1 minute.",
exc_info=True,
)
self.last_config_refresh = now
def _is_settings_change(self, event: Optional[EventEnvelope]) -> bool:
if event and isinstance(event.event, MetadataChangeLogClass):
entity_type = event.event.entityType
if entity_type == "globalSettings":
return True
return False
def _only_one_upstream_field(
self,
graph: AcrylDataHubGraph,
downstream_field: str,
upstream_field: str,
) -> bool:
"""
Check if there is only one upstream field for the downstream field. If upstream_field is provided,
it will also check if the upstream field is the only upstream
TODO: We should cache upstreams because we make this fetch upstreams call FOR EVERY downstream that must be propagated to.
"""
upstreams = graph.get_upstreams(entity_urn=downstream_field)
# Use a set here in case there are duplicated upstream edges
upstream_fields = list(
{x for x in upstreams if guess_entity_type(x) == "schemaField"}
)
# If we found no upstreams for the downstream field, simply skip.
if not upstream_fields:
logger.debug(
f"No upstream fields found. Skipping propagation to downstream {downstream_field}"
)
return False
# Convert the set to a list to access by index
result = len(upstream_fields) == 1 and upstream_fields[0] == upstream_field
if not result:
logger.warning(
f"Failed check for single upstream: Found upstream fields {upstream_fields} for downstream {downstream_field}. Expecting only one upstream field: {upstream_field}"
)
return result
def act(self, event: EventEnvelope) -> None:
assert self.ctx.graph
for mcp in self.act_async(event):
self._rate_limited_emit_mcp(mcp)
def act_async(
self, event: EventEnvelope
) -> Iterable[MetadataChangeProposalWrapper]:
"""
Process the event asynchronously and return the change proposals
"""
self.refresh_config(event)
if not self.config.enabled or not self.config.columns_enabled:
logger.warning("Doc propagation is disabled. Skipping event")
return
else:
logger.debug(f"Processing event {event}")
if not self._stats.event_processing_stats:
self._stats.event_processing_stats = EventProcessingStats()
stats = self._stats.event_processing_stats
stats.start(event)
try:
doc_propagation_directive = self.should_propagate(event)
# breakpoint()
logger.debug(
f"Doc propagation directive for {event}: {doc_propagation_directive}"
)
if (
doc_propagation_directive is not None
and doc_propagation_directive.propagate
):
self._stats.increment_assets_processed(doc_propagation_directive.entity)
context = SourceDetails(
origin=doc_propagation_directive.origin,
via=doc_propagation_directive.via,
propagated=True,
actor=doc_propagation_directive.actor,
propagation_started_at=doc_propagation_directive.propagation_started_at,
propagation_depth=doc_propagation_directive.propagation_depth,
)
assert self.ctx.graph
logger.debug(f"Doc Propagation Directive: {doc_propagation_directive}")
# TODO: Put each mechanism behind a config flag to be controlled
# externally.
lineage_downstream = (
RelationshipType.LINEAGE,
DirectionType.DOWN,
) in doc_propagation_directive.relationships
lineage_upstream = (
RelationshipType.LINEAGE,
DirectionType.UP,
) in doc_propagation_directive.relationships
lineage_any = (
RelationshipType.LINEAGE,
DirectionType.ALL,
) in doc_propagation_directive.relationships
logger.debug(
f"Lineage Downstream: {lineage_downstream}, Lineage Upstream: {lineage_upstream}, Lineage Any: {lineage_any}"
)
if lineage_downstream or lineage_any:
# Step 1: Propagate to downstream entities
yield from self._propagate_to_downstreams(
doc_propagation_directive, context
)
if lineage_upstream or lineage_any:
# Step 2: Propagate to upstream entities
yield from self._propagate_to_upstreams(
doc_propagation_directive, context
)
if (
RelationshipType.SIBLING,
DirectionType.ALL,
) in doc_propagation_directive.relationships:
# Step 3: Propagate to sibling entities
yield from self._propagate_to_siblings(
doc_propagation_directive, context
)
stats.end(event, success=True)
except Exception:
logger.error(f"Error processing event {event}:", exc_info=True)
stats.end(event, success=False)
def _propagate_to_downstreams(
self, doc_propagation_directive: DocPropagationDirective, context: SourceDetails
) -> Iterable[MetadataChangeProposalWrapper]:
"""
Propagate the documentation to downstream entities.
"""
assert self.ctx.graph
downstreams = self.ctx.graph.get_downstreams(
entity_urn=doc_propagation_directive.entity
)
logger.debug(
f"Downstreams: {downstreams} for {doc_propagation_directive.entity}"
)
entity_urn = doc_propagation_directive.entity
propagated_context = SourceDetails.parse_obj(context.dict())
propagated_context.propagation_relationship = RelationshipType.LINEAGE
propagated_context.propagation_direction = DirectionType.DOWN
propagated_entities_this_hop_count = 0
# breakpoint()
if guess_entity_type(entity_urn) == "schemaField":
downstream_fields = {
x for x in downstreams if guess_entity_type(x) == "schemaField"
}
for field in downstream_fields:
schema_field_urn = Urn.from_string(field)
parent_urn = schema_field_urn.get_entity_id()[0]
field_path = schema_field_urn.get_entity_id()[1]
logger.debug(
f"Will {doc_propagation_directive.operation} documentation {doc_propagation_directive.doc_string} for {field_path} on {parent_urn}"
)
parent_entity_type = guess_entity_type(parent_urn)
if parent_entity_type == "dataset":
if self._only_one_upstream_field(
self.ctx.graph,
downstream_field=str(schema_field_urn),
upstream_field=entity_urn,
):
if (
propagated_entities_this_hop_count
>= self.config.max_propagation_fanout
):
# breakpoint()
logger.warning(
f"Exceeded max propagation fanout of {self.config.max_propagation_fanout}. Skipping propagation to downstream {field}"
)
# No need to propagate to more downstreams
return
maybe_mcp = self.modify_docs_on_columns(
self.ctx.graph,
doc_propagation_directive.operation,
field,
parent_urn,
field_doc=doc_propagation_directive.doc_string,
context=propagated_context,
)
if maybe_mcp:
propagated_entities_this_hop_count += 1
yield maybe_mcp
elif parent_entity_type == "chart":
logger.warning(
"Charts are expected to have fields that are dataset schema fields. Skipping for now..."
)
self._stats.increment_assets_impacted(field)
elif guess_entity_type(entity_urn) == "dataset":
logger.debug(
"Dataset level documentation propagation is not yet supported!"
)
def _propagate_to_upstreams(
self, doc_propagation_directive: DocPropagationDirective, context: SourceDetails
) -> Iterable[MetadataChangeProposalWrapper]:
"""
Propagate the documentation to upstream entities.
"""
assert self.ctx.graph
upstreams = self.ctx.graph.get_upstreams(
entity_urn=doc_propagation_directive.entity
)
logger.debug(f"Upstreams: {upstreams} for {doc_propagation_directive.entity}")
entity_urn = doc_propagation_directive.entity
propagated_context = SourceDetails.parse_obj(context.dict())
propagated_context.propagation_relationship = RelationshipType.LINEAGE
propagated_context.propagation_direction = DirectionType.UP
propagated_entities_this_hop_count = 0
if guess_entity_type(entity_urn) == "schemaField":
upstream_fields = {
x for x in upstreams if guess_entity_type(x) == "schemaField"
}
# We only propagate to the upstream field if there is only one
# upstream field
if len(upstream_fields) == 1:
for field in upstream_fields:
schema_field_urn = Urn.from_string(field)
parent_urn = schema_field_urn.get_entity_id()[0]
field_path = schema_field_urn.get_entity_id()[1]
logger.debug(
f"Will {doc_propagation_directive.operation} documentation {doc_propagation_directive.doc_string} for {field_path} on {parent_urn}"
)
parent_entity_type = guess_entity_type(parent_urn)
if parent_entity_type == "dataset":
if (
propagated_entities_this_hop_count
>= self.config.max_propagation_fanout
):
logger.warning(
f"Exceeded max propagation fanout of {self.config.max_propagation_fanout}. Skipping propagation to upstream {field}"
)
# No need to propagate to more upstreams
return
maybe_mcp = self.modify_docs_on_columns(
self.ctx.graph,
doc_propagation_directive.operation,
field,
parent_urn,
field_doc=doc_propagation_directive.doc_string,
context=propagated_context,
)
if maybe_mcp:
propagated_entities_this_hop_count += 1
yield maybe_mcp
elif parent_entity_type == "chart":
logger.warning(
"Charts are expected to have fields that are dataset schema fields. Skipping for now..."
)
self._stats.increment_assets_impacted(field)
elif guess_entity_type(entity_urn) == "dataset":
logger.debug(
"Dataset level documentation propagation is not yet supported!"
)
def _propagate_to_siblings(
self, doc_propagation_directive: DocPropagationDirective, context: SourceDetails
) -> Iterable[MetadataChangeProposalWrapper]:
"""
Propagate the documentation to sibling entities.
"""
assert self.ctx.graph
entity_urn = doc_propagation_directive.entity
siblings = get_unique_siblings(self.ctx.graph, entity_urn)
propagated_context = SourceDetails.parse_obj(context.dict())
propagated_context.propagation_relationship = RelationshipType.SIBLING
propagated_context.propagation_direction = DirectionType.ALL
logger.debug(f"Siblings: {siblings} for {doc_propagation_directive.entity}")
for sibling in siblings:
if (
guess_entity_type(entity_urn) == "schemaField"
and guess_entity_type(sibling) == "schemaField"
):
parent_urn = Urn.from_string(sibling).get_entity_id()[0]
self._stats.increment_assets_impacted(sibling)
maybe_mcp = self.modify_docs_on_columns(
self.ctx.graph,
doc_propagation_directive.operation,
schema_field_urn=sibling,
dataset_urn=parent_urn,
field_doc=doc_propagation_directive.doc_string,
context=propagated_context,
)
if maybe_mcp:
yield maybe_mcp
def close(self) -> None:
return

View File

@ -0,0 +1,289 @@
# Copyright 2021 Acryl Data, Inc.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
import json
import time
from abc import abstractmethod
from enum import Enum
from functools import wraps
from typing import Any, Dict, Iterable, List, Optional, Tuple
from pydantic import validator
from pydantic.fields import Field
from pydantic.main import BaseModel
from ratelimit import limits, sleep_and_retry
import datahub.metadata.schema_classes as models
from datahub.configuration.common import ConfigModel
from datahub.emitter.mce_builder import make_schema_field_urn
from datahub.ingestion.graph.client import DataHubGraph
from datahub.ingestion.graph.filters import SearchFilterRule
from datahub.metadata.schema_classes import MetadataAttributionClass
from datahub.utilities.str_enum import StrEnum
from datahub.utilities.urns.urn import Urn, guess_entity_type
from datahub_actions.api.action_graph import AcrylDataHubGraph
SYSTEM_ACTOR = "urn:li:corpuser:__datahub_system"
class RelationshipType(StrEnum):
LINEAGE = "lineage" # signifies all types of lineage
HIERARCHY = "hierarchy" # signifies all types of hierarchy
SIBLING = "sibling" # signifies all types of sibling
class DirectionType(StrEnum):
UP = "up" # signifies upstream or parent (depending on relationship type)
DOWN = "down" # signifies downstream or child (depending on relationship type)
ALL = "all" # signifies all directions
class PropagationDirective(BaseModel):
propagate: bool
operation: str
relationships: List[Tuple[RelationshipType, DirectionType]]
entity: str = Field(
description="Entity that currently triggered the propagation directive",
)
origin: str = Field(
description="Origin entity for the association. This is the entity that triggered the propagation.",
)
via: Optional[str] = Field(
None,
description="Via entity for the association. This is the direct entity that the propagation came through.",
)
actor: Optional[str] = Field(
None,
description="Actor that triggered the propagation through the original association.",
)
propagation_started_at: Optional[int] = Field(
None,
description="Timestamp (in millis) when the original propagation event happened.",
)
propagation_depth: Optional[int] = Field(
default=0,
description="Depth of propagation. This is used to track the depth of the propagation.",
)
class SourceDetails(BaseModel):
origin: Optional[str] = Field(
None,
description="Origin entity for the documentation. This is the entity that triggered the documentation propagation.",
)
via: Optional[str] = Field(
None,
description="Via entity for the documentation. This is the direct entity that the documentation was propagated through.",
)
propagated: Optional[str] = Field(
None,
description="Indicates whether the metadata element was propagated.",
)
actor: Optional[str] = Field(
None,
description="Actor that triggered the metadata propagation.",
)
propagation_started_at: Optional[int] = Field(
None,
description="Timestamp when the metadata propagation event happened.",
)
propagation_depth: Optional[int] = Field(
default=0,
description="Depth of metadata propagation.",
)
propagation_relationship: Optional[RelationshipType] = Field(
None,
description="The relationship that the metadata was propagated through.",
)
propagation_direction: Optional[DirectionType] = Field(
None,
description="The direction that the metadata was propagated through.",
)
@validator("propagated", pre=True)
def convert_boolean_to_lowercase_string(cls, v: Any) -> Optional[str]:
if isinstance(v, bool):
return str(v).lower()
return v
@validator("propagation_depth", "propagation_started_at", pre=True)
def convert_to_int(cls, v: Any) -> Optional[int]:
if v is not None:
return int(v)
return v
def for_metadata_attribution(self) -> Dict[str, str]:
"""
Convert the SourceDetails object to a dictionary that can be used in
Metadata Attribution MCPs.
"""
result = {}
for k, v in self.dict(exclude_none=True).items():
if isinstance(v, Enum):
result[k] = v.value # Use the enum's value
elif isinstance(v, int):
result[k] = str(v) # Convert int to string
else:
result[k] = str(v) # Convert everything else to string
return result
class PropagationConfig(ConfigModel):
"""
Base class for all propagation configs
"""
max_propagation_depth: int = 5
max_propagation_fanout: int = 1000
max_propagation_time_millis: int = 1000 * 60 * 60 * 1 # 1 hour
rate_limit_propagated_writes: int = 15000 # 15000 writes per 15 seconds (default)
rate_limit_propagated_writes_period: int = 15 # Every 15 seconds
def get_rate_limited_emit_mcp(self, emitter: DataHubGraph) -> Any:
"""
Returns a rate limited emitter that can be used to emit metadata for propagation
"""
@sleep_and_retry
@limits(
calls=self.rate_limit_propagated_writes,
period=self.rate_limit_propagated_writes_period,
)
@wraps(emitter.emit_mcp)
def wrapper(*args, **kwargs):
return emitter.emit_mcp(*args, **kwargs)
return wrapper
def get_attribution_and_context_from_directive(
action_urn: str,
propagation_directive: PropagationDirective,
actor: str = SYSTEM_ACTOR,
time: int = int(time.time() * 1000.0),
) -> Tuple[MetadataAttributionClass, str]:
"""
Given a propagation directive, return the attribution and context for
the directive.
Attribution is the official way to track the source of metadata in
DataHub.
Context is the older way to track the source of metadata in DataHub.
We populate both to ensure compatibility with older versions of DataHub.
"""
source_detail: dict[str, str] = {
"origin": propagation_directive.origin,
"propagated": "true",
"propagation_depth": str(propagation_directive.propagation_depth),
"propagation_started_at": str(
propagation_directive.propagation_started_at
if propagation_directive.propagation_started_at
else time
),
}
if propagation_directive.relationships:
source_detail["propagation_relationship"] = propagation_directive.relationships[
0
][0].value
source_detail["propagation_direction"] = propagation_directive.relationships[0][
1
].value
if propagation_directive.actor:
source_detail["actor"] = propagation_directive.actor
else:
source_detail["actor"] = actor
if propagation_directive.via:
source_detail["via"] = propagation_directive.via
context_dict: dict[str, str] = {}
context_dict.update(source_detail)
return (
MetadataAttributionClass(
time=time,
actor=actor,
source=action_urn,
sourceDetail=source_detail,
),
json.dumps(context_dict),
)
class SelectedAsset(BaseModel):
"""
A selected asset is a data structure that represents an asset that has been
selected for processing by a propagator.
"""
urn: str # URN of the asset that has been selected
target_entity_type: str # entity type that is being targeted by the propagator. e.g. schemaField even if asset is of type dataset
class ComposablePropagator:
@abstractmethod
def asset_filters(self) -> Dict[str, Dict[str, List[SearchFilterRule]]]:
"""
Returns a dictionary of asset filters that are used to filter the assets
based on the configuration of the action.
"""
pass
@abstractmethod
def process_one_asset(
self, asset: SelectedAsset, operation: str
) -> Iterable[PropagationDirective]:
"""
Given an asset, returns a list of propagation directives
:param asset_urn: URN of the asset
:param target_entity_type: The entity type of the target entity (Note:
this can be different from the entity type of the asset. e.g. we
might process a dataset while the target entity_type is a column
(schemaField))
:param operation: The operation that triggered the propagation (ADD /
REMOVE)
:return: A list of PropagationDirective objects
"""
pass
def get_unique_siblings(graph: AcrylDataHubGraph, entity_urn: str) -> list[str]:
"""
Get unique siblings for the entity urn
"""
if guess_entity_type(entity_urn) == "schemaField":
parent_urn = Urn.from_string(entity_urn).get_entity_id()[0]
entity_field_path = Urn.from_string(entity_urn).get_entity_id()[1]
# Does my parent have siblings?
siblings: Optional[models.SiblingsClass] = graph.graph.get_aspect(
parent_urn,
models.SiblingsClass,
)
if siblings and siblings.siblings:
other_siblings = [x for x in siblings.siblings if x != parent_urn]
if len(other_siblings) == 1:
target_sibling = other_siblings[0]
# now we need to find the schema field in this sibling that
# matches us
if guess_entity_type(target_sibling) == "dataset":
schema_fields = graph.graph.get_aspect(
target_sibling, models.SchemaMetadataClass
)
if schema_fields:
for schema_field in schema_fields.fields:
if schema_field.fieldPath == entity_field_path:
# we found the sibling field
schema_field_urn = make_schema_field_urn(
target_sibling, schema_field.fieldPath
)
return [schema_field_urn]
return []

View File

@ -0,0 +1,13 @@
# Copyright 2021 Acryl Data, Inc.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

View File

@ -0,0 +1,146 @@
# Copyright 2021 Acryl Data, Inc.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
import json
import logging
from dataclasses import dataclass
from typing import Dict, List
from pydantic import SecretStr
from ratelimit import limits, sleep_and_retry
from requests import sessions
from slack_bolt import App
from datahub.configuration.common import ConfigModel
from datahub.metadata.schema_classes import EntityChangeEventClass as EntityChangeEvent
from datahub_actions.action.action import Action
from datahub_actions.event.event_envelope import EventEnvelope
from datahub_actions.pipeline.pipeline_context import PipelineContext
from datahub_actions.utils.datahub_util import DATAHUB_SYSTEM_ACTOR_URN
from datahub_actions.utils.social_util import (
StructuredMessage,
get_message_from_entity_change_event,
get_welcome_message,
pretty_any_text,
)
logger = logging.getLogger(__name__)
@sleep_and_retry
@limits(calls=1, period=1)
def post_message(client, token, channel, text):
client.chat_postMessage(
token=token,
channel=channel,
text=text,
)
@dataclass
class SlackNotification:
@staticmethod
def get_payload(message: StructuredMessage) -> List[Dict]:
return [
{
"type": "section",
"text": {"type": "mrkdwn", "text": message.title},
},
{"type": "divider"},
{
"type": "section",
"text": {
"type": "mrkdwn",
"text": "\n".join(
[
f"*{k}*: {pretty_any_text(v, channel='slack')}"
for k, v in message.properties.items()
]
),
},
},
{"type": "divider"},
]
class SlackNotificationConfig(ConfigModel):
# default webhook posts to #actions-dev-slack-notifications on Acryl Data Slack space
bot_token: SecretStr
signing_secret: SecretStr
default_channel: str
base_url: str = "http://localhost:9002/"
suppress_system_activity: bool = True
class SlackNotificationAction(Action):
def name(self):
return "SlackNotificationAction"
def close(self) -> None:
pass
@classmethod
def create(cls, config_dict: dict, ctx: PipelineContext) -> "Action":
action_config = SlackNotificationConfig.parse_obj(config_dict or {})
logger.info(f"Slack notification action configured with {action_config}")
return cls(action_config, ctx)
def __init__(self, action_config: SlackNotificationConfig, ctx: PipelineContext):
self.action_config = action_config
self.ctx = ctx
self.session = sessions.Session()
# Initializes your app with your bot token and signing secret
self.app = App(
token=self.action_config.bot_token.get_secret_value(),
signing_secret=self.action_config.signing_secret.get_secret_value(),
)
self.app.client.chat_postMessage(
token=self.action_config.bot_token.get_secret_value(),
channel=self.action_config.default_channel,
blocks=SlackNotification.get_payload(
get_welcome_message(self.action_config.base_url)
),
)
def act(self, event: EventEnvelope) -> None:
try:
message = json.dumps(json.loads(event.as_json()), indent=4)
logger.debug(f"Received event: {message}")
if event.event_type == "EntityChangeEvent_v1":
assert isinstance(event.event, EntityChangeEvent)
if (
event.event.auditStamp.actor == DATAHUB_SYSTEM_ACTOR_URN
and self.action_config.suppress_system_activity
):
return None
semantic_message = get_message_from_entity_change_event(
event.event,
self.action_config.base_url,
self.ctx.graph.graph if self.ctx.graph else None,
channel="slack",
)
if semantic_message:
post_message(
client=self.app.client,
token=self.action_config.bot_token.get_secret_value(),
channel=self.action_config.default_channel,
text=semantic_message,
)
else:
logger.debug("Skipping message because it didn't match our filter")
except Exception as e:
logger.debug("Failed to process event", e)

View File

@ -0,0 +1,13 @@
# Copyright 2021 Acryl Data, Inc.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

View File

@ -0,0 +1,130 @@
# Copyright 2021 Acryl Data, Inc.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
import logging
from sqlalchemy import create_engine
from datahub.emitter.mce_builder import dataset_urn_to_key
from datahub.ingestion.api.closeable import Closeable
from datahub.ingestion.source.snowflake.snowflake_config import SnowflakeConfig
from datahub.metadata.schema_classes import GlossaryNodeInfoClass, GlossaryTermInfoClass
from datahub.utilities.urns.urn import Urn
from datahub_actions.api.action_graph import AcrylDataHubGraph
logger: logging.Logger = logging.getLogger(__name__)
class SnowflakeTagHelper(Closeable):
def __init__(self, config: SnowflakeConfig):
self.config: SnowflakeConfig = config
url = self.config.get_sql_alchemy_url()
self.engine = create_engine(url, **self.config.get_options())
@staticmethod
def get_term_name_from_id(term_urn: str, graph: AcrylDataHubGraph) -> str:
term_id = Urn.from_string(term_urn).get_entity_id_as_string()
if term_id.count("-") == 4:
# needs resolution
term_info = graph.graph.get_aspect(term_urn, GlossaryTermInfoClass)
assert term_info
assert term_info.name
term_name = term_info.name
parent = term_info.parentNode
while parent:
parent_id = Urn.from_string(parent).get_entity_id_as_string()
node_info = graph.graph.get_aspect(parent, GlossaryNodeInfoClass)
assert node_info
if parent_id.count("-") == 4:
parent_name = node_info.name
parent = node_info.parentNode
else:
# terminate
parent_name = parent_id
parent = None
term_name = f"{parent_name}.{term_name}"
else:
term_name = term_id
return term_name
@staticmethod
def get_label_urn_to_tag(label_urn: str, graph: AcrylDataHubGraph) -> str:
label_urn_parsed = Urn.from_string(label_urn)
if label_urn_parsed.get_type() == "tag":
return label_urn_parsed.get_entity_id_as_string()
elif label_urn_parsed.get_type() == "glossaryTerm":
# if this looks like a guid, we want to resolve to human friendly names
term_name = SnowflakeTagHelper.get_term_name_from_id(label_urn, graph)
if term_name is not None:
# terms use `.` for separation, replace with _
return term_name.replace(".", "_").replace(" ", "_")
else:
raise ValueError(f"Invalid tag or term urn {label_urn}")
else:
raise Exception(
f"Unexpected label type: neither tag or term {label_urn_parsed.get_type()}"
)
def apply_tag_or_term(
self, dataset_urn: str, tag_or_term_urn: str, graph: AcrylDataHubGraph
) -> None:
dataset_key = dataset_urn_to_key(dataset_urn)
assert dataset_key is not None
if dataset_key.platform != "snowflake":
return
tag = self.get_label_urn_to_tag(tag_or_term_urn, graph)
assert tag is not None
name_tokens = dataset_key.name.split(".")
assert len(name_tokens) == 3
self.run_query(
name_tokens[0],
name_tokens[1],
f"CREATE TAG IF NOT EXISTS {tag} COMMENT = 'Replicated Tag {tag_or_term_urn} from DataHub';",
)
self.run_query(
name_tokens[0],
name_tokens[1],
f'ALTER TABLE {name_tokens[2]} SET TAG {tag}="{tag_or_term_urn}";',
)
def remove_tag_or_term(
self, dataset_urn: str, tag_urn: str, graph: AcrylDataHubGraph
) -> None:
dataset_key = dataset_urn_to_key(dataset_urn)
assert dataset_key is not None
if dataset_key.platform != "snowflake":
return
tag = self.get_label_urn_to_tag(tag_urn, graph)
assert tag is not None
name_tokens = dataset_key.name.split(".")
assert len(name_tokens) == 3
self.run_query(
name_tokens[0],
name_tokens[1],
f"ALTER TABLE {name_tokens[2]} UNSET TAG {tag};",
)
def run_query(self, database: str, schema: str, query: str) -> None:
try:
self.engine.execute(f"USE {database}.{schema};")
self.engine.execute(query)
logger.info(f"Successfully executed query {query}")
except Exception as e:
logger.warning(
f"Failed to execute snowflake query: {query}. Exception: ", e
)
def close(self) -> None:
return

View File

@ -0,0 +1,121 @@
# Copyright 2021 Acryl Data, Inc.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
import logging
from typing import Optional
from datahub.configuration.common import ConfigModel
from datahub.ingestion.source.snowflake.snowflake_config import SnowflakeV2Config
from datahub_actions.action.action import Action
from datahub_actions.event.event_envelope import EventEnvelope
from datahub_actions.event.event_registry import EntityChangeEvent
from datahub_actions.pipeline.pipeline_context import PipelineContext
from datahub_actions.plugin.action.snowflake.snowflake_util import SnowflakeTagHelper
from datahub_actions.plugin.action.tag.tag_propagation_action import (
TagPropagationAction,
TagPropagationConfig,
)
from datahub_actions.plugin.action.term.term_propagation_action import (
TermPropagationAction,
TermPropagationConfig,
)
logger = logging.getLogger(__name__)
class SnowflakeTagPropagatorConfig(ConfigModel):
snowflake: SnowflakeV2Config
tag_propagation: Optional[TagPropagationConfig] = None
term_propagation: Optional[TermPropagationConfig] = None
class SnowflakeTagPropagatorAction(Action):
def __init__(self, config: SnowflakeTagPropagatorConfig, ctx: PipelineContext):
self.config: SnowflakeTagPropagatorConfig = config
self.ctx = ctx
self.snowflake_tag_helper = SnowflakeTagHelper(self.config.snowflake)
logger.info("[Config] Snowflake tag sync enabled")
if self.config.tag_propagation:
logger.info("[Config] Will propagate DataHub Tags")
if self.config.tag_propagation.tag_prefixes:
logger.info(
f"[Config] Tag prefixes: {self.config.tag_propagation.tag_prefixes}"
)
self.tag_propagator = TagPropagationAction(self.config.tag_propagation, ctx)
if self.config.term_propagation:
logger.info("[Config] Will propagate Glossary Terms")
self.term_propagator = TermPropagationAction(
self.config.term_propagation, ctx
)
def close(self) -> None:
self.snowflake_tag_helper.close()
return
@classmethod
def create(cls, config_dict: dict, ctx: PipelineContext) -> "Action":
config = SnowflakeTagPropagatorConfig.parse_obj(config_dict or {})
return cls(config, ctx)
@staticmethod
def is_snowflake_urn(urn: str) -> bool:
return urn.startswith("urn:li:dataset:(urn:li:dataPlatform:snowflake")
def name(self) -> str:
return "SnowflakeTagPropagator"
def act(self, event: EventEnvelope) -> None:
if event.event_type == "EntityChangeEvent_v1":
assert isinstance(event.event, EntityChangeEvent)
assert self.ctx.graph is not None
semantic_event = event.event
if not self.is_snowflake_urn(semantic_event.entityUrn):
return
entity_to_apply = None
tag_to_apply = None
if self.tag_propagator is not None:
tag_propagation_directive = self.tag_propagator.should_propagate(
event=event
)
if (
tag_propagation_directive is not None
and tag_propagation_directive.propagate
):
entity_to_apply = tag_propagation_directive.entity
tag_to_apply = tag_propagation_directive.tag
if self.term_propagator is not None:
term_propagation_directive = self.term_propagator.should_propagate(
event=event
)
if (
term_propagation_directive is not None
and term_propagation_directive.propagate
):
entity_to_apply = term_propagation_directive.entity
tag_to_apply = term_propagation_directive.term
if entity_to_apply is not None:
assert tag_to_apply
logger.info(
f"Will {semantic_event.operation.lower()} {tag_to_apply} on Snowflake {entity_to_apply}"
)
if semantic_event.operation == "ADD":
self.snowflake_tag_helper.apply_tag_or_term(
entity_to_apply, tag_to_apply, self.ctx.graph
)
else:
self.snowflake_tag_helper.remove_tag_or_term(
entity_to_apply, tag_to_apply, self.ctx.graph
)

View File

@ -0,0 +1,204 @@
# Copyright 2021 Acryl Data, Inc.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
import abc
import json
from datetime import datetime, timezone
from typing import Dict, Optional
import pydantic
from pydantic import BaseModel
from datahub.ingestion.api.report import Report, SupportsAsObj
from datahub.utilities.str_enum import StrEnum
from datahub_actions.action.action import Action
from datahub_actions.event.event_envelope import EventEnvelope
from datahub_actions.event.event_registry import (
ENTITY_CHANGE_EVENT_V1_TYPE,
METADATA_CHANGE_LOG_EVENT_V1_TYPE,
EntityChangeEvent,
MetadataChangeLogEvent,
)
from datahub_actions.pipeline.pipeline_context import PipelineContext
class EventProcessingStats(BaseModel):
"""
A class to represent the event-oriented processing stats for a pipeline.
Note: Might be merged into ActionStats in the future.
"""
last_seen_event_time: Optional[str] = pydantic.Field(
None, description="The event time of the last event we processed"
)
last_event_processed_time: Optional[str] = pydantic.Field(
None, description="The time at which we processed the last event"
)
last_seen_event_time_success: Optional[str] = pydantic.Field(
None, description="The event time of the last event we processed successfully"
)
last_event_processed_time_success: Optional[str] = pydantic.Field(
None, description="The time at which we processed the last event successfully"
)
last_seen_event_time_failure: Optional[str] = pydantic.Field(
None, description="The event time of the last event we processed unsuccessfully"
)
last_event_processed_time_failure: Optional[str] = pydantic.Field(
None, description="The time at which we processed the last event unsuccessfully"
)
@classmethod
def _get_event_time(cls, event: EventEnvelope) -> Optional[str]:
"""
Get the event time from the event.
"""
if event.event_type == ENTITY_CHANGE_EVENT_V1_TYPE:
if isinstance(event.event, EntityChangeEvent):
return (
datetime.fromtimestamp(
event.event.auditStamp.time / 1000.0, tz=timezone.utc
).isoformat()
if event.event.auditStamp
else None
)
elif event.event_type == METADATA_CHANGE_LOG_EVENT_V1_TYPE:
if isinstance(event.event, MetadataChangeLogEvent):
return (
datetime.fromtimestamp(
event.event.auditHeader.time / 1000.0, tz=timezone.utc
).isoformat()
if event.event.auditHeader
else None
)
return None
def start(self, event: EventEnvelope) -> None:
"""
Update the stats based on the event.
"""
self.last_event_processed_time = datetime.now(tz=timezone.utc).isoformat()
self.last_seen_event_time = self._get_event_time(event)
def end(self, event: EventEnvelope, success: bool) -> None:
"""
Update the stats based on the event.
"""
if success:
self.last_seen_event_time_success = (
self._get_event_time(event) or self.last_seen_event_time_success
)
self.last_event_processed_time_success = datetime.now(
timezone.utc
).isoformat()
else:
self.last_seen_event_time_failure = (
self._get_event_time(event) or self.last_seen_event_time_failure
)
self.last_event_processed_time_failure = datetime.now(
timezone.utc
).isoformat()
def __str__(self) -> str:
return json.dumps(self.dict(), indent=2)
class StageStatus(StrEnum):
SUCCESS = "success"
FAILURE = "failure"
RUNNING = "running"
STOPPED = "stopped"
class ActionStageReport(BaseModel):
# All stats here are only for the current run of the current stage.
# Attributes that should be aggregated across runs should be prefixed with "total_".
# Only ints can be aggregated.
start_time: int = 0
end_time: int = 0
total_assets_to_process: int = -1 # -1 if unknown
total_assets_processed: int = 0
total_actions_executed: int = 0
total_assets_impacted: int = 0
event_processing_stats: Optional[EventProcessingStats] = None
status: Optional[StageStatus] = None
def start(self) -> None:
self.start_time = int(datetime.now().timestamp() * 1000)
self.status = StageStatus.RUNNING
def end(self, success: bool) -> None:
self.end_time = int(datetime.now().timestamp() * 1000)
self.status = StageStatus.SUCCESS if success else StageStatus.FAILURE
def increment_assets_processed(self, asset: str) -> None:
# TODO: If we want to track unique assets, use a counting set.
# For now, just increment
self.total_assets_processed += 1
def increment_assets_impacted(self, asset: str) -> None:
# TODO: If we want to track unique assets, use a counting set.
# For now, just increment
self.total_assets_impacted += 1
def as_obj(self) -> dict:
return Report.to_pure_python_obj(self)
def aggregatable_stats(self) -> Dict[str, int]:
all_items = self.dict()
stats = {k: v for k, v in all_items.items() if k.startswith("total_")}
# If total_assets_to_process is unknown, don't include it.
if self.total_assets_to_process == -1:
stats.pop("total_assets_to_process")
# Add a few additional special cases of aggregatable stats.
if self.event_processing_stats:
for key, value in self.event_processing_stats.dict().items():
if value is not None:
stats[f"event_processing_stats.{key}"] = str(value)
return stats
class ReportingAction(Action, abc.ABC):
def __init__(self, ctx: PipelineContext):
super().__init__()
self.ctx = ctx
self.action_urn: str
if "urn:li:dataHubAction:" in ctx.pipeline_name:
# The pipeline name might get a prefix before the urn:li:... part.
# We need to remove that prefix to get the urn:li:dataHubAction part.
action_urn_part = ctx.pipeline_name.split("urn:li:dataHubAction:")[1]
self.action_urn = f"urn:li:dataHubAction:{action_urn_part}"
else:
self.action_urn = f"urn:li:dataHubAction:{ctx.pipeline_name}"
@abc.abstractmethod
def get_report(self) -> ActionStageReport:
pass
assert isinstance(ActionStageReport(), SupportsAsObj)

View File

@ -0,0 +1,37 @@
# Tag Sync Action
The Tag Sync (or Tag Propagation) Action allows you to propagate tags from your assets into downstream entities. e.g. You can apply a tag (like `critical`) on a dataset and have it propagate down to all the downstream datasets.
## Configurability
You can control which tags should be propagated downstream using a prefix system. E.g. You can specify that only tags that start with `tier:` should be propagated downstream.
## Additions and Removals
The action supports both additions and removals of tags.
### Example Config
```yaml
name: "tag_propagation"
source:
type: "kafka"
config:
connection:
bootstrap: ${KAFKA_BOOTSTRAP_SERVER:-localhost:9092}
schema_registry_url: ${SCHEMA_REGISTRY_URL:-http://localhost:8081}
filter:
event_type: "EntityChangeEvent_v1"
action:
type: "tag_propagation"
config:
tag_prefixes:
- classification
datahub:
server: "http://localhost:8080"
```
## Caveats
- Tag Propagation is currently only supported for downstream datasets. Tags will not propagate to downstream dashboards or charts. Let us know if this is an important feature for you.

View File

@ -0,0 +1,13 @@
# Copyright 2021 Acryl Data, Inc.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

View File

@ -0,0 +1,162 @@
# Copyright 2021 Acryl Data, Inc.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
import logging
from typing import List, Optional
from pydantic import BaseModel, Field, validator
from datahub.configuration.common import ConfigModel
from datahub.emitter.mce_builder import make_tag_urn
from datahub_actions.action.action import Action
from datahub_actions.event.event_envelope import EventEnvelope
from datahub_actions.event.event_registry import EntityChangeEvent
from datahub_actions.pipeline.pipeline_context import PipelineContext
logger = logging.getLogger(__name__)
logger.setLevel(logging.DEBUG)
class TagPropagationConfig(ConfigModel):
"""
Configuration model for tag propagation.
Attributes:
enabled (bool): Indicates whether tag propagation is enabled or not. Default is True.
tag_prefixes (Optional[List[str]]): Optional list of tag prefixes to restrict tag propagation.
If provided, only tags with prefixes in this list will be propagated. Default is None,
meaning all tags will be propagated.
Note:
Tag propagation allows tags to be automatically propagated to downstream entities.
Enabling tag propagation can help maintain consistent metadata across connected entities.
The `enabled` attribute controls whether tag propagation is enabled or disabled.
The `tag_prefixes` attribute can be used to specify a list of tag prefixes that define which tags
should be propagated. If no prefixes are specified (default), all tags will be propagated.
Example:
config = TagPropagationConfig(enabled=True, tag_prefixes=["urn:li:tag:"])
"""
enabled: bool = Field(
True,
description="Indicates whether tag propagation is enabled or not.",
)
tag_prefixes: Optional[List[str]] = Field(
None,
description="Optional list of tag prefixes to restrict tag propagation.",
examples=[
"urn:li:tag:classification",
],
)
@validator("tag_prefixes", each_item=True)
def tag_prefix_should_start_with_urn(cls, v: str) -> str:
if v:
return make_tag_urn(v)
return v
class TagPropagationDirective(BaseModel):
propagate: bool
tag: str
operation: str
entity: str
class TagPropagationAction(Action):
def __init__(self, config: TagPropagationConfig, ctx: PipelineContext):
self.config: TagPropagationConfig = config
self.ctx = ctx
@classmethod
def create(cls, config_dict, ctx):
config = TagPropagationConfig.parse_obj(config_dict or {})
logger.info(f"TagPropagationAction configured with {config}")
return cls(config, ctx)
def name(self) -> str:
return "TagPropagator"
def should_propagate(
self, event: EventEnvelope
) -> Optional[TagPropagationDirective]:
"""
Return a tag urn to propagate or None if no propagation is desired
"""
if event.event_type == "EntityChangeEvent_v1":
assert isinstance(event.event, EntityChangeEvent)
assert self.ctx.graph is not None
semantic_event = event.event
if semantic_event.category == "TAG" and (
semantic_event.operation == "ADD"
or semantic_event.operation == "REMOVE"
):
assert semantic_event.modifier, "tag urn should be present"
propagate = self.config.enabled
if self.config.tag_prefixes:
propagate = any(
[
True
for prefix in self.config.tag_prefixes
if semantic_event.modifier.startswith(prefix)
]
)
if not propagate:
logger.debug(f"Not propagating {semantic_event.modifier}")
if propagate:
return TagPropagationDirective(
propagate=True,
tag=semantic_event.modifier,
operation=semantic_event.operation,
entity=semantic_event.entityUrn,
)
else:
return TagPropagationDirective(
propagate=False,
tag=semantic_event.modifier,
operation=semantic_event.modifier,
entity=semantic_event.entityUrn,
)
return None
def act(self, event: EventEnvelope) -> None:
tag_propagation_directive = self.should_propagate(event)
if tag_propagation_directive is not None:
if tag_propagation_directive.propagate:
# find downstream lineage
assert self.ctx.graph
entity_urn: str = tag_propagation_directive.entity
downstreams = self.ctx.graph.get_downstreams(entity_urn)
logger.info(
f"Detected {len(downstreams)} downstreams for {entity_urn}: {downstreams}"
)
logger.info(
f"Detected {tag_propagation_directive.tag} {tag_propagation_directive.operation} on {tag_propagation_directive.entity}"
)
# apply tags to downstreams
for d in downstreams:
self.ctx.graph.add_tags_to_dataset(
d,
[tag_propagation_directive.tag],
context={
"propagated": True,
"origin": tag_propagation_directive.entity,
},
)
else:
logger.debug(f"Not propagating {tag_propagation_directive.tag}")
def close(self) -> None:
return

View File

@ -0,0 +1,103 @@
# Copyright 2021 Acryl Data, Inc.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
import json
import logging
import pymsteams
from pydantic import SecretStr
from ratelimit import limits, sleep_and_retry
from datahub.configuration.common import ConfigModel
from datahub.metadata.schema_classes import EntityChangeEventClass as EntityChangeEvent
from datahub_actions.action.action import Action
from datahub_actions.event.event_envelope import EventEnvelope
from datahub_actions.pipeline.pipeline_context import PipelineContext
from datahub_actions.utils.datahub_util import DATAHUB_SYSTEM_ACTOR_URN
from datahub_actions.utils.social_util import (
get_message_from_entity_change_event,
get_welcome_message,
pretty_any_text,
)
logger = logging.getLogger(__name__)
@sleep_and_retry
@limits(calls=1, period=1) # 1 call per second
def post_message(message_card, message):
message_card.text(message)
message_card.send()
class TeamsNotificationConfig(ConfigModel):
webhook_url: SecretStr
base_url: str = "http://localhost:9002/"
suppress_system_activity: bool = True
class TeamsNotificationAction(Action):
def name(self):
return "TeamsNotificationAction"
def close(self) -> None:
pass
@classmethod
def create(cls, config_dict: dict, ctx: PipelineContext) -> "Action":
action_config = TeamsNotificationConfig.parse_obj(config_dict or {})
logger.info(f"Teams notification action configured with {action_config}")
return cls(action_config, ctx)
def _new_card(self):
return pymsteams.connectorcard(
self.action_config.webhook_url.get_secret_value()
)
def __init__(self, action_config: TeamsNotificationConfig, ctx: PipelineContext):
self.action_config = action_config
self.ctx = ctx
welcome_card = self._new_card()
structured_message = get_welcome_message(self.action_config.base_url)
welcome_card.title(structured_message.title)
message_section = pymsteams.cardsection()
for k, v in structured_message.properties.items():
message_section.addFact(k, pretty_any_text(v, channel="teams"))
welcome_card.addSection(message_section)
post_message(welcome_card, structured_message.text)
def act(self, event: EventEnvelope) -> None:
try:
message = json.dumps(json.loads(event.as_json()), indent=4)
logger.debug(f"Received event: {message}")
if event.event_type == "EntityChangeEvent_v1":
assert isinstance(event.event, EntityChangeEvent)
if (
event.event.auditStamp.actor == DATAHUB_SYSTEM_ACTOR_URN
and self.action_config.suppress_system_activity
):
return None
semantic_message = get_message_from_entity_change_event(
event.event,
self.action_config.base_url,
self.ctx.graph.graph if self.ctx.graph else None,
channel="teams",
)
message_card = self._new_card()
post_message(message_card, semantic_message)
else:
logger.debug("Skipping message because it didn't match our filter")
except Exception as e:
logger.debug("Failed to process event", e)

View File

@ -0,0 +1,56 @@
# Glossary Term Propagation Action
The Glossary Term Propagation Action allows you to propagate glossary terms from your assets into downstream entities.
## Use Cases
Enable classification of datasets or field of datasets to propagate metadata to downstream datasets with minimum manual work.
## Functionality
Propagation can be controlled via a specified list of terms or a specified list of term groups.
### Target Terms
- Given a list of "target terms", the propagation action will detect application of the target term to any field or dataset and propagate it down (as a dataset-level tag) on all downstream datasets. For example, given a target term of `Classification.Confidential` (the default), if you apply `Classification.Confidential` term to a dataset (at the dataset level or a field-level), this action will find all the downstream datasets and apply the `Classification.Confidential` tag to them at the dataset level. Note that downstream application is only at the dataset level, regardless of whether the primary application was at the field level or the dataset level.
- This action also supports term linkage. If you apply a term that is linked to the target term via inheritance, then this action will detect that application and propagate it downstream as well. For example, if the term `PersonalInformation.Email` inherits `Classification.Confidential` (the target term), and if you apply the `PersonalInformation.Email` term to a dataset (or a field in the dataset), it will be picked up by the action, and the `PersonalInformation.Email` term will be applied at the dataset level to all the downstream entities.
### Term Groups
- Given a list of "term groups", the propagation action will only propagate terms that belong to these term groups.
### Addition and Removals
The action supports propagation of term additions and removals.
## Configurability
You can control what the target term should be. Linkage to the target term is controlled through your business glossary which is completely under your control.
### Example Config
```yaml
name: "term_propagation"
source:
type: "kafka"
config:
connection:
bootstrap: ${KAFKA_BOOTSTRAP_SERVER:-localhost:9092}
schema_registry_url: ${SCHEMA_REGISTRY_URL:-http://localhost:8081}
filter:
event_type: "EntityChangeEvent_v1"
action:
type: "term_propagation"
config:
target_terms:
- Classification
term_groups:
- "Personal Information"
datahub:
server: "http://localhost:8080"
```
## Caveats
- Term Propagation is currently only supported for downstream datasets. Terms will not propagate to downstream dashboards or charts. Let us know if this is an important feature for you.

View File

@ -0,0 +1,13 @@
# Copyright 2021 Acryl Data, Inc.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

View File

@ -0,0 +1,192 @@
# Copyright 2021 Acryl Data, Inc.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
import logging
from typing import List, Optional
from pydantic import BaseModel, Field
from datahub.configuration.common import ConfigModel
from datahub.metadata.schema_classes import EntityChangeEventClass as EntityChangeEvent
from datahub_actions.action.action import Action
from datahub_actions.event.event_envelope import EventEnvelope
from datahub_actions.pipeline.pipeline_context import PipelineContext
from datahub_actions.plugin.action.utils.term_resolver import GlossaryTermsResolver
logger = logging.getLogger(__name__)
class TermPropagationDirective(BaseModel):
propagate: bool
term: str
operation: str
entity: str
class TermPropagationConfig(ConfigModel):
"""
Configuration model for term propagation.
Attributes:
enabled (bool): Indicates whether term propagation is enabled or not. Default is True.
target_term (Optional[str]): Optional target term to restrict term propagation.
If provided, only this specific term and terms related to it via `isA` relationship will be propagated.
Default is None, meaning all terms will be propagated.
term_groups (Optional[List[str]]): Optional list of term groups to restrict term propagation.
If provided, only terms within these groups will be propagated. Default is None, meaning all term groups will be propagated.
Note:
Term propagation allows terms to be automatically propagated to downstream entities.
Enabling term propagation can help maintain consistent metadata across connected entities.
The `enabled` attribute controls whether term propagation is enabled or disabled.
The `target_terms` attribute can be used to specify a set of specific terms or all terms related to these specific terms that should be propagated.
The `term_groups` attribute can be used to specify a list of term groups to restrict propagation to.
Example:
config = TermPropagationConfig(enabled=True, target_terms=["urn:li:glossaryTerm:Sensitive"])
"""
enabled: bool = Field(
True,
description="Indicates whether term propagation is enabled or not.",
)
target_terms: Optional[List[str]] = Field(
None,
description="Optional target terms to restrict term propagation to this and all terms related to these terms.",
examples=[
"urn:li:glossaryTerm:Sensitive",
],
)
term_groups: Optional[List[str]] = Field(
None,
description="Optional list of term groups to restrict term propagation.",
examples=[
"Group1",
"Group2",
],
)
class TermPropagationAction(Action):
def __init__(self, config: TermPropagationConfig, ctx: PipelineContext):
self.config = config
self.ctx = ctx
self.term_resolver = GlossaryTermsResolver(graph=self.ctx.graph)
if self.config.target_terms:
logger.info(
f"[Config] Will propagate terms that inherit from terms {self.config.target_terms}"
)
resolved_terms = []
for t in self.config.target_terms:
if t.startswith("urn:li:glossaryTerm"):
resolved_terms.append(t)
else:
resolved_term = self.term_resolver.get_glossary_term_urn(t)
if not resolved_term:
raise Exception(f"Failed to resolve term by name {t}")
resolved_terms.append(resolved_term)
self.config.target_terms = resolved_terms
logger.info(
f"[Config] Will propagate terms that inherit from terms {self.config.target_terms}"
)
if self.config.term_groups:
resolved_nodes = []
for node in self.config.term_groups:
if node.startswith("urn:li:glossaryNode"):
resolved_nodes.append(node)
else:
resolved_node = self.term_resolver.get_glossary_node_urn(node)
if not resolved_node:
raise Exception(f"Failed to resolve node by name {node}")
resolved_nodes.append(resolved_node)
self.config.term_groups = resolved_nodes
logger.info(
f"[Config] Will propagate all terms in groups {self.config.term_groups}"
)
def name(self) -> str:
return "TermPropagator"
@classmethod
def create(cls, config_dict: dict, ctx: PipelineContext) -> "Action":
action_config = TermPropagationConfig.parse_obj(config_dict or {})
logger.info(f"Term Propagation Config action configured with {action_config}")
return cls(action_config, ctx)
def should_propagate(
self, event: EventEnvelope
) -> Optional[TermPropagationDirective]:
if event.event_type == "EntityChangeEvent_v1":
assert isinstance(event.event, EntityChangeEvent)
assert self.ctx.graph is not None
semantic_event = event.event
if (
semantic_event.category == "GLOSSARY_TERM"
and self.config is not None
and self.config.enabled
):
assert semantic_event.modifier
for target_term in self.config.target_terms or [
semantic_event.modifier
]:
# a cheap way to handle optionality and always propagate if config is not set
# Check which terms have connectivity to the target term
if (
semantic_event.modifier == target_term
or self.ctx.graph.check_relationship( # term has been directly applied # term is indirectly associated
target_term,
semantic_event.modifier,
"IsA",
)
):
return TermPropagationDirective(
propagate=True,
term=semantic_event.modifier,
operation=semantic_event.operation,
entity=semantic_event.entityUrn,
)
return None
def act(self, event: EventEnvelope) -> None:
"""This method responds to changes to glossary terms and propagates them to downstream entities"""
term_propagation_directive = self.should_propagate(event)
if (
term_propagation_directive is not None
and term_propagation_directive.propagate
):
assert self.ctx.graph
# find downstream lineage
downstreams = self.ctx.graph.get_downstreams(
entity_urn=term_propagation_directive.entity
)
# apply terms to downstreams
for dataset in downstreams:
self.ctx.graph.add_terms_to_dataset(
dataset,
[term_propagation_directive.term],
context={
"propagated": True,
"origin": term_propagation_directive.entity,
},
)
logger.info(
f"Will add term {term_propagation_directive.term} to {dataset}"
)
def close(self) -> None:
return

Some files were not shown because too many files have changed in this diff Show More