145 Commits

Author SHA1 Message Date
Jake Poznanski
e06fd622c3 Adjusting tagging pipelien v2 2025-05-10 17:43:56 +00:00
Jake Poznanski
623c66c85c Fixing up tagging pipeline 2025-05-10 17:41:43 +00:00
Jake Poznanski
1854ae1269 A bit more work on tagging 2025-05-09 19:31:07 +00:00
Jake Poznanski
72bcfd8f31 doing some extra pii tagging steps 2025-05-09 15:40:22 +00:00
Jake Poznanski
424052df63 Outputting some nice reference docs to check pii 2025-05-08 21:27:55 +00:00
Jake Poznanski
d18f3f734f More pii tag checking 2025-05-08 20:07:21 +00:00
Jake Poznanski
80645c886e Hypothesis checker 2025-05-08 17:58:50 +00:00
Jake Poznanski
3aba3a5c10 Comitting script to get stats on PII tagging 2025-05-08 17:02:36 +00:00
Jake Poznanski
9e5965a95e Some PII filter 2025-05-06 21:22:27 +00:00
Jake Poznanski
d671be6823 Working on some dataset filtering 2025-05-06 20:49:39 +00:00
Jake Poznanski
da21074477 More nits 2025-05-05 20:43:03 +00:00
Jake Poznanski
a2ec95e0f5 Testing out to see where we stand on qwen2.5 2025-05-05 17:15:09 +00:00
Jake Poznanski
791983c09b Tweaking some more pii detection 2025-05-01 17:09:05 +00:00
Jake Poznanski
5cc084887a Rich tagger with bigger model 2025-05-01 09:33:27 -07:00
Jake Poznanski
4ed00d097b Fixes for rich tagging 2025-04-30 14:38:35 -07:00
Jake Poznanski
472ee108d7 Lints 2025-04-30 21:18:59 +00:00
Jake Poznanski
8ef7e56c86 Trying a new rich tagging pipeline for PII 2025-04-30 21:18:22 +00:00
Jake Poznanski
0a320e9870 Some helper scripts for Aman 2025-04-30 18:47:10 +00:00
Jake Poznanski
f8808478bd Adding some small changes to the tagging pipeline 2025-04-29 11:12:03 -07:00
Jake Poznanski
66d293c178 Decent resume/cv tagging 2025-04-28 15:57:20 -07:00
Jake Poznanski
1f66b96ffd Adding openai dependecy for benchmarking 2025-04-25 18:18:37 +00:00
Jake Poznanski
8ec7dbe2e0 Script updates 2025-04-25 18:00:41 +00:00
Jake Poznanski
83002a0de7 Reinit credentials 2025-04-24 20:43:54 +00:00
Jake Poznanski
2d5e1838f4 Small corrections 2025-04-24 20:31:59 +00:00
Jake Poznanski
df71dc38ce Small fix for cluster usage 2025-04-24 20:24:06 +00:00
Jake Poznanski
67a01cfcc8 FIxups for tagging pipeline 2025-04-24 20:14:42 +00:00
Jake Poznanski
c326fae03c Refactoring tagging bigly 2025-04-24 10:18:30 -07:00
Jake Poznanski
479b2c1b2d Working on a tagger 2025-04-23 15:54:49 -07:00
Jake Poznanski
717ed811e1 Cleanup 2025-04-23 14:47:00 -07:00
Jake Poznanski
97ae48c66a Making some more progress 2025-04-23 14:46:16 -07:00
Jake Poznanski
7d8e9d181a Fixing up tagging pipeline 2025-04-23 19:56:13 +00:00
Jake Poznanski
12100b420d Adding some manual structure to be filled in 2025-04-23 18:39:31 +00:00
Jake Poznanski
ee8c506d92 Example of a basic empty pipeline that I'm hoping to extend for tagging 2025-04-23 18:27:26 +00:00
mhamada-ai2
01644c4a49
Update scan_dolmadocs.py
Instruction text updates and public release question update
2025-04-22 16:16:21 -07:00
Jake Poznanski
246490f960 Lint fixes 2025-04-22 21:33:52 +00:00
Jake Poznanski
967210f23b Adjustments to task 2025-04-22 21:33:39 +00:00
Jake Poznanski
3dffeeac22 Saving prolific PID 2025-04-22 21:16:41 +00:00
Jake Poznanski
eabbe279fb Lint fixes 2025-04-16 20:14:20 +00:00
Jake Poznanski
e16f66d6c5 Working on annotation for dolma docs release 2025-04-16 19:29:45 +00:00
Jake Poznanski
9a67f50539 Doing some work on annotations again... 2025-04-15 22:27:07 +00:00
Jake Poznanski
1d0c560455 Upping version to fix issue with work queue and delimited paths 2025-04-15 18:50:13 +00:00
Jake Poznanski
786b14aef5 Final adjustments 2025-04-14 23:27:27 +00:00
Jake Poznanski
4d8a8affdb Adjusting prolific script 2025-04-14 23:21:28 +00:00
Jake Poznanski
dc2512c2f0 Adjusted annotation script 2025-04-14 20:27:06 +00:00
Jake Poznanski
ee41449ff6 Instructions updated in annotation tool 2025-04-14 19:07:13 +00:00
Jake Poznanski
590a92ec2f Ruff fix 2025-04-10 21:50:14 +00:00
Jake Poznanski
a13a50143a Formatting, fixes to annotation tool 2025-04-08 22:30:59 +00:00
Jake Poznanski
a74800f528 New flowchart based annotation tool 2025-04-08 21:04:56 +00:00
Jake Poznanski
cdc7fae4f9 Adjusting annotation script 2025-04-08 20:50:00 +00:00
Jake Poznanski
474e0ef6ed Lint fixes, adjusting qwen2.5 vl prompt 2025-04-07 21:19:36 -07:00