1195 Commits

Author SHA1 Message Date
Jake Poznanski
bb3fe14543 Pareto plot for paper 2025-05-15 23:57:18 +00:00
Jake Poznanski
4b4ba454ba
Update README.md 2025-05-15 16:17:29 -07:00
Jake Poznanski
f0768bba3e Merge branch 'main' of https://github.com/allenai/olmocr 2025-05-15 22:50:30 +00:00
Jake Poznanski
c4a0fb9af5 Adding back in proper CI estimation 2025-05-15 22:50:29 +00:00
Aman Rangapur
d047bc6712
Updated README.md 2025-05-15 11:34:07 -07:00
Jake Poznanski
d17210f40d Lint fix 2025-05-14 19:54:19 +00:00
Jake Poznanski
ffee4c9740 Big bug fix, moving the prompt to match how training was done, 2.3 point boost on olmocr-bench 2025-05-14 19:51:00 +00:00
Jake Poznanski
28966b9f14 Adding CDF plots 2025-05-14 16:57:56 +00:00
Jake Poznanski
2e8753af26 Docling runner based on CLI, but its too slow to use. Pii rule fixes 2025-05-14 16:31:56 +00:00
Jake Poznanski
74ef2b6f65 Fixes for some pii taggers 2025-05-13 16:19:50 +00:00
Jake Poznanski
b3b405d077 dedupe script 2025-05-12 17:02:35 +00:00
Jake Poznanski
e06fd622c3 Adjusting tagging pipelien v2 2025-05-10 17:43:56 +00:00
Jake Poznanski
1538163f6f Merge branch 'main' of https://github.com/allenai/olmocr 2025-05-10 17:41:44 +00:00
Jake Poznanski
623c66c85c Fixing up tagging pipeline 2025-05-10 17:41:43 +00:00
Jake Poznanski
1c59130b55
Update README.md 2025-05-09 14:51:18 -07:00
Jake Poznanski
225b705eef
Update README.md 2025-05-09 14:48:49 -07:00
Jake Poznanski
1854ae1269 A bit more work on tagging 2025-05-09 19:31:07 +00:00
Jake Poznanski
72bcfd8f31 doing some extra pii tagging steps 2025-05-09 15:40:22 +00:00
Jake Poznanski
9871e066b4 Merge branch 'main' of https://github.com/allenai/olmocr 2025-05-08 21:27:56 +00:00
Jake Poznanski
424052df63 Outputting some nice reference docs to check pii 2025-05-08 21:27:55 +00:00
Jake Poznanski
d18f3f734f More pii tag checking 2025-05-08 20:07:21 +00:00
Jake Poznanski
80645c886e Hypothesis checker 2025-05-08 17:58:50 +00:00
Jake Poznanski
03db04cb7e Fixing handling of new lines in some test cases 2025-05-08 17:21:06 +00:00
Jake Poznanski
3aba3a5c10 Comitting script to get stats on PII tagging 2025-05-08 17:02:36 +00:00
Aman Rangapur
6f62e05b1f
Merge pull request #188 from allenai/amanr/miners
added checker for `hea_foo` and miner to get `old_scans` img's
2025-05-07 11:41:29 -07:00
Jake Poznanski
9e5965a95e Some PII filter 2025-05-06 21:22:27 +00:00
Jake Poznanski
ef083bf845 Stats fix 2025-05-06 21:21:06 +00:00
Jake Poznanski
d671be6823 Working on some dataset filtering 2025-05-06 20:49:39 +00:00
Jake Poznanski
da21074477 More nits 2025-05-05 20:43:03 +00:00
Jake Poznanski
88270e9307 More work on qwen25 finetune 2025-05-05 20:39:28 +00:00
Jake Poznanski
a2ec95e0f5 Testing out to see where we stand on qwen2.5 2025-05-05 17:15:09 +00:00
aman-17
57720564ee fixed lint and style 2025-05-02 16:24:03 -07:00
aman-17
281ca51916 added checker for hea_foo and miner to get old scans img's 2025-05-02 16:22:45 -07:00
Jake Poznanski
97e4992a3f Merge branch 'main' of https://github.com/allenai/olmocr 2025-05-02 21:51:24 +00:00
Jake Poznanski
dcbe6543b8 Report for benchmarking 2025-05-02 21:51:23 +00:00
Jake Poznanski
18de822269
Update README.md 2025-05-01 13:31:19 -07:00
Jake Poznanski
791983c09b Tweaking some more pii detection 2025-05-01 17:09:05 +00:00
Jake Poznanski
5cc084887a Rich tagger with bigger model 2025-05-01 09:33:27 -07:00
Jake Poznanski
4ed00d097b Fixes for rich tagging 2025-04-30 14:38:35 -07:00
Jake Poznanski
472ee108d7 Lints 2025-04-30 21:18:59 +00:00
Jake Poznanski
8ef7e56c86 Trying a new rich tagging pipeline for PII 2025-04-30 21:18:22 +00:00
Jake Poznanski
0a320e9870 Some helper scripts for Aman 2025-04-30 18:47:10 +00:00
Jake Poznanski
1067f80160
Update README.md 2025-04-29 15:43:43 -07:00
Jake Poznanski
4e9e13e56f Option in benchmark to output tests which fail on all models for debugging 2025-04-29 14:07:07 -07:00
Jake Poznanski
e51362bcc2 Showing benchmark scores per category, speed improvements 2025-04-29 13:44:05 -07:00
Jake Poznanski
f8808478bd Adding some small changes to the tagging pipeline 2025-04-29 11:12:03 -07:00
Jake Poznanski
66d293c178 Decent resume/cv tagging 2025-04-28 15:57:20 -07:00
Jake Poznanski
1f66b96ffd Adding openai dependecy for benchmarking 2025-04-25 18:18:37 +00:00
Jake Poznanski
689bcd9e91 Merge branch 'main' of https://github.com/allenai/olmocr 2025-04-25 18:00:43 +00:00
Jake Poznanski
8ec7dbe2e0 Script updates 2025-04-25 18:00:41 +00:00