Jake Poznanski
|
bb3fe14543
|
Pareto plot for paper
|
2025-05-15 23:57:18 +00:00 |
|
Jake Poznanski
|
4b4ba454ba
|
Update README.md
|
2025-05-15 16:17:29 -07:00 |
|
Jake Poznanski
|
f0768bba3e
|
Merge branch 'main' of https://github.com/allenai/olmocr
|
2025-05-15 22:50:30 +00:00 |
|
Jake Poznanski
|
c4a0fb9af5
|
Adding back in proper CI estimation
|
2025-05-15 22:50:29 +00:00 |
|
Aman Rangapur
|
d047bc6712
|
Updated README.md
|
2025-05-15 11:34:07 -07:00 |
|
Jake Poznanski
|
d17210f40d
|
Lint fix
|
2025-05-14 19:54:19 +00:00 |
|
Jake Poznanski
|
ffee4c9740
|
Big bug fix, moving the prompt to match how training was done, 2.3 point boost on olmocr-bench
|
2025-05-14 19:51:00 +00:00 |
|
Jake Poznanski
|
28966b9f14
|
Adding CDF plots
|
2025-05-14 16:57:56 +00:00 |
|
Jake Poznanski
|
2e8753af26
|
Docling runner based on CLI, but its too slow to use. Pii rule fixes
|
2025-05-14 16:31:56 +00:00 |
|
Jake Poznanski
|
74ef2b6f65
|
Fixes for some pii taggers
|
2025-05-13 16:19:50 +00:00 |
|
Jake Poznanski
|
b3b405d077
|
dedupe script
|
2025-05-12 17:02:35 +00:00 |
|
Jake Poznanski
|
e06fd622c3
|
Adjusting tagging pipelien v2
|
2025-05-10 17:43:56 +00:00 |
|
Jake Poznanski
|
1538163f6f
|
Merge branch 'main' of https://github.com/allenai/olmocr
|
2025-05-10 17:41:44 +00:00 |
|
Jake Poznanski
|
623c66c85c
|
Fixing up tagging pipeline
|
2025-05-10 17:41:43 +00:00 |
|
Jake Poznanski
|
1c59130b55
|
Update README.md
|
2025-05-09 14:51:18 -07:00 |
|
Jake Poznanski
|
225b705eef
|
Update README.md
|
2025-05-09 14:48:49 -07:00 |
|
Jake Poznanski
|
1854ae1269
|
A bit more work on tagging
|
2025-05-09 19:31:07 +00:00 |
|
Jake Poznanski
|
72bcfd8f31
|
doing some extra pii tagging steps
|
2025-05-09 15:40:22 +00:00 |
|
Jake Poznanski
|
9871e066b4
|
Merge branch 'main' of https://github.com/allenai/olmocr
|
2025-05-08 21:27:56 +00:00 |
|
Jake Poznanski
|
424052df63
|
Outputting some nice reference docs to check pii
|
2025-05-08 21:27:55 +00:00 |
|
Jake Poznanski
|
d18f3f734f
|
More pii tag checking
|
2025-05-08 20:07:21 +00:00 |
|
Jake Poznanski
|
80645c886e
|
Hypothesis checker
|
2025-05-08 17:58:50 +00:00 |
|
Jake Poznanski
|
03db04cb7e
|
Fixing handling of new lines in some test cases
|
2025-05-08 17:21:06 +00:00 |
|
Jake Poznanski
|
3aba3a5c10
|
Comitting script to get stats on PII tagging
|
2025-05-08 17:02:36 +00:00 |
|
Aman Rangapur
|
6f62e05b1f
|
Merge pull request #188 from allenai/amanr/miners
added checker for `hea_foo` and miner to get `old_scans` img's
|
2025-05-07 11:41:29 -07:00 |
|
Jake Poznanski
|
9e5965a95e
|
Some PII filter
|
2025-05-06 21:22:27 +00:00 |
|
Jake Poznanski
|
ef083bf845
|
Stats fix
|
2025-05-06 21:21:06 +00:00 |
|
Jake Poznanski
|
d671be6823
|
Working on some dataset filtering
|
2025-05-06 20:49:39 +00:00 |
|
Jake Poznanski
|
da21074477
|
More nits
|
2025-05-05 20:43:03 +00:00 |
|
Jake Poznanski
|
88270e9307
|
More work on qwen25 finetune
|
2025-05-05 20:39:28 +00:00 |
|
Jake Poznanski
|
a2ec95e0f5
|
Testing out to see where we stand on qwen2.5
|
2025-05-05 17:15:09 +00:00 |
|
aman-17
|
57720564ee
|
fixed lint and style
|
2025-05-02 16:24:03 -07:00 |
|
aman-17
|
281ca51916
|
added checker for hea_foo and miner to get old scans img's
|
2025-05-02 16:22:45 -07:00 |
|
Jake Poznanski
|
97e4992a3f
|
Merge branch 'main' of https://github.com/allenai/olmocr
|
2025-05-02 21:51:24 +00:00 |
|
Jake Poznanski
|
dcbe6543b8
|
Report for benchmarking
|
2025-05-02 21:51:23 +00:00 |
|
Jake Poznanski
|
18de822269
|
Update README.md
|
2025-05-01 13:31:19 -07:00 |
|
Jake Poznanski
|
791983c09b
|
Tweaking some more pii detection
|
2025-05-01 17:09:05 +00:00 |
|
Jake Poznanski
|
5cc084887a
|
Rich tagger with bigger model
|
2025-05-01 09:33:27 -07:00 |
|
Jake Poznanski
|
4ed00d097b
|
Fixes for rich tagging
|
2025-04-30 14:38:35 -07:00 |
|
Jake Poznanski
|
472ee108d7
|
Lints
|
2025-04-30 21:18:59 +00:00 |
|
Jake Poznanski
|
8ef7e56c86
|
Trying a new rich tagging pipeline for PII
|
2025-04-30 21:18:22 +00:00 |
|
Jake Poznanski
|
0a320e9870
|
Some helper scripts for Aman
|
2025-04-30 18:47:10 +00:00 |
|
Jake Poznanski
|
1067f80160
|
Update README.md
|
2025-04-29 15:43:43 -07:00 |
|
Jake Poznanski
|
4e9e13e56f
|
Option in benchmark to output tests which fail on all models for debugging
|
2025-04-29 14:07:07 -07:00 |
|
Jake Poznanski
|
e51362bcc2
|
Showing benchmark scores per category, speed improvements
|
2025-04-29 13:44:05 -07:00 |
|
Jake Poznanski
|
f8808478bd
|
Adding some small changes to the tagging pipeline
|
2025-04-29 11:12:03 -07:00 |
|
Jake Poznanski
|
66d293c178
|
Decent resume/cv tagging
|
2025-04-28 15:57:20 -07:00 |
|
Jake Poznanski
|
1f66b96ffd
|
Adding openai dependecy for benchmarking
|
2025-04-25 18:18:37 +00:00 |
|
Jake Poznanski
|
689bcd9e91
|
Merge branch 'main' of https://github.com/allenai/olmocr
|
2025-04-25 18:00:43 +00:00 |
|
Jake Poznanski
|
8ec7dbe2e0
|
Script updates
|
2025-04-25 18:00:41 +00:00 |
|