1015 Commits

Author SHA1 Message Date
aman-17
cdbf908071 added PII checker 2025-05-12 12:49:27 -07:00
aman-17
57720564ee fixed lint and style 2025-05-02 16:24:03 -07:00
aman-17
281ca51916 added checker for hea_foo and miner to get old scans img's 2025-05-02 16:22:45 -07:00
Jake Poznanski
97e4992a3f Merge branch 'main' of https://github.com/allenai/olmocr 2025-05-02 21:51:24 +00:00
Jake Poznanski
dcbe6543b8 Report for benchmarking 2025-05-02 21:51:23 +00:00
Jake Poznanski
18de822269
Update README.md 2025-05-01 13:31:19 -07:00
Jake Poznanski
791983c09b Tweaking some more pii detection 2025-05-01 17:09:05 +00:00
Jake Poznanski
5cc084887a Rich tagger with bigger model 2025-05-01 09:33:27 -07:00
Jake Poznanski
4ed00d097b Fixes for rich tagging 2025-04-30 14:38:35 -07:00
Jake Poznanski
472ee108d7 Lints 2025-04-30 21:18:59 +00:00
Jake Poznanski
8ef7e56c86 Trying a new rich tagging pipeline for PII 2025-04-30 21:18:22 +00:00
Jake Poznanski
0a320e9870 Some helper scripts for Aman 2025-04-30 18:47:10 +00:00
Jake Poznanski
1067f80160
Update README.md 2025-04-29 15:43:43 -07:00
Jake Poznanski
4e9e13e56f Option in benchmark to output tests which fail on all models for debugging 2025-04-29 14:07:07 -07:00
Jake Poznanski
e51362bcc2 Showing benchmark scores per category, speed improvements 2025-04-29 13:44:05 -07:00
Jake Poznanski
f8808478bd Adding some small changes to the tagging pipeline 2025-04-29 11:12:03 -07:00
Jake Poznanski
66d293c178 Decent resume/cv tagging 2025-04-28 15:57:20 -07:00
Jake Poznanski
1f66b96ffd Adding openai dependecy for benchmarking 2025-04-25 18:18:37 +00:00
Jake Poznanski
689bcd9e91 Merge branch 'main' of https://github.com/allenai/olmocr 2025-04-25 18:00:43 +00:00
Jake Poznanski
8ec7dbe2e0 Script updates 2025-04-25 18:00:41 +00:00
Aman Rangapur
a7db2bd160
Merge pull request #183 from allenai/amanr/bench_checkers
Added checker for old_scans_math
2025-04-24 17:20:48 -07:00
aman-17
c7220ce460 Merge remote-tracking branch 'origin/main' into amanr/bench_checkers
merge from main
2025-04-24 17:09:53 -07:00
Jake Poznanski
83002a0de7 Reinit credentials 2025-04-24 20:43:54 +00:00
Jake Poznanski
2d5e1838f4 Small corrections 2025-04-24 20:31:59 +00:00
Jake Poznanski
df71dc38ce Small fix for cluster usage 2025-04-24 20:24:06 +00:00
Jake Poznanski
67a01cfcc8 FIxups for tagging pipeline 2025-04-24 20:14:42 +00:00
Jake Poznanski
c326fae03c Refactoring tagging bigly 2025-04-24 10:18:30 -07:00
Jake Poznanski
811d267bd5 Merge branch 'main' of https://github.com/allenai/olmocr into main 2025-04-23 15:55:04 -07:00
Jake Poznanski
479b2c1b2d Working on a tagger 2025-04-23 15:54:49 -07:00
Jake Poznanski
717ed811e1 Cleanup 2025-04-23 14:47:00 -07:00
Jake Poznanski
97ae48c66a Making some more progress 2025-04-23 14:46:16 -07:00
aman-17
2a4522e7e5 fixed minor bug 2025-04-23 14:41:09 -07:00
aman-17
076f3e2e04 fixed style 2025-04-23 14:38:19 -07:00
aman-17
b095be0fed added checker for old_scans_math 2025-04-23 14:37:42 -07:00
Aman Rangapur
85b40f46ce
Updated bench README.md
Cleaned old scans tests and removed [] and other symbols.
2025-04-23 13:53:24 -07:00
Jake Poznanski
7d8e9d181a Fixing up tagging pipeline 2025-04-23 19:56:13 +00:00
Jake Poznanski
12100b420d Adding some manual structure to be filled in 2025-04-23 18:39:31 +00:00
Jake Poznanski
ee8c506d92 Example of a basic empty pipeline that I'm hoping to extend for tagging 2025-04-23 18:27:26 +00:00
Jake Poznanski
582518f1e8
Merge pull request #181 from mhamada-ai2/patch-1
Update scan_dolmadocs.py
2025-04-23 09:48:08 -07:00
mhamada-ai2
01644c4a49
Update scan_dolmadocs.py
Instruction text updates and public release question update
2025-04-22 16:16:21 -07:00
Jake Poznanski
887efac133 Merge branch 'main' of https://github.com/allenai/olmocr 2025-04-22 21:33:53 +00:00
Jake Poznanski
246490f960 Lint fixes 2025-04-22 21:33:52 +00:00
Jake Poznanski
967210f23b Adjustments to task 2025-04-22 21:33:39 +00:00
Jake Poznanski
3dffeeac22 Saving prolific PID 2025-04-22 21:16:41 +00:00
Aman Rangapur
622279850d
Merge pull request #179 from allenai/amanr/long_tiny_text
Added Miner for long tiny text
2025-04-22 14:00:26 -07:00
Jake Poznanski
b20a4886f9 README for benchmark 2025-04-22 20:35:11 +00:00
aman-17
0926dacc59 fixed style 2025-04-21 17:42:32 -07:00
aman-17
6845517761 added miner 2025-04-21 17:41:16 -07:00
Jake Poznanski
b897bf1414 Merge branch 'main' of https://github.com/allenai/olmocr 2025-04-18 15:47:32 +00:00
Jake Poznanski
f0992b95e1 Better staggering of downloads 2025-04-18 15:47:31 +00:00