1191 Commits

Author SHA1 Message Date
Jake Poznanski
df71dc38ce Small fix for cluster usage 2025-04-24 20:24:06 +00:00
Jake Poznanski
67a01cfcc8 FIxups for tagging pipeline 2025-04-24 20:14:42 +00:00
Jake Poznanski
c326fae03c Refactoring tagging bigly 2025-04-24 10:18:30 -07:00
Jake Poznanski
811d267bd5 Merge branch 'main' of https://github.com/allenai/olmocr into main 2025-04-23 15:55:04 -07:00
Jake Poznanski
479b2c1b2d Working on a tagger 2025-04-23 15:54:49 -07:00
Jake Poznanski
717ed811e1 Cleanup 2025-04-23 14:47:00 -07:00
Jake Poznanski
97ae48c66a Making some more progress 2025-04-23 14:46:16 -07:00
aman-17
2a4522e7e5 fixed minor bug 2025-04-23 14:41:09 -07:00
aman-17
076f3e2e04 fixed style 2025-04-23 14:38:19 -07:00
aman-17
b095be0fed added checker for old_scans_math 2025-04-23 14:37:42 -07:00
Aman Rangapur
85b40f46ce
Updated bench README.md
Cleaned old scans tests and removed [] and other symbols.
2025-04-23 13:53:24 -07:00
Jake Poznanski
7d8e9d181a Fixing up tagging pipeline 2025-04-23 19:56:13 +00:00
Jake Poznanski
12100b420d Adding some manual structure to be filled in 2025-04-23 18:39:31 +00:00
Jake Poznanski
ee8c506d92 Example of a basic empty pipeline that I'm hoping to extend for tagging 2025-04-23 18:27:26 +00:00
Jake Poznanski
582518f1e8
Merge pull request #181 from mhamada-ai2/patch-1
Update scan_dolmadocs.py
2025-04-23 09:48:08 -07:00
mhamada-ai2
01644c4a49
Update scan_dolmadocs.py
Instruction text updates and public release question update
2025-04-22 16:16:21 -07:00
Jake Poznanski
887efac133 Merge branch 'main' of https://github.com/allenai/olmocr 2025-04-22 21:33:53 +00:00
Jake Poznanski
246490f960 Lint fixes 2025-04-22 21:33:52 +00:00
Jake Poznanski
967210f23b Adjustments to task 2025-04-22 21:33:39 +00:00
Jake Poznanski
3dffeeac22 Saving prolific PID 2025-04-22 21:16:41 +00:00
Aman Rangapur
622279850d
Merge pull request #179 from allenai/amanr/long_tiny_text
Added Miner for long tiny text
2025-04-22 14:00:26 -07:00
Jake Poznanski
b20a4886f9 README for benchmark 2025-04-22 20:35:11 +00:00
aman-17
0926dacc59 fixed style 2025-04-21 17:42:32 -07:00
aman-17
6845517761 added miner 2025-04-21 17:41:16 -07:00
Jake Poznanski
b897bf1414 Merge branch 'main' of https://github.com/allenai/olmocr 2025-04-18 15:47:32 +00:00
Jake Poznanski
f0992b95e1 Better staggering of downloads 2025-04-18 15:47:31 +00:00
Jake Poznanski
dd92c75c1f Fixing CI again 2025-04-17 14:43:46 -07:00
Jake Poznanski
cd79b202ed Fixing gh actions 2025-04-17 14:32:43 -07:00
Jake Poznanski
8f46b6e966 Running more tests in CI 2025-04-17 14:26:06 -07:00
Jake Poznanski
6fefc98f77 Merge branch 'main' of https://github.com/allenai/olmocr into main 2025-04-17 13:51:51 -07:00
Jake Poznanski
5aa6a9f1a3 Fixing olmocr_pipeline in converter 2025-04-17 13:51:49 -07:00
Jake Poznanski
858cf69507 Bumping version 2025-04-17 17:00:01 +00:00
Jake Poznanski
10cb6aad26 Updating pipeline to take cloud storage model names and paths, as well as local directory 2025-04-17 09:59:28 -07:00
Jake Poznanski
e3617130ae
Update README.md 2025-04-16 18:46:29 -07:00
Jake Poznanski
ac8c5369c9
Update README.md 2025-04-16 18:43:32 -07:00
Jake Poznanski
df657575b6
Update README.md 2025-04-16 17:02:32 -07:00
Jake Poznanski
ca6e1427c1 Adding some extra unit tests on some math cases I wasn't sure of 2025-04-16 23:44:48 +00:00
Jake Poznanski
7a638c74c9 Adding some more options to prompt chatgpt 2025-04-16 22:47:28 +00:00
Jake Poznanski
eabbe279fb Lint fixes 2025-04-16 20:14:20 +00:00
Jake Poznanski
7f822607c0
Merge pull request #173 from allenai/amanr/olmocr-bench-old_scans
Added files of old_scans and old_scans_math for bench
2025-04-16 13:01:02 -07:00
Jake Poznanski
e16f66d6c5 Working on annotation for dolma docs release 2025-04-16 19:29:45 +00:00
aman-17
b85b71c2a5 removed old_scans 2025-04-15 15:44:26 -07:00
Jake Poznanski
9a67f50539 Doing some work on annotations again... 2025-04-15 22:27:07 +00:00
aman-17
2622a09a45 renamed processing_old_scans to mine_old_scans 2025-04-15 15:20:42 -07:00
aman-17
48de825e3a added old_scans_math miner 2025-04-15 15:18:31 -07:00
aman-17
c72b8fb47c fixed style and lint 2025-04-15 15:14:00 -07:00
aman-17
bc89f90216 removed convert file 2025-04-15 15:12:35 -07:00
aman-17
8abc475a0b added old_scans and old_scans math miners and review app 2025-04-15 15:11:20 -07:00
Jake Poznanski
1d0c560455 Upping version to fix issue with work queue and delimited paths 2025-04-15 18:50:13 +00:00
aman-17
7703f0c9fa update 2025-04-14 19:40:17 -07:00