387 Commits

Author SHA1 Message Date
aman-17
b5bd179128 Merge remote-tracking branch 'origin/main' into amanr/pp-doc-layout
merge from main
2025-03-17 12:35:46 -07:00
aman-17
8f356a18d4 added pp_doc 2025-03-17 12:34:45 -07:00
Jake Poznanski
aee030c42b Fixing sample dataset, outputting some reports for debugging. Math is good enough for now 2025-03-17 10:59:02 -07:00
Jake Poznanski
dd725636a3 Bump version to v0.1.60 for release 2025-03-17 08:59:18 -07:00
Jake Poznanski
baa00825b0 Don't go down too low in temp 2025-03-17 08:48:19 -07:00
Jake Poznanski
f2951f3f78 Lints 2025-03-17 08:47:57 -07:00
Jake Poznanski
1e42e5ea9a Faster and nicer equation cache 2025-03-17 08:47:06 -07:00
Jake Poznanski
1f8cc59b22 Pipeline scales temperature automatically, increases performance ~2% 2025-03-14 22:27:51 -07:00
Jake Poznanski
4768ac4be5 Merge branch 'main' of https://github.com/allenai/olmocr 2025-03-14 22:32:39 +00:00
Jake Poznanski
0968bd17ce Mine headers footers 2025-03-14 22:32:38 +00:00
Jake Poznanski
7b4026233c Benchmark script supports rel paths 2025-03-14 13:22:12 -07:00
Jake Poznanski
1270ca336a lints 2025-03-14 17:53:43 +00:00
Jake Poznanski
d7361c436e Basic convert script 2025-03-14 10:35:46 -07:00
Jake Poznanski
142a9cbd20 Convert script to support broader folder structures 2025-03-14 10:12:21 -07:00
Jake Poznanski
98c4283eef Cap max workers to hopefully improve stability 2025-03-14 10:08:30 -07:00
Jake Poznanski
5f3ef510ab Faster equation cache and checking, cleanup data script 2025-03-14 16:40:16 +00:00
Jake Poznanski
9f38a8a602 Lints 2025-03-13 22:29:27 +00:00
Jake Poznanski
5009bb31f1 Lints 2025-03-13 22:26:53 +00:00
Jake Poznanski
acb0df32a8 Fixes 2025-03-13 13:15:45 -07:00
Jake Poznanski
3eec2a855b Mining math 2025-03-13 13:11:01 -07:00
Jake Poznanski
95f03e1e42 More small tests 2025-03-13 12:50:52 -07:00
Jake Poznanski
d30a070234 Tests 2025-03-13 12:34:56 -07:00
Jake Poznanski
269650299c Much faster and responsive math bench 2025-03-13 10:42:25 -07:00
Jake Poznanski
980121feea Loading tests much faster in parallel 2025-03-13 10:20:09 -07:00
Jake Poznanski
7729e5a9d7 Graphical pdf test from github 2025-03-13 09:33:40 -07:00
Jake Poznanski
154a07c211 Math miner looks decent 2025-03-12 16:04:03 -07:00
Jake Poznanski
d0b9b5b7a8 Fixes for math mining 2025-03-12 15:49:07 -07:00
Jake Poznanski
09fd299242 Mining 2025-03-12 14:24:47 -07:00
Jake Poznanski
3f92265a81 Math miner working decently 2025-03-12 14:13:17 -07:00
Jake Poznanski
5387a79a2f More tests for olmocrbench 2025-03-12 11:59:11 -07:00
Jake Poznanski
189104bc90 Fixing escaped html bug in mathml parsing 2025-03-12 11:18:32 -07:00
Jake Poznanski
770bc364ed Fixes for multipage 2025-03-12 11:09:18 -07:00
Jake Poznanski
0553443301 Convert scripts and other fun 2025-03-12 11:04:20 -07:00
Jake Poznanski
8b3a9e4201 Fixes for multipage runners 2025-03-12 10:29:49 -07:00
Jake Poznanski
743e48e4ad More fixes 2025-03-11 04:59:19 +00:00
Jake Poznanski
b2fe82db9b Working on math compares 2025-03-11 04:50:51 +00:00
Jake Poznanski
bc3a94583a Adding some tests 2025-03-11 03:57:12 +00:00
Jake Poznanski
35cc6f110c A few fixes for text comparisons and normalized chars 2025-03-11 03:42:51 +00:00
Jake Poznanski
4709156ce5 Leaving with some more data, but still cases to investigate 2025-03-10 15:53:01 -07:00
Jake Poznanski
07be9ea6e3 More math testing 2025-03-10 21:55:33 +00:00
Jake Poznanski
e39c3e4613 New method for comparing equations 2025-03-10 21:47:49 +00:00
Jake Poznanski
fff40506cc More test documents 2025-03-10 17:09:42 +00:00
Jake Poznanski
0ba56c0fa9 Adjusting repeat test to be the "baseline" test which also looks for disallowed characters 2025-03-10 16:53:07 +00:00
Jake Poznanski
a2b5ca8d41 Better markdown table parsing 2025-03-10 16:40:30 +00:00
Jake Poznanski
3fef3f914f Gemini support, some debugging stuff 2025-03-10 16:26:48 +00:00
Jake Poznanski
fc857f9c6d Starting on math dataset 2025-03-07 21:30:37 +00:00
Jake Poznanski
d006e8f331 Working on equation matching 2025-03-06 16:09:26 -08:00
Jake Poznanski
7003e9cfe1 Working on a better compare function 2025-03-06 15:16:20 -08:00
Jake Poznanski
e144200276 Fix markdown parsing for mistral 2025-03-06 13:41:51 -08:00
Jake Poznanski
bdc0d75799 Adding mistral ocr to eval 2025-03-06 13:29:56 -08:00