746 Commits

Author SHA1 Message Date
Jake Poznanski
d0b9b5b7a8 Fixes for math mining 2025-03-12 15:49:07 -07:00
Jake Poznanski
09fd299242 Mining 2025-03-12 14:24:47 -07:00
Jake Poznanski
3f92265a81 Math miner working decently 2025-03-12 14:13:17 -07:00
Jake Poznanski
5387a79a2f More tests for olmocrbench 2025-03-12 11:59:11 -07:00
Jake Poznanski
189104bc90 Fixing escaped html bug in mathml parsing 2025-03-12 11:18:32 -07:00
Jake Poznanski
770bc364ed Fixes for multipage 2025-03-12 11:09:18 -07:00
Jake Poznanski
0553443301 Convert scripts and other fun 2025-03-12 11:04:20 -07:00
Jake Poznanski
8b3a9e4201 Fixes for multipage runners 2025-03-12 10:29:49 -07:00
Jake Poznanski
743e48e4ad More fixes 2025-03-11 04:59:19 +00:00
Jake Poznanski
b2fe82db9b Working on math compares 2025-03-11 04:50:51 +00:00
Jake Poznanski
bc3a94583a Adding some tests 2025-03-11 03:57:12 +00:00
Jake Poznanski
35cc6f110c A few fixes for text comparisons and normalized chars 2025-03-11 03:42:51 +00:00
Jake Poznanski
4709156ce5 Leaving with some more data, but still cases to investigate 2025-03-10 15:53:01 -07:00
Jake Poznanski
07be9ea6e3 More math testing 2025-03-10 21:55:33 +00:00
Jake Poznanski
e39c3e4613 New method for comparing equations 2025-03-10 21:47:49 +00:00
Jake Poznanski
fff40506cc More test documents 2025-03-10 17:09:42 +00:00
Jake Poznanski
0ba56c0fa9 Adjusting repeat test to be the "baseline" test which also looks for disallowed characters 2025-03-10 16:53:07 +00:00
Jake Poznanski
a2b5ca8d41 Better markdown table parsing 2025-03-10 16:40:30 +00:00
Jake Poznanski
3fef3f914f Gemini support, some debugging stuff 2025-03-10 16:26:48 +00:00
Jake Poznanski
fc857f9c6d Starting on math dataset 2025-03-07 21:30:37 +00:00
Jake Poznanski
d006e8f331 Working on equation matching 2025-03-06 16:09:26 -08:00
Jake Poznanski
7003e9cfe1 Working on a better compare function 2025-03-06 15:16:20 -08:00
Jake Poznanski
e144200276 Fix markdown parsing for mistral 2025-03-06 13:41:51 -08:00
Jake Poznanski
bdc0d75799 Adding mistral ocr to eval 2025-03-06 13:29:56 -08:00
Jake Poznanski
4053ea58a4 Work on image matching 2025-03-06 13:11:08 -08:00
Jake Poznanski
b03d840238 Better error handling on eqn rendering 2025-03-06 11:59:20 -08:00
Jake Poznanski
438e68ec68 Some more math stuff 2025-03-06 11:00:50 -08:00
Jake Poznanski
7f36ac86f3 First math tests 2025-03-06 10:34:05 -08:00
Jake Poznanski
b62ccc25dd Equation rendering code, first pass 2025-03-06 09:59:36 -08:00
Jake Poznanski
9be696fa30 Adding a trailing repetition test 2025-03-06 08:56:16 -08:00
Jake Poznanski
07466e1ae4 Stats tests 2025-03-06 08:18:05 -08:00
Jake Poznanski
eeb2733c9e Marker rerun, stats changes 2025-03-06 07:55:44 -08:00
Jake Poznanski
50e55f45ab Conversion fixes 2025-03-05 15:31:45 -08:00
Jake Poznanski
fb0a729fe6 Better convert script 2025-03-05 14:31:39 -08:00
Jake Poznanski
fa68c6b6ce Better conversion script, run on more things 2025-03-05 14:16:29 -08:00
Jake Poznanski
c9ecd8e040 Need those chat templates 2025-03-05 14:01:14 -08:00
Jake Poznanski
5611d79bb2 Model runners 2025-03-05 13:55:40 -08:00
Jake Poznanski
5cb32c3289 Convert script work with server backends 2025-03-05 13:33:39 -08:00
Jake Poznanski
87875b3e2f Merge branch 'main' of https://github.com/allenai/olmocr into main 2025-03-05 12:33:02 -08:00
Jake Poznanski
2982526a10 Convert scripts for benchmark 2025-03-05 12:03:34 -08:00
Jake Poznanski
dbbe6cea11 Merge branch 'main' of https://github.com/allenai/olmocr 2025-03-05 19:37:10 +00:00
Jake Poznanski
abeaf028fd Docker file builds faster now 2025-03-05 19:37:09 +00:00
Jake Poznanski
1545a6d515 Adding more work on diffs 2025-03-04 15:08:59 -08:00
Jake Poznanski
004486f014 Nice tables support 2025-03-04 14:22:03 -08:00
Jake Poznanski
3a0bcb6afd Better table tests 2025-03-04 14:04:50 -08:00
Jake Poznanski
748fd62e8a Adding basic table relative tests 2025-03-04 13:34:33 -08:00
Jake Poznanski
76476f9992 Synth rendering ideas 2025-03-04 09:59:51 -08:00
Jake Poznanski
c4f6b11834 Fixing the mine diffs script, but it still doesn't work great 2025-03-04 09:11:53 -08:00
Jake Poznanski
fcb1eab98f Consistent ordering on convert, with data dir script 2025-03-04 08:39:35 -08:00
Jake Poznanski
ecac3847e4 Making a nicer warning message when waiting for sglang server 2025-03-04 08:28:15 -08:00