aman-17
|
b5bd179128
|
Merge remote-tracking branch 'origin/main' into amanr/pp-doc-layout
merge from main
|
2025-03-17 12:35:46 -07:00 |
|
aman-17
|
8f356a18d4
|
added pp_doc
|
2025-03-17 12:34:45 -07:00 |
|
Jake Poznanski
|
aee030c42b
|
Fixing sample dataset, outputting some reports for debugging. Math is good enough for now
|
2025-03-17 10:59:02 -07:00 |
|
Jake Poznanski
|
dd725636a3
|
Bump version to v0.1.60 for release
|
2025-03-17 08:59:18 -07:00 |
|
Jake Poznanski
|
baa00825b0
|
Don't go down too low in temp
|
2025-03-17 08:48:19 -07:00 |
|
Jake Poznanski
|
f2951f3f78
|
Lints
|
2025-03-17 08:47:57 -07:00 |
|
Jake Poznanski
|
1e42e5ea9a
|
Faster and nicer equation cache
|
2025-03-17 08:47:06 -07:00 |
|
Jake Poznanski
|
1f8cc59b22
|
Pipeline scales temperature automatically, increases performance ~2%
|
2025-03-14 22:27:51 -07:00 |
|
Jake Poznanski
|
4768ac4be5
|
Merge branch 'main' of https://github.com/allenai/olmocr
|
2025-03-14 22:32:39 +00:00 |
|
Jake Poznanski
|
0968bd17ce
|
Mine headers footers
|
2025-03-14 22:32:38 +00:00 |
|
Jake Poznanski
|
7b4026233c
|
Benchmark script supports rel paths
|
2025-03-14 13:22:12 -07:00 |
|
Jake Poznanski
|
1270ca336a
|
lints
|
2025-03-14 17:53:43 +00:00 |
|
Jake Poznanski
|
d7361c436e
|
Basic convert script
|
2025-03-14 10:35:46 -07:00 |
|
Jake Poznanski
|
142a9cbd20
|
Convert script to support broader folder structures
|
2025-03-14 10:12:21 -07:00 |
|
Jake Poznanski
|
98c4283eef
|
Cap max workers to hopefully improve stability
|
2025-03-14 10:08:30 -07:00 |
|
Jake Poznanski
|
5f3ef510ab
|
Faster equation cache and checking, cleanup data script
|
2025-03-14 16:40:16 +00:00 |
|
Jake Poznanski
|
9f38a8a602
|
Lints
|
2025-03-13 22:29:27 +00:00 |
|
Jake Poznanski
|
5009bb31f1
|
Lints
|
2025-03-13 22:26:53 +00:00 |
|
Jake Poznanski
|
acb0df32a8
|
Fixes
|
2025-03-13 13:15:45 -07:00 |
|
Jake Poznanski
|
3eec2a855b
|
Mining math
|
2025-03-13 13:11:01 -07:00 |
|
Jake Poznanski
|
95f03e1e42
|
More small tests
|
2025-03-13 12:50:52 -07:00 |
|
Jake Poznanski
|
d30a070234
|
Tests
|
2025-03-13 12:34:56 -07:00 |
|
Jake Poznanski
|
269650299c
|
Much faster and responsive math bench
|
2025-03-13 10:42:25 -07:00 |
|
Jake Poznanski
|
980121feea
|
Loading tests much faster in parallel
|
2025-03-13 10:20:09 -07:00 |
|
Jake Poznanski
|
7729e5a9d7
|
Graphical pdf test from github
|
2025-03-13 09:33:40 -07:00 |
|
Jake Poznanski
|
154a07c211
|
Math miner looks decent
|
2025-03-12 16:04:03 -07:00 |
|
Jake Poznanski
|
d0b9b5b7a8
|
Fixes for math mining
|
2025-03-12 15:49:07 -07:00 |
|
Jake Poznanski
|
09fd299242
|
Mining
|
2025-03-12 14:24:47 -07:00 |
|
Jake Poznanski
|
3f92265a81
|
Math miner working decently
|
2025-03-12 14:13:17 -07:00 |
|
Jake Poznanski
|
5387a79a2f
|
More tests for olmocrbench
|
2025-03-12 11:59:11 -07:00 |
|
Jake Poznanski
|
189104bc90
|
Fixing escaped html bug in mathml parsing
|
2025-03-12 11:18:32 -07:00 |
|
Jake Poznanski
|
770bc364ed
|
Fixes for multipage
|
2025-03-12 11:09:18 -07:00 |
|
Jake Poznanski
|
0553443301
|
Convert scripts and other fun
|
2025-03-12 11:04:20 -07:00 |
|
Jake Poznanski
|
8b3a9e4201
|
Fixes for multipage runners
|
2025-03-12 10:29:49 -07:00 |
|
Jake Poznanski
|
743e48e4ad
|
More fixes
|
2025-03-11 04:59:19 +00:00 |
|
Jake Poznanski
|
b2fe82db9b
|
Working on math compares
|
2025-03-11 04:50:51 +00:00 |
|
Jake Poznanski
|
bc3a94583a
|
Adding some tests
|
2025-03-11 03:57:12 +00:00 |
|
Jake Poznanski
|
35cc6f110c
|
A few fixes for text comparisons and normalized chars
|
2025-03-11 03:42:51 +00:00 |
|
Jake Poznanski
|
4709156ce5
|
Leaving with some more data, but still cases to investigate
|
2025-03-10 15:53:01 -07:00 |
|
Jake Poznanski
|
07be9ea6e3
|
More math testing
|
2025-03-10 21:55:33 +00:00 |
|
Jake Poznanski
|
e39c3e4613
|
New method for comparing equations
|
2025-03-10 21:47:49 +00:00 |
|
Jake Poznanski
|
fff40506cc
|
More test documents
|
2025-03-10 17:09:42 +00:00 |
|
Jake Poznanski
|
0ba56c0fa9
|
Adjusting repeat test to be the "baseline" test which also looks for disallowed characters
|
2025-03-10 16:53:07 +00:00 |
|
Jake Poznanski
|
a2b5ca8d41
|
Better markdown table parsing
|
2025-03-10 16:40:30 +00:00 |
|
Jake Poznanski
|
3fef3f914f
|
Gemini support, some debugging stuff
|
2025-03-10 16:26:48 +00:00 |
|
Jake Poznanski
|
fc857f9c6d
|
Starting on math dataset
|
2025-03-07 21:30:37 +00:00 |
|
Jake Poznanski
|
d006e8f331
|
Working on equation matching
|
2025-03-06 16:09:26 -08:00 |
|
Jake Poznanski
|
7003e9cfe1
|
Working on a better compare function
|
2025-03-06 15:16:20 -08:00 |
|
Jake Poznanski
|
e144200276
|
Fix markdown parsing for mistral
|
2025-03-06 13:41:51 -08:00 |
|
Jake Poznanski
|
bdc0d75799
|
Adding mistral ocr to eval
|
2025-03-06 13:29:56 -08:00 |
|