851 Commits

Author SHA1 Message Date
Jake Poznanski
cc8e4b1863 Adding native support to convert pngs and jpgs to pdfs so the pipeline can work on them 2025-03-31 10:59:38 -07:00
Jake Poznanski
0892b1829b
Merge pull request #138 from xcvil/sglang_server
feat: avoid sglang server starting with empty queue
2025-03-28 11:45:46 -07:00
Jake Poznanski
abcf7f083a Lints 2025-03-27 22:22:25 +00:00
Jake Poznanski
cd5a93d8d5 Rendering pdfs with playwright and chromium 2025-03-27 22:18:23 +00:00
Xiaochen Zheng
ee687e25d6 Update suggested changes for qsize check 2025-03-27 23:09:50 +01:00
Jake Poznanski
9749e9559d Merge branch 'main' of https://github.com/allenai/olmocr 2025-03-27 21:56:31 +00:00
Jake Poznanski
731aa73c70 Better synth miner script 2025-03-27 21:56:29 +00:00
Jake Poznanski
bb3395e739
Merge pull request #132 from xcvil/sglang_port
Add argparser argument for configuring SGLang server port
2025-03-27 12:52:58 -07:00
Jake Poznanski
42be0ccd0c Too much debug spew 2025-03-26 21:03:03 +00:00
Jake Poznanski
d45c0323a4 Better equation rendering checker with more tests. 2025-03-26 18:49:48 +00:00
Jake Poznanski
b8e3034847 Trying a change to the render script 2025-03-26 18:26:06 +00:00
Jake Poznanski
2141f18f10 Adding a katex test case that should be fixed 2025-03-25 18:36:47 +00:00
Jake Poznanski
4d6a97f9fb Style fix, a few notes 2025-03-25 18:25:07 +00:00
Jake Poznanski
c36d8fd967 Merge branch 'main' of https://github.com/allenai/olmocr into main 2025-03-25 09:49:24 -07:00
Jake Poznanski
223d05aca4 Adding basic prompt template 2025-03-25 09:48:21 -07:00
xcvil
6d766307be feat: avoid sglang server starting with empty queue 2025-03-23 23:45:28 +01:00
Jake Poznanski
3edae0ac71 Normalizing out markdown stuff 2025-03-21 18:30:09 +00:00
Jake Poznanski
2417e61136 Mediod 2025-03-21 17:55:22 +00:00
Jake Poznanski
03285d90a3 Merge branch 'main' of https://github.com/allenai/olmocr 2025-03-21 17:51:31 +00:00
Jake Poznanski
1f77aab75a Some early code for mining html templates of pages, pick mediod code 2025-03-21 17:51:29 +00:00
Jake Poznanski
85054d64f1 Outputting with no document anchoring 2025-03-20 15:31:33 -07:00
Jake Poznanski
57a83d807a Simplofy repo 2025-03-20 13:34:07 -07:00
Jake Poznanski
f2f0be182e Convert script suppors no document anchoring mode, and parallel pipeline 2025-03-20 13:31:38 -07:00
Jake Poznanski
58276b04cb Mining reading order checkpoint, convert script to use images 2025-03-20 19:49:39 +00:00
Jake Poznanski
f79bd0d248 Cleanup review app 2025-03-20 16:36:10 +00:00
Jake Poznanski
063d4f556a Review page 2025-03-19 23:28:37 +00:00
Jake Poznanski
449900a303 Tests 2025-03-19 23:06:18 +00:00
Jake Poznanski
9e3b554f12 More html table parsing goodness 2025-03-19 21:06:52 +00:00
Jake Poznanski
2944d3b6ef More fixes 2025-03-19 20:52:00 +00:00
Jake Poznanski
16ab1a4f37 Progress on more complicated header and footers 2025-03-19 20:42:04 +00:00
Jake Poznanski
1e13ddef5a Sorting results 2025-03-19 18:57:53 +00:00
Jake Poznanski
c25e9cb084 Addxing some fixes 2025-03-19 18:57:00 +00:00
Jake Poznanski
3005ebd67d Normalization 2025-03-19 18:46:07 +00:00
Jake Poznanski
8ec1ebe5ed Normalization 2025-03-19 18:40:03 +00:00
Jake Poznanski
cb4dfeba36 Fix 2025-03-19 18:33:48 +00:00
Jake Poznanski
a4605e4efc Fixing normalizing during table cell comparison 2025-03-19 18:29:42 +00:00
Jake Poznanski
17979118ba Lints 2025-03-19 18:01:53 +00:00
Jake Poznanski
b307f5a116 More robust markdown parsing 2025-03-19 18:01:02 +00:00
Jake Poznanski
53444571e9 Tests 2025-03-19 17:53:45 +00:00
Jake Poznanski
cac5ef13a9 Tests for the tests 2025-03-19 17:44:49 +00:00
Jake Poznanski
196654ed25 Merge branch 'main' of https://github.com/allenai/olmocr 2025-03-19 17:32:24 +00:00
Jake Poznanski
0a3a5efe07 Lints 2025-03-19 17:32:22 +00:00
Jake Poznanski
0afacd6ac7 Less duped tests 2025-03-19 17:32:06 +00:00
Jake Poznanski
9855f70fee Some work on table dataset 2025-03-19 17:25:22 +00:00
Jake Poznanski
14e3f6e97b Small edits 2025-03-19 09:27:41 -07:00
Jake Poznanski
46ffbe9324 smolDocling support for benchmark 2025-03-19 08:36:31 -07:00
xcvil
a6a0f21c8b Add argparser argument for configuring SGLang server port 2025-03-19 14:00:28 +01:00
Jake Poznanski
bc41ba92e7 Merge branch 'main' of https://github.com/allenai/olmocr 2025-03-18 22:35:46 +00:00
Jake Poznanski
ad82e5526f Adding url reference for tests, some mining and cleanup scripts 2025-03-18 22:35:44 +00:00
Aman Rangapur
f1945c1ecf
Merge pull request #127 from allenai/amanr/pp-doc-layout
Headers and Footers pdf's
2025-03-18 14:35:20 -07:00