877 Commits

Author SHA1 Message Date
Jake Poznanski
e856e9de1d Test mining not including line numbers 2025-04-02 23:07:32 +00:00
Jake Poznanski
2614fc9050 Merge branch 'main' of https://github.com/allenai/olmocr 2025-04-02 21:46:35 +00:00
Jake Poznanski
a96f1541c4 Hopefuly avoiding comparison issues now 2025-04-02 21:46:34 +00:00
Jake Poznanski
46ca990663 Merge branch 'main' of https://github.com/allenai/olmocr into main 2025-04-02 14:46:13 -07:00
Jake Poznanski
0d94d15341 Test validation 2025-04-02 14:46:07 -07:00
Jake Poznanski
b8b780faca More mining of synthetic tests code 2025-04-02 21:39:50 +00:00
Jake Poznanski
360b1be07c Better filtering of tests 2025-04-02 21:24:00 +00:00
Jake Poznanski
6d3a7d634e Adding autorender if katex into synthetic pipeline 2025-04-02 21:14:14 +00:00
Jake Poznanski
4604b59661 SYnth mining 2025-04-02 20:25:16 +00:00
Jake Poznanski
69b0222697 Improving miner script 2025-04-02 20:12:06 +00:00
Jake Poznanski
841ce72c19 Miner improvements 2025-04-02 18:49:43 +00:00
Jake Poznanski
97376493fd More tests 2025-04-02 18:39:51 +00:00
Jake Poznanski
748ab95751 Miner unit tests for duplicate absent tests 2025-04-02 18:12:05 +00:00
Jake Poznanski
594f47306b Synth miner coming together more 2025-04-02 18:02:39 +00:00
Jake Poznanski
fb8b23d506 SMall adjustments to synthetic data pipeline 2025-04-02 17:46:48 +00:00
Jake Poznanski
678c000685 Nicer claude prompt for synth data gen 2025-04-01 22:42:09 +00:00
Jake Poznanski
5c98a47eaa Mining upgrades 2025-04-01 22:22:19 +00:00
Jake Poznanski
a34b158ebf Lints 2025-04-01 20:05:55 +00:00
Jake Poznanski
83ae61014c Scan dolma docs improvements for PII review 2025-04-01 20:03:15 +00:00
Jake Poznanski
bc78e0d8a0 Adding feedback 2025-04-01 18:35:04 +00:00
Jake Poznanski
213252f048 A few improvements to the dolma doc viewer script 2025-04-01 18:25:40 +00:00
Jake Poznanski
3ca39abd9b Merge branch 'main' of https://github.com/allenai/olmocr 2025-04-01 18:11:09 +00:00
Jake Poznanski
7e46626452
Update README.md 2025-03-31 13:50:07 -07:00
Jake Poznanski
0d21ade0d8 Unused import 2025-03-31 13:30:20 -07:00
Jake Poznanski
b64fd19db3 Cleaning up code for image to pdf conversion 2025-03-31 13:28:30 -07:00
Jake Poznanski
cc8e4b1863 Adding native support to convert pngs and jpgs to pdfs so the pipeline can work on them 2025-03-31 10:59:38 -07:00
Jake Poznanski
0892b1829b
Merge pull request #138 from xcvil/sglang_server
feat: avoid sglang server starting with empty queue
2025-03-28 11:45:46 -07:00
Jake Poznanski
9b119c81bd First attempt at mining actual test cases 2025-03-27 22:43:08 +00:00
Jake Poznanski
abcf7f083a Lints 2025-03-27 22:22:25 +00:00
Jake Poznanski
cd5a93d8d5 Rendering pdfs with playwright and chromium 2025-03-27 22:18:23 +00:00
Xiaochen Zheng
ee687e25d6 Update suggested changes for qsize check 2025-03-27 23:09:50 +01:00
Jake Poznanski
9749e9559d Merge branch 'main' of https://github.com/allenai/olmocr 2025-03-27 21:56:31 +00:00
Jake Poznanski
731aa73c70 Better synth miner script 2025-03-27 21:56:29 +00:00
Jake Poznanski
bb3395e739
Merge pull request #132 from xcvil/sglang_port
Add argparser argument for configuring SGLang server port
2025-03-27 12:52:58 -07:00
Jake Poznanski
42be0ccd0c Too much debug spew 2025-03-26 21:03:03 +00:00
Jake Poznanski
d45c0323a4 Better equation rendering checker with more tests. 2025-03-26 18:49:48 +00:00
Jake Poznanski
b8e3034847 Trying a change to the render script 2025-03-26 18:26:06 +00:00
Jake Poznanski
2141f18f10 Adding a katex test case that should be fixed 2025-03-25 18:36:47 +00:00
Jake Poznanski
4d6a97f9fb Style fix, a few notes 2025-03-25 18:25:07 +00:00
Jake Poznanski
c36d8fd967 Merge branch 'main' of https://github.com/allenai/olmocr into main 2025-03-25 09:49:24 -07:00
Jake Poznanski
223d05aca4 Adding basic prompt template 2025-03-25 09:48:21 -07:00
xcvil
6d766307be feat: avoid sglang server starting with empty queue 2025-03-23 23:45:28 +01:00
Jake Poznanski
3edae0ac71 Normalizing out markdown stuff 2025-03-21 18:30:09 +00:00
Jake Poznanski
2417e61136 Mediod 2025-03-21 17:55:22 +00:00
Jake Poznanski
03285d90a3 Merge branch 'main' of https://github.com/allenai/olmocr 2025-03-21 17:51:31 +00:00
Jake Poznanski
1f77aab75a Some early code for mining html templates of pages, pick mediod code 2025-03-21 17:51:29 +00:00
Jake Poznanski
85054d64f1 Outputting with no document anchoring 2025-03-20 15:31:33 -07:00
Jake Poznanski
57a83d807a Simplofy repo 2025-03-20 13:34:07 -07:00
Jake Poznanski
f2f0be182e Convert script suppors no document anchoring mode, and parallel pipeline 2025-03-20 13:31:38 -07:00
Jake Poznanski
58276b04cb Mining reading order checkpoint, convert script to use images 2025-03-20 19:49:39 +00:00