Jake Poznanski
|
e856e9de1d
|
Test mining not including line numbers
|
2025-04-02 23:07:32 +00:00 |
|
Jake Poznanski
|
2614fc9050
|
Merge branch 'main' of https://github.com/allenai/olmocr
|
2025-04-02 21:46:35 +00:00 |
|
Jake Poznanski
|
a96f1541c4
|
Hopefuly avoiding comparison issues now
|
2025-04-02 21:46:34 +00:00 |
|
Jake Poznanski
|
46ca990663
|
Merge branch 'main' of https://github.com/allenai/olmocr into main
|
2025-04-02 14:46:13 -07:00 |
|
Jake Poznanski
|
0d94d15341
|
Test validation
|
2025-04-02 14:46:07 -07:00 |
|
Jake Poznanski
|
b8b780faca
|
More mining of synthetic tests code
|
2025-04-02 21:39:50 +00:00 |
|
Jake Poznanski
|
360b1be07c
|
Better filtering of tests
|
2025-04-02 21:24:00 +00:00 |
|
Jake Poznanski
|
6d3a7d634e
|
Adding autorender if katex into synthetic pipeline
|
2025-04-02 21:14:14 +00:00 |
|
Jake Poznanski
|
4604b59661
|
SYnth mining
|
2025-04-02 20:25:16 +00:00 |
|
Jake Poznanski
|
69b0222697
|
Improving miner script
|
2025-04-02 20:12:06 +00:00 |
|
Jake Poznanski
|
841ce72c19
|
Miner improvements
|
2025-04-02 18:49:43 +00:00 |
|
Jake Poznanski
|
97376493fd
|
More tests
|
2025-04-02 18:39:51 +00:00 |
|
Jake Poznanski
|
748ab95751
|
Miner unit tests for duplicate absent tests
|
2025-04-02 18:12:05 +00:00 |
|
Jake Poznanski
|
594f47306b
|
Synth miner coming together more
|
2025-04-02 18:02:39 +00:00 |
|
Jake Poznanski
|
fb8b23d506
|
SMall adjustments to synthetic data pipeline
|
2025-04-02 17:46:48 +00:00 |
|
Jake Poznanski
|
678c000685
|
Nicer claude prompt for synth data gen
|
2025-04-01 22:42:09 +00:00 |
|
Jake Poznanski
|
5c98a47eaa
|
Mining upgrades
|
2025-04-01 22:22:19 +00:00 |
|
Jake Poznanski
|
a34b158ebf
|
Lints
|
2025-04-01 20:05:55 +00:00 |
|
Jake Poznanski
|
83ae61014c
|
Scan dolma docs improvements for PII review
|
2025-04-01 20:03:15 +00:00 |
|
Jake Poznanski
|
bc78e0d8a0
|
Adding feedback
|
2025-04-01 18:35:04 +00:00 |
|
Jake Poznanski
|
213252f048
|
A few improvements to the dolma doc viewer script
|
2025-04-01 18:25:40 +00:00 |
|
Jake Poznanski
|
3ca39abd9b
|
Merge branch 'main' of https://github.com/allenai/olmocr
|
2025-04-01 18:11:09 +00:00 |
|
Jake Poznanski
|
7e46626452
|
Update README.md
|
2025-03-31 13:50:07 -07:00 |
|
Jake Poznanski
|
0d21ade0d8
|
Unused import
|
2025-03-31 13:30:20 -07:00 |
|
Jake Poznanski
|
b64fd19db3
|
Cleaning up code for image to pdf conversion
|
2025-03-31 13:28:30 -07:00 |
|
Jake Poznanski
|
cc8e4b1863
|
Adding native support to convert pngs and jpgs to pdfs so the pipeline can work on them
|
2025-03-31 10:59:38 -07:00 |
|
Jake Poznanski
|
0892b1829b
|
Merge pull request #138 from xcvil/sglang_server
feat: avoid sglang server starting with empty queue
|
2025-03-28 11:45:46 -07:00 |
|
Jake Poznanski
|
9b119c81bd
|
First attempt at mining actual test cases
|
2025-03-27 22:43:08 +00:00 |
|
Jake Poznanski
|
abcf7f083a
|
Lints
|
2025-03-27 22:22:25 +00:00 |
|
Jake Poznanski
|
cd5a93d8d5
|
Rendering pdfs with playwright and chromium
|
2025-03-27 22:18:23 +00:00 |
|
Xiaochen Zheng
|
ee687e25d6
|
Update suggested changes for qsize check
|
2025-03-27 23:09:50 +01:00 |
|
Jake Poznanski
|
9749e9559d
|
Merge branch 'main' of https://github.com/allenai/olmocr
|
2025-03-27 21:56:31 +00:00 |
|
Jake Poznanski
|
731aa73c70
|
Better synth miner script
|
2025-03-27 21:56:29 +00:00 |
|
Jake Poznanski
|
bb3395e739
|
Merge pull request #132 from xcvil/sglang_port
Add argparser argument for configuring SGLang server port
|
2025-03-27 12:52:58 -07:00 |
|
Jake Poznanski
|
42be0ccd0c
|
Too much debug spew
|
2025-03-26 21:03:03 +00:00 |
|
Jake Poznanski
|
d45c0323a4
|
Better equation rendering checker with more tests.
|
2025-03-26 18:49:48 +00:00 |
|
Jake Poznanski
|
b8e3034847
|
Trying a change to the render script
|
2025-03-26 18:26:06 +00:00 |
|
Jake Poznanski
|
2141f18f10
|
Adding a katex test case that should be fixed
|
2025-03-25 18:36:47 +00:00 |
|
Jake Poznanski
|
4d6a97f9fb
|
Style fix, a few notes
|
2025-03-25 18:25:07 +00:00 |
|
Jake Poznanski
|
c36d8fd967
|
Merge branch 'main' of https://github.com/allenai/olmocr into main
|
2025-03-25 09:49:24 -07:00 |
|
Jake Poznanski
|
223d05aca4
|
Adding basic prompt template
|
2025-03-25 09:48:21 -07:00 |
|
xcvil
|
6d766307be
|
feat: avoid sglang server starting with empty queue
|
2025-03-23 23:45:28 +01:00 |
|
Jake Poznanski
|
3edae0ac71
|
Normalizing out markdown stuff
|
2025-03-21 18:30:09 +00:00 |
|
Jake Poznanski
|
2417e61136
|
Mediod
|
2025-03-21 17:55:22 +00:00 |
|
Jake Poznanski
|
03285d90a3
|
Merge branch 'main' of https://github.com/allenai/olmocr
|
2025-03-21 17:51:31 +00:00 |
|
Jake Poznanski
|
1f77aab75a
|
Some early code for mining html templates of pages, pick mediod code
|
2025-03-21 17:51:29 +00:00 |
|
Jake Poznanski
|
85054d64f1
|
Outputting with no document anchoring
|
2025-03-20 15:31:33 -07:00 |
|
Jake Poznanski
|
57a83d807a
|
Simplofy repo
|
2025-03-20 13:34:07 -07:00 |
|
Jake Poznanski
|
f2f0be182e
|
Convert script suppors no document anchoring mode, and parallel pipeline
|
2025-03-20 13:31:38 -07:00 |
|
Jake Poznanski
|
58276b04cb
|
Mining reading order checkpoint, convert script to use images
|
2025-03-20 19:49:39 +00:00 |
|