Jake Poznanski
|
cc8e4b1863
|
Adding native support to convert pngs and jpgs to pdfs so the pipeline can work on them
|
2025-03-31 10:59:38 -07:00 |
|
Jake Poznanski
|
0892b1829b
|
Merge pull request #138 from xcvil/sglang_server
feat: avoid sglang server starting with empty queue
|
2025-03-28 11:45:46 -07:00 |
|
Jake Poznanski
|
abcf7f083a
|
Lints
|
2025-03-27 22:22:25 +00:00 |
|
Jake Poznanski
|
cd5a93d8d5
|
Rendering pdfs with playwright and chromium
|
2025-03-27 22:18:23 +00:00 |
|
Xiaochen Zheng
|
ee687e25d6
|
Update suggested changes for qsize check
|
2025-03-27 23:09:50 +01:00 |
|
Jake Poznanski
|
9749e9559d
|
Merge branch 'main' of https://github.com/allenai/olmocr
|
2025-03-27 21:56:31 +00:00 |
|
Jake Poznanski
|
731aa73c70
|
Better synth miner script
|
2025-03-27 21:56:29 +00:00 |
|
Jake Poznanski
|
bb3395e739
|
Merge pull request #132 from xcvil/sglang_port
Add argparser argument for configuring SGLang server port
|
2025-03-27 12:52:58 -07:00 |
|
Jake Poznanski
|
42be0ccd0c
|
Too much debug spew
|
2025-03-26 21:03:03 +00:00 |
|
Jake Poznanski
|
d45c0323a4
|
Better equation rendering checker with more tests.
|
2025-03-26 18:49:48 +00:00 |
|
Jake Poznanski
|
b8e3034847
|
Trying a change to the render script
|
2025-03-26 18:26:06 +00:00 |
|
Jake Poznanski
|
2141f18f10
|
Adding a katex test case that should be fixed
|
2025-03-25 18:36:47 +00:00 |
|
Jake Poznanski
|
4d6a97f9fb
|
Style fix, a few notes
|
2025-03-25 18:25:07 +00:00 |
|
Jake Poznanski
|
c36d8fd967
|
Merge branch 'main' of https://github.com/allenai/olmocr into main
|
2025-03-25 09:49:24 -07:00 |
|
Jake Poznanski
|
223d05aca4
|
Adding basic prompt template
|
2025-03-25 09:48:21 -07:00 |
|
xcvil
|
6d766307be
|
feat: avoid sglang server starting with empty queue
|
2025-03-23 23:45:28 +01:00 |
|
Jake Poznanski
|
3edae0ac71
|
Normalizing out markdown stuff
|
2025-03-21 18:30:09 +00:00 |
|
Jake Poznanski
|
2417e61136
|
Mediod
|
2025-03-21 17:55:22 +00:00 |
|
Jake Poznanski
|
03285d90a3
|
Merge branch 'main' of https://github.com/allenai/olmocr
|
2025-03-21 17:51:31 +00:00 |
|
Jake Poznanski
|
1f77aab75a
|
Some early code for mining html templates of pages, pick mediod code
|
2025-03-21 17:51:29 +00:00 |
|
Jake Poznanski
|
85054d64f1
|
Outputting with no document anchoring
|
2025-03-20 15:31:33 -07:00 |
|
Jake Poznanski
|
57a83d807a
|
Simplofy repo
|
2025-03-20 13:34:07 -07:00 |
|
Jake Poznanski
|
f2f0be182e
|
Convert script suppors no document anchoring mode, and parallel pipeline
|
2025-03-20 13:31:38 -07:00 |
|
Jake Poznanski
|
58276b04cb
|
Mining reading order checkpoint, convert script to use images
|
2025-03-20 19:49:39 +00:00 |
|
Jake Poznanski
|
f79bd0d248
|
Cleanup review app
|
2025-03-20 16:36:10 +00:00 |
|
Jake Poznanski
|
063d4f556a
|
Review page
|
2025-03-19 23:28:37 +00:00 |
|
Jake Poznanski
|
449900a303
|
Tests
|
2025-03-19 23:06:18 +00:00 |
|
Jake Poznanski
|
9e3b554f12
|
More html table parsing goodness
|
2025-03-19 21:06:52 +00:00 |
|
Jake Poznanski
|
2944d3b6ef
|
More fixes
|
2025-03-19 20:52:00 +00:00 |
|
Jake Poznanski
|
16ab1a4f37
|
Progress on more complicated header and footers
|
2025-03-19 20:42:04 +00:00 |
|
Jake Poznanski
|
1e13ddef5a
|
Sorting results
|
2025-03-19 18:57:53 +00:00 |
|
Jake Poznanski
|
c25e9cb084
|
Addxing some fixes
|
2025-03-19 18:57:00 +00:00 |
|
Jake Poznanski
|
3005ebd67d
|
Normalization
|
2025-03-19 18:46:07 +00:00 |
|
Jake Poznanski
|
8ec1ebe5ed
|
Normalization
|
2025-03-19 18:40:03 +00:00 |
|
Jake Poznanski
|
cb4dfeba36
|
Fix
|
2025-03-19 18:33:48 +00:00 |
|
Jake Poznanski
|
a4605e4efc
|
Fixing normalizing during table cell comparison
|
2025-03-19 18:29:42 +00:00 |
|
Jake Poznanski
|
17979118ba
|
Lints
|
2025-03-19 18:01:53 +00:00 |
|
Jake Poznanski
|
b307f5a116
|
More robust markdown parsing
|
2025-03-19 18:01:02 +00:00 |
|
Jake Poznanski
|
53444571e9
|
Tests
|
2025-03-19 17:53:45 +00:00 |
|
Jake Poznanski
|
cac5ef13a9
|
Tests for the tests
|
2025-03-19 17:44:49 +00:00 |
|
Jake Poznanski
|
196654ed25
|
Merge branch 'main' of https://github.com/allenai/olmocr
|
2025-03-19 17:32:24 +00:00 |
|
Jake Poznanski
|
0a3a5efe07
|
Lints
|
2025-03-19 17:32:22 +00:00 |
|
Jake Poznanski
|
0afacd6ac7
|
Less duped tests
|
2025-03-19 17:32:06 +00:00 |
|
Jake Poznanski
|
9855f70fee
|
Some work on table dataset
|
2025-03-19 17:25:22 +00:00 |
|
Jake Poznanski
|
14e3f6e97b
|
Small edits
|
2025-03-19 09:27:41 -07:00 |
|
Jake Poznanski
|
46ffbe9324
|
smolDocling support for benchmark
|
2025-03-19 08:36:31 -07:00 |
|
xcvil
|
a6a0f21c8b
|
Add argparser argument for configuring SGLang server port
|
2025-03-19 14:00:28 +01:00 |
|
Jake Poznanski
|
bc41ba92e7
|
Merge branch 'main' of https://github.com/allenai/olmocr
|
2025-03-18 22:35:46 +00:00 |
|
Jake Poznanski
|
ad82e5526f
|
Adding url reference for tests, some mining and cleanup scripts
|
2025-03-18 22:35:44 +00:00 |
|
Aman Rangapur
|
f1945c1ecf
|
Merge pull request #127 from allenai/amanr/pp-doc-layout
Headers and Footers pdf's
|
2025-03-18 14:35:20 -07:00 |
|