139 Commits

Author SHA1 Message Date
Jake Poznanski
a957ab2aaf Adding an adjustment to how blank pages test is run, skipping image tags 2025-09-08 17:18:51 +00:00
Jake Poznanski
0710debf75 Cleaner front matter reward 2025-08-27 19:49:42 +00:00
Jake Poznanski
d70208d98a Moving test code around, adding format reward since some runs stop outputting the front matter thing in grpo training 2025-08-27 18:22:05 +00:00
Jake Poznanski
8383865392 Fixing up subscripts and superscripts in synth data 2025-08-27 18:15:36 +00:00
Jake Poznanski
d36357f3db Some fixes to validating math which was not working otherwise 2025-08-22 20:40:14 +00:00
Jake Poznanski
dcc932dc2c Markdown cleanup 2025-08-22 17:21:13 +00:00
Jake Poznanski
d2bec31595 Markdown front matter corrector 2025-08-22 16:43:36 +00:00
Jake Poznanski
0fd7d07e73 GRPO reward fixups 2025-08-21 18:33:11 +00:00
Jake Poznanski
1dd6ff9b03 Olmocr bench grpo stuff 2025-08-21 18:17:07 +00:00
Jake Poznanski
cc918ca03e Setting up GRPO trainer 2025-08-20 22:18:38 +00:00
Jake Poznanski
41201b6317 Lints 2025-08-19 21:30:41 +00:00
Jake Poznanski
768cb33937 Better filtering coming in 2025-08-19 21:22:54 +00:00
Jake Poznanski
84a0c432e7 Adding some filtering rules and tests for them 2025-08-19 18:14:15 +00:00
Jake Poznanski
93411a80a0 Lint fixes 2025-08-13 20:21:04 +00:00
Jake Poznanski
05330150ad New work queue code is cleaner 2025-08-13 20:20:27 +00:00
Jake Poznanski
6216896102 Accidentally comitted too many files 2025-08-04 20:41:21 +00:00
Jake Poznanski
0536c0e9b8 Lint fixes 2025-08-04 18:21:47 +00:00
Jake Poznanski
08b263ba46 Cumulative rotation support 2025-08-04 18:21:31 +00:00
Jake Poznanski
ed8a5d10cf Ok fixed rotation stuff finally 2025-08-04 17:53:48 +00:00
Jake Poznanski
e0158df210 Adding test file 2025-08-04 17:21:40 +00:00
Jake Poznanski
6cdcb06ae7 Removing some dead code and adding tests 2025-08-04 16:54:42 +00:00
Jake Poznanski
a8d5299433 Trying to add a test for rotation correction 2025-08-04 16:24:13 +00:00
Jake Poznanski
56296d6927 Brining back a few files 2025-07-23 04:49:13 +00:00
Jake Poznanski
b588ae27d2 Remvoing sglang tests, switch to vllm 2025-06-17 16:07:16 +00:00
Jake Poznanski
5faf570e30 Format fixes 2025-05-29 23:23:02 +00:00
Jake Poznanski
f8fd234093 Idea to improve retry performance 2025-05-28 18:27:40 +00:00
Jake Poznanski
63aee2c1e5 Code cleanup, version bump, remove unused permutation test 2025-05-16 21:25:32 +00:00
Jake Poznanski
1854ae1269 A bit more work on tagging 2025-05-09 19:31:07 +00:00
Jake Poznanski
03db04cb7e Fixing handling of new lines in some test cases 2025-05-08 17:21:06 +00:00
Jake Poznanski
8f46b6e966 Running more tests in CI 2025-04-17 14:26:06 -07:00
Jake Poznanski
1d0c560455 Upping version to fix issue with work queue and delimited paths 2025-04-15 18:50:13 +00:00
Jake Poznanski
79e2677319 Hmm, these should be passing! 2025-03-14 02:52:13 +00:00
Jake Poznanski
f5d92bdb14 Trying to get new CI to work 2025-03-14 02:43:55 +00:00
Chris Wilhelm
c585415797 for now, only process one pdf in the ci script 2025-03-13 15:48:47 -07:00
Chris Wilhelm
9b958e65f1 moves what happens where around a bit and updates readme 2025-03-13 15:31:55 -07:00
Chris Wilhelm
29b9054749 basic docker image and test 2025-03-13 15:31:55 -07:00
aman-17
0130a970c2 fixed style 2025-02-25 08:57:02 -08:00
Jake Poznanski
58bdfa512b CI 2025-02-14 20:51:04 +00:00
Jake Poznanski
25ec87b66d CI 2025-02-14 20:46:55 +00:00
Jake Poznanski
c05e01532c Hopefully CI runs now 2025-02-14 20:42:19 +00:00
Jake Poznanski
91eef279b3 Adding some gnarly 1 pager pdfs from kyle 2025-02-11 18:45:42 +00:00
aman-17
a036133fdd resolved all the mypy, black and isort issues and updated readme 2025-02-07 16:05:00 -08:00
Jake Poznanski
9bf3d35cdb Comment fix 2025-01-30 16:02:08 -08:00
Jake Poznanski
2ab7cb280c Removing pymupdf 2025-01-30 15:51:54 -08:00
Jake Poznanski
72f4b9a590 Project setup 2025-01-30 15:33:04 -08:00
Jake Poznanski
cdd830235f Shortened some sample docs 2025-01-30 15:28:31 -08:00
Jake Poznanski
10094ffc19 Even newer mypy crashes still 2025-01-30 14:32:08 -08:00
Jake Poznanski
fb402297ce Isort and black update 2025-01-29 15:42:34 -08:00
Jake Poznanski
dcaca8aa90 Black formatting 2025-01-29 15:30:39 -08:00
Jake Poznanski
4a1762d455 isort 2025-01-29 15:25:10 -08:00