1554 Commits

Author SHA1 Message Date
Jake Poznanski
768cb33937 Better filtering coming in 2025-08-19 21:22:54 +00:00
Jake Poznanski
1cafa779a3 More filtering stages 2025-08-19 20:09:41 +00:00
Jake Poznanski
4d837b7db2 More filter rules 2025-08-19 20:01:42 +00:00
Jake Poznanski
17d131fce0 Some more filtering stuff 2025-08-19 18:54:04 +00:00
Jake Poznanski
a3d23d7de1 Adding a part of code to dataloader so you can see what is getting filtered out of your dataset 2025-08-19 18:45:01 +00:00
Jake Poznanski
84a0c432e7 Adding some filtering rules and tests for them 2025-08-19 18:14:15 +00:00
Jake Poznanski
cd09e190b5 Fixes 2025-08-19 17:50:23 +00:00
Jake Poznanski
798335c88e Setting pipeline touse new prompt too 2025-08-19 17:46:23 +00:00
Jake Poznanski
f2db62b0f8 Train a run with adjusted prompt 2025-08-19 17:45:41 +00:00
Jake Poznanski
1be5cea567 Merge branch 'main' into jakep/new_data 2025-08-19 17:41:45 +00:00
Jake Poznanski
702f8996a9 2epoch 2025-08-16 21:34:16 +00:00
Jake Poznanski
c075f3071f New configs for new data 2025-08-16 17:31:42 +00:00
Jake Poznanski
cffbb82b0b Fix for iabooks 2025-08-16 17:26:51 +00:00
Jake Poznanski
0a9c82927f Adding strip 2025-08-16 17:05:09 +00:00
Jake Poznanski
c492615355 Bump version to v0.3.3 for release v0.3.3 2025-08-15 19:45:17 +00:00
Jake Poznanski
cee12ccc9f New version 2025-08-15 19:45:07 +00:00
Jake Poznanski
76405b53db Lints 2025-08-15 19:44:47 +00:00
Jake Poznanski
69c33abfcc Trying to keep queue loaded more 2025-08-15 18:44:45 +00:00
Jake Poznanski
7c98673972 Pipeline fixes for OMP_NUM_THREADS 2025-08-15 18:30:00 +00:00
Jake Poznanski
b9238b8638 Fix for floaty amount 2025-08-14 22:27:26 +00:00
Jake Poznanski
618777c17e Bump version to v0.3.2 for release v0.3.2 2025-08-14 20:58:11 +00:00
Jake Poznanski
5532493ec8 Pipeline should be improved to limit CPU usage on page renders 2025-08-14 20:57:57 +00:00
Jake Poznanski
3a36ee239d Cleanup 2025-08-14 20:13:52 +00:00
Jake Poznanski
a863d04e6e Cleanup page rendering cpu limits 2025-08-14 20:11:26 +00:00
Jake Poznanski
482030f286 Script to process batch outputs 2025-08-14 19:54:29 +00:00
Jake Poznanski
53c0e57e4a openai batch data writer 2025-08-14 19:40:36 +00:00
Jake Poznanski
6d2c1a646a Olmocr mix to batch format 2025-08-14 18:24:47 +00:00
Jake Poznanski
2049abd8ff prompt stuff 2025-08-14 18:08:43 +00:00
Jake Poznanski
807257f43a Better prompts 2025-08-14 18:04:47 +00:00
Jake Poznanski
1f50a6b6bd Trying out some new prompts 2025-08-14 17:44:56 +00:00
Jake Poznanski
0dd4fe83f4 Bump version to v0.3.1 for release v0.3.1 2025-08-14 16:52:35 +00:00
Jake Poznanski
7e8f9e43d8 New version 2025-08-14 16:50:49 +00:00
Jake Poznanski
7a36c98e26 Merge branch 'main' into jakep/new_data 2025-08-14 16:45:00 +00:00
Jake Poznanski
0a8cd93c0a Better queue managmenet again 2025-08-14 16:37:11 +00:00
Jake Poznanski
38679243d7 Removing extra files 2025-08-14 16:17:59 +00:00
Jake Poznanski
dc5c45e144 Deps 2025-08-14 16:10:29 +00:00
Jake Poznanski
7b3b93589d VLLM bump 2025-08-14 16:08:45 +00:00
Jake Poznanski
4431b4886f Better tracking of semaphore release on bigger jobs 2025-08-14 16:05:21 +00:00
Jake Poznanski
4efd3f5d9e AI2 Internal budgeting 2025-08-13 22:16:18 +00:00
Jake Poznanski
9f8df232b6 Readme updates 2025-08-13 22:03:03 +00:00
Jake Poznanski
36ca700669 Bump version to v0.3.0 for release v0.3.0 2025-08-13 21:41:30 +00:00
Jake Poznanski
3e5351c028 version bump 2025-08-13 21:41:22 +00:00
Jake Poznanski
894c617ea4
Merge pull request #303 from allenai/jakep/olmocr_v03
olmOCR v.0.3.0
2025-08-13 14:39:54 -07:00
Jake Poznanski
463cef7ea2 New default model 2025-08-13 20:57:15 +00:00
Jake Poznanski
e86267a01c Making local results directory properly 2025-08-13 20:40:04 +00:00
Jake Poznanski
11302feb8c Move open cv2 import only into experimental data loader class 2025-08-13 20:28:31 +00:00
Jake Poznanski
93411a80a0 Lint fixes 2025-08-13 20:21:04 +00:00
Jake Poznanski
05330150ad New work queue code is cleaner 2025-08-13 20:20:27 +00:00
Jake Poznanski
9a8fa335ae One more scheme to try 2025-08-13 18:21:58 +00:00
Jake Poznanski
ffb0c6abc5 Adding some more quant schemes 2025-08-13 18:00:38 +00:00