582 Commits

Author SHA1 Message Date
Jake Poznanski
6a22900b8a Allow for sampling anchor and other params 2024-10-23 22:26:12 +00:00
Jake Poznanski
999f64dd46 Adding empty anchor support 2024-10-23 22:17:20 +00:00
Jake Poznanski
f8c5aac5a0 Some cleanup 2024-10-23 21:51:54 +00:00
Jake Poznanski
a1a4798ce7 Some crazy idea I had to simplify futures and memory limits 2024-10-23 21:51:37 +00:00
Jake Poznanski
f6ac591fe9 vllm benchmarker 2024-10-23 18:14:50 +00:00
Jake Poznanski
4047258277 Fixing one old bug to make update_static atomic 2024-10-23 17:51:22 +00:00
Jake Poznanski
38dc5a2a0f Refactored to have a more efficient batchwriter, and also not allow too many running futures 2024-10-23 16:28:46 +00:00
Jake Poznanski
d99096e9a2 Adding vllm profile script for reference 2024-10-22 20:00:34 +00:00
Jake Poznanski
0a5c5068b4 index 2024-10-22 16:03:06 +00:00
Jake Poznanski
7c7867626f Fix pipeline bug with indexing 2024-10-22 15:47:11 +00:00
Jake Poznanski
31becaf7e4 S2orc dataset extractor 2024-10-21 21:28:44 +00:00
Jake Poznanski
302eee3da5 Yay matches between birr and hf 2024-10-21 16:58:30 +00:00
Jake Poznanski
f44dbd15ef Small fixes 2024-10-21 16:45:06 +00:00
Jake Poznanski
a4822718ea train more steps 2024-10-19 14:12:44 +00:00
Jake Poznanski
c9ac48bd9d Try to save at the last second only 2024-10-19 02:07:57 +00:00
Jake Poznanski
9d35d3ca8f Birr tokenization test 2024-10-18 23:02:37 +00:00
Jake Poznanski
77f0b9fa84 help text 2024-10-18 22:39:25 +00:00
Jake Poznanski
7dbcbc154b Birr tests that don't do anything but help me understand the universe 2024-10-18 22:39:17 +00:00
Jake Poznanski
492a3f6bef Adding parameters for taget image and anchor text sizes 2024-10-18 21:47:30 +00:00
Jake Poznanski
1c8602c0ff Removing rotation invalid ones to see what happens 2024-10-17 22:41:44 +00:00
Jake Poznanski
dd4f9670b5 Filter refactor 2024-10-17 22:36:38 +00:00
Jake Poznanski
3ecbeae6dc Trying save to s3 but with threaded saver 2024-10-17 21:39:01 +00:00
Jake Poznanski
5ba78edc39 Fix 2024-10-17 20:57:12 +00:00
Jake Poznanski
89fcff233a Fixing saving bug again 2024-10-17 20:37:28 +00:00
Jake Poznanski
7d4cff53b5 Nice test for picking proper page in birrpipelie 2024-10-17 20:26:02 +00:00
Jake Poznanski
a4d76206ff Choosing proper page 2024-10-17 20:18:06 +00:00
Jake Poznanski
529d51d57d Put LR back, need to save larger checkpoints to weka to prevent timeouts 2024-10-17 19:46:25 +00:00
Jake Poznanski
e141c91e5e Try lora run higher LR 2024-10-17 17:12:35 +00:00
Jake Poznanski
2826bcad18 Yay all unit tests pass cleanly now too 2024-10-17 17:05:55 +00:00
Jake Poznanski
124aaf5fe0 Hmm, cant repro failing anchor case 2024-10-17 17:00:02 +00:00
Jake Poznanski
1c42a08d06 Fixes to prevent errors later in dataloading 2024-10-17 02:28:43 +00:00
Jake Poznanski
f13bcad943 Adding check that pdfs are valid in the new anchor text generation format 2024-10-16 23:31:40 +00:00
Jake Poznanski
5018d591f6 will try lower lr 2024-10-16 23:27:00 +00:00
Jake Poznanski
5c36c22bf7 Prepping for more training 2024-10-16 23:01:40 +00:00
Jake Poznanski
063be21287 New image 2024-10-16 14:46:28 -07:00
Jake Poznanski
90cb80fd65 Docker update 2024-10-16 21:40:39 +00:00
Jake Poznanski
277723fa2c Adding cache 2024-10-16 21:18:52 +00:00
Jake Poznanski
87182ab573 Ensuring unique names 2024-10-16 20:44:23 +00:00
Jake Poznanski
4884b8288b Full dataset 2024-10-16 13:30:25 -07:00
Jake Poznanski
51f1669451 fix 2024-10-16 13:30:06 -07:00
Jake Poznanski
d94713e73e Truncation handled in a custom collator 2024-10-16 13:28:12 -07:00
Jake Poznanski
cbc667ce78 Prepping to train 2024-10-16 13:18:24 -07:00
Jake Poznanski
9d647b13b8 fix 2024-10-16 11:58:35 -07:00
Jake Poznanski
446773dbc8 First part of new dataloader 2024-10-16 11:54:06 -07:00
Jake Poznanski
202d81cece Merge branch 'main' of https://github.com/allenai/pdelfin into main 2024-10-16 11:38:33 -07:00
Jake Poznanski
e2552b2f28 Adding test case 2024-10-16 11:38:31 -07:00
Jake Poznanski
d4f64ed82a Config work 2024-10-16 18:37:52 +00:00
Jake Poznanski
3c1b7de293 Refactoring of train dataloaders 2024-10-16 18:26:25 +00:00
Jake Poznanski
23d129fd2c Organizing around a new style of dataloader 2024-10-16 18:06:27 +00:00
Jake Poznanski
a2546e0b04 more stuff 2024-10-16 17:06:03 +00:00