1195 Commits

Author SHA1 Message Date
Jake Poznanski
1cf3cd8caa Had to swtich to conda env override for gantry due to cu118 compat 2024-09-23 22:35:42 +00:00
Jake Poznanski
cb0b97a16a Gantry requirements 2024-09-23 15:08:39 -07:00
Jake Poznanski
15793975dd Merge branch 'main' of https://github.com/allenai/pdelfin 2024-09-23 21:42:27 +00:00
Jake Poznanski
0691e1a77f chmodding 2024-09-23 21:42:26 +00:00
Jake Poznanski
a30ca16e1f Script adjustment 2024-09-23 14:41:35 -07:00
Jake Poznanski
79feb986a6 Merge branch 'main' of https://github.com/allenai/pdelfin into main 2024-09-23 14:32:12 -07:00
Jake Poznanski
a3feca01fc Setting up for a real train run 2024-09-23 14:32:10 -07:00
Jake Poznanski
d589b5651d Merge branch 'main' of https://github.com/allenai/pdelfin 2024-09-23 21:19:26 +00:00
Jake Poznanski
9ae26472d3 Silver dataset adjustments 2024-09-23 21:19:24 +00:00
Jake Poznanski
0812b0dd77 Prepping for gantry 2024-09-23 14:04:22 -07:00
Jake Poznanski
f78d021f50 Should be merging the LORA adapters back into the model for the final checkpoint 2024-09-23 12:55:01 -07:00
Jake Poznanski
5967a525fd Flash attention and mixed precision training, works quite a bit faster 2024-09-23 11:26:18 -07:00
Jake Poznanski
a7782255d5 Merge branch 'main' of https://github.com/allenai/pdelfin into main 2024-09-23 10:44:39 -07:00
Jake Poznanski
45e5823823 Much happier gpu utilization 2024-09-23 10:44:25 -07:00
Jake Poznanski
5535e3ab2e Moving the openai data generation stuff to this repo now 2024-09-23 17:20:18 +00:00
Jake Poznanski
dc71b28ddd No need to save tokenizer 2024-09-23 10:06:04 -07:00
Jake Poznanski
5916239cd8 typos 2024-09-23 09:43:36 -07:00
Jake Poznanski
ea3af0143c Loading dataset from config now 2024-09-23 09:40:24 -07:00
Jake Poznanski
ab9458b913 Basic LORA trainer, doesn't seem to make any speed difference 2024-09-23 09:08:00 -07:00
Jake Poznanski
3ed14a9ea5 Prepping new training stuff 2024-09-23 08:53:56 -07:00
Jake Poznanski
b915e7de00 Smaller config for now, fixing a few requirements 2024-09-23 08:20:08 -07:00
Jake Poznanski
256d77c232 Hoping to get a basic hf Trainer to run 2024-09-20 15:53:11 -07:00
Jake Poznanski
55035b02c9 Tries to run a forward pass but oOMS 2024-09-20 15:05:23 -07:00
Jake Poznanski
4eddb1b45f Okay, reasonably happy with the dataprep pipeline 2024-09-20 13:04:47 -07:00
Jake Poznanski
a47afe5c8d Adding test to make sure the traning and inference time tokenization stays identical, currenlty failing 2024-09-20 12:01:05 -07:00
Jake Poznanski
fcb67ebd61 Prepping data to be in a trainable format 2024-09-20 09:25:54 -07:00
Jake Poznanski
dc86a99a97 Pyproject dependency cleanup 2024-09-20 08:22:10 -07:00
Jake Poznanski
962fb7eb6d merge 2024-09-20 15:10:47 +00:00
Jake Poznanski
0cc2b5d7cf Pyproject stuff 2024-09-20 15:09:45 +00:00
Jake Poznanski
0f2c42a6d3 Fixing formating in pyproject 2024-09-20 08:01:48 -07:00
Jake Poznanski
84e68f313e Basic forward generation pass with openai dataset and qwen2vl 2024-09-19 22:16:59 +00:00
Jake Poznanski
7d2c447dd3 Importing core training config stuff from dolma refine 2024-09-19 21:55:07 +00:00
Jake Poznanski
bab32aa9b3 Formatting 2024-09-18 22:52:42 +00:00
Jake Poznanski
f4d18cb287 Dataloader capabable of loading 38k rows reasonably fast 2024-09-18 22:48:38 +00:00
Jake Poznanski
d22b311340 Starting to write dataloader for visual lm data 2024-09-18 21:42:09 +00:00
Jake Poznanski
fb4fc4229e Fixing close file warning 2024-09-17 20:31:32 +00:00
Jake Poznanski
af2126df99 450tok/sec/core with smollm that appears to work well 2024-09-17 19:59:02 +00:00
Jake Poznanski
2f71cb9232 Using SmolLM, seems a lot better and is able to pass some tests 2024-09-17 18:47:27 +00:00
Jake Poznanski
57e80aacd2 Testing coherence with distilgpt2, but it doesn't work great 2024-09-17 16:58:45 +00:00
Jake Poznanski
cb9b6efb3c Trying distilgpt2 instead of kenlm 2024-09-17 16:50:01 +00:00
Jake Poznanski
01bc0b2f10 Moving a whole bunch of code over, still broken 2024-09-17 16:26:55 +00:00
Jake Poznanski
a534a0180d Moving pdf filter code over with tests 2024-09-17 15:16:58 +00:00
Jake Poznanski
9662718bfd Running personalize script on template 2024-09-17 15:06:59 +00:00
Jake Poznanski
7d71e2d643
Update README.md 2024-09-17 07:58:39 -07:00
Jake Poznanski
68b2c0e8d6
Initial commit 2024-09-17 07:53:43 -07:00