114 Commits

Author SHA1 Message Date
Jake Poznanski
61dd7bb61f Fix for map in iterable mode 2024-09-26 20:44:47 +00:00
Jake Poznanski
cf1aa0176e Proper use of iterable_dataset 2024-09-26 19:55:54 +00:00
Jake Poznanski
9cbc128553 Sampling some sequence lengths 2024-09-25 09:05:11 -07:00
Jake Poznanski
4eddb1b45f Okay, reasonably happy with the dataprep pipeline 2024-09-20 13:04:47 -07:00
Jake Poznanski
a47afe5c8d Adding test to make sure the traning and inference time tokenization stays identical, currenlty failing 2024-09-20 12:01:05 -07:00
Jake Poznanski
bab32aa9b3 Formatting 2024-09-18 22:52:42 +00:00
Jake Poznanski
f4d18cb287 Dataloader capabable of loading 38k rows reasonably fast 2024-09-18 22:48:38 +00:00
Jake Poznanski
d22b311340 Starting to write dataloader for visual lm data 2024-09-18 21:42:09 +00:00
Jake Poznanski
af2126df99 450tok/sec/core with smollm that appears to work well 2024-09-17 19:59:02 +00:00
Jake Poznanski
2f71cb9232 Using SmolLM, seems a lot better and is able to pass some tests 2024-09-17 18:47:27 +00:00
Jake Poznanski
57e80aacd2 Testing coherence with distilgpt2, but it doesn't work great 2024-09-17 16:58:45 +00:00
Jake Poznanski
01bc0b2f10 Moving a whole bunch of code over, still broken 2024-09-17 16:26:55 +00:00
Jake Poznanski
a534a0180d Moving pdf filter code over with tests 2024-09-17 15:16:58 +00:00
Jake Poznanski
68b2c0e8d6
Initial commit 2024-09-17 07:53:43 -07:00