15 Commits

Author SHA1 Message Date
Jake Poznanski
230c8a9f9a Trying new run that will rewrite the prompts as it goes 2024-10-08 22:10:18 +00:00
Jake Poznanski
a90eb94951 Fix dataloader bug 2024-10-07 15:25:48 -07:00
Jake Poznanski
3d36545fa5 loading fix for parquets again... 2024-10-07 14:48:53 -07:00
Jake Poznanski
7416b42023 Adding support for parquet datasets which are precached 2024-10-07 21:14:33 +00:00
Jake Poznanski
dc26541da2 Starting code to build parquets... 2024-10-07 20:59:43 +00:00
Jake Poznanski
d8e459c9f3 Weird issue with surrogate pairs in json 2024-10-07 09:04:13 -07:00
Jake Poznanski
98020cabbb Allow loading files locally 2024-10-07 07:49:16 -07:00
Jake Poznanski
decfd7fbc1 Fixing the refiner input prompt to something simpler that doesn't depend on the training data. Fixing beaker job workspace and bumping priority to high. 2024-09-27 22:54:07 +00:00
Jake Poznanski
22b765e6be Going back to non iterable dataset, so shuffling works better, applying a light filter 2024-09-27 15:48:56 +00:00
Jake Poznanski
86813fe210 Filtering off the weird tail ends of the distribution to make training smoother 2024-09-25 09:49:03 -07:00
Jake Poznanski
5916239cd8 typos 2024-09-23 09:43:36 -07:00
Jake Poznanski
ea3af0143c Loading dataset from config now 2024-09-23 09:40:24 -07:00
Jake Poznanski
bab32aa9b3 Formatting 2024-09-18 22:52:42 +00:00
Jake Poznanski
f4d18cb287 Dataloader capabable of loading 38k rows reasonably fast 2024-09-18 22:48:38 +00:00
Jake Poznanski
d22b311340 Starting to write dataloader for visual lm data 2024-09-18 21:42:09 +00:00