20 Commits

Author SHA1 Message Date
Jake Poznanski
b2894d0280 Massive refactor from pdelfin to olmocr 2025-01-27 18:30:41 +00:00
Jake Poznanski
3c1b7de293 Refactoring of train dataloaders 2024-10-16 18:26:25 +00:00
Jake Poznanski
23d129fd2c Organizing around a new style of dataloader 2024-10-16 18:06:27 +00:00
Jake Poznanski
96682b2ecb Refactoring 2024-10-16 16:18:27 +00:00
Jake Poznanski
4bf6e7a430 Refactoring 2024-10-09 18:11:18 +00:00
Jake Poznanski
230c8a9f9a Trying new run that will rewrite the prompts as it goes 2024-10-08 22:10:18 +00:00
Jake Poznanski
ebd40f9084 Hopefully fixing dataloader for now 2024-10-07 12:59:27 -07:00
Jake Poznanski
d8e459c9f3 Weird issue with surrogate pairs in json 2024-10-07 09:04:13 -07:00
Jake Poznanski
98020cabbb Allow loading files locally 2024-10-07 07:49:16 -07:00
Jake Poznanski
1686790ac8 Checking filtering logic 2024-10-02 22:45:40 +00:00
Jake Poznanski
decfd7fbc1 Fixing the refiner input prompt to something simpler that doesn't depend on the training data. Fixing beaker job workspace and bumping priority to high. 2024-09-27 22:54:07 +00:00
Jake Poznanski
22b765e6be Going back to non iterable dataset, so shuffling works better, applying a light filter 2024-09-27 15:48:56 +00:00
Jake Poznanski
c00e40d1c4 More fixes 2024-09-26 23:10:07 +00:00
Jake Poznanski
d098a87ed2 Column name fix 2024-09-26 22:29:19 +00:00
Jake Poznanski
61dd7bb61f Fix for map in iterable mode 2024-09-26 20:44:47 +00:00
Jake Poznanski
cf1aa0176e Proper use of iterable_dataset 2024-09-26 19:55:54 +00:00
Jake Poznanski
9cbc128553 Sampling some sequence lengths 2024-09-25 09:05:11 -07:00
Jake Poznanski
bab32aa9b3 Formatting 2024-09-18 22:52:42 +00:00
Jake Poznanski
f4d18cb287 Dataloader capabable of loading 38k rows reasonably fast 2024-09-18 22:48:38 +00:00
Jake Poznanski
d22b311340 Starting to write dataloader for visual lm data 2024-09-18 21:42:09 +00:00