olmocr

mirror of https://github.com/allenai/olmocr.git synced 2025-11-07 05:39:49 +00:00

Author	SHA1	Message	Date
Jake Poznanski	61dd7bb61f	Fix for map in iterable mode	2024-09-26 20:44:47 +00:00
Jake Poznanski	cf1aa0176e	Proper use of iterable_dataset	2024-09-26 19:55:54 +00:00
Jake Poznanski	9cbc128553	Sampling some sequence lengths	2024-09-25 09:05:11 -07:00
Jake Poznanski	4eddb1b45f	Okay, reasonably happy with the dataprep pipeline	2024-09-20 13:04:47 -07:00
Jake Poznanski	a47afe5c8d	Adding test to make sure the traning and inference time tokenization stays identical, currenlty failing	2024-09-20 12:01:05 -07:00
Jake Poznanski	bab32aa9b3	Formatting	2024-09-18 22:52:42 +00:00
Jake Poznanski	f4d18cb287	Dataloader capabable of loading 38k rows reasonably fast	2024-09-18 22:48:38 +00:00
Jake Poznanski	d22b311340	Starting to write dataloader for visual lm data	2024-09-18 21:42:09 +00:00
Jake Poznanski	af2126df99	450tok/sec/core with smollm that appears to work well	2024-09-17 19:59:02 +00:00
Jake Poznanski	2f71cb9232	Using SmolLM, seems a lot better and is able to pass some tests	2024-09-17 18:47:27 +00:00
Jake Poznanski	57e80aacd2	Testing coherence with distilgpt2, but it doesn't work great	2024-09-17 16:58:45 +00:00
Jake Poznanski	01bc0b2f10	Moving a whole bunch of code over, still broken	2024-09-17 16:26:55 +00:00
Jake Poznanski	a534a0180d	Moving pdf filter code over with tests	2024-09-17 15:16:58 +00:00
Jake Poznanski	68b2c0e8d6	Initial commit	2024-09-17 07:53:43 -07:00

1 2 3