olmocr

mirror of https://github.com/allenai/olmocr.git synced 2025-09-16 03:57:35 +00:00

Author	SHA1	Message	Date
Jake Poznanski	7416b42023	Adding support for parquet datasets which are precached	2024-10-07 21:14:33 +00:00
Jake Poznanski	dc26541da2	Starting code to build parquets...	2024-10-07 20:59:43 +00:00
Jake Poznanski	4557a5b296	Typo	2024-10-07 13:03:31 -07:00
Jake Poznanski	e973de7ba9	Typo	2024-10-07 13:01:43 -07:00
Jake Poznanski	ebd40f9084	Hopefully fixing dataloader for now	2024-10-07 12:59:27 -07:00
Jake Poznanski	44bcdc771b	Hopefully can use weka for the train datasets now	2024-10-07 16:14:28 +00:00
Jake Poznanski	d8e459c9f3	Weird issue with surrogate pairs in json	2024-10-07 09:04:13 -07:00
Jake Poznanski	98020cabbb	Allow loading files locally	2024-10-07 07:49:16 -07:00
Jake Poznanski	568dd48509	Prepping for qwen2vl full training run	2024-10-05 04:04:45 +00:00
Jake Poznanski	974ddd3773	I'm pretty sure we only need to save on rank0	2024-10-03 11:30:44 -07:00
Jake Poznanski	8f1fa4f796	Running a mini config again with metric	2024-10-03 11:12:30 -07:00
Jake Poznanski	046d4a4534	Adding eval on start and seed params	2024-10-03 10:54:25 -07:00
Jake Poznanski	2227605bfb	Mini train config	2024-10-03 10:32:15 -07:00
Jake Poznanski	0ddaf9023d	Getting ready to launch a new training run	2024-10-02 23:04:56 +00:00
Jake Poznanski	1686790ac8	Checking filtering logic	2024-10-02 22:45:40 +00:00
Jake Poznanski	b340ae5092	A few notes, starting to test dataloader with new structured response format	2024-10-02 22:17:15 +00:00
Jake Poznanski	b6543a4f65	Qwen checkpoint fixer script	2024-10-01 16:02:10 +00:00
Jake Poznanski	e179453cc5	Fixing qwen checkpoint script	2024-09-30 20:34:06 +00:00
Jake Poznanski	da1982acb8	Refactoring prompts into their own new folder	2024-09-30 18:48:17 +00:00
Jake Poznanski	8ec9e35f22	dataprep issue	2024-09-28 04:31:11 +00:00
Jake Poznanski	e53f782b0f	Datasetdict fix	2024-09-28 03:38:29 +00:00
Jake Poznanski	decfd7fbc1	Fixing the refiner input prompt to something simpler that doesn't depend on the training data. Fixing beaker job workspace and bumping priority to high.	2024-09-27 22:54:07 +00:00
Jake Poznanski	22b765e6be	Going back to non iterable dataset, so shuffling works better, applying a light filter	2024-09-27 15:48:56 +00:00
Jake Poznanski	65a9c9981e	Hopefuly will train now	2024-09-27 15:16:12 +00:00
Jake Poznanski	e864b9d88f	weird dataloader stuff now	2024-09-27 02:53:59 +00:00
Jake Poznanski	37f10051f6	typo	2024-09-27 01:19:21 +00:00
Jake Poznanski	c00e40d1c4	More fixes	2024-09-26 23:10:07 +00:00
Jake Poznanski	d098a87ed2	Column name fix	2024-09-26 22:29:19 +00:00
Jake Poznanski	84e9da637c	Removing lambda due to pickling errors	2024-09-26 21:39:08 +00:00
Jake Poznanski	61dd7bb61f	Fix for map in iterable mode	2024-09-26 20:44:47 +00:00
Jake Poznanski	49efa5cb40	Typo	2024-09-26 19:57:53 +00:00
Jake Poznanski	cf1aa0176e	Proper use of iterable_dataset	2024-09-26 19:55:54 +00:00
Jake Poznanski	05fdb81da2	map and filter on iterable dataset	2024-09-26 19:01:34 +00:00
Jake Poznanski	7707bc08da	trying cheaper optimizer to solve ooms	2024-09-25 22:56:05 +00:00
Jake Poznanski	385c1bf9a7	Lora config	2024-09-25 22:07:04 +00:00
Jake Poznanski	24b30b2333	Prepping for 7b training	2024-09-25 20:51:25 +00:00
Jake Poznanski	3a5b438a6f	Lora misconfiguration	2024-09-25 10:48:39 -07:00
Jake Poznanski	86813fe210	Filtering off the weird tail ends of the distribution to make training smoother	2024-09-25 09:49:03 -07:00
Jake Poznanski	5f313266a4	Adding linear layers from visual network to target modules LORA	2024-09-25 09:09:24 -07:00
Jake Poznanski	b2341ed4f4	Merge branch 'main' of https://github.com/allenai/pdelfin into main	2024-09-25 09:05:46 -07:00
Jake Poznanski	9cbc128553	Sampling some sequence lengths	2024-09-25 09:05:11 -07:00
Jake Poznanski	d0deac5ea7	Lora config	2024-09-25 08:34:58 -07:00
Jake Poznanski	07c0323c91	Adding lora config to try to address OOMs	2024-09-25 07:57:01 -07:00
Jake Poznanski	ea0226c499	More flexibility in dataloader dims	2024-09-24 19:47:13 -07:00
Jake Poznanski	f6905c39ea	Hopefully the last changes	2024-09-24 15:52:34 -07:00
Jake Poznanski	ea731055d7	More realistic configuration	2024-09-24 14:50:23 -07:00
Jake Poznanski	0442a33209	New images work much better now, and device map fix	2024-09-24 12:58:18 -07:00
Jake Poznanski	bf1239deea	Use mini dataset now for testing	2024-09-24 10:55:03 -07:00
Jake Poznanski	596fc55628	Enabling model eval	2024-09-24 10:48:53 -07:00
Jake Poznanski	5a0bcb7b1d	batch inference slowness	2024-09-24 09:13:47 -07:00

1 2 3

123 Commits