693 Commits

Author SHA1 Message Date
Jake Poznanski
51f1669451 fix 2024-10-16 13:30:06 -07:00
Jake Poznanski
d94713e73e Truncation handled in a custom collator 2024-10-16 13:28:12 -07:00
Jake Poznanski
cbc667ce78 Prepping to train 2024-10-16 13:18:24 -07:00
Jake Poznanski
9d647b13b8 fix 2024-10-16 11:58:35 -07:00
Jake Poznanski
446773dbc8 First part of new dataloader 2024-10-16 11:54:06 -07:00
Jake Poznanski
202d81cece Merge branch 'main' of https://github.com/allenai/pdelfin into main 2024-10-16 11:38:33 -07:00
Jake Poznanski
e2552b2f28 Adding test case 2024-10-16 11:38:31 -07:00
Jake Poznanski
d4f64ed82a Config work 2024-10-16 18:37:52 +00:00
Jake Poznanski
3c1b7de293 Refactoring of train dataloaders 2024-10-16 18:26:25 +00:00
Jake Poznanski
23d129fd2c Organizing around a new style of dataloader 2024-10-16 18:06:27 +00:00
Jake Poznanski
a2546e0b04 more stuff 2024-10-16 17:06:03 +00:00
Jake Poznanski
a7cd7467c3 mathjax 2024-10-16 16:45:07 +00:00
Jake Poznanski
baa82a4a9a Fixing links, rendering tables 2024-10-16 16:37:08 +00:00
Jake Poznanski
19e56ec7ce dolma viewer runs much faster now 2024-10-16 16:21:25 +00:00
Jake Poznanski
96682b2ecb Refactoring 2024-10-16 16:18:27 +00:00
Jake Poznanski
2cd863ddce Dolma viewer improvements 2024-10-16 16:05:44 +00:00
Jake Poznanski
35558dbddc Make the prompt hint randomly select lines 2024-10-16 16:05:07 +00:00
Jake Poznanski
9eb252f8f6 Better tracking of completion_errors 2024-10-15 22:43:31 +00:00
Jake Poznanski
4ef14ec813 More stats 2024-10-15 22:26:31 +00:00
Jake Poznanski
4a280e55df Nicer dolma viewer 2024-10-15 21:03:28 +00:00
Jake Poznanski
42cf6a639f Dolma viewer 2024-10-15 18:37:31 +00:00
Jake Poznanski
b8cd414022 tiny fix 2024-10-15 16:54:19 +00:00
Jake Poznanski
a7fae0e659 fix 2024-10-15 16:36:54 +00:00
Jake Poznanski
4669eb7134 Adjusting workflow so I can do s2 pdfs 2024-10-15 16:22:55 +00:00
Jake Poznanski
6d61ae4aa8 Some pipeline cleanup stuff 2024-10-15 16:02:08 +00:00
Jake Poznanski
fc8fcfaeba Fixing dataloader hopefully 2024-10-15 15:13:25 +00:00
Jake Poznanski
6d53683001 More stats hopefully running faster 2024-10-14 21:37:14 +00:00
Jake Poznanski
350061906e Adding nicer output stats 2024-10-14 20:48:33 +00:00
Jake Poznanski
194af5ff52 Robustness 2024-10-14 20:31:37 +00:00
Jake Poznanski
1ed9e4c947 Runs to the end now 2024-10-14 20:28:54 +00:00
Jake Poznanski
879b974af2 More and more fixes 2024-10-14 20:06:07 +00:00
Jake Poznanski
77a850d7ef Tracking rounds of inference better 2024-10-14 18:42:50 +00:00
Jake Poznanski
af992bd603 More refactoring 2024-10-14 18:23:22 +00:00
Jake Poznanski
cd8e28e459 Pipeline working hopefully soon 2024-10-14 18:19:17 +00:00
Jake Poznanski
f2f578cca9 More pipeline code 2024-10-14 17:23:09 +00:00
Jake Poznanski
39333f2c96 New pipeline stuff 2024-10-14 17:09:11 +00:00
Jake Poznanski
4d6eaf654d Merge branch 'main' of https://github.com/allenai/pdelfin 2024-10-14 16:30:51 +00:00
Jake Poznanski
89d4ee2145 Pipeline work 2024-10-14 16:30:49 +00:00
Jake Poznanski
7b161533e2 Code to do local inference on fine tuned models for testing 2024-10-14 08:38:18 -07:00
Jake Poznanski
5a7377af30 Refactoring 2024-10-11 22:57:49 +00:00
Jake Poznanski
4fd6066600 gpt cleanup 2024-10-11 22:41:09 +00:00
Jake Poznanski
a45f86e4a4 More cleanup 2024-10-11 22:37:32 +00:00
Jake Poznanski
53fdb6108c More pipeline code 2024-10-11 21:50:09 +00:00
Jake Poznanski
10b7a58d28 fix 2024-10-11 20:22:58 +00:00
Jake Poznanski
f477a68621 dbmanager 2024-10-11 16:24:29 +00:00
Jake Poznanski
2dccc4be3b Oops removing print 2024-10-11 16:23:14 +00:00
Jake Poznanski
aea3f7f1fe Fix for anchor generation on pdfs with no text elements 2024-10-11 15:01:01 +00:00
Jake Poznanski
af03358c47 assemble 2024-10-10 22:36:09 +00:00
Jake Poznanski
312847acac Ok, finally working nicely to build the page index 2024-10-10 22:30:09 +00:00
Jake Poznanski
312ee8d953 pipeline script 2024-10-10 22:13:43 +00:00