14 Commits

Author SHA1 Message Date
Jake Poznanski
fd17652d55 Trying to make it faster 2024-11-15 11:06:50 -08:00
Jake Poznanski
999f64dd46 Adding empty anchor support 2024-10-23 22:17:20 +00:00
Jake Poznanski
3c1b7de293 Refactoring of train dataloaders 2024-10-16 18:26:25 +00:00
Jake Poznanski
35558dbddc Make the prompt hint randomly select lines 2024-10-16 16:05:07 +00:00
Jake Poznanski
6d53683001 More stats hopefully running faster 2024-10-14 21:37:14 +00:00
Jake Poznanski
aea3f7f1fe Fix for anchor generation on pdfs with no text elements 2024-10-11 15:01:01 +00:00
Jake Poznanski
dc6440d068 Cleaning up anchor text to deal with abnormally long lines 2024-10-09 16:29:20 +00:00
Jake Poznanski
97291b3f6a Anchor is fixed to sample text elements better 2024-10-08 21:51:43 +00:00
Jake Poznanski
c8a4d14c57 Adding image merging to pdf report/hint/anchor 2024-10-08 21:23:21 +00:00
Jake Poznanski
5d35461dd2 Fix for unicode errors in big datasets for the future 2024-10-07 17:01:59 +00:00
Jake Poznanski
73fb81ef6c Review page size option, fixing mkdirs in convertsilver script 2024-10-02 15:53:21 +00:00
Jake Poznanski
6ef8226347 Can spit out anchor text for a gpt engine using pypdf, showing locations of images and text 2024-10-01 23:15:53 +00:00
Jake Poznanski
09e8840c56 coherency based anchor text 2024-10-01 20:19:03 +00:00
Jake Poznanski
28fe314539 prepping anchor text generation code 2024-10-01 19:59:48 +00:00