26 Commits

Author SHA1 Message Date
Jake Poznanski
5faf570e30 Format fixes 2025-05-29 23:23:02 +00:00
Jake Poznanski
f8fd234093 Idea to improve retry performance 2025-05-28 18:27:40 +00:00
Jake Poznanski
58bdfa512b CI 2025-02-14 20:51:04 +00:00
Jake Poznanski
91eef279b3 Adding some gnarly 1 pager pdfs from kyle 2025-02-11 18:45:42 +00:00
Jake Poznanski
dcaca8aa90 Black formatting 2025-01-29 15:30:39 -08:00
Jake Poznanski
4a1762d455 isort 2025-01-29 15:25:10 -08:00
Jake Poznanski
0628d3161f Some unit test cleanup 2025-01-29 15:15:10 -08:00
Jake Poznanski
b2894d0280 Massive refactor from pdelfin to olmocr 2025-01-27 18:30:41 +00:00
Jake Poznanski
0d1fc08081 Small fixes 2025-01-10 19:38:42 +00:00
Jake Poznanski
ba8eba245b Unit tests fixes 2024-11-25 09:13:13 -08:00
Jake Poznanski
c9e1a4c540 More tests 2024-11-20 19:37:00 +00:00
Jake Poznanski
96984fcd77 Fix a reliability issue 2024-11-18 09:03:24 -08:00
Jake Poznanski
85e0e2a61b Fixing issues with pdf parsing 2024-10-30 16:26:02 +00:00
Jake Poznanski
999f64dd46 Adding empty anchor support 2024-10-23 22:17:20 +00:00
Jake Poznanski
124aaf5fe0 Hmm, cant repro failing anchor case 2024-10-17 17:00:02 +00:00
Jake Poznanski
2cd863ddce Dolma viewer improvements 2024-10-16 16:05:44 +00:00
Jake Poznanski
6d53683001 More stats hopefully running faster 2024-10-14 21:37:14 +00:00
Jake Poznanski
a90feda42f bugfixes 2024-10-09 20:20:06 +00:00
Jake Poznanski
dc6440d068 Cleaning up anchor text to deal with abnormally long lines 2024-10-09 16:29:20 +00:00
Jake Poznanski
97291b3f6a Anchor is fixed to sample text elements better 2024-10-08 21:51:43 +00:00
Jake Poznanski
c8a4d14c57 Adding image merging to pdf report/hint/anchor 2024-10-08 21:23:21 +00:00
Jake Poznanski
5d35461dd2 Fix for unicode errors in big datasets for the future 2024-10-07 17:01:59 +00:00
Jake Poznanski
b340ae5092 A few notes, starting to test dataloader with new structured response format 2024-10-02 22:17:15 +00:00
Jake Poznanski
0071cbd788 Appears as if the report method works really well, might need one last step to detect rotated pages 2024-10-02 16:44:39 +00:00
Jake Poznanski
6ef8226347 Can spit out anchor text for a gpt engine using pypdf, showing locations of images and text 2024-10-01 23:15:53 +00:00
Jake Poznanski
e42cecf96c Adding anchor code based off of pypdf that visits each text block, hopefully so we can make it output good bboxes 2024-10-01 22:10:58 +00:00