Jake Poznanski
|
5faf570e30
|
Format fixes
|
2025-05-29 23:23:02 +00:00 |
|
Jake Poznanski
|
f8fd234093
|
Idea to improve retry performance
|
2025-05-28 18:27:40 +00:00 |
|
Jake Poznanski
|
58bdfa512b
|
CI
|
2025-02-14 20:51:04 +00:00 |
|
Jake Poznanski
|
91eef279b3
|
Adding some gnarly 1 pager pdfs from kyle
|
2025-02-11 18:45:42 +00:00 |
|
Jake Poznanski
|
dcaca8aa90
|
Black formatting
|
2025-01-29 15:30:39 -08:00 |
|
Jake Poznanski
|
4a1762d455
|
isort
|
2025-01-29 15:25:10 -08:00 |
|
Jake Poznanski
|
0628d3161f
|
Some unit test cleanup
|
2025-01-29 15:15:10 -08:00 |
|
Jake Poznanski
|
b2894d0280
|
Massive refactor from pdelfin to olmocr
|
2025-01-27 18:30:41 +00:00 |
|
Jake Poznanski
|
0d1fc08081
|
Small fixes
|
2025-01-10 19:38:42 +00:00 |
|
Jake Poznanski
|
ba8eba245b
|
Unit tests fixes
|
2024-11-25 09:13:13 -08:00 |
|
Jake Poznanski
|
c9e1a4c540
|
More tests
|
2024-11-20 19:37:00 +00:00 |
|
Jake Poznanski
|
96984fcd77
|
Fix a reliability issue
|
2024-11-18 09:03:24 -08:00 |
|
Jake Poznanski
|
85e0e2a61b
|
Fixing issues with pdf parsing
|
2024-10-30 16:26:02 +00:00 |
|
Jake Poznanski
|
999f64dd46
|
Adding empty anchor support
|
2024-10-23 22:17:20 +00:00 |
|
Jake Poznanski
|
124aaf5fe0
|
Hmm, cant repro failing anchor case
|
2024-10-17 17:00:02 +00:00 |
|
Jake Poznanski
|
2cd863ddce
|
Dolma viewer improvements
|
2024-10-16 16:05:44 +00:00 |
|
Jake Poznanski
|
6d53683001
|
More stats hopefully running faster
|
2024-10-14 21:37:14 +00:00 |
|
Jake Poznanski
|
a90feda42f
|
bugfixes
|
2024-10-09 20:20:06 +00:00 |
|
Jake Poznanski
|
dc6440d068
|
Cleaning up anchor text to deal with abnormally long lines
|
2024-10-09 16:29:20 +00:00 |
|
Jake Poznanski
|
97291b3f6a
|
Anchor is fixed to sample text elements better
|
2024-10-08 21:51:43 +00:00 |
|
Jake Poznanski
|
c8a4d14c57
|
Adding image merging to pdf report/hint/anchor
|
2024-10-08 21:23:21 +00:00 |
|
Jake Poznanski
|
5d35461dd2
|
Fix for unicode errors in big datasets for the future
|
2024-10-07 17:01:59 +00:00 |
|
Jake Poznanski
|
b340ae5092
|
A few notes, starting to test dataloader with new structured response format
|
2024-10-02 22:17:15 +00:00 |
|
Jake Poznanski
|
0071cbd788
|
Appears as if the report method works really well, might need one last step to detect rotated pages
|
2024-10-02 16:44:39 +00:00 |
|
Jake Poznanski
|
6ef8226347
|
Can spit out anchor text for a gpt engine using pypdf, showing locations of images and text
|
2024-10-01 23:15:53 +00:00 |
|
Jake Poznanski
|
e42cecf96c
|
Adding anchor code based off of pypdf that visits each text block, hopefully so we can make it output good bboxes
|
2024-10-01 22:10:58 +00:00 |
|