13 Commits

Author SHA1 Message Date
Jake Poznanski
2ab7cb280c Removing pymupdf 2025-01-30 15:51:54 -08:00
Jake Poznanski
fb402297ce Isort and black update 2025-01-29 15:42:34 -08:00
Jake Poznanski
dcaca8aa90 Black formatting 2025-01-29 15:30:39 -08:00
Jake Poznanski
4a1762d455 isort 2025-01-29 15:25:10 -08:00
Jake Poznanski
b2894d0280 Massive refactor from pdelfin to olmocr 2025-01-27 18:30:41 +00:00
Jake Poznanski
b7c80cd17f Fix up some tests but I don't see why this isn't working 2024-10-10 16:58:40 +00:00
Jake Poznanski
09e8840c56 coherency based anchor text 2024-10-01 20:19:03 +00:00
Jake Poznanski
bab32aa9b3 Formatting 2024-09-18 22:52:42 +00:00
Jake Poznanski
d22b311340 Starting to write dataloader for visual lm data 2024-09-18 21:42:09 +00:00
Jake Poznanski
af2126df99 450tok/sec/core with smollm that appears to work well 2024-09-17 19:59:02 +00:00
Jake Poznanski
2f71cb9232 Using SmolLM, seems a lot better and is able to pass some tests 2024-09-17 18:47:27 +00:00
Jake Poznanski
57e80aacd2 Testing coherence with distilgpt2, but it doesn't work great 2024-09-17 16:58:45 +00:00
Jake Poznanski
01bc0b2f10 Moving a whole bunch of code over, still broken 2024-09-17 16:26:55 +00:00