Jake Poznanski
|
1686790ac8
|
Checking filtering logic
|
2024-10-02 22:45:40 +00:00 |
|
Jake Poznanski
|
b340ae5092
|
A few notes, starting to test dataloader with new structured response format
|
2024-10-02 22:17:15 +00:00 |
|
Jake Poznanski
|
0071cbd788
|
Appears as if the report method works really well, might need one last step to detect rotated pages
|
2024-10-02 16:44:39 +00:00 |
|
Jake Poznanski
|
6ef8226347
|
Can spit out anchor text for a gpt engine using pypdf, showing locations of images and text
|
2024-10-01 23:15:53 +00:00 |
|
Jake Poznanski
|
e42cecf96c
|
Adding anchor code based off of pypdf that visits each text block, hopefully so we can make it output good bboxes
|
2024-10-01 22:10:58 +00:00 |
|
Jake Poznanski
|
09e8840c56
|
coherency based anchor text
|
2024-10-01 20:19:03 +00:00 |
|
Jake Poznanski
|
decfd7fbc1
|
Fixing the refiner input prompt to something simpler that doesn't depend on the training data. Fixing beaker job workspace and bumping priority to high.
|
2024-09-27 22:54:07 +00:00 |
|
Jake Poznanski
|
22b765e6be
|
Going back to non iterable dataset, so shuffling works better, applying a light filter
|
2024-09-27 15:48:56 +00:00 |
|
Jake Poznanski
|
c00e40d1c4
|
More fixes
|
2024-09-26 23:10:07 +00:00 |
|
Jake Poznanski
|
d098a87ed2
|
Column name fix
|
2024-09-26 22:29:19 +00:00 |
|
Jake Poznanski
|
61dd7bb61f
|
Fix for map in iterable mode
|
2024-09-26 20:44:47 +00:00 |
|
Jake Poznanski
|
cf1aa0176e
|
Proper use of iterable_dataset
|
2024-09-26 19:55:54 +00:00 |
|
Jake Poznanski
|
9cbc128553
|
Sampling some sequence lengths
|
2024-09-25 09:05:11 -07:00 |
|
Jake Poznanski
|
4eddb1b45f
|
Okay, reasonably happy with the dataprep pipeline
|
2024-09-20 13:04:47 -07:00 |
|
Jake Poznanski
|
a47afe5c8d
|
Adding test to make sure the traning and inference time tokenization stays identical, currenlty failing
|
2024-09-20 12:01:05 -07:00 |
|
Jake Poznanski
|
bab32aa9b3
|
Formatting
|
2024-09-18 22:52:42 +00:00 |
|
Jake Poznanski
|
f4d18cb287
|
Dataloader capabable of loading 38k rows reasonably fast
|
2024-09-18 22:48:38 +00:00 |
|
Jake Poznanski
|
d22b311340
|
Starting to write dataloader for visual lm data
|
2024-09-18 21:42:09 +00:00 |
|
Jake Poznanski
|
af2126df99
|
450tok/sec/core with smollm that appears to work well
|
2024-09-17 19:59:02 +00:00 |
|
Jake Poznanski
|
2f71cb9232
|
Using SmolLM, seems a lot better and is able to pass some tests
|
2024-09-17 18:47:27 +00:00 |
|
Jake Poznanski
|
57e80aacd2
|
Testing coherence with distilgpt2, but it doesn't work great
|
2024-09-17 16:58:45 +00:00 |
|
Jake Poznanski
|
01bc0b2f10
|
Moving a whole bunch of code over, still broken
|
2024-09-17 16:26:55 +00:00 |
|
Jake Poznanski
|
a534a0180d
|
Moving pdf filter code over with tests
|
2024-09-17 15:16:58 +00:00 |
|
Jake Poznanski
|
68b2c0e8d6
|
Initial commit
|
2024-09-17 07:53:43 -07:00 |
|