Jake Poznanski
|
7d4cff53b5
|
Nice test for picking proper page in birrpipelie
|
2024-10-17 20:26:02 +00:00 |
|
Jake Poznanski
|
2826bcad18
|
Yay all unit tests pass cleanly now too
|
2024-10-17 17:05:55 +00:00 |
|
Jake Poznanski
|
124aaf5fe0
|
Hmm, cant repro failing anchor case
|
2024-10-17 17:00:02 +00:00 |
|
Jake Poznanski
|
202d81cece
|
Merge branch 'main' of https://github.com/allenai/pdelfin into main
|
2024-10-16 11:38:33 -07:00 |
|
Jake Poznanski
|
e2552b2f28
|
Adding test case
|
2024-10-16 11:38:31 -07:00 |
|
Jake Poznanski
|
3c1b7de293
|
Refactoring of train dataloaders
|
2024-10-16 18:26:25 +00:00 |
|
Jake Poznanski
|
23d129fd2c
|
Organizing around a new style of dataloader
|
2024-10-16 18:06:27 +00:00 |
|
Jake Poznanski
|
a2546e0b04
|
more stuff
|
2024-10-16 17:06:03 +00:00 |
|
Jake Poznanski
|
96682b2ecb
|
Refactoring
|
2024-10-16 16:18:27 +00:00 |
|
Jake Poznanski
|
2cd863ddce
|
Dolma viewer improvements
|
2024-10-16 16:05:44 +00:00 |
|
Jake Poznanski
|
6d53683001
|
More stats hopefully running faster
|
2024-10-14 21:37:14 +00:00 |
|
Jake Poznanski
|
7b161533e2
|
Code to do local inference on fine tuned models for testing
|
2024-10-14 08:38:18 -07:00 |
|
Jake Poznanski
|
2864f907e1
|
Dataloader fix with nicer tests
|
2024-10-10 16:58:45 +00:00 |
|
Jake Poznanski
|
b7c80cd17f
|
Fix up some tests but I don't see why this isn't working
|
2024-10-10 16:58:40 +00:00 |
|
Jake Poznanski
|
a90feda42f
|
bugfixes
|
2024-10-09 20:20:06 +00:00 |
|
Jake Poznanski
|
4bf6e7a430
|
Refactoring
|
2024-10-09 18:11:18 +00:00 |
|
Jake Poznanski
|
dc6440d068
|
Cleaning up anchor text to deal with abnormally long lines
|
2024-10-09 16:29:20 +00:00 |
|
Jake Poznanski
|
230c8a9f9a
|
Trying new run that will rewrite the prompts as it goes
|
2024-10-08 22:10:18 +00:00 |
|
Jake Poznanski
|
97291b3f6a
|
Anchor is fixed to sample text elements better
|
2024-10-08 21:51:43 +00:00 |
|
Jake Poznanski
|
c8a4d14c57
|
Adding image merging to pdf report/hint/anchor
|
2024-10-08 21:23:21 +00:00 |
|
Jake Poznanski
|
ebd40f9084
|
Hopefully fixing dataloader for now
|
2024-10-07 12:59:27 -07:00 |
|
Jake Poznanski
|
5d35461dd2
|
Fix for unicode errors in big datasets for the future
|
2024-10-07 17:01:59 +00:00 |
|
Jake Poznanski
|
d8e459c9f3
|
Weird issue with surrogate pairs in json
|
2024-10-07 09:04:13 -07:00 |
|
Jake Poznanski
|
98020cabbb
|
Allow loading files locally
|
2024-10-07 07:49:16 -07:00 |
|
Jake Poznanski
|
1686790ac8
|
Checking filtering logic
|
2024-10-02 22:45:40 +00:00 |
|
Jake Poznanski
|
b340ae5092
|
A few notes, starting to test dataloader with new structured response format
|
2024-10-02 22:17:15 +00:00 |
|
Jake Poznanski
|
0071cbd788
|
Appears as if the report method works really well, might need one last step to detect rotated pages
|
2024-10-02 16:44:39 +00:00 |
|
Jake Poznanski
|
6ef8226347
|
Can spit out anchor text for a gpt engine using pypdf, showing locations of images and text
|
2024-10-01 23:15:53 +00:00 |
|
Jake Poznanski
|
e42cecf96c
|
Adding anchor code based off of pypdf that visits each text block, hopefully so we can make it output good bboxes
|
2024-10-01 22:10:58 +00:00 |
|
Jake Poznanski
|
09e8840c56
|
coherency based anchor text
|
2024-10-01 20:19:03 +00:00 |
|
Jake Poznanski
|
decfd7fbc1
|
Fixing the refiner input prompt to something simpler that doesn't depend on the training data. Fixing beaker job workspace and bumping priority to high.
|
2024-09-27 22:54:07 +00:00 |
|
Jake Poznanski
|
22b765e6be
|
Going back to non iterable dataset, so shuffling works better, applying a light filter
|
2024-09-27 15:48:56 +00:00 |
|
Jake Poznanski
|
c00e40d1c4
|
More fixes
|
2024-09-26 23:10:07 +00:00 |
|
Jake Poznanski
|
d098a87ed2
|
Column name fix
|
2024-09-26 22:29:19 +00:00 |
|
Jake Poznanski
|
61dd7bb61f
|
Fix for map in iterable mode
|
2024-09-26 20:44:47 +00:00 |
|
Jake Poznanski
|
cf1aa0176e
|
Proper use of iterable_dataset
|
2024-09-26 19:55:54 +00:00 |
|
Jake Poznanski
|
9cbc128553
|
Sampling some sequence lengths
|
2024-09-25 09:05:11 -07:00 |
|
Jake Poznanski
|
4eddb1b45f
|
Okay, reasonably happy with the dataprep pipeline
|
2024-09-20 13:04:47 -07:00 |
|
Jake Poznanski
|
a47afe5c8d
|
Adding test to make sure the traning and inference time tokenization stays identical, currenlty failing
|
2024-09-20 12:01:05 -07:00 |
|
Jake Poznanski
|
bab32aa9b3
|
Formatting
|
2024-09-18 22:52:42 +00:00 |
|
Jake Poznanski
|
f4d18cb287
|
Dataloader capabable of loading 38k rows reasonably fast
|
2024-09-18 22:48:38 +00:00 |
|
Jake Poznanski
|
d22b311340
|
Starting to write dataloader for visual lm data
|
2024-09-18 21:42:09 +00:00 |
|
Jake Poznanski
|
af2126df99
|
450tok/sec/core with smollm that appears to work well
|
2024-09-17 19:59:02 +00:00 |
|
Jake Poznanski
|
2f71cb9232
|
Using SmolLM, seems a lot better and is able to pass some tests
|
2024-09-17 18:47:27 +00:00 |
|
Jake Poznanski
|
57e80aacd2
|
Testing coherence with distilgpt2, but it doesn't work great
|
2024-09-17 16:58:45 +00:00 |
|
Jake Poznanski
|
01bc0b2f10
|
Moving a whole bunch of code over, still broken
|
2024-09-17 16:26:55 +00:00 |
|
Jake Poznanski
|
a534a0180d
|
Moving pdf filter code over with tests
|
2024-09-17 15:16:58 +00:00 |
|
Jake Poznanski
|
68b2c0e8d6
|
Initial commit
|
2024-09-17 07:53:43 -07:00 |
|