Jake Poznanski
|
4d6eaf654d
|
Merge branch 'main' of https://github.com/allenai/pdelfin
|
2024-10-14 16:30:51 +00:00 |
|
Jake Poznanski
|
89d4ee2145
|
Pipeline work
|
2024-10-14 16:30:49 +00:00 |
|
Jake Poznanski
|
7b161533e2
|
Code to do local inference on fine tuned models for testing
|
2024-10-14 08:38:18 -07:00 |
|
Jake Poznanski
|
5a7377af30
|
Refactoring
|
2024-10-11 22:57:49 +00:00 |
|
Jake Poznanski
|
4fd6066600
|
gpt cleanup
|
2024-10-11 22:41:09 +00:00 |
|
Jake Poznanski
|
a45f86e4a4
|
More cleanup
|
2024-10-11 22:37:32 +00:00 |
|
Jake Poznanski
|
53fdb6108c
|
More pipeline code
|
2024-10-11 21:50:09 +00:00 |
|
Jake Poznanski
|
10b7a58d28
|
fix
|
2024-10-11 20:22:58 +00:00 |
|
Jake Poznanski
|
f477a68621
|
dbmanager
|
2024-10-11 16:24:29 +00:00 |
|
Jake Poznanski
|
2dccc4be3b
|
Oops removing print
|
2024-10-11 16:23:14 +00:00 |
|
Jake Poznanski
|
aea3f7f1fe
|
Fix for anchor generation on pdfs with no text elements
|
2024-10-11 15:01:01 +00:00 |
|
Jake Poznanski
|
af03358c47
|
assemble
|
2024-10-10 22:36:09 +00:00 |
|
Jake Poznanski
|
312847acac
|
Ok, finally working nicely to build the page index
|
2024-10-10 22:30:09 +00:00 |
|
Jake Poznanski
|
312ee8d953
|
pipeline script
|
2024-10-10 22:13:43 +00:00 |
|
Jake Poznanski
|
49b5b233c3
|
Working on new pipeline script
|
2024-10-10 22:10:26 +00:00 |
|
Jake Poznanski
|
a8b50ae8fa
|
Preloading the datasets directly
|
2024-10-10 19:57:51 +00:00 |
|
Jake Poznanski
|
85f2dc6d26
|
Fixes
|
2024-10-10 18:52:42 +00:00 |
|
Jake Poznanski
|
2864f907e1
|
Dataloader fix with nicer tests
|
2024-10-10 16:58:45 +00:00 |
|
Jake Poznanski
|
b7c80cd17f
|
Fix up some tests but I don't see why this isn't working
|
2024-10-10 16:58:40 +00:00 |
|
Jake Poznanski
|
3245990216
|
Faster eval script
|
2024-10-10 15:22:33 +00:00 |
|
Jake Poznanski
|
931f48c3d1
|
Allow eval script to support one more type of jsonls, runpipeline multiglobs, other fixes
|
2024-10-09 23:39:13 +00:00 |
|
Jake Poznanski
|
c6bdf69d8f
|
First stab at document assembly
|
2024-10-09 22:19:16 +00:00 |
|
Jake Poznanski
|
847064f46f
|
Taking notes, starting on document assembly
|
2024-10-09 22:14:28 +00:00 |
|
Jake Poznanski
|
8e5809da71
|
runpipeline
|
2024-10-09 20:29:59 +00:00 |
|
Jake Poznanski
|
a90feda42f
|
bugfixes
|
2024-10-09 20:20:06 +00:00 |
|
Jake Poznanski
|
c2909f314e
|
run pipeline
|
2024-10-09 19:55:45 +00:00 |
|
Jake Poznanski
|
954b19a5d4
|
Stuff
|
2024-10-09 19:55:04 +00:00 |
|
Jake Poznanski
|
991b213cf5
|
Refactoring, startng to write run_pipeline
|
2024-10-09 18:48:31 +00:00 |
|
Jake Poznanski
|
4bf6e7a430
|
Refactoring
|
2024-10-09 18:11:18 +00:00 |
|
Jake Poznanski
|
0c56dec704
|
Adding diff to tinyhost
|
2024-10-09 17:53:26 +00:00 |
|
Jake Poznanski
|
400e92180b
|
Unifying some of the pdf rendering stuff
|
2024-10-09 16:57:13 +00:00 |
|
Jake Poznanski
|
dc6440d068
|
Cleaning up anchor text to deal with abnormally long lines
|
2024-10-09 16:29:20 +00:00 |
|
Jake Poznanski
|
b6b74b7832
|
Rewriting prompts to eval with new model
|
2024-10-09 16:04:39 +00:00 |
|
Jake Poznanski
|
7c19a9a856
|
fix
|
2024-10-08 23:54:17 +00:00 |
|
Jake Poznanski
|
ad10add6c1
|
try lower lr
|
2024-10-08 23:52:56 +00:00 |
|
Jake Poznanski
|
230c8a9f9a
|
Trying new run that will rewrite the prompts as it goes
|
2024-10-08 22:10:18 +00:00 |
|
Jake Poznanski
|
97291b3f6a
|
Anchor is fixed to sample text elements better
|
2024-10-08 21:51:43 +00:00 |
|
Jake Poznanski
|
c8a4d14c57
|
Adding image merging to pdf report/hint/anchor
|
2024-10-08 21:23:21 +00:00 |
|
Jake Poznanski
|
57d9a21eeb
|
Adding prompt length histogram to a script
|
2024-10-08 18:22:56 +00:00 |
|
Jake Poznanski
|
adc702c918
|
FIxing wandb key
|
2024-10-08 18:16:39 +00:00 |
|
Jake Poznanski
|
085937859f
|
Lower lr
|
2024-10-08 17:52:00 +00:00 |
|
Jake Poznanski
|
4b30dd867b
|
Fixing eval script, working FSDP config
|
2024-10-08 16:56:07 +00:00 |
|
Jake Poznanski
|
f5fd9ff53a
|
Trying grad checkpoint
|
2024-10-08 16:11:31 +00:00 |
|
Jake Poznanski
|
4fb7e9b184
|
Updated eval script
|
2024-10-08 16:09:25 +00:00 |
|
Jake Poznanski
|
fb4e585e9f
|
Trying out non-lora training
|
2024-10-08 15:20:37 +00:00 |
|
Jake Poznanski
|
ec09408ca9
|
Filtering based on cpu count
|
2024-10-07 15:40:29 -07:00 |
|
Jake Poznanski
|
a90eb94951
|
Fix dataloader bug
|
2024-10-07 15:25:48 -07:00 |
|
Jake Poznanski
|
3d36545fa5
|
loading fix for parquets again...
|
2024-10-07 14:48:53 -07:00 |
|
Jake Poznanski
|
fdcd77eadd
|
typo
|
2024-10-07 14:32:47 -07:00 |
|
Jake Poznanski
|
7416b42023
|
Adding support for parquet datasets which are precached
|
2024-10-07 21:14:33 +00:00 |
|