Jake Poznanski
|
dc26541da2
|
Starting code to build parquets...
|
2024-10-07 20:59:43 +00:00 |
|
Jake Poznanski
|
4557a5b296
|
Typo
|
2024-10-07 13:03:31 -07:00 |
|
Jake Poznanski
|
e973de7ba9
|
Typo
|
2024-10-07 13:01:43 -07:00 |
|
Jake Poznanski
|
ebd40f9084
|
Hopefully fixing dataloader for now
|
2024-10-07 12:59:27 -07:00 |
|
Jake Poznanski
|
5d35461dd2
|
Fix for unicode errors in big datasets for the future
|
2024-10-07 17:01:59 +00:00 |
|
Jake Poznanski
|
44bcdc771b
|
Hopefully can use weka for the train datasets now
|
2024-10-07 16:14:28 +00:00 |
|
Jake Poznanski
|
d8e459c9f3
|
Weird issue with surrogate pairs in json
|
2024-10-07 09:04:13 -07:00 |
|
Jake Poznanski
|
98020cabbb
|
Allow loading files locally
|
2024-10-07 07:49:16 -07:00 |
|
Jake Poznanski
|
13123ddea4
|
Pinning datasets to work around weird issue
|
2024-10-06 03:56:27 +00:00 |
|
Jake Poznanski
|
568dd48509
|
Prepping for qwen2vl full training run
|
2024-10-05 04:04:45 +00:00 |
|
Jake Poznanski
|
6065da268b
|
Hopefully working better
|
2024-10-04 18:06:04 +00:00 |
|
Jake Poznanski
|
a2ff849a78
|
checkpoint on new runner for openai batches
|
2024-10-04 17:32:35 +00:00 |
|
Jake Poznanski
|
2da901d433
|
new better runopenaibatch script
|
2024-10-04 16:58:38 +00:00 |
|
Jake Poznanski
|
35ec67c427
|
Hopefully finishing touches
|
2024-10-04 16:10:19 +00:00 |
|
Jake Poznanski
|
db36608b42
|
Fix
|
2024-10-04 16:05:08 +00:00 |
|
Jake Poznanski
|
f25cb6c261
|
Fixes
|
2024-10-04 15:54:00 +00:00 |
|
Jake Poznanski
|
4630f7b1cb
|
Bugfixes
|
2024-10-04 15:35:52 +00:00 |
|
Jake Poznanski
|
e87729a653
|
New send silver script for testing
|
2024-10-04 15:27:43 +00:00 |
|
Jake Poznanski
|
6e1094ee8a
|
Support for more evals and output formats
|
2024-10-03 20:19:52 +00:00 |
|
Jake Poznanski
|
974ddd3773
|
I'm pretty sure we only need to save on rank0
|
2024-10-03 11:30:44 -07:00 |
|
Jake Poznanski
|
8f1fa4f796
|
Running a mini config again with metric
|
2024-10-03 11:12:30 -07:00 |
|
Jake Poznanski
|
046d4a4534
|
Adding eval on start and seed params
|
2024-10-03 10:54:25 -07:00 |
|
Jake Poznanski
|
2227605bfb
|
Mini train config
|
2024-10-03 10:32:15 -07:00 |
|
Jake Poznanski
|
4505a49420
|
Pinning to normal transformers version now
|
2024-10-03 09:00:53 -07:00 |
|
Jake Poznanski
|
78e3a94173
|
Adding pluto ib
|
2024-10-03 15:33:17 +00:00 |
|
Jake Poznanski
|
0ddaf9023d
|
Getting ready to launch a new training run
|
2024-10-02 23:04:56 +00:00 |
|
Jake Poznanski
|
1686790ac8
|
Checking filtering logic
|
2024-10-02 22:45:40 +00:00 |
|
Jake Poznanski
|
b340ae5092
|
A few notes, starting to test dataloader with new structured response format
|
2024-10-02 22:17:15 +00:00 |
|
Jake Poznanski
|
8315162a25
|
Merge branch 'main' of https://github.com/allenai/pdelfin
|
2024-10-02 20:48:58 +00:00 |
|
Jake Poznanski
|
6d8e638152
|
Readme
|
2024-10-02 20:48:39 +00:00 |
|
Jake Poznanski
|
ad1d818816
|
Update README.md
|
2024-10-02 13:42:43 -07:00 |
|
Jake Poznanski
|
68b9ee8c90
|
Small prompt fix
|
2024-10-02 20:19:03 +00:00 |
|
Jake Poznanski
|
a5c27212f0
|
Need more token output due to structured outputs
|
2024-10-02 19:54:54 +00:00 |
|
Jake Poznanski
|
d05832ebee
|
Fixes and evals for structured outputs
|
2024-10-02 19:51:15 +00:00 |
|
Jake Poznanski
|
802632c49f
|
Building openai prompt with structured output
|
2024-10-02 18:10:47 +00:00 |
|
Jake Poznanski
|
be00ccf321
|
Switching buildsilver to use new anchor code
|
2024-10-02 17:29:44 +00:00 |
|
Jake Poznanski
|
0071cbd788
|
Appears as if the report method works really well, might need one last step to detect rotated pages
|
2024-10-02 16:44:39 +00:00 |
|
Jake Poznanski
|
5703a59e50
|
Fix for voting on multiple docs in the same eval page
|
2024-10-02 16:31:59 +00:00 |
|
Jake Poznanski
|
73fb81ef6c
|
Review page size option, fixing mkdirs in convertsilver script
|
2024-10-02 15:53:21 +00:00 |
|
Jake Poznanski
|
276465aab1
|
Adding flag to allow skipping filter
|
2024-10-02 15:46:12 +00:00 |
|
Jake Poznanski
|
549e07bed0
|
filtering out stupid ads
|
2024-10-02 15:36:41 +00:00 |
|
Jake Poznanski
|
6ef8226347
|
Can spit out anchor text for a gpt engine using pypdf, showing locations of images and text
|
2024-10-01 23:15:53 +00:00 |
|
Jake Poznanski
|
e42cecf96c
|
Adding anchor code based off of pypdf that visits each text block, hopefully so we can make it output good bboxes
|
2024-10-01 22:10:58 +00:00 |
|
Jake Poznanski
|
09e8840c56
|
coherency based anchor text
|
2024-10-01 20:19:03 +00:00 |
|
Jake Poznanski
|
28fe314539
|
prepping anchor text generation code
|
2024-10-01 19:59:48 +00:00 |
|
Jake Poznanski
|
7795f65a53
|
Fixing bug where we were not showing all the worst alignments
|
2024-10-01 16:56:15 +00:00 |
|
Jake Poznanski
|
9d6e2faf95
|
Runeval is much improved now
|
2024-10-01 16:46:35 +00:00 |
|
Jake Poznanski
|
8a66ecee25
|
Script to rerun openai prompts on the same data
|
2024-10-01 16:25:16 +00:00 |
|
Jake Poznanski
|
f99f6a6729
|
Prompt utils
|
2024-10-01 16:02:24 +00:00 |
|
Jake Poznanski
|
b6543a4f65
|
Qwen checkpoint fixer script
|
2024-10-01 16:02:10 +00:00 |
|