582 Commits

Author SHA1 Message Date
Jake Poznanski
0ddaf9023d Getting ready to launch a new training run 2024-10-02 23:04:56 +00:00
Jake Poznanski
1686790ac8 Checking filtering logic 2024-10-02 22:45:40 +00:00
Jake Poznanski
b340ae5092 A few notes, starting to test dataloader with new structured response format 2024-10-02 22:17:15 +00:00
Jake Poznanski
8315162a25 Merge branch 'main' of https://github.com/allenai/pdelfin 2024-10-02 20:48:58 +00:00
Jake Poznanski
6d8e638152 Readme 2024-10-02 20:48:39 +00:00
Jake Poznanski
ad1d818816
Update README.md 2024-10-02 13:42:43 -07:00
Jake Poznanski
68b9ee8c90 Small prompt fix 2024-10-02 20:19:03 +00:00
Jake Poznanski
a5c27212f0 Need more token output due to structured outputs 2024-10-02 19:54:54 +00:00
Jake Poznanski
d05832ebee Fixes and evals for structured outputs 2024-10-02 19:51:15 +00:00
Jake Poznanski
802632c49f Building openai prompt with structured output 2024-10-02 18:10:47 +00:00
Jake Poznanski
be00ccf321 Switching buildsilver to use new anchor code 2024-10-02 17:29:44 +00:00
Jake Poznanski
0071cbd788 Appears as if the report method works really well, might need one last step to detect rotated pages 2024-10-02 16:44:39 +00:00
Jake Poznanski
5703a59e50 Fix for voting on multiple docs in the same eval page 2024-10-02 16:31:59 +00:00
Jake Poznanski
73fb81ef6c Review page size option, fixing mkdirs in convertsilver script 2024-10-02 15:53:21 +00:00
Jake Poznanski
276465aab1 Adding flag to allow skipping filter 2024-10-02 15:46:12 +00:00
Jake Poznanski
549e07bed0 filtering out stupid ads 2024-10-02 15:36:41 +00:00
Jake Poznanski
6ef8226347 Can spit out anchor text for a gpt engine using pypdf, showing locations of images and text 2024-10-01 23:15:53 +00:00
Jake Poznanski
e42cecf96c Adding anchor code based off of pypdf that visits each text block, hopefully so we can make it output good bboxes 2024-10-01 22:10:58 +00:00
Jake Poznanski
09e8840c56 coherency based anchor text 2024-10-01 20:19:03 +00:00
Jake Poznanski
28fe314539 prepping anchor text generation code 2024-10-01 19:59:48 +00:00
Jake Poznanski
7795f65a53 Fixing bug where we were not showing all the worst alignments 2024-10-01 16:56:15 +00:00
Jake Poznanski
9d6e2faf95 Runeval is much improved now 2024-10-01 16:46:35 +00:00
Jake Poznanski
8a66ecee25 Script to rerun openai prompts on the same data 2024-10-01 16:25:16 +00:00
Jake Poznanski
f99f6a6729 Prompt utils 2024-10-01 16:02:24 +00:00
Jake Poznanski
b6543a4f65 Qwen checkpoint fixer script 2024-10-01 16:02:10 +00:00
Jake Poznanski
2c7323d1c4 Convert silver adjustments 2024-09-30 22:41:51 +00:00
Jake Poznanski
80bb0cbc23 Open ai to openai comparison now supported, new prompts 2024-09-30 22:08:30 +00:00
Jake Poznanski
e179453cc5 Fixing qwen checkpoint script 2024-09-30 20:34:06 +00:00
Jake Poznanski
963e946233 Convertsilver birr script can go in and out of S3 now 2024-09-30 20:06:45 +00:00
Jake Poznanski
b856b4551f Fixes to convertsilver to birr script 2024-09-30 19:54:30 +00:00
Jake Poznanski
da1982acb8 Refactoring prompts into their own new folder 2024-09-30 18:48:17 +00:00
Jake Poznanski
d74f9a352b Send silver script tries to open file first, before sending an API requests 2024-09-30 18:41:50 +00:00
Jake Poznanski
1216d9c7c9 retrieve silver script reports errors better 2024-09-30 18:41:33 +00:00
Jake Poznanski
b4e9d6a2b8 Buildsilver script suppors reservoir sampling so it can sample 100M+ paths now efficiently 2024-09-30 18:41:18 +00:00
Jake Poznanski
8ec9e35f22 dataprep issue 2024-09-28 04:31:11 +00:00
Jake Poznanski
e53f782b0f Datasetdict fix 2024-09-28 03:38:29 +00:00
Jake Poznanski
decfd7fbc1 Fixing the refiner input prompt to something simpler that doesn't depend on the training data. Fixing beaker job workspace and bumping priority to high. 2024-09-27 22:54:07 +00:00
Jake Poznanski
22b765e6be Going back to non iterable dataset, so shuffling works better, applying a light filter 2024-09-27 15:48:56 +00:00
Jake Poznanski
65a9c9981e Hopefuly will train now 2024-09-27 15:16:12 +00:00
Jake Poznanski
e864b9d88f weird dataloader stuff now 2024-09-27 02:53:59 +00:00
Jake Poznanski
37f10051f6 typo 2024-09-27 01:19:21 +00:00
Jake Poznanski
c00e40d1c4 More fixes 2024-09-26 23:10:07 +00:00
Jake Poznanski
d098a87ed2 Column name fix 2024-09-26 22:29:19 +00:00
Jake Poznanski
84e9da637c Removing lambda due to pickling errors 2024-09-26 21:39:08 +00:00
Jake Poznanski
61dd7bb61f Fix for map in iterable mode 2024-09-26 20:44:47 +00:00
Jake Poznanski
49efa5cb40 Typo 2024-09-26 19:57:53 +00:00
Jake Poznanski
cf1aa0176e Proper use of iterable_dataset 2024-09-26 19:55:54 +00:00
Jake Poznanski
05fdb81da2 map and filter on iterable dataset 2024-09-26 19:01:34 +00:00
Jake Poznanski
f14e910175 bnb 2024-09-26 03:30:35 +00:00
Jake Poznanski
7707bc08da trying cheaper optimizer to solve ooms 2024-09-25 22:56:05 +00:00