1 Commits

Author SHA1 Message Date
cragwolfe
e9690b2738
feat: utility script to process large PDFs through the API by script (#3591)
Adds the bash script `process-pdf-parallel-through-api.sh` that allows
splitting up a PDF into smaller parts (splits) to be processed through
the API concurrently, and is re-entrant. If any of the parts splits fail
to process, one can attempt reprocessing those split(s) by rerunning the
script.

Note: requires the `qpdf` command line utility.

The below command line output shows the scenario where just one split
had to be reprocessed through the API to create the final
`layout-parser-paper_combined.json` output.

```
$ BATCH_SIZE=20 PDF_SPLIT_PAGE_SIZE=6 STRATEGY=hi_res \
  ./scripts/user/process-pdf-parallel-through-api.sh example-docs/pdf/layout-parser-paper.pdf
>   % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0
 Skipping processing for /Users/cragwolfe/tmp/pdf-splits/layout-parser-paper-output-8a76cb6228e109450992bc097dbd1a51_split-6_strat-hi_res/layout-pars\
er-paper_pages_1_to_6.json as it already exists.
Skipping processing for /Users/cragwolfe/tmp/pdf-splits/layout-parser-paper-output-8a76cb6228e109450992bc097dbd1a51_split-6_strat-hi_res/layout-parser-paper_pages_7_to_12.json as it already exists.
Valid JSON output created: /Users/cragwolfe/tmp/pdf-splits/layout-parser-paper-output-8a76cb6228e109450992bc097dbd1a51_split-6_strat-hi_res/layout-parser-paper_pages_13_to_16.json
Processing complete. Combined JSON saved to /Users/cragwolfe/tmp/pdf-splits/layout-parser-paper-output-8a76cb6228e109450992bc097dbd1a51_split-6_strat-hi_res/layout-parser-paper_combined.json
```

Bonus change to `unstructured-get-json.sh` to point to the standard
hosted Serverless API, but allow using the Free API with --freemium.
2024-09-10 11:40:35 -07:00