unstructured

mirror of https://github.com/Unstructured-IO/unstructured.git synced 2025-06-27 02:30:08 +00:00

Author	SHA1	Message	Date
cragwolfe	c6b8ed4290	chore: allow changing default output dir for unstructured-get-json.sh (#3973 )	2025-03-31 22:18:57 -07:00
cragwolfe	19fc1fcc72	feat: convenience unstructured-get-json.sh update (#3971 ) * script now supports: * the --vlm flag, to process the document with the VLM strategy * optionally takes --vlm-model, --vlm-provider args * optionally also writes .html outputs by converting unstructured .json output * optionally opens those .html outputs in a browser Tested with: ``` unstructured-get-json.sh --write-html --open-html --fast layout-parser-paper-p2.pdf unstructured-get-json.sh --write-html --open-html --hi-res layout-parser-paper-p2.pdf unstructured-get-json.sh --write-html --open-html --ocr-only layout-parser-paper-p2.pdf unstructured-get-json.sh --write-html --open-html --vlm layout-parser-paper-p2.pdf unstructured-get-json.sh --write-html --open-html --vlm --vlm-provider openai --vlm-model gpt-4o layout-parser-paper-p2.pdf unstructured-get-json.sh --write-html --open-html --vlm --vlm-provider vertexai --vlm-model gemini-2.0-flash-001 layout-parser-paper-p2.pdf unstructured-get-json.sh --write-html --open-html --vlm --vlm-provider anthropic --vlm-model claude-3-5-sonnet-20241022 layout-parser-paper-p2.pdf ``` [layout-parser-paper-p2.pdf](https://github.com/user-attachments/files/19514007/layout-parser-paper-p2.pdf)	2025-03-31 09:45:01 -07:00
cragwolfe	238f985dda	feat: add --images support to unstructured-get-json.sh (#3888 ) E.g., now can run: ```bash # extracts base64 encoded image data for `Table` and `Image` elements $ unstructured-get-json.sh --trace --verbose --images /t/docs/Captur-1317-5_ENG-p5.pdf # also extracts `Title` elements (see screenshot) $ IMAGE_BLOCK_TYPES='"title","table","image"' unstructured-get-json.sh --trace --verbose --images /t/docs/Captur-1317-5_ENG-p5.pdf ``` It was discovered during testing that "narrativetext" does not work, probably due to camel casing of NarrativeText 😬 ![image](https://github.com/user-attachments/assets/e6414a57-81e1-4560-b1b2-dce3b1c2c804)	2025-01-27 16:09:13 -08:00
cragwolfe	9445a2dd01	chore: fix CHANGELOG formatting (#3800 ) Fixes formatting in CHANGELOG.md where most of the page was bold and indented. (verify the branch version here: https://github.com/Unstructured-IO/unstructured/blob/crag/tables-tweak/CHANGELOG.md) Bonus tweak: u-table-inspect.sh is more robust to adding borders for visualizations	2024-11-26 15:38:42 -08:00
cragwolfe	e9690b2738	feat: utility script to process large PDFs through the API by script (#3591 ) Adds the bash script `process-pdf-parallel-through-api.sh` that allows splitting up a PDF into smaller parts (splits) to be processed through the API concurrently, and is re-entrant. If any of the parts splits fail to process, one can attempt reprocessing those split(s) by rerunning the script. Note: requires the `qpdf` command line utility. The below command line output shows the scenario where just one split had to be reprocessed through the API to create the final `layout-parser-paper_combined.json` output. ``` $ BATCH_SIZE=20 PDF_SPLIT_PAGE_SIZE=6 STRATEGY=hi_res \ ./scripts/user/process-pdf-parallel-through-api.sh example-docs/pdf/layout-parser-paper.pdf > % Total % Received % Xferd Average Speed Time Time Time Current Dload Upload Total Spent Left Speed 0 0 0 0 0 0 0 0 --:--:-- --:--:-- --:--:-- 0 Skipping processing for /Users/cragwolfe/tmp/pdf-splits/layout-parser-paper-output-8a76cb6228e109450992bc097dbd1a51_split-6_strat-hi_res/layout-pars\ er-paper_pages_1_to_6.json as it already exists. Skipping processing for /Users/cragwolfe/tmp/pdf-splits/layout-parser-paper-output-8a76cb6228e109450992bc097dbd1a51_split-6_strat-hi_res/layout-parser-paper_pages_7_to_12.json as it already exists. Valid JSON output created: /Users/cragwolfe/tmp/pdf-splits/layout-parser-paper-output-8a76cb6228e109450992bc097dbd1a51_split-6_strat-hi_res/layout-parser-paper_pages_13_to_16.json Processing complete. Combined JSON saved to /Users/cragwolfe/tmp/pdf-splits/layout-parser-paper-output-8a76cb6228e109450992bc097dbd1a51_split-6_strat-hi_res/layout-parser-paper_combined.json ``` Bonus change to `unstructured-get-json.sh` to point to the standard hosted Serverless API, but allow using the Free API with --freemium.	2024-09-10 11:40:35 -07:00
cragwolfe	bd8a74d686	chore: shell scripts default indent of 2 instead of 4 (#2287 ) Given the tendency for shell scripts to easily enter into a few levels of indentation and long line lengths, update the default to 2 spaces.	2023-12-19 07:48:21 +00:00
Roman Isecke	76efcf4dd7	chore: add shfmt (#2246 ) ### Description Given all the shell files that now exist in the repo, would be nice to have linting/formatting around them (in addition to the existing shellcheck which doesn't do anything to format the shell code). This PR introduces `shfmt` to both check for changes and apply formatting when the associated make targets are called.	2023-12-12 01:04:15 +00:00
cragwolfe	d7456ab6d2	feat: convenience script to view tables (#2124 ) Executive Summary Eyeballing or saving html in a Table element (in the `metadata.text_as_html` field) takes some manual effort. This script provides a quick way to do so given an unstructured .json file that adheres to the usual schema (i.e., that's returned by the Unstructured API). Testing Instructions Get some unstructured output that includes a table. E.g. [124_PDFsam_Basel III - Finalising post-crisis reforms.pdf](https://github.com/Unstructured-IO/unstructured/files/13407404/124_PDFsam_Basel.III.-.Finalising.post-crisis.reforms.pdf) ``` ./unstructured-get-json.sh --tables --hi-res \ 124_PDFsam_Basel\ III\ -\ Finalising\ post-crisis\ reforms.pdf ```` Then use this the following script to view the structure and content of the tables: (note that output file was copied to the clipboard from prior command): ``` ./u-tables-inspect.sh \ "<snip>/tmp/unst-outputs/124_PDFsam_Basel III - Finalising post-crisis reforms.pdf-hi-res.json" ```	2023-11-21 22:18:39 -08:00
cragwolfe	5fa40850f4	feat: convenience script to post files to the API (#2083 ) Usage: ./unstructured-get-json.sh [options] <file>" Options: --api-key KEY Specify the API key for authentication. Set the env var $UNST_API_KEY to skip providing this option. --hi-res hi_res strategy: Enable high-resolution processing, with layout segmentation and OCR --fast fast strategy: No OCR, just extract embedded text --ocr-only ocr_only strategy: Perform OCR (Optical Character Recognition) only. No layout segmentation. --tables Enable table extraction: tables are represented as html in metadata --coordinates Include coordinates in the output --trace Enable trace logging for debugging, useful to cut and paste the executed curl call --verbose Enable verbose logging including printing first 8 elements to stdout --s3 Write the resulting output to s3 (like a pastebin) --help Display this help and exit. Arguments: <file> File to send to the API. The script requires a <file>, the document to post to the Unstructured API. The .json result is written to ~/tmp/unst-outputs/ -- this path is echoed and copied to your clipboard.	2023-11-15 22:58:28 -08:00

9 Commits