7 Commits

Author SHA1 Message Date
ryannikolaidis
7d157c1ede
test: add benchmark script (#638) 2023-06-05 09:14:43 -07:00
Mallori Harrell
34d563c1fc
feat: Create spacy notebook example (#593)
* add new notebook for spacy
2023-05-17 15:42:15 -05:00
Yuming Long
5b6f11bb88
Chore(ingest): Add --partition-strategy parameter in CLI (#582)
* change strategy arg defalut to auto in partition

* passing --partition-strategy down

* add strategy="hi_res" to test (default changed)

* made an error on param name, added note
2023-05-15 19:26:53 +00:00
Matt Robinson
aa01cdfc7a
fix: group together text from the same bounding box in partition_pdf with fast strategy (#542)
* switch to using PDF objects

* linting, linting, linting

* couple more tweaks

* added test for chevron-page

* version and changelog

* linting, linting, linting

* now processing 4 files
2023-05-03 18:33:24 -04:00
Matt Robinson
894a190001
enhancement: check for copy protection on PDFs and fallback to hi res when necessary (#514)
* function to check if pdf is extractable

* add fallback logic for unextractable pdfs

* tests for docs with copy protection

* add test for unprocessable pdf

* update docs

* changelog and version

* update logic for images; reset file before proceeding

* 3 files for api tests

* docs update
2023-04-21 21:35:43 +00:00
cragwolfe
5657378602
test: avoid misleading output in ingest tests (#488)
Previously, if there was an error (non-zero exit code) in an ingest test script,
the script would still complete and echo a warning about mismatched outputs
and how to regenerate the fixtures. However, this statement is irrelevant and
misleading: if the ingest failed with a non-zero exit code in the first place,
that is the failure that should be debugged -- don't confuse the user with
a comment about outputs.
2023-04-17 21:57:44 +00:00
Austin Walker
4af4d33423
feat: add --partition-by-api and --partition-host to unstructured-ingest (#443)
* Add --partition-by-api and --partition-host args to ingest

* Fix error in make check

* Bump changelog

* Add a test ingest script

Also add a workaround for the test causing 400s from our api. Seems we need to make sure
unstructured-api can handle getting a file.content_type of None.

* Remove the content type workaround
2023-04-11 22:05:07 -07:00