unstructured

mirror of https://github.com/Unstructured-IO/unstructured.git synced 2025-10-31 10:03:07 +00:00

Author	SHA1	Message	Date
cragwolfe	aaea6358f6	build(deps): bump pip (#558 )	2023-05-08 23:08:10 -07:00
ryannikolaidis	2fc4d37454	chore: pin inference version, bump deps, and update openssl (#551 )	2023-05-08 17:02:55 -07:00
Matt Robinson	3d3f3df3ec	enhancement: add "ocr_only" strategy for PDFs (#553 ) * add tests for validating strategy * refactor into determine_pdf_strategy function * refactor pdf strategies into strategies * remove commented out code * remove unreachable code * add in handling for image types * a little more refactoring * import ocr partioning for images * catch warnings, partition type for valid strategies * fallback to ocr_only from fast * fallback logic for hi_res * test for fallback to ocr only * fallback logic ofr ocr_only * more tests for fallback logic * update doc strings * version and changelog * linting, linting, linting * update docs to include notes about strategy * fix typos * change back patched filename 0.6.4	2023-05-08 17:21:24 +00:00
Trevor Bossert	1ac72c6ee8	Fixes issue where detectron2 would not install on OSX (#552 ) * Fixes issue where detectron2 would not install on OSX Tested on Apple silicon based MacBook Pro. This installs tensorboard which is required on OSX and arm based cpu’s for detectron2. * Improve Arch detection for tensorboard * remove makefile from commands in readme pin tensorboard version	2023-05-05 17:16:28 -07:00
Matt Robinson	0fc0571c02	fix(ci): don't skip deploy for tags (#549 )	2023-05-05 09:51:41 -04:00
Matt Robinson	392cccdbf7	enhancement: add ocr_only strategy for `partition_image` (#540 ) * spike for ocr-only strategy for images * fix for file processing * extra space * add korean to ci * added test for ocr_only strategy * added docs for ocr_only * changelog and version * added test for bad strategy * skip korean test if in docker * bump version * version bump * document valid strategies * bump version for release --------- Co-authored-by: qued <64741807+qued@users.noreply.github.com> 0.6.3	2023-05-04 20:23:51 +00:00
Matt Robinson	fae5f8fdde	feat: add `partition_odt` for open office docs (#548 ) * added filetype detection for odt * add function for partition odt documents * add odt files to auto * changelog and version * docs and readme * update installation docs * skip tests if not supported or in docker * import pytest * fix docs typos	2023-05-04 19:28:08 +00:00
Matt Robinson	981805e435	feat: `stage_for_baseplate` function (#546 ) * added a staging brick for baseplate * added a test for baseplate * update documentation * version and changelog	2023-05-04 11:05:38 -04:00
Matt Robinson	aa01cdfc7a	fix: group together text from the same bounding box in `partition_pdf` with fast strategy (#542 ) * switch to using PDF objects * linting, linting, linting * couple more tweaks * added test for chevron-page * version and changelog * linting, linting, linting * now processing 4 files	2023-05-03 18:33:24 -04:00
Matt Robinson	7e43a25f07	feat: add `partition_multiple_via_api` function (#539 ) * added function for multiple files via api * make multiple work with files * updated docs strings * changelog and version * docs and contextlib for open files * tests for partition multiple * add tests for error conditions * add output example	2023-05-03 15:06:06 -04:00
Matt Robinson	3c3c59a726	build(deps): add pdfminer.six to dependencies (#537 )	2023-05-02 15:36:12 +00:00
Matt Robinson	19488bf15f	ci: only build docs on tags (#538 ) * ci: only build docs on tags * add branch for docs builds	2023-05-02 15:15:23 +00:00
dependabot[bot]	61209b34bd	build(deps): bump yarl from 1.8.2 to 1.9.2 in /requirements (#530 ) Bumps [yarl](https://github.com/aio-libs/yarl) from 1.8.2 to 1.9.2. - [Release notes](https://github.com/aio-libs/yarl/releases) - [Changelog](https://github.com/aio-libs/yarl/blob/master/CHANGES.rst) - [Commits](https://github.com/aio-libs/yarl/compare/v1.8.2...v1.9.2) --- updated-dependencies: - dependency-name: yarl dependency-type: direct:production update-type: version-update:semver-minor ... Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> Co-authored-by: Matt Robinson <mrobinson@unstructured.io>	2023-05-01 18:18:45 -04:00
Matt Robinson	e805ed465d	docs: add slack and github links back into docs page (#535 ) * stars and github link to top of page * wording updates * remove unnecessary font weight change * remove next arrows * buttons to bottom on sidebar	2023-05-01 18:17:52 -04:00
Matt Robinson	22ebfa6714	docs: add download badges to README (#536 ) * downloads badge * total downloads	2023-05-01 18:17:31 -04:00
dependabot[bot]	7f9ec8108d	build(deps): bump importlib-metadata in /requirements (#531 ) Bumps [importlib-metadata](https://github.com/python/importlib_metadata) from 6.5.0 to 6.6.0. - [Release notes](https://github.com/python/importlib_metadata/releases) - [Changelog](https://github.com/python/importlib_metadata/blob/main/CHANGES.rst) - [Commits](https://github.com/python/importlib_metadata/compare/v6.5.0...v6.6.0) --- updated-dependencies: - dependency-name: importlib-metadata dependency-type: direct:production update-type: version-update:semver-minor ... Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> Co-authored-by: Matt Robinson <mrobinson@unstructured.io>	2023-05-01 21:49:26 +00:00
dependabot[bot]	8ed1627928	build(deps): bump huggingface-hub from 0.13.4 to 0.14.1 in /requirements (#528 ) Bumps [huggingface-hub](https://github.com/huggingface/huggingface_hub) from 0.13.4 to 0.14.1. - [Release notes](https://github.com/huggingface/huggingface_hub/releases) - [Commits](https://github.com/huggingface/huggingface_hub/compare/v0.13.4...v0.14.1) --- updated-dependencies: - dependency-name: huggingface-hub dependency-type: direct:production update-type: version-update:semver-minor ... Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>	2023-05-01 17:22:21 -04:00
Matt Robinson	1b8e9a353a	version bump for release 0.6.2	2023-04-26 16:29:16 -04:00
Matt Robinson	9fdc310358	fix: update `detect_filetype` for JSONs with text/plain MIME type (#520 ) * check to see if text file is a json * add json check into filetype detection * added test for updated file detection logic * bytes/strings handling * changlog and version bump	2023-04-26 13:52:47 -04:00
Matt Robinson	4156cb12e0	feat: `partition_via_api` helper function (#518 ) * added function for partitioning via api * added tests for api function * changelog and version * add docs for partition_via_api	2023-04-26 09:05:35 -04:00
JaeyongLee	be8e6da884	fix: correct return types in `exceeds_caps_ratio` (#489 ) * fix: fix text_type.py exceeds_cap_ratio() returns There are cases when function is_possible_narrative_text receives an incorrect return from function exceeds_cap_ratio and does an incorrect classification, so some of the return values of exceeds_cap_ratio are corrected * Update text_type.py exceeds_cap_ratio() .. * Update text_type.py .. * Update CHANGELOG.md .. * linting, linting, linting ... * update tests * more test fixes * Update text_type.py .. * bump version and changelog * add punctuation check --------- Co-authored-by: Matt Robinson <mrobinson@unstructured.io> Co-authored-by: Matt Robinson <mrobinson@unstructuredai.io>	2023-04-24 10:45:09 -04:00
Matt Robinson	894a190001	enhancement: check for copy protection on PDFs and fallback to hi res when necessary (#514 ) * function to check if pdf is extractable * add fallback logic for unextractable pdfs * tests for docs with copy protection * add test for unprocessable pdf * update docs * changelog and version * update logic for images; reset file before proceeding * 3 files for api tests * docs update	2023-04-21 21:35:43 +00:00
qued	5b6640a55a	chore: change table param name (#513 ) Updated parameter names that controls whether we try to infer table structure. 0.6.1	2023-04-21 13:48:19 -05:00
Sebastian Laverde Alfonso	ba59ad6b3a	chore: add copy-protected pdf to sample-docs (#512 )	2023-04-21 18:02:38 +00:00
Matt Robinson	a7a9ccd3a4	ci: separate job for ingest tests (#511 ) * separate job for ingest tests * remove lint from description	2023-04-21 13:31:36 -04:00
qued	dc4147d7df	feat: extract tables (#503 ) Exposes table extraction through partition and partition_pdf. 0.6.0	2023-04-21 17:01:29 +00:00
Mallori Harrell	5d1e61cb3f	feat: add msg attachment support (#510 ) * add msg function and fix bug in eml attachment function	2023-04-21 11:14:46 -05:00
Matt Robinson	6874df91ef	feat: allow users to pass OCR language into `partition` (#509 ) * pip-compile new reqs * bump inference version * add language to pdf and image calls * tests for passing in language * version bump and changelog * update docs * pass ocr_languages in auto * updated test fixtures * typo in doc string	2023-04-21 13:41:26 +00:00
natygyoon	db2f70dbc4	sync version-sync.sh with other repos (#508 )	2023-04-21 05:48:38 +09:00
Matt Robinson	bd1e540af9	feat: parameter to turn off SSL verification (#506 ) * add kwarg for ssl verification * update docs * update version and changelog * add verify kwarg to test	2023-04-20 11:13:56 -04:00
Matt Robinson	43854e367a	docs: fix incomplete hi_res docs (#505 )	2023-04-20 09:43:33 -04:00
Amanda Cameron	db6e5b41b8	chore: updating readme with api announcement (#499 ) * updating readme	2023-04-19 11:59:26 -07:00
Matt Robinson	87c6d5e679	build: version bump for 0.5.13 release (#501 ) 0.5.13	2023-04-19 14:35:45 -04:00
Matt Robinson	4e1cc5ab3d	fix: add slack to fixture update script (#500 )	2023-04-19 18:16:44 +00:00
Matt Robinson	39b261aee6	fix: group broken paragraphs when using the fast strategy for PDFs (#485 ) * group broken paragraphs with fast strategy * changelog and version * fix broken tests for text.py * formatting for paragraph pattern re * fix test * fix whitespace substitution * one more test tweak * blurb to account for short lines * fix for shorter paragraphs * update changelog * remove extra line break from auto * retrigger ci * trying skipping azure * skip azure (test) * updated github and azure fixtures * update slack fixture	2023-04-19 13:54:17 -04:00
Shukri	396295fc04	fix: formatting error in sphinx docs (#498 ) * fix: formatting error in sphinx docs	2023-04-17 23:13:09 -07:00
cragwolfe	bfba2bb1eb	fix: workaround .json file detection with old libmagic installs (#493 ) Fixes issue where .json files were recognized as "text/plain" rather than "application/json on the Unstructured image (and other installs that may have an older libmagic). Also adds missing json auto partition tests. Including an xfail test for #492 .	2023-04-17 23:11:21 -07:00
Shukri	8d4308af43	doc: typo (#495 ) XML/HTML Depenedencies -> XML/HTML Dependencies	2023-04-17 20:26:50 -07:00
qued	3a61046307	fix: Fix typo in function call (#491 ) Closes GitHub Issue #487. Fixed typo in call to exactly_one in partition_json.	2023-04-17 23:37:50 +00:00
cragwolfe	5657378602	test: avoid misleading output in ingest tests (#488 ) Previously, if there was an error (non-zero exit code) in an ingest test script, the script would still complete and echo a warning about mismatched outputs and how to regenerate the fixtures. However, this statement is irrelevant and misleading: if the ingest failed with a non-zero exit code in the first place, that is the failure that should be debugged -- don't confuse the user with a comment about outputs.	2023-04-17 21:57:44 +00:00
pravin-unstructured	4020da56ad	Went through this demo notebook with Matt. Decision was made to add it to our collection of examples for use later. (#484 )	2023-04-17 11:53:25 -04:00
Trevor Bossert	cff7f4fd5a	Slack connector (#462 ) This connector takes a slack channel id, token and other options to pull conversation history for a channel and store it as a text file that is then processed by unstructured into expected output.	2023-04-16 19:34:43 +00:00
cragwolfe	a11563fe63	fix: update ingest test fixtures, disable biomed test (#486 ) * Update test fixtures that should have been updated in prior commit * Disable biomed ingest tests for now, the fail more often than not * Bonus: echo `tesseract --version` in the update script, since that is a key thing that influences fixture outputs.	2023-04-15 00:07:09 +00:00
JaeyongLee	8456676fad	fix: fix text_type.py exceeds_cap_ratio() returns (#478 ) There are cases when function is_possible_narrative_text receives an incorrect return from function exceeds_cap_ratio and does an incorrect classification, so some of the return values of exceeds_cap_ratio are corrected. --------- Co-authored-by: Matt Robinson <mrobinson@unstructured.io>	2023-04-14 11:53:10 -07:00
cragwolfe	46ac2a2226	build(CI): add access token for github-ingest test (#482 ) Avoids the occaisonal CI test failures in test-ingest-github.sh that were due to rate-limited non-auth'ed requests against a GitHub repo.	2023-04-14 11:14:21 -07:00
Matt Robinson	137b4b9a2e	feat: cleaning brick for normalizing bytes string output (#481 ) * add cleaning brick for emojis * changelog and versoin * docs for bytes_string_to_string * different test for bytes_string_to_string	2023-04-13 19:39:08 +00:00
Matt Robinson	9c1c6a13f6	fix: updates markdown code to process markdown with embedded html (#480 ) * add carriage return to html if missing * test on markdown with embedded html * changelog and version * check for html parser * linting, linting, linting	2023-04-13 12:47:45 -04:00
Matt Robinson	ec02d9298e	fix: only warn about fallback to fast in `partition_pdf` if hi_res is used (#479 ) * only warn if detectron2 not available and hi_res is used * changelog and version	2023-04-13 11:46:35 -04:00
Matt Robinson	b628fa8048	feat: allow headers in `partition` (#473 ) * feat: allow headers in `partition` * warning if header is set and url is not * update emoji test	2023-04-13 15:04:15 +00:00
jonvet	7f0f33ddb0	fix: encode xml string if document_tree is `None` in `_read_xml` (#477 ) * fix: encode xml string if document_tree is `None` in `_read_xml` * don't encode text in test	2023-04-13 09:09:58 -04:00

1 2 3 4 5 ...

459 Commits