unstructured

mirror of https://github.com/Unstructured-IO/unstructured.git synced 2025-08-26 17:53:24 +00:00

Author	SHA1	Message	Date
Matt Robinson	0fc0571c02	fix(ci): don't skip deploy for tags (#549 )	2023-05-05 09:51:41 -04:00
Matt Robinson	392cccdbf7	enhancement: add ocr_only strategy for `partition_image` (#540 ) * spike for ocr-only strategy for images * fix for file processing * extra space * add korean to ci * added test for ocr_only strategy * added docs for ocr_only * changelog and version * added test for bad strategy * skip korean test if in docker * bump version * version bump * document valid strategies * bump version for release --------- Co-authored-by: qued <64741807+qued@users.noreply.github.com> 0.6.3	2023-05-04 20:23:51 +00:00
Matt Robinson	fae5f8fdde	feat: add `partition_odt` for open office docs (#548 ) * added filetype detection for odt * add function for partition odt documents * add odt files to auto * changelog and version * docs and readme * update installation docs * skip tests if not supported or in docker * import pytest * fix docs typos	2023-05-04 19:28:08 +00:00
Matt Robinson	981805e435	feat: `stage_for_baseplate` function (#546 ) * added a staging brick for baseplate * added a test for baseplate * update documentation * version and changelog	2023-05-04 11:05:38 -04:00
Matt Robinson	aa01cdfc7a	fix: group together text from the same bounding box in `partition_pdf` with fast strategy (#542 ) * switch to using PDF objects * linting, linting, linting * couple more tweaks * added test for chevron-page * version and changelog * linting, linting, linting * now processing 4 files	2023-05-03 18:33:24 -04:00
Matt Robinson	7e43a25f07	feat: add `partition_multiple_via_api` function (#539 ) * added function for multiple files via api * make multiple work with files * updated docs strings * changelog and version * docs and contextlib for open files * tests for partition multiple * add tests for error conditions * add output example	2023-05-03 15:06:06 -04:00
Matt Robinson	3c3c59a726	build(deps): add pdfminer.six to dependencies (#537 )	2023-05-02 15:36:12 +00:00
Matt Robinson	19488bf15f	ci: only build docs on tags (#538 ) * ci: only build docs on tags * add branch for docs builds	2023-05-02 15:15:23 +00:00
dependabot[bot]	61209b34bd	build(deps): bump yarl from 1.8.2 to 1.9.2 in /requirements (#530 ) Bumps [yarl](https://github.com/aio-libs/yarl) from 1.8.2 to 1.9.2. - [Release notes](https://github.com/aio-libs/yarl/releases) - [Changelog](https://github.com/aio-libs/yarl/blob/master/CHANGES.rst) - [Commits](https://github.com/aio-libs/yarl/compare/v1.8.2...v1.9.2) --- updated-dependencies: - dependency-name: yarl dependency-type: direct:production update-type: version-update:semver-minor ... Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> Co-authored-by: Matt Robinson <mrobinson@unstructured.io>	2023-05-01 18:18:45 -04:00
Matt Robinson	e805ed465d	docs: add slack and github links back into docs page (#535 ) * stars and github link to top of page * wording updates * remove unnecessary font weight change * remove next arrows * buttons to bottom on sidebar	2023-05-01 18:17:52 -04:00
Matt Robinson	22ebfa6714	docs: add download badges to README (#536 ) * downloads badge * total downloads	2023-05-01 18:17:31 -04:00
dependabot[bot]	7f9ec8108d	build(deps): bump importlib-metadata in /requirements (#531 ) Bumps [importlib-metadata](https://github.com/python/importlib_metadata) from 6.5.0 to 6.6.0. - [Release notes](https://github.com/python/importlib_metadata/releases) - [Changelog](https://github.com/python/importlib_metadata/blob/main/CHANGES.rst) - [Commits](https://github.com/python/importlib_metadata/compare/v6.5.0...v6.6.0) --- updated-dependencies: - dependency-name: importlib-metadata dependency-type: direct:production update-type: version-update:semver-minor ... Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> Co-authored-by: Matt Robinson <mrobinson@unstructured.io>	2023-05-01 21:49:26 +00:00
dependabot[bot]	8ed1627928	build(deps): bump huggingface-hub from 0.13.4 to 0.14.1 in /requirements (#528 ) Bumps [huggingface-hub](https://github.com/huggingface/huggingface_hub) from 0.13.4 to 0.14.1. - [Release notes](https://github.com/huggingface/huggingface_hub/releases) - [Commits](https://github.com/huggingface/huggingface_hub/compare/v0.13.4...v0.14.1) --- updated-dependencies: - dependency-name: huggingface-hub dependency-type: direct:production update-type: version-update:semver-minor ... Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>	2023-05-01 17:22:21 -04:00
Matt Robinson	1b8e9a353a	version bump for release 0.6.2	2023-04-26 16:29:16 -04:00
Matt Robinson	9fdc310358	fix: update `detect_filetype` for JSONs with text/plain MIME type (#520 ) * check to see if text file is a json * add json check into filetype detection * added test for updated file detection logic * bytes/strings handling * changlog and version bump	2023-04-26 13:52:47 -04:00
Matt Robinson	4156cb12e0	feat: `partition_via_api` helper function (#518 ) * added function for partitioning via api * added tests for api function * changelog and version * add docs for partition_via_api	2023-04-26 09:05:35 -04:00
JaeyongLee	be8e6da884	fix: correct return types in `exceeds_caps_ratio` (#489 ) * fix: fix text_type.py exceeds_cap_ratio() returns There are cases when function is_possible_narrative_text receives an incorrect return from function exceeds_cap_ratio and does an incorrect classification, so some of the return values of exceeds_cap_ratio are corrected * Update text_type.py exceeds_cap_ratio() .. * Update text_type.py .. * Update CHANGELOG.md .. * linting, linting, linting ... * update tests * more test fixes * Update text_type.py .. * bump version and changelog * add punctuation check --------- Co-authored-by: Matt Robinson <mrobinson@unstructured.io> Co-authored-by: Matt Robinson <mrobinson@unstructuredai.io>	2023-04-24 10:45:09 -04:00
Matt Robinson	894a190001	enhancement: check for copy protection on PDFs and fallback to hi res when necessary (#514 ) * function to check if pdf is extractable * add fallback logic for unextractable pdfs * tests for docs with copy protection * add test for unprocessable pdf * update docs * changelog and version * update logic for images; reset file before proceeding * 3 files for api tests * docs update	2023-04-21 21:35:43 +00:00
qued	5b6640a55a	chore: change table param name (#513 ) Updated parameter names that controls whether we try to infer table structure. 0.6.1	2023-04-21 13:48:19 -05:00
Sebastian Laverde Alfonso	ba59ad6b3a	chore: add copy-protected pdf to sample-docs (#512 )	2023-04-21 18:02:38 +00:00
Matt Robinson	a7a9ccd3a4	ci: separate job for ingest tests (#511 ) * separate job for ingest tests * remove lint from description	2023-04-21 13:31:36 -04:00
qued	dc4147d7df	feat: extract tables (#503 ) Exposes table extraction through partition and partition_pdf. 0.6.0	2023-04-21 17:01:29 +00:00
Mallori Harrell	5d1e61cb3f	feat: add msg attachment support (#510 ) * add msg function and fix bug in eml attachment function	2023-04-21 11:14:46 -05:00
Matt Robinson	6874df91ef	feat: allow users to pass OCR language into `partition` (#509 ) * pip-compile new reqs * bump inference version * add language to pdf and image calls * tests for passing in language * version bump and changelog * update docs * pass ocr_languages in auto * updated test fixtures * typo in doc string	2023-04-21 13:41:26 +00:00
natygyoon	db2f70dbc4	sync version-sync.sh with other repos (#508 )	2023-04-21 05:48:38 +09:00
Matt Robinson	bd1e540af9	feat: parameter to turn off SSL verification (#506 ) * add kwarg for ssl verification * update docs * update version and changelog * add verify kwarg to test	2023-04-20 11:13:56 -04:00
Matt Robinson	43854e367a	docs: fix incomplete hi_res docs (#505 )	2023-04-20 09:43:33 -04:00
Amanda Cameron	db6e5b41b8	chore: updating readme with api announcement (#499 ) * updating readme	2023-04-19 11:59:26 -07:00
Matt Robinson	87c6d5e679	build: version bump for 0.5.13 release (#501 ) 0.5.13	2023-04-19 14:35:45 -04:00
Matt Robinson	4e1cc5ab3d	fix: add slack to fixture update script (#500 )	2023-04-19 18:16:44 +00:00
Matt Robinson	39b261aee6	fix: group broken paragraphs when using the fast strategy for PDFs (#485 ) * group broken paragraphs with fast strategy * changelog and version * fix broken tests for text.py * formatting for paragraph pattern re * fix test * fix whitespace substitution * one more test tweak * blurb to account for short lines * fix for shorter paragraphs * update changelog * remove extra line break from auto * retrigger ci * trying skipping azure * skip azure (test) * updated github and azure fixtures * update slack fixture	2023-04-19 13:54:17 -04:00
Shukri	396295fc04	fix: formatting error in sphinx docs (#498 ) * fix: formatting error in sphinx docs	2023-04-17 23:13:09 -07:00
cragwolfe	bfba2bb1eb	fix: workaround .json file detection with old libmagic installs (#493 ) Fixes issue where .json files were recognized as "text/plain" rather than "application/json on the Unstructured image (and other installs that may have an older libmagic). Also adds missing json auto partition tests. Including an xfail test for #492 .	2023-04-17 23:11:21 -07:00
Shukri	8d4308af43	doc: typo (#495 ) XML/HTML Depenedencies -> XML/HTML Dependencies	2023-04-17 20:26:50 -07:00
qued	3a61046307	fix: Fix typo in function call (#491 ) Closes GitHub Issue #487. Fixed typo in call to exactly_one in partition_json.	2023-04-17 23:37:50 +00:00
cragwolfe	5657378602	test: avoid misleading output in ingest tests (#488 ) Previously, if there was an error (non-zero exit code) in an ingest test script, the script would still complete and echo a warning about mismatched outputs and how to regenerate the fixtures. However, this statement is irrelevant and misleading: if the ingest failed with a non-zero exit code in the first place, that is the failure that should be debugged -- don't confuse the user with a comment about outputs.	2023-04-17 21:57:44 +00:00
pravin-unstructured	4020da56ad	Went through this demo notebook with Matt. Decision was made to add it to our collection of examples for use later. (#484 )	2023-04-17 11:53:25 -04:00
Trevor Bossert	cff7f4fd5a	Slack connector (#462 ) This connector takes a slack channel id, token and other options to pull conversation history for a channel and store it as a text file that is then processed by unstructured into expected output.	2023-04-16 19:34:43 +00:00
cragwolfe	a11563fe63	fix: update ingest test fixtures, disable biomed test (#486 ) * Update test fixtures that should have been updated in prior commit * Disable biomed ingest tests for now, the fail more often than not * Bonus: echo `tesseract --version` in the update script, since that is a key thing that influences fixture outputs.	2023-04-15 00:07:09 +00:00
JaeyongLee	8456676fad	fix: fix text_type.py exceeds_cap_ratio() returns (#478 ) There are cases when function is_possible_narrative_text receives an incorrect return from function exceeds_cap_ratio and does an incorrect classification, so some of the return values of exceeds_cap_ratio are corrected. --------- Co-authored-by: Matt Robinson <mrobinson@unstructured.io>	2023-04-14 11:53:10 -07:00
cragwolfe	46ac2a2226	build(CI): add access token for github-ingest test (#482 ) Avoids the occaisonal CI test failures in test-ingest-github.sh that were due to rate-limited non-auth'ed requests against a GitHub repo.	2023-04-14 11:14:21 -07:00
Matt Robinson	137b4b9a2e	feat: cleaning brick for normalizing bytes string output (#481 ) * add cleaning brick for emojis * changelog and versoin * docs for bytes_string_to_string * different test for bytes_string_to_string	2023-04-13 19:39:08 +00:00
Matt Robinson	9c1c6a13f6	fix: updates markdown code to process markdown with embedded html (#480 ) * add carriage return to html if missing * test on markdown with embedded html * changelog and version * check for html parser * linting, linting, linting	2023-04-13 12:47:45 -04:00
Matt Robinson	ec02d9298e	fix: only warn about fallback to fast in `partition_pdf` if hi_res is used (#479 ) * only warn if detectron2 not available and hi_res is used * changelog and version	2023-04-13 11:46:35 -04:00
Matt Robinson	b628fa8048	feat: allow headers in `partition` (#473 ) * feat: allow headers in `partition` * warning if header is set and url is not * update emoji test	2023-04-13 15:04:15 +00:00
jonvet	7f0f33ddb0	fix: encode xml string if document_tree is `None` in `_read_xml` (#477 ) * fix: encode xml string if document_tree is `None` in `_read_xml` * don't encode text in test	2023-04-13 09:09:58 -04:00
Matt Robinson	e2e473dddd	feat: add `url` kwarg to `partititon` (#470 ) * added url option to auto partition * add test for partition from url * version and changelog * update docs * add url to element metadata 0.5.12	2023-04-12 18:31:01 +00:00
qued	2110a266c8	fix: fix github issue formatting (#471 ) Attempting to fix formatting of github issues transferred to Jira. The old format was attempting to use double-slashes (\\) to specify line breaks. This worked in the test repo but didn't look right when merged to this repo. Now attempting to use formatted text in the yaml with \|. This worked in the test repo, but I guess that's no guarantee.	2023-04-12 16:59:12 +00:00
Austin Walker	4af4d33423	feat: add --partition-by-api and --partition-host to unstructured-ingest (#443 ) * Add --partition-by-api and --partition-host args to ingest * Fix error in make check * Bump changelog * Add a test ingest script Also add a workaround for the test causing 400s from our api. Seems we need to make sure unstructured-api can handle getting a file.content_type of None. * Remove the content type workaround	2023-04-11 22:05:07 -07:00
cragwolfe	ba4dadaa98	build: skip biomed ingest tests 90% of time due to ftp connectivity (#467 )	2023-04-11 11:27:38 -07:00

... 8 9 10 11 12 ...

805 Commits