unstructured

mirror of https://github.com/Unstructured-IO/unstructured.git synced 2025-12-20 19:54:49 +00:00

Author	SHA1	Message	Date
fran-unstructured	26da51c765	docs: Add source code links to bricks' docs (#923 ) Co-authored-by: Francisco Ansaldo <franciscoansaldo@Franciscos-MacBook-Pro.local>	2023-07-13 17:27:47 +00:00
Matt Robinson	9b830693bd	fix: adds to list of extensions to check if a file has a plain text MIME type (#916 ) * added .txt, .text, and .tab to text file list * changelog and version	2023-07-12 20:07:43 +00:00
fran-unstructured	f7b3c0f741	docs: adds connectors' documentation (#917 ) * Add connectors documentation * Add connectors documentation with corrections and index.rst update * Add connectors documentation - add API information --------- Co-authored-by: Francisco Ansaldo <franciscoansaldo@Franciscos-MacBook-Pro.local> Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com>	2023-07-12 14:56:09 -04:00
dependabot[bot]	f490b82d5b	build(deps): bump praw from 7.7.0 to 7.7.1 in /requirements (#922 ) Bumps [praw](https://github.com/praw-dev/praw) from 7.7.0 to 7.7.1. - [Release notes](https://github.com/praw-dev/praw/releases) - [Changelog](https://github.com/praw-dev/praw/blob/master/CHANGES.rst) - [Commits](https://github.com/praw-dev/praw/compare/v7.7.0...v7.7.1) --- updated-dependencies: - dependency-name: praw dependency-type: direct:production update-type: version-update:semver-patch ... Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>	2023-07-12 14:55:19 -04:00
dependabot[bot]	a7d6edc528	build(deps): bump google-api-python-client in /requirements (#921 ) Bumps [google-api-python-client](https://github.com/googleapis/google-api-python-client) from 2.92.0 to 2.93.0. - [Release notes](https://github.com/googleapis/google-api-python-client/releases) - [Changelog](https://github.com/googleapis/google-api-python-client/blob/main/CHANGELOG.md) - [Commits](https://github.com/googleapis/google-api-python-client/compare/v2.92.0...v2.93.0) --- updated-dependencies: - dependency-name: google-api-python-client dependency-type: direct:production update-type: version-update:semver-minor ... Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>	2023-07-12 14:55:03 -04:00
Matt Robinson	a583d47b84	docs: update table and API documentation (#919 ) * more detailed api docs * add table docs * remove rtf/epubs comment * remove confusing request_kwargs verbiage * add missing a	2023-07-12 12:59:59 -04:00
dependabot[bot]	1fa944ec87	build(deps): bump black from 23.3.0 to 23.7.0 in /requirements (#920 ) Bumps [black](https://github.com/psf/black) from 23.3.0 to 23.7.0. - [Release notes](https://github.com/psf/black/releases) - [Changelog](https://github.com/psf/black/blob/main/CHANGES.md) - [Commits](https://github.com/psf/black/compare/23.3.0...23.7.0) --- updated-dependencies: - dependency-name: black dependency-type: direct:production update-type: version-update:semver-minor ... Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>	2023-07-12 15:30:57 +00:00
dependabot[bot]	80bdd60b32	build(deps): bump protobuf from 3.20.3 to 4.23.4 in /requirements (#910 ) Bumps [protobuf](https://github.com/protocolbuffers/protobuf) from 3.20.3 to 4.23.4. - [Release notes](https://github.com/protocolbuffers/protobuf/releases) - [Changelog](https://github.com/protocolbuffers/protobuf/blob/main/protobuf_release.bzl) - [Commits](https://github.com/protocolbuffers/protobuf/compare/v3.20.3...v4.23.4) --- updated-dependencies: - dependency-name: protobuf dependency-type: direct:production update-type: version-update:semver-major ... Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> Co-authored-by: Matt Robinson <mrobinson@unstructured.io>	2023-07-12 10:41:02 -04:00
Roman Isecke	8b233b4f62	Set the version to 0.8.1 (#914 )	2023-07-11 10:27:54 -04:00
Emily Chen	2635b0be07	Don't instantiate an element with a coordinate system when there isn't a way to get its location (#913 )	2023-07-10 21:47:41 -07:00
Matt Robinson	b3936893b8	build: add python 3.11 to CI (#908 ) * remove argilla; bump reqs * enable py 3.11 * add 3.11 to setup.py * make pip-compile * ignore cli mypy errors * install argilla * fix constraints * install argilla * changelog and version * skip argilla in docker * dont import argilla in docker * skip all of argilla if in container * only import argilla if outside docker * more docker skips * remove weird pypi settings	2023-07-10 18:52:25 +00:00
Trevor Bossert	66f2d4b280	Add both arm and amd builds to manifests (#899 )	2023-07-10 10:15:15 -07:00
John	6173362620	fix: detect list items in MS Word documents (#909 ) * fix merge conflict * update changelog and version	2023-07-10 15:29:08 +00:00
qued	79f734d3f9	fix: better extractable check (#900 ) auto strategy was choosing the fast strategy in cases where the pdf contents were just a flat image, resulting in no output. This PR changes the behavior of auto so that elements that can be extracted by fast are extracted, a cursory examination of the elements is made to see if there are elements with text present, and if so then these elements are used as the output. Otherwise fallback strategies come into play.	2023-07-07 23:41:37 -05:00
Matt Robinson	f51ae45050	fix: grab all metadata fields in `convert_to_dataframe` (#893 ) * add all fieldnames to dataframe * drop empty columns in convert_to_dataframe * test for maintaining metadata * version and changelog	2023-07-07 20:04:35 +00:00
dependabot[bot]	c8e6f0e141	build(deps): bump elasticsearch from 8.8.0 to 8.8.2 in /requirements (#898 ) Bumps [elasticsearch](https://github.com/elastic/elasticsearch-py) from 8.8.0 to 8.8.2. - [Release notes](https://github.com/elastic/elasticsearch-py/releases) - [Commits](https://github.com/elastic/elasticsearch-py/compare/v8.8.0...v8.8.2) --- updated-dependencies: - dependency-name: elasticsearch dependency-type: direct:production update-type: version-update:semver-patch ... Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> Co-authored-by: Matt Robinson <mrobinson@unstructured.io>	2023-07-07 19:19:45 +00:00
dependabot[bot]	05d51cfb4f	build(deps): bump ruff from 0.0.275 to 0.0.277 in /requirements (#897 ) Bumps [ruff](https://github.com/astral-sh/ruff) from 0.0.275 to 0.0.277. - [Release notes](https://github.com/astral-sh/ruff/releases) - [Changelog](https://github.com/astral-sh/ruff/blob/main/BREAKING_CHANGES.md) - [Commits](https://github.com/astral-sh/ruff/compare/v0.0.275...v0.0.277) --- updated-dependencies: - dependency-name: ruff dependency-type: direct:production update-type: version-update:semver-patch ... Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> Co-authored-by: Matt Robinson <mrobinson@unstructured.io>	2023-07-07 13:51:52 -04:00
dependabot[bot]	7f9532f8b3	build(deps): bump lxml from 4.9.2 to 4.9.3 in /requirements (#896 ) Bumps [lxml](https://github.com/lxml/lxml) from 4.9.2 to 4.9.3. - [Release notes](https://github.com/lxml/lxml/releases) - [Changelog](https://github.com/lxml/lxml/blob/master/CHANGES.txt) - [Commits](https://github.com/lxml/lxml/compare/lxml-4.9.2...lxml-4.9.3) --- updated-dependencies: - dependency-name: lxml dependency-type: direct:production update-type: version-update:semver-patch ... Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> Co-authored-by: Matt Robinson <mrobinson@unstructured.io>	2023-07-07 13:51:05 -04:00
dependabot[bot]	1619a0b6a2	build(deps): bump google-api-python-client in /requirements (#895 ) Bumps [google-api-python-client](https://github.com/googleapis/google-api-python-client) from 2.91.0 to 2.92.0. - [Release notes](https://github.com/googleapis/google-api-python-client/releases) - [Changelog](https://github.com/googleapis/google-api-python-client/blob/main/CHANGELOG.md) - [Commits](https://github.com/googleapis/google-api-python-client/compare/v2.91.0...v2.92.0) --- updated-dependencies: - dependency-name: google-api-python-client dependency-type: direct:production update-type: version-update:semver-minor ... Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>	2023-07-07 13:50:25 -04:00
Roman Isecke	5e1150184c	Add optional param for model name when partitioning pdfs (#890 ) * Add optional param for model name when partitioning pdfs * Pull in latest inference changes * Fix linting 0.8.0	2023-07-07 11:16:55 -04:00
Christine Straub	47bc4009a8	fix: adjust threshold for encoding detection (#894 ) * chore: add example doc * fix: adjust encoding recognition threshold value in `detect_file_encoding` * test: add test cases for German characters * chore: update changelog & version	2023-07-07 09:25:03 -04:00
Matt Robinson	52aced8677	fix: validate encodings from email headers (#881 ) * add validate encoding function * remove extraneous file * added test case for malformed encoding * version and changelog	2023-07-06 13:49:27 +00:00
cragwolfe	209054f0db	build(image): revert docker build tweak for arm64 (#887 ) arm64 Images (and amd64 ones) now building again in CI 😐 .	2023-07-06 06:46:40 +00:00
Ahmet Melek	4b827f0793	fix: local connector output filename when a single file is being processed (#879 ) * fix string processing error for _output_filename * Add docstring and type hint, update CHANGELOG, update version * update test fixture * simple code change commit to retrigger ci checks * update test fixture - after brew install tesseract-lang * Update ingest test fixtures (#882) Co-authored-by: ahmetmeleq <ahmetmeleq@users.noreply.github.com> * correct CHANGELOG * correct CHANGELOG --------- Co-authored-by: Unstructured-DevOps <111007769+Unstructured-DevOps@users.noreply.github.com> Co-authored-by: ahmetmeleq <ahmetmeleq@users.noreply.github.com>	2023-07-05 14:37:40 -07:00
Nathan Chappell	24dad24f87	chore: changed type IO to IO[bytes] (#878 ) Co-authored-by: Nathan Chappell <nchappell@mono.software> Co-authored-by: Matt Robinson <mrobinson@unstructured.io>	2023-07-05 16:37:31 -04:00
John	dc6d7d7268	feat: add metadata_filename parameter across all partition functions (#811 ) * fix conflicts * add tests and clean metadata_filename in partitions * fix test_email and remove comments * make tidy/check * update changelog and version * fix tests * make tidy again	2023-07-05 16:02:22 -04:00
Austin Walker	8d2e7c0746	fix: Fix KeyError in isd_to_elements (#876 )	2023-07-05 19:09:18 +00:00
Emily Chen	24ebd0fa4e	chore: Move coordinate details from Element model to a metadata model (#827 )	2023-07-05 11:25:11 -07:00
Johnny Lim	6ec177e7c6	Add a missing space in a warning message in filetype.py (#873 ) Adds a missing space in a warning message in the filetype.py file.	2023-07-01 20:54:39 +00:00
Ahmet Melek	5ea216cf07	feat: elasticsearch connector (#817 )	2023-07-01 17:45:28 +00:00
cragwolfe	cb2866b159	build(image): docker build tweak for arm64 (#871 ) Fixes issue where arm64 docker builds were failing and preventing images from being published.	2023-06-30 20:49:31 -07:00
Trevor Bossert	6249e1553e	New base image with security patches (#869 ) * New base image with security patches * Bump version * remove line from changelog not code related 0.7.12	2023-06-30 19:14:06 -07:00
David Potter	bec733cdf8	feat: add Dropbox connector (#844 )	2023-06-30 17:08:27 -07:00
John	e9fdbb0943	feat: add include_metadata across all partition functions (#853 ) * add include_metadata kwarg and tests to parsers add exclude_metadata to docx add test for doc to exclude metadata add include_metadata kwarg to email add include_metadata kwarg to epub add include_metadata kwarg to json add exclude_metadata tests to md add include_metadata kwarg and tests for msg parse add include_metadata kwarg and tests for odt parse add include_metadata kwarg and tests for org parse add include_metadata kwarg and tests for ppt and pptx parse add include_metadata kwarg and tests for rst parse add include_metadata kwarg and tests for rtf parse add include_metadata tests for text parse add include_metadata tests for tsv parse add include_metadata tests for xlsx parse add include_metadata tests for xml parse * WIP add include_metadata to partition_pdf * add include_metadata tests to partition_pdf * make tidy/check * update changelog and version * change test asserts and move docstring logic to process_metadata * make tidy * fix tests asserts * linting, linting, linting * sync versions * skip api call test not on main --------- Co-authored-by: Matt Robinson <mrobinson@unstructured.io> Co-authored-by: Matt Robinson <mrobinson@unstructuredai.io>	2023-06-30 10:44:46 -04:00
qued	350bb1dad5	enhancement: clean pdf elements (bump unstructured-inference) (#790 ) More deterministic element ordering when using hi_res PDF parsing strategy (from unstructured-inference bump to 0.5.4) Make large model available (from unstructured-inference bump to 0.5.3) Combine inferred elements with extracted elements (from unstructured-inference bump to 0.5.2) --------- Co-authored-by: Roman Isecke <roman@unstructured.io> Co-authored-by: Crag Wolfe <crag@unstructured.io> 0.7.11	2023-06-29 18:35:06 -07:00
ryannikolaidis	642562beb5	fix: skip test with api call when run outside CI (#862 )	2023-06-30 00:47:51 +00:00
ryannikolaidis	62e20442df	chore: refactor ingest tests (#814 ) - Adds reusable validation scripts (check-x.sh) to minimize repeated (or near-repeated) code and create one source of truth - Restructures the location of download and output folders such that they are nested in the test_unstructured_ingest directory - Adds gitignore for output folders / files to avoid them accidentally getting checked into the repository - Construct paths as reusable variables declared at top of scripts - Sort order of flag for ingest calls, across all tests (this makes it easier to parse at a glance) - OVERWRITE_FIXTURES removes all old fixtures for path to guarantee no stale results are left behind - Bonus: don't check/exit on expected number of expected outputs when OVERWRITE_FIXTURES is true - Bonus: exclude file_directory from Slack and Discord test scripts (match convention in all others)	2023-06-29 23:13:41 +00:00
Matt Robinson	c581a33c8a	feat: attachment processing for emails (#855 ) * process attachments for email * add attachment processing to msg * fix up metadata for attachments * add test for processing email attachments * added test for processing msg attachments * update docs * tests for error conditions * version and changelog	2023-06-29 18:01:12 -04:00
ryannikolaidis	92e55eb89e	fix: add api key to image publishing workflow tests (#854 )	2023-06-29 20:46:07 +00:00
ryannikolaidis	4f891fff63	fix: ingest download skipping (#847 )	2023-06-29 20:04:37 +00:00
ryannikolaidis	8ea5f6939e	fix: parameterized ingest test overwriting (#838 ) * sets OVERWRITE_FIXTURES to default to false in test-ingest-local-single-file.sh * fixes incorrect expected results * update expected results to properly parse Korean text * bonus: installs language pack for Korean in CI and ingest fixture workflows	2023-06-29 18:37:09 +00:00
ryannikolaidis	60fe231f08	fix: use api key where needed in tests (#843 ) * passes api key for unstructured-api to unit and ingest tests as needed. * adds check for env var CI to otherwise skip tests that require an api key	2023-06-29 17:31:01 +00:00
Roman Isecke	9882c2b83f	Avoid setting metadata in constructor signature for elements (#837 ) Avoid setting metadata in constructor signature for elements because that can lead to unexpected object reuse (and modification). Bonus refactor for PageBreak to have text values of "". --------- Co-authored-by: Alan Bertl <alan@unstructured.io> Co-authored-by: Crag Wolfe <crag@unstructuredai.io>	2023-06-29 03:14:05 +00:00
Matt Robinson	44411ecc59	enhancement: `max_partition` kwarg for limiting element size (#818 ) * add max partition size logic * work splitting logic into split_by_paragraph * pass through max_partition to other functions * added test for splitting long document * add type hint * add documentation * version and changelog * ingest-test-fixtures-update * Update ingest test fixtures (#819) Co-authored-by: MthwRobinson <MthwRobinson@users.noreply.github.com> * retrigger ci * ingest-test-fixtures-update * ingest-test-fixtures-update * Update ingest test fixtures (#821) Co-authored-by: MthwRobinson <MthwRobinson@users.noreply.github.com> * update default for partition_xml * update version for release * update msg doc string --------- Co-authored-by: MthwRobinson <MthwRobinson@users.noreply.github.com> 0.7.10	2023-06-28 15:26:01 -04:00
Matt Robinson	38457777fa	fix: ignore escaped commas in CSV checks (#832 ) * fix file content checking bug * skip counting commas in quotes for csv detection * add test for comma count * change file content grab to -1 * version and changelog * add csv to extension check * add file to tests * ingest-test-fixtures-update * Update ingest test fixtures (#833) Co-authored-by: MthwRobinson <MthwRobinson@users.noreply.github.com> * fix typo * fix changelog wording --------- Co-authored-by: MthwRobinson <MthwRobinson@users.noreply.github.com>	2023-06-28 17:22:23 +00:00
Matt Robinson	06077b09ee	fix: don't detect line breaks as list items (#831 ) * add negative lookahead to bullet pattern * version and changelog * update paragraph pattern * add list item assert	2023-06-28 12:49:12 -04:00
qued	773d9a4f37	feat: choose model (#824 ) Added the ability to select the hi_res model via the environment variable UNSTRUCTURED_HI_RES_MODEL_NAME. Variable must be a string that matches up with a model name defined in unstructured_inference. Also removed code related to old unstructured_inference API which has been removed from currently pinned version of unstructured-inference and is no longer running as a service.	2023-06-28 04:06:08 +00:00
shreyanid	433d6af1bc	fix: format Arabic and Hebrew annotated encodings (#823 ) * add modified arabic and hebrew encodings * added calls to format_encoding_str so encoding is checked before use * added formatting to detect_filetype() * explicitly provided default value for null encoding parameter * fixed format of annotated encodings list * adding hebrew base64 test file * small lint fixes * update changelog * bump version to -dev2	2023-06-27 18:15:02 -07:00
kravetsmic	58e988e110	feature(html partition): parse pre tag (#642 ) * feature(html partition): parse pre tag * chore: update CHANGELOG.md * style: black format xml.py * Added tests dor html with pre tag * remove skip test, update parse pre tag * fix style * chore: spell check * chore: update changelog & version * chore: update ingest test fixtures * chore: add exception handling if `element.text` is `None` in `_read_xml` * test: add more sanity testing on the `.text` content of the element(s) * refactor: move the conditional logic for <pre> outside of the `try/except` block --------- Co-authored-by: cragwolfe <crag@unstructured.io> Co-authored-by: christinestraub <christinemstraub@gmail.com>	2023-06-27 18:52:39 +00:00
ryannikolaidis	078e2aa116	ci: fix arm build issue with docker driver (#810 )	2023-06-26 22:39:00 +00:00

... 7 8 9 10 11 ...

929 Commits