unstructured

mirror of https://github.com/Unstructured-IO/unstructured.git synced 2025-08-04 06:50:02 +00:00

Author	SHA1	Message	Date
Jason Scheirer	196efa09b1	chore: Add encoding param to ingest (#955 ) * Add encoding param to ingest	2023-07-24 10:06:13 -07:00
John	676c50a6ec	feat: add min_partition kwarg to that combines elements below a specified threshold (#926 ) * add min_partition * functioning _split_content_to_fit_min_max * create test and make tidy/check * fix rebase issues * fix type hinting, remove unused code, add tests * various changes and refactoring of methods * add test, refactor, change var names for debugging purposes * update test * make tidy/check * give more descriptive var names and add comments * update xml partition via partition_text and create test * fix <pre> bug for test_partition_html_with_pre_tag * make tidy * refactor and fix tests * make tidy/check * ingest-test-fixtures-update * change list comprehension to for loop * fix error check	2023-07-24 15:57:24 +00:00
qued	d0329126ef	chore: remove outdated error message (#935 ) There's an issue in unstructured-inference about these blocks trapping unrelated import errors. The fix for that would be to narrow the scope of the traps, but I think this is made redundant by the requires_dependencies decorator, so I removed it completely.	2023-07-22 05:10:26 +00:00
Emily Chen	050cfafb70	Add subsection for docs; prioritize getting started with container (#962 )	2023-07-21 17:29:58 -07:00
Amanda Cameron	35e529f2d4	updating api key link (#960 )	2023-07-21 13:05:40 -07:00
Jack Retterer	708714dab5	docs: fixed typo in Installation guide (#945 )	2023-07-21 13:33:44 +00:00
Yuming Long	208148abe7	Chore: update require api key in readme (#952 )	2023-07-20 16:10:03 +00:00
Ronny H	31511793cb	Update README and API doc for Chipper announcement (#940 ) Update README and API doc for Chipper model beta version announcement	2023-07-19 13:00:37 -07:00
Emily Chen	4b1e5a8057	Publicly document OneDrive connector (#949 )	2023-07-18 16:37:44 -07:00
Ahmet Melek	b7674fb97e	feat: confluence connector (cloud) (#906 ) * Add confluence connector and an example script * add test script, add dependency installations * add authentication secret variables for ci tests and actions * add dependency installation commands for workflows * add dependency installation commands for workflows * Update ingest test fixtures (#907) Co-authored-by: ahmetmeleq <ahmetmeleq@users.noreply.github.com> * add add ingest test fixtures update workflow for python 3.10, update example script with dummy values * change workflow name to avoid confusion * change workflow name to avoid confusion * only leave 3.8 in ingest test matrix to test consistent partitioning among python versions, remove 3.10 workflow for the test fixtures update * only leave 3.8 in ingest test matrix to test consistent partitioning among python versions * Update ingest test fixtures (#911) Co-authored-by: ahmetmeleq <ahmetmeleq@users.noreply.github.com> * revert back the test python version matrix * recompile dependencies * modifications for shellcheck * update changelog and version * changelog and version * remove comments * Update ingest test fixtures (#915) Co-authored-by: ahmetmeleq <ahmetmeleq@users.noreply.github.com> * add the option to state the number of spaces to be fetched * add scroll functionality, expose --confluence-num-of-spaces, --confluence-list-of-spaces and --confluence-num-of-docs-from-each-space to users * add help message * add docstrings for two tests, validate grabbing every doc in the fetched spaces, count number of files instead of diffing for confluence2 test * change test names * rename connector arg Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com> * change arg name for connector Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com> * add comment to example * change arg names * add new tests to ingest test * shellcheck remove redundant statement * Update ingest test fixtures (#932) Co-authored-by: ahmetmeleq <ahmetmeleq@users.noreply.github.com> * Update ingest test fixtures (#936) Co-authored-by: ahmetmeleq <ahmetmeleq@users.noreply.github.com> * linting * change file extensions to parse as html * Update ingest test fixtures (#943) Co-authored-by: ahmetmeleq <ahmetmeleq@users.noreply.github.com> * remove old fixtures * update version to 0.8.2-dev3 * change file to trigger CI * change file to trigger CI * change file to trigger CI * change file to trigger CI --------- Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com> Co-authored-by: ahmetmeleq <ahmetmeleq@users.noreply.github.com>	2023-07-18 19:29:41 +01:00
Chris Watts	bf47dc10ae	feat: add slide notes to pptx (#942 ) * Add slide notes to pptx * Add include_slide_notes flag for pptx * Update CHANGELOG.md 0.8.2-dev2 - Add slide notes to pptx * Fix lint error * Fix pptx.py lint	2023-07-17 17:52:34 -04:00
ryannikolaidis	3b33331082	docs: fix readme word docs typo (#946 )	2023-07-17 20:04:50 +00:00
Matt Robinson	0d332743eb	fix: enable passing filters to `partition_doc` for libreoffice conversion (#934 ) * add optional filter to docx conversion * add filters to tests * changelog and version * update filter for power point	2023-07-17 13:54:44 -04:00
Yuming Long	067eb5701f	Fix: docker build with missing dependency (#931 ) * pip -compile * test trigger * Revert "test trigger" This reverts commit 69d4c8cd9f285f6ef4bf445f5fb27b5c62e1391c. * version conflict and pip compile	2023-07-14 22:20:11 +00:00
Matt Robinson	685e33f890	build: remove docs-build branch (#933 )	2023-07-14 16:23:47 -04:00
Christine Straub	5b7ae29876	fix: 521 pdf2image memory error (#924 ) Closes issue #521. Implements the same logic as unstructured-inference/PR #136 for the ocr_only strategy. * Add functionality to convert a PDF in small chunks of pages at a time * Add functionality to write images to computer storage temporarily instead of keeping them in memory * Set the file's current position to the beginning after reading the file in convert_to_bytes	2023-07-14 15:08:33 -05:00
fran-unstructured	dd4bb752e2	docs: Add Unstructured API documentation (#928 ) * fran-unstructured/Add unstructured API documentation * fran-unstructured/add api docs to index.rst * fran-unstructured/add api docs with changes requested 0.8.1-docs-rebuild	2023-07-14 18:28:57 +00:00
rvztz	ce20c3f2bc	feat: add OneDrive connector (#834 )	2023-07-13 20:57:54 +00:00
fran-unstructured	26da51c765	docs: Add source code links to bricks' docs (#923 ) Co-authored-by: Francisco Ansaldo <franciscoansaldo@Franciscos-MacBook-Pro.local>	2023-07-13 17:27:47 +00:00
Matt Robinson	9b830693bd	fix: adds to list of extensions to check if a file has a plain text MIME type (#916 ) * added .txt, .text, and .tab to text file list * changelog and version	2023-07-12 20:07:43 +00:00
fran-unstructured	f7b3c0f741	docs: adds connectors' documentation (#917 ) * Add connectors documentation * Add connectors documentation with corrections and index.rst update * Add connectors documentation - add API information --------- Co-authored-by: Francisco Ansaldo <franciscoansaldo@Franciscos-MacBook-Pro.local> Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com>	2023-07-12 14:56:09 -04:00
dependabot[bot]	f490b82d5b	build(deps): bump praw from 7.7.0 to 7.7.1 in /requirements (#922 ) Bumps [praw](https://github.com/praw-dev/praw) from 7.7.0 to 7.7.1. - [Release notes](https://github.com/praw-dev/praw/releases) - [Changelog](https://github.com/praw-dev/praw/blob/master/CHANGES.rst) - [Commits](https://github.com/praw-dev/praw/compare/v7.7.0...v7.7.1) --- updated-dependencies: - dependency-name: praw dependency-type: direct:production update-type: version-update:semver-patch ... Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>	2023-07-12 14:55:19 -04:00
dependabot[bot]	a7d6edc528	build(deps): bump google-api-python-client in /requirements (#921 ) Bumps [google-api-python-client](https://github.com/googleapis/google-api-python-client) from 2.92.0 to 2.93.0. - [Release notes](https://github.com/googleapis/google-api-python-client/releases) - [Changelog](https://github.com/googleapis/google-api-python-client/blob/main/CHANGELOG.md) - [Commits](https://github.com/googleapis/google-api-python-client/compare/v2.92.0...v2.93.0) --- updated-dependencies: - dependency-name: google-api-python-client dependency-type: direct:production update-type: version-update:semver-minor ... Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>	2023-07-12 14:55:03 -04:00
Matt Robinson	a583d47b84	docs: update table and API documentation (#919 ) * more detailed api docs * add table docs * remove rtf/epubs comment * remove confusing request_kwargs verbiage * add missing a	2023-07-12 12:59:59 -04:00
dependabot[bot]	1fa944ec87	build(deps): bump black from 23.3.0 to 23.7.0 in /requirements (#920 ) Bumps [black](https://github.com/psf/black) from 23.3.0 to 23.7.0. - [Release notes](https://github.com/psf/black/releases) - [Changelog](https://github.com/psf/black/blob/main/CHANGES.md) - [Commits](https://github.com/psf/black/compare/23.3.0...23.7.0) --- updated-dependencies: - dependency-name: black dependency-type: direct:production update-type: version-update:semver-minor ... Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>	2023-07-12 15:30:57 +00:00
dependabot[bot]	80bdd60b32	build(deps): bump protobuf from 3.20.3 to 4.23.4 in /requirements (#910 ) Bumps [protobuf](https://github.com/protocolbuffers/protobuf) from 3.20.3 to 4.23.4. - [Release notes](https://github.com/protocolbuffers/protobuf/releases) - [Changelog](https://github.com/protocolbuffers/protobuf/blob/main/protobuf_release.bzl) - [Commits](https://github.com/protocolbuffers/protobuf/compare/v3.20.3...v4.23.4) --- updated-dependencies: - dependency-name: protobuf dependency-type: direct:production update-type: version-update:semver-major ... Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> Co-authored-by: Matt Robinson <mrobinson@unstructured.io>	2023-07-12 10:41:02 -04:00
Roman Isecke	8b233b4f62	Set the version to 0.8.1 (#914 )	2023-07-11 10:27:54 -04:00
Emily Chen	2635b0be07	Don't instantiate an element with a coordinate system when there isn't a way to get its location (#913 )	2023-07-10 21:47:41 -07:00
Matt Robinson	b3936893b8	build: add python 3.11 to CI (#908 ) * remove argilla; bump reqs * enable py 3.11 * add 3.11 to setup.py * make pip-compile * ignore cli mypy errors * install argilla * fix constraints * install argilla * changelog and version * skip argilla in docker * dont import argilla in docker * skip all of argilla if in container * only import argilla if outside docker * more docker skips * remove weird pypi settings	2023-07-10 18:52:25 +00:00
Trevor Bossert	66f2d4b280	Add both arm and amd builds to manifests (#899 )	2023-07-10 10:15:15 -07:00
John	6173362620	fix: detect list items in MS Word documents (#909 ) * fix merge conflict * update changelog and version	2023-07-10 15:29:08 +00:00
qued	79f734d3f9	fix: better extractable check (#900 ) auto strategy was choosing the fast strategy in cases where the pdf contents were just a flat image, resulting in no output. This PR changes the behavior of auto so that elements that can be extracted by fast are extracted, a cursory examination of the elements is made to see if there are elements with text present, and if so then these elements are used as the output. Otherwise fallback strategies come into play.	2023-07-07 23:41:37 -05:00
Matt Robinson	f51ae45050	fix: grab all metadata fields in `convert_to_dataframe` (#893 ) * add all fieldnames to dataframe * drop empty columns in convert_to_dataframe * test for maintaining metadata * version and changelog	2023-07-07 20:04:35 +00:00
dependabot[bot]	c8e6f0e141	build(deps): bump elasticsearch from 8.8.0 to 8.8.2 in /requirements (#898 ) Bumps [elasticsearch](https://github.com/elastic/elasticsearch-py) from 8.8.0 to 8.8.2. - [Release notes](https://github.com/elastic/elasticsearch-py/releases) - [Commits](https://github.com/elastic/elasticsearch-py/compare/v8.8.0...v8.8.2) --- updated-dependencies: - dependency-name: elasticsearch dependency-type: direct:production update-type: version-update:semver-patch ... Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> Co-authored-by: Matt Robinson <mrobinson@unstructured.io>	2023-07-07 19:19:45 +00:00
dependabot[bot]	05d51cfb4f	build(deps): bump ruff from 0.0.275 to 0.0.277 in /requirements (#897 ) Bumps [ruff](https://github.com/astral-sh/ruff) from 0.0.275 to 0.0.277. - [Release notes](https://github.com/astral-sh/ruff/releases) - [Changelog](https://github.com/astral-sh/ruff/blob/main/BREAKING_CHANGES.md) - [Commits](https://github.com/astral-sh/ruff/compare/v0.0.275...v0.0.277) --- updated-dependencies: - dependency-name: ruff dependency-type: direct:production update-type: version-update:semver-patch ... Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> Co-authored-by: Matt Robinson <mrobinson@unstructured.io>	2023-07-07 13:51:52 -04:00
dependabot[bot]	7f9532f8b3	build(deps): bump lxml from 4.9.2 to 4.9.3 in /requirements (#896 ) Bumps [lxml](https://github.com/lxml/lxml) from 4.9.2 to 4.9.3. - [Release notes](https://github.com/lxml/lxml/releases) - [Changelog](https://github.com/lxml/lxml/blob/master/CHANGES.txt) - [Commits](https://github.com/lxml/lxml/compare/lxml-4.9.2...lxml-4.9.3) --- updated-dependencies: - dependency-name: lxml dependency-type: direct:production update-type: version-update:semver-patch ... Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> Co-authored-by: Matt Robinson <mrobinson@unstructured.io>	2023-07-07 13:51:05 -04:00
dependabot[bot]	1619a0b6a2	build(deps): bump google-api-python-client in /requirements (#895 ) Bumps [google-api-python-client](https://github.com/googleapis/google-api-python-client) from 2.91.0 to 2.92.0. - [Release notes](https://github.com/googleapis/google-api-python-client/releases) - [Changelog](https://github.com/googleapis/google-api-python-client/blob/main/CHANGELOG.md) - [Commits](https://github.com/googleapis/google-api-python-client/compare/v2.91.0...v2.92.0) --- updated-dependencies: - dependency-name: google-api-python-client dependency-type: direct:production update-type: version-update:semver-minor ... Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>	2023-07-07 13:50:25 -04:00
Roman Isecke	5e1150184c	Add optional param for model name when partitioning pdfs (#890 ) * Add optional param for model name when partitioning pdfs * Pull in latest inference changes * Fix linting 0.8.0	2023-07-07 11:16:55 -04:00
Christine Straub	47bc4009a8	fix: adjust threshold for encoding detection (#894 ) * chore: add example doc * fix: adjust encoding recognition threshold value in `detect_file_encoding` * test: add test cases for German characters * chore: update changelog & version	2023-07-07 09:25:03 -04:00
Matt Robinson	52aced8677	fix: validate encodings from email headers (#881 ) * add validate encoding function * remove extraneous file * added test case for malformed encoding * version and changelog	2023-07-06 13:49:27 +00:00
cragwolfe	209054f0db	build(image): revert docker build tweak for arm64 (#887 ) arm64 Images (and amd64 ones) now building again in CI 😐 .	2023-07-06 06:46:40 +00:00
Ahmet Melek	4b827f0793	fix: local connector output filename when a single file is being processed (#879 ) * fix string processing error for _output_filename * Add docstring and type hint, update CHANGELOG, update version * update test fixture * simple code change commit to retrigger ci checks * update test fixture - after brew install tesseract-lang * Update ingest test fixtures (#882) Co-authored-by: ahmetmeleq <ahmetmeleq@users.noreply.github.com> * correct CHANGELOG * correct CHANGELOG --------- Co-authored-by: Unstructured-DevOps <111007769+Unstructured-DevOps@users.noreply.github.com> Co-authored-by: ahmetmeleq <ahmetmeleq@users.noreply.github.com>	2023-07-05 14:37:40 -07:00
Nathan Chappell	24dad24f87	chore: changed type IO to IO[bytes] (#878 ) Co-authored-by: Nathan Chappell <nchappell@mono.software> Co-authored-by: Matt Robinson <mrobinson@unstructured.io>	2023-07-05 16:37:31 -04:00
John	dc6d7d7268	feat: add metadata_filename parameter across all partition functions (#811 ) * fix conflicts * add tests and clean metadata_filename in partitions * fix test_email and remove comments * make tidy/check * update changelog and version * fix tests * make tidy again	2023-07-05 16:02:22 -04:00
Austin Walker	8d2e7c0746	fix: Fix KeyError in isd_to_elements (#876 )	2023-07-05 19:09:18 +00:00
Emily Chen	24ebd0fa4e	chore: Move coordinate details from Element model to a metadata model (#827 )	2023-07-05 11:25:11 -07:00
Johnny Lim	6ec177e7c6	Add a missing space in a warning message in filetype.py (#873 ) Adds a missing space in a warning message in the filetype.py file.	2023-07-01 20:54:39 +00:00
Ahmet Melek	5ea216cf07	feat: elasticsearch connector (#817 )	2023-07-01 17:45:28 +00:00
cragwolfe	cb2866b159	build(image): docker build tweak for arm64 (#871 ) Fixes issue where arm64 docker builds were failing and preventing images from being published.	2023-06-30 20:49:31 -07:00
Trevor Bossert	6249e1553e	New base image with security patches (#869 ) * New base image with security patches * Bump version * remove line from changelog not code related 0.7.12	2023-06-30 19:14:06 -07:00

... 17 18 19 20 21 ...

1447 Commits