unstructured

mirror of https://github.com/Unstructured-IO/unstructured.git synced 2025-08-26 01:36:49 +00:00

Author	SHA1	Message	Date
Klaijan	675a10ea69	fix: update test_json to not use auto partition (#1187 ) Update `test_json` to not use auto partition due to dependencies. Previously, to run `test_json` requires full requirements installation library to read file types, including but not limited to, docx, pptx, as well as others. Therefore the test will raise error with base installation. With the update, this fix also add to other test files to check its invariant with `elements_to_json`.	2023-08-29 16:59:26 -04:00
Klaijan	4b830e3b05	fix: return ocr coordinates points as tuple (#1219 ) The `add_pytesseract_bbox_to_elements` returned the `metadata.coordinates.points` as `Tuple` whereas other strategies returned as `List`. Make change accordingly for consistency. Previously: ``` element.metadata.coordinates.points = [ (x1, y1), (x2, y2), (x3, y3), (x4, y4), ] ``` Currently: ``` element.metadata.coordinates.points = ( (x1, y1), (x2, y2), (x3, y3), (x4, y4), ) ```	2023-08-28 13:31:55 -04:00
Matt Robinson	07f76275f1	feat: detect PGP encrypted content in `partition_email` and `partition_msg` (#1205 ) ### Summary Closes #1018. Enables `partition_email` and `partition_msg` to detect if an email has PGP encrypted content. Based on the specification in [RFC 2015](https://www.ietf.org/rfc/rfc2015.txt). The test emails are based on the example email in the spec. If PGP detected content is detected, a warning is emitted and an empty set of lists is returned. ### Testing ```python from unstructured.partition_email import partition_email filename = "example-docs/eml/fake-encrypted.eml" partition_email(filename=filename) ``` ```python from unstructured.partition_msg import partition_msg filename = "example-docs/fake-encrypted.msg" partition_msgl(filename=filename) ```	2023-08-25 17:09:25 -07:00
John	5872fa23c3	Extract coordinates from PDFs and images when using OCR only strategy (#1163 ) ### Summary Closes #983 Creates new function `add_pytesseract_bbox_to_elements` Fixes typos in docstrings ### Testing ``` from unstructured.partition.image import partition_image from PIL import Image, ImageDraw png_filename="example-docs/english-and-korean.png" png_elements = partition_image(filename=png_filename, strategy="ocr_only") png_image = Image.open(png_filename) draw = ImageDraw.Draw(png_image) draw.polygon(png_elements[0].metadata.coordinates.points, outline="red", width=2) draw.polygon(png_elements[1].metadata.coordinates.points, outline="red", width=2) draw.polygon(png_elements[2].metadata.coordinates.points, outline="red", width=2) output = "example-docs/english-and-korean-box.png" png_image.save(output) png_image.close() ```	2023-08-25 05:32:12 +00:00
Matt Robinson	c578b85699	fix: respect `<pre>` tag order in `partition_html` (#1197 ) ### Summary Closes #1184. Updates `partition_html` to respect the ordering of `<pre>` tags in HTML documents. ### Testing The elements in the following example should be in the correct order. ```python from unstructured.partition.html import partition_html html_text = """ <pre>The Big Brown Bear</pre> <div>The big brown bear is growling.</div> <pre>The big brown bear is sleeping.</pre> <div>The Big Blue Bear</div> """ elements = partition_html(text=html_text) print("\n\n".join([str(el) for el in elements])) ```	2023-08-25 04:14:48 +00:00
Christine Straub	483b09b3c9	Feat/1136 elements ordering for pdf (#1161 ) ### Summary Address [#1136](https://github.com/Unstructured-IO/unstructured/issues/1136) for `hi_res` and `fast` strategies. The `ocr_only` strategy does not include coordinates. - add functionality to switch sort mode between the current `basic` sorting and the new `xy-cut` sorting for `hi_res` and `fast` strategies - add the script to evaluate the `xy-cut` sorting approach - add jupyter notebook to provide evaluation and visualization for the `xy-cut` sorting approach ### Evaluation ``` export PYTHONPATH=.:$PYTHONPATH && python examples/custom-layout-order/evaluate_xy_cut_sorting.py <file_path> <strategy> ``` Here, the file should be under the project root directory. For example, ``` export PYTHONPATH=.:$PYTHONPATH && python examples/custom-layout-order/evaluate_xy_cut_sorting.py example-docs/multi-column-2p.pdf fast ```	2023-08-24 17:46:19 -07:00
Klaijan	1524841cd9	feat: supports multipage tiff (#1131 ) Add test case test_partition_image_with_multipage_tiff that reads multipage TIFF file and - confirms that the function reads all the pages in the TIFF. - page number is added to the metadata This PR is branched from and developed on top of 6d6be99 commit.	2023-08-24 15:12:50 +00:00
Matt Robinson	cdae53cc29	chore: deprecation warning for `file_filename` (#1191 ) ### Summary Closes #1007. Adds a deprecation warning for the `file_filename` kwarg to `partition`, `partition_via_api`, and `partition_multiple_via_api`. Also catches a warning in `ebooklib` that we do not want to emit in `unstructured`. ### Testing ```python from unstructured.partition.auto import partition filename = "example-docs/winter-sports.epub" # Should not emit a warning with open(filename, "rb") as f: elements = partition(file=f, metadata_filename="test.epub") # Should be test.epub elements[0].metadata.filename # Should emit a warning with open(filename, "rb") as f: elements = partition(file=f, file_filename="test.epub") # Should be test.epub elements[0].metadata.filename # Should raise an error with open(filename, "rb") as f: elements = partition(file=f, metadata_filename="test.epub", file_filename="test.epub") ```	2023-08-24 07:02:47 +00:00
Charles	1ddf542e14	fix: Don't call extractable_elements if strategy is ocr_only (#1160 ) - fixes #1079 where partitioning is happening twice in the case of `strategy="ocr_only"` - only calls `extractable_elements` if we can predetermine that `ocr_only` is not a possible strategy even if it was the intended strategy. - Adds additional assertion test that `_partition_pdf_or_image_with_ocr` is not called when falling back to `fast` from `ocr_only`	2023-08-22 19:43:33 -07:00
Austin Walker	e7d189fcc8	chore: Bump inference and set default ocr_mode to entire_page (#1172 ) * pip-compile in order to bump unstructured-inference * Set the default `ocr_mode` back to `enitre_page` now that [this error](https://github.com/Unstructured-IO/unstructured-inference/pull/183) is addressed * Explicitly add `sphinx-tabs` to `build.in`. This file provides `docs/requirements.txt`. * Remove a pinned `pydantic` version * Fix a makefile command to `pip-compile` a missing ingest file.	2023-08-22 16:05:02 -07:00
Matt Robinson	ad595d32f6	enhancement: tell users to install missing extras (#1167 ) ### Summary Updates `partition` to let users know to installs the appropriate extras if they're missing. Prior to this PR, users would get an exception stating `partition_pdf` (or whichever function that requires extras) does not exist. ### Testing First `pip uninstall ebooklib`. Then run ```python from unstructured.partition.auto import partition partition(filename="example-docs/winter-sports.epub") ``` The error should look like ```python ImportError: partition_epub is not available. Install the epub dependencies with pip install "unstructured[epub]" ```	2023-08-22 03:00:21 +00:00
Newel H	e4aa7373e2	test: create CI pipelines for verifying base and extras pass respective tests (#1137 ) Summary Closes #747 * Create CI Pipeline for running text, xml, email, and html doc tests against the library installed without extras * Create CI Pipeline for running each library extra against their respective tests	2023-08-19 12:56:13 -04:00
John	69edffb0c0	bug: update partition_msg and partition_email so attachments also receive metadata_last_modified kwarg (#1134 ) ### Summary Closes #1027 The msg test in question was no longer failing after removing the quick-fix and comment explaining the issue. However, the test was not functioning as intended. Test was refactored to appropriately test `metadata_last_modified` of attachments. `partition_msg` was then updated to pass `metadata_last_modified` to `attachment_partitioner`. The same was done for email partitioning. ### Testing ``` from unstructured.partition.text import partition_text from unstructured.partition.msg import partition_msg from unstructured.partition.email import partition_email filename="example-docs/fake-email-attachment.msg" elements = partition_msg(filename=filename, attachment_partitioner=partition_text, process_attachments=True, metadata_last_modified="0000-00-00") # previously, these were different values because last_modified wasn't being updated in attachments elements[1].metadata.last_modified elements[-1].text elements[-1].metadata.last_modified email_filename="example-docs/eml/fake-email-attachment.eml" email_elements = partition_email(filename=email_filename, attachment_partitioner=partition_text, process_attachments=True, metadata_last_modified="0000-00-00") email_elements[1].metadata.last_modified email_elements[-1].text email_elements[-1].metadata.last_modified ```	2023-08-18 23:21:11 +00:00
Austin Walker	dd243b4fd9	chore: pass ocr_mode in partition_pdf_or_image (#1154 ) Set to individual_blocks for now to work around [this bug](https://github.com/Unstructured-IO/unstructured-inference/issues/179). I verified by printing the current ocr_mode in inference. The `entire_page` default is overridden. --------- Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com> Co-authored-by: awalker4 <awalker4@users.noreply.github.com>	2023-08-18 20:59:08 +00:00
cragwolfe	1456f06b2d	chore: skip consistently failing test in main (#1150 ) The reason this test is failing is the API is returning "fast" results when "hi_res" is requested, which is being tracked in this ticket: https://github.com/Unstructured-IO/unstructured-api/issues/188 . This failure was only showing up on the `main` branch, per the commented out `pytest` skips.	2023-08-18 10:06:17 -07:00
John	9f7bd6127b	enhancement: Add `include_header` kwarg for xlsx, default True(#1125 ) Closes Github issue #1121 Adds include_header kwarg to partition_xlsx and change default behavior to True.	2023-08-17 04:16:23 +00:00
Christine Straub	0e887cc36b	Feat/1060 update metadata fields (#1099 ) Closes Github Issue #1060. * update the metadata field links * update the metadata field emphasized_texts	2023-08-16 04:33:06 +00:00
Mike Lay	79a1eb8683	Handle inline and lacking filename (#1109 ) Handle Content-Disposition: inline and attachment without filename * Add new email test example and test with Content-Disposition: inline. * Move attachment_info above for loop so it is always defined * Check if item is inline as well as attachment as these both lack an = character to split on * Create filename if filename is not specified and write file. * Update list_attachments with new filename	2023-08-14 18:38:53 +00:00
Mike Lay	2e0ab86c6a	Fix attachments with `=` in filename (#1110 ) Fix attachments with = in filename * Limit split to first match of = to prevent creating a list of more than two parts * Add example email with attachment name and test for issue	2023-08-13 20:35:18 -07:00
Christine Straub	fc2699ff06	Fix/1057 etree parser error tsv (#1106 ) * feat: always use `soupparser_fromstring` to parse `html text` which gracefully handles emoji * chore: update changelog & version	2023-08-14 01:22:36 +00:00
Christine Straub	4a3176885f	Fix/1057 etree parser error xlsx (#1094 ) * feat: add functionality to check if a string contains any emoji characters * feat: add functionality to switch `html` text parser based on whether the `html` text contains emoji * chore: add `beautifulsoup4` and `emoji` packages to `requirements/base.in` for general use * chore: update changelog & version * chore: update changelog & version * chore: update dependencies * test: update `EXPECTED_XLS_TEXT_LEN` for `test_auto_partition_xls_from_filename` * chore: update changelog & version * feat: add functionality to switch html text parser based on whether the html text contains emoji * chore: update changelog & version * fix lint errors * test: revert the `EXPECTED_XLS_TEXT_LEN` value back * feat: always use `soupparser_fromstring` to parse `html text` * fix lint error	2023-08-13 12:20:33 -07:00
cragwolfe	02af625b93	chore: fix fickle test to not be so time sensitive (#1105 )	2023-08-13 10:58:46 -07:00
John	f63a66dbef	Capture section and chapter in the metadata for epubs under `epub_section` (#1005 ) Capture section and chapter in the metadata for epubs under epub_section. Closes Github issue #459	2023-08-12 21:02:06 +00:00
Matt Robinson	fa5a3dbd81	feat: `unique_element_ids` kwarg for UUID elements (#1085 ) * added kwarg for unique elements * test for unique ids * update docs * changelog and version	2023-08-11 11:02:37 +00:00
Christine Straub	d26ab1deac	fix: etree parser error (#1077 ) * feat: add functionality to check if a string contains any emoji characters * feat: add functionality to switch `html` text parser based on whether the `html` text contains emoji * chore: add `beautifulsoup4` and `emoji` packages to `requirements/base.in` for general use * chore: update changelog & version * chore: update changelog & version * chore: update dependencies * test: update `EXPECTED_XLS_TEXT_LEN` for `test_auto_partition_xls_from_filename` * chore: update changelog & version	2023-08-10 23:28:57 +00:00
cragwolfe	6779918406	build(release): bump unstructured-inference (#1074 ) * build(release): bump unstructured-inference Related to downstream issue: Unstructured-IO/unstructured-api#182 And upstream PR: Unstructured-IO/unstructured-inference#165 --------- Co-authored-by: Shreya Nidadavolu <shreyanid9@gmail.com>	2023-08-10 20:57:46 +00:00
Klaijan	ad386af8b5	Klaijan/auto paragraph grouper (#994 ) * add auto_paragraph_grouper. add line break pattern. * combine group_broken_paragraph and blank_line_grouper function * fix make check errors * fix make check errors * fix make check errors * fix make check errors * run make tidy to fix errors * tidy core.py and text.py * fix blank-line breaker to extends the result and replace new line with space * fix function name typo * call group_broken_paragraphs for blank_line_grouper * edit function name from one_line_grouper to new_line_grouper for consistency * edit threshold from 0.5 to 0.1 * edit threshold from 0.5 to 0.1 * Revert "call group_broken_paragraphs for blank_line_grouper" This reverts commit 8fb93b7aa7c4d7e0320ac1e09c77da44c9b6c7d9. * revert to commit 8fb93b7 and change threshold from 0.5 to 0.1 * edit test_text assertion. remove all BULLETS_PATTERN. * Update ingest test fixtures (#1052) Co-authored-by: ahmetmeleq <ahmetmeleq@users.noreply.github.com> * edit test case in test_xml_partition * update assertion on test_auto --------- Co-authored-by: Klaijan Sinteppadon <klaijan@Klaijans-MacBook-Pro.local> Co-authored-by: Klaijan Sinteppadon <klaijan@klaijans-mbp.mynetworksettings.com> Co-authored-by: Klaijan Sinteppadon <klaijan@Klaijans-MBP.fios-router.home> Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com> Co-authored-by: ahmetmeleq <ahmetmeleq@users.noreply.github.com>	2023-08-07 18:37:18 -04:00
kravetsmic	25ca5744cf	feat: optionally ignore header and footer tags in partition html (#1013 ) --------- Co-authored-by: Matt Robinson <mrobinson@unstructured.io>	2023-08-04 21:56:33 +00:00
Christine Straub	b76d2ee745	feat: track emphasized text msword (#1048 ) * feat: add functionality to track emphasized text (`bold/italic` formatting) from paragraph * chore: add docstring * chore: fix lint errors * feat: ignore spaces when extracting emphasized texts from a paragraph * feat: add functionality to track emphasized text (`bold/italic` formatting) from table * test: add test case for grabbing emphasized texts from element metadata * chore: fix lint errors * chore: update changelog & version * Update ingest test fixtures (#1047)	2023-08-04 17:04:12 -04:00
kravetsmic	bef93aef6e	fix: email addresses shouldn't be flagged as titles (#957 ) * feat: add func for checking on EmailAddress type * feat: add EmailAddress type * feat: add check for email type * feat: add test for cheking EmailAdress type * feat: update existing example files with email * feat: add new exampe fileds with email in the text * fix: apply linter * feat: update changelog file * feat: add test for is_email_address function * don't push * fix: clean up code * apply linter * fix: clean up * fix: remove file chaanges * fix: remove not used files for email address test * fix: remove not necessary tests * clean up * fix: apply linter * fix: update CHANGELOG * fix: change version * fix: fix msg test * fix: apply linter for tests * fix: remove spaces * fix: apply linter with longer line * feat: update documentation * fix: remove duplicates * Update getting_started.rst --------- Co-authored-by: Matt Robinson <mrobinson@unstructured.io>	2023-08-04 11:28:36 -04:00
Hynek Kydlíček	47b20119c3	fix: extract emojis with `partition_xlsx` (#1009 ) * 🐛 fixxed emoji xlsx bug * update version and changelog * check if beautifulsoup exists * update docs * fix html parser call * fix failing attachment test * ✅ added emoji test, added requirment fixed dependency * 🐛 dependency * 🐛 correct depeendency * linting, linting, linting * check for bs4 * skip auto xls filename test --------- Co-authored-by: Matt Robinson <mrobinson@unstructured.io>	2023-08-04 10:14:08 -04:00
Matt Robinson	a1ef6248bf	fix: simplify `min_partition` logic for `partition_text` (#1032 ) * min simplify first pass * update tests * better max partition default * version and changelog	2023-08-04 13:32:42 +00:00
Matt Robinson	f4ddf53590	feat: track emphasized text in `partition_html` (#1034 ) * Feat/965 track emphasized text html (#1021) * feat: add functionality to track emphasized text (<strong>, <em>, <span>, <b>, <i> tags) in HTML * feat: add `include_tail_text` parameter to `_construct_text` * test: add test case for `_get_emphasized_texts_from_tag` * test: add `emphasized_texts` to metadata * chore: update changelog & version * fix tests * fix lint errors * chore: update changelog * chore: small comment updates * feat: update `XMLDocument._read_xml` to create `<p>` tag element for the text enclosed in the `<pre>` tag * chore: update changelog * Update ingest test fixtures (#1026) Co-authored-by: christinestraub <christinestraub@users.noreply.github.com> --------- Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com> Co-authored-by: christinestraub <christinestraub@users.noreply.github.com> Co-authored-by: Matt Robinson <mrobinson@unstructured.io> * ingest-test-fixtures-update * Update ingest test fixtures (#1035) Co-authored-by: MthwRobinson <MthwRobinson@users.noreply.github.com> --------- Co-authored-by: Christine Straub <christinemstraub@gmail.com> Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com> Co-authored-by: christinestraub <christinestraub@users.noreply.github.com> Co-authored-by: MthwRobinson <MthwRobinson@users.noreply.github.com>	2023-08-03 16:24:25 +00:00
Chris Pappalardo	f8a7ae2953	fixed filename metadata bug when using file and file_filename (#1002 )	2023-08-02 18:14:15 -07:00
shreyanid	a23d75a292	Set default strategy for images to be "hi_res" (#968 ) Set default strategy for images (not PDFs) to be hi_res.	2023-08-02 09:22:20 -07:00
Matt Robinson	331c7faf38	build(deps): split up dependencies by document type (#986 ) * split dependencies by document type * make pip-compile with new requirements * add extra requirements to setup.py * add in all docs; re pip-compile * extra for all docs * add pandas to xlsx * dependency requires for tsv and csv * handling for doc, docx and odt * dependency check for pypandoc * required dependencies for pandoc files * xml and html * markdown * msg * add in pdf * add in pptx * add in excel * add lxml as base req * extra all docs for local inference * local inference installs all * pin pillow version * fixes for plain text tests * fixes for doc * update make commands * changelog and version * add xlrd * update pip-compile * pin numpy for python 3.8 support * more constraints * contraint on scipy * update install docs * constrain ipython * add outlook to pip-compile * more ipython constraints * add extras to dockerfile * pin office365 client * few doc tweaks * types as strings * last pip-compile * re pip-comple * make tidy * make tidy	2023-08-01 11:31:13 -04:00
cragwolfe	13d3559fa4	chore: rename Element's "date" field to "last_modified" (#997 ) Change the Element's date field name to the more specific last_modified so there is less room for confusion of what that field represents.	2023-08-01 02:55:43 +00:00
Yuming Long	d46c1c2d83	Chore: Pass table support param to partition image (#973 ) * add param and test in image table extraction * version and changelog * need to publish this one for api repo * add new param skip_infer_table_types * use warning * clean up with mapping * add test for tsv * fix test fail * weird change from merge * doc nit * don't use mapping * correct conflict	2023-07-27 13:33:36 -04:00
Matt Robinson	15618e8346	fix: handling for empty tables in word docs and powerpoints (#982 ) * fix table index error * changelog and version	2023-07-27 11:07:27 -04:00
Matt Robinson	d9aed66b65	feat: add document date for remaining file types (#930 ) (#969 ) * feat: add document date for remaining file types (#930) * feat: add functions for getting modification date * feat: add date field to metadata from csv file * feat: add tests for csv patition * feat: add date field to metadata from html file * feat: add tests for html partition * fix: return file name onlyif possible * feat: add csv tests * fix: renaming * feat: add filed metadata_date as date of last mod * feat: add tests for partition_docx * feat: add filed metadata_date to .doc file * feat: add tests for partition_doc * feat: add metadata_date to .epub file * feat: add tests for partition_epub * fix: fix test mocking * feat: add metadata_date for image partition * feat: add test for image partition * feat: add coorrdinate system argument * feat: add date to element metadata * feat: add metadata_date for JSON partition * feat: add test for JSON partition * fix: rename variable * feat: add metadata_date for md partition * feat: add test for md partition * feat: update doc string * feat: add metadata_date for .odt partition * feat: update .odt string * feat: add metadata_date for .org partition * feat: add tests for .org partition * feat: add metadata_date for .pdf partition * feat: add tests for .pdf partition * feat: add metadata_date for .pptx partition * feat: add metadata_date for .ppt partition * feat: add tests for .ppt partition * feat: add tests for .pptx partition * feat: add metadata_date for .rst partition * feat: add tests for .rst partition * fix: get modification date after file checking * feat: add tests for .rtf partition * feat: add tests for .rtf partition * feat: add metadata_date for .txt partition * fix: rename argument * feat: add tests for .txt partition * feat: update doc string rst patrition function * feat: add metadata_date for .tsv partition * feat: add tests for .tsv partition * feat: add metadata_date for .xlsx partition * feat: add tests for .xlsx partition * fix: clean up * feat: add tests for .xml partition * feat: add tests for .xml partition * fix: use `or ` instead of `if` * fix: fix epub tests * fix: remove not used code * fix: add try block for getting file name * fix: applying linter changes * fix: fix test_partition_file * feat: add metadata_date for email * feat: add test for email partition * feat: add metadata_date for msg * feat: add tests for msg partition * feat: update CHANGELOG file * fix: update partitions doc string * don't push * fix: clean up code * linting, linting, linting * remove unnecessary example doc * update version and changelog * ingest-test-fixtures-update * set metadata date in test --------- Co-authored-by: Matt Robinson <mrobinson@unstructuredai.io> * ingest-test-fixtures-update * Update ingest test fixtures (#970) Co-authored-by: MthwRobinson <MthwRobinson@users.noreply.github.com> * Revert "Update ingest test fixtures (#970)" This reverts commit 1d182ae474b3545b15551fffc15977757d552cd2. * remove date from metadata in outputs * update docstring ordering * remove print * remove print * remove print * linting, linting, linting * fix version and test * fix changelog * fix changelog * update version --------- Co-authored-by: kravetsmic <79907559+kravetsmic@users.noreply.github.com> Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com> Co-authored-by: MthwRobinson <MthwRobinson@users.noreply.github.com>	2023-07-26 15:10:14 -04:00
shreyanid	71a24b2887	Update `partition_via_api` to not post a strategy value if not user specified (#967 ) * remove default strategy * working on test * fixed test, coordinates param needed to be included * nits * update changelog * lint * update requirements	2023-07-26 09:56:39 -07:00
Roman Isecke	b39e0d7354	Roman/expose dpi param (#966 ) * Bump inference version * Pass through the dpi param if available * Update CHANGELOG * Check dpi param passed in via unit test * Bump inference version * Fix unit test around file info to work on mac as well	2023-07-26 09:26:06 -04:00
David Potter	f7e46af22f	feat: adds Outlook connector (#939 ) * bonus: fixes issue with email partitioning where From field was being assigned the To field value.	2023-07-26 04:09:26 +00:00
Matt Robinson	d694cd53bf	refactor: simplifies JSON detection and add tests (#975 ) * refactor json detection * version and changelog * fix mock in test	2023-07-25 19:59:45 +00:00
Matt Robinson	6e852cbe70	feat: track links from anchor tags in `partition_html` (#959 ) * track tags in html * pass through links as metadata * add test for grabbing links * one more link * changelog and version * update docs * fix tests * update empty link assertion * ingest-test-fixtures-update * Update ingest test fixtures (#961)	2023-07-24 18:28:56 +00:00
John	676c50a6ec	feat: add min_partition kwarg to that combines elements below a specified threshold (#926 ) * add min_partition * functioning _split_content_to_fit_min_max * create test and make tidy/check * fix rebase issues * fix type hinting, remove unused code, add tests * various changes and refactoring of methods * add test, refactor, change var names for debugging purposes * update test * make tidy/check * give more descriptive var names and add comments * update xml partition via partition_text and create test * fix <pre> bug for test_partition_html_with_pre_tag * make tidy * refactor and fix tests * make tidy/check * ingest-test-fixtures-update * change list comprehension to for loop * fix error check	2023-07-24 15:57:24 +00:00
Matt Robinson	0d332743eb	fix: enable passing filters to `partition_doc` for libreoffice conversion (#934 ) * add optional filter to docx conversion * add filters to tests * changelog and version * update filter for power point	2023-07-17 13:54:44 -04:00
Christine Straub	5b7ae29876	fix: 521 pdf2image memory error (#924 ) Closes issue #521. Implements the same logic as unstructured-inference/PR #136 for the ocr_only strategy. * Add functionality to convert a PDF in small chunks of pages at a time * Add functionality to write images to computer storage temporarily instead of keeping them in memory * Set the file's current position to the beginning after reading the file in convert_to_bytes	2023-07-14 15:08:33 -05:00
John	6173362620	fix: detect list items in MS Word documents (#909 ) * fix merge conflict * update changelog and version	2023-07-10 15:29:08 +00:00
qued	79f734d3f9	fix: better extractable check (#900 ) auto strategy was choosing the fast strategy in cases where the pdf contents were just a flat image, resulting in no output. This PR changes the behavior of auto so that elements that can be extracted by fast are extracted, a cursory examination of the elements is made to see if there are elements with text present, and if so then these elements are used as the output. Otherwise fallback strategies come into play.	2023-07-07 23:41:37 -05:00

1 2 3 4

193 Commits