unstructured

mirror of https://github.com/Unstructured-IO/unstructured.git synced 2025-11-19 11:57:32 +00:00

Author	SHA1	Message	Date
rvztz	424852ab39	feat: adds data source properties to Sharepoint and Outlook (#1278 )	2023-09-20 09:13:35 +00:00
Ryan Nikolaidis	8c1d03e5cf	update slack invite	2023-09-20 00:02:03 -07:00
rvztz	2f52df180f	Adds data source properties to onedrive, reddit and slack (#1281 )	2023-09-20 04:26:36 +00:00
Amanda Cameron	e359afafbe	fix: coordinates bug on pdf parsing (#1462 ) Addresses: https://github.com/Unstructured-IO/unstructured/issues/1460 We were raising an error with invalid coordinates, which prevented us from continuing to return the element and continue parsing the pdf. Now instead of raising the error we'll return early. to test: ``` from unstructured.partition.auto import partition elements = partition(url='https://www.apple.com/environment/pdf/Apple_Environmental_Progress_Report_2022.pdf', strategy="fast") ``` --------- Co-authored-by: cragwolfe <crag@unstructured.io> 0.10.16	2023-09-19 19:25:31 -07:00
Steve Canny	b54994ae95	rfctr: docx partitioning (#1422 ) Reviewers: I recommend reviewing commit-by-commit or just looking at the final version of `partition/docx.py` as View File. This refactor solves a few problems but mostly lays the groundwork to allow us to refine further aspects such as page-break detection, list-item detection, and moving python-docx internals upstream to that library so our work doesn't depend on that domain-knowledge.	2023-09-19 15:32:46 -07:00
rvztz	9a3e24fcbb	Adds data source properties to elasticsearch, wikipedia and google-drive (#1282 )	2023-09-19 20:25:38 +00:00
rvztz	92e18c3f58	feat: adds data source properties to airtable, confluence and discord (#1283 )	2023-09-19 18:05:27 +00:00
Yuming Long	f962a1e57d	fix: fix ingest paddle hanging issue (#1441 ) ## Summary Ingest tests are having paddle OOM issue which cause the tests to hang forever. The fix here is to remove paddle from ci and set both OCR env `TABLE_OCR` and `ENTIRE_PAGE_OCR` to `tesseract`. (will have follow up PR to investigate why this is failing) ## Test please check ingest tests in CI	2023-09-19 17:20:23 +00:00
shreyanid	eb8ce89137	chore: function to map between standard and Tesseract language codes (#1421 ) ### Summary In order to convert between incompatible language codes from packages used for OCR, this change adds a function to map between any standard language codes and tesseract OCR specific codes. Users can input language information to `languages` in any Tesseract-supported langcode or any ISO 639 standard language code. ### Details - Introduces the [python-iso639](https://pypi.org/project/python-iso639/) package for matching standard language codes. Recompiles all dependencies. - If a language is not already supplied by the user as a Tesseract specific langcode, supplies all possible script/orthography variants of the language to the Tesseract OCR agent. ### Test Added many unit tests for a variety of language combinations, special cases, and variants. For general testing, call partition functions with any lang codes in the languages parameter (Tesseract or standard). for example, ``` from unstructured.partition.auto import partition elements = partition(filename="example-docs/layout-parser-paper.pdf", strategy="hi_res", languages=["en", "chi"]) print("\n\n".join([str(el) for el in elements])) ``` should supply eng+chi_sim+chi_sim_vert+chi_tra+chi_tra_vert to Tesseract	2023-09-18 08:42:02 -07:00
qued	3a07d1e6b4	chore: Fix typos in changelog (#1442 )	2023-09-18 10:39:36 -05:00
Amanda Cameron	a9f18eddb8	chore: adding test case for odt tables (#1434 ) ODT table extraction is happening! Just added to an existing example-doc and an accompanying test case.	2023-09-16 22:29:44 -07:00
Yao You	b534b2a6cd	Chore: bump inference package version to 0.5.28 and new release (#1355 ) This bump removes the preprocessing before table structure extraction and improves the OCR results for tables. --------- Co-authored-by: yuming-long <yuming-long@users.noreply.github.com> 0.10.15	2023-09-15 18:26:15 -07:00
Trevor Bossert	09a0958f90	Feat: CORE-1269 - Install paddlepaddle wheel dependent on arch, supporting aarch64 (#1350 ) Testing instructions on Apple silicon ``` make docker-build docker run -it unstructured:dev bash python3 ``` Then run the test in this PR https://unstructured-ai.atlassian.net/browse/CORE-1269 You should get output like shown in ticket Run the same process on your local machine (not inside docker) with same test to verify the non aarch64 paddlepaddle got installed correctly --------- Co-authored-by: Yuming Long <63475068+yuming-long@users.noreply.github.com>	2023-09-15 17:05:48 -07:00
cragwolfe	36d026cb1b	chore: update CHANGELOG.md bullets (#1436 ) add "why does it matter" for a couple of bullets	2023-09-15 16:52:01 -07:00
John	6187dc0976	update links in integrations.rst (#1418 ) A number of the links in integrations.rst don't seem to lead to the intended section in the unstructured documentation. For example: ```See the `stage_for_weaviate <https://unstructured-io.github.io/unstructured/bricks.html#stage-for-weaviate>`_ docs for details``` It seems this link should direct to here instead: https://unstructured-io.github.io/unstructured/bricks/staging.html#stage-for-weaviate	2023-09-15 16:50:55 -07:00
Roman Isecke	333558494e	roman/delta lake dest connector (#1385 ) ### Description Add delta table downstream destination connector Closes https://github.com/Unstructured-IO/unstructured/issues/1415	2023-09-15 22:13:39 +00:00
cragwolfe	98d3541909	Update CHANGELOG.md (#1435 ) Update a bullet to reflect: What was the problem? What was fixed? Why does it matter?	2023-09-15 15:26:49 -05:00
John	de4d496fcf	Fix bbox coordinates for ocr_only strategy (#1325 ) ### Summary Duplicate PR of #1259 because of issues with checks Closes #1227, which found that `nan` values were present in the coordinates being generated for some elements. This breaks logic out from `add_pytesseract_bbox_to_elements` to new functions `_get_element_box` and `convert_multiple_coordinates_to_new_system`. It also updates the logic to check that the current bounding box matches the first character of the element's text (as to avoid the `~` characters that `pytesseract.image_to_boxes` includes, but are not present in `pytesseract.image_to_string`. ### Testing ``` from unstructured.partition.image import partition_image from PIL import Image, ImageDraw filename="example-docs/layout-parser-paper-with-table.jpg" elements = partition_image(filename=filename, strategy="ocr_only") image = Image.open(filename) draw = ImageDraw.Draw(image) for i, element in enumerate(elements): print(i, element.metadata.coordinates) if element.metadata.coordinates: draw.polygon(element.metadata.coordinates.points, outline="red", width=2) output = "example-docs/box-layout-parser-paper-with-table.jpg" image.save(output) image.close() ``` --------- Co-authored-by: qued <64741807+qued@users.noreply.github.com> Co-authored-by: cragwolfe <crag@unstructured.io> Co-authored-by: Yao You <theyaoyou@gmail.com>	2023-09-15 15:11:16 -05:00
qued	0d61c98481	fix: Pass partition_image kwargs downstream (#1426 ) `partition_pdf` allows for passing a `model_name` parameter. Given the similarity between the image and PDF pipelines, the expected behavior is that `partition_image` should support the same parameter, but `partition_image` was unintentionally not passing along its `kwargs`. This was corrected by adding the kwargs to the downstream call. #### Testing: ```python from unstructured.partition.image import partition_image output1 = partition_image("example-docs/layout-parser-paper-fast.jpg", model_name="detectron2_onnx") output2 = partition_image("example-docs/layout-parser-paper-fast.jpg", model_name="yolox") # These shouldn't be the same, since they were produced using different models. assert output1 != output2 ``` The assertion should fail on `main`, but pass on this branch.	2023-09-15 15:09:58 -05:00
Sebastian Laverde Alfonso	fe11ab4235	feat: improved mapping for missing chipper elements (#1431 ) This PR updates [TYPE_TO_TEXT_ELEMENT_MAP](`bd33a52ee0/unstructured/documents/elements.py (L551)`) with additional mapping for `chipper` elements: ``` “Threading”: NarrativeText, “Form”: NarrativeText, “FieldName”: Title, “Value”: NarrativeText, “Link”: NarrativeText, ```	2023-09-15 20:05:40 +00:00
Amanda Cameron	50db2abd9f	fix: updating element types (#1394 ) This PR adds an arg to the html partition flow called `source_format` if anything other than "html" we will return non-HTML elements to conform with the file type we received. addresses: https://github.com/Unstructured-IO/unstructured/issues/726	2023-09-15 11:51:22 -05:00
Sebastian Laverde Alfonso	40b1d0d092	feat: improved chipper elements mapping and new category_depth metadata (#1308 ) Two changes: 1. Improved mapping of `chipper` element types `Headline` (to `Title`), `Subheadline`(to `Title`) and `Abstract`( to `NarrativeText`. 2. New element metadata `category_depth`: `None` unless is `Headline` (`category_depth=1`), or `Subheadline` (`category_depth=2`). The update of `category_depth` happens during the transform `normalize_layout_element`. --------- Co-authored-by: Yao You <theyaoyou@gmail.com> Co-authored-by: Yao You <yao@unstructured.io> Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com> Co-authored-by: LaverdeS <LaverdeS@users.noreply.github.com> Co-authored-by: Benjamin Torres <benjats07@users.noreply.github.com> Co-authored-by: Benjamin Torres <benjamin@unstructured.io>	2023-09-15 14:43:17 +00:00
ryannikolaidis	ad69d93d53	ci: add new release version alert (#1413 )	2023-09-15 07:05:00 +00:00
rvztz	3be9f089b3	feat: adds data source properties to fsspec-based connectors (#1279 )	2023-09-15 05:56:44 +00:00
Yao You	a5ca628f22	[CORE-1741] use forked pytesseract to reduce calls to tesseract (#1298 ) This PR resolves [CORE-1741](https://unstructured-ai.atlassian.net/browse/CORE-1741) by using a new function `pytesseract.run_and_get_multiple_output`, see forked repo for more details: https://github.com/Unstructured-IO/unstructured.pytesseract/releases/tag/0.3.11-dev1 This reduces the call to `tesseract` by half per page of PDF/image during partition, roughly reducing the runtime by 48%. The new function is in forked `unstructured.pytesseract`. A PR has been made to the upstream repo and once that is merged we should switch to the up stream version. For now we add a new dependency: `unstructured.pytesseract`. ## testing Existing unit tests should serve as tests to the new function. To demonstrate the changes in performance: - checkout main - run `./scripts/performance/profile.sh` and select `ocr_only` strategy, using the 10th document (16 page layout paper in pdf format) - examine the speedscope profile or time profile in flamegraph -> should see two dominant time spenders are `pytesseract.image_to_text` and `pytesseract.image_to_boxes`, with both about the same total time (see attached first image) - checkout this branch - run the same `profile.sh` with the same options - examine the profile again and this time should notice 1) total runtime is reduced by more than 40%; 2) only `unstructured_pytesseract.run_and_get_multiple_output` is the top time spender and its total time is about the same as either the `pytesseract.image_to_text` or `pytesseract.image_to_boxes` time (see second image below) ![Screenshot 2023-09-06 at 9 45 10 AM](https://github.com/Unstructured-IO/unstructured/assets/647930/fed6118b-a0dc-493d-bef8-85d73027c968) ![Screenshot 2023-09-06 at 9 46 37 AM](https://github.com/Unstructured-IO/unstructured/assets/647930/dd1d6369-cfba-43d4-b1c6-87a8a98b2e16) [CORE-1741]: https://unstructured-ai.atlassian.net/browse/CORE-1741?atlOrigin=eyJpIjoiNWRkNTljNzYxNjVmNDY3MDlhMDU5Y2ZhYzA5YTRkZjUiLCJwIjoiZ2l0aHViLWNvbS1KU1cifQ --------- Co-authored-by: Benjamin Torres <benjats07@users.noreply.github.com> Co-authored-by: cragwolfe <crag@unstructured.io>	2023-09-14 23:27:18 +00:00
cragwolfe	8f60784178	chore: update CHANGELOG.md (#1420 ) Move to a new CHANGELOG.md convention to more fully describe changes. Bullets should address: what was broken? what was fixed? why does it matter? To assist with scanning changes, the first sentence in each bullet is in bold. Note: it's also worth looking at the rendered markdown in the branch: https://github.com/Unstructured-IO/unstructured/blob/crag/changelog-tweak/CHANGELOG.md rather than just the git diff. --------- Co-authored-by: Klaijan <klaijan@unstructured.io>	2023-09-14 12:53:30 -07:00
qued	3655a752bc	docs: clearer message for sentence count skip (#1410 ) Related to #744 . In the `sentence_count` function, there is a parameter that sets a threshold for a minimum word count for a sentence to be "counted" as a sentence. When a sentence is skipped in the count because it doesn't meet the minimum word count, the log message is potentially misleading as it mentions "skipping" the sentence. Without the above context, it could be interpreted that the sentence is being skipped in the partitioning process, which is not the case. This PR is to reword the log message to make the situation clearer. Co-authored-by: Ronny H <138828701+ron-unstructured@users.noreply.github.com>	2023-09-14 12:33:10 -05:00
Newel H	cd704e873b	Feat: Create a naive hierarchy for elements (#1268 ) ## Summary By adding hierarchy to unstructured elements, users will have more information for implementing vector db/LLM chunking strategies. For example, text elements could be queried by their preceding title element. The hierarchy is implemented by a parent_id tag in the element's metadata. ### Features - Introduces a parent_id to ElementMetadata (The id of the parent element, not a pointer) - Creates a rule set for assigning hierarchies. Sensible default is assigned, with an optional override parameter - Sets element parent ids if there isn't an existing parent id or matches the ruleset ### How it works Hierarchies are assigned via a parent id field in element metadata. Elements are read sequentially and evaluated against a ruleset. For example take the following elements: 1. Title, "This is the Title" 2. Text, "this is the text" And the ruleset: `{"title": ["text"]}`. When evaluated, the parent_id of 2 will be the id of 1. The algorithm for determining this is more complex and resolves several edge cases, so please read the code for further details. ### Schema Changes ``` @dataclass class ElementMetadata: coordinates: Optional[CoordinatesMetadata] = None data_source: Optional[DataSourceMetadata] = None filename: Optional[str] = None file_directory: Optional[str] = None last_modified: Optional[str] = None filetype: Optional[str] = None attached_to_filename: Optional[str] = None + parent_id: Optional[Union[str, uuid.UUID, NoID, UUID]] = None + category_depth: Optional[int] = None ... ``` ### Testing ``` from unstructured.partition.auto import partition from typing import List elements = partition(filename="./unstructured/example-docs/fake-html.html", strategy="auto") for element in elements: print( f"Category: {getattr(element, 'category', '')}\n"\ f"Text: {getattr(element, 'text', '')}\n" f"ID: {element.id}\n" \ f"Parent ID: {element.metadata.parent_id}\n"\ f"Depth: {element.metadata.category_depth}\n" \ ) ``` ### Additional Notes Implementing this feature revealed a possibly undesired side-effect in how element metadata are processed. In `unstructured/partition/common.py` the `_add_element_metadata` is invoked as part of the `add_metadata_with_filetype` decorator for filetype partitioning. This method is intended to add additional information to the metadata generated with the element including filename and filetype, however the existing metadata is merged into a newly created metadata object rather than the other way around. Because of the way it's structured, new metadata fields can easily be forgotten and pose debugging challenges to developers. This likely warrants a new issue. I'm guessing that the implementation is done this way to avoid issues with deserializing elements, but could be wrong. --------- Co-authored-by: Benjamin Torres <benjats07@users.noreply.github.com>	2023-09-14 11:23:16 -04:00
Ahmet Melek	bd33a52ee0	fix: coordinates metadata hinders chunking (#1374 ) Closes https://github.com/Unstructured-IO/unstructured/issues/1373 This PR: - drops the `coordinates` metadata field in `chunk_by_title` to fix https://github.com/Unstructured-IO/unstructured/issues/1373 (read issue for the details) - adds relevant test that checks the particular case	2023-09-14 10:10:03 +00:00
Ronny H	f1364594ad	Docs models (#1412 ) This PR adds documentation of models supported by the `Unstructured` tool. The changes reflect the tool's capabilities, usage examples, and the process for integrating custom models. Sections: - Detailed the basic usage of the `Unstructured` partition with the model name. - Provided a list of available models in the `Unstructured` partition. - Added instructions on using non-default models via three distinct methods. - Explained leveraging models from the LayoutParser's model zoo with `UnstructuredDetectronModel`. - Guided users in integrating their custom object detection models using the `UnstructuredObjectDetectionModel` class. Tested the docs build with: > cd docs > pip install -r requirements.txt > make html	2023-09-13 23:37:31 -07:00
Yao You	12d7628b10	update constraints to pin weaviate during ci (#1408 ) This PR ensures the version for `weaviate` is consistent in CI testing. Latest (3.24.1) is not compatible with our test needs and last version that run successfully in CI is 3.23.2.	2023-09-13 23:19:20 +00:00
Klaijan	00181b88df	feat: pdf auto strategy groups broken numbered and bullet list items(#1393 ) Summary Adds logic to combine broken numbered list for pdf fast strategy. Details Previously the document reads the numbered list items part of the `layout-parser-paper-fast.pdf` file as: ``` '1. An oﬀ-the-shelf toolkit for applying DL models for layout detection, character' 'recognition, and other DIA tasks (Section 3)' '2. A rich repository of pre-trained neural network models (Model Zoo) that' 'underlies the oﬀ-the-shelf usage' '3. Comprehensive tools for eﬃcient document image data annotation and model' 'tuning to support diﬀerent levels of customization' '4. A DL model hub and community platform for the easy sharing, distribu- tion, and discussion of DIA models and pipelines, to promote reusability, reproducibility, and extensibility (Section 4)' ``` Now it reads: ``` '1. An oﬀ-the-shelf toolkit for applying DL models for layout detection, character recognition, and other DIA tasks (Section 3)' '2. A rich repository of pre-trained neural network models (Model Zoo) that underlies the oﬀ-the-shelf usage' '3. Comprehensive tools for eﬃcient document image data annotation and model' tuning to support diﬀerent levels of customization' '4. A DL model hub and community platform for the easy sharing, distribu- tion, and discussion of DIA models and pipelines, to promote reusability, reproducibility, and extensibility (Section 4)' ``` The added logic leverages `ElementType` and `coordinates` to determine whether the following lines is a part of the previously detected `ListItem` or not. Test Add test that checks the element length less than original version with broken numbered list. The test also checks whether the first detected numbered list ends with previously broken line. --------- Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com> Co-authored-by: Klaijan <Klaijan@users.noreply.github.com>	2023-09-13 21:30:06 +00:00
shreyanid	d87c83d7b6	chore: refactor languages parameter for text_type functions (#1399 ) ### Summary In order to support language functionality other than Tesseract OCR, we want to represent languages provided for either partitioning accuracy or OCR as a standard list of langcodes as strings. To identify element types such as NarrativeText and Title, continue the refactor into functions that use language checks to determine those potential classifications. ### Details Replaces `language` with `languages` (a list of strings) as a parameter to `is_possible_narrative_text` and `is_possible_title`. ### Test Call `is_possible_narrative_text` and `is_possible_title` with text in a variety of languages and different inputs for `languages`. The resulting element classifications should be no different from the current outputs. ex: see `test_text_type_handles_multi_language_examples` in `test_unstructured/partition/test_text_type.py`.	2023-09-13 19:46:36 +00:00
shreyanid	1b7c99d878	chore: refactor languages parameter for auto partition (#1400 ) ### Summary In order to support language functionality other than Tesseract OCR, we want to represent languages provided for either partitioning accuracy or OCR as a standard list of langcodes as strings. ### Details Follows the pattern established with PDFs in #1334. Adds languages (a list of strings) as a parameter to partition in auto.py. Marks ocr_languages for deprecation. ### Test Call partition with a variety of filetypes (especially pdfs/images), strategies, languages, or ocr_languages. - inclusion of ocr_languages as a parameter should display a deprecation warning and may proceed with partitioning if no other conflicts - the other valid call outputs should be no different from the current outputs	2023-09-13 13:07:28 -04:00
shreyanid	2b571eb9a3	chore: refactor languages parameter for image partition functions (#1395 ) Adds languages (a list of strings) as a parameter to `partition_image`. Marks ocr_languages for deprecation.	2023-09-13 04:11:58 +00:00
Amanda Cameron	7fd81dc7df	Table processing test for RTF (#1388 ) This PR does two things: 1. Adds test case (and alters sample doc) for rtf and epub files with table 2. Adds `xls/x` file extension to `skip_infer_table_types` default list --------- Co-authored-by: shreyanid <42684285+shreyanid@users.noreply.github.com>	2023-09-12 18:27:05 -07:00
shreyanid	791adf459d	stop printing all commands in version-sync script (#1390 ) ### Summary Remove -x in version-sync script to stop printing all commands and arguments and improve readability. ### Test `make check` and `make check-version` no longer print all the commands and arguments. (unstructured) shreyanid@Shreyas-MBP-2 unstructured % make check-version scripts/version-sync.sh -c \ -f "unstructured/__version__.py" semver From github.com:Unstructured-IO/unstructured * branch main -> FETCH_HEAD version sync would make no changes to unstructured/__version__.py.	2023-09-12 15:05:26 -07:00
qued	6595632a57	enhancement: backup text categorization (#1322 ) Currently there are some cases when `partition_pdf` is run using the `hi_res` strategy, in which elements can come back with category `UncategorizedText`. This happens when the detection model fails to detect an element, but we're able to find it anyway either because it was embedded in the PDF, or we found it using OCR. This commit is to allow for attempting to categorize these uncategorized elements using our text-based classification function, `element_from_text`.	2023-09-12 20:32:48 +00:00
shreyanid	c2853e4ac3	refactor languages parameter for pdf partition functions (#1334 ) ### Summary In order to support language functionality other than Tesseract OCR, we want to represent languages provided for either partitioning accuracy or OCR as a standard list of langcodes as strings. ### Details Adds `languages` (a list of strings) as a parameter to pdf partitioning functions. Marks `ocr_languages` for deprecation. Adds a new file `lang.py` for language-related helper functions. Coming up: langcode standardization, language detection ### Test Call `partition_pdf` or `partition_pdf_or_image` with a variety of strategies, languages, or `ocr_languages`. - inclusion of `ocr_languages` as a parameter should display a deprecation warning - the other valid call outputs should be no different from the current outputs. ex: ``` from unstructured.partition.pdf import partition_pdf elements = partition_pdf(filename="example-docs/DA-1p.pdf", strategy="hi_res", languages=["eng", "spa"]) print("\n\n".join([str(el) for el in elements])) ```	2023-09-12 16:15:26 +00:00
John	c58b261feb	chunk_by_title decorator (#1304 ) ### Summary Partial solution to #1185. Related to #1222. Creates decorator from `chunk_by_title` cleaning brick. Breaks a document into sections based on the presence of Title elements. Also starts a new section under the following conditions: - If metadata changes, indicating a change in section or page or a switch to processing attachments. If `multipage_sections=True`, sections can span pages. `multipage_sections` defaults to True. - If the length of the section exceeds `new_after_n_chars` characters. The default is 1500. The chunking function does not split individual elements, so it's possible for a section to exceed that threshold if an individual element if over `new_after_n_chars characters`, which could occur with a long NarrativeText element. Combines sections under these conditions - Sections under `combine_under_n_chars` characters are combined. The default is 500. ### Testing from unstructured.partition.html import partition_html url = "https://understandingwar.org/backgrounder/russian-offensive-campaign-assessment-august-27-2023-0" chunks = partition_html(url=url, chunking_strategy="by_title") for chunk in chunks: print(chunk) print("\n\n" + "-"*80) input()	2023-09-11 21:00:14 +00:00
Amanda Cameron	a501d1d18f	Adding table extraction to partition_html (#1324 ) Adding table extraction to HTML partitioning. This PR utilizes 'table' HTML elements to extract and parse HTML tables and return them in partitioning. ``` # checkout this branch, go into ipython shell In [1]: from unstructured.partition.html import partition_html In [2]: path_to_html = "{html sample file with table}" In [3]: elements = partition_html(path_to_html) ``` you should see the table in the elements list!	2023-09-11 11:14:11 -07:00
Roman Isecke	59e850bbd9	Roman/downstream connector cli subcommand (#1302 ) ### Description Update all other connectors to use the new downstream architecture that was recently introduced for the s3 connector. Closes #1313 and #1311 0.10.14	2023-09-11 11:40:56 -04:00
cragwolfe	d0749d181f	fix: avoid PDF sorting error on negative coords (#1361 ) The default sorting algorithm for PDF's, "xycut," would cause an error when partitioning a document if Y coordinate points were negative. This change checks for that condition (or more broadly, any negative coordinates) and falls back to the "basic" sort if that is the case. This PR does not address the underlying issue of "bad points" which still should be investigated. However, the sorting code should be less brittle to unexpected bounding boxes in the first case. Resolves: https://github.com/Unstructured-IO/unstructured/issues/1296 0.10.13	2023-09-10 19:29:49 -07:00
Ronny H	edc45013dc	Add `strategy` documentation (#1353 )	2023-09-09 18:54:01 -07:00
Trevor Bossert	915e4adcbb	Updating deps from base image (#1360 ) Updated versions of: Tesseract Leptonica Pandoc Testing: `make docker-build` `make docker-test`	2023-09-09 10:47:16 -07:00
suraj chauhan	f4bf1fa270	Chore: Libmagic detection for "application/octet-stream" when it is not a zip file. (#1347 ) Addressed the issue #494 . Updated the `_detect_filetype_from_octet_stream()` function to use libmagic to infer the content type of file when it is not a zip file.	2023-09-08 18:49:00 +00:00
cragwolfe	87bfe7a1fe	build(deps): PDF images, unstructured-inference==0.5.23 (#1341 ) Bumps unstructured-inference==05.23 to pull in @christinestraub's fix: https://github.com/Unstructured-IO/unstructured-inference/pull/198 , so embedded Images in PDF's are now included in partition results ("hi_res"). From the perspective of elements with clean text, this is not a big win as a lot of the images have OCR garbage. However, it is important to preserve image elements for other downstream use cases, so overall this is a step forward.	2023-09-08 05:29:53 +00:00
pravin-unstructured	8641fe39dc	Add Model Probabilities to Hi-Res strategy MetaData for Images + PDFs. (#1323 ) If a layout model is used from unstructured-inference, you get back class probabilities in the element metadata from partition. extra-pdf-image-in in requirements already has the newest version of unstructured-inference in there without a pinned version. Is there any place else that the unstructured-inference version needs to be updated to the required release version, 0.5.22?	2023-09-07 22:56:43 -04:00
Wahab Alshahin	e4e25c9feb	Add clean_ligatures to core cleaners (#1326 ) # Background [Ligatures](https://en.wikipedia.org/wiki/Ligature_(writing)#Ligatures_in_Unicode_(Latin_alphabets)) can sometimes show up during the text extraction process when they should not. Very common examples of this are with the Latin `f` related ligatures which can be very subtle to spot by eye (see example below), but can wreak havoc later. ```python "ﬀ": "ff", "ﬁ": "fi", "ﬂ": "fl", "ﬃ": "ffi", "ﬄ": "ffl", ``` Several libraries already do something like this. Most recently, `pdfplumber` added this sort of capability as part of the text extraction process, see https://github.com/jsvine/pdfplumber/issues/598 Instead of incorporating any sort of breaking change to the PDF text processing in `unstructured`, it is best to add this as another cleaner and allow users to opt in. In turn, the `clean_ligatures` method has been added in this PR - with accompanying tests. # Example Here is an example PDF that causes the issue. For example: `Beneﬁts`, which should be `Benefits`. [example.pdf](https://github.com/Unstructured-IO/unstructured/files/12544344/example.pdf) ```bash curl -X 'POST' \ 'https://api.unstructured.io/general/v0/general' \ -H 'accept: application/json' \ -H 'Content-Type: multipart/form-data' \ -H 'unstructured-api-key: ${UNSTRUCTURED_API_KEY}' \ -F 'files=@example.pdf' \ -s \| jq -C . ``` # Notes An initial list of mappings was added with the most common ligatures. There is some subjectivity to this, but this should be a relatively safe starting set. Can always be expanded as needed.	2023-09-07 21:30:18 +00:00
Matt Robinson	22974f61ce	fix: separate elements by `<br>` tag in `partition_html` (#1314 ) ### Summary Closes #1230. Updates `partition_html` to split on `<br>` tags that appear within text elements. ### Testing The following is code previously produced one giant element on `main`. ```python from unstructured.partition.html import partition_html filename = "example-docs/ideas-page.html" elements = partition_html(filename=filename) len(elements) # Should be 4 print("\n\n".join([str(el) for el in elements)]) ``` The output should be: ```python January 2023 (Someone fed my essays into GPT to make something that could answer questions based on them, then asked it where good ideas come from. The answer was ok, but not what I would have said. This is what I would have said.) The way to get new ideas is to notice anomalies: what seems strange, or missing, or broken? You can see anomalies in everyday life (much of standup comedy is based on this), but the best place to look for them is at the frontiers of knowledge. Knowledge grows fractally. From a distance its edges look smooth, but when you learn enough to get close to one, you'll notice it's full of gaps. These gaps will seem obvious; it will seem inexplicable that no one has tried x or wondered about y. In the best case, exploring such gaps yields whole new fractal buds. ```	2023-09-07 13:16:31 +00:00

... 13 14 15 16 17 ...

1447 Commits