unstructured

mirror of https://github.com/Unstructured-IO/unstructured.git synced 2025-08-01 21:36:57 +00:00

Author	SHA1	Message	Date
Yao You	2dceac34b5	Feat/remove reference of PageLayout.elements (#3943 ) This PR removes usage of `PageLayout.elements` from partition function, except for when `analysis=True`. This PR updates the partition logic so that `PageLayout.elements_array` is used everywhere to save memory and cpu cost. Since the analysis function is intended for investigation and not for general document processing purposes, this part of the code is left for a future refactor. `PageLayout.elements` uses a list to store layout elements' data while `elements_array` uses `numpy` array to store the data, which has much lower memory requirements. Using `memory_profiler` to test the differences is usually around 10x.	2025-03-12 15:21:21 +00:00
Pluto	3973a30b8c	Feat: Add pdfminer parameters configuration (#3918 ) This pull request adds the ability to configure multiple pdfminer parameters (with the simple possibility to extend for the additional parameters). One of the parameters overwrites the default from LA Params config class. Example: ```python3 partition( filename=example_doc_path("pdf/layout-parser-paper-fast.pdf"), pdfminer_line_margin=1.123, pdfminer_char_margin=None, pdfminer_line_overlap=0.0123, pdfminer_word_margin=3.21, ) assert pdfminer_mock.call_args.kwargs == { "line_margin": 1.123, "line_overlap": 0.0123, "word_margin": 3.21, } ``` --------- Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com> Co-authored-by: plutasnyy <plutasnyy@users.noreply.github.com>	2025-02-17 11:41:20 +00:00
Yao You	8f2a719873	Feat/refactor layoutelement textregion to vectorized data structure (#3881 ) This PR refactors the data structure for `list[LayoutElement]` and `list[TextRegion]` used in partition pdf/image files. - new data structure replaces a list of objects with one object with `numpy` array to store data - this only affects partition internal steps and it doesn't change input or output signature of `partition` function itself, i.e., `partition` still returns `list[Element]` - internally `list[LayoutElement]` -> `LayoutElements`; `list[TextRegion]` -> `TextRegions` - current refactor stops before clean up pdfminer elements inside inferred layout elements -> the algorithm of clean up needs to be refactored before the data structure refactor can move forward. So current refactor converts the array data structure into list data structure with `element_array.as_list()` call. This is the last step before turning `list[LayoutElement]` into `list[Element]` as return - a future PR will update this last step so that we build `list[Element]` from `LayoutElements` data structure instead. The goal of this PR is to replace the data structure as much as possible without changing underlying logic. There are a few places where the slicing or filtering logic was simple enough to be converted into vector data structure operations. Those are refactored to be vector based. As a result there is some small improvements observed in ingest test. This is likely because the vector operations cleaned up some previous inconsistency in data types and operations. --------- Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com> Co-authored-by: badGarnet <badGarnet@users.noreply.github.com>	2025-01-23 17:11:38 +00:00
Christine Straub	0ed69a1ac3	refactor: pdfminer image cleanup (#3648 ) This PR aims to remove `clean_pdfminer_duplicate_image_elements()` function, as its functionality has already been integrated into the `remove_duplicate_elements()` function in [PR #3630](https://github.com/Unstructured-IO/unstructured/pull/3630).	2024-09-19 18:57:02 +00:00
Christine Straub	be88eef06f	perf: optimize pdfminer image cleanup process for improved performance (#3630 ) This PR enhances `pdfminer` image cleanup process by repositioning the duplicate image removal step. It optimizes the removal of duplicated pdfminer images by performing the cleanup before merging elements, rather than after. This improvement reduces execution time and enhances the overall processing speed of PDF documents. --------- Co-authored-by: Yao You <theyaoyou@gmail.com>	2024-09-19 14:05:05 +00:00
Christine Straub	87a88a3c87	feat: improve pdfminer element processing (#3618 ) This PR implements splitting of `pdfminer` elements (`groups of text chunks`) into smaller bounding boxes (`text lines`). This implementation prevents loss of information from the object detection model and facilitates more effective removal of duplicated `pdfminer` text. This PR also addresses #3430. --------- Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com> Co-authored-by: christinestraub <christinestraub@users.noreply.github.com>	2024-09-12 21:17:27 +00:00
Yao You	d51fb134e6	Feat/improve iou speed (#3582 ) This PR vectorizes the computation of element overlap to speed up deduplication process of extracted elements. ## test This PR adds unit test to the new vectorized IOU and subregion computation functions. In addition, running partition on large files with many elements like this slide: [002489.pdf](https://github.com/user-attachments/files/16823176/002489.pdf) shows a reduction of runtime from around 15min on the main branch to less than 4min with this branch. Profiling results show that the new implementation greatly reduces the time cost of computation and now most of the time is spend on getting the coordinates from a list of bboxes. ![Screenshot 2024-08-30 at 9 29 27 PM](https://github.com/user-attachments/assets/6c186838-54c7-483b-ac3e-7342c23ff3a6)	2024-09-03 00:06:18 +00:00
Christine Straub	b0d8a779da	feat: `partiton_pdf()` set inferred elements text (#3061 ) This PR adds the ability to fill inferred elements text from embedded text (`pdfminer`) without depending on `unstructured-inference` library. This PR is the second part of moving embedded text related code from `unstructured-inference` to `unstructured` and works together with https://github.com/Unstructured-IO/unstructured-inference/pull/349.	2024-05-21 19:43:38 +00:00
Christine Straub	ac5048bf30	enhancement: remove duplicate embedded images (#2897 ) This PR aims to remove duplicate embedded images taken by `PDFminer`. ### Summary - add `clean_pdfminer_duplicate_image_elements()` to remove embedded images with similar `bboxes` and the same `text` - add env_config `EMBEDDED_IMAGE_SAME_REGION_THRESHOLD` to consider the bounding boxes of two embedded images as the same region - refactor: reorganzie `clean_pdfminer_inner_elements()`	2024-04-18 23:07:47 +00:00

9 Commits