unstructured

mirror of https://github.com/Unstructured-IO/unstructured.git synced 2025-07-31 21:10:43 +00:00

Author	SHA1	Message	Date
jiajun-unstructured	b0dbd71aff	Parallelize tests (#4024 )	2025-06-16 23:29:35 +00:00
Austin Walker	e3417d7e98	fix: Fix for Pillow error when extracting PNG images (#3998 ) When I tried to partition a PNG file and extract images, I got an error from Pillow: ``` WARNING unstructured:pdf_image_utils.py:230 Image Extraction Error: Skipping the failed image Traceback (most recent call last): File "/Users/austin/.pyenv/versions/unstructured/lib/python3.10/site-packages/PIL/JpegImagePlugin.py", line 666, in _save rawmode = RAWMODE[im.mode] KeyError: 'RGBA' ``` The issue is that a PNG has an additional layer that cannot be saved off in jpeg format. We can fix this with a quick conversion. I added a png test case that is now passing with this fix.	2025-05-08 21:57:05 +00:00
Philippe PRADOS	d570f4624b	Fix sort_page_element. ensures that sorting is stable and not random. (#3978 ) The sort_page_element() use the element id to sort the elements. Two executions of the same code, on the same file, produce different results. The order of the elements is random. This makes it impossible to write stable unit tests, for example, or to obtain reproducible results.	2025-04-07 15:57:20 +00:00
cragwolfe	dfa17bd3a0	fix: hi_res PDF parsing: only uncategorized text for extracted elements (#3975 )	2025-04-04 14:38:23 -07:00
Yao You	4e424efd22	feat: use lxml instead of bs4 to parse hOCR data (#3960 ) - `lxml` is a much faster library than `bs4` when the input data is regular - since the hOCR data is guaranteed to be regular (programmatically generated) we don't need `bs4` here to parse the data - `lxml` improves parsing speed by about 10x Example runtime profiling locally using the same `hocr` data from 1 page pdf, where `agent.hocr_to_dataframe_bs4` is the current method on main and `agent.hocr_to_dataframe` is the PR's method. ![Screenshot 2025-03-17 at 12 14 59 PM](https://github.com/user-attachments/assets/7c483857-8711-4d72-8954-e83510fef783)	2025-03-18 00:36:19 +00:00
Yao You	2dceac34b5	Feat/remove reference of PageLayout.elements (#3943 ) This PR removes usage of `PageLayout.elements` from partition function, except for when `analysis=True`. This PR updates the partition logic so that `PageLayout.elements_array` is used everywhere to save memory and cpu cost. Since the analysis function is intended for investigation and not for general document processing purposes, this part of the code is left for a future refactor. `PageLayout.elements` uses a list to store layout elements' data while `elements_array` uses `numpy` array to store the data, which has much lower memory requirements. Using `memory_profiler` to test the differences is usually around 10x.	2025-03-12 15:21:21 +00:00
Yao You	8759b0aac9	feat: allow passing down of ocr agent and table agent (#3954 ) This PR allows passing down both `ocr_agent` and `table_ocr_agent` as parameters to specify the `OCRAgent` class for the page and tables, if any, respectively. Both are default to using `tesseract`, consistent with the present default behavior. We used to rely on env variables to specify the agents but os env can be changed during runtime outside of the caller's control. This method of passing down the variables ensures that specification is independent of env changes. ## testing Using `example-docs/img/layout-parser-paper-with-table.jpg` and run partition with two different settings. Note that this test requires `paddleocr` extra. ```python from unstructured.partition.auto import partition from unstructured.partition.utils.constants import OCR_AGENT_TESSERACT, OCR_AGENT_PADDLE elements = partition(f, strategy="hi_res", skip_infer_table_types=[], ocr_agent=OCR_AGENT_TESSERACT, table_ocr_agent=OCR_AGENT_PADDLE) elements_alt = partition(f, strategy="hi_res", skip_infer_table_types=[], ocr_agent=OCR_AGENT_PADDLE, table_ocr_agent=OCR_AGENT_TESSERACT) ``` we should see both finish and slight differences in the table element's text attribute.	2025-03-11 16:36:31 +00:00
Yao You	43b682ad3f	feat: allow extraction of camel cased element type names (#3938 ) This PR allows element types with CamelCase names to be extractable using `extract_image_block_types` variable. Before: specify `extract_image_block_types=["NarrativeText"]` (or any casing for `NarrativeText`) would raise a warning that it doesn't match any available types and not image would be extracted for this element type Now: specify `extract_image_block_types=["NarrativeText"]` would extract images for this element type ## testing ```python from unstructured.partition.auto import partition f = "example-docs/pdf/embedded-images-tables.pdf" elements = partition(f, strategy="hi_res", extract_image_block_types=["narrativetext"]) ``` Without this PR no figures would be extracted. With this PR a local folder would be created to contain images of the narrative text elements in path like `./figures/figure-1-1.jpg` --------- Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com>	2025-03-04 01:33:05 +00:00
Pluto	3973a30b8c	Feat: Add pdfminer parameters configuration (#3918 ) This pull request adds the ability to configure multiple pdfminer parameters (with the simple possibility to extend for the additional parameters). One of the parameters overwrites the default from LA Params config class. Example: ```python3 partition( filename=example_doc_path("pdf/layout-parser-paper-fast.pdf"), pdfminer_line_margin=1.123, pdfminer_char_margin=None, pdfminer_line_overlap=0.0123, pdfminer_word_margin=3.21, ) assert pdfminer_mock.call_args.kwargs == { "line_margin": 1.123, "line_overlap": 0.0123, "word_margin": 3.21, } ``` --------- Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com> Co-authored-by: plutasnyy <plutasnyy@users.noreply.github.com>	2025-02-17 11:41:20 +00:00
Philippe PRADOS	b521bce9c6	Add password with PDF files (#3721 ) Add password with PDF files Must be combined with [PR 392 in unstructured-inference](https://github.com/Unstructured-IO/unstructured-inference/pull/392) --------- Co-authored-by: John J <43506685+Coniferish@users.noreply.github.com>	2025-02-11 17:39:16 +00:00
Yao You	9d58b34ab4	Fix/fix table id checking logic (#3898 ) - there is a bug in deciding if a page has tables before performing table extraction. This logic checks if the id associated with Table type element is True - however, it should be checking if the id is `None` because sometimes the id can be 0 (the first type of element in the page) - the fix updates the logic - adds a unit test for this specific case	2025-01-31 10:19:14 -08:00
Yao You	a9ff1e70b2	Fix/fix ocr region to elements bug (#3891 ) This PR fixes a bug in `build_layout_elements_from_ocr_regions` where texts are joint in incorrect orders. The bug is due to incorrect masking of the `ocr_regions` after some are already selected as one of the final groups. The fix uses simpler method to mask the indices by simply use the same indices that adds the regions to the final groups to mask them so they are not considered again. ## Testing This PR adds a unit test specifically aimed for this bug. Without the fix the test would fail. Additionally any PDF files with repeated texts has a potential to trigger this bug. e.g., create a simple pdf use the test text ```python "LayoutParser: \n\nA Unified Toolkit for Deep Learning Based Document Image\n\nLayoutParser for Deep Learning" ``` and partition with `ocr_only` mode on main branch would hit this bug and output text where position of the second "LayoutParser" is incorrect. ```python [ 'LayoutParser:', 'A Unified Toolkit for Deep Learning Based Document Image', 'for Deep Learning LayoutParser', ] ```	2025-01-29 12:11:17 +00:00
David Huggins-Daines	9e5ff225f6	fix: Correctly patch pdfminer to avoid unnecessarily and unsuccessfully repairing PDFs with long content streams, causing needless and endless OCR (#3822 ) Fixes: #3815 Verified on my very large documents that it doesn't unnecessarily and unsuccessfully "repair" them. You may or may not wish to keep the version check in `patch_psparser`. Since ~you're pinning the version of pdfminer.six and since it isn't guaranteed that the bug in question will be fixed in the next pdfminer.six release (but it is rather serious, so I should hope so), then perhaps you just want to unconditionally patch it.~ it seems like pinning of versions is only operative when running from Docker (good!) so never mind! Keep that version check! Also corrected an import so that if you do feel like using a newer version of pdfminer.six, it won't break on you. --------- Authored-by: David Huggins-Daines <dhdaines@logisphere.ca>	2025-01-24 14:27:25 -06:00
Yao You	8f2a719873	Feat/refactor layoutelement textregion to vectorized data structure (#3881 ) This PR refactors the data structure for `list[LayoutElement]` and `list[TextRegion]` used in partition pdf/image files. - new data structure replaces a list of objects with one object with `numpy` array to store data - this only affects partition internal steps and it doesn't change input or output signature of `partition` function itself, i.e., `partition` still returns `list[Element]` - internally `list[LayoutElement]` -> `LayoutElements`; `list[TextRegion]` -> `TextRegions` - current refactor stops before clean up pdfminer elements inside inferred layout elements -> the algorithm of clean up needs to be refactored before the data structure refactor can move forward. So current refactor converts the array data structure into list data structure with `element_array.as_list()` call. This is the last step before turning `list[LayoutElement]` into `list[Element]` as return - a future PR will update this last step so that we build `list[Element]` from `LayoutElements` data structure instead. The goal of this PR is to replace the data structure as much as possible without changing underlying logic. There are a few places where the slicing or filtering logic was simple enough to be converted into vector data structure operations. Those are refactored to be vector based. As a result there is some small improvements observed in ingest test. This is likely because the vector operations cleaned up some previous inconsistency in data types and operations. --------- Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com> Co-authored-by: badGarnet <badGarnet@users.noreply.github.com>	2025-01-23 17:11:38 +00:00
Pluto	8685905bd1	Character confidence threshold (#3860 ) This change adds the ability to filter out characters predicted by Tesseract with low confidence scores. Some notes: - I intentionally disabled it by default; I think some low score(like 0.9-0.95 for Tesseract) could be a safe choice though - I wanted to use character bboxes and combine them into word bbox later. However, a bug in Tesseract in some specific scenarios returns incorrect character bboxes (unit tests caught it 🥳 ). More in comment in the code	2025-01-13 13:12:46 +00:00
Steve Canny	10f0d54ac2	build: remove ruff version upper bound (#3829 ) Summary Remove pin on `ruff` linter and fix the handful of lint errors a newer version catches.	2024-12-16 23:01:22 +00:00
Christine Straub	df156ebe5a	feat: support pdf link extraction in hi_res strategy (#3753 ) This PR aims to add support for link extraction in pdf `hi_res` strategy. The `partition_pdf()` function now supports link extraction when using the `hi_res` strategy, allowing users to extract hyperlinks from PDF documents. ### Summary - Added functionalities to support link extraction in hi_res flow - Enhanced word extraction functionality used for link extraction in both `fast` and `hi_res` flows, resulted in more correct `start_index` and `text` in `links` metadata. - Updated ingest fixture update workflow to not skip Astra DB source test ### Testing ``` elements = partition_pdf( filename="example-docs/pdf/embedded-link.pdf", strategy="hi_res" ) assert len(elements[0].metadata.links) == 3 ``` --------- Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com> Co-authored-by: christinestraub <christinestraub@users.noreply.github.com> Co-authored-by: cragwolfe <crag@unstructured.io>	2024-10-31 16:52:27 +00:00
Yao You	a11ad22609	bump `unstructured-inference` (#3711 ) This PR bumps `unstructured-inference` to `0.8.0`, which introduces vectorized data structure for layout elements and text regions. This PR also cleans up a few places in CI that has repeated definition of env variables or missing installation of testing dependencies in cache. A few document ingest results are changed: - two places for `biomed-api` (actually processed locally on runner) are due to very small changes in numerical results of the bounding box areas: one results in a duplicated page number/header and another results in a deduplication of a word of a sentence that starts in a new line. (yes, two cases goes in opposite directions) - the layout parser paper now outputs the code lines with page number inside the code box as list items --------- Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com> Co-authored-by: badGarnet <badGarnet@users.noreply.github.com> Co-authored-by: christinestraub <christinemstraub@gmail.com>	2024-10-21 21:55:08 +00:00
Nathan Van Gheem	b092d45816	Remove unsupported chipper model (#3728 ) The chipper model is no longer supported.	2024-10-17 17:40:45 +00:00
David Blore	ecf0267b85	fix: add `language` to `OCRAgentGoogleVision` constructor (#3696 ) This PR addresses issue #3659 by adding an optional `language` parameter to the `OCRAgentGoogleVision` class constructor. This parameter serves as a "language hint" for the `document_text_detection` method in the `ImageAnnotatorClient`. For more information on language hints, refer to the [Google Cloud Vision documentation](https://cloud.google.com/vision/docs/languages). Default Behavior: The language parameter defaults to None, allowing Google Cloud Vision to auto-detect the language, as recommended in their documentation. Purpose: This change is necessary because the `OCRAgent`'s `get_instance` method expects all `OCRAgent`s to include a language parameter in their constructors. Context on Issue: When trying to parse a PDF with `OCR_AGENT=unstructured.partition.utils.ocr_models.google_vision_ocr.OCRAgentGoogleVision`, an error occurs in the `get_instance` method. The method expects a `language` parameter, which the current `OCRAgentGoogleVision` constructor does not support, leading to a positional argument error. --------- Co-authored-by: Christine Straub <christinemstraub@gmail.com>	2024-10-14 05:35:05 +00:00
Steve Canny	44bad216f3	rfctr(part): prepare for pluggable auto-partitioners 3 (#3661 ) Summary Remove unused `include_metadata` parameter. Additional Context - The `include_metadata` parameter was originally added circa v0.7.12 as a mechanism for avoiding the "double-decorating" problem on delegating partitioners. - It turns out it doesn't fully address that problem, is now unused, and is unnecessary for the solution we'll be adding as part of pluggable partitioners. - Remove the unnecessary complexity introduced by this unused parameter.	2024-09-25 18:17:48 +00:00
Steve Canny	3bab9d93e6	rfctr(part): prepare for pluggable auto-partitioners 1 (#3655 ) Summary In preparation for pluggable auto-partitioners simplify metadata as discussed. Additional Context - Pluggable auto-partitioners requires partitioners to have a consistent call signature. An arbitrary partitioner provided at runtime needs to have a call signature that is known and consistent. Basically `partition_x(filename, , file, *kwargs)`. - The current `auto.partition()` is highly coupled to each distinct file-type partitioner, deciding which arguments to forward to each. - This is driven by the existence of "delegating" partitioners, those that convert their file-type and then call a second partitioner to do the actual partitioning. Both the delegating and proxy partitioners are decorated with metadata-post-processing decorators and those decorators are not idempotent. We call the situation where those decorators would run twice "double-decorating". For example, EPUB converts to HTML and calls `partition_html()` and both `partition_epub()` and `partition_html()` are decorated. - The way double-decorating has been avoided in the past is to avoid sending the arguments the metadata decorators are sensitive to to the proxy partitioner. This is very obscure, complex to reason about, error-prone, and just overall not a viable strategy. The better solution is to not decorate delegating partitioners and let the proxy partitioner handle all the metadata. - This first step in preparation for that is part of simplifying the metadata processing by removing unused or unwanted legacy parameters. - `date_from_file_object` is a misnomer because a file-object never contains last-modified data. - It can never produce useful results in the API where last-modified information must be provided by `metadata_last_modified`. - It is an undocumented parameter so not in use. - Using it can produce incorrect metadata.	2024-09-23 22:23:10 +00:00
Christine Straub	0ed69a1ac3	refactor: pdfminer image cleanup (#3648 ) This PR aims to remove `clean_pdfminer_duplicate_image_elements()` function, as its functionality has already been integrated into the `remove_duplicate_elements()` function in [PR #3630](https://github.com/Unstructured-IO/unstructured/pull/3630).	2024-09-19 18:57:02 +00:00
Christine Straub	be88eef06f	perf: optimize pdfminer image cleanup process for improved performance (#3630 ) This PR enhances `pdfminer` image cleanup process by repositioning the duplicate image removal step. It optimizes the removal of duplicated pdfminer images by performing the cleanup before merging elements, rather than after. This improvement reduces execution time and enhances the overall processing speed of PDF documents. --------- Co-authored-by: Yao You <theyaoyou@gmail.com>	2024-09-19 14:05:05 +00:00
Christine Straub	87a88a3c87	feat: improve pdfminer element processing (#3618 ) This PR implements splitting of `pdfminer` elements (`groups of text chunks`) into smaller bounding boxes (`text lines`). This implementation prevents loss of information from the object detection model and facilitates more effective removal of duplicated `pdfminer` text. This PR also addresses #3430. --------- Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com> Co-authored-by: christinestraub <christinestraub@users.noreply.github.com>	2024-09-12 21:17:27 +00:00
Yao You	d51fb134e6	Feat/improve iou speed (#3582 ) This PR vectorizes the computation of element overlap to speed up deduplication process of extracted elements. ## test This PR adds unit test to the new vectorized IOU and subregion computation functions. In addition, running partition on large files with many elements like this slide: [002489.pdf](https://github.com/user-attachments/files/16823176/002489.pdf) shows a reduction of runtime from around 15min on the main branch to less than 4min with this branch. Profiling results show that the new implementation greatly reduces the time cost of computation and now most of the time is spend on getting the coordinates from a list of bboxes. ![Screenshot 2024-08-30 at 9 29 27 PM](https://github.com/user-attachments/assets/6c186838-54c7-483b-ac3e-7342c23ff3a6)	2024-09-03 00:06:18 +00:00
Pawel Kmiecik	404f780bbb	feat: make analysis drawing more flexible (#3574 ) This PR changes the way the analysis tools can be used: - by default if `analysis` is set to `True` in `partition_pdf` and the strategy is resolved to `hi_res`: - for each file 4 layout dumps are produced and saved as JSON files (`object_detection`, `extracted`, `ocr`, `final`) - similar way to the current `object_detection` dump - the drawing functions/classes now accept these dumps accordingly instead of the internal classes instances (like `TextRegion`, `DocumentLayout` - it makes it possible to use the lightweight JSON files to render the bboxes of a given file after the partition is done - `_partition_pdf_or_image_local` has been refactored and most of the analysis code is now encapsulated in `save_analysis_artifiacts` function - to do this, helper function `render_bboxes_for_file` is added <img width="338" alt="Screenshot 2024-08-28 at 14 37 56" src="https://github.com/user-attachments/assets/10b6fbbd-7824-448d-8c11-52fc1b1b0dd0">	2024-09-02 11:06:11 +00:00
Christine Straub	fc26426310	feat: replace `pytesseract` with `unstructured.pytesseract` fork (#3528 ) This PR reverts `pytesseract` dependency to `unstructured.pytesseract` fork due to the unavailability of some recent release versions of `pytesseract` on PyPI. This PR also addresses an issue encountered during the publication of `unstructured==0.15.4` to PyPI. The error was due to the fact that PyPI does not allow direct dependencies from Version Control System URLs like GitHub in the `install_requires` or `extras_require` sections of the `setup.py` file.	2024-08-16 10:34:22 -04:00
Jake Zerrer	051be5aead	Remove unstructured.pytesseract fork (#3454 ) A second attempt at https://github.com/Unstructured-IO/unstructured/pull/3360, this PR removes unstructured's dependency on its own fork of `pytesseract`. (The original reason for the fork, the addition of `run_and_get_multiple_output`, was removed [here](https://github.com/madmaze/pytesseract/releases/tag/v0.3.12).) --------- Co-authored-by: Christine Straub <christinemstraub@gmail.com>	2024-08-09 04:28:48 +00:00
Maciej Kurzawa	b749b891a7	fix: disabled checking max pages for images (#3473 ) Added fix related to https://github.com/Unstructured-IO/unstructured/pull/3431, which disables checking max pages for images	2024-08-02 14:25:08 +00:00
Maciej Kurzawa	8fd216cc9f	feat/pdf-page-limit-in-hi-res (#3431 ) # Description: Passing `max_pages` argument allows rejecting pdf files which exceeds this page number limit while `high_res` strategy is chosen. By default it will allow parsing pdf files with unlimited number of pages. # Testing: ```python from unstructured.partition.auto import partition elements = partition(filename="unstructured/example-docs/pdf/reliance.pdf", strategy='hi_res') # should pass elements = partition(filename="unstructured/example-docs/pdf/reliance.pdf", strategy='hi_res', max_pages=4) # should pass elements = partition(filename="unstructured/example-docs/pdf/reliance.pdf", strategy='hi_res', max_pages=2) # should raise PdfMaxPagesExceededError ```	2024-07-30 16:52:17 +00:00
Christine Straub	0eb461acc2	refactor: restructure PDF/Image example document organization (#3410 ) This PR aims to improve the organization and readability of our example documents used in unit tests, specifically focusing on PDF and image files. ### Summary - Created two new subdirectories in the `example-docs` folder: - `pdf/`: for all PDF example files - `img/`: for all image example files - Moved relevant PDF files from `example-docs/` to `example-docs/pdf/` - Moved relevant image files from `example-docs/` to `example-docs/img/` - Updated file paths in affected unit & ingest tests to reflect the new directory structure ### Testing All unit & ingest tests should be updated and verified to work with the new file structure. ## Notes Other file types (e.g., office documents, HTML files) remain in the root of `example-docs/` for now. ## Next Steps Consider similar reorganization for other file types if this structure proves to be beneficial. --------- Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com> Co-authored-by: christinestraub <christinestraub@users.noreply.github.com>	2024-07-18 22:21:32 +00:00
Christine Straub	48bdf94656	feat: `partition_pdf()` support language specification for PaddleOCR (#3400 ) Closes #3159. This PR extends language specification capability to `PaddleOCR` in addition to `TesseractOCR`. Users can now specify OCR languages for both OCR engines when using `partition_pdf()`. ### Testing ``` os.environ["OCR_AGENT"] = "unstructured.partition.utils.ocr_models.paddle_ocr.OCRAgentPaddle" elements = partition_pdf( filename=<file_path>, strategy=strategy, languages=["chi_sim"], # chinese - simplified infer_table_structure=True, ) ```	2024-07-16 22:19:25 +00:00
Steve Canny	d48fa3b163	rfctr(auto): improve typing and organize auto tests (#3355 ) Summary In preparation for further work on auto-partitioning (`partition()`), improve typing and organize `test_auto.py` by introducing categories.	2024-07-08 21:25:17 +00:00
Pawel Kmiecik	575957b2d2	feat: enhance analysis options with od model dump and better vis (#3234 ) This PR adds new capabilities for drawing bboxes for each layout (extracted, inferred, ocr and final) + OD model output dump as a json file for better analysis. --------- Co-authored-by: Christine Straub <christinemstraub@gmail.com> Co-authored-by: Michal Martyniak <michal.martyniak@deepsense.ai>	2024-06-26 13:14:55 +00:00
Austin Walker	0b73978b92	fix: fix `IndexError` when partioning a pdf with `starting_page_number` (#3246 ) The Issue: When extracting images from pdfs, we use the metadata page number to index into a list of the images. However, the metadata page number can now be changed via `starting_page_number`. To get the true page index, we need to subtract this value. Testing: Run this snippet in a python shell. Before the fix, this throws an IndexError. On this branch, it will return the elements. ``` from unstructured.partition.auto import partition filename = "example-docs/layout-parser-paper-with-table.pdf" partition(filename, strategy="hi_res", extract_image_block_types=["Image", "Table"], starting_page_number=20) ``` --------- Co-authored-by: Matt Robinson <mrobinson@unstructuredai.io> Co-authored-by: christinestraub <christinemstraub@gmail.com>	2024-06-19 18:20:54 +00:00
Christine Straub	9552fbbfbf	chore: bump unstructured-inference 0.7.35 (#3205 ) ### Summary - bump unstructured-inference to `0.7.35` which fixed syntax for generated HTML tables - update unit tests and ingest test fixtures to reflect changes in the generated HTML tables - cut a release for `0.14.6` --------- Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com> Co-authored-by: christinestraub <christinestraub@users.noreply.github.com>	2024-06-14 18:11:38 +00:00
Christine Straub	f4457249a7	fix: `partition_pdf()` removes spaces from the text (#3106 ) Closes #2896. This PR aims to fix `partition_pdf()` to keep spaces in text. The control character `\t` is now replaced with a space instead of being removed when merging inferred and embedded elements. ### Testing PDF: [rok_20230930_1-1.pdf](https://github.com/Unstructured-IO/unstructured/files/15001636/rok_20230930_1-1.pdf) ``` elements = partition_pdf( filename="rok_20230930_1-1.pdf", strategy="hi_res", ) print(str(elements[20])) ``` Results: - PR ``` Name of each exchange on which registered New York Stock Exchange ``` - main branch ``` Nameofeachexchangeonwhichregistered NewYorkStockExchange ```	2024-05-29 04:53:17 +00:00
Christine Straub	b0d8a779da	feat: `partiton_pdf()` set inferred elements text (#3061 ) This PR adds the ability to fill inferred elements text from embedded text (`pdfminer`) without depending on `unstructured-inference` library. This PR is the second part of moving embedded text related code from `unstructured-inference` to `unstructured` and works together with https://github.com/Unstructured-IO/unstructured-inference/pull/349.	2024-05-21 19:43:38 +00:00
Christine Straub	76831f154b	refactor: `partition_pdf()` pass `kwargs` through `fast` strategy pipeline (#3040 ) This PR aims to pass `kwargs` through `fast` strategy pipeline, which was missing as part of the previous PR - https://github.com/Unstructured-IO/unstructured/pull/3030. I also did some code refactoring in this PR, so I recommend reviewing this PR commit by commit. ### Summary - pass `kwargs` through `fast` strategy pipeline, which will allow users to specify additional params like `sort_mode` - refactor: code reorganization - cut a release for `0.14.0` ### Testing CI should pass	2024-05-17 20:55:11 +00:00
amadeusz-ds	1c8b2b23eb	feat: add GLOBAL_WORKING_DIR and GLOBAL_WORKING_PROCESS_DIR config parameteres (#3014 ) This PR introduces GLOBAL_WORKING_DIR and GLOBAL_WORKING_PROCESS_DIR controlling where temporary files are stored during partition flow, via tempfile.tempdir. #### Edit: Renamed prefixes from STORAGE_ to UNSTRUCTURED_CACHE_ #### Edit 2: Renamed prefixes from UNSTRUCTURED_CACHE to GLOBAL_WORKING_DIR_	2024-05-17 19:16:10 +00:00
Christine Straub	1fb0fe5cf5	enhancement: `partitoin_pdf()` skip unnecessary element sorting (#3030 ) This PR aims to skip element sorting when determining whether embedded text can be extracted. The extracted elements in this step are returned as final elements only for the `fast` strategy pipeline and are never used for other strategy pipelines (`hi_res`, `ocr`). Removing element sorting in this step and adding it to the `fast` strategy pipeline later will improve performance and reduce execution time. ### Summary - skip element sorting when determining whether embedded text can be extracted. - add `_partition_pdf_with_pdfparser()` function for fast` strategy pipeline ### Testing CI should pass.	2024-05-16 06:02:56 +00:00
John	d829b669e6	Add starting_page_num param to partition_image (#2987 ) Add missing starting_page_num param to partition_image Closes #2985	2024-05-09 21:31:35 +00:00
Christine Straub	0cd07d78f9	feat: `parition_pdf()` add ability to get `cid` ratio (#2970 ) This PR adds the ability to get the ratio of `cid` characters in embedded text extracted by `pdfminer`. This PR is the second part of moving `cid` related code from `unstructured-inference` to `unstructured` and works together with https://github.com/Unstructured-IO/unstructured-inference/pull/342.	2024-05-04 05:21:27 +00:00
Michał Martyniak	7720e72424	Fix: avoid elements sharing the same memory address (#2940 ) This PR attempts to fix a memory issue, which resulted in errors like this: https://github.com/Unstructured-IO/unstructured/issues/2931 The root cause seems to be in how ListItems are being combined, not in how hashes or parent IDs are updated. When `assign_and_map_hash_ids()` is called and elements (or elements' metadata) do not have unique memory addresses, then updating the parent_id of one element will also overwrite the parent_id of some other element. --------- Co-authored-by: cragwolfe <crag@unstructured.io>	2024-04-28 19:15:17 -07:00
Michał Martyniak	2d1923ac7e	Better element IDs - deterministic and document-unique hashes (#2673 ) Part two of: https://github.com/Unstructured-IO/unstructured/pull/2842 Main changes compared to part one: * hash computation includes element's sequence number on page, page number, document filename and its text * there are more test for deterministic behavior of IDs returned by partitioning functions + their uniqueness (guaranteed at the document level, and high probability across multiple documents) This PR addresses the following issue: https://github.com/Unstructured-IO/unstructured/issues/2461	2024-04-24 00:05:20 -07:00
Dimitri Lozeve	abb0174181	Integration with the Google Cloud Vision API (#2902 ) This PR adds a third OCR provider, alongside Tesseract and Paddle: the [Google Cloud Vision API](https://cloud.google.com/vision). It can be used similarly to other OCR methods: set the `OCR_AGENT` environment variable to the path to the OCR module (`unstructured.partition.utils.ocr_models.google_vision_ocr.OCRAgentGoogleVision`). You also need to set the credentials to use Google APIs, for instance by setting the `GOOGLE_APPLICATION_CREDENTIALS` environment variable. --------- Co-authored-by: christinestraub <christinemstraub@gmail.com>	2024-04-23 21:11:39 +00:00
Christine Straub	ac5048bf30	enhancement: remove duplicate embedded images (#2897 ) This PR aims to remove duplicate embedded images taken by `PDFminer`. ### Summary - add `clean_pdfminer_duplicate_image_elements()` to remove embedded images with similar `bboxes` and the same `text` - add env_config `EMBEDDED_IMAGE_SAME_REGION_THRESHOLD` to consider the bounding boxes of two embedded images as the same region - refactor: reorganzie `clean_pdfminer_inner_elements()`	2024-04-18 23:07:47 +00:00
Michał Martyniak	cb1e91058e	Introduce `start_page` argument to partitioning functions that assign `element.metadata.page_number` (#2884 ) This small change will be useful for users who partition only fragments of their PDF documents. It's a small step towards addressing this issue: https://github.com/Unstructured-IO/unstructured/issues/2461 Related PRs: * https://github.com/Unstructured-IO/unstructured/pull/2842 * https://github.com/Unstructured-IO/unstructured/pull/2673	2024-04-15 21:03:42 +00:00
Christine Straub	887e6c9094	refactor: use env_config instead of `SUBREGION_THRESHOLD_FOR_OCR` constant (#2697 ) The purpose of this PR is to introduce a new env_config for the subregion threshold for OCR. ### Testing CI should pass.	2024-03-28 20:28:35 +00:00

1 2

100 Commits