unstructured

mirror of https://github.com/Unstructured-IO/unstructured.git synced 2025-11-30 00:59:52 +00:00

Author	SHA1	Message	Date
luke-kucing	b01d35b237	fix: sanitize MSG attachment filenames to prevent path traversal (GHS… (#4117 ) Summary Fixes path traversal vulnerability in email and MSG attachment filename handling (GHSA-gm8q-m8mv-jj5m). Changes Security Fix Sanitizes attachment filenames in _AttachmentPartitioner for both email.py and msg.py Uses os.path.basename() to strip path components from filenames Normalizes backslashes to forward slashes to handle Windows paths on Unix systems Removes null bytes and other control characters Handles edge cases (empty strings, ".", "..") Defaults to "unknown" for invalid or dangerous filenames Test Coverage Added 17 comprehensive tests covering: Path traversal attempts (../../../etc/passwd) Absolute Unix paths (/etc/passwd) Absolute Windows paths (C:\Windows\System32\config\sam) Null byte injection (file\x00.txt) Dot and dotdot filenames (. and ..) Missing/empty filenames Complex mixed path separators Valid filenames (ensuring they pass through unchanged) Test Results ✅ All 17 new security tests pass ✅ All 129 existing tests pass ✅ No regressions Security Impact Prevents attackers from using malicious attachment filenames to write files outside the intended directory, which could lead to arbitrary file write vulnerabilities. Changes include comprehensive test coverage for various attack vectors and a version bump to 0.18.18. --------- Co-authored-by: Claude <noreply@anthropic.com>	2025-11-06 23:14:56 +00:00
David Potter	0d20f6a9b1	email date format flexibility (#4072 ) we are seeing some .eml files come through the VLM partitioner. Which then downgrades to hi-res i believe. For some reason they have a date format that is not standard email format. But it is still legitimate. This uses a more robust date package to parse the date. This package is already installed. --------- Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com> Co-authored-by: potter-potter <potter-potter@users.noreply.github.com>	2025-08-13 18:55:24 +00:00
Nick Franck	b8c14a7a4f	fix: replace UnicodeDecodeError to prevent large payload logging (#4071 ) Replace UnicodeDecodeError with UnprocessableEntityError in encoding detection to avoid logging entire file contents. UnicodeDecodeError.object automatically stores complete input data, causing memory issues with large files in logging and error reporting systems.	2025-07-25 21:01:37 +00:00
Maksymilian Operlejn	536819719c	feat: map <input> tags by `type` + add coverage (#4068 ) Implements type-aware classification of `<input>` elements in `extract_tag_and_ontology_class_from_tag` (checkbox → `Checkbox`, radio → `RadioButton`, else → `FormFieldValue`) and updates/extends the HTML-to-ontology test suite to validate the new behaviour.	2025-07-22 16:13:07 +00:00
jiajun-unstructured	d24dec5e04	add '\|' as a delimiter in csv files (#4059 ) This PR fixes the error “Failure to process CSV: Expected 2 fields in line 2, saw 4” when '\|' is used as a delimiter in the csv file	2025-07-18 17:56:24 +00:00
Yao You	909716f310	feat: keep input tag's class attr in table (#4064 ) This change affects partition html. Previously when there is a table in the html, we clean any tags inside the table of their class and id attributes except for the class attribute for `img` tags. This change also preserves the class attribute for `input` tags inside a table. This change is reflected in a table element's metadata.text_as_html attribute.	2025-07-16 21:46:58 +00:00
shreyanid	446826885b	fix: add empty string case for language metadata (#4062 ) Add an empty string edge case for when the element text field is None or not a string. most of the diff is `make tidy`	2025-07-16 21:35:00 +00:00
qued	c7c3e3c082	feat: convert elements to markdown (#4055 ) Creates a staging function `elements_to_md` to convert lists of `Elements` to markdown strings (or a markdown file). Includes unit tests as well as ingest tests and expected output fixtures.	2025-07-16 14:34:29 +00:00
Filip Knefel	f66562b1cb	fix: properly handle password protected xlsx (#4057 ) ### Issue Attempt at partitioning a password protected errors results in an obscure exception > Can't find workbook in OLE2 compound document ### Solution Utilize [msoffcrypto-tool](https://pypi.org/project/msoffcrypto-tool/) package (MIT License) to load XLSX file and check whether it's encrypted, if yes throw an `UnprocessableEntityError` exception detailing the reason for rejecting the file. --------- Co-authored-by: Filip Knefel <filip@unstructured.io>	2025-07-16 13:19:14 +00:00
shreyanid	344202fa6d	feat: detect language for PDFs (#4051 ) The `@apply_metadata` decorator already contains logic to detect the language of the element text (on either a document or element level). Update pdfs, and later images, to use this decorator to get accurate element language results outputted. Test ``` from unstructured.partition.auto import partition def test_partition_pdf(): pdf_path = "example-docs/language-docs/fr_olap.pdf" elements = partition(pdf_path) # optionally set `detect_language_per_element=True)` print(f"Number of elements partitioned: {len(elements)}") # Check if elements are returned assert len(elements) > 0, "No elements were partitioned from the PDF." # check language outputted for each element for element in elements: print(element) print(element.metadata.languages) print("-------------------------------") test_partition_pdf() ``` --------- Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com> Co-authored-by: shreyanid <shreyanid@users.noreply.github.com>	2025-07-15 18:53:28 +00:00
mateuszkuprowski	37800c3523	feat: added new exception type to epub conversions (#4052 ) Added UnprocessableEpubError to better handle the case when incoming epub file is actually damanged which makes pandoc lib crash with exit code 64.	2025-07-15 10:56:22 +00:00
Yao You	73d239fb28	feat: keep img tag's class attr (#4050 ) This change affects partition html. Previously when there is a table in the html, we clean any tags inside the table of their `class` and `id` attributes. However, sometimes there are images, `img` tags, present in a table and its `class` attribute identifies some important information about the image. This change preserves the `class` attribute for `img` tags inside a table. This change is reflected in a table element's `metadata.text_as_html` attribute.	2025-07-10 20:46:28 +00:00
qued	7764fb6fd4	build: drop remaining Python 3.9 refs (#4049 ) Dropped variables that said we support Python 3.9 in `setup.py`, as well as any remaining references to Python 3.9. I also checked the pins and removed several that don't seem necessary any more.	2025-07-10 16:43:15 +00:00
jiajun-unstructured	92965fb286	add fenced-code extension to the md parser (#4044 ) https://github.com/Unstructured-IO/unstructured/issues/3578 --------- Co-authored-by: qued <64741807+qued@users.noreply.github.com> Co-authored-by: Alan Bertl <alan@unstructured.io>	2025-07-07 21:05:54 +00:00
Yao You	aa332101ab	fix: fix header and footer not parsed as Header/Footer types (#4041 ) ## Summary This PR fixes an issue where header/footer content in html are not partitioned as `unstructured` `Header` or `Footer` element types. Rather they are either `UncategorizedText` or taking on the type of the nested structure inside the header/footer. E.g., `<header class="Header"><h1 class="Title">Header Title</h1></header>` would be partitioned as a `Title` instead of `Header`. ## Bug description This behavior is because we treat header and footer as layout, i.e., containers, in the ontology definition. As a result, during parsing we [unwrap](`ec209c6b5f/unstructured/partition/html/transformations.py (L361-L378)`) the container and parse the contents as if they are from the main text even though they are still part of header/footer. The fix is to treat header/footer as text instead of layout in ontology so that all content inside of them are properly gathered under `Header`/`Footer` element types.	2025-07-01 21:58:43 +00:00
Klaijan	56e739b34c	fix: update md to reads umlauts on non-utf-8 files (#4037 ) This PR updates the `partition_md` to reads files with non-utf8 encodings without fail. Closes issue https://github.com/Unstructured-IO/unstructured-api/issues/489	2025-07-01 16:38:30 +00:00
jiajun-unstructured	66640f26fe	fix: xml processing not escaped (#4034 ) `<?xml version="1.0"?>` does not get escaped when converting to html, in a code block like this in the markdown file ```` <?xml version="1.0"?> <sparql xmlns="http://www.w3.org/2005/sparql-results#"> <head></head> <boolean>true</boolean> </sparql> ```` which causes the parser to throw error like > AttributeError: 'lxml.etree._ProcessingInstruction' object has no attribute 'is_phrasing'. This PR processes the original md file and add indentation to `<?xml version="1.0"?>` to force the xml code to be escaped when being converted to html https://github.com/Unstructured-IO/unstructured/issues/3935	2025-06-30 20:15:38 +00:00
Klaijan	dab79b0c83	fix: add try/except wrap over row.cells to failproof tc grid_offset (#4033 ) This PR fixes the issue with `docx` with complex/recursive/merged/malformed tables by skipping cells that could not trace back to a valid `<w:tc>` element used by the `python-docx` due to missing or improperly merged rows. Accessing row.cells in such cases can raise a `ValueError` when `python-docx` fails to resolve the full logical table layout. This PR wraps those calls in `try/except` to skip problematic rows while continuing to extract usable content from the rest of the document.	2025-06-30 14:20:18 +00:00
Yuming Long	c04235c168	fix [NEX-49] : Fix TypeError for empty HTML content (#4032 ) ### Summary Addressed a TypeError that occurred when partitioning empty or whitespace-only HTML content. ## Test * unit test `test_unstructured/partition/html/test_partition.py::test_partition_html_with_empty_content_raises_error` can reproduce the TypeErro before fix * now test can pass	2025-06-25 18:13:20 +00:00
jiajun-unstructured	b0dbd71aff	Parallelize tests (#4024 )	2025-06-16 23:29:35 +00:00
Yuming Long	55ad5fd637	fix chucking text None type has no attribute stripe (#4018 ) ### Summary To fix error `Error in chunk: 512: {"detail":"'NoneType' object has no attribute 'strip'"}` I found the logs under same org (could assume this is the same job) screenshot: ![Screenshot 2025-06-11 at 10 15 57 AM](https://github.com/user-attachments/assets/c50ada55-eef1-43f7-9e27-9b9ae339a6fb) stack trace from the `utic-api` ES log doc: ![Screenshot 2025-06-11 at 2 01 01 PM](https://github.com/user-attachments/assets/7e84fa24-4eb6-45e8-b195-a11d3d124bfa) ### Notes longer term we should make partitioner (vlm + utic-api) not return text with Null --------- Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com> Co-authored-by: yuming-long <yuming-long@users.noreply.github.com>	2025-06-12 18:28:46 +00:00
Pluto	ec209c6b5f	Remove IDs from HTML code (#4012 ) In this pull request parent-child relationship for elements generated with v2 parser is based on actual element IDs instead of IDs baked somewhere in the HTML script. With some extra bug fixing it allowed for significantly simplifying json -> HTML script	2025-06-11 11:55:02 +00:00
luke-kucing	a7e90f7990	resolve CVEs and HF issue (#4009 ) update reqs to resolve CVEs and add the HF ENV to stop it from reaching out updated the Dockerfile with ENV HF_HUB_OFFLINE=1 to stop it from pinging HF. This was an issue for a gov customer. and updated requirements to resolve some open CVEs --------- Co-authored-by: cragwolfe <crag@unstructured.io> Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com> Co-authored-by: luke-kucing <luke-kucing@users.noreply.github.com>	2025-06-04 18:52:58 +00:00
jordan-homan	570ee078a4	fix: throw validation error when json is passed with invalid unstructured json (#4002 ) ### Notes Adds validation if `json` / `ndjson` are not valid unstructured schema. ### Testing Manually tested serverless API with example json: ``` test_length = [] = 200 test_invalid = [{"invalid": "schema"}] = 422 test_invalid_ndjson ={"hi": "there"} = 422 test_chunk = [{"type":"Header","element_id":"a23fdadef9277f217563e217ebd074d5" ... = 200 ```	2025-05-19 18:24:44 +00:00
Austin Walker	e3417d7e98	fix: Fix for Pillow error when extracting PNG images (#3998 ) When I tried to partition a PNG file and extract images, I got an error from Pillow: ``` WARNING unstructured:pdf_image_utils.py:230 Image Extraction Error: Skipping the failed image Traceback (most recent call last): File "/Users/austin/.pyenv/versions/unstructured/lib/python3.10/site-packages/PIL/JpegImagePlugin.py", line 666, in _save rawmode = RAWMODE[im.mode] KeyError: 'RGBA' ``` The issue is that a PNG has an additional layer that cannot be saved off in jpeg format. We can fix this with a quick conversion. I added a png test case that is now passing with this fix.	2025-05-08 21:57:05 +00:00
Yao You	b814ece39f	fix: properly handle the case when an element's text is None (#3995 ) Some elements, like `Image`, can have `None` as its `text` attribute's value. In that case current chunking logic fails because it expects the field to always have a length or can be split. The fix is to update the logic as `element.text or ""` for checking length and add flow control to early exit to avoid calling split on `None`.	2025-05-05 18:08:11 +00:00
Philippe PRADOS	d570f4624b	Fix sort_page_element. ensures that sorting is stable and not random. (#3978 ) The sort_page_element() use the element id to sort the elements. Two executions of the same code, on the same file, produce different results. The order of the elements is random. This makes it impossible to write stable unit tests, for example, or to obtain reproducible results.	2025-04-07 15:57:20 +00:00
cragwolfe	dfa17bd3a0	fix: hi_res PDF parsing: only uncategorized text for extracted elements (#3975 )	2025-04-04 14:38:23 -07:00
qued	3f07840b80	chore: deprecate stage_for_label_studio (#3968 ) This PR is to address [a CVE](https://github.com/advisories/GHSA-rgv9-w7jp-m23g) that appeared in a recent scan. The CVE has to do with the package `label_studio_sdk`. This relates to the tool Label Studio, a data labeling platform. We built a staging function that takes a list of elements and converts it to a format suitable for passing to the LabelStudio platform. We don't use the package with the vulnerability in the actual function, we only use it to test the output of the function against the Label Studio API schema. Even the test where we use it is sort of questionable in value, since it's really testing the schema against an old version of the LabelStudio API (we are testing against a recording of the Label Studio API's responses stored using `vcrpy`). Label Studio has fixed the vulnerability as of version 1.0.10 of their SDK, but we're stuck on 1.0.5 because 1.0.6 and above require `numpy<2.0.0`. This leaves us with several choices of resolution, some of which are: 1. Downgrade `numpy` to upgrade `label_studio_sdk` to >=1.0.10 to resolve the CVE 2. Drop `label_studio_sdk` by either removing or rewriting the test. 3. Drop test and dev dependencies from the `unstructured` image. We've decided to do 2. _and_ 3. This PR handles 2., with 3. to be a follow-on PR. Here we add a deprecation notice to `stage_for_label_studio` and remove the offending test. Normally good practice would be to add a warning of future deprecation to the function for a reasonable amount of time, but in order to address the CVE immediately, we're deprecating it right away. ### Testing Install the dependencies (`make install`) into a fresh environment, and `pip list \| grep label` should have no results. The scan artifact in CI should contain no "high" or "critical" CVEs.	2025-03-26 23:37:03 +00:00
Sri Sudarsan	349728162e	Matches prefix to verify presence of DOCX,PPTX,XLSX files instead of standard file names (#3959 ) Instead of looking for presence of `word/document.xml` , `ppt/presentation.xml` and `xl/workbook.xml` to identify DOCX,PPTX and XLSX files, we look for prefix `word/document.xml`, `ppt/presentation.xml` and `xl/workbook*.xml` as certain files generated from office365 has files with different names. Fixes https://github.com/Unstructured-IO/unstructured/issues/3937 --------- Co-authored-by: Yao You <theyaoyou@gmail.com>	2025-03-21 16:27:13 +00:00
Antonio Jose Jimeno Yepes	0fa5174bd7	Image within div or span with no text is annotated as Image (#3962 ) Ticket: https://unstructured-ai.atlassian.net/browse/ML-942 The following uncompressed HTML document can be used to test the transformation using the `partition_html` function from the VLM partitioner. [recalibrating-risk-report.pdf.json.html.zip](https://github.com/user-attachments/files/19330528/recalibrating-risk-report.pdf.json.html.zip)	2025-03-20 04:09:02 +00:00
Yao You	7de630e45e	Feat/bump numpy to 2 (#3961 ) This PR updates a few dependencies so that they are compatible with `numpy>=2`.	2025-03-18 21:33:48 +00:00
Yao You	4e424efd22	feat: use lxml instead of bs4 to parse hOCR data (#3960 ) - `lxml` is a much faster library than `bs4` when the input data is regular - since the hOCR data is guaranteed to be regular (programmatically generated) we don't need `bs4` here to parse the data - `lxml` improves parsing speed by about 10x Example runtime profiling locally using the same `hocr` data from 1 page pdf, where `agent.hocr_to_dataframe_bs4` is the current method on main and `agent.hocr_to_dataframe` is the PR's method. ![Screenshot 2025-03-17 at 12 14 59 PM](https://github.com/user-attachments/assets/7c483857-8711-4d72-8954-e83510fef783)	2025-03-18 00:36:19 +00:00
ryannikolaidis	66bf4b0198	feat: support extracting image url in html (#3955 ) also removes mimetype when base64 is not included in image metadata --------- Co-authored-by: ryannikolaidis <ryannikolaidis@users.noreply.github.com>	2025-03-13 22:41:10 +00:00
Yao You	2dceac34b5	Feat/remove reference of PageLayout.elements (#3943 ) This PR removes usage of `PageLayout.elements` from partition function, except for when `analysis=True`. This PR updates the partition logic so that `PageLayout.elements_array` is used everywhere to save memory and cpu cost. Since the analysis function is intended for investigation and not for general document processing purposes, this part of the code is left for a future refactor. `PageLayout.elements` uses a list to store layout elements' data while `elements_array` uses `numpy` array to store the data, which has much lower memory requirements. Using `memory_profiler` to test the differences is usually around 10x.	2025-03-12 15:21:21 +00:00
Yao You	8759b0aac9	feat: allow passing down of ocr agent and table agent (#3954 ) This PR allows passing down both `ocr_agent` and `table_ocr_agent` as parameters to specify the `OCRAgent` class for the page and tables, if any, respectively. Both are default to using `tesseract`, consistent with the present default behavior. We used to rely on env variables to specify the agents but os env can be changed during runtime outside of the caller's control. This method of passing down the variables ensures that specification is independent of env changes. ## testing Using `example-docs/img/layout-parser-paper-with-table.jpg` and run partition with two different settings. Note that this test requires `paddleocr` extra. ```python from unstructured.partition.auto import partition from unstructured.partition.utils.constants import OCR_AGENT_TESSERACT, OCR_AGENT_PADDLE elements = partition(f, strategy="hi_res", skip_infer_table_types=[], ocr_agent=OCR_AGENT_TESSERACT, table_ocr_agent=OCR_AGENT_PADDLE) elements_alt = partition(f, strategy="hi_res", skip_infer_table_types=[], ocr_agent=OCR_AGENT_PADDLE, table_ocr_agent=OCR_AGENT_TESSERACT) ``` we should see both finish and slight differences in the table element's text attribute.	2025-03-11 16:36:31 +00:00
ryannikolaidis	0001a33dba	fix: pass extract image args to all partitioners (#3950 ) This is needed in order for the user to specify whether to extract the base64 for images, which are now parsed by the html partitioner. ## Testing Adds test that validates this by calling the auto-partitioner with appropriate arguments partitioning an html file with base64 embedded image.	2025-03-10 04:15:08 +00:00
ryannikolaidis	c0457c1cc3	feat: include images when partitioning html (#3945 ) Currently we [filter img tags](`2addb19473/unstructured/partition/html/partition.py (L226-L229)`) before tags are converted to Elements by the html partitioner. More importantly we also don’t currently have a defined “block” / mapping to support these. This adds these mappings and logic to process. It also respects `extract_image_block_types` and `extract_image_block_to_payload` (as we do with pdfs) to determine whether base64 is included in the metadata. The partitioned Image Elements sets the text to the img tag’s alt text if available. The partitioned Image Elements include the [url in the metadata](https://github.com/Unstructured-IO/unstructured/blob/main/unstructured/documents/elements.py#L209) (rather than image_base64) if the img tag src is a url. ## Testing unit tests have been added for explicit coverage. existing integration tests and other unit test fixtures have been updated to account for `Image` elements now present --------- Co-authored-by: ryannikolaidis <ryannikolaidis@users.noreply.github.com>	2025-03-08 01:25:21 +00:00
Pluto	74b0647aa2	Fix json bytes content type detection (#3941 ) Fixes order of content type detection strategies for byte-encoded jsons. Before ``` json_bytes = json.dumps([{"example": "data"}]).encode("utf-8") file_buffer = io.BytesIO(json_bytes) detect_filetype(file=file_buffer, metadata_file_path="filename.pdf") ``` Before PDF Now JSON	2025-03-07 10:33:33 +00:00
Nathan Van Gheem	19373de5ff	Enable dynamic file type registration (#3946 ) The purpose of this PR is to enable registering new file types dynamically. The PR enables this through 2 primary functions: 1. `unstructured.file_utils.model.create_file_type` This registers the new `FileType` enum which enables the rest of unstructured to understand a new type of file 2. `unstructured.file_utils.model.register_partitioner` Decorator that enables registering a partitioner function to run for a file type. --------- Co-authored-by: Roman Isecke <136338424+rbiseck3@users.noreply.github.com>	2025-03-06 22:09:42 +00:00
Marek Połom	f333d7fe7f	feat: Json elements to HTML converter (#3936 ) ## NOTE `test_unstructured_ingest/expected-structured-output-html` contains all test HTML fixtures. Original JSON files, from which these HTML fixtures are generated, were taken from `test_unstructured_ingest/expected-structured-output`	2025-03-04 13:57:35 +00:00
Yao You	43b682ad3f	feat: allow extraction of camel cased element type names (#3938 ) This PR allows element types with CamelCase names to be extractable using `extract_image_block_types` variable. Before: specify `extract_image_block_types=["NarrativeText"]` (or any casing for `NarrativeText`) would raise a warning that it doesn't match any available types and not image would be extracted for this element type Now: specify `extract_image_block_types=["NarrativeText"]` would extract images for this element type ## testing ```python from unstructured.partition.auto import partition f = "example-docs/pdf/embedded-images-tables.pdf" elements = partition(f, strategy="hi_res", extract_image_block_types=["narrativetext"]) ``` Without this PR no figures would be extracted. With this PR a local folder would be created to contain images of the narrative text elements in path like `./figures/figure-1-1.jpg` --------- Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com>	2025-03-04 01:33:05 +00:00
Pluto	0df50fe6e8	Fix file detection when spooled file is pased (#3932 ) This pull request fixes the scenario when SpooledTemporaryFile is passed to detect_file type. In such cases some weird number was assigned as 'name' (and it couldn't be overwritten as SpooledTemporaryFile can't have fields assigned 😩 ) so I added in our object factory just another scenario where we parse this type of file. For BytesIo `name` attr is None as it should be and some other metadata fields are leveraged for file type recognition	2025-02-20 13:00:25 +00:00
Pluto	3973a30b8c	Feat: Add pdfminer parameters configuration (#3918 ) This pull request adds the ability to configure multiple pdfminer parameters (with the simple possibility to extend for the additional parameters). One of the parameters overwrites the default from LA Params config class. Example: ```python3 partition( filename=example_doc_path("pdf/layout-parser-paper-fast.pdf"), pdfminer_line_margin=1.123, pdfminer_char_margin=None, pdfminer_line_overlap=0.0123, pdfminer_word_margin=3.21, ) assert pdfminer_mock.call_args.kwargs == { "line_margin": 1.123, "line_overlap": 0.0123, "word_margin": 3.21, } ``` --------- Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com> Co-authored-by: plutasnyy <plutasnyy@users.noreply.github.com>	2025-02-17 11:41:20 +00:00
Philippe PRADOS	b521bce9c6	Add password with PDF files (#3721 ) Add password with PDF files Must be combined with [PR 392 in unstructured-inference](https://github.com/Unstructured-IO/unstructured-inference/pull/392) --------- Co-authored-by: John J <43506685+Coniferish@users.noreply.github.com>	2025-02-11 17:39:16 +00:00
Roman Isecke	92be4eb2dd	bugfix/fix ndjson detection (#3905 ) ### Description NDJSON files were being detected as JSON due to having the same mime-type. This adds additional logic to skip mime-type based detection if extension is `.ndjson`	2025-02-11 14:21:28 +00:00
qued	b10379c14c	Fix: plug security issue partition system files via include (#3908 ) #### Summary A recent security review showed that it was possible to partition arbitrary local files in cases where the filetype supports an "include" functionality that brings in the content of files external to the partitioned file. This affects `rst` and `org` files. #### Fix This PR fixes the above issue by passing the parameter `sandbox=True` in all cases where `pypandoc.convert_file` is called. Note I also added the parameter to a call to this method in the ODT code. I haven't investigated whether there was a security issue with ODT files, but it seems better to use pandoc in sandbox mode given the security issues we know about. #### Testing To verify that the tests that are added with this PR find the relevant issue: - Remove the `sandbox=True` text from `unstructured/file_utils/file_conversion.py` line 17. - Run the tests `test_unstructured.partition.test_rst.test_rst_wont_include_external_files` and `test_unstructured.partition.test_org.test_org_wont_include_external_files`. Both should fail due to the partitioning containing the word "wombat", which only appears in a file external to the partitioned file. - Add the parameter back in, and the tests pass.	2025-02-06 03:27:18 +00:00
Pluto	5bb95b5841	Fix parsing table cells (#3904 ) This PR: - Fixes removing HTML tags that exist in <td> cells - stripping function was in general problematic to implement in easy and straightforward way (you can't modify `descendants` in-place). So I decided instead of patching something in table cell I added stripping everywhere in the same consistent way. This is why some tests needed small edits with removing one white-space in each tag. I believe this won't cause any problems for downstream tasks. Tested HTML: ```html <table class="Table"> <tbody> <tr> <td colspan="2"> Some text </td> <td> <input checked="" class="Checkbox" type="checkbox"/> </td> </tr> </tbody> </table> ``` Before & After ```html '<table class="Table" id="..."> <tbody> <tr> <td colspan="2">Some text</td><td></td></tr></tbody></table>' '<table class="Table" id="..."><tbody><tr><td colspan="2">Some text</td><td><input checked="" type="checkbox"/></td></tr></tbody></table>'' ```	2025-02-05 15:28:49 +00:00
Yao You	9d58b34ab4	Fix/fix table id checking logic (#3898 ) - there is a bug in deciding if a page has tables before performing table extraction. This logic checks if the id associated with Table type element is True - however, it should be checking if the id is `None` because sometimes the id can be 0 (the first type of element in the page) - the fix updates the logic - adds a unit test for this specific case	2025-01-31 10:19:14 -08:00
Yao You	a9ff1e70b2	Fix/fix ocr region to elements bug (#3891 ) This PR fixes a bug in `build_layout_elements_from_ocr_regions` where texts are joint in incorrect orders. The bug is due to incorrect masking of the `ocr_regions` after some are already selected as one of the final groups. The fix uses simpler method to mask the indices by simply use the same indices that adds the regions to the final groups to mask them so they are not considered again. ## Testing This PR adds a unit test specifically aimed for this bug. Without the fix the test would fail. Additionally any PDF files with repeated texts has a potential to trigger this bug. e.g., create a simple pdf use the test text ```python "LayoutParser: \n\nA Unified Toolkit for Deep Learning Based Document Image\n\nLayoutParser for Deep Learning" ``` and partition with `ocr_only` mode on main branch would hit this bug and output text where position of the second "LayoutParser" is incorrect. ```python [ 'LayoutParser:', 'A Unified Toolkit for Deep Learning Based Document Image', 'for Deep Learning LayoutParser', ] ```	2025-01-29 12:11:17 +00:00

1 2 3 4 5 ...

744 Commits