fix: set resolve_entities=False in partition_xml (#3088)

### Summary

Closes #3078. Sets `resolve_entities=False` for parsing XML with `lxml`
in `partition_xml` to avoid text being dynamically injected into the
document.

### Testing

`pytest test_unstructured/partition/test_xml.py` continues to pass with
the update.
This commit is contained in:
Matt Robinson 2024-05-23 14:38:11 -04:00 committed by GitHub
parent 9b83330b5a
commit 171b5df09f
No known key found for this signature in database
GPG Key ID: B5690EEEBB952194
3 changed files with 5 additions and 3 deletions

View File

@ -1,4 +1,4 @@
## 0.14.3-dev1
## 0.14.3-dev2
### Enhancements
@ -8,6 +8,8 @@
### Fixes
**Turn off XML resolve entities** Sets `resolve_entities=False` for XML parsing with `lxml`
to avoid text being dynamically injected into the XML document.
* Add the missing `form_extraction_skip_tables` argument to the `partition_pdf_or_image` call.
## 0.14.2

View File

@ -1 +1 @@
__version__ = "0.14.3-dev1" # pragma: no cover
__version__ = "0.14.3-dev2" # pragma: no cover

View File

@ -51,7 +51,7 @@ def _get_leaf_elements(
"""Parse the XML tree in a memory efficient manner if possible."""
element_stack = []
element_iterator = etree.iterparse(file, events=("start", "end"))
element_iterator = etree.iterparse(file, events=("start", "end"), resolve_entities=False)
# NOTE(alan) If xml_path is used for filtering, I've yet to find a good way to stream
# elements through in a memory efficient way, so we bite the bullet and load it all into
# memory.