mirror of
https://github.com/Unstructured-IO/unstructured.git
synced 2026-01-06 04:11:08 +00:00
fix: set resolve_entities=False in partition_xml (#3088)
### Summary Closes #3078. Sets `resolve_entities=False` for parsing XML with `lxml` in `partition_xml` to avoid text being dynamically injected into the document. ### Testing `pytest test_unstructured/partition/test_xml.py` continues to pass with the update.
This commit is contained in:
parent
9b83330b5a
commit
171b5df09f
@ -1,4 +1,4 @@
|
||||
## 0.14.3-dev1
|
||||
## 0.14.3-dev2
|
||||
|
||||
### Enhancements
|
||||
|
||||
@ -8,6 +8,8 @@
|
||||
|
||||
### Fixes
|
||||
|
||||
**Turn off XML resolve entities** Sets `resolve_entities=False` for XML parsing with `lxml`
|
||||
to avoid text being dynamically injected into the XML document.
|
||||
* Add the missing `form_extraction_skip_tables` argument to the `partition_pdf_or_image` call.
|
||||
|
||||
## 0.14.2
|
||||
|
||||
@ -1 +1 @@
|
||||
__version__ = "0.14.3-dev1" # pragma: no cover
|
||||
__version__ = "0.14.3-dev2" # pragma: no cover
|
||||
|
||||
@ -51,7 +51,7 @@ def _get_leaf_elements(
|
||||
"""Parse the XML tree in a memory efficient manner if possible."""
|
||||
element_stack = []
|
||||
|
||||
element_iterator = etree.iterparse(file, events=("start", "end"))
|
||||
element_iterator = etree.iterparse(file, events=("start", "end"), resolve_entities=False)
|
||||
# NOTE(alan) If xml_path is used for filtering, I've yet to find a good way to stream
|
||||
# elements through in a memory efficient way, so we bite the bullet and load it all into
|
||||
# memory.
|
||||
|
||||
Loading…
x
Reference in New Issue
Block a user