23 lines
240 B
Plaintext
Raw Permalink Normal View History

-c ./deps/constraints.txt
chardet
filetype
python-magic
lxml
nltk
requests
beautifulsoup4
emoji
dataclasses-json
python-iso639
langdetect
numpy
rapidfuzz
backoff
build(deps): add typing extensions dep (#1835) Closes #1330. Added `typing-extensions` as an explicit dependency (it was previously an implicit dependency via `dataclasses-json`). This dependency should be explicit, since we import from it directly in `unstructured.documents.elements`. This has the added benefit that `TypedDict` will be available for Python 3.7 users. Other changes: * Ran `pip-compile` * Fixed a bug in `version-sync.sh` that caused an error when using the sync functionality when syncing to a dev version from a release version. #### Testing: To test the Python 3.7 functionality, in a Python 3.7 environment install the base requirements and run ```python from unstructured.documents.elements import Element ``` This also works on `main` as `typing_extensions` is a requirement. However if you `pip uninstall typing-extensions`, and run the above code, it should fail. So this update makes sure `typing-extensions` doesn't get lost if the other dependencies move around. To reproduce the `version-sync.sh` bug that was fixed, in `main`, increment the most recent version in `CHANGELOG.md` while leaving the version in `__version__.py`. Then add the following lines to `version-sync.sh` to simulate a particular set of circumstances, starting on line 114: ``` MAIN_IS_RELEASE=true CURRENT_BRANCH="something-not-main" ``` Then run `make version-sync`. The expected behavior is that the version in `__version__.py` is changed to the new version to match `CHANGELOG.md`, but instead it exits with an error. The fix was to only do the version incrementation check when the script is running in `-c` or "check" mode.
2023-10-24 14:19:09 -05:00
typing-extensions
unstructured-client
wrapt
tqdm
psutil
python-oxmsg
Add parsing HTML to unstructured elements (#3732) > This is POC change; not everything is working correctly and code quality could be improved significantly This ticket add parsing HTML to unstructured element and back. How is it working? HTML has a tree structure, Unstructured Elements is a list. HTML structure is traversed in DFS order, creating Elements and adding them to list. So the reading order from HTML is preserved. To be able to compose tree again all elements has IDs, and metadata.parent_id is leveraged How html is preserved if there are 'layout' without text, or there are deeply nested HTMLs that are just text from the point of view of Unstructured Element? Each element is parsed back to HTML using metadata.text_as_html field. For layout elements only html_tag are there, for long text elements there is everything required to recreate HTML - you can see examples in unit tests or .json file I attached. Pros of solution: - Nothing had to be changed in element types Cons: - There are elements without Text which may be confusing (they could be replaced by some special type) Core transformation logic can be found in 2 functions in `unstructured/documents/transformations.py` Knowns bugs (they are minor): - sometimes html tag is changed incorrectly - metadata.category_depth and metadata.page_number are not set - page break is not added between pages How to test. Generate HTML: ```python3 from pathlib import Path from vlm_partitioner.src.partition import partition if __name__ == "__main__": doc_dir = Path("out_dir") file_path = Path("example_doc.pdf") partition(str(file_path), provider="anthropic", output_dir=str(doc_dir)) ``` Then parse to unstructured elements and back to html ```python3 from pathlib import Path from unstructured.documents.html_utils import indent_html from unstructured.documents.transformations import parse_html_to_ontology, ontology_to_unstructured_elements, \ unstructured_elements_to_ontology from unstructured.staging.base import elements_to_json if __name__ == "__main__": output_dir = Path("out_dir/") output_dir.mkdir(exist_ok=True, parents=True) doc_path = Path("out_dir/example_doc.html") html_content = doc_path.read_text() ontology = parse_html_to_ontology(html_content) unstructured_elements = ontology_to_unstructured_elements(ontology) elements_to_json(unstructured_elements, str(output_dir / f"{doc_path.stem}_unstr.json")) parsed_ontology = unstructured_elements_to_ontology(unstructured_elements) html_to_save = indent_html(parsed_ontology.to_html()) Path(output_dir / f"{doc_path.stem}_parsed_unstr.html").write_text(html_to_save) ``` I attached example doc before and after running these scripts [outputs.zip](https://github.com/user-attachments/files/17438673/outputs.zip)
2024-10-23 14:28:07 +02:00
html5lib