* add encoding to elements_to_json and elements_from_json
* version and changelog
* add new test
* fix version
* revert test file
* blank line to test
* no blank line
* feat: if no params show help
* Remove comments
* feat: update checking params
* updated main script and changelog
* version bump
---------
Co-authored-by: yuming <305248291@qq.com>
* feat: nb elasticsearch unstructured sentiment
* chore: refactor readme for elasticsearch nb
* fix: update es-credentials.ini
* chore: update es-credentials.ini
* fix: type in nb load-into-es.ipynb
exist --> exists
* fix: typo 2 in nb load-into-es.ipynb
obtaing --> obtain
Add functionality to try other common encodings for html, xml files if an error related to the encoding is raised and the user has not specified an encoding.
Change auto.py to have a None default for encoding
Remove the unused parameter encoding from partition_pdf
Add functionality to the read_txt_file utility function to handle file-like object from URL
tabulate is used by functions that extract tables from Microsoft documents, but there is nothing explicitly requiring the library. This was not caught by tests, because for some reason, tabulate is in base.txt.
This PR adds the dependency to base.in (which also puts it in setup.py), and recompiles the dependencies.
* build from Rocky linux unstructured base image
* add qemu for arm
* comment out push while testing
* remove quotes
* Add arch
* bump login action
* add ARCH env var to the push step
* run only subset of tests on arm image
Tests on emulated arm are extremely slow. Likelyhood of something breaking in arm image only, is minimal. I say that knowing I likely just jinxed us.
* re-enable push from main
* add a dnf cleanup
* version bump
* move from dev to minor version bump
Updated to the the latest version of unstructured-inference. detectron2 now gets implemented with onnxruntime, yay!
---------
Co-authored-by: Matt Robinson <mrobinson@unstructured.io>
This PR adds functionality to try other common encodings for email (.eml) files if an error related to the
encoding is raised and the user has not specified an encoding.
* docker works
* more epub tests
* changelog version
* support epub + odt + rtf
* update dockerfile
* revert..
* install pandoc on ci env
* pandoc docker grab bashed on arch
* move arch into image
* move back to base image
Add support for older .XLS files from the partition function in unstructured.partition.auto.
Note, this should also work on the centos7 unstructured image (with the requirements/*txt updates in this PR).
Addresses #631.
* Uses constraints to keep dependency versions more consistent.
* Moves all dependencies to .in files which are then ingested by setup.py.
* Adds script to check consistency of all extras.
* Adds consistency check to CI.
I should note that while it shouldn't be possible to cause a conflict between base.txt and any of the extras (because base.txt constrains all the extras) it is possible to get a conflict between two of the extras files. There are ways of trying to avoid that (like constraining each file by all the files that have already been processed before it in the order given in the make pip-compile target) but the ones I could think of seemed a little overwrought, and come with problems of their own. If a conflict arises, it should be flagged by CI or locally with make check-deps. When/if that happens, you can resolve the conflict by adding appropriate global constraints in requirements/constraints.txt.
Also note that if fileA.in is constrained by fileB.txt, then fileB.in should be compiled before fileA.in in the make pip-compile target. Otherwise fileA.in will be compiled with the old version of fileB.txt which can cause conflicts or keep dependencies from being updated properly.
Makes the bounding box coordinates available when using fast strategy.
* Refactored partition_text to make the workflow of categorizing an element purely from the text available without running the entirety of partition_text.
* Transformed the coordinates from pdf space into pixel space to be consistent with hi_res. We will probably want to revisit the coordinate system soon.
* first pass on partition_xml
* add option to keep xml tags
* added tests for xml
* fix filename
* update filenames
* remove outdated readme
* add xml to auto
* version and changelog
* update readme and docs
* pass through include_metadata
* update include_metadata description
* add README back in
* linting, linting, linting
* more linting
* spooled to bytes doesnt need to be a tuple
* Add tests for newly supported filetypes
* Correct metadata filetype
* doc typo
Co-authored-by: qued <64741807+qued@users.noreply.github.com>
* typo fix
Co-authored-by: qued <64741807+qued@users.noreply.github.com>
* typo fix
Co-authored-by: qued <64741807+qued@users.noreply.github.com>
* keep_xml_tags -> xml_keep_tags
---------
Co-authored-by: Alan Bertl <alan@unstructured.io>
Co-authored-by: qued <64741807+qued@users.noreply.github.com>
* Update detect_filetype() to use hashmap for mime type return
* fix: text mime type and linting
* fix: declare docx and xlsx mime types locally and also fix linting
* Update CHANGELOG.md
* tweaks for failing tests
---------
Co-authored-by: Matt Robinson <mrobinson@unstructuredai.io>