4 Commits

Author SHA1 Message Date
John
dc6d7d7268
feat: add metadata_filename parameter across all partition functions (#811)
* fix conflicts

* add tests and clean metadata_filename in partitions

* fix test_email and remove comments

* make tidy/check

* update changelog and version

* fix tests

* make tidy again
2023-07-05 16:02:22 -04:00
John
e9fdbb0943
feat: add include_metadata across all partition functions (#853)
* add include_metadata kwarg and tests to parsers

add exclude_metadata to docx

add test for doc to exclude metadata

add include_metadata kwarg to email

add include_metadata kwarg to epub

add include_metadata kwarg to json

add exclude_metadata tests to md

add include_metadata kwarg and tests for msg parse

add include_metadata kwarg and tests for odt parse

add include_metadata kwarg and tests for org parse

add include_metadata kwarg and tests for ppt and pptx parse

add include_metadata kwarg and tests for rst parse

add include_metadata kwarg and tests for rtf parse

add include_metadata tests for text parse

add include_metadata tests for tsv parse

add include_metadata tests for xlsx parse

add include_metadata tests for xml parse

* WIP add include_metadata to partition_pdf

* add include_metadata tests to partition_pdf

* make tidy/check

* update changelog and version

* change test asserts and move docstring logic to process_metadata

* make tidy

* fix tests asserts

* linting, linting, linting

* sync versions

* skip api call test not on main

---------

Co-authored-by: Matt Robinson <mrobinson@unstructured.io>
Co-authored-by: Matt Robinson <mrobinson@unstructuredai.io>
2023-06-30 10:44:46 -04:00
Christine Straub
547bb38d86
fix: encoding/decoding error with default utf-8 encoding for html, xml, and auto (#660)
Add functionality to try other common encodings for html, xml files if an error related to the encoding is raised and the user has not specified an encoding.

Change auto.py to have a None default for encoding

Remove the unused parameter encoding from partition_pdf

Add functionality to the read_txt_file utility function to handle file-like object from URL
2023-06-05 11:27:12 -07:00
Matt Robinson
23ff32cc42
feat: add partition_xml for XML files (#596)
* first pass on partition_xml

* add option to keep xml tags

* added tests for xml

* fix filename

* update filenames

* remove outdated readme

* add xml to auto

* version and changelog

* update readme and docs

* pass through include_metadata

* update include_metadata description

* add README back in

* linting, linting, linting

* more linting

* spooled to bytes doesnt need to be a tuple

* Add tests for newly supported filetypes

* Correct metadata filetype

* doc typo

Co-authored-by: qued <64741807+qued@users.noreply.github.com>

* typo fix

Co-authored-by: qued <64741807+qued@users.noreply.github.com>

* typo fix

Co-authored-by: qued <64741807+qued@users.noreply.github.com>

* keep_xml_tags -> xml_keep_tags

---------

Co-authored-by: Alan Bertl <alan@unstructured.io>
Co-authored-by: qued <64741807+qued@users.noreply.github.com>
2023-05-18 15:40:12 +00:00