unstructured

mirror of https://github.com/Unstructured-IO/unstructured.git synced 2025-06-27 02:30:08 +00:00

Author	SHA1	Message	Date
Steve Canny	9ece0b5ad2	fix: improve false-positive Title elements on Chinese text (#3836 ) Summary Improve element-type mapping for Chinese text. Fixes bug where Chinese text would produce large numbers of false-positive `Title` elements. Fixes #3084 --------- Co-authored-by: scanny <scanny@users.noreply.github.com> Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com>	2024-12-18 01:16:42 +00:00
Steve Canny	bba60260b2	rfctr(part): remove double-decoration 1 (#3685 ) Summary Install new `@apply_metadata()` on CSV and DOCX and remove decoration from DOC and ODT. Additional Context - Working in alphabetical order and keeping PR size manageable, replace use of `@process_metadata()` and `@add_metadata_with_filetype()` decorators with `@apply_metadata()` on principal partitioners (those that do not delegate to other partitioners. - Remove all decorators from delegating partitioners (DOC and ODT in this case); this removes the "double-decorating".	2024-10-01 22:40:58 +00:00
Steve Canny	44bad216f3	rfctr(part): prepare for pluggable auto-partitioners 3 (#3661 ) Summary Remove unused `include_metadata` parameter. Additional Context - The `include_metadata` parameter was originally added circa v0.7.12 as a mechanism for avoiding the "double-decorating" problem on delegating partitioners. - It turns out it doesn't fully address that problem, is now unused, and is unnecessary for the solution we'll be adding as part of pluggable partitioners. - Remove the unnecessary complexity introduced by this unused parameter.	2024-09-25 18:17:48 +00:00
Steve Canny	3bab9d93e6	rfctr(part): prepare for pluggable auto-partitioners 1 (#3655 ) Summary In preparation for pluggable auto-partitioners simplify metadata as discussed. Additional Context - Pluggable auto-partitioners requires partitioners to have a consistent call signature. An arbitrary partitioner provided at runtime needs to have a call signature that is known and consistent. Basically `partition_x(filename, , file, *kwargs)`. - The current `auto.partition()` is highly coupled to each distinct file-type partitioner, deciding which arguments to forward to each. - This is driven by the existence of "delegating" partitioners, those that convert their file-type and then call a second partitioner to do the actual partitioning. Both the delegating and proxy partitioners are decorated with metadata-post-processing decorators and those decorators are not idempotent. We call the situation where those decorators would run twice "double-decorating". For example, EPUB converts to HTML and calls `partition_html()` and both `partition_epub()` and `partition_html()` are decorated. - The way double-decorating has been avoided in the past is to avoid sending the arguments the metadata decorators are sensitive to to the proxy partitioner. This is very obscure, complex to reason about, error-prone, and just overall not a viable strategy. The better solution is to not decorate delegating partitioners and let the proxy partitioner handle all the metadata. - This first step in preparation for that is part of simplifying the metadata processing by removing unused or unwanted legacy parameters. - `date_from_file_object` is a misnomer because a file-object never contains last-modified data. - It can never produce useful results in the API where last-modified information must be provided by `metadata_last_modified`. - It is an undocumented parameter so not in use. - Using it can produce incorrect metadata.	2024-09-23 22:23:10 +00:00
Steve Canny	77a9e1b54d	rfctr(html): drop convert_and_partition_html() (#3215 ) Summary Remove `unstructured.partition.html.convert_and_partition_html()`. Move file-type conversion (to HTML) responsibility to each brokering partitioner that uses that strategy and let them call `partition_html()` for themselves with the result. Additional Context Rationale: - `partition_html()` does not want or need to know which partitioners might broker partitioning to it. - Different brokering partitioners have their own methods to convert their format to HTML and quirks that may be involved for their format. Avoid coupling them so they can evolve independently. - The core of the conversion work is already encapsulated in `unstructured.partition.common.convert_file_to_html_text_using_pandoc()`. - `convert_and_partition_html()` represents an additional brokering layer with the entailed complexities of an additional site for default parameter values to be (mis-)applied and/or dropped and is an additional location for new parameters to be added.	2024-06-17 19:43:18 +00:00
Matt Robinson	08383a27de	build: pull from wolfi base image (#3213 ) ### Summary Updates the `wolfi` image to pull from the upstream `wolfi-base` base image to avoid maintaining the base layers in both locations. Closes #3105 by pulling in the fix from upstream. ### Testing `test_dockerfile` should continue to pass with the changes.	2024-06-14 20:41:27 +00:00
Steve Canny	f2e67539b1	rfctr: clean MSG partitioner and tests as prep (#3107 ) Summary Fix type errors and generally prepare `partition_msg()` and its tests for refactoring to use `python-oxmsg` library instead of the problematic `msg_parser` library for partitioning Outlook MSG files.	2024-05-29 21:36:05 +00:00
Steve Canny	b4ee019170	rfctr: flatten test_unstructured/partition (#3073 ) Summary Some partitioner test modules are placed in directories by themselves or with one other test module. This unnecessarily obscures where to find the test module corresponding to a partitiner. Move partitioner test modules to mirror the directory structure of `unstructured/partition`.	2024-05-23 00:51:08 +00:00
Newel H	e4aa7373e2	test: create CI pipelines for verifying base and extras pass respective tests (#1137 ) Summary Closes #747 * Create CI Pipeline for running text, xml, email, and html doc tests against the library installed without extras * Create CI Pipeline for running each library extra against their respective tests	2023-08-19 12:56:13 -04:00
kravetsmic	bef93aef6e	fix: email addresses shouldn't be flagged as titles (#957 ) * feat: add func for checking on EmailAddress type * feat: add EmailAddress type * feat: add check for email type * feat: add test for cheking EmailAdress type * feat: update existing example files with email * feat: add new exampe fileds with email in the text * fix: apply linter * feat: update changelog file * feat: add test for is_email_address function * don't push * fix: clean up code * apply linter * fix: clean up * fix: remove file chaanges * fix: remove not used files for email address test * fix: remove not necessary tests * clean up * fix: apply linter * fix: update CHANGELOG * fix: change version * fix: fix msg test * fix: apply linter for tests * fix: remove spaces * fix: apply linter with longer line * feat: update documentation * fix: remove duplicates * Update getting_started.rst --------- Co-authored-by: Matt Robinson <mrobinson@unstructured.io>	2023-08-04 11:28:36 -04:00
Matt Robinson	331c7faf38	build(deps): split up dependencies by document type (#986 ) * split dependencies by document type * make pip-compile with new requirements * add extra requirements to setup.py * add in all docs; re pip-compile * extra for all docs * add pandas to xlsx * dependency requires for tsv and csv * handling for doc, docx and odt * dependency check for pypandoc * required dependencies for pandoc files * xml and html * markdown * msg * add in pdf * add in pptx * add in excel * add lxml as base req * extra all docs for local inference * local inference installs all * pin pillow version * fixes for plain text tests * fixes for doc * update make commands * changelog and version * add xlrd * update pip-compile * pin numpy for python 3.8 support * more constraints * contraint on scipy * update install docs * constrain ipython * add outlook to pip-compile * more ipython constraints * add extras to dockerfile * pin office365 client * few doc tweaks * types as strings * last pip-compile * re pip-comple * make tidy * make tidy	2023-08-01 11:31:13 -04:00
cragwolfe	13d3559fa4	chore: rename Element's "date" field to "last_modified" (#997 ) Change the Element's date field name to the more specific last_modified so there is less room for confusion of what that field represents.	2023-08-01 02:55:43 +00:00
Matt Robinson	d9aed66b65	feat: add document date for remaining file types (#930 ) (#969 ) * feat: add document date for remaining file types (#930) * feat: add functions for getting modification date * feat: add date field to metadata from csv file * feat: add tests for csv patition * feat: add date field to metadata from html file * feat: add tests for html partition * fix: return file name onlyif possible * feat: add csv tests * fix: renaming * feat: add filed metadata_date as date of last mod * feat: add tests for partition_docx * feat: add filed metadata_date to .doc file * feat: add tests for partition_doc * feat: add metadata_date to .epub file * feat: add tests for partition_epub * fix: fix test mocking * feat: add metadata_date for image partition * feat: add test for image partition * feat: add coorrdinate system argument * feat: add date to element metadata * feat: add metadata_date for JSON partition * feat: add test for JSON partition * fix: rename variable * feat: add metadata_date for md partition * feat: add test for md partition * feat: update doc string * feat: add metadata_date for .odt partition * feat: update .odt string * feat: add metadata_date for .org partition * feat: add tests for .org partition * feat: add metadata_date for .pdf partition * feat: add tests for .pdf partition * feat: add metadata_date for .pptx partition * feat: add metadata_date for .ppt partition * feat: add tests for .ppt partition * feat: add tests for .pptx partition * feat: add metadata_date for .rst partition * feat: add tests for .rst partition * fix: get modification date after file checking * feat: add tests for .rtf partition * feat: add tests for .rtf partition * feat: add metadata_date for .txt partition * fix: rename argument * feat: add tests for .txt partition * feat: update doc string rst patrition function * feat: add metadata_date for .tsv partition * feat: add tests for .tsv partition * feat: add metadata_date for .xlsx partition * feat: add tests for .xlsx partition * fix: clean up * feat: add tests for .xml partition * feat: add tests for .xml partition * fix: use `or ` instead of `if` * fix: fix epub tests * fix: remove not used code * fix: add try block for getting file name * fix: applying linter changes * fix: fix test_partition_file * feat: add metadata_date for email * feat: add test for email partition * feat: add metadata_date for msg * feat: add tests for msg partition * feat: update CHANGELOG file * fix: update partitions doc string * don't push * fix: clean up code * linting, linting, linting * remove unnecessary example doc * update version and changelog * ingest-test-fixtures-update * set metadata date in test --------- Co-authored-by: Matt Robinson <mrobinson@unstructuredai.io> * ingest-test-fixtures-update * Update ingest test fixtures (#970) Co-authored-by: MthwRobinson <MthwRobinson@users.noreply.github.com> * Revert "Update ingest test fixtures (#970)" This reverts commit 1d182ae474b3545b15551fffc15977757d552cd2. * remove date from metadata in outputs * update docstring ordering * remove print * remove print * remove print * linting, linting, linting * fix version and test * fix changelog * fix changelog * update version --------- Co-authored-by: kravetsmic <79907559+kravetsmic@users.noreply.github.com> Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com> Co-authored-by: MthwRobinson <MthwRobinson@users.noreply.github.com>	2023-07-26 15:10:14 -04:00
Matt Robinson	0d332743eb	fix: enable passing filters to `partition_doc` for libreoffice conversion (#934 ) * add optional filter to docx conversion * add filters to tests * changelog and version * update filter for power point	2023-07-17 13:54:44 -04:00
John	dc6d7d7268	feat: add metadata_filename parameter across all partition functions (#811 ) * fix conflicts * add tests and clean metadata_filename in partitions * fix test_email and remove comments * make tidy/check * update changelog and version * fix tests * make tidy again	2023-07-05 16:02:22 -04:00
John	e9fdbb0943	feat: add include_metadata across all partition functions (#853 ) * add include_metadata kwarg and tests to parsers add exclude_metadata to docx add test for doc to exclude metadata add include_metadata kwarg to email add include_metadata kwarg to epub add include_metadata kwarg to json add exclude_metadata tests to md add include_metadata kwarg and tests for msg parse add include_metadata kwarg and tests for odt parse add include_metadata kwarg and tests for org parse add include_metadata kwarg and tests for ppt and pptx parse add include_metadata kwarg and tests for rst parse add include_metadata kwarg and tests for rtf parse add include_metadata tests for text parse add include_metadata tests for tsv parse add include_metadata tests for xlsx parse add include_metadata tests for xml parse * WIP add include_metadata to partition_pdf * add include_metadata tests to partition_pdf * make tidy/check * update changelog and version * change test asserts and move docstring logic to process_metadata * make tidy * fix tests asserts * linting, linting, linting * sync versions * skip api call test not on main --------- Co-authored-by: Matt Robinson <mrobinson@unstructured.io> Co-authored-by: Matt Robinson <mrobinson@unstructuredai.io>	2023-06-30 10:44:46 -04:00
Matt Robinson	c1ba090c34	fix: suppress file conversion warnings in `convert_office_doc` (#703 ) * test that output is suppressed * add test for error output * changelog and version	2023-06-08 12:33:06 -04:00
Matt Robinson	bd6a8a3a40	enhancement: add `file_directory` to element metadata (#585 ) * enhancement: add `file_directory` to element metadata * update msg test * exclude file_directory * update slack output * added file directory tests on partition_x paths	2023-05-15 18:25:39 -04:00
Tom Aarsen	5eb1466acc	Resolve various style issues to improve overall code quality (#282 ) * Apply import sorting ruff . --select I --fix * Remove unnecessary open mode parameter ruff . --select UP015 --fix * Use f-string formatting rather than .format * Remove extraneous parentheses Also use "" instead of str() * Resolve missing trailing commas ruff . --select COM --fix * Rewrite list() and dict() calls using literals ruff . --select C4 --fix * Add () to pytest.fixture, use tuples for parametrize, etc. ruff . --select PT --fix * Simplify code: merge conditionals, context managers ruff . --select SIM --fix * Import without unnecessary alias ruff . --select PLR0402 --fix * Apply formatting via black * Rewrite ValueError somewhat Slightly unrelated to the rest of the PR * Apply formatting to tests via black * Update expected exception message to match 0d81564 * Satisfy E501 line too long in test * Update changelog & version * Add ruff to make tidy and test deps * Run 'make tidy' * Update changelog & version * Update changelog & version * Add ruff to 'check' target Doing so required me to also fix some non-auto-fixable issues. Two of them I fixed with a noqa: SIM115, but especially the one in __init__ may need some attention. That said, that refactor is out of scope of this PR.	2023-02-27 11:30:54 -05:00
Matt Robinson	6036af33e7	feat: add `partition_doc` for `.doc` files (#236 ) * first pass on doc partitioning * add libreoffice to deps * update docs and readme * add .doc to auto * changelog bump * value error with missing doc * doc updates	2023-02-17 09:30:23 -05:00

20 Commits