11 Commits

Author SHA1 Message Date
Christine Straub
d26ab1deac
fix: etree parser error (#1077)
* feat: add functionality to check if a string contains any emoji characters

* feat: add functionality to switch `html` text parser based on whether the `html` text contains emoji

* chore: add `beautifulsoup4` and `emoji` packages to `requirements/base.in` for general use

* chore: update changelog & version

* chore: update changelog & version

* chore: update dependencies

* test: update `EXPECTED_XLS_TEXT_LEN` for `test_auto_partition_xls_from_filename`

* chore: update changelog & version
2023-08-10 23:28:57 +00:00
Matt Robinson
15618e8346
fix: handling for empty tables in word docs and powerpoints (#982)
* fix table index error

* changelog and version
2023-07-27 11:07:27 -04:00
Emily Chen
24ebd0fa4e
chore: Move coordinate details from Element model to a metadata model (#827) 2023-07-05 11:25:11 -07:00
qued
db4c5dfdf7
feat: coordinate systems (#774)
Added the CoordinateSystem class for tracking the system in which coordinates are represented, and changing the system if desired.
2023-06-20 11:19:55 -05:00
Matt Robinson
c1ba090c34
fix: suppress file conversion warnings in convert_office_doc (#703)
* test that output is suppressed

* add test for error output

* changelog and version
2023-06-08 12:33:06 -04:00
cragwolfe
3972c80c51
build(deps): bump requirements (#414) 2023-04-05 02:59:06 +00:00
Matt Robinson
69661788cf
fix: track narrative text and figure captions in HTML documents (#309)
* fix for missing narrative text in partition_html

* fixes so existing tests pass

* tests for figure caption and narrative text

* bump version; changelog
2023-02-28 15:36:08 +00:00
Tom Aarsen
5eb1466acc
Resolve various style issues to improve overall code quality (#282)
* Apply import sorting

ruff . --select I --fix

* Remove unnecessary open mode parameter

ruff . --select UP015 --fix

* Use f-string formatting rather than .format

* Remove extraneous parentheses

Also use "" instead of str()

* Resolve missing trailing commas

ruff . --select COM --fix

* Rewrite list() and dict() calls using literals

ruff . --select C4 --fix

* Add () to pytest.fixture, use tuples for parametrize, etc.

ruff . --select PT --fix

* Simplify code: merge conditionals, context managers

ruff . --select SIM --fix

* Import without unnecessary alias

ruff . --select PLR0402 --fix

* Apply formatting via black

* Rewrite ValueError somewhat

Slightly unrelated to the rest of the PR

* Apply formatting to tests via black

* Update expected exception message to match
0d81564

* Satisfy E501 line too long in test

* Update changelog & version

* Add ruff to make tidy and test deps

* Run 'make tidy'

* Update changelog & version

* Update changelog & version

* Add ruff to 'check' target

Doing so required me to also fix some non-auto-fixable issues. Two of them I fixed with a noqa: SIM115, but especially the one in __init__ may need some attention. That said, that refactor is out of scope of this PR.
2023-02-27 11:30:54 -05:00
Matt Robinson
74e6b84b41
feat: add metadata tracking to document elements (#225)
* add metadata field to elements

* metadata tracking for pdf/image

* metadata for html

* update expected outputs

* metadata for the rest of the document types

* take out file metadata for now

* add url to tables

* added metadata to test_auto

* bump version

* added coordinates to __init__

* fix coordinates in tests
2023-02-15 18:26:20 +00:00
Matt Robinson
8d3e616846
feat: add ability to parse LayoutElement lists (#165)
* added ability to split list items

* changelog and version bump

* retrigger ci
2023-01-20 08:55:11 -05:00
Matt Robinson
c1822911a5
chore: return Element objects in partition_pdf and partition_image (#164)
* helper function to convert to element

* test for element types

* fix for healthcheck url

* version bump

* note on coordinates

* mention FigureCaption

* test_shared -> test_common

* add check boxes for checkbox template

* update changelog
2023-01-19 14:29:28 +00:00