* Many command line options added. The sample ingest project is now an easy to use CLI (no code editing
necessary), capable of processing large numbers of files from S3 in a re-entrant manner. See Ingest.md.
* Fixes issue where text fixtures had been truncated
* Adds a check to make sure this doesn't happen again
* Moves fixture outputs for the existing connector one subdir lower,
to make room for future connector outputs.
* add metadata field to elements
* metadata tracking for pdf/image
* metadata for html
* update expected outputs
* metadata for the rest of the document types
* take out file metadata for now
* add url to tables
* added metadata to test_auto
* bump version
* added coordinates to __init__
* fix coordinates in tests
* Update xml.py
remove comments while parsing
* change logged in CHANGLOG and editted version
* make tidy
* editted version
* new version 0.4.8-dev1
* editted version
* Update CHANGELOG.md
Co-authored-by: cragwolfe <crag@unstructuredai.io>
---------
Co-authored-by: cragwolfe <crag@unstructuredai.io>
* added to_dict to elements
* first training notebook
* bump changelog, rerun notebook
* remove coordinates and id
* rerun notebook
* has -> have
* partitioning -> partition
* various and sundry typos
* switch to using convert_to_isd
* added function for exploring a list of files
* file info from file contents
* added tests for file info from contents
* bump version and add tests
* add dev to version
* page breaks for pptx
* added page breaks for image/pdf
* tests for images with page breaks
* page breaks for html documents
* linting, linting, linting
* changelog and bump version
* update docs
* fix typo
* refactor reusable code to common.py
* add type back in
* feature: adding a feature for customizing color theme of sphinx docs
* fix: adding changelog and comments
* Adding css for changing colors of sidebar
* fix: removing changelog description
* order the shapes top to bottom and left to right
* added tests for ordering
* update change log and bump version
* more tests
* don't need enumerate
* n -> on
* add a bigger list of english words
* update thresholds and add tests
* update docs; bump version
* fix version
* add additional english words back in
* linting, linting, linting
* add slashes
* work -> word
* add env var for cap threshold; raise default threshold
* update docs and tests
* added check for ending in a comma
* update docs
* no caps check for all upper text
* capture Text in html and text
* check category in Text equality check
* lower case all caps before checking for verbs
* added check for us city/state/zip
* added address type
* add address to html
* add address to text
* fix for text tests; escape for large text segments
* refactor regex for readability
* update comment
* additional test for text with linebreaks
* update docs
* update changelog
* update elements docs
* remove old comment
* case -> cast
* type fix
* helper function to convert to element
* test for element types
* fix for healthcheck url
* version bump
* note on coordinates
* mention FigureCaption
* test_shared -> test_common
* add check boxes for checkbox template
* update changelog
Adds partition_image to partition image file types, which is integrated into the partition brick. This relies on the 0.2.2 version of unstructured-inference.