* added method for extracting datetime
* change filename metadata to the base filename
* fix filename metadata for msg
* changelog and bump version
* fix expected structured output
* newline back in file
* reset outpout file
* update filename output
* update test fixtures
* update fixture
* pip-compile new reqs
* bump inference version
* add language to pdf and image calls
* tests for passing in language
* version bump and changelog
* update docs
* pass ocr_languages in auto
* updated test fixtures
* typo in doc string
* group broken paragraphs with fast strategy
* changelog and version
* fix broken tests for text.py
* formatting for paragraph pattern re
* fix test
* fix whitespace substitution
* one more test tweak
* blurb to account for short lines
* fix for shorter paragraphs
* update changelog
* remove extra line break from auto
* retrigger ci
* trying skipping azure
* skip azure (test)
* updated github and azure fixtures
* update slack fixture
This connector takes a slack channel id, token and other options to
pull conversation history for a channel and store it as a text file that
is then processed by unstructured into expected output.
* Update test fixtures that should have been updated in prior commit
* Disable biomed ingest tests for now, the fail more often than not
* Bonus: echo `tesseract --version` in the update script, since that is a key thing that influences fixture outputs.
- Updates CI to install tesseract version 5.3.0 (better than 4.x in various ways incl. perf.).
- Adds azure expected output fixtures for more useful reference points and as a repro for Some PDF's with scanned images return empty elements #346 .
- Adds a script to regenerate ingest test fixtures that is run in an ubuntu docker container (like CI), with the same version of tesseract. See the comments in scripts/ingest-test-fixtures-update.sh for details.
- Updates expected outputs with above script.
- Updates individual test-ingest scripts to update expected .json output if OVERWRITE_FIXTURES=true.
Update versions of dependencies, including unpinning the unstructured-inference dependency that's causing conflicts in repos like pipeline-oer that want the newer version.
* fix: ensure all text is maintained in html pages
* add back in replace unicode quotes
* changelog and version bump
* apt-get update in ci
* white space differences in output
* added type to text element map
* add element_id and coordinates
* added test for serialization
* added serialization for check boxes
* add dict_to_elements and covert_to_dict aliases
* helpers for serializing and deserializing elements
* bump version; changelog
* add Text to tests
* aliases for isd functions
* remove test elements json
* changelog updates
* make indent a kwarg
* update expected structured output
* docs update
* use new function in ingest code
* pop coordinates due to floating point differences
* pop coordinates
* Many command line options added. The sample ingest project is now an easy to use CLI (no code editing
necessary), capable of processing large numbers of files from S3 in a re-entrant manner. See Ingest.md.
* Fixes issue where text fixtures had been truncated
* Adds a check to make sure this doesn't happen again
* Moves fixture outputs for the existing connector one subdir lower,
to make room for future connector outputs.
* add metadata field to elements
* metadata tracking for pdf/image
* metadata for html
* update expected outputs
* metadata for the rest of the document types
* take out file metadata for now
* add url to tables
* added metadata to test_auto
* bump version
* added coordinates to __init__
* fix coordinates in tests