6 Commits

Author SHA1 Message Date
Yuming Long
2fbb1ccd30
Chore(ingest) : add tests on PDFs with fast strategy (#614)
Summary
* Updates "fast" PDF output element ordering to be consistent across Python versions by using the X,Y coordinates of elements extracted
* Added PDFs ingest tests with fast strategy with new script ./test_unstructured_ingest/test-ingest-pdf-fast-reprocess.sh

Updated ingest tests procedure:

* Processing files with hi_res strategy, and preserve downloads to repo files-ingest-download/<ingest_test_name>
* Reprocessing all PDFs with fast strategy from local file files-ingest-download, the partition outputs are stored at expected-structured-output/pdf-fast-reprocess/<ingest_test_name>
Test
* Reproduce tests with ./scripts/ingest-test-fixtures-update.sh , should expect no update. Also don't need any secret tokens since relevant tests won't produce PDFs.
2023-06-12 19:02:48 +00:00
ryannikolaidis
dabda67c8f
fix: ingest-test-fixtures-update script to pass env vars (#697) 2023-06-08 04:48:49 +00:00
Trevor Bossert
830d67f653
Feat: Discord connector (#515)
* Initial commit of discord connector

based off of initial work by @tnachen with modifications

https://github.com/tnachen/unstructured/tree/tnachen/discord_connector

* Add test file

change format of imports

* working version of the connector

More work to be done to tidy it up and add any additional options

* add to test fixtures update

* fix spacing

* tests working, switching to bot testing channel

* add additional channel

add reprocess to tests

* add try clause to allow for exit on error

Update changelog and bump version

* add updated expected output filtes

* add logic to check if —discord-period is an integer

Add more to option description

* fix lint error

* Update discord reqs

* PR feedback

* add newline

* another newline

---------

Co-authored-by: Justin Bossert <packerbacker21@hotmail.com>
2023-05-16 11:46:30 -07:00
Matt Robinson
4e1cc5ab3d
fix: add slack to fixture update script (#500) 2023-04-19 18:16:44 +00:00
cragwolfe
a11563fe63
fix: update ingest test fixtures, disable biomed test (#486)
* Update test fixtures that should have been updated in prior commit
* Disable biomed ingest tests for now, the fail more often than not
* Bonus: echo `tesseract --version` in the update script, since that is a key thing that influences fixture outputs.
2023-04-15 00:07:09 +00:00
cragwolfe
7b44bcd6e0
build: script to update all ingest fixtures, add azure ingest fixtures (#367)
- Updates CI to install tesseract version 5.3.0 (better than 4.x in various ways incl. perf.).
- Adds azure expected output fixtures for more useful reference points and as a repro for Some PDF's with scanned images return empty elements #346 .
- Adds a script to regenerate ingest test fixtures that is run in an ubuntu docker container (like CI), with the same version of tesseract. See the comments in scripts/ingest-test-fixtures-update.sh for details.
- Updates expected outputs with above script.
- Updates individual test-ingest scripts to update expected .json output if OVERWRITE_FIXTURES=true.
2023-04-11 00:11:50 -07:00