Summary
* Updates "fast" PDF output element ordering to be consistent across Python versions by using the X,Y coordinates of elements extracted
* Added PDFs ingest tests with fast strategy with new script ./test_unstructured_ingest/test-ingest-pdf-fast-reprocess.sh
Updated ingest tests procedure:
* Processing files with hi_res strategy, and preserve downloads to repo files-ingest-download/<ingest_test_name>
* Reprocessing all PDFs with fast strategy from local file files-ingest-download, the partition outputs are stored at expected-structured-output/pdf-fast-reprocess/<ingest_test_name>
Test
* Reproduce tests with ./scripts/ingest-test-fixtures-update.sh , should expect no update. Also don't need any secret tokens since relevant tests won't produce PDFs.
* change strategy arg defalut to auto in partition
* passing --partition-strategy down
* add strategy="hi_res" to test (default changed)
* made an error on param name, added note
Previously, if there was an error (non-zero exit code) in an ingest test script,
the script would still complete and echo a warning about mismatched outputs
and how to regenerate the fixtures. However, this statement is irrelevant and
misleading: if the ingest failed with a non-zero exit code in the first place,
that is the failure that should be debugged -- don't confuse the user with
a comment about outputs.
Add GitLab data connector for ingest.
Involves more general Git functionality that is shared between the GitHub and GitLab data connectors.
Prevent code duplication for functionality between GitHub and GitLab ingest connectors.
Renamed github-access-token, github-branch and github-file-glob to git-access-token, git-branch and git-file-glob, respectively.
These work for GitHub and GitLab.