9 Commits

Author SHA1 Message Date
Emily Chen
24ebd0fa4e
chore: Move coordinate details from Element model to a metadata model (#827) 2023-07-05 11:25:11 -07:00
ryannikolaidis
62e20442df
chore: refactor ingest tests (#814)
- Adds reusable validation scripts (check-x.sh) to minimize repeated (or near-repeated) code and create one source of truth
- Restructures the location of download and output folders such that they are nested in the test_unstructured_ingest directory
- Adds gitignore for output folders / files to avoid them accidentally getting checked into the repository
- Construct paths as reusable variables declared at top of scripts
- Sort order of flag for ingest calls, across all tests (this makes it easier to parse at a glance)
- OVERWRITE_FIXTURES removes all old fixtures for path to guarantee no stale results are left behind
- Bonus: don't check/exit on expected number of expected outputs when OVERWRITE_FIXTURES is true
- Bonus: exclude file_directory from Slack and Discord test scripts (match convention in all others)
2023-06-29 23:13:41 +00:00
Yuming Long
2fbb1ccd30
Chore(ingest) : add tests on PDFs with fast strategy (#614)
Summary
* Updates "fast" PDF output element ordering to be consistent across Python versions by using the X,Y coordinates of elements extracted
* Added PDFs ingest tests with fast strategy with new script ./test_unstructured_ingest/test-ingest-pdf-fast-reprocess.sh

Updated ingest tests procedure:

* Processing files with hi_res strategy, and preserve downloads to repo files-ingest-download/<ingest_test_name>
* Reprocessing all PDFs with fast strategy from local file files-ingest-download, the partition outputs are stored at expected-structured-output/pdf-fast-reprocess/<ingest_test_name>
Test
* Reproduce tests with ./scripts/ingest-test-fixtures-update.sh , should expect no update. Also don't need any secret tokens since relevant tests won't produce PDFs.
2023-06-12 19:02:48 +00:00
ryannikolaidis
2094b976cf
feat: adds data_source metadata to ElementMetadata (#690) 2023-06-07 21:22:18 -07:00
Matt Robinson
bd6a8a3a40
enhancement: add file_directory to element metadata (#585)
* enhancement: add `file_directory` to element metadata

* update msg test

* exclude file_directory

* update slack output

* added file directory tests on partition_x paths
2023-05-15 18:25:39 -04:00
Yuming Long
5b6f11bb88
Chore(ingest): Add --partition-strategy parameter in CLI (#582)
* change strategy arg defalut to auto in partition

* passing --partition-strategy down

* add strategy="hi_res" to test (default changed)

* made an error on param name, added note
2023-05-15 19:26:53 +00:00
cragwolfe
5657378602
test: avoid misleading output in ingest tests (#488)
Previously, if there was an error (non-zero exit code) in an ingest test script,
the script would still complete and echo a warning about mismatched outputs
and how to regenerate the fixtures. However, this statement is irrelevant and
misleading: if the ingest failed with a non-zero exit code in the first place,
that is the failure that should be debugged -- don't confuse the user with
a comment about outputs.
2023-04-17 21:57:44 +00:00
natygyoon
c16862e7b3
feat: add --metadata-include and --metadata-exclude parameters to unstructured-ingest (#368)
* added metadata in/exclude params

* updated process_file

* existing tests

* remove default behavior

* changelog and ci

* line length

* import

* import

* import sorted

* import

* type

* line length

* main

* ci

* json

* dict

* type ignore

* lint

* unit tests for process_file

* lint

* type changed to Optional(str)

* ci

* line length

* added mutex check

* nit
2023-03-22 03:30:53 +09:00
Tom Aarsen
1580c1bf8e
feat: Add GitLab ingest connector (#349)
Add GitLab data connector for ingest.

Involves more general Git functionality that is shared between the GitHub and GitLab data connectors.

Prevent code duplication for functionality between GitHub and GitLab ingest connectors.

Renamed github-access-token, github-branch and github-file-glob to git-access-token, git-branch and git-file-glob, respectively.

These work for GitHub and GitLab.
2023-03-08 00:15:21 -08:00