fix: local connector with input path to single file (#2116)

When passed an absolute file path for the input document path, the local
connector incorrectly writes the output file to the wrong directory.
Also, in the single file input path cases we are currently including
parent path as part of the destination writing, instead when a single
file is specified as input the output file should be located directly in
the specified outputs directory. Note: this change meant that we needed
to bump the file path of some expected results. This fixes such that the
output in this case is written to `output-dir/input-filename.json`.

## Changes
- Fix for incorrect output path of files partitioned via the local
connector when the input path is a file path (rather than directory)
- Updated single-local-file test to validate the flow where we specify
an absolute file path (since this was particularly broken)

## Testing
Note: running the updated `local-single-file` test without the changes
to the local connector will result in a final output copy of:

```
Copying /Users/ryannikolaidis/Development/unstructured/unstructured/test_unstructured_ingest/workdir/local-single-file/partitioned/a48c2abec07a9a31860429f94e5a6ade.json -> /Users/ryannikolaidis/Development/unstructured/unstructured/test_unstructured_ingest/../example-docs/language-docs/UDHR_first_article_all.txt.json
```

where the output path is the input path and not the expected
`output-dir/input-filename.json`

Running with this change we can now expect the file at that directory.

---------

Co-authored-by: ryannikolaidis <ryannikolaidis@users.noreply.github.com>
This commit is contained in:
ryannikolaidis 2023-11-19 10:21:31 -08:00 committed by GitHub
parent d623d75d3c
commit 13a23deba6
No known key found for this signature in database
GPG Key ID: 4AEE18F83AFDEB23
11 changed files with 545 additions and 540 deletions

View File

@ -1,4 +1,4 @@
## 0.11.0-dev6 ## 0.11.0-dev7
### Enhancements ### Enhancements
@ -22,6 +22,7 @@
* **Fix some pdfs returning `KeyError: 'N'`** Certain pdfs were throwing this error when being opened by pdfminer. Added a wrapper function for pdfminer that allows these documents to be partitioned. * **Fix some pdfs returning `KeyError: 'N'`** Certain pdfs were throwing this error when being opened by pdfminer. Added a wrapper function for pdfminer that allows these documents to be partitioned.
* **Fix mis-splits on `Table` chunks.** Remedies repeated appearance of full `.text_as_html` on metadata of each `TableChunk` split from a `Table` element too large to fit in the chunking window. * **Fix mis-splits on `Table` chunks.** Remedies repeated appearance of full `.text_as_html` on metadata of each `TableChunk` split from a `Table` element too large to fit in the chunking window.
* **Import tables_agent from inference** so that we don't have to initialize a global table agent in unstructured OCR again * **Import tables_agent from inference** so that we don't have to initialize a global table agent in unstructured OCR again
* **Fix local connector with absolute input path** When passed an absolute filepath for the input document path, the local connector incorrectly writes the output file to the input file directory. This fixes such that the output in this case is written to `output-dir/input-filename.json`
## 0.10.30 ## 0.10.30

View File

@ -6,9 +6,9 @@ SRC_PATH=$(dirname "$(realpath "$0")")
SCRIPT_DIR=$(dirname "$SRC_PATH") SCRIPT_DIR=$(dirname "$SRC_PATH")
cd "$SCRIPT_DIR"/.. || exit 1 cd "$SCRIPT_DIR"/.. || exit 1
OUTPUT_ROOT=${OUTPUT_ROOT:-$SCRIPT_DIR} OUTPUT_ROOT=${OUTPUT_ROOT:-$SCRIPT_DIR}
OUTPUT_FOLDER_NAME=azure-cog-search-dest
OUTPUT_DIR=$OUTPUT_ROOT/structured-output/$OUTPUT_FOLDER_NAME OUTPUT_DIR=$OUTPUT_ROOT/structured-output/$OUTPUT_FOLDER_NAME
WORK_DIR=$OUTPUT_ROOT/workdir/$OUTPUT_FOLDER_NAME WORK_DIR=$OUTPUT_ROOT/workdir/$OUTPUT_FOLDER_NAME
OUTPUT_FOLDER_NAME=azure-cog-search-dest
max_processes=${MAX_PROCESSES:=$(python3 -c "import os; print(os.cpu_count())")} max_processes=${MAX_PROCESSES:=$(python3 -c "import os; print(os.cpu_count())")}
DESTINATION_INDEX="utic-test-ingest-fixtures-output-$(uuidgen)" DESTINATION_INDEX="utic-test-ingest-fixtures-output-$(uuidgen)"
@ -90,19 +90,19 @@ while [ "$docs_count_remote" -eq 0 ] && [ "$attempt" -lt 6 ]; do
--header "api-key: $AZURE_SEARCH_API_KEY" \ --header "api-key: $AZURE_SEARCH_API_KEY" \
--header 'content-type: application/json' | jq) --header 'content-type: application/json' | jq)
echo "docs count pulled from Azure: $docs_count_remote" echo "docs count pulled from Azure Cognitive Search: $docs_count_remote"
attempt=$((attempt+1)) attempt=$((attempt+1))
done done
docs_count_local=0 docs_count_local=0
for i in $(jq length "$OUTPUT_DIR"/**/*.json); do for i in $(jq length "$OUTPUT_DIR"/*.json); do
docs_count_local=$((docs_count_local+i)); docs_count_local=$((docs_count_local+i));
done done
if [ "$docs_count_remote" -ne "$docs_count_local" ];then if [ "$docs_count_remote" -ne "$docs_count_local" ];then
echo "Number of docs in Azure $docs_count_remote doesn't match the expected docs: $docs_count_local" echo "Number of docs in Azure Cognitive Search $docs_count_remote doesn't match the expected docs: $docs_count_local"
exit 1 exit 1
fi fi

View File

@ -73,7 +73,7 @@ expected_num_files=1
num_files_in_dropbox=$(curl -X POST https://api.dropboxapi.com/2/files/list_folder \ num_files_in_dropbox=$(curl -X POST https://api.dropboxapi.com/2/files/list_folder \
--header "Content-Type: application/json" \ --header "Content-Type: application/json" \
--header "Authorization: Bearer $DROPBOX_ACCESS_TOKEN" \ --header "Authorization: Bearer $DROPBOX_ACCESS_TOKEN" \
--data "{\"path\":\"$DESTINATION_DROPBOX/example-docs/\"}" | jq '.entries | length') --data "{\"path\":\"$DESTINATION_DROPBOX/\"}" | jq '.entries | length')
if [ "$num_files_in_dropbox" -ne "$expected_num_files" ]; then if [ "$num_files_in_dropbox" -ne "$expected_num_files" ]; then
echo "Expected $expected_num_files files to be uploaded to dropbox, but found $num_files_in_dropbox files." echo "Expected $expected_num_files files to be uploaded to dropbox, but found $num_files_in_dropbox files."
exit 1 exit 1

View File

@ -66,4 +66,4 @@ python "$SCRIPT_DIR"/python/test-ingest-mongodb.py \
--database "$MONGODB_DATABASE_NAME" \ --database "$MONGODB_DATABASE_NAME" \
--collection "$DESTINATION_MONGO_COLLECTION" \ --collection "$DESTINATION_MONGO_COLLECTION" \
check-vector \ check-vector \
--output-json "$OUTPUT_ROOT"/structured-output/$OUTPUT_FOLDER_NAME/example-docs/fake-memo.pdf.json --output-json "$OUTPUT_ROOT"/structured-output/$OUTPUT_FOLDER_NAME/fake-memo.pdf.json

View File

@ -43,7 +43,7 @@ PYTHONPATH=${PYTHONPATH:-.} "$RUN_SCRIPT" \
# Simply check the number of files uploaded # Simply check the number of files uploaded
expected_num_files=1 expected_num_files=1
num_files_in_s3=$(aws s3 ls "${DESTINATION_S3}example-docs/" --region us-east-2 | grep -c "\.json$") num_files_in_s3=$(aws s3 ls "${DESTINATION_S3}" --region us-east-2 | grep -c "\.json$")
if [ "$num_files_in_s3" -ne "$expected_num_files" ]; then if [ "$num_files_in_s3" -ne "$expected_num_files" ]; then
echo "Expected $expected_num_files files to be uploaded to s3, but found $num_files_in_s3 files." echo "Expected $expected_num_files files to be uploaded to s3, but found $num_files_in_s3 files."
exit 1 exit 1

View File

@ -9,6 +9,8 @@ OUTPUT_FOLDER_NAME=local-single-file
OUTPUT_ROOT=${OUTPUT_ROOT:-$SCRIPT_DIR} OUTPUT_ROOT=${OUTPUT_ROOT:-$SCRIPT_DIR}
OUTPUT_DIR=$OUTPUT_ROOT/structured-output/$OUTPUT_FOLDER_NAME OUTPUT_DIR=$OUTPUT_ROOT/structured-output/$OUTPUT_FOLDER_NAME
WORK_DIR=$OUTPUT_ROOT/workdir/$OUTPUT_FOLDER_NAME WORK_DIR=$OUTPUT_ROOT/workdir/$OUTPUT_FOLDER_NAME
# assigning an absolute path to the input file so that we explicitly test passing an absolute path
ABS_INPUT_PATH="$SCRIPT_DIR/../example-docs/language-docs/UDHR_first_article_all.txt"
max_processes=${MAX_PROCESSES:=$(python3 -c "import os; print(os.cpu_count())")} max_processes=${MAX_PROCESSES:=$(python3 -c "import os; print(os.cpu_count())")}
# shellcheck disable=SC1091 # shellcheck disable=SC1091
@ -29,7 +31,7 @@ PYTHONPATH=${PYTHONPATH:-.} "$RUN_SCRIPT" \
--additional-partition-args '{"strategy":"ocr_only", "languages":["ind", "est"]}' \ --additional-partition-args '{"strategy":"ocr_only", "languages":["ind", "est"]}' \
--verbose \ --verbose \
--reprocess \ --reprocess \
--input-path example-docs/language-docs/UDHR_first_article_all.txt \ --input-path "$ABS_INPUT_PATH" \
--work-dir "$WORK_DIR" --work-dir "$WORK_DIR"
set +e set +e

View File

@ -1 +1 @@
__version__ = "0.11.0-dev6" # pragma: no cover __version__ = "0.11.0-dev7" # pragma: no cover

View File

@ -41,10 +41,12 @@ class LocalIngestDoc(BaseIngestDoc):
@property @property
def base_filename(self) -> t.Optional[str]: def base_filename(self) -> t.Optional[str]:
download_path = str(Path(self.connector_config.input_path).resolve()) download_path = Path(self.connector_config.input_path).resolve()
full_path = str(self.filename) full_path = Path(self.filename).resolve()
base_path = full_path.replace(download_path, "") if download_path.is_file():
return base_path download_path = download_path.parent
relative_path = full_path.relative_to(download_path)
return str(relative_path)
@property @property
def filename(self): def filename(self):