1418 Commits

Author SHA1 Message Date
Klaijan
5ba3b9c2c6
chore: get eval metrics from ingest in (#2097)
Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com>
Co-authored-by: Klaijan <Klaijan@users.noreply.github.com>
2023-11-17 18:22:36 +00:00
Steve Canny
ee62ed7b54
rfctr(html): clean types and docs in prep for HTML table parsing fixes (#2104)
There are a cluster of bugs in the HTML parsing code, particularly
surrounding table behaviors but also inclusion of style elements, etc.

Clean up typing and docstrings in that neighborhood as a way to
familiarize myself with that part of the code-base.
2023-11-17 17:56:38 +00:00
Yuming Long
ef8ac7257d
Chore: Import tables_agent from inference (#2087)
Related to https://github.com/Unstructured-IO/unstructured/issues/2028

Import `tables_agent` from inference and don't init in unst again

copy the same logic from
[`interpret_table_block`](https://github.com/Unstructured-IO/unstructured-inference/blob/main/unstructured_inference/inference/layoutelement.py#L89)
(currently deprecated).
2023-11-16 22:14:24 -08:00
Yuming Long
97a25b0094
Chore: move hi res initialization initialize.py file out of ingest (#2096)
Move Hi_res model initialization file out of ingest to `partition` dir

---------

Co-authored-by: cragwolfe <crag@unstructured.io>
2023-11-16 21:53:25 -08:00
Christine Straub
9c66eab8a9
Fix: handle pdf text extraction errors (#2101)
Closes #2084.
### Summary
Certain pdfs throw unexpected errors when being opened by `pdfminer`,
causing `partition_pdf()` to fail. We expect to be able to partition
smoothly using an alternative strategy if text extraction doesn't work.
Added exception handling to handle unexpected errors when extracting pdf
text and to help determine pdf strategy.
### Testing
PDF:
[NASA-SNA-8-D-027III-Rev2-CsmLmSpacecraftOperationalDataBook-Volume3-MassProperties-pg856.pdf](https://github.com/Unstructured-IO/unstructured/files/13383215/NASA-SNA-8-D-027III-Rev2-CsmLmSpacecraftOperationalDataBook-Volume3-MassProperties-pg856.pdf)

```
elements = partition_pdf(
    filename="NASA-SNA-8-D-027III-Rev2-CsmLmSpacecraftOperationalDataBook-Volume3-MassProperties-pg856.pdf",
)
```
2023-11-16 21:42:36 -08:00
Steve Canny
a589a494f6
docx: improve page break fidelity (#1631)
Page breaks can and often do occur within a paragraph. The full text of
the paragraph is attributed to the page (number) the paragraph starts
on.

Improve page-break fidelity such that a paragraph containing a
page-break is split into two elements, one containing the text before
the page-break and the other the text after. Emit the `PageBreak`
element between these two and assign the correct page-number (n and n+1
respectively) to the two textual elements.

This functionality is largely provided upstream by the new `python-docx`
v1.0.0 release (1.0.0 from 0.8.11 because it drops Python 2 support).
That version also makes obsolete the "include hyperlink text in
`Paragraph.text` monkey patch that we had maintained up to now. Remove
that monkey-patch.
2023-11-17 00:09:14 +00:00
Roman Isecke
b8af2f18bb
add mongo db destination connector (#2068)
### Description
This adds the basic implementation of pushing the generated json output
of partition to mongodb. None of this code provisions the mondo db
instance so things like adding a search index around the embedding
content must be done by the user. Any sort of schema validation would
also have to take place via user-specific configuration on the database.
This update makes no assumptions about the configuration of the database
itself.
2023-11-16 22:40:22 +00:00
Roman Isecke
ead2a7f1eb
drop cloud cli deps (#2088)
### Description
To not require additional dependencies on cloud-related CLIs (i.e.
gcloud and az), using python and the existing dependencies already used
to run out code to interact with those providers for overhead work
associated with destination ingest tests.
2023-11-16 20:13:46 +00:00
Steve Canny
7a741c9ae6
fix(chunk): #1985 mis-splits of Table chunks (#2076)
Closes #1985 

**Summary.** Due to an interaction of coding errors, HTML text in
`TableChunk` splits of a `Table` element were repeating the entire HTML
for the table in each chunk.

**Technical Summary.** This behavior was fixed but not published in the
last chunking PR of a series. Finish up that PR and submit it all here.
This PR extracts chunking to the particular Section type (each has their
own distinct chunking behavior).
2023-11-16 16:22:50 +00:00
Steve Canny
41fc55bc12
fix(docx): tabulate output is non-deterministic (#2090)
The test for nested tables added a few PRs ago indirectly relies on the
padding added to table-HTML by `tabulate`. The length of that padding
turns out to be non-deterministic, perhaps related to M1 vs. Intel
hardware.

Remove padding from tabulate output in the test so only actual content
is compared.
2023-11-16 07:52:16 +00:00
cragwolfe
5fa40850f4
feat: convenience script to post files to the API (#2083)
Usage: ./unstructured-get-json.sh [options] <file>"
                                                                                                                                                       
Options:                                                                                                                                                             
  --api-key KEY   Specify the API key for authentication. Set the env var $UNST_API_KEY to skip providing this option.                                               
  --hi-res        hi_res strategy: Enable high-resolution processing, with layout segmentation and OCR                                                               
  --fast          fast strategy: No OCR, just extract embedded text                                                                                                  
  --ocr-only      ocr_only strategy: Perform OCR (Optical Character Recognition) only. No layout segmentation.                                                       
  --tables        Enable table extraction: tables are represented as html in metadata                                                                                
  --coordinates   Include coordinates in the output                                                                                                                  
  --trace         Enable trace logging for debugging, useful to cut and paste the executed curl call                                                                 
  --verbose       Enable verbose logging including printing first 8 elements to stdout                                                                               
  --s3            Write the resulting output to s3 (like a pastebin)                                                                                                 
  --help          Display this help and exit.                                                                                                                        
                                                                                                                                                                     
Arguments:                                                                                                                                                           
  <file>          File to send to the API.                                                                                                                           
                                                                                                                                                                     
The script requires a <file>, the document to post to the Unstructured API.                                                                                          
The .json result is written to ~/tmp/unst-outputs/ -- this path is echoed and copied to your clipboard.
2023-11-15 22:58:28 -08:00
cragwolfe
abe4e8191a
chore: ingest-script cleanup, better skip condition (#2094)
When testing ingest tests, one often wants to keep the .json output or
generated metrics files around for inspection after the fact. This
updates the bash condition to actually honor the comment that mentions

    # export UNSTRUCTURED_CLEANUP_DEV_FIXTURES=1
    
** Test Instructions **

Run:

    export UNSTRUCTURED_CLEANUP_DEV_FIXTURES=1
    ./test_unstructured_ingest/src/s3.sh
    ./test_unstructured_ingest/evaluation-metrics.sh text-extraction
    
and witness test directories/files do not get cleaned up. E.g.,
`test_unstructured_ingest/metrics-tmp/`. One can also add a `set -x` at
the top of test_unstructured_ingest/cleanup.sh to see what is getting
skipped (it's a lot!).
2023-11-15 22:28:04 -08:00
Christine Straub
e114e5c418
Refactor: partition pdf (#2074)
### Summary
- add constants for strategies
- add `_process_uncategorized_text_elements()` to remove code block
duplication
### Testing
CI should pass.
2023-11-15 21:41:02 -08:00
Klaijan
777a428071
chore: for ingest-test metrics, also check subdirs (#2079)
- Copy script only went through one layer of subdirectory so it did not
found the match between manifest file and structured output. Now edited
to search all subdirectories.
- `set -e` causes the script to exit at any exit rather than `exit 0`,
fix all scripts that needs to run the copy script to be `set +e` right
before the check diff, then back to `set -e` after
- Edit the default evaluation metrics output from `metrics` to
`metrics-tmp` to account for diff check
- Add a script that checks the differences between old eval metric
output (metrics) and new eval metrics output (metrics-tmp)
2023-11-15 21:02:43 -08:00
Yao You
f1ad901f57
chore: add more parametrization to ingestion test (#2086)
- allow the overwrite destination to be set to the `OUTPUT_ROOT` instead
of default to script dir.

## test
run

```bash
OVERWRITE_FIXTURES=true OUTPUT_ROOT=/tmp ./test_unstructured_ingest/src/s3.sh
```

with this change we should find new files generated under
`/tmp/expected-structured-output/s3`.
Without this change there will be no such new files.
2023-11-15 22:32:41 +00:00
Steve Canny
252405c780
Dynamic ElementMetadata implementation (#2043)
### Executive Summary
The structure of element metadata is currently static, meaning only
predefined fields can appear in the metadata. We would like the
flexibility for end-users, at their own discretion, to define and use
additional metadata fields that make sense for their particular
use-case.

### Concepts
A key concept for dynamic metadata is _known field_. A known-field is
one of those explicitly defined on `ElementMetadata`. Each of these has
a type and can be specified when _constructing_ a new `ElementMetadata`
instance. This is in contrast to an _end-user defined_ (or _ad-hoc_)
metadata field, one not known at "compile" time and added at the
discretion of an end-user to suit the purposes of their application.

An ad-hoc field can only be added by _assignment_ on an already
constructed instance.

### End-user ad-hoc metadata field behaviors

An ad-hoc field can be added to an `ElementMetadata` instance by
assignment:
```python
>>> metadata = ElementMetadata()
>>> metadata.coefficient = 0.536
```
A field added in this way can be accessed by name:
```python
>>> metadata.coefficient
0.536
```
and that field will appear in the JSON/dict for that instance:
```python
>>> metadata = ElementMetadata()
>>> metadata.coefficient = 0.536
>>> metadata.to_dict()
{"coefficient": 0.536}
```
However, accessing a "user-defined" value that has _not_ been assigned
on that instance raises `AttributeError`:
```python
>>> metadata.coeffcient  # -- misspelled "coefficient" --
AttributeError: 'ElementMetadata' object has no attribute 'coeffcient'
```

This makes "tagging" a metadata item with a value very convenient, but
entails the proviso that if an end-user wants to add a metadata field to
_some_ elements and not others (sparse population), AND they want to
access that field by name on ANY element and receive `None` where it has
not been assigned, they will need to use an expression like this:
```python
coefficient = metadata.coefficient if hasattr(metadata, "coefficient") else None
``` 

### Implementation Notes

- **ad-hoc metadata fields** are discarded during consolidation (for
chunking) because we don't have a consolidation strategy defined for
those. We could consider using a default consolidation strategy like
`FIRST` or possibly allow a user to register a strategy (although that
gets hairy in non-private and multiple-memory-space situations.)
- ad-hoc metadata fields **cannot start with an underscore**.
- We have no way to distinguish an ad-hoc field from any "noise" fields
that might appear in a JSON/dict loaded using `.from_dict()`, so unlike
the original (which only loaded known-fields), we'll rehydrate anything
that we find there.
- No real type-safety is possible on ad-hoc fields but the type-checker
does not complain because the type of all ad-hoc fields is `Any` (which
is the best available behavior in my view).
- We may want to consider whether end-users should be able to add ad-hoc
fields to "sub" metadata objects too, like `DataSourceMetadata` and
conceivably `CoordinatesMetadata` (although I'm not immediately seeing a
use-case for the second one).
2023-11-15 13:22:15 -08:00
cragwolfe
d7a280402f
build: larger images for docker publish (#2082)
Build and publish docker images on larger runner to work around the
space issue here:
https://github.com/Unstructured-IO/unstructured/actions/runs/6871101034/job/18689403845
.
2023-11-15 14:46:53 +00:00
Steve Canny
b8a8de33f4
fix(ingest): canonicalize ingest JSON (#2080)
Canonicalize JSON produced for ingest tests such that incidental changes
is _form_ of the JSON objects (keys moving around) that does not change
the _content_ of that JSON object does not trigger an ingest-test
failure.
2023-11-15 00:52:58 -08:00
Austin Walker
2931cb38e8
fix: handle KeyError: 'N' for certain pdfs (#2072)
Closes #2059.

We've found some pdfs that throw an error in pdfminer. These files use a
ICCBased color profile but do not include an expected value `N`. As a
workaround, we can wrap pdfminer and drop any colorspace info, since we
don't need to render the document.

To verify, try to partition the document in the linked issue.

```
elements = partition(filename="google-2023-environmental-report_condensed.pdf", strategy="fast")
```

---------

Co-authored-by: cragwolfe <crag@unstructured.io>
2023-11-15 01:59:05 +00:00
Trevor Bossert
f8528a0e2c
Update base image to include CUDA 11.8 (#2053)
This adds Nvidia GPU support with CUDA to container images.
2023-11-14 16:14:01 -08:00
Christine Straub
475066ba7c
Fix: fast strategy fallback to ocr only (#2055)
Closes #2038.
### Summary
The `fast` strategy should not fall back to a more expensive strategy.

### Testing
For
[9493801-p17.pdf](https://github.com/Unstructured-IO/unstructured/files/13292884/9493801-p17.pdf),
the following code should return an empty list.

```
elements = partition(filename=filename, strategy="fast")
```

---------

Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com>
Co-authored-by: christinestraub <christinestraub@users.noreply.github.com>
2023-11-14 18:46:41 +00:00
Ahmet Melek
68686e292e
fix: check existence of variable res before iteration (#2063)
Fixes a bug where `TypeError: 'NoneType' object is not iterable` raises
due to variable `res` returning as None

Checks the existence of `res` before iteration
2023-11-14 16:07:54 +00:00
Yuming Long
6c9990b013
Chore: specify default language parameter to paddle with DEFAULT_PADDLE_LANG (#2065)
Close: https://github.com/Unstructured-IO/unstructured-api/issues/247

### Summary
User now can specify default [paddle lang
code](https://github.com/Mushroomcat9998/PaddleOCR/blob/main/doc/doc_en/multi_languages_en.md#5-support-languages-and-abbreviations)
with env `DEFAULT_PADDLE_LANG` before we have the language mapping for
paddle

### Test
* in your unstructured API env, cd to unstructured repo and install it
locally with `pip install -e .`
* check out to this branch
* run paddle on intel chip:
```
pip install paddlepaddle
pip install "unstructured.PaddleOCR"
export OCR_AGENT=paddle
export DEFAULT_PADDLE_LANG=ch
make run-web-app
```
* curl:
```
curl  -X 'POST'  'http://localhost:8000/general/v0/general'   -H 'accept: application/json'  -F 'files=@sample-docs/english-and-korean.png'   | jq -C . | less -R
```
* expected to see `INFO Loading paddle with CPU on language=ch...` in
log info
2023-11-13 22:05:37 +00:00
Trevor Bossert
22aedc4d6f
Remove ssh-keyscan and files (#2057)
This was legacy and is no longer needed. It also has the effect of
incorrect owner for known_hosts of notebook-user

Relates to: #2056
2023-11-13 18:50:06 +00:00
Yao You
36c4441e2b
ci: parametrize ingest test checking scripts (#2062)
- parametrize the output folder paths and expected output folder paths
in comparison scripts
- now allow user to use env `OUTPUT_ROOT` to control where the output
and expected output is
- currently assumes output from test and expected output are in the same
directory; this may need separation later

## test
run
```bash
OUTPUT_ROOT=/tmp ./test_unstructured_ingest/test-ingest-src.sh
```
and it should show files changed but not able to show diff since there
is no expected output content at `OUTPUT_ROOT`.

Then run
```bash
cp -R test_unstructured_ingest/expected-* /tmp/
OUTPUT_ROOT=/tmp ./test_unstructured_ingest/test-ingest-src.sh
```
we can see (due to CI and local instance producing different results)
actual line by line diff
2023-11-13 18:42:19 +00:00
John
1ead5a27df
Jj/2011 missing languages metadata (#2037)
### Summary
Closes #2011 

`languages` was missing from the metadata when partitioning pdfs via
`hi_res` and `fast` strategies and missing from image partitions via
`hi_res`. This PR adds `languages` to the relevant function calls so it
is included in the resulting elements.


### Testing
On the main branch, `partition_image` will include `languages` when
`strategy='ocr_only'`, but not when `strategy='hi_res'`:
```
filename = "example-docs/english-and-korean.png"
from unstructured.partition.image import partition_image

elements = partition_image(filename, strategy="ocr_only", languages=['eng', 'kor'])
elements[0].metadata.languages

elements = partition_image(filename, strategy="hi_res", languages=['eng', 'kor'])
elements[0].metadata.languages
```

For `partition_pdf`, `'ocr_only'` will include `languages` in the
metadata, but `'fast'` and `'hi_res'` will not.
```
filename = "example-docs/korean-text-with-tables.pdf"
from unstructured.partition.pdf import partition_pdf

elements = partition_pdf(filename, strategy="ocr_only", languages=['kor'])
elements[0].metadata.languages


elements = partition_pdf(filename, strategy="fast", languages=['kor'])
elements[0].metadata.languages


elements = partition_pdf(filename, strategy="hi_res", languages=['kor'])
elements[0].metadata.languages
```

On this branch, `languages` is included in the metadata regardless of
strategy

---------

Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com>
Co-authored-by: Coniferish <Coniferish@users.noreply.github.com>
2023-11-13 16:47:05 +00:00
Christine Straub
b11c546757
Fix: partition pdf overflow error (#2054)
Closes #2050.
### Summary
- set zoom to `1` if zoom is less than `0` when parsing Tesseract OCR
data
- update `determine_pdf_auto_strategy` to return the `hi_res` strategy
if either `infer_table_structure` or `extract_images_in_pdf` is true
### Testing
PDF:
[getty_62-62.pdf](https://github.com/Unstructured-IO/unstructured/files/13322169/getty_62-62.pdf)

Run the following code in both the `main` branch and the `current`
branch.

```
from unstructured.partition.pdf import partition_pdf

elements = partition_pdf(
    filename="getty_62-62.pdf",
    extract_images_in_pdf=True,
    infer_table_structure=True,
    chunking_strategy="by_title",
    max_characters=4000,
    new_after_n_chars=3800,
    combine_text_under_n_chars=2000,
    image_output_dir_path=path,
)
```
0.10.30
2023-11-10 11:01:46 -08:00
John
f8c180a59e
Jj/2027 float no attr strip (#2048)
Closes #2027 

Tables or pages that contain only numbers are returned as floats in a
pandas.DataFrame when the image or page is converted from
`.image_to_data()`. An AttributeError was raised downstream when trying
to `.strip()` the floats. This update converts those floats if needed
and otherwise strips the text.

Testing (note: the document used for testing is new, so you will have to
copy it to the main branch in order to see that this snippet raises an
AttributeError on the main branch, but works on this branch)
```
from unstructured.partition.pdf import partition_pdf
filename = "example-docs/all-number-table.pdf"
partition_pdf(filename, strategy="ocr_only")
```

---------

Co-authored-by: cragwolfe <crag@unstructured.io>
2023-11-10 05:14:06 +00:00
cragwolfe
fa27408c4f
chore: fix Makefile ingest targets (#2051)
Fixes the Makefile `ingest-` targets were broken in
https://github.com/Unstructured-IO/unstructured/pull/1799/files.

**Test Instructions**

for maketarget in $(grep .PHONY Makefile | grep install-ingest | perl -p
-e 's/.PHONY://' | tr -d '\n'); do
      echo $maketarget; make $maketarget
    done
2023-11-09 21:55:27 -08:00
cragwolfe
69952f66ed
fix(build): update ingest script loc in Dockerfile (#2052)
Fixes docker-smoke-test.sh to reference the new location for the
wikipedia ingest script, which was moved in
https://github.com/Unstructured-IO/unstructured/pull/1951 . This fix
should allow the docker image build to complete on merges to main.

Reference to recent failed job:

https://github.com/Unstructured-IO/unstructured/actions/runs/6819416096/job/18546724401
2023-11-09 21:55:07 -08:00
Klaijan
049b0f3fa8
chore: update metrics-json-manifest (#2047)
Update `metrics-json-manigest.txt` master file for ingest evaluation.

---------

Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com>
Co-authored-by: Klaijan <Klaijan@users.noreply.github.com>
2023-11-10 00:24:59 +00:00
Steve Canny
d06bcc41bb
fix(docx): improve page-break detection (#2036)
Page breaks are reliably indicated by `w:lastRenderedPageBreak` elements
present in the document XML. Page breaks are NOT reliably indicated by
"hard" page-breaks inserted by the author and when present are redundant
to a `w:lastRenderedPageBreak` element so cause over-counting if used.

Use rendered page-breaks only.
2023-11-09 20:34:30 +00:00
Christine Straub
3fe480799a
Fix: missing characters at the beginning of sentences on table ingest output after table OCR refactor (#1961)
Closes #1875.

### Summary
- add functionality to do a second OCR on cropped table images
- use `IMAGE_CROP_PAD` env for `individual_blocks` mode
### Testing
The test function
[`test_partition_pdf_hi_res_ocr_mode_with_table_extraction()`](https://github.com/Unstructured-IO/unstructured/blob/main/test_unstructured/partition/pdf_image/test_pdf.py#L425)
in `test_pdf.py` should pass.

### NOTE: 
I've tried to experiment with values for scaling ENVs on the following
PRs but found that changes to the values for scaling ENVs affect the
entire page OCR output(OCR regression) so switched to doing a second OCR
for tables.
- https://github.com/Unstructured-IO/unstructured/pull/1998/files 
- https://github.com/Unstructured-IO/unstructured/pull/2004/files
- https://github.com/Unstructured-IO/unstructured/pull/2016/files
- https://github.com/Unstructured-IO/unstructured/pull/2029/files

---------

Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com>
Co-authored-by: christinestraub <christinestraub@users.noreply.github.com>
2023-11-09 18:29:55 +00:00
Christine Straub
bb58c1bb0b
Refactor: element type (#2035)
### Summary
- add constants for element type
- replace the `TYPE_TO_TEXT_ELEMENT_MAP` dictionary using the
`ElementType` constants
- replace element type strings using the constants

### Testing
CI should pass.
2023-11-08 21:52:55 -08:00
Steve Canny
c688216b38
fix: remove .max_characters from ElementMetadata (#2032)
This metadata field is assumedly vestigial and is unused by any code in
the repo. `max_characters` is an optional argument to `chunk_by_title()`
and has meaning in that context, but is not written to the metadata.

Remove this unused field.
2023-11-08 19:56:31 +00:00
Steve Canny
0e2c21e5a2
fix: handle sectionless-docx in the general case (#1829)
A DOCX document that has no sections can still contain one or more
tables. Such files are never created by Word but Word can open them just
fine. These can be and are generated by other applications.

Use the newly-added `Document.iter_inner_content()` method added
upstream in `python-docx` to capture both paragraphs and tables from a
section-less DOCX document.

This generalizes the fix for MS Teams chat-transcripts (an example of
sectionless-docx) implemented in #1825.
2023-11-08 19:05:19 +00:00
shreyanid
67fa7ad867
feat: rework aggregate metrics by doctype calculation (#1982)
### Summary
Previously, the holistic evaluation script was a copy of the ingest
evaluation function with some modifications to aggregate the data by
doctype. This refactor instead takes the result of the
`measure_text_edit_distance` function (used by ingest) and aggregates
the results by doctype. This pattern can also be followed by future
aggregations we may want to perform.

### Test
Confirm the doctype aggregation functionality of the
`aggregate_cct_data_by_doctype` function by calling it on the ingest
metrics result sheet:
(from the top level unstructured folder)
```
python -c 'from unstructured.metrics.doctype_aggregation import *; aggregate_cct_data_by_doctype("./test_unstructured_ingest/metrics")'
```
The aggregated result will be written to the same metrics folder.
<img width="680" alt="Screenshot 2023-11-03 at 2 56 20 PM"
src="https://github.com/Unstructured-IO/unstructured/assets/42684285/7250191b-bdf7-4e9f-99ca-ddbe7ee74ac5">
2023-11-08 01:00:02 -08:00
ryannikolaidis
d5fd21f0fd
fix: pass partition arguments to api when partitioning with unstructured-ingest and --partition-by-api (#2023)
Closes #1064 

When using the `--partition-by-api` flag via unstructured-ingest, none
of the partition arguments are forwarded, meaning that these options are
disregarded. With this change, we now pass through all of the relevant
partition arguments to the api.

## Changes

* parse and pass relevant partition arguments to the api in
unstructured-ingest
* bonus: leverage an existing `partition.api` function to call out to
the api rather than including duplicative request logic in unstructured
ingest
* bonus: --pdf-infer-table-structure is now a flag not an arg (it
defaults false anyways, this is more succinct and consistent with
similar parameters)
* bonus: adds `hi_res_model_name` so a user can specify the model to
leverage when using a hi_res strategy.

## Testing

* update against_api.sh source test script to specify a partition
argument and validates that the response from the api respected the
argument
* manually ran a request and validated that it was processed with
chipper as specified (not sure if we want to bake a chipper request into
the ci tests) (validated that the response leveraged the chipper model):

```
PYTHONPATH=. ./unstructured/ingest/main.py \
    local \
    --output-dir /tmp/ingest-requests/chipper \
    --verbose \
    --reprocess \
    --strategy hi_res \
    --partition-by-api \
    --hi-res-model-name chipper \
    --api-key "$API_KEY" \
    --input-path 'example-docs/layout-parser-paper-with-table.pdf'
```
2023-11-08 04:47:02 +00:00
Roman Isecke
03f62faf9b
feat: add connection check method to all source and destination connectors (#2000)
### Description
Add a `check_connection` method to each connector to easily be able to
check it without running the full ingest process. As part of this PR,
some refactoring done to allow clients to be shared and populated across
the `check_connection` method and the `initialize` method, allowing for
the `check_connection` method to be called without having to rely on the
`initialize` one to be called first.

* bonus: fix the changelog

---------

Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com>
2023-11-08 03:11:39 +00:00
qued
92ddf3a337
feat: enable request timeout (#2013)
Courtesy @cdpierse.

Adds a test to PR #1529 in accordance with feedback.

Description from original PR:

In python the default behaviour of `requests.get` without a `timeout`
being set is to hang indefinitely. We have a production use case where
the desired behaviour would be to raise a timeout error rather than have
the application just hang.

This PR adds a new optional keyword parameter `request_timeout` to
`partition` which is passed to `file_and_type_from_url` in the case
where we are fetching from a URL. This is then passed to `requests.get`

---------

Co-authored-by: Charles Pierse <charlespierse@gmail.com>
2023-11-08 00:44:58 +00:00
Steve Canny
80fe07b89f
fix: #1952 support nested docx tables (#2020)
In DOCX, like HTML, a table cell can itself contain a table. This is not
uncommon and is typically used for formatting purposes.

When a DOCX table is nested, create nested HTML tables to reflect that
structure and create a plain-text table with captures all the text in
nested tables, formatting it as a reasonable facsimile of a table.

This implements the solution described and spiked in PR #1952.

---------

Co-authored-by: Bruno Bornsztein <bruno.bornsztein@gmail.com>
2023-11-08 00:37:21 +00:00
ryannikolaidis
0e94dd5d65
fix: ingest destination test failure with missing output (#2031)
Intermittently the various destination test will fail with:

```
{noformat}--- Cleanup done ---
gs://utic-test-ingest-fixtures-output/1699377964/example-docs/
deleting gs://utic-test-ingest-fixtures-output/1699377964
Removing objects:
  

ERROR: (gcloud.storage.rm) The following URLs matched no objects or files:
-gs://utic-test-ingest-fixtures-output/1699377964
Last ran script: gcs.sh
Error: Process completed with exit code 1.{noformat}
```

Reference trace
[here](https://github.com/Unstructured-IO/unstructured/actions/runs/6787927424/job/18452240764?pr=2020)

After some investigation it looks like this error is due to collisions
that occur because we’re assuming 1s date accuracy is sufficient when
generating (and deleting) "unique" test destination location names. The
likelihood is actually pretty high given that we run these tests against
a test matrix.

Instead we should just use a uuid for these unique destinations.

## Changes

- Use uuidgen instead of `date +%s` for unique destinations
2023-11-07 23:14:01 +00:00
qued
04fcdb91fe
chore: Update readme slack links (#2030)
Updated slack links in the README that were using an old shortened URL.
2023-11-07 13:02:43 -08:00
shreyanid
6db663e7bb
refactor: separate click wrappers from core evaluation functionality (#1981)
### Summary
Click decorated functions cannot (properly) be called outside of the
click interface. This makes it difficult to reuse the setup
functionality in measure_text_edit_distance or
measure_element_type_accuracy. This PR removes the click decoration and
separates it into a wrapper function purely to execute the command.

### Technical Details
- Changed as suggested in [this StackOverflow
post](https://stackoverflow.com/questions/40091347/call-another-click-command-from-a-click-command)
response
- The locations of these now distinct functions are separate: the
`_command` click-decorated functions stay in ingest/evaluate.py, and the
core functions measure_text_edit_distance and
measure_element_type_accuracy are moved into the unstructured/metrics/
folder (which is a more logical location for them).
- Initial test added for measure_text_edit_distance

### Test
`sh ./test_unstructured_ingest/evaluation-metrics.sh text-extraction`
functionality is unchanged.

---------

Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com>
Co-authored-by: shreyanid <shreyanid@users.noreply.github.com>
Co-authored-by: Trevor Bossert <37596773+tabossert@users.noreply.github.com>
2023-11-07 19:54:22 +00:00
Yuming Long
ad14321016
Chore: don't pass empty language code to tesseract CLI (#1996)
Summary:

Close: https://github.com/Unstructured-IO/unstructured/issues/1920

* stop passing in empty string from `languages` to tesseract, which will
result in passing empty string to language config `-l` for the tesseract
CLI
* also stop passing in duplicate language code from `languages` to
tesseract OCR
* if we failed to convert any iso languages from the `languages`
parameter, proceed OCR with `eng` as default
  
### Test
* First confirm the tesseract error `Estimating resolution as X` before
this:
* on the `unstructured-api` repo with main branch, run `make
run-web-app`
* curl to test error from empty string, or just any wrong input like `-F
'languages="eng,de"'`:
 ```
curl -X 'POST'  'http://0.0.0.0:8000/general/v0/general' \
  -H 'accept: application/json'   \
-H 'Content-Type: multipart/form-data' \
 -F 'files=@sample-docs/layout-parser-paper-with-table.jpg' \
-F 'languages=""'  \
-F 'strategy=hi_res'  \
-F 'pdf_infer_table_structure=True' \
 | jq -C . | less -R
``` 

* after this change:
   * in your unstructured API env, cd to unstructured repo and install it locally with `pip install -e .`
   * check out to this branch
   * run `make run-web-app` again in api repo
   * the curl command return output and see warning in log

---------

Co-authored-by: qued <64741807+qued@users.noreply.github.com>
0.10.29
2023-11-06 19:30:12 -06:00
Yao You
38ab35dcb6
fix: make pip compile (#2015)
- add missing make file in ingest folder
2023-11-06 16:26:12 -06:00
qued
ad09a869b5
fix: update slack link to link shortener (#2010)
Per @tabossert we're now using a link shortener behind which we can
rotate the link to keep it current. That way we (🤞 ) never have to
update this here again.

#### Testing:

Links should work. No more links should exist in the documentation
except this one.
2023-11-06 15:47:18 +00:00
Ahmet Melek
ca78dc737a
feat: extend ingest options to support multiple embedding modules, add deterministic ingest test for embeddings (#1918)
Closes #1782 

This PR:
- Extends ingest pipeline so that it is possible to select an embedding
provider from a range of providers
- Modifies the ingest embedding test to be a diff test, since the
embedding vectors are reproducible after supporting multiple providers

Additional info on the chosen provider for the test:
- Found `langchain.embeddings.HuggingFaceEmbeddings` to be deterministic
even when there's no seed set
- Took 6.84s to pass a unit test with the provider (without cache,
including model download)
- `langchain.embeddings.HuggingFaceEmbeddings` runs in local, making it
zero cost

For all these reasons, testing embedding modules with the Huggingface
model seems to be making sense

---------

Co-authored-by: cragwolfe <crag@unstructured.io>
Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com>
Co-authored-by: ahmetmeleq <ahmetmeleq@users.noreply.github.com>
2023-11-06 12:26:12 +00:00
Trevor Bossert
24d5877bd6
Bump base image with latest security fixes (#2009)
This includes latest version and security updates available from
upstream
2023-11-05 19:29:29 +00:00
Matt Robinson
e5bcd36475
docs: update slack links (#1990)
### Summary

A user in the [Community
Slack](https://unstructuredw-kbe4326.slack.com/archives/C043YA29U0J/p1698933003702919)
reported having difficulty signing up for Slack using the links from the
documentation. Updated the links to the use the invite link that worked
from him, which came from [this blog
post](https://medium.com/unstructured-io/setting-up-a-private-retrieval-augmented-generation-rag-system-with-local-vector-database-d42f34692ca7).
2023-11-05 11:26:34 -08:00