1109 Commits

Author SHA1 Message Date
Steve Canny
41fc55bc12
fix(docx): tabulate output is non-deterministic (#2090)
The test for nested tables added a few PRs ago indirectly relies on the
padding added to table-HTML by `tabulate`. The length of that padding
turns out to be non-deterministic, perhaps related to M1 vs. Intel
hardware.

Remove padding from tabulate output in the test so only actual content
is compared.
2023-11-16 07:52:16 +00:00
cragwolfe
5fa40850f4
feat: convenience script to post files to the API (#2083)
Usage: ./unstructured-get-json.sh [options] <file>"
                                                                                                                                                       
Options:                                                                                                                                                             
  --api-key KEY   Specify the API key for authentication. Set the env var $UNST_API_KEY to skip providing this option.                                               
  --hi-res        hi_res strategy: Enable high-resolution processing, with layout segmentation and OCR                                                               
  --fast          fast strategy: No OCR, just extract embedded text                                                                                                  
  --ocr-only      ocr_only strategy: Perform OCR (Optical Character Recognition) only. No layout segmentation.                                                       
  --tables        Enable table extraction: tables are represented as html in metadata                                                                                
  --coordinates   Include coordinates in the output                                                                                                                  
  --trace         Enable trace logging for debugging, useful to cut and paste the executed curl call                                                                 
  --verbose       Enable verbose logging including printing first 8 elements to stdout                                                                               
  --s3            Write the resulting output to s3 (like a pastebin)                                                                                                 
  --help          Display this help and exit.                                                                                                                        
                                                                                                                                                                     
Arguments:                                                                                                                                                           
  <file>          File to send to the API.                                                                                                                           
                                                                                                                                                                     
The script requires a <file>, the document to post to the Unstructured API.                                                                                          
The .json result is written to ~/tmp/unst-outputs/ -- this path is echoed and copied to your clipboard.
2023-11-15 22:58:28 -08:00
cragwolfe
abe4e8191a
chore: ingest-script cleanup, better skip condition (#2094)
When testing ingest tests, one often wants to keep the .json output or
generated metrics files around for inspection after the fact. This
updates the bash condition to actually honor the comment that mentions

    # export UNSTRUCTURED_CLEANUP_DEV_FIXTURES=1
    
** Test Instructions **

Run:

    export UNSTRUCTURED_CLEANUP_DEV_FIXTURES=1
    ./test_unstructured_ingest/src/s3.sh
    ./test_unstructured_ingest/evaluation-metrics.sh text-extraction
    
and witness test directories/files do not get cleaned up. E.g.,
`test_unstructured_ingest/metrics-tmp/`. One can also add a `set -x` at
the top of test_unstructured_ingest/cleanup.sh to see what is getting
skipped (it's a lot!).
2023-11-15 22:28:04 -08:00
Christine Straub
e114e5c418
Refactor: partition pdf (#2074)
### Summary
- add constants for strategies
- add `_process_uncategorized_text_elements()` to remove code block
duplication
### Testing
CI should pass.
2023-11-15 21:41:02 -08:00
Klaijan
777a428071
chore: for ingest-test metrics, also check subdirs (#2079)
- Copy script only went through one layer of subdirectory so it did not
found the match between manifest file and structured output. Now edited
to search all subdirectories.
- `set -e` causes the script to exit at any exit rather than `exit 0`,
fix all scripts that needs to run the copy script to be `set +e` right
before the check diff, then back to `set -e` after
- Edit the default evaluation metrics output from `metrics` to
`metrics-tmp` to account for diff check
- Add a script that checks the differences between old eval metric
output (metrics) and new eval metrics output (metrics-tmp)
2023-11-15 21:02:43 -08:00
Yao You
f1ad901f57
chore: add more parametrization to ingestion test (#2086)
- allow the overwrite destination to be set to the `OUTPUT_ROOT` instead
of default to script dir.

## test
run

```bash
OVERWRITE_FIXTURES=true OUTPUT_ROOT=/tmp ./test_unstructured_ingest/src/s3.sh
```

with this change we should find new files generated under
`/tmp/expected-structured-output/s3`.
Without this change there will be no such new files.
2023-11-15 22:32:41 +00:00
Steve Canny
252405c780
Dynamic ElementMetadata implementation (#2043)
### Executive Summary
The structure of element metadata is currently static, meaning only
predefined fields can appear in the metadata. We would like the
flexibility for end-users, at their own discretion, to define and use
additional metadata fields that make sense for their particular
use-case.

### Concepts
A key concept for dynamic metadata is _known field_. A known-field is
one of those explicitly defined on `ElementMetadata`. Each of these has
a type and can be specified when _constructing_ a new `ElementMetadata`
instance. This is in contrast to an _end-user defined_ (or _ad-hoc_)
metadata field, one not known at "compile" time and added at the
discretion of an end-user to suit the purposes of their application.

An ad-hoc field can only be added by _assignment_ on an already
constructed instance.

### End-user ad-hoc metadata field behaviors

An ad-hoc field can be added to an `ElementMetadata` instance by
assignment:
```python
>>> metadata = ElementMetadata()
>>> metadata.coefficient = 0.536
```
A field added in this way can be accessed by name:
```python
>>> metadata.coefficient
0.536
```
and that field will appear in the JSON/dict for that instance:
```python
>>> metadata = ElementMetadata()
>>> metadata.coefficient = 0.536
>>> metadata.to_dict()
{"coefficient": 0.536}
```
However, accessing a "user-defined" value that has _not_ been assigned
on that instance raises `AttributeError`:
```python
>>> metadata.coeffcient  # -- misspelled "coefficient" --
AttributeError: 'ElementMetadata' object has no attribute 'coeffcient'
```

This makes "tagging" a metadata item with a value very convenient, but
entails the proviso that if an end-user wants to add a metadata field to
_some_ elements and not others (sparse population), AND they want to
access that field by name on ANY element and receive `None` where it has
not been assigned, they will need to use an expression like this:
```python
coefficient = metadata.coefficient if hasattr(metadata, "coefficient") else None
``` 

### Implementation Notes

- **ad-hoc metadata fields** are discarded during consolidation (for
chunking) because we don't have a consolidation strategy defined for
those. We could consider using a default consolidation strategy like
`FIRST` or possibly allow a user to register a strategy (although that
gets hairy in non-private and multiple-memory-space situations.)
- ad-hoc metadata fields **cannot start with an underscore**.
- We have no way to distinguish an ad-hoc field from any "noise" fields
that might appear in a JSON/dict loaded using `.from_dict()`, so unlike
the original (which only loaded known-fields), we'll rehydrate anything
that we find there.
- No real type-safety is possible on ad-hoc fields but the type-checker
does not complain because the type of all ad-hoc fields is `Any` (which
is the best available behavior in my view).
- We may want to consider whether end-users should be able to add ad-hoc
fields to "sub" metadata objects too, like `DataSourceMetadata` and
conceivably `CoordinatesMetadata` (although I'm not immediately seeing a
use-case for the second one).
2023-11-15 13:22:15 -08:00
cragwolfe
d7a280402f
build: larger images for docker publish (#2082)
Build and publish docker images on larger runner to work around the
space issue here:
https://github.com/Unstructured-IO/unstructured/actions/runs/6871101034/job/18689403845
.
2023-11-15 14:46:53 +00:00
Steve Canny
b8a8de33f4
fix(ingest): canonicalize ingest JSON (#2080)
Canonicalize JSON produced for ingest tests such that incidental changes
is _form_ of the JSON objects (keys moving around) that does not change
the _content_ of that JSON object does not trigger an ingest-test
failure.
2023-11-15 00:52:58 -08:00
Austin Walker
2931cb38e8
fix: handle KeyError: 'N' for certain pdfs (#2072)
Closes #2059.

We've found some pdfs that throw an error in pdfminer. These files use a
ICCBased color profile but do not include an expected value `N`. As a
workaround, we can wrap pdfminer and drop any colorspace info, since we
don't need to render the document.

To verify, try to partition the document in the linked issue.

```
elements = partition(filename="google-2023-environmental-report_condensed.pdf", strategy="fast")
```

---------

Co-authored-by: cragwolfe <crag@unstructured.io>
2023-11-15 01:59:05 +00:00
Trevor Bossert
f8528a0e2c
Update base image to include CUDA 11.8 (#2053)
This adds Nvidia GPU support with CUDA to container images.
2023-11-14 16:14:01 -08:00
Christine Straub
475066ba7c
Fix: fast strategy fallback to ocr only (#2055)
Closes #2038.
### Summary
The `fast` strategy should not fall back to a more expensive strategy.

### Testing
For
[9493801-p17.pdf](https://github.com/Unstructured-IO/unstructured/files/13292884/9493801-p17.pdf),
the following code should return an empty list.

```
elements = partition(filename=filename, strategy="fast")
```

---------

Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com>
Co-authored-by: christinestraub <christinestraub@users.noreply.github.com>
2023-11-14 18:46:41 +00:00
Ahmet Melek
68686e292e
fix: check existence of variable res before iteration (#2063)
Fixes a bug where `TypeError: 'NoneType' object is not iterable` raises
due to variable `res` returning as None

Checks the existence of `res` before iteration
2023-11-14 16:07:54 +00:00
Yuming Long
6c9990b013
Chore: specify default language parameter to paddle with DEFAULT_PADDLE_LANG (#2065)
Close: https://github.com/Unstructured-IO/unstructured-api/issues/247

### Summary
User now can specify default [paddle lang
code](https://github.com/Mushroomcat9998/PaddleOCR/blob/main/doc/doc_en/multi_languages_en.md#5-support-languages-and-abbreviations)
with env `DEFAULT_PADDLE_LANG` before we have the language mapping for
paddle

### Test
* in your unstructured API env, cd to unstructured repo and install it
locally with `pip install -e .`
* check out to this branch
* run paddle on intel chip:
```
pip install paddlepaddle
pip install "unstructured.PaddleOCR"
export OCR_AGENT=paddle
export DEFAULT_PADDLE_LANG=ch
make run-web-app
```
* curl:
```
curl  -X 'POST'  'http://localhost:8000/general/v0/general'   -H 'accept: application/json'  -F 'files=@sample-docs/english-and-korean.png'   | jq -C . | less -R
```
* expected to see `INFO Loading paddle with CPU on language=ch...` in
log info
2023-11-13 22:05:37 +00:00
Trevor Bossert
22aedc4d6f
Remove ssh-keyscan and files (#2057)
This was legacy and is no longer needed. It also has the effect of
incorrect owner for known_hosts of notebook-user

Relates to: #2056
2023-11-13 18:50:06 +00:00
Yao You
36c4441e2b
ci: parametrize ingest test checking scripts (#2062)
- parametrize the output folder paths and expected output folder paths
in comparison scripts
- now allow user to use env `OUTPUT_ROOT` to control where the output
and expected output is
- currently assumes output from test and expected output are in the same
directory; this may need separation later

## test
run
```bash
OUTPUT_ROOT=/tmp ./test_unstructured_ingest/test-ingest-src.sh
```
and it should show files changed but not able to show diff since there
is no expected output content at `OUTPUT_ROOT`.

Then run
```bash
cp -R test_unstructured_ingest/expected-* /tmp/
OUTPUT_ROOT=/tmp ./test_unstructured_ingest/test-ingest-src.sh
```
we can see (due to CI and local instance producing different results)
actual line by line diff
2023-11-13 18:42:19 +00:00
John
1ead5a27df
Jj/2011 missing languages metadata (#2037)
### Summary
Closes #2011 

`languages` was missing from the metadata when partitioning pdfs via
`hi_res` and `fast` strategies and missing from image partitions via
`hi_res`. This PR adds `languages` to the relevant function calls so it
is included in the resulting elements.


### Testing
On the main branch, `partition_image` will include `languages` when
`strategy='ocr_only'`, but not when `strategy='hi_res'`:
```
filename = "example-docs/english-and-korean.png"
from unstructured.partition.image import partition_image

elements = partition_image(filename, strategy="ocr_only", languages=['eng', 'kor'])
elements[0].metadata.languages

elements = partition_image(filename, strategy="hi_res", languages=['eng', 'kor'])
elements[0].metadata.languages
```

For `partition_pdf`, `'ocr_only'` will include `languages` in the
metadata, but `'fast'` and `'hi_res'` will not.
```
filename = "example-docs/korean-text-with-tables.pdf"
from unstructured.partition.pdf import partition_pdf

elements = partition_pdf(filename, strategy="ocr_only", languages=['kor'])
elements[0].metadata.languages


elements = partition_pdf(filename, strategy="fast", languages=['kor'])
elements[0].metadata.languages


elements = partition_pdf(filename, strategy="hi_res", languages=['kor'])
elements[0].metadata.languages
```

On this branch, `languages` is included in the metadata regardless of
strategy

---------

Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com>
Co-authored-by: Coniferish <Coniferish@users.noreply.github.com>
2023-11-13 16:47:05 +00:00
Christine Straub
b11c546757
Fix: partition pdf overflow error (#2054)
Closes #2050.
### Summary
- set zoom to `1` if zoom is less than `0` when parsing Tesseract OCR
data
- update `determine_pdf_auto_strategy` to return the `hi_res` strategy
if either `infer_table_structure` or `extract_images_in_pdf` is true
### Testing
PDF:
[getty_62-62.pdf](https://github.com/Unstructured-IO/unstructured/files/13322169/getty_62-62.pdf)

Run the following code in both the `main` branch and the `current`
branch.

```
from unstructured.partition.pdf import partition_pdf

elements = partition_pdf(
    filename="getty_62-62.pdf",
    extract_images_in_pdf=True,
    infer_table_structure=True,
    chunking_strategy="by_title",
    max_characters=4000,
    new_after_n_chars=3800,
    combine_text_under_n_chars=2000,
    image_output_dir_path=path,
)
```
0.10.30
2023-11-10 11:01:46 -08:00
John
f8c180a59e
Jj/2027 float no attr strip (#2048)
Closes #2027 

Tables or pages that contain only numbers are returned as floats in a
pandas.DataFrame when the image or page is converted from
`.image_to_data()`. An AttributeError was raised downstream when trying
to `.strip()` the floats. This update converts those floats if needed
and otherwise strips the text.

Testing (note: the document used for testing is new, so you will have to
copy it to the main branch in order to see that this snippet raises an
AttributeError on the main branch, but works on this branch)
```
from unstructured.partition.pdf import partition_pdf
filename = "example-docs/all-number-table.pdf"
partition_pdf(filename, strategy="ocr_only")
```

---------

Co-authored-by: cragwolfe <crag@unstructured.io>
2023-11-10 05:14:06 +00:00
cragwolfe
fa27408c4f
chore: fix Makefile ingest targets (#2051)
Fixes the Makefile `ingest-` targets were broken in
https://github.com/Unstructured-IO/unstructured/pull/1799/files.

**Test Instructions**

for maketarget in $(grep .PHONY Makefile | grep install-ingest | perl -p
-e 's/.PHONY://' | tr -d '\n'); do
      echo $maketarget; make $maketarget
    done
2023-11-09 21:55:27 -08:00
cragwolfe
69952f66ed
fix(build): update ingest script loc in Dockerfile (#2052)
Fixes docker-smoke-test.sh to reference the new location for the
wikipedia ingest script, which was moved in
https://github.com/Unstructured-IO/unstructured/pull/1951 . This fix
should allow the docker image build to complete on merges to main.

Reference to recent failed job:

https://github.com/Unstructured-IO/unstructured/actions/runs/6819416096/job/18546724401
2023-11-09 21:55:07 -08:00
Klaijan
049b0f3fa8
chore: update metrics-json-manifest (#2047)
Update `metrics-json-manigest.txt` master file for ingest evaluation.

---------

Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com>
Co-authored-by: Klaijan <Klaijan@users.noreply.github.com>
2023-11-10 00:24:59 +00:00
Steve Canny
d06bcc41bb
fix(docx): improve page-break detection (#2036)
Page breaks are reliably indicated by `w:lastRenderedPageBreak` elements
present in the document XML. Page breaks are NOT reliably indicated by
"hard" page-breaks inserted by the author and when present are redundant
to a `w:lastRenderedPageBreak` element so cause over-counting if used.

Use rendered page-breaks only.
2023-11-09 20:34:30 +00:00
Christine Straub
3fe480799a
Fix: missing characters at the beginning of sentences on table ingest output after table OCR refactor (#1961)
Closes #1875.

### Summary
- add functionality to do a second OCR on cropped table images
- use `IMAGE_CROP_PAD` env for `individual_blocks` mode
### Testing
The test function
[`test_partition_pdf_hi_res_ocr_mode_with_table_extraction()`](https://github.com/Unstructured-IO/unstructured/blob/main/test_unstructured/partition/pdf_image/test_pdf.py#L425)
in `test_pdf.py` should pass.

### NOTE: 
I've tried to experiment with values for scaling ENVs on the following
PRs but found that changes to the values for scaling ENVs affect the
entire page OCR output(OCR regression) so switched to doing a second OCR
for tables.
- https://github.com/Unstructured-IO/unstructured/pull/1998/files 
- https://github.com/Unstructured-IO/unstructured/pull/2004/files
- https://github.com/Unstructured-IO/unstructured/pull/2016/files
- https://github.com/Unstructured-IO/unstructured/pull/2029/files

---------

Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com>
Co-authored-by: christinestraub <christinestraub@users.noreply.github.com>
2023-11-09 18:29:55 +00:00
Christine Straub
bb58c1bb0b
Refactor: element type (#2035)
### Summary
- add constants for element type
- replace the `TYPE_TO_TEXT_ELEMENT_MAP` dictionary using the
`ElementType` constants
- replace element type strings using the constants

### Testing
CI should pass.
2023-11-08 21:52:55 -08:00
Steve Canny
c688216b38
fix: remove .max_characters from ElementMetadata (#2032)
This metadata field is assumedly vestigial and is unused by any code in
the repo. `max_characters` is an optional argument to `chunk_by_title()`
and has meaning in that context, but is not written to the metadata.

Remove this unused field.
2023-11-08 19:56:31 +00:00
Steve Canny
0e2c21e5a2
fix: handle sectionless-docx in the general case (#1829)
A DOCX document that has no sections can still contain one or more
tables. Such files are never created by Word but Word can open them just
fine. These can be and are generated by other applications.

Use the newly-added `Document.iter_inner_content()` method added
upstream in `python-docx` to capture both paragraphs and tables from a
section-less DOCX document.

This generalizes the fix for MS Teams chat-transcripts (an example of
sectionless-docx) implemented in #1825.
2023-11-08 19:05:19 +00:00
shreyanid
67fa7ad867
feat: rework aggregate metrics by doctype calculation (#1982)
### Summary
Previously, the holistic evaluation script was a copy of the ingest
evaluation function with some modifications to aggregate the data by
doctype. This refactor instead takes the result of the
`measure_text_edit_distance` function (used by ingest) and aggregates
the results by doctype. This pattern can also be followed by future
aggregations we may want to perform.

### Test
Confirm the doctype aggregation functionality of the
`aggregate_cct_data_by_doctype` function by calling it on the ingest
metrics result sheet:
(from the top level unstructured folder)
```
python -c 'from unstructured.metrics.doctype_aggregation import *; aggregate_cct_data_by_doctype("./test_unstructured_ingest/metrics")'
```
The aggregated result will be written to the same metrics folder.
<img width="680" alt="Screenshot 2023-11-03 at 2 56 20 PM"
src="https://github.com/Unstructured-IO/unstructured/assets/42684285/7250191b-bdf7-4e9f-99ca-ddbe7ee74ac5">
2023-11-08 01:00:02 -08:00
ryannikolaidis
d5fd21f0fd
fix: pass partition arguments to api when partitioning with unstructured-ingest and --partition-by-api (#2023)
Closes #1064 

When using the `--partition-by-api` flag via unstructured-ingest, none
of the partition arguments are forwarded, meaning that these options are
disregarded. With this change, we now pass through all of the relevant
partition arguments to the api.

## Changes

* parse and pass relevant partition arguments to the api in
unstructured-ingest
* bonus: leverage an existing `partition.api` function to call out to
the api rather than including duplicative request logic in unstructured
ingest
* bonus: --pdf-infer-table-structure is now a flag not an arg (it
defaults false anyways, this is more succinct and consistent with
similar parameters)
* bonus: adds `hi_res_model_name` so a user can specify the model to
leverage when using a hi_res strategy.

## Testing

* update against_api.sh source test script to specify a partition
argument and validates that the response from the api respected the
argument
* manually ran a request and validated that it was processed with
chipper as specified (not sure if we want to bake a chipper request into
the ci tests) (validated that the response leveraged the chipper model):

```
PYTHONPATH=. ./unstructured/ingest/main.py \
    local \
    --output-dir /tmp/ingest-requests/chipper \
    --verbose \
    --reprocess \
    --strategy hi_res \
    --partition-by-api \
    --hi-res-model-name chipper \
    --api-key "$API_KEY" \
    --input-path 'example-docs/layout-parser-paper-with-table.pdf'
```
2023-11-08 04:47:02 +00:00
Roman Isecke
03f62faf9b
feat: add connection check method to all source and destination connectors (#2000)
### Description
Add a `check_connection` method to each connector to easily be able to
check it without running the full ingest process. As part of this PR,
some refactoring done to allow clients to be shared and populated across
the `check_connection` method and the `initialize` method, allowing for
the `check_connection` method to be called without having to rely on the
`initialize` one to be called first.

* bonus: fix the changelog

---------

Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com>
2023-11-08 03:11:39 +00:00
qued
92ddf3a337
feat: enable request timeout (#2013)
Courtesy @cdpierse.

Adds a test to PR #1529 in accordance with feedback.

Description from original PR:

In python the default behaviour of `requests.get` without a `timeout`
being set is to hang indefinitely. We have a production use case where
the desired behaviour would be to raise a timeout error rather than have
the application just hang.

This PR adds a new optional keyword parameter `request_timeout` to
`partition` which is passed to `file_and_type_from_url` in the case
where we are fetching from a URL. This is then passed to `requests.get`

---------

Co-authored-by: Charles Pierse <charlespierse@gmail.com>
2023-11-08 00:44:58 +00:00
Steve Canny
80fe07b89f
fix: #1952 support nested docx tables (#2020)
In DOCX, like HTML, a table cell can itself contain a table. This is not
uncommon and is typically used for formatting purposes.

When a DOCX table is nested, create nested HTML tables to reflect that
structure and create a plain-text table with captures all the text in
nested tables, formatting it as a reasonable facsimile of a table.

This implements the solution described and spiked in PR #1952.

---------

Co-authored-by: Bruno Bornsztein <bruno.bornsztein@gmail.com>
2023-11-08 00:37:21 +00:00
ryannikolaidis
0e94dd5d65
fix: ingest destination test failure with missing output (#2031)
Intermittently the various destination test will fail with:

```
{noformat}--- Cleanup done ---
gs://utic-test-ingest-fixtures-output/1699377964/example-docs/
deleting gs://utic-test-ingest-fixtures-output/1699377964
Removing objects:
  

ERROR: (gcloud.storage.rm) The following URLs matched no objects or files:
-gs://utic-test-ingest-fixtures-output/1699377964
Last ran script: gcs.sh
Error: Process completed with exit code 1.{noformat}
```

Reference trace
[here](https://github.com/Unstructured-IO/unstructured/actions/runs/6787927424/job/18452240764?pr=2020)

After some investigation it looks like this error is due to collisions
that occur because we’re assuming 1s date accuracy is sufficient when
generating (and deleting) "unique" test destination location names. The
likelihood is actually pretty high given that we run these tests against
a test matrix.

Instead we should just use a uuid for these unique destinations.

## Changes

- Use uuidgen instead of `date +%s` for unique destinations
2023-11-07 23:14:01 +00:00
qued
04fcdb91fe
chore: Update readme slack links (#2030)
Updated slack links in the README that were using an old shortened URL.
2023-11-07 13:02:43 -08:00
shreyanid
6db663e7bb
refactor: separate click wrappers from core evaluation functionality (#1981)
### Summary
Click decorated functions cannot (properly) be called outside of the
click interface. This makes it difficult to reuse the setup
functionality in measure_text_edit_distance or
measure_element_type_accuracy. This PR removes the click decoration and
separates it into a wrapper function purely to execute the command.

### Technical Details
- Changed as suggested in [this StackOverflow
post](https://stackoverflow.com/questions/40091347/call-another-click-command-from-a-click-command)
response
- The locations of these now distinct functions are separate: the
`_command` click-decorated functions stay in ingest/evaluate.py, and the
core functions measure_text_edit_distance and
measure_element_type_accuracy are moved into the unstructured/metrics/
folder (which is a more logical location for them).
- Initial test added for measure_text_edit_distance

### Test
`sh ./test_unstructured_ingest/evaluation-metrics.sh text-extraction`
functionality is unchanged.

---------

Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com>
Co-authored-by: shreyanid <shreyanid@users.noreply.github.com>
Co-authored-by: Trevor Bossert <37596773+tabossert@users.noreply.github.com>
2023-11-07 19:54:22 +00:00
Yuming Long
ad14321016
Chore: don't pass empty language code to tesseract CLI (#1996)
Summary:

Close: https://github.com/Unstructured-IO/unstructured/issues/1920

* stop passing in empty string from `languages` to tesseract, which will
result in passing empty string to language config `-l` for the tesseract
CLI
* also stop passing in duplicate language code from `languages` to
tesseract OCR
* if we failed to convert any iso languages from the `languages`
parameter, proceed OCR with `eng` as default
  
### Test
* First confirm the tesseract error `Estimating resolution as X` before
this:
* on the `unstructured-api` repo with main branch, run `make
run-web-app`
* curl to test error from empty string, or just any wrong input like `-F
'languages="eng,de"'`:
 ```
curl -X 'POST'  'http://0.0.0.0:8000/general/v0/general' \
  -H 'accept: application/json'   \
-H 'Content-Type: multipart/form-data' \
 -F 'files=@sample-docs/layout-parser-paper-with-table.jpg' \
-F 'languages=""'  \
-F 'strategy=hi_res'  \
-F 'pdf_infer_table_structure=True' \
 | jq -C . | less -R
``` 

* after this change:
   * in your unstructured API env, cd to unstructured repo and install it locally with `pip install -e .`
   * check out to this branch
   * run `make run-web-app` again in api repo
   * the curl command return output and see warning in log

---------

Co-authored-by: qued <64741807+qued@users.noreply.github.com>
0.10.29
2023-11-06 19:30:12 -06:00
Yao You
38ab35dcb6
fix: make pip compile (#2015)
- add missing make file in ingest folder
2023-11-06 16:26:12 -06:00
qued
ad09a869b5
fix: update slack link to link shortener (#2010)
Per @tabossert we're now using a link shortener behind which we can
rotate the link to keep it current. That way we (🤞 ) never have to
update this here again.

#### Testing:

Links should work. No more links should exist in the documentation
except this one.
2023-11-06 15:47:18 +00:00
Ahmet Melek
ca78dc737a
feat: extend ingest options to support multiple embedding modules, add deterministic ingest test for embeddings (#1918)
Closes #1782 

This PR:
- Extends ingest pipeline so that it is possible to select an embedding
provider from a range of providers
- Modifies the ingest embedding test to be a diff test, since the
embedding vectors are reproducible after supporting multiple providers

Additional info on the chosen provider for the test:
- Found `langchain.embeddings.HuggingFaceEmbeddings` to be deterministic
even when there's no seed set
- Took 6.84s to pass a unit test with the provider (without cache,
including model download)
- `langchain.embeddings.HuggingFaceEmbeddings` runs in local, making it
zero cost

For all these reasons, testing embedding modules with the Huggingface
model seems to be making sense

---------

Co-authored-by: cragwolfe <crag@unstructured.io>
Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com>
Co-authored-by: ahmetmeleq <ahmetmeleq@users.noreply.github.com>
2023-11-06 12:26:12 +00:00
Trevor Bossert
24d5877bd6
Bump base image with latest security fixes (#2009)
This includes latest version and security updates available from
upstream
2023-11-05 19:29:29 +00:00
Matt Robinson
e5bcd36475
docs: update slack links (#1990)
### Summary

A user in the [Community
Slack](https://unstructuredw-kbe4326.slack.com/archives/C043YA29U0J/p1698933003702919)
reported having difficulty signing up for Slack using the links from the
documentation. Updated the links to the use the invite link that worked
from him, which came from [this blog
post](https://medium.com/unstructured-io/setting-up-a-private-retrieval-augmented-generation-rag-system-with-local-vector-database-d42f34692ca7).
2023-11-05 11:26:34 -08:00
Trevor Bossert
d63bb215d4
Add back environment to unblock (#2008)
Azure federated branch subject doesn’t work with merge queues.
2023-11-05 05:05:24 +00:00
Klaijan
c471ea3cc7
chore: remove copy line from non-matrix connectors (#1976) 2023-11-04 10:58:56 -07:00
Trevor Bossert
4db04b7a22
ci(test): remove environment identifier (#2003)
Moved Azure OIDC to use only Pull Request subject, this gets rid of the
noise it creates in the PR’s
2023-11-03 16:10:22 -07:00
ryannikolaidis
9fd77a5232
ci: only trigger ingest fixtures workflow on workflow dispatch (#2002)
We currently have a method to trigger the ingest fixture workflow by
commit message in addition to workflow dispatch (the trigger in gha
gui). The former requires that the workflow run on every push. Because
nobody uses the former, let's scrap it and save the time in CI.
2023-11-03 18:19:15 +00:00
Roman Isecke
ba4477ac20
feat: support table conversion for tabular destination connectors (#1917)
### Description
* A full schema was introduced to map the type of all output content
from the json partition output and mapped to a flattened table structure
to leverage table-based destination connectors. The delta table
destination connector was updated at the moment to take advantage of
this.
* Existing method to convert to a dataframe was updated because it had a
bug in it. Object content in the metadata would have the key name
changed when flattened but then this would be omitted since it didn't
exist in the `_get_metadata_table_fieldnames` response.
* Unit test was added to make sure we handle all values possible in an
Element when converting to a table
* Delta table ingest test was split into a source and destination test
(looking ahead to split these up in CI)

---------

Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com>
Co-authored-by: rbiseck3 <rbiseck3@users.noreply.github.com>
2023-11-03 16:47:21 +00:00
Christine Straub
9f7ff4fd98
rfctr: Clean up test functions in test_pdf.py (#1999)
### Summary:
- use the test utility function `example_doc_path()`
- clean up test functions related to `metadata_date` and
`exclude_metadata`
2023-11-03 10:02:43 -05:00
Roman Isecke
d09c8c0cab
test: update ingest dest tests to follow set pattern (#1991)
### Description
Update all destination tests to match pattern:
* Don't omit any metadata to check full schema
* Move azure cognitive dest test from src to dest
* Split delta table test into seperate src and dest tests
* Fix azure cognitive search and add to dest tests being run (wasn't
being run originally)
2023-11-03 12:46:56 +00:00
cragwolfe
668bd2967e
chore: update CHANGELOG.md (#1997)
Remove bullets not related to end-user consumption of the unstructured
library.

Co-authored-by: shreyanid <42684285+shreyanid@users.noreply.github.com>
2023-11-02 23:31:04 +00:00
Yao You
db766402a4
test: parametrize ingest test scripts (#1979)
This PR resolves
[CORE-2453](https://unstructured-ai.atlassian.net/browse/CORE-2453):

- parametrizes the output folder so that ingest output files can be
saved other than the same place where the scripts are; this is set by
env `OUTPUT_ROOT`
- parametrize the python path `PYTHONPATH` to first check existing
definition before default to `.`, the current folder
- parametrize the run script that carries out ingest using `RUN_SCRIPT`,
default is still `./unstructured/ingest/main.py`

These changes allows us to run ingest test with more control. To test:
- run `OUTPUT_ROOT=/tmp
./test_unstructured_ingest/src/local-single-file.sh`: the output now
should be in `/tmp` instead of in the ingest test folder
- run `RUN_SCRIPT=/hope/you/do/not/have/this/folder
./test_unstructured_ingest/src/local-single-file.sh` would raise an
error because system can't find `/hope/you/do/not/have/this/folder`
- run `RUN_SCRIPT=./unstructured/ingest/main.py
./test_unstructured_ingest/src/local-single-file.sh` should run as
normal
- do the following

```bash
cp ./unstructured/ingest/main.py /tmp/main.py
OUTPUT_ROOT=/tmp PYTHONPATH=$(pwd) RUN_SCRIPT=./unstructured/ingest/main.py ./test_unstructured_ingest/src/local-single-file.sh
```
This will run and generate output at `/tmp`

[CORE-2453]:
https://unstructured-ai.atlassian.net/browse/CORE-2453?atlOrigin=eyJpIjoiNWRkNTljNzYxNjVmNDY3MDlhMDU5Y2ZhYzA5YTRkZjUiLCJwIjoiZ2l0aHViLWNvbS1KU1cifQ
2023-11-02 21:41:56 +00:00