321 Commits

Author SHA1 Message Date
Christine Straub
475066ba7c
Fix: fast strategy fallback to ocr only (#2055)
Closes #2038.
### Summary
The `fast` strategy should not fall back to a more expensive strategy.

### Testing
For
[9493801-p17.pdf](https://github.com/Unstructured-IO/unstructured/files/13292884/9493801-p17.pdf),
the following code should return an empty list.

```
elements = partition(filename=filename, strategy="fast")
```

---------

Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com>
Co-authored-by: christinestraub <christinestraub@users.noreply.github.com>
2023-11-14 18:46:41 +00:00
Yao You
36c4441e2b
ci: parametrize ingest test checking scripts (#2062)
- parametrize the output folder paths and expected output folder paths
in comparison scripts
- now allow user to use env `OUTPUT_ROOT` to control where the output
and expected output is
- currently assumes output from test and expected output are in the same
directory; this may need separation later

## test
run
```bash
OUTPUT_ROOT=/tmp ./test_unstructured_ingest/test-ingest-src.sh
```
and it should show files changed but not able to show diff since there
is no expected output content at `OUTPUT_ROOT`.

Then run
```bash
cp -R test_unstructured_ingest/expected-* /tmp/
OUTPUT_ROOT=/tmp ./test_unstructured_ingest/test-ingest-src.sh
```
we can see (due to CI and local instance producing different results)
actual line by line diff
2023-11-13 18:42:19 +00:00
John
1ead5a27df
Jj/2011 missing languages metadata (#2037)
### Summary
Closes #2011 

`languages` was missing from the metadata when partitioning pdfs via
`hi_res` and `fast` strategies and missing from image partitions via
`hi_res`. This PR adds `languages` to the relevant function calls so it
is included in the resulting elements.


### Testing
On the main branch, `partition_image` will include `languages` when
`strategy='ocr_only'`, but not when `strategy='hi_res'`:
```
filename = "example-docs/english-and-korean.png"
from unstructured.partition.image import partition_image

elements = partition_image(filename, strategy="ocr_only", languages=['eng', 'kor'])
elements[0].metadata.languages

elements = partition_image(filename, strategy="hi_res", languages=['eng', 'kor'])
elements[0].metadata.languages
```

For `partition_pdf`, `'ocr_only'` will include `languages` in the
metadata, but `'fast'` and `'hi_res'` will not.
```
filename = "example-docs/korean-text-with-tables.pdf"
from unstructured.partition.pdf import partition_pdf

elements = partition_pdf(filename, strategy="ocr_only", languages=['kor'])
elements[0].metadata.languages


elements = partition_pdf(filename, strategy="fast", languages=['kor'])
elements[0].metadata.languages


elements = partition_pdf(filename, strategy="hi_res", languages=['kor'])
elements[0].metadata.languages
```

On this branch, `languages` is included in the metadata regardless of
strategy

---------

Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com>
Co-authored-by: Coniferish <Coniferish@users.noreply.github.com>
2023-11-13 16:47:05 +00:00
Klaijan
049b0f3fa8
chore: update metrics-json-manifest (#2047)
Update `metrics-json-manigest.txt` master file for ingest evaluation.

---------

Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com>
Co-authored-by: Klaijan <Klaijan@users.noreply.github.com>
2023-11-10 00:24:59 +00:00
Steve Canny
d06bcc41bb
fix(docx): improve page-break detection (#2036)
Page breaks are reliably indicated by `w:lastRenderedPageBreak` elements
present in the document XML. Page breaks are NOT reliably indicated by
"hard" page-breaks inserted by the author and when present are redundant
to a `w:lastRenderedPageBreak` element so cause over-counting if used.

Use rendered page-breaks only.
2023-11-09 20:34:30 +00:00
Christine Straub
3fe480799a
Fix: missing characters at the beginning of sentences on table ingest output after table OCR refactor (#1961)
Closes #1875.

### Summary
- add functionality to do a second OCR on cropped table images
- use `IMAGE_CROP_PAD` env for `individual_blocks` mode
### Testing
The test function
[`test_partition_pdf_hi_res_ocr_mode_with_table_extraction()`](https://github.com/Unstructured-IO/unstructured/blob/main/test_unstructured/partition/pdf_image/test_pdf.py#L425)
in `test_pdf.py` should pass.

### NOTE: 
I've tried to experiment with values for scaling ENVs on the following
PRs but found that changes to the values for scaling ENVs affect the
entire page OCR output(OCR regression) so switched to doing a second OCR
for tables.
- https://github.com/Unstructured-IO/unstructured/pull/1998/files 
- https://github.com/Unstructured-IO/unstructured/pull/2004/files
- https://github.com/Unstructured-IO/unstructured/pull/2016/files
- https://github.com/Unstructured-IO/unstructured/pull/2029/files

---------

Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com>
Co-authored-by: christinestraub <christinestraub@users.noreply.github.com>
2023-11-09 18:29:55 +00:00
ryannikolaidis
d5fd21f0fd
fix: pass partition arguments to api when partitioning with unstructured-ingest and --partition-by-api (#2023)
Closes #1064 

When using the `--partition-by-api` flag via unstructured-ingest, none
of the partition arguments are forwarded, meaning that these options are
disregarded. With this change, we now pass through all of the relevant
partition arguments to the api.

## Changes

* parse and pass relevant partition arguments to the api in
unstructured-ingest
* bonus: leverage an existing `partition.api` function to call out to
the api rather than including duplicative request logic in unstructured
ingest
* bonus: --pdf-infer-table-structure is now a flag not an arg (it
defaults false anyways, this is more succinct and consistent with
similar parameters)
* bonus: adds `hi_res_model_name` so a user can specify the model to
leverage when using a hi_res strategy.

## Testing

* update against_api.sh source test script to specify a partition
argument and validates that the response from the api respected the
argument
* manually ran a request and validated that it was processed with
chipper as specified (not sure if we want to bake a chipper request into
the ci tests) (validated that the response leveraged the chipper model):

```
PYTHONPATH=. ./unstructured/ingest/main.py \
    local \
    --output-dir /tmp/ingest-requests/chipper \
    --verbose \
    --reprocess \
    --strategy hi_res \
    --partition-by-api \
    --hi-res-model-name chipper \
    --api-key "$API_KEY" \
    --input-path 'example-docs/layout-parser-paper-with-table.pdf'
```
2023-11-08 04:47:02 +00:00
ryannikolaidis
0e94dd5d65
fix: ingest destination test failure with missing output (#2031)
Intermittently the various destination test will fail with:

```
{noformat}--- Cleanup done ---
gs://utic-test-ingest-fixtures-output/1699377964/example-docs/
deleting gs://utic-test-ingest-fixtures-output/1699377964
Removing objects:
  

ERROR: (gcloud.storage.rm) The following URLs matched no objects or files:
-gs://utic-test-ingest-fixtures-output/1699377964
Last ran script: gcs.sh
Error: Process completed with exit code 1.{noformat}
```

Reference trace
[here](https://github.com/Unstructured-IO/unstructured/actions/runs/6787927424/job/18452240764?pr=2020)

After some investigation it looks like this error is due to collisions
that occur because we’re assuming 1s date accuracy is sufficient when
generating (and deleting) "unique" test destination location names. The
likelihood is actually pretty high given that we run these tests against
a test matrix.

Instead we should just use a uuid for these unique destinations.

## Changes

- Use uuidgen instead of `date +%s` for unique destinations
2023-11-07 23:14:01 +00:00
shreyanid
6db663e7bb
refactor: separate click wrappers from core evaluation functionality (#1981)
### Summary
Click decorated functions cannot (properly) be called outside of the
click interface. This makes it difficult to reuse the setup
functionality in measure_text_edit_distance or
measure_element_type_accuracy. This PR removes the click decoration and
separates it into a wrapper function purely to execute the command.

### Technical Details
- Changed as suggested in [this StackOverflow
post](https://stackoverflow.com/questions/40091347/call-another-click-command-from-a-click-command)
response
- The locations of these now distinct functions are separate: the
`_command` click-decorated functions stay in ingest/evaluate.py, and the
core functions measure_text_edit_distance and
measure_element_type_accuracy are moved into the unstructured/metrics/
folder (which is a more logical location for them).
- Initial test added for measure_text_edit_distance

### Test
`sh ./test_unstructured_ingest/evaluation-metrics.sh text-extraction`
functionality is unchanged.

---------

Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com>
Co-authored-by: shreyanid <shreyanid@users.noreply.github.com>
Co-authored-by: Trevor Bossert <37596773+tabossert@users.noreply.github.com>
2023-11-07 19:54:22 +00:00
Ahmet Melek
ca78dc737a
feat: extend ingest options to support multiple embedding modules, add deterministic ingest test for embeddings (#1918)
Closes #1782 

This PR:
- Extends ingest pipeline so that it is possible to select an embedding
provider from a range of providers
- Modifies the ingest embedding test to be a diff test, since the
embedding vectors are reproducible after supporting multiple providers

Additional info on the chosen provider for the test:
- Found `langchain.embeddings.HuggingFaceEmbeddings` to be deterministic
even when there's no seed set
- Took 6.84s to pass a unit test with the provider (without cache,
including model download)
- `langchain.embeddings.HuggingFaceEmbeddings` runs in local, making it
zero cost

For all these reasons, testing embedding modules with the Huggingface
model seems to be making sense

---------

Co-authored-by: cragwolfe <crag@unstructured.io>
Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com>
Co-authored-by: ahmetmeleq <ahmetmeleq@users.noreply.github.com>
2023-11-06 12:26:12 +00:00
Klaijan
c471ea3cc7
chore: remove copy line from non-matrix connectors (#1976) 2023-11-04 10:58:56 -07:00
Roman Isecke
ba4477ac20
feat: support table conversion for tabular destination connectors (#1917)
### Description
* A full schema was introduced to map the type of all output content
from the json partition output and mapped to a flattened table structure
to leverage table-based destination connectors. The delta table
destination connector was updated at the moment to take advantage of
this.
* Existing method to convert to a dataframe was updated because it had a
bug in it. Object content in the metadata would have the key name
changed when flattened but then this would be omitted since it didn't
exist in the `_get_metadata_table_fieldnames` response.
* Unit test was added to make sure we handle all values possible in an
Element when converting to a table
* Delta table ingest test was split into a source and destination test
(looking ahead to split these up in CI)

---------

Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com>
Co-authored-by: rbiseck3 <rbiseck3@users.noreply.github.com>
2023-11-03 16:47:21 +00:00
Roman Isecke
d09c8c0cab
test: update ingest dest tests to follow set pattern (#1991)
### Description
Update all destination tests to match pattern:
* Don't omit any metadata to check full schema
* Move azure cognitive dest test from src to dest
* Split delta table test into seperate src and dest tests
* Fix azure cognitive search and add to dest tests being run (wasn't
being run originally)
2023-11-03 12:46:56 +00:00
Yao You
db766402a4
test: parametrize ingest test scripts (#1979)
This PR resolves
[CORE-2453](https://unstructured-ai.atlassian.net/browse/CORE-2453):

- parametrizes the output folder so that ingest output files can be
saved other than the same place where the scripts are; this is set by
env `OUTPUT_ROOT`
- parametrize the python path `PYTHONPATH` to first check existing
definition before default to `.`, the current folder
- parametrize the run script that carries out ingest using `RUN_SCRIPT`,
default is still `./unstructured/ingest/main.py`

These changes allows us to run ingest test with more control. To test:
- run `OUTPUT_ROOT=/tmp
./test_unstructured_ingest/src/local-single-file.sh`: the output now
should be in `/tmp` instead of in the ingest test folder
- run `RUN_SCRIPT=/hope/you/do/not/have/this/folder
./test_unstructured_ingest/src/local-single-file.sh` would raise an
error because system can't find `/hope/you/do/not/have/this/folder`
- run `RUN_SCRIPT=./unstructured/ingest/main.py
./test_unstructured_ingest/src/local-single-file.sh` should run as
normal
- do the following

```bash
cp ./unstructured/ingest/main.py /tmp/main.py
OUTPUT_ROOT=/tmp PYTHONPATH=$(pwd) RUN_SCRIPT=./unstructured/ingest/main.py ./test_unstructured_ingest/src/local-single-file.sh
```
This will run and generate output at `/tmp`

[CORE-2453]:
https://unstructured-ai.atlassian.net/browse/CORE-2453?atlOrigin=eyJpIjoiNWRkNTljNzYxNjVmNDY3MDlhMDU5Y2ZhYzA5YTRkZjUiLCJwIjoiZ2l0aHViLWNvbS1KU1cifQ
2023-11-02 21:41:56 +00:00
Roman Isecke
6700a7d8c4
feat: support generic inputs for partition kwargs from ingest CLI (#1923)
### Description
To always support the latest changed to the partition method and the
possible kwargs it supports, the ingest CLI has been refactored to take
in a valid json string to represent those values to allow a user more
flexibility with controlling the partition method.
2023-11-02 21:19:29 +00:00
shreyanid
c24e6e056c
chore: add doctype to ingest evaluation functions (#1977)
### Summary
To combine ingest and holistic metrics efforts, add the `doctype` field
to the results from the functions in evaluate.py for use in subsequent
aggregation functions.

### Test
Run `sh ./test_unstructured_ingest/evaluation-metrics.sh
text-extraction` and there will be a new doctype column with the file's
doctype extension.
<img width="508" alt="Screenshot 2023-11-01 at 2 23 11 PM"
src="https://github.com/Unstructured-IO/unstructured/assets/42684285/44583da9-e7ef-4142-be72-c2247b954bcf">

---------

Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com>
Co-authored-by: shreyanid <shreyanid@users.noreply.github.com>
2023-11-02 19:15:53 +00:00
Roman Isecke
24a419ece0
separate ingest tests (#1951)
### Description
This splits the source ingest tests from the destination ingest tests
since they share a different pattern:
* src tests pull data from a source and compare the partitioned content
to the expected results
* destingation tests leverage the local connector to produce results to
push to a destination and leverages overhead to create temporary
locations at those destinations to write to and delete when done.

Only the src tests create partitioned content that needs to be checked
so the update ingest test CI job only needs to run these.
2023-11-01 19:23:44 +00:00
Klaijan
a06b151897
refactor: ci workflow refactor (#1907)
Refactor the evaluation scripts including
`unstructured/ingest/evaluation.py`
`test_unstructured_ingest/evaluation-metrics.sh` for more structured
code and usage.
- The script is now only use one python script call with param
- Adds function to build string for output_args (`--output_dir
--output_list) and source_args (`--source_dir --source_args`)
- Now accepts evaluation to call as a param, currently only accepts
`text-extraction` and `element-type`

Example to call the function:
```sh evaluation-metrics.sh text-extraction```

---------

Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com>
Co-authored-by: Klaijan <Klaijan@users.noreply.github.com>
2023-11-01 15:58:23 +00:00
Roman Isecke
123ad20f4c
support passing credentials from memory for google connectors (#1888)
### Description

### Google Drive
The existing service account parameter was expanded to support either a
file path or a json value to generate the credentials when instantiating
the google drive client.

### GCS
Google Cloud Storage already supports the value being passed in, from
their docstring:
> - you may supply a token generated by the
      [gcloud](https://cloud.google.com/sdk/docs/)
      utility; this is either a python dictionary, the name of a file
containing the JSON returned by logging in with the gcloud CLI tool,
      or a Credentials object.


I tested this locally:

```python
from gcsfs import GCSFileSystem
import json

with open("/Users/romanisecke/.ssh/google-cloud-unstructured-ingest-test-d4fc30286d9d.json") as json_file:
    json_data = json.load(json_file)
    print(json_data)

    fs = GCSFileSystem(token=json_data)
    print(fs.ls(path="gs://utic-test-ingest-fixtures/"))
```
`['utic-test-ingest-fixtures/ideas-page.html',
'utic-test-ingest-fixtures/nested-1',
'utic-test-ingest-fixtures/nested-2']`
2023-10-31 17:12:04 +00:00
Ahmet Melek
a9a3efd85c
bugfix: SharePoint permissions fetching should be opt-in (#1894)
Closes: #1891 (check the issue for more info)

---------

Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com>
Co-authored-by: ahmetmeleq <ahmetmeleq@users.noreply.github.com>
Co-authored-by: Yao You <yao@unstructured.io>
2023-10-31 15:55:07 +00:00
Christine Straub
1f0c563e0c
refactor: partition_pdf() for ocr_only strategy (#1811)
### Summary
Update `ocr_only` strategy in `partition_pdf()`. This PR adds the
functionality to get accurate coordinate data when partitioning PDFs and
Images with the `ocr_only` strategy.
- Add functionality to perform OCR region grouping based on the OCR text
taken from `pytesseract.image_to_string()`
- Add functionality to get layout elements from OCR regions (ocr_layout)
for both `tesseract` and `paddle`
- Add functionality to determine the `source` of merged text regions
when merging text regions in `merge_text_regions()`
- Merge multiple test functions related to "ocr_only" strategy into
`test_partition_pdf_with_ocr_only_strategy()`
- This PR also fixes [issue
#1792](https://github.com/Unstructured-IO/unstructured/issues/1792)
### Evaluation
```
# Image
PYTHONPATH=. python examples/custom-layout-order/evaluate_natural_reading_order.py example-docs/double-column-A.jpg ocr_only xy-cut image

# PDF
PYTHONPATH=. python examples/custom-layout-order/evaluate_natural_reading_order.py example-docs/multi-column-2p.pdf ocr_only xy-cut pdf
```
### Test
- **Before update**
All elements have the same coordinate data 


![multi-column-2p_1_xy-cut](https://github.com/Unstructured-IO/unstructured/assets/9475974/aae0195a-2943-4fa8-bdd8-807f2f09c768)

- **After update**
All elements have accurate coordinate data


![multi-column-2p_1_xy-cut](https://github.com/Unstructured-IO/unstructured/assets/9475974/0f6c6202-9e65-4acf-bcd4-ac9dd01ab64a)

---------

Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com>
Co-authored-by: christinestraub <christinestraub@users.noreply.github.com>
2023-10-30 20:13:29 +00:00
Roman Isecke
680cfbabd4
expand fsspec downstream connectors (#1777)
### Description
Replacing PR
[1383](https://github.com/Unstructured-IO/unstructured/pull/1383)

---------

Co-authored-by: Trevor Bossert <alanboss@gmail.com>
Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com>
Co-authored-by: rbiseck3 <rbiseck3@users.noreply.github.com>
2023-10-30 20:09:49 +00:00
Benjamin Torres
05c3cd1be2
feat: clean pdfminer elements inside tables (#1808)
This PR introduces `clean_pdfminer_inner_elements` , which deletes
pdfminer elements inside other detection origins such as YoloX or
detectron.
This function returns the clean document.

Also, the ingest-test fixtures were updated to reflect the new standard
output.

The best way to check that this function is working properly is check
the new test `test_clean_pdfminer_inner_elements` in
`test_unstructured/partition/utils/test_processing_elements.py`

---------

Co-authored-by: Roman Isecke <roman@unstructured.io>
Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com>
Co-authored-by: rbiseck3 <rbiseck3@users.noreply.github.com>
Co-authored-by: Roman Isecke <136338424+rbiseck3@users.noreply.github.com>
2023-10-30 07:10:51 +00:00
Ahmet Melek
c249d02fa8
bugfix: ingest pipeline with chunking and embedding does not persist data to the embedding step (#1893)
Closes: #1892 (check the issue for more info)
2023-10-27 13:07:00 +00:00
Klaijan
466255eec3
build: element type frequency evaluation metrics workflow in ci (#1862)
**Executive Summary**
Measured element type frequency accuracy from the current version of
code with the expected output. The performance is reported as tsv file
under `metrics`.

**Technical Details**
- The evaluation measures element type frequencies from
`structured-output-eval` against `expected-structured-output`
- `evaluation.py` has been edited to support function calling using
`click.group()` and `command()`
- `evaluation-ingest-cp.sh` is now added to all the `test-ingest-xx.sh`
scripts

**Outputs**
2 tsv files is saved

![image](https://github.com/Unstructured-IO/unstructured/assets/2177850/b4458094-a9fc-48f9-a0bd-2ccd6985440a)

![image](https://github.com/Unstructured-IO/unstructured/assets/2177850/6d785736-bcaf-4275-bf2d-ab511cdfb3f4)
9-0e05-41d4-b69f-841a2aa131ec)
and aggregated score is displayed.

![image](https://github.com/Unstructured-IO/unstructured/assets/2177850/9d42bd0c-a0dd-41c2-a2e5-b675a40f35cc)

---------

Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com>
Co-authored-by: Klaijan <Klaijan@users.noreply.github.com>
Co-authored-by: Yao You <theyaoyou@gmail.com>
2023-10-27 04:36:36 +00:00
Roman Isecke
135aa65906
update ingest pipeline to share ingest docs via multiprocessing.manager.dict (#1814)
### Description
* If the contents of a doc were updated by the process of
reading/downloading it, this was not being persisted. To fix this, the
data being passed around was updated to use a multiprocessing safe dict
rather than the json string. Now that dict is updated after the
`get_file` method is called.
* Wikipedia connector was updated to use a static filename rather than
one requiring a call to fetch data.
* The read config param `re_download` was not being leveraged by the
source node, this was fixed.
* Added fix: chunking and embedding order reversed so chunking runs
before embeddings

---------

Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com>
Co-authored-by: rbiseck3 <rbiseck3@users.noreply.github.com>
2023-10-25 22:04:27 +00:00
Yuming Long
01a0e003d9
Chore: stop passing extract_tables to inference and note table regression on entire doc OCR (#1850)
### Summary

A follow up ticket on
https://github.com/Unstructured-IO/unstructured/pull/1801, I forgot to
remove the lines that pass extract_tables to inference, and noted the
table regression if we only do one OCR for entire doc

**Tech details:**
* stop passing `extract_tables` parameter to inference
* added table extraction ingest test for image, which was skipped
before, and the "text_as_html" field contains the OCR output from the
table OCR refactor PR
* replaced `assert_called_once_with` with `call_args` so that the unit
tests don't need to test additional parameters
* added `error_margin` as ENV when comparing bounding boxes
of`ocr_region` with `table_element`
* added more tests for tables and noted the table regression in test for
partition pdf

### Test
* for stop passing `extract_tables` parameter to inference, run test
`test_partition_pdf_hi_res_ocr_mode_with_table_extraction` before this
branch and you will see warning like `Table OCR from get_tokens method
will be deprecated....`, which means it called the table OCR in
inference repo. This branch removed the warning.
2023-10-24 17:13:28 +00:00
Amanda Cameron
0584e1d031
chore: fix infer_table bug (#1833)
Carrying `skip_infer_table_types` to `infer_table_structure` in
partition flow. Now PPT/X, DOC/X, etc. Table elements should not have a
`text_as_html` field.

Note: I've continued to exclude this var from partitioners that go
through html flow, I think if we've already got the html it doesn't make
sense to carry the infer variable along, since we're not 'infer-ing' the
html table in these cases.


TODO:
  add unit tests

---------

Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com>
Co-authored-by: amanda103 <amanda103@users.noreply.github.com>
2023-10-24 00:11:53 +00:00
Klaijan
6707cab250
build: text extraction evaluation metrics workflow added (#1757)
**Executive Summary**
This PR adds the evaluation metrics to our current workflow. It verifies
the flow that when the code is pushed, the code will gets evaluate
against our gold standard and output into `.tsv` file.

**Technical Details**
- Adds evaluation metrics to the test-ingest workflow
- Make use of `structured-output` from `test-ingest` and compare to the
gold-standard uploaded in s3, and download into local when make
comparison. The current folder in-use is
`s3://utic-dev-tech-fixtures/small-cct`. This dir is editable in the
shell script.
- With this PR, only one file from one connector is use to compare.

**Misc**
- Not many overlapped files between test-ingest and gold-standard. More
files will be added.

**Outputs**
2 `.tsv` files are saved under `test_unstructured_ingest/metrics/`.


![image](https://github.com/Unstructured-IO/unstructured/assets/2177850/222e437c-1a94-4d7c-9320-81696633b1ae)


![image](https://github.com/Unstructured-IO/unstructured/assets/2177850/5c840322-6739-4634-8868-eba04b4ebc96)

---------

Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com>
Co-authored-by: Klaijan <Klaijan@users.noreply.github.com>
2023-10-23 21:39:22 +00:00
Roman Isecke
a2af72bb79
local connector metadata and deserialization fix (#1800)
### Description
* Priority of this was to fix deserialization of ingest docs. Currently
the source metadata wasn't being persisted
* To help debug this, source metadata was added to the local ingest doc
as well.
* Unit test added to make sure the metadata itself was persisted.
* As part of serialization, it was forcing docs to fetch source metadata
if it hadn't already to add to the generated dict/json. This shouldn't
have happened if the underlying variable `_source_metadata` was `None`.
This way the doc can be serialized without any calls being made.
* Serialization was moved to the `to_dict` method to make it more
universal.

---------

Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com>
Co-authored-by: rbiseck3 <rbiseck3@users.noreply.github.com>
2023-10-23 15:51:52 +00:00
Yuming Long
ce40cdc55f
Chore (refactor): support table extraction with pre-computed ocr data (#1801)
### Summary

Table OCR refactor, move the OCR part for table model in inference repo
to unst repo.
* Before this PR, table model extracts OCR tokens with texts and
bounding box and fills the tokens to the table structure in inference
repo. This means we need to do an additional OCR for tables.
* After this PR, we use the OCR data from entire page OCR and pass the
OCR tokens to inference repo, which means we only do one OCR for the
entire document.

**Tech details:**
* Combined env `ENTIRE_PAGE_OCR` and `TABLE_OCR` to `OCR_AGENT`, this
means we use the same OCR agent for entire page and tables since we only
do one OCR.
* Bump inference repo to `0.7.9`, which allow table model in inference
to use pre-computed OCR data from unst repo. Please check in
[PR](https://github.com/Unstructured-IO/unstructured-inference/pull/256).
* All notebooks lint are made by `make tidy`
* This PR also fixes
[issue](https://github.com/Unstructured-IO/unstructured/issues/1564),
I've added test for the issue in
`test_pdf.py::test_partition_pdf_hi_table_extraction_with_languages`
* Add same scaling logic to image [similar to previous Table
OCR](https://github.com/Unstructured-IO/unstructured-inference/blob/main/unstructured_inference/models/tables.py#L109C1-L113),
but now scaling is applied to entire image

### Test
* Not much to manually testing expect table extraction still works
* But due to change on scaling and use pre-computed OCR data from entire
page, there are some slight (better) changes on table output, here is an
comparison on test outputs i found from the same test
`test_partition_image_with_table_extraction`:

screen shot for table in `layout-parser-paper-with-table.jpg`:
<img width="343" alt="expected"
src="https://github.com/Unstructured-IO/unstructured/assets/63475068/278d7665-d212-433d-9a05-872c4502725c">
before refactor:
<img width="709" alt="before"
src="https://github.com/Unstructured-IO/unstructured/assets/63475068/347fbc3b-f52b-45b5-97e9-6f633eaa0d5e">
after refactor:
<img width="705" alt="after"
src="https://github.com/Unstructured-IO/unstructured/assets/63475068/b3cbd809-cf67-4e75-945a-5cbd06b33b2d">

### TODO
(added as a ticket) Still have some clean up to do in inference repo
since now unst repo have duplicate logic, but can keep them as a fall
back plan. If we want to remove anything OCR related in inference, here
are items that is deprecated and can be removed:
*
[`get_tokens`](https://github.com/Unstructured-IO/unstructured-inference/blob/main/unstructured_inference/models/tables.py#L77)
(already noted in code)
* parameter `extract_tables` in inference
*
[`interpret_table_block`](https://github.com/Unstructured-IO/unstructured-inference/blob/main/unstructured_inference/inference/layoutelement.py#L88)
*
[`load_agent`](https://github.com/Unstructured-IO/unstructured-inference/blob/main/unstructured_inference/models/tables.py#L197)
* env `TABLE_OCR` 

### Note
if we want to fallback for an additional table OCR (may need this for
using paddle for table), we need to:
* pass `infer_table_structure` to inference with `extract_tables`
parameter
* stop passing `infer_table_structure` to `ocr.py`

---------

Co-authored-by: Yao You <yao@unstructured.io>
2023-10-21 00:24:23 +00:00
Klaijan
4a9d48859c
chore: add test_unstructured_ingest/metrics dummy dir (#1827)
Update path for metrics output to `test_unstructured_ingest/metrics` and
create a dummy folder `test_unstructured_ingest/metrics` to prevent ci
error.
2023-10-20 14:54:30 -07:00
Roman Isecke
63861f537e
Add check for duplicate click options (#1775)
### Description
Given that many of the options associated with the `Click` based cli
ingest commands are added dynamically from a number of configs, a check
was incorporated to make sure there were no duplicate entries to prevent
new configs from overwriting already added options.

### Issues that were found and fixes:
* duplicate api-key option set on Notion command conflicts with api key
used for unstructured api. Added notion prefix.
* retry logic configs had duplicates in biomed. Removed since this is
not handled by the pipeline.
2023-10-20 14:00:19 +00:00
Steve Canny
d9c2516364
fix: chunks break on regex-meta changes and regex-meta start/stop not adjusted (#1779)
**Executive Summary.** Introducing strict type-checking as preparation
for adding the chunk-overlap feature revealed a type mismatch for
regex-metadata between chunking tests and the (authoritative)
ElementMetadata definition. The implementation of regex-metadata aspects
of chunking passed the tests but did not produce the appropriate
behaviors in production where the actual data-structure was different.
This PR fixes these two bugs.

1. **Over-chunking.** The presence of `regex-metadata` in an element was
incorrectly being interpreted as a semantic boundary, leading to such
elements being isolated in their own chunks.

2. **Discarded regex-metadata.** regex-metadata present on the second or
later elements in a section (chunk) was discarded.


**Technical Summary**

The type of `ElementMetadata.regex_metadata` is `Dict[str,
List[RegexMetadata]]`. `RegexMetadata` is a `TypedDict` like `{"text":
"this matched", "start": 7, "end": 19}`.

Multiple regexes can be specified, each with a name like "mail-stop",
"version", etc. Each of those may produce its own set of matches, like:

```python
>>> element.regex_metadata
{
    "mail-stop": [{"text": "MS-107", "start": 18, "end": 24}],
    "version": [
        {"text": "current: v1.7.2", "start": 7, "end": 21},
        {"text": "supersedes: v1.7.0", "start": 22, "end": 40},
    ],
}
```

*Forensic analysis*
* The regex-metadata feature was added by Matt Robinson on 06/16/2023
commit: 4ea71683. The regex_metadata data structure is the same as when
it was added.

* The chunk-by-title feature was added by Matt Robinson on 08/29/2023
commit: f6a745a7. The mistaken regex-metadata data structure in the
tests is present in that commit.

Looks to me like a mis-remembering of the regex-metadata data-structure
and insufficient type-checking rigor (type-checker strictness level set
too low) to warn of the mistake.


**Over-chunking Behavior**

The over-chunking looked like this:

Chunking three elements with regex metadata should combine them into a
single chunk (`CompositeElement` object), subject to maximum size rules
(default 500 chars).

```python
elements: List[Element] = [
    Title(
        "Lorem Ipsum",
        metadata=ElementMetadata(
            regex_metadata={"ipsum": [RegexMetadata(text="Ipsum", start=6, end=11)]}
        ),
    ),
    Text(
        "Lorem ipsum dolor sit amet consectetur adipiscing elit.",
        metadata=ElementMetadata(
            regex_metadata={"dolor": [RegexMetadata(text="dolor", start=12, end=17)]}
        ),
    ),
    Text(
        "In rhoncus ipsum sed lectus porta volutpat.",
        metadata=ElementMetadata(
            regex_metadata={"ipsum": [RegexMetadata(text="ipsum", start=11, end=16)]}
        ),
    ),
]

chunks = chunk_by_title(elements)

assert chunks == [
    CompositeElement(
        "Lorem Ipsum\n\nLorem ipsum dolor sit amet consectetur adipiscing elit.\n\nIn rhoncus"
        " ipsum sed lectus porta volutpat."
    )
]
```

Observed behavior looked like this:

```python
chunks => [
    CompositeElement('Lorem Ipsum')
    CompositeElement('Lorem ipsum dolor sit amet consectetur adipiscing elit.')
    CompositeElement('In rhoncus ipsum sed lectus porta volutpat.')
]

```

The fix changed the approach from breaking on any metadata field not in
a specified group (`regex_metadata` was missing from this group) to only
breaking on specified fields (whitelisting instead of blacklisting).
This avoids overchunking every time we add a new metadata field and is
also simpler and easier to understand. This change in approach is
discussed in more detail here #1790.


**Dropping regex-metadata Behavior**

Chunking this section:

```python
elements: List[Element] = [
    Title(
        "Lorem Ipsum",
        metadata=ElementMetadata(
            regex_metadata={"ipsum": [RegexMetadata(text="Ipsum", start=6, end=11)]}
        ),
    ),
    Text(
        "Lorem ipsum dolor sit amet consectetur adipiscing elit.",
        metadata=ElementMetadata(
            regex_metadata={
                "dolor": [RegexMetadata(text="dolor", start=12, end=17)],
                "ipsum": [RegexMetadata(text="ipsum", start=6, end=11)],
            }
        ),
    ),
    Text(
        "In rhoncus ipsum sed lectus porta volutpat.",
        metadata=ElementMetadata(
            regex_metadata={"ipsum": [RegexMetadata(text="ipsum", start=11, end=16)]}
        ),
    ),
]
```

..should produce this regex_metadata on the single produced chunk:

```python
assert chunk == CompositeElement(
    "Lorem Ipsum\n\nLorem ipsum dolor sit amet consectetur adipiscing elit.\n\nIn rhoncus"
    " ipsum sed lectus porta volutpat."
)
assert chunk.metadata.regex_metadata == {
    "dolor": [RegexMetadata(text="dolor", start=25, end=30)],
    "ipsum": [
        RegexMetadata(text="Ipsum", start=6, end=11),
        RegexMetadata(text="ipsum", start=19, end=24),
        RegexMetadata(text="ipsum", start=81, end=86),
    ],
}
```

but instead produced this:

```python
regex_metadata == {"ipsum": [{"text": "Ipsum", "start": 6, "end": 11}]}
```

Which is the regex-metadata from the first element only.

The fix was to remove the consolidation+adjustment process from inside
the "list-attribute-processing" loop (because regex-metadata is not a
list) and process regex metadata separately.
2023-10-19 22:16:02 -05:00
Mallori Harrell
00635744ed
feat: Adds local embedding model (#1619)
This PR adds a local embedding model option as an alternative to using
our OpenAI embedding brick. This brick uses LangChain's
HuggingFacEmbeddings.
2023-10-19 11:51:36 -05:00
ryannikolaidis
80c3c24ca5
ingest retry strategy refactor <- Ingest test fixtures update (#1780)
This pull request includes updated ingest test fixtures.
Please review and merge if appropriate.

Co-authored-by: benjats07 <benjats07@users.noreply.github.com>
2023-10-18 04:33:57 +00:00
Roman Isecke
775bfb7588
ingest retry strategy refactor (#1708)
### Description
Pivot from using the retry logic as a decorator as this posed too many
limitations on what can be passed in as a parameter at runtime. Moved
this to be a class approach and now that can be instantiated with
appropriate loggers leveraging the `--verbose` flag to set the log
level. This also mitigates how much new code is being forked from the
backoff library. The existing notion client that was using the previous
decorator has been refactored to use the new class approach and the
airtable connector was updated to support retry logic as well. Default
log handlers were introduced which applies to all instances of the retry
handler when it starts, backs off, and gives up.

A generic approach was added to configuring the retry parameters in the
CLI and was added to the running number of common configs across all CLI
commands.

Omitted CHANGELOG entry as this is mostly just a refactor of the retry
code. All other connectors will be updated to support retry in another
PR but this helps limit the number of changes to review in this one.

### Extra fixes
* Updated local and salesforce source connector to set `ingest_doc_cls`
in a `__post_init__` method since this variable can't be serialized.

### Testing
Both the airtable and notion ingest tests can be run locally. While they
might not pass due to text changes (to be expected when running
locally), the process can be viewed in the logs to validate.

Associated issue: #1488
2023-10-17 16:15:08 +00:00
Christine Straub
237d04c896
feat: improve natural reading order by filtering OCR results (#1768)
### Summary
Some `OCR` elements with only spaces in the text have full-page width in
the bounding box, which causes the `xycut` sorting to not work as
expected. Now the logic to parse OCR results removes any elements with
only spaces (more than one space).

---------

Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com>
Co-authored-by: christinestraub <christinestraub@users.noreply.github.com>
2023-10-16 23:05:55 +00:00
Léa
89fa88f076
fix: stop csv and tsv dropping the first line of the file (#1530)
The current code assumes the first line of csv and tsv files are a
header line. Most csv and tsv files don't have a header line, and even
for those that do, dropping this line may not be the desired behavior.

Here is a snippet of code that demonstrates the current behavior and the
proposed fix

```
import pandas as pd
from lxml.html.soupparser import fromstring as soupparser_fromstring

c1 = """
    Stanley Cups,,
    Team,Location,Stanley Cups
    Blues,STL,1
    Flyers,PHI,2
    Maple Leafs,TOR,13
    """

f = "./test.csv"
with open(f, 'w') as ff:
    ff.write(c1)
  
print("Suggested Improvement Keep First Line") 
table = pd.read_csv(f, header=None)
html_text = table.to_html(index=False, header=False, na_rep="")
text = soupparser_fromstring(html_text).text_content()
print(text)

print("\n\nOriginal Looses First Line") 
table = pd.read_csv(f)
html_text = table.to_html(index=False, header=False, na_rep="")
text = soupparser_fromstring(html_text).text_content()
print(text)
```

---------

Co-authored-by: cragwolfe <crag@unstructured.io>
Co-authored-by: Yao You <theyaoyou@gmail.com>
Co-authored-by: Yao You <yao@unstructured.io>
2023-10-16 17:59:35 -05:00
Roman Isecke
9c7ee8921a
roman/fsspec compression support (#1730)
### Description 
Opened to replace original PR:
[1443](https://github.com/Unstructured-IO/unstructured/pull/1443)
2023-10-16 14:26:30 +00:00
John
6d7fe3ab02
fix: default to None for the languages metadata field (#1743)
### Summary
Closes #1714
Changes the default value for `languages` to `None` for elements that
don't have text or the language can't be detected.

### Testing
```
from unstructured.partition.auto import partition
filename = "example-docs/handbook-1p.docx"
elements = partition(filename=filename, detect_language_per_element=True)

# PageBreak elements don't have text and will be collected here
none_langs = [element for element in elements if element.metadata.languages is None]
none_langs[0].text
```

---------

Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com>
Co-authored-by: Coniferish <Coniferish@users.noreply.github.com>
Co-authored-by: cragwolfe <crag@unstructured.io>
2023-10-14 22:46:24 +00:00
qued
8100f1e7e2
chore: process chipper hierarchy (#1634)
PR to support schema changes introduced from [PR
232](https://github.com/Unstructured-IO/unstructured-inference/pull/232)
in `unstructured-inference`.

Specifically what needs to be supported is:
* Change to the way `LayoutElement` from `unstructured-inference` is
structured, specifically that this class is no longer a subclass of
`Rectangle`, and instead `LayoutElement` has a `bbox` property that
captures the location information and a `from_coords` method that allows
construction of a `LayoutElement` directly from coordinates.
* Removal of `LocationlessLayoutElement` since chipper now exports
bounding boxes, and if we need to support elements without bounding
boxes, we can make the `bbox` property mentioned above optional.
* Getting hierarchy data directly from the inference elements rather
than in post-processing
* Don't try to reorder elements received from chipper v2, as they should
already be ordered.

#### Testing:

The following demonstrates that the new version of chipper is inferring
hierarchy.

```python
from unstructured.partition.pdf import partition_pdf
elements = partition_pdf("example-docs/layout-parser-paper-fast.pdf", strategy="hi_res", model_name="chipper")
children = [el for el in elements if el.metadata.parent_id is not None]
print(children)

```
Also verify that running the traditional `hi_res` gives different
results:
```python
from unstructured.partition.pdf import partition_pdf
elements = partition_pdf("example-docs/layout-parser-paper-fast.pdf", strategy="hi_res")

```

---------

Co-authored-by: Sebastian Laverde Alfonso <lavmlk20201@gmail.com>
Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com>
Co-authored-by: christinestraub <christinemstraub@gmail.com>
2023-10-13 01:28:46 +00:00
Ahmet Melek
94836cfad4
feat: add file-based access permissions for SharePoint ingest (#1628)
This PR:

- defines rbac_data as a SourceMetadata field,
- manages connections to an external api for obtaining rbac data with
ConnectorRBAC class,
- serializes rbac data and saves it to the disk,
- matches the rbac_data in the disk to each IngestDoc, using a common
field,
- forwards rbac data to Elements, via the partition() function

To test the changes, run `examples/ingest/sharepoint/ingest.sh` with the
relevant rbac & connector credentials

---------

Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com>
Co-authored-by: ahmetmeleq <ahmetmeleq@users.noreply.github.com>
2023-10-13 00:38:08 +00:00
Roman Isecke
ebf0722dcc
roman/ingest continue on error (#1736)
### Description
Add flag to raise an error on failure but default to only log it and
continue with other docs
2023-10-12 21:33:10 +00:00
ryannikolaidis
d22044a44c
fix: unstructured-ingest embedding KeyError (#1727)
Currently adding the embedding flag to any unstructured-ingest call
results in this failure:

```
2023-10-11 22:42:14,177 MainProcess ERROR    'b8a98c5d963a9dd75847a8f110cbf7c9'
multiprocessing.pool.RemoteTraceback:
"""
Traceback (most recent call last):
  File "/Users/ryannikolaidis/.pyenv/versions/3.10.11/lib/python3.10/multiprocessing/pool.py", line 125, in worker
    result = (True, func(*args, **kwds))
  File "/Users/ryannikolaidis/.pyenv/versions/3.10.11/lib/python3.10/multiprocessing/pool.py", line 48, in mapstar
    return list(map(*args))
  File "/Users/ryannikolaidis/Development/unstructured/unstructured/unstructured/ingest/pipeline/copy.py", line 14, in run
    ingest_doc_json = self.pipeline_context.ingest_docs_map[doc_hash]
  File "<string>", line 2, in __getitem__
  File "/Users/ryannikolaidis/.pyenv/versions/3.10.11/lib/python3.10/multiprocessing/managers.py", line 833, in _callmethod
    raise convert_to_error(kind, result)
KeyError: 'b8a98c5d963a9dd75847a8f110cbf7c9'
"""
```

This is because the run method for the embedding node is not adding the
IngestDoc to the context map. This PR adds that logic and adds a test to
validate that the embeddings option works as expected.

NOTE: until https://github.com/Unstructured-IO/unstructured/pull/1719
goes in, the expected results include the duplicate element bug, however
currently this does at least prove that embeddings are generated and the
function doesn't error.
2023-10-12 20:27:30 +00:00
Yuming Long
cb247d8cc4
doc: update comment for ingest test pdf-fast-reprocess (#1733) 2023-10-12 17:33:25 +00:00
Roman Isecke
22e568cf64
roman/bugfix fix default language ingest option (#1729)
### Description
Set language to None by default. Update ingest test to use local file
used in language unit tests to validate.

---------

Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com>
Co-authored-by: rbiseck3 <rbiseck3@users.noreply.github.com>
2023-10-12 17:31:23 +00:00
Roman Isecke
9b5d5e0f9e
roman/cli infer table arg (#1685)
### Description
Add new parameter to map to `skip_infer_table_types` partition arg.
Applies to partition config which is set on all connectors.

---------

Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com>
Co-authored-by: rbiseck3 <rbiseck3@users.noreply.github.com>
2023-10-12 16:14:53 +00:00
Inscore
8ab40c20c1
fix: correct PDF list item parsing (#1693)
The current implementation removes elements from the beginning of the
element list and duplicates the list items

---------

Co-authored-by: Klaijan <klaijan@unstructured.io>
Co-authored-by: yuming <305248291@qq.com>
Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com>
Co-authored-by: yuming-long <yuming-long@users.noreply.github.com>
2023-10-11 20:38:36 +00:00
John
9500d04791
detect document language across all partitioners (#1627)
### Summary
Closes #1534 and #1535
Detects document language using `langdetect` package. 
Creates new kwargs for user to set the document language (`languages`)
or detect the language at the element level instead of the default
document level (`detect_language_per_element`)

---------

Co-authored-by: shreyanid <42684285+shreyanid@users.noreply.github.com>
Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com>
Co-authored-by: Coniferish <Coniferish@users.noreply.github.com>
Co-authored-by: cragwolfe <crag@unstructured.io>
Co-authored-by: Austin Walker <austin@unstructured.io>
2023-10-11 01:47:56 +00:00