1393 Commits

Author SHA1 Message Date
Klaijan
a06b151897
refactor: ci workflow refactor (#1907)
Refactor the evaluation scripts including
`unstructured/ingest/evaluation.py`
`test_unstructured_ingest/evaluation-metrics.sh` for more structured
code and usage.
- The script is now only use one python script call with param
- Adds function to build string for output_args (`--output_dir
--output_list) and source_args (`--source_dir --source_args`)
- Now accepts evaluation to call as a param, currently only accepts
`text-extraction` and `element-type`

Example to call the function:
```sh evaluation-metrics.sh text-extraction```

---------

Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com>
Co-authored-by: Klaijan <Klaijan@users.noreply.github.com>
2023-11-01 15:58:23 +00:00
qued
b08562ba1a
tests: separate chipper tests (#1939)
Separates chipper tests to speed up testing and CI.
2023-10-31 21:02:00 +00:00
Roman Isecke
123ad20f4c
support passing credentials from memory for google connectors (#1888)
### Description

### Google Drive
The existing service account parameter was expanded to support either a
file path or a json value to generate the credentials when instantiating
the google drive client.

### GCS
Google Cloud Storage already supports the value being passed in, from
their docstring:
> - you may supply a token generated by the
      [gcloud](https://cloud.google.com/sdk/docs/)
      utility; this is either a python dictionary, the name of a file
containing the JSON returned by logging in with the gcloud CLI tool,
      or a Credentials object.


I tested this locally:

```python
from gcsfs import GCSFileSystem
import json

with open("/Users/romanisecke/.ssh/google-cloud-unstructured-ingest-test-d4fc30286d9d.json") as json_file:
    json_data = json.load(json_file)
    print(json_data)

    fs = GCSFileSystem(token=json_data)
    print(fs.ls(path="gs://utic-test-ingest-fixtures/"))
```
`['utic-test-ingest-fixtures/ideas-page.html',
'utic-test-ingest-fixtures/nested-1',
'utic-test-ingest-fixtures/nested-2']`
2023-10-31 17:12:04 +00:00
Roman Isecke
922bc84cee
Update fsspecs-specific source connector docs (#1898)
### Description
Add in the fsspec configs needed for the fsspec-based connectors

To match the behavior of the original CLI, the default used by the click
option was mirrored in the base config for the api endpoint.
2023-10-31 16:09:46 +00:00
Ahmet Melek
a9a3efd85c
bugfix: SharePoint permissions fetching should be opt-in (#1894)
Closes: #1891 (check the issue for more info)

---------

Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com>
Co-authored-by: ahmetmeleq <ahmetmeleq@users.noreply.github.com>
Co-authored-by: Yao You <yao@unstructured.io>
2023-10-31 15:55:07 +00:00
qued
b83057ac66
build: don't save cache for cache existence check (#1953)
Using `actions/cache@v3` instead of `actions/cache/restore@v3` for the
cache lookup in `setup_ingest` is causing CI to save the cache twice
when there's a cache miss. This is unnecessary, but I'm also a little
concerned it's causing some sort of race condition since I've seen
instances of CI failing to save due to the cache already existing (which
shouldn't be the case on a cache miss).

This PR switches the lookup to a `restore` action to avoid duplicate
ingest cache saving.

#### Testing:

There should only be one "Post Run actions/cache@v3" step in each
`setup_ingest` job when there's a cache miss.


[Here](https://github.com/Unstructured-IO/unstructured/actions/runs/6707740917/job/18227300507)
is an example of a cache miss running with this PR.
2023-10-31 15:49:12 +00:00
Matt Robinson
21b45ae8b0
docs: update to new logo (#1937)
### Summary

Updates the docs and README to use the new Unstructured logo. The README
links to the raw GitHub user content, so the changes isn't reflected in
the README on the branch, but will update after the image is merged to
main.

### Testing

Here's what the updated docs look like locally:

<img width="237" alt="image"
src="https://github.com/Unstructured-IO/unstructured/assets/1635179/f13d8b4b-3098-4823-bd16-a6c8dfcffe67">

<img width="1509" alt="image"
src="https://github.com/Unstructured-IO/unstructured/assets/1635179/3b8aae5e-34aa-48c0-90f9-f5f3f0f1e26d">

<img width="1490" alt="image"
src="https://github.com/Unstructured-IO/unstructured/assets/1635179/e82a876f-b19a-4573-b6bb-1c0215d2d7a9">
2023-10-31 15:39:19 +00:00
Roman Isecke
4f8cb04663
ingest download-only fix (#1943)
### Description
move check for download only after source node run
2023-10-31 14:05:37 +00:00
Roman Isecke
857195b6e6
expand retry logic in source connectors (#1889)
### Description
All http calls being made by the ingest source connectors have been
isolated and wrapped by the `SourceConnectionNetworkError` custom error,
which triggers the retry logic, if enabled, in the ingest pipeline.
2023-10-31 14:02:28 +00:00
Roman Isecke
963ac35b9c
bugfix/correctly share session handler across ingest docs (#1806)
### Description
Fix session handler
2023-10-31 12:21:23 +00:00
Denis Lusson
f585d489c1
feat: Add include_header argument for partition_csv and partition_tsv (#1764)
This PR add `include_header` argument for partition_csv and
partition_tsv. This is related to the following feature request
https://github.com/Unstructured-IO/unstructured/issues/1751.

`include_header` is already part of partition_xlsx. The work here is in
line with the current usage and testing of the `include_header` argument
in partition_xlsx.

---------

Co-authored-by: cragwolfe <crag@unstructured.io>
2023-10-31 08:16:36 +00:00
Ronny H
f78d4d505a
Updated "join Slack" link (#1948)
Updated "join Slack" links on README page.

---------

Co-authored-by: cragwolfe <crag@unstructured.io>
2023-10-31 00:02:21 -07:00
cragwolfe
ecbc4546e3
build: release commit for unstructured==0.10.28 (#1949) 0.10.28 2023-10-30 23:01:09 -07:00
cragwolfe
841a521790
build(ci): use larger CI runners for setup (#1946)
Hopefully avoid incomplete cache issues. Though to be fair, there is no
solid evidence pointing to runner size as the source of the issue.
2023-10-31 02:09:59 +00:00
Klaijan
a11d4634f1
fix: type error string indices bug (#1940)
Fix TypeError: string indices must be integers. The `annotation_dict`
variable is conditioned to be `None` if instance type is not dict. Then
we add logic to skip the attempt if the value is `None`.
2023-10-30 17:38:57 -07:00
Trevor Bossert
c3e42e9ffc
remove test login (#1945)
This was only used for debugging on a branch, not needed here. It was
failing because it didn't have the environment var set to "ci".
2023-10-30 15:47:43 -07:00
Christine Straub
1f0c563e0c
refactor: partition_pdf() for ocr_only strategy (#1811)
### Summary
Update `ocr_only` strategy in `partition_pdf()`. This PR adds the
functionality to get accurate coordinate data when partitioning PDFs and
Images with the `ocr_only` strategy.
- Add functionality to perform OCR region grouping based on the OCR text
taken from `pytesseract.image_to_string()`
- Add functionality to get layout elements from OCR regions (ocr_layout)
for both `tesseract` and `paddle`
- Add functionality to determine the `source` of merged text regions
when merging text regions in `merge_text_regions()`
- Merge multiple test functions related to "ocr_only" strategy into
`test_partition_pdf_with_ocr_only_strategy()`
- This PR also fixes [issue
#1792](https://github.com/Unstructured-IO/unstructured/issues/1792)
### Evaluation
```
# Image
PYTHONPATH=. python examples/custom-layout-order/evaluate_natural_reading_order.py example-docs/double-column-A.jpg ocr_only xy-cut image

# PDF
PYTHONPATH=. python examples/custom-layout-order/evaluate_natural_reading_order.py example-docs/multi-column-2p.pdf ocr_only xy-cut pdf
```
### Test
- **Before update**
All elements have the same coordinate data 


![multi-column-2p_1_xy-cut](https://github.com/Unstructured-IO/unstructured/assets/9475974/aae0195a-2943-4fa8-bdd8-807f2f09c768)

- **After update**
All elements have accurate coordinate data


![multi-column-2p_1_xy-cut](https://github.com/Unstructured-IO/unstructured/assets/9475974/0f6c6202-9e65-4acf-bcd4-ac9dd01ab64a)

---------

Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com>
Co-authored-by: christinestraub <christinestraub@users.noreply.github.com>
2023-10-30 20:13:29 +00:00
Roman Isecke
680cfbabd4
expand fsspec downstream connectors (#1777)
### Description
Replacing PR
[1383](https://github.com/Unstructured-IO/unstructured/pull/1383)

---------

Co-authored-by: Trevor Bossert <alanboss@gmail.com>
Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com>
Co-authored-by: rbiseck3 <rbiseck3@users.noreply.github.com>
2023-10-30 20:09:49 +00:00
qued
645a0fb765
fix: md tables (#1924)
Courtesy @phowat, created a branch in the repo to make some changes and
merge quickly.

Closes #1486.

* **Fixes issue where tables from markdown documents were being treated
as text** Problem: Tables from markdown documents were being treated as
text, and not being extracted as tables. Solution: Enable the `tables`
extension when instantiating the `python-markdown` object. Importance:
This will allow users to extract structured data from tables in markdown
documents.

#### Testing:

On `main` run the following (run `git checkout fix/md-tables --
example-docs/simple-table.md` first to grab the example table from this
branch)
```python
from unstructured.partition.md import partition_md
elements = partition_md("example-docs/simple-table.md")
print(elements[0].category)

```
Output should be `UncategorizedText`. Then run the same code on this
branch and observe the output is `Table`.

---------

Co-authored-by: cragwolfe <crag@unstructured.io>
2023-10-30 14:09:46 +00:00
Benjamin Torres
05c3cd1be2
feat: clean pdfminer elements inside tables (#1808)
This PR introduces `clean_pdfminer_inner_elements` , which deletes
pdfminer elements inside other detection origins such as YoloX or
detectron.
This function returns the clean document.

Also, the ingest-test fixtures were updated to reflect the new standard
output.

The best way to check that this function is working properly is check
the new test `test_clean_pdfminer_inner_elements` in
`test_unstructured/partition/utils/test_processing_elements.py`

---------

Co-authored-by: Roman Isecke <roman@unstructured.io>
Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com>
Co-authored-by: rbiseck3 <rbiseck3@users.noreply.github.com>
Co-authored-by: Roman Isecke <136338424+rbiseck3@users.noreply.github.com>
2023-10-30 07:10:51 +00:00
cragwolfe
2e1419867f
build(ci): bring back medium runners where appropriate (#1936)
Now that "medium" runners are available in GitHub Actions again,
re-enable them.

Intentionally not reverting
76213ecba7
the change to the setup-ingest job since additional CPU shouldn't make a
difference (e.g. it ran in 5 minutes here:
76213ecba7
).
2023-10-30 05:07:17 +00:00
Steve Canny
7373391aa4
fix: sectioner dissociated titles from their chunk (#1861)
### disassociated-titles

**Executive Summary**. Section titles are often combined with the prior
section and then missing from the section they belong to.

_Chunk combination_ is a behavior in which two succesive small chunks
are combined into a single chunk that better fills the chunk window.
Chunking can be and by default is configured to combine sequential small
chunks that will together fit within the full chunk window (default 500
chars).

Combination is only valid for "whole" chunks. The current implementation
attempts to combine at the element level (in the sectioner), meaning a
small initial element (such as a `Title`) is combined with the prior
section without considering the remaining length of the section that
title belongs to. This frequently causes a title element to be removed
from the chunk it belongs to and added to the prior, otherwise
unrelated, chunk.

Example:
```python
elements: List[Element] = [
    Title("Lorem Ipsum"),  # 11
    Text("Lorem ipsum dolor sit amet consectetur adipiscing elit."),  # 55
    Title("Rhoncus"),  # 7
    Text("In rhoncus ipsum sed lectus porta volutpat. Ut fermentum."),  # 57
]

chunks = chunk_by_title(elements, max_characters=80, combine_text_under_n_chars=80)

# -- want --------------------
CompositeElement('Lorem Ipsum\n\nLorem ipsum dolor sit amet consectetur adipiscing elit.')
CompositeElement('Rhoncus\n\nIn rhoncus ipsum sed lectus porta volutpat. Ut fermentum.')

# -- got ---------------------
CompositeElement('Lorem Ipsum\n\nLorem ipsum dolor sit amet consectetur adipiscing elit.\n\nRhoncus')
CompositeElement('In rhoncus ipsum sed lectus porta volutpat. Ut fermentum.')
```

**Technical Summary.** Combination cannot be effectively performed at
the element level, at least not without complicating things with
arbitrary look-ahead into future elements. Much more straightforward is
to combine sections once they have been formed from the element stream.

**Fix.** Introduce an intermediate stream processor that accepts a
stream of sections and emits a stream of sometimes-combined sections.

The solution implemented in this PR builds upon introducing `_Section`
objects to replace the `List[Element]` primitive used previously:

- `_TextSection` gets the `.combine()` method and `.text_length`
property which allows a combining client to produce a combined section
(only text-sections are ever combined).
- `_SectionCombiner` is introduced to encapsulate the logic of
combination, acting as a "filter", accepting a stream of sections and
emitting the same type, just with some resulting from two or more
combined input sections: `(Iterable[_Section]) -> Iterator[_Section]`.
- `_TextSectionAccumulator` is a helper to `_SectionCombiner` that takes
responsibility for repeatedly accumulating sections, characterizing
their length and doing the actual combining (calling
`_Section.combine(other_section)`) when instructed. Very similar in
concept to `_TextSectionBuilder`, just at the section level instead of
element level.
- Remove attempts to combine sections at the element level from
`_split_elements_by_title_and_table()` and install `_SectionCombiner` as
filter between sectioner and chunker.
2023-10-30 04:20:27 +00:00
cragwolfe
76213ecba7
build(fixtures-update): all CI jobs on smaller worker (#1934) 2023-10-28 21:19:51 -07:00
cragwolfe
4e669d419f
build(fixtures-update): run on smaller worker (#1932)
Run fixtures-update workflow on smaller github runner until larger one
is available again.
2023-10-27 20:36:05 -07:00
cragwolfe
22b3edb226
build: re-enable ingest on normal CI workers (#1931)
temporarily, until large workers are working again.
2023-10-27 19:46:40 -07:00
cragwolfe
56b1c063a2
chore: moar changelog repair (#1930)
Per
a2af72bb79
, these changes were part of 0.10.26.
2023-10-27 18:05:56 -07:00
Benjamin Torres
25e7a68d4b
chore: changelog repair (#1929)
Removes duplicated entries in changelog
2023-10-27 17:46:50 -07:00
Yao You
f87731e085
feat: use yolox as default to table extraction for pdf/image (#1919)
- yolox has better recall than yolox_quantized, the current default
model, for table detection
- update logic so that when `infer_table_structure=True` the default
model is `yolox` instead of `yolox_quantized`
- user can still override the default by passing in a `model_name` or
set the env variable `UNSTRUCTURED_HI_RES_MODEL_NAME`

## Test:

Partition the attached file with 

```python
from unstructured.partition.pdf import partition_pdf

yolox_elements = partition_pdf(filename, strategy="hi_re", infer_table_structure=True)
yolox_quantized_elements = partition_pdf(filename, strategy="hi_re", infer_table_structure=True, model_name="yolox_quantized")
```

Compare the table elements between those two and yolox (default)
elements should have more complete table.


[AK_AK-PERS_CAFR_2008_3.pdf](https://github.com/Unstructured-IO/unstructured/files/13191198/AK_AK-PERS_CAFR_2008_3.pdf)
2023-10-27 15:37:45 -05:00
cragwolfe
ff752e88df
chore: exit evaluation script if nothing to do (#1910)
Relates to CI ingest-tests. The last step of test-ingest.sh is to
calculate evaluation metrics (comparing gold set standard outputs with
actual output files). If no output files were created, as *should* be
the case right now in CI for all python versions other than 3.10 (that
only test a limited number of
files/connectors),`unstructured/ingest/evaluate.py` would fail.
2023-10-27 13:29:05 -05:00
John
670687bb67
update .pre-commit-config to match linting used by CI (#1906)
Closes #1905 
.pre-commit-config.yaml does not match pyproject.toml, which causes
unnecessary/undesirable formatting changes. These changes are not
required by CI, so they should not have to be made.

**To Reproduce**
Install pre-commit configuration as described
[here](https://github.com/Unstructured-IO/unstructured#installation-instructions-for-local-development).
Make a commit and something like the following will be logged:
```
check for added large files..............................................Passed
check toml...........................................(no files to check)Skipped
check yaml...........................................(no files to check)Skipped
check json...........................................(no files to check)Skipped
check xml............................................(no files to check)Skipped
fix end of files.........................................................Passed
trim trailing whitespace.................................................Passed
mixed line ending........................................................Passed
black....................................................................Passed
ruff.....................................................................Failed
- hook id: ruff
- files were modified by this hook
```

---------

Co-authored-by: Yao You <theyaoyou@gmail.com>
2023-10-27 13:24:55 -05:00
Yao You
42f8cf1997
chore: add metric helper for table structure eval (#1877)
- add helper to run inference over an image or pdf of table and compare
it against a ground truth csv file
- this metric generates a similarity score between 1 and 0, where 1 is
perfect match and 0 is no match at all
- add example docs for testing
- NOTE: this metric is only relevant to table structure detection.
Therefore the input should be just the table area in an image/pdf file;
we are not evaluating table element detection in this metric
2023-10-27 13:23:44 -05:00
Yuming Long
b1534af55c
Fix: replace wrong logger for paddle info (#1916)
### Summary

The logger from `paddle_ocr.py` is wrong, it should be `from
unstructured.logger` since the module is in unst repo

### Test
* install this branch of unst to an unst-api env with `pip install -e .`
* in unst-api repo, run `OCR_AGENT=paddle make run-web-app`
* curl with:
```
curl -X 'POST'   'http://0.0.0.0:8000/general/v0/general' \
-H 'accept: application/json'  \
-H 'Content-Type: multipart/form-data'  \
-F 'files=@sample-docs/layout-parser-paper.pdf'  \
-F 'strategy=hi_res'  \
-F 'pdf_infer_table_structure=True' \
 | jq -C . | less -R
```
you should be able to see log like 
```
2023-10-27 10:31:48,691 unstructured INFO Processing OCR with paddle...
2023-10-27 10:31:48,969 unstructured INFO Loading paddle with CPU on language=en...
```
not 
```
2023-10-27 10:16:08,654 unstructured_inference INFO Loading paddle with CPU on language=en...
```
even paddle is not installed
2023-10-27 16:06:30 +00:00
cragwolfe
8aceda97dd
test: print slowest unittests (#1911)
Show which tests are slowing things down when running `make test`:

E.g., from the CI run in this PR:

```
2023-10-27T05:51:05.6256039Z 105.12s setup    test_unstructured/partition/pdf_image/test_pdf.py::test_chipper_has_hierarchy
2023-10-27T05:51:05.6257784Z 93.47s call     test_unstructured/partition/pdf_image/test_pdf.py::test_partition_pdf_hi_res_ocr_mode_with_table_extraction[entire_page]
2023-10-27T05:51:05.6259866Z 93.09s call     test_unstructured/partition/pdf_image/test_pdf.py::test_partition_pdf_hi_res_ocr_mode_with_table_extraction[individual_blocks]
2023-10-27T05:51:05.6261818Z 31.70s call     test_unstructured/partition/epub/test_epub.py::test_add_chunking_strategy_on_partition_epub_non_default
2023-10-27T05:51:05.6263774Z 17.22s call     test_unstructured/partition/pdf_image/test_pdf.py::test_partition_pdf[hi_res-expected1-pdf-filename]
2023-10-27T05:51:05.6265658Z 17.13s call     test_unstructured/partition/pdf_image/test_pdf.py::test_partition_pdf[hi_res-expected1-pdf-spool]
2023-10-27T05:51:05.6273195Z 16.95s call     test_unstructured/partition/pdf_image/test_image.py::test_add_chunking_strategy_on_partition_image_hi_res
2023-10-27T05:51:05.6275118Z 16.77s call     test_unstructured/partition/pdf_image/test_pdf.py::test_partition_pdf[hi_res-expected1-pdf-rb]
2023-10-27T05:51:05.6276759Z 14.64s call     test_unstructured/partition/test_text.py::test_partition_text_detects_more_than_3_languages
2023-10-27T05:51:05.6278381Z 13.86s call     test_unstructured/partition/pdf_image/test_image.py::test_partition_image_with_multipage_tiff
2023-10-27T05:51:05.6280137Z 13.51s call     test_unstructured/partition/test_auto.py::test_auto_partition_pdf_from_filename[False-None]
2023-10-27T05:51:05.6281995Z 13.41s call     test_unstructured/partition/test_html_partition.py::test_add_chunking_strategy_on_partition_html
2023-10-27T05:51:05.6283640Z 12.80s call     test_unstructured/partition/pdf_image/test_pdf.py::test_partition_pdf_with_copy_protection
2023-10-27T05:51:05.6285305Z 12.46s call     test_unstructured/partition/pdf_image/test_image.py::test_add_chunking_strategy_on_partition_image
2023-10-27T05:51:05.6287250Z 12.39s call     test_unstructured/partition/pdf_image/test_image.py::test_partition_image_hi_res_ocr_mode_with_table_extraction[individual_blocks]
2023-10-27T05:51:05.6289347Z 12.14s call     test_unstructured/partition/pdf_image/test_pdf.py::test_partition_pdf_from_file_with_hi_res_strategy_custom_metadata_date
2023-10-27T05:51:05.6291329Z 12.12s call     test_unstructured/partition/pdf_image/test_pdf.py::test_partition_pdf_with_hi_res_strategy_custom_metadata_date
2023-10-27T05:51:05.6293388Z 12.12s call     test_unstructured/partition/test_auto.py::test_auto_partition_pdf_from_file[True-application/pdf]
2023-10-27T05:51:05.6294869Z 12.08s call     test_unstructured/partition/test_auto.py::test_auto_with_page_breaks
2023-10-27T05:51:05.6296396Z 12.02s call     test_unstructured/partition/pdf_image/test_pdf.py::test_partition_pdf_with_hi_res_strategy_metadata_date
2023-10-27T05:51:05.6298278Z 11.99s call     test_unstructured/partition/pdf_image/test_pdf.py::test_partition_pdf_from_file_with_hi_res_strategy_metadata_date
```
2023-10-27 11:40:55 -05:00
Ahmet Melek
c249d02fa8
bugfix: ingest pipeline with chunking and embedding does not persist data to the embedding step (#1893)
Closes: #1892 (check the issue for more info)
2023-10-27 13:07:00 +00:00
qued
450e7f0614
build: streamline ci (#1909)
Updated CI to shave time off in some conditions with no real downside.

- When the base cache already exists, we don't download it during setup,
and we skip all other steps as well.
- During ingest setup, we check if the ingest cache exists before
downloading the base cache, and if the ingest cache already exists, we
skip everything else.
- `check-deps` doesn't have to wait on `setup` or download a cache, as
the dependencies aren't needed, only `pip`.
2023-10-27 02:15:22 -05:00
Klaijan
466255eec3
build: element type frequency evaluation metrics workflow in ci (#1862)
**Executive Summary**
Measured element type frequency accuracy from the current version of
code with the expected output. The performance is reported as tsv file
under `metrics`.

**Technical Details**
- The evaluation measures element type frequencies from
`structured-output-eval` against `expected-structured-output`
- `evaluation.py` has been edited to support function calling using
`click.group()` and `command()`
- `evaluation-ingest-cp.sh` is now added to all the `test-ingest-xx.sh`
scripts

**Outputs**
2 tsv files is saved

![image](https://github.com/Unstructured-IO/unstructured/assets/2177850/b4458094-a9fc-48f9-a0bd-2ccd6985440a)

![image](https://github.com/Unstructured-IO/unstructured/assets/2177850/6d785736-bcaf-4275-bf2d-ab511cdfb3f4)
9-0e05-41d4-b69f-841a2aa131ec)
and aggregated score is displayed.

![image](https://github.com/Unstructured-IO/unstructured/assets/2177850/9d42bd0c-a0dd-41c2-a2e5-b675a40f35cc)

---------

Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com>
Co-authored-by: Klaijan <Klaijan@users.noreply.github.com>
Co-authored-by: Yao You <theyaoyou@gmail.com>
2023-10-27 04:36:36 +00:00
Steve Canny
f273a7cb83
fix: sectioner does not consider separator length (#1858)
### sectioner-does-not-consider-separator-length

**Executive Summary.** A primary responsibility of the sectioner is to
minimize the number of chunks that need to be split mid-text. It does
this by computing text-length of the section being formed and
"finishing" the section when adding another element would extend its
text beyond the window size.

When element-text is consolidated into a chunk, the text of each element
is joined, separated by a "blank-line" (`"\n\n"`). The sectioner does
not currently consider the added length of separators (2-chars each) and
so forms sections that need to be split mid-text when chunked.

Chunk-splitting should only be necessary when the text of a single
element is longer than the chunking window.

**Example**

  ```python
    elements: List[Element] = [
        Title("Chunking Priorities"),  # 19 chars
        ListItem("Divide text into manageable chunks"),  # 34 chars
        ListItem("Preserve semantic boundaries"),  # 28 chars
        ListItem("Minimize mid-text chunk-splitting"),  # 33 chars
    ]  # 114 chars total but 120 chars with separators

    chunks = chunk_by_title(elements, max_characters=115)
  ```

  Want:

  ```python
[
    CompositeElement(
        "Chunking Priorities"
        "\n\nDivide text into manageable chunks"
        "\n\nPreserve semantic boundaries"
    ),
    CompositeElement("Minimize mid-text chunk-splitting"),
]
  ```

  Got:

  ```python
[
    CompositeElement(
        "Chunking Priorities"
        "\n\nDivide text into manageable chunks"
        "\n\nPreserve semantic boundaries"
        "\n\nMinimize mid-text chunk-spli"),
    )
    CompositeElement("tting")
  ```

### Technical Summary

Because the sectioner does not consider separator (`"\n\n"`) length when
it computes the space remaining in the section, it over-populates the
section and when the chunker concatenates the element text (each
separated by the separator) the text exceeds the window length and the
chunk must be split mid-text, even though there was an even element
boundary it could have been split on.

### Fix

Consider separator length in the space-remaining computation.

The solution here extracts both the `section.text_length` and
`section.space_remaining` computations to a `_TextSectionBuilder` object
which removes the need for the sectioner
(`_split_elements_by_title_and_table()`) to deal with primitives
(List[Element], running text length, separator length, etc.) and allows
it to focus on the rules of when to start a new section.

This solution may seem like overkill at the moment and indeed it would
be except it forms the foundation for adding section-level chunk
combination (fix: dissociated title elements) in the next PR. The
objects introduced here will gain several additional responsibilities in
the next few chunking PRs in the pipeline and will earn their place.
2023-10-26 21:34:15 +00:00
qued
808b4ced7a
build(deps): remove ebooklib (#1878)
* **Removed `ebooklib` as a dependency** `ebooklib` is licensed under
AGPL3, which is incompatible with the Apache 2.0 license. Thus it is
being removed.
0.10.27
2023-10-26 12:22:40 -05:00
Roman Isecke
135aa65906
update ingest pipeline to share ingest docs via multiprocessing.manager.dict (#1814)
### Description
* If the contents of a doc were updated by the process of
reading/downloading it, this was not being persisted. To fix this, the
data being passed around was updated to use a multiprocessing safe dict
rather than the json string. Now that dict is updated after the
`get_file` method is called.
* Wikipedia connector was updated to use a static filename rather than
one requiring a call to fetch data.
* The read config param `re_download` was not being leveraged by the
source node, this was fixed.
* Added fix: chunking and embedding order reversed so chunking runs
before embeddings

---------

Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com>
Co-authored-by: rbiseck3 <rbiseck3@users.noreply.github.com>
2023-10-25 22:04:27 +00:00
qued
2e9f4c4849
chore: bump version for release (#1867) 0.10.26 2023-10-25 10:11:09 -05:00
Sebastian Laverde Alfonso
c11a2ff478
feat: method to catch and classify overlapping bounding boxes (#1803)
We have established that overlapping bounding boxes does not have a
one-fits-all solution, so different cases need to be handled differently
to avoid information loss. We have manually identified the
cases/categories of overlapping. Now we need a method to
programmatically classify overlapping-bboxes cases within detected
elements in a document, and return a report about it (list of cases with
metadata). This fits two purposes:

- **Evaluation**: We can have a pipeline using the DVC data registry
that assess the performance of a detection model against a set of
documents (PDF/Images), by analysing the overlapping-bboxes cases it
has. The metadata in the output can be used for generating metrics for
this.
- **Scope overlapping cases**: Manual inspection give us a clue about
currently present cases of overlapping bboxes. We need to propose
solutions to fix those on code. This method generates a report by
analysing several aspects of two overlapping regions. This data can be
used to profile and specify the necessary changes that will fix each
case.
- **Fix overlapping cases**: We could introduce this functionality in
the flow of a partition method (such as `partition_pdf`, to handle the
calls to post-processing methods to fix overlapping. Tested on ~331
documents, the worst time per page is around 5ms. For a document such as
`layout-parser-paper.pdf` it takes 4.46 ms.

Introduces functionality to take a list of unstructured elements (which
contain bounding boxes) and identify pairs of bounding boxes which
overlap and which case is pertinent to the pairing. This PR includes the
following methods in `utils.py`:

- **`ngrams(s, n)`**: Generate n-grams from a string
- **`calculate_shared_ngram_percentage(string_A, string_B, n)`**:
Calculate the percentage of `common_ngrams` between `string_A` and
`string_B` with reference to the total number of ngrams in `string_A`.
- **`calculate_largest_ngram_percentage(string_A, string_B)`**:
Iteratively call `calculate_shared_ngram_percentage` starting from the
biggest ngram possible until the shared percentage is >0.0%
- **`is_parent_box(parent_target, child_target, add=0)`**: True if the
`child_target` bounding box is nested in the `parent_target` Box format:
[`x_bottom_left`, `y_bottom_left`, `x_top_right`, `y_top_right`]. The
parameter 'add' is the pixel error tolerance for extra pixels outside
the parent region
- **`calculate_overlap_percentage(box1, box2,
intersection_ratio_method="total")`**: Box format: [`x_bottom_left`,
`y_bottom_left`, `x_top_right`, `y_top_right`]. Calculates the
percentage of overlapped region with reference to biggest element-region
(`intersection_ratio_method="parent"`), the smallest element-region
(`intersection_ratio_method="partial"`), or to the disjunctive union
region (`intersection_ratio_method="total"`).
- **`identify_overlapping_or_nesting_case`**: Identify if there are
nested or overlapping elements. If overlapping is present,
it identifies the case calling the method `identify_overlapping_case`.
- **`identify_overlapping_case`**: Classifies the overlapping case for
an element_pair input in one of 5 categories of overlapping.
- **`catch_overlapping_and_nested_bboxes`**: Catch overlapping and
nested bounding boxes cases across a list of elements. The params
`nested_error_tolerance_px` and `sm_overlap_threshold` help controling
the separation of the cases.

The overlapping/nested elements cases that are being caught are:
1. **Nested elements**
2. **Small partial overlap**
3. **Partial overlap with empty content**
4. **Partial overlap with duplicate text (sharing 100% of the text)**
5. **Partial overlap without sharing text**
6. **Partial overlap sharing**
{`calculate_largest_ngram_percentage(...)`}% **of the text**

Here is a snippet to test it:
```
from unstructured.partition.auto import partition

model_name = "yolox_quantized"
target = "sample-docs/layout-parser-paper-fast.pdf"
elements = partition(filename=file_path_i, strategy='hi_res', model_name=model_name)
overlapping_flag, overlapping_cases = catch_overlapping_bboxes(elements)
for case in overlapping_cases:
    print(case, "\n")
```
Here is a screenshot of a json built with the output list
`overlapping_cases`:
<img width="377" alt="image"
src="https://github.com/Unstructured-IO/unstructured/assets/38184042/a6fea64b-d40a-4e01-beda-27840f4f4b3a">
2023-10-25 12:17:34 +00:00
qued
d8241cbcfc
fix: filename missing from image metadata (#1863)
Closes
[#1859](https://github.com/Unstructured-IO/unstructured/issues/1859).

* **Fixes elements partitioned from an image file missing certain
metadata** Metadata for image files, like file type, was being handled
differently from other file types. This caused a bug where other
metadata, like the file name, was being missed. This change brought
metadata handling for image files to be more in line with the handling
for other file types so that file name and other metadata fields are
being captured.

Additionally:
* Added test to verify filename is being captured in metadata
* Cleaned up `CHANGELOG.md` formatting

#### Testing:
The following produces output `None` on `main`, but outputs the filename
`layout-parser-paper-fast.jpg` on this branch:
```python
from unstructured.partition.auto import partition
elements = partition("example-docs/layout-parser-paper-fast.jpg")
print(elements[0].metadata.filename)

```
2023-10-25 05:19:51 +00:00
Roman Isecke
2d5ffa4581
Fix ingest test CI job (#1864)
Fixed some syntax errors in the update ingest CI job causing it to
always fail
2023-10-25 02:23:00 +00:00
Steve Canny
40a265d027
fix: chunk_by_title() interface is rude (#1844)
### `chunk_by_title()` interface is "rude"

**Executive Summary.** Perhaps the most commonly specified option for
`chunk_by_title()` is `max_characters` (default: 500), which specifies
the chunk window size.

When a user specifies this value, they get an error message:
```python
  >>> chunks = chunk_by_title(elements, max_characters=100)
  ValueError: Invalid values for combine_text_under_n_chars, new_after_n_chars, and/or max_characters.
```
A few of the things that might reasonably pass through a user's mind at
such a moment are:
* "Is `110` not a valid value for `max_characters`? Why would that be?"
* "I didn't specify a value for `combine_text_under_n_chars` or
`new_after_n_chars`, in fact I don't know what they are because I
haven't studied the documentation and would prefer not to; I just want
smaller chunks! How could I supply an invalid value when I haven't
supplied any value at all for these?"
* "Which of these values is the problem? Why are you making me figure
that out for myself? I'm sure the code knows which one is not valid, why
doesn't it share that information with me? I'm busy here!"

In this particular case, the problem is that
`combine_text_under_n_chars` (defaults to 500) is greater than
`max_characters`, which means it would never take effect (which is
actually not a problem in itself).

To fix this, once figuring out that was the problem, probably after
opening an issue and maybe reading the source code, the user would need
to specify:
  ```python
  >>> chunks = chunk_by_title(
  ...     elements, max_characters=100, combine_text_under_n_chars=100
  ... )
  ```

This and other stressful user scenarios can be remedied by:
* Using "active" defaults for the `combine_text_under_n_chars` and
`new_after_n_chars` options.
* Providing a specific error message for each way a constraint may be
violated, such that direction to remedy the problem is immediately clear
to the user.

An *active default* is for example:
* Make the default for `combine_text_under_n_chars: int | None = None`
such that the code can detect when it has not been specified.
* When not specified, set its value to `max_characters`, the same as its
current (static) default.
This particular change would avoid the behavior in the motivating
example above.

Another alternative for this argument is simply:
```python
combine_text_under_n_chars = min(max_characters, combine_text_under_n_chars)
```

### Fix

1. Add constraint-specific error messages.
2. Use "active" defaults for `combine_text_under_n_ chars` and
`new_after_n_chars`.
3. Improve docstring to describe active defaults, and explain other
argument behaviors, in particular identifying suppression options like
`combine_text_under_n_chars = 0` to disable chunk combining.
2023-10-24 23:22:38 +00:00
John
8080f9480d
fix strategy test for api and linting (#1840)
### Summary 
Closes unstructured-api issue
[188](https://github.com/Unstructured-IO/unstructured-api/issues/188)
The test and gist were using different versions of the same file
(jpg/pdf), creating what looked like a bug when there wasn't one. The
api is correctly using the `strategy` kwarg.

### Testing
#### Checkout to `main`
- Comment out the `@pytest.mark.skip` decorators for the
`test_partition_via_api_with_no_strategy` test
- Add an API key to your env:
- Add `from dotenv import load_dotenv; load_dotenv()` to the top of the
file and have `UNS_API_KEY` defined in `.env`

- Run `pytest test_unstructured/partition/test_api.py -k
"test_partition_via_api_with_no_strategy"`
^the test will fail

#### Checkout to this branch 
- (make the same changes as above)
- Run `pytest test_unstructured/partition/test_api.py -k
"test_partition_via_api_with_no_strategy"`

### Other
`make tidy` and `make check` made linting changes to additional files
2023-10-24 22:17:54 +00:00
qued
d79f633ada
build(deps): add typing extensions dep (#1835)
Closes #1330.

Added `typing-extensions` as an explicit dependency (it was previously
an implicit dependency via `dataclasses-json`).

This dependency should be explicit, since we import from it directly in
`unstructured.documents.elements`. This has the added benefit that
`TypedDict` will be available for Python 3.7 users.

Other changes:
* Ran `pip-compile`
* Fixed a bug in `version-sync.sh` that caused an error when using the
sync functionality when syncing to a dev version from a release version.

#### Testing:

To test the Python 3.7 functionality, in a Python 3.7 environment
install the base requirements and run
```python
from unstructured.documents.elements import Element

```
This also works on `main` as `typing_extensions` is a requirement.
However if you `pip uninstall typing-extensions`, and run the above
code, it should fail. So this update makes sure `typing-extensions`
doesn't get lost if the other dependencies move around.

To reproduce the `version-sync.sh` bug that was fixed, in `main`,
increment the most recent version in `CHANGELOG.md` while leaving the
version in `__version__.py`. Then add the following lines to
`version-sync.sh` to simulate a particular set of circumstances,
starting on line 114:

```
MAIN_IS_RELEASE=true
CURRENT_BRANCH="something-not-main"
```

Then run `make version-sync`.

The expected behavior is that the version in `__version__.py` is changed
to the new version to match `CHANGELOG.md`, but instead it exits with an
error.

The fix was to only do the version incrementation check when the script
is running in `-c` or "check" mode.
2023-10-24 19:19:09 +00:00
Yuming Long
01a0e003d9
Chore: stop passing extract_tables to inference and note table regression on entire doc OCR (#1850)
### Summary

A follow up ticket on
https://github.com/Unstructured-IO/unstructured/pull/1801, I forgot to
remove the lines that pass extract_tables to inference, and noted the
table regression if we only do one OCR for entire doc

**Tech details:**
* stop passing `extract_tables` parameter to inference
* added table extraction ingest test for image, which was skipped
before, and the "text_as_html" field contains the OCR output from the
table OCR refactor PR
* replaced `assert_called_once_with` with `call_args` so that the unit
tests don't need to test additional parameters
* added `error_margin` as ENV when comparing bounding boxes
of`ocr_region` with `table_element`
* added more tests for tables and noted the table regression in test for
partition pdf

### Test
* for stop passing `extract_tables` parameter to inference, run test
`test_partition_pdf_hi_res_ocr_mode_with_table_extraction` before this
branch and you will see warning like `Table OCR from get_tokens method
will be deprecated....`, which means it called the table OCR in
inference repo. This branch removed the warning.
2023-10-24 17:13:28 +00:00
qued
44cef80c82
test: Add test to ensure languages trickle down to ocr (#1857)
Closes
[#93](https://github.com/Unstructured-IO/unstructured-inference/issues/93).

Adds a test to ensure language parameters are passed all the way from
`partition_pdf` down to the OCR calls.

#### Testing:

CI should pass.
2023-10-24 16:54:19 +00:00
Yao You
b530e0a2be
fix: partition docx from teams output (#1825)
This PR resolves #1816 
- current docx partition assumes all contents are in sections
- this is not true for MS Teams chat transcript exported to docx
- now the code checks if there are sections or not; if not then iterate
through the paragraphs and partition contents in the paragraphs
2023-10-24 15:17:02 +00:00
Roman Isecke
4802332de0
Roman/optimize ingest ci (#1799)
### Description
Currently the CI caches the CI dependencies but uses the hash of all
files in `requirements/`. This isn't completely accurate since the
ingest dependencies are installed in a later step and don't affect the
cached environment. As part of this PR:
* ingest dependencies were isolated into their own folder in
`requirements/ingest/`
* A new cache setup was introduced in the CI to restore the base cache
-> install ingest dependencies -> cache it with a new id
* new make target created to install all ingest dependencies via `pip
install -r ...`
* updates to Dockerfile to use `find ...` to install all dependencies,
avoiding the need to update this when new deps are added.
* update to pip-compile script to run over all `*.in` files in
`requirements/`
2023-10-24 14:54:00 +00:00