566 Commits

Author SHA1 Message Date
Steve Canny
21bc67f52f
rfctr: improve element typing (#2247)
In preparation for work on generalized chunking including
`chunk_by_character()` and overlap, get `elements` module and tests
passing strict type-checking.
2023-12-12 23:12:23 +00:00
Christine Straub
da7ac625b1
Feat: save tables in PDF's as images (#2229)
closes #2222.

### Summary
The "table" elements are saved as `table-<pageN>-<tableN>.jpg`. This
filename is presented in the `image_path` metadata field for the Table
element. The default would be to not do this.

### Testing
PDF: [124_PDFsam_Basel III - Finalising post-crisis
reforms.pdf](https://github.com/Unstructured-IO/unstructured/files/13591714/124_PDFsam_Basel.III.-.Finalising.post-crisis.reforms.pdf)

```
elements = partition_pdf(
    filename="124_PDFsam_Basel III - Finalising post-crisis reforms.pdf",
    strategy="hi_res",
    infer_table_structure=True,
    extract_element_types=['Table'],
)
```
2023-12-11 19:14:41 +00:00
Christine Straub
ed76b11b1a
Refactor: support image extraction (#2201)
### Summary
This PR is the second part of the "image extraction" refactor to move it
from unstructured-inference repo to unstructured repo, the first part is
done in
https://github.com/Unstructured-IO/unstructured-inference/pull/299. This
PR adds logic to support extracting images.

### Testing

`git clone -b refactor/remove_image_extraction_code --single-branch
https://github.com/Unstructured-IO/unstructured-inference.git && cd
unstructured-inference && pip install -e . && cd ../`

```
elements = partition_pdf(
        filename="example-docs/embedded-images.pdf",
        strategy="hi_res",
        extract_images_in_pdf=True,
    )

print("\n\n".join([str(el) for el in elements]))
```
2023-12-05 18:22:29 +00:00
John
8fa5cbf036
build(ci): rm unneeded call to get_api_key in test (#2199)
Follow-up PR to
[https://github.com/Unstructured-IO/unstructured/pull/2195](https://github.com/Unstructured-IO/unstructured/pull/2195).

Removes unnecessary calls to `get_api_key()`. That helper function is
supposed to only be used for tests decorated by
@pytest.mark.skipif(skip_outside_ci, reason="Skipping test run outside
of CI") (which are skipped because those tests are partitioning pdf/jpg
files).
These tests are partitioning emails and rely on the MockResponse at the
top of the file, so they don't need to call `get_api_key()` and it can
simply be removed from them.
2023-12-03 21:28:05 -08:00
rvztz
ce905dd098
feat: Weaviate destination connector (#1963)
Closes #1781.
- Adds a Weaviate destination connector
- The connector receives a host for the weaviate instance and a weaviate
class name.
- Defines a weaviate schema for json elements.
- Defines the pre-processing to conform unstructured's schema to the
proposed weaviate schema.
2023-12-01 22:27:41 +00:00
Christine Straub
69d0ee1aea
Refactor: support merging extracted layout with inferred layout (#2158)
### Summary
This PR is the second part of `pdfminer` refactor to move it from
`unstructured-inference` repo to `unstructured` repo, the first part is
done in
https://github.com/Unstructured-IO/unstructured-inference/pull/294. This
PR adds logic to merge the extracted layout with the inferred layout.

The updated workflow for the `hi_res` strategy:
* pass the document (as data/filename) to the `inference` repo to get
`inferred_layout` (DocumentLayout)
* pass the `inferred_layout` returned from the `inference` repo and the
document (as data/filename) to the `pdfminer_processing` module, which
first opens the document (create temp file/dir as needed), and splits
the document by pages
* if is_image is `True`, return the passed
inferred_layout(DocumentLayout)
  * if is_image is `False`:
* get extracted_layout (TextRegions) from the passed
document(data/filename) by pdfminer
* merge `extracted_layout` (TextRegions) with the passed
`inferred_layout` (DocumentLayout)
* return the `inferred_layout `(DocumentLayout) with updated elements
(all merged LayoutElements) as merged_layout (DocumentLayout)
* pass merged_layout and the document (as data/filename) to the `OCR`
module, which first opens the document (create temp file/dir as needed),
and splits the document by pages (convert PDF pages to image pages for
PDF file)

### Note
This PR also fixes issue #2164 by using functionality similar to the one
implemented in the `fast` strategy workflow when extracting elements by
`pdfminer`.

### TODO
* image extraction refactor to move it from `unstructured-inference`
repo to `unstructured` repo
* improving natural reading order by applying the current default
`xycut` sorting to the elements extracted by `pdfminer`
2023-12-01 20:56:31 +00:00
John
e5bdf7fb43
chore: unstructured python client (#2195)
### Summary
Closes #2033
Updates `partition_via_api` to use `UnstructuredClient` for api calls
instead of `requests`.
Updates associated tests.

Note: This PR does **not** update `partition_multiple_via_api` as
documentation in `unstructured-python-client` indicates it does not
support multiple files. A new issue should be opened to add that
functionality to `unstructured-python-client`.

---------

Co-authored-by: Klaijan <klaijan@unstructured.io>
Co-authored-by: Roman Isecke <136338424+rbiseck3@users.noreply.github.com>
Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com>
Co-authored-by: rbiseck3 <rbiseck3@users.noreply.github.com>
2023-12-01 18:49:59 +00:00
pravin-unstructured
341f0f428c
Add coco staging brick to unstructured base (#2180) 2023-11-29 20:55:23 +00:00
Yuming Long
92dae8cd1a
Chore: Repair invalid PDF structure for PDFminer when PSSyntaxError (#2137)
### Summary

Add a procedure to repair PDF when the PDF structure is invalid for
`PDFminer` to process.

This PR handles two cases of `PSSyntaxError Invalid dictionary
construct: ...`:
* PDFminer open entire document and create pages generator on
`PDFPage.get_pages(fp)`: [sentry log
example](https://unstructuredio.sentry.io/issues/4655715023/?alert_rule_id=14681339&alert_type=issue&notification_uuid=d8db4cf4-686f-4504-8a22-74a79a8e966f&project=4505909127086080&referrer=slack)
* PDFminer's interpreter process a single page on
`interpreter.process_page(page)`: [sentry log
example](https://unstructuredio.sentry.io/issues/4655898781/?referrer=slack&notification_uuid=0d929d48-f490-4db8-8dad-5d431c8460bc&alert_rule_id=14681339&alert_type=issue)

**Additional tech details:**
* Add new dependency `pikepdf` in `requirements/extra-pdf-image.in`,
which is used for repairing PDF.
* Add new denpendenct `pypdf` in `requirements/extra-pdf-image.in`,
which is used to find the error page from entire document by reading the
PDF file again (can't find a way to split pdf in PDFminer).
* Refactor the `is null` check for `get_uris_from_annots`, since the
root cause is that `get_uris` passed a None `annots` to
`get_uris_from_annots`, so the Null check should happen in `get_uris`.
* Add more type protection in `get_uris_from_annots` when using any
`PDFObjRef.resolve()` as `dict` (it could still be a `PDFObjRef`). This
should fix :
* https://github.com/Unstructured-IO/unstructured/issues/1922 where
`annotation_dict` is a `PDFObjRef`
* https://github.com/Unstructured-IO/unstructured/issues/1921 where
`rect` is a `PDFObjRef`

### Test
Added three test files (both are larger than 500 KB) for unittests to
test:
* Repair entire doc
* Repair one page
* Reprocess failure after repairing one page (just return the elements
before error page in this case).
* Also seems like splitting the document into smaller pages could fix
this problem, but not sure why. For example, I saw error from reprocess
in the whole
[cancer.pdf](https://github.com/Unstructured-IO/unstructured/files/13461616/cancer.pdf)
doc, but no error when i split the pdf by error page....
* tested if i can repair the entire doc again in this case, saw other
error which means repairing is not helping imo
* PDFminer can process the whole doc after pikepdf only repaired the
entire doc in the first place, but we can't repair by pages in this way

---------

Co-authored-by: cragwolfe <crag@unstructured.io>
2023-11-29 19:00:15 +00:00
Klaijan
0aae1faa54
feat: add visualize param to command and add test (#2178)
- Add `visualize` parameter to the click command -- now callable using
`--visualize` flag to show the progress bar.
- Refactor the name.
2023-11-29 01:05:55 +00:00
Yuming Long
6c08c136ae
ci: fix broken API unit test for using unsupported fast strategy for images (#2144)
### Summary

This should fix the broken unit test on main CI

* change the strategy in
`test_partition_multiple_via_api_valid_request_data_kwargs` from `fast`
to `auto`, since the test was using `fast` for images, and we don't
support it.
2023-11-22 17:35:04 -08:00
Steve Canny
02e8c962aa
fix(docx): tables in header/footer dropped (#2135)
A DOCX header or footer is a so-called "story part" meaning like the
document body (which is also a story part) it can contain both
paragraphs and tables. The implementation of `Header.text` and
`Footer.text` gather only the paragraphs.

Add a new method to extract all content from a header or footer,
including table content, suitable for use as the `.text` attribute of
that element.

Fixes #2126.
2023-11-22 15:39:25 -08:00
Klaijan
2c2d5b65ca
refactor: measure_text_edit_distance function for aggregation (#2108)
- Refactor `metrics/evaluation.py` to accepts `grouping` as parameter. 
- Switch to `DataFrame` for easier analysis and aggregation.
2023-11-22 13:30:16 -08:00
Steve Canny
e6637592d1
fix(docx): Table.text duplicates merged cell text (#2134)
**Summary.** The `python-docx` table API is designed for _uniform_
tables (no merged cells, no nested tables). Naive processing of DOCX
tables using this API produces duplicate text when the table has merged
cells. Add a more sophisticated parsing method that reads only "root"
cells (those with an actual `<tc>` element) and skip cells spanned by a
merge.

In the process, abandon use of the `tabulate` package for this job
(which is also designed for uniform tables) and remove the whitespace
padding it adds for visual alignment of columns. Separate the text for
each cell with a single newline ("\n").

Since it's little extra trouble, add support for nested tables such that
their text also contributes to the `Table.text` string.

The new `._iter_table_texts()` method will also be used for parsing
tables in headers and footers (where they are frequently used for layout
purposes) in a closely following PR.

Fixes #2106.
2023-11-21 22:22:40 +00:00
Steve Canny
19f00b9fa4
fix(html): style (CSS) content appears in HTML text (#2132)
Fixes #1958.

`<style>` is invalid where it appears in the HTML of thw WSJ page
mentioned by that issue but invalid has little meaning in the HTML world
if Chrome accepts it.

In any case, we have no use for the contents of a `<style>` tag wherever
it appears so safe enough for us to just strip all those tags. Note we
do not want to also strip the *tail text* which can contain text we're
interested in.
2023-11-21 04:22:13 +00:00
Steve Canny
ee9be2a3b2
fix: assorted partition_html() bugs (#2113)
Addresses a cluster of HTML-related bugs:
- empty table is identified as bulleted-table
- `partition_html()` emits empty (no text) tables (#1928)
- `.text_as_html` contains inappropriate `<br>` elements in invalid
locations.
- cells enclosed in `<thead>` and `<tfoot>` elements are dropped (#1928)
- `.text_as_html` contains whitespace padding

Each of these is addressed in a separate commit below.

Fixes #1928.

---------

Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com>
Co-authored-by: scanny <scanny@users.noreply.github.com>
Co-authored-by: Yuming Long <63475068+yuming-long@users.noreply.github.com>
2023-11-20 16:29:32 +00:00
Christine Straub
d623d75d3c
Fix: incorrect figure mapping (#2111)
Closes #2098.
2023-11-18 00:11:11 +00:00
Steve Canny
ee62ed7b54
rfctr(html): clean types and docs in prep for HTML table parsing fixes (#2104)
There are a cluster of bugs in the HTML parsing code, particularly
surrounding table behaviors but also inclusion of style elements, etc.

Clean up typing and docstrings in that neighborhood as a way to
familiarize myself with that part of the code-base.
2023-11-17 17:56:38 +00:00
Steve Canny
a589a494f6
docx: improve page break fidelity (#1631)
Page breaks can and often do occur within a paragraph. The full text of
the paragraph is attributed to the page (number) the paragraph starts
on.

Improve page-break fidelity such that a paragraph containing a
page-break is split into two elements, one containing the text before
the page-break and the other the text after. Emit the `PageBreak`
element between these two and assign the correct page-number (n and n+1
respectively) to the two textual elements.

This functionality is largely provided upstream by the new `python-docx`
v1.0.0 release (1.0.0 from 0.8.11 because it drops Python 2 support).
That version also makes obsolete the "include hyperlink text in
`Paragraph.text` monkey patch that we had maintained up to now. Remove
that monkey-patch.
2023-11-17 00:09:14 +00:00
Steve Canny
7a741c9ae6
fix(chunk): #1985 mis-splits of Table chunks (#2076)
Closes #1985 

**Summary.** Due to an interaction of coding errors, HTML text in
`TableChunk` splits of a `Table` element were repeating the entire HTML
for the table in each chunk.

**Technical Summary.** This behavior was fixed but not published in the
last chunking PR of a series. Finish up that PR and submit it all here.
This PR extracts chunking to the particular Section type (each has their
own distinct chunking behavior).
2023-11-16 16:22:50 +00:00
Steve Canny
41fc55bc12
fix(docx): tabulate output is non-deterministic (#2090)
The test for nested tables added a few PRs ago indirectly relies on the
padding added to table-HTML by `tabulate`. The length of that padding
turns out to be non-deterministic, perhaps related to M1 vs. Intel
hardware.

Remove padding from tabulate output in the test so only actual content
is compared.
2023-11-16 07:52:16 +00:00
Christine Straub
e114e5c418
Refactor: partition pdf (#2074)
### Summary
- add constants for strategies
- add `_process_uncategorized_text_elements()` to remove code block
duplication
### Testing
CI should pass.
2023-11-15 21:41:02 -08:00
Steve Canny
252405c780
Dynamic ElementMetadata implementation (#2043)
### Executive Summary
The structure of element metadata is currently static, meaning only
predefined fields can appear in the metadata. We would like the
flexibility for end-users, at their own discretion, to define and use
additional metadata fields that make sense for their particular
use-case.

### Concepts
A key concept for dynamic metadata is _known field_. A known-field is
one of those explicitly defined on `ElementMetadata`. Each of these has
a type and can be specified when _constructing_ a new `ElementMetadata`
instance. This is in contrast to an _end-user defined_ (or _ad-hoc_)
metadata field, one not known at "compile" time and added at the
discretion of an end-user to suit the purposes of their application.

An ad-hoc field can only be added by _assignment_ on an already
constructed instance.

### End-user ad-hoc metadata field behaviors

An ad-hoc field can be added to an `ElementMetadata` instance by
assignment:
```python
>>> metadata = ElementMetadata()
>>> metadata.coefficient = 0.536
```
A field added in this way can be accessed by name:
```python
>>> metadata.coefficient
0.536
```
and that field will appear in the JSON/dict for that instance:
```python
>>> metadata = ElementMetadata()
>>> metadata.coefficient = 0.536
>>> metadata.to_dict()
{"coefficient": 0.536}
```
However, accessing a "user-defined" value that has _not_ been assigned
on that instance raises `AttributeError`:
```python
>>> metadata.coeffcient  # -- misspelled "coefficient" --
AttributeError: 'ElementMetadata' object has no attribute 'coeffcient'
```

This makes "tagging" a metadata item with a value very convenient, but
entails the proviso that if an end-user wants to add a metadata field to
_some_ elements and not others (sparse population), AND they want to
access that field by name on ANY element and receive `None` where it has
not been assigned, they will need to use an expression like this:
```python
coefficient = metadata.coefficient if hasattr(metadata, "coefficient") else None
``` 

### Implementation Notes

- **ad-hoc metadata fields** are discarded during consolidation (for
chunking) because we don't have a consolidation strategy defined for
those. We could consider using a default consolidation strategy like
`FIRST` or possibly allow a user to register a strategy (although that
gets hairy in non-private and multiple-memory-space situations.)
- ad-hoc metadata fields **cannot start with an underscore**.
- We have no way to distinguish an ad-hoc field from any "noise" fields
that might appear in a JSON/dict loaded using `.from_dict()`, so unlike
the original (which only loaded known-fields), we'll rehydrate anything
that we find there.
- No real type-safety is possible on ad-hoc fields but the type-checker
does not complain because the type of all ad-hoc fields is `Any` (which
is the best available behavior in my view).
- We may want to consider whether end-users should be able to add ad-hoc
fields to "sub" metadata objects too, like `DataSourceMetadata` and
conceivably `CoordinatesMetadata` (although I'm not immediately seeing a
use-case for the second one).
2023-11-15 13:22:15 -08:00
Austin Walker
2931cb38e8
fix: handle KeyError: 'N' for certain pdfs (#2072)
Closes #2059.

We've found some pdfs that throw an error in pdfminer. These files use a
ICCBased color profile but do not include an expected value `N`. As a
workaround, we can wrap pdfminer and drop any colorspace info, since we
don't need to render the document.

To verify, try to partition the document in the linked issue.

```
elements = partition(filename="google-2023-environmental-report_condensed.pdf", strategy="fast")
```

---------

Co-authored-by: cragwolfe <crag@unstructured.io>
2023-11-15 01:59:05 +00:00
Christine Straub
475066ba7c
Fix: fast strategy fallback to ocr only (#2055)
Closes #2038.
### Summary
The `fast` strategy should not fall back to a more expensive strategy.

### Testing
For
[9493801-p17.pdf](https://github.com/Unstructured-IO/unstructured/files/13292884/9493801-p17.pdf),
the following code should return an empty list.

```
elements = partition(filename=filename, strategy="fast")
```

---------

Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com>
Co-authored-by: christinestraub <christinestraub@users.noreply.github.com>
2023-11-14 18:46:41 +00:00
John
1ead5a27df
Jj/2011 missing languages metadata (#2037)
### Summary
Closes #2011 

`languages` was missing from the metadata when partitioning pdfs via
`hi_res` and `fast` strategies and missing from image partitions via
`hi_res`. This PR adds `languages` to the relevant function calls so it
is included in the resulting elements.


### Testing
On the main branch, `partition_image` will include `languages` when
`strategy='ocr_only'`, but not when `strategy='hi_res'`:
```
filename = "example-docs/english-and-korean.png"
from unstructured.partition.image import partition_image

elements = partition_image(filename, strategy="ocr_only", languages=['eng', 'kor'])
elements[0].metadata.languages

elements = partition_image(filename, strategy="hi_res", languages=['eng', 'kor'])
elements[0].metadata.languages
```

For `partition_pdf`, `'ocr_only'` will include `languages` in the
metadata, but `'fast'` and `'hi_res'` will not.
```
filename = "example-docs/korean-text-with-tables.pdf"
from unstructured.partition.pdf import partition_pdf

elements = partition_pdf(filename, strategy="ocr_only", languages=['kor'])
elements[0].metadata.languages


elements = partition_pdf(filename, strategy="fast", languages=['kor'])
elements[0].metadata.languages


elements = partition_pdf(filename, strategy="hi_res", languages=['kor'])
elements[0].metadata.languages
```

On this branch, `languages` is included in the metadata regardless of
strategy

---------

Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com>
Co-authored-by: Coniferish <Coniferish@users.noreply.github.com>
2023-11-13 16:47:05 +00:00
John
f8c180a59e
Jj/2027 float no attr strip (#2048)
Closes #2027 

Tables or pages that contain only numbers are returned as floats in a
pandas.DataFrame when the image or page is converted from
`.image_to_data()`. An AttributeError was raised downstream when trying
to `.strip()` the floats. This update converts those floats if needed
and otherwise strips the text.

Testing (note: the document used for testing is new, so you will have to
copy it to the main branch in order to see that this snippet raises an
AttributeError on the main branch, but works on this branch)
```
from unstructured.partition.pdf import partition_pdf
filename = "example-docs/all-number-table.pdf"
partition_pdf(filename, strategy="ocr_only")
```

---------

Co-authored-by: cragwolfe <crag@unstructured.io>
2023-11-10 05:14:06 +00:00
Steve Canny
d06bcc41bb
fix(docx): improve page-break detection (#2036)
Page breaks are reliably indicated by `w:lastRenderedPageBreak` elements
present in the document XML. Page breaks are NOT reliably indicated by
"hard" page-breaks inserted by the author and when present are redundant
to a `w:lastRenderedPageBreak` element so cause over-counting if used.

Use rendered page-breaks only.
2023-11-09 20:34:30 +00:00
Christine Straub
3fe480799a
Fix: missing characters at the beginning of sentences on table ingest output after table OCR refactor (#1961)
Closes #1875.

### Summary
- add functionality to do a second OCR on cropped table images
- use `IMAGE_CROP_PAD` env for `individual_blocks` mode
### Testing
The test function
[`test_partition_pdf_hi_res_ocr_mode_with_table_extraction()`](https://github.com/Unstructured-IO/unstructured/blob/main/test_unstructured/partition/pdf_image/test_pdf.py#L425)
in `test_pdf.py` should pass.

### NOTE: 
I've tried to experiment with values for scaling ENVs on the following
PRs but found that changes to the values for scaling ENVs affect the
entire page OCR output(OCR regression) so switched to doing a second OCR
for tables.
- https://github.com/Unstructured-IO/unstructured/pull/1998/files 
- https://github.com/Unstructured-IO/unstructured/pull/2004/files
- https://github.com/Unstructured-IO/unstructured/pull/2016/files
- https://github.com/Unstructured-IO/unstructured/pull/2029/files

---------

Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com>
Co-authored-by: christinestraub <christinestraub@users.noreply.github.com>
2023-11-09 18:29:55 +00:00
Christine Straub
bb58c1bb0b
Refactor: element type (#2035)
### Summary
- add constants for element type
- replace the `TYPE_TO_TEXT_ELEMENT_MAP` dictionary using the
`ElementType` constants
- replace element type strings using the constants

### Testing
CI should pass.
2023-11-08 21:52:55 -08:00
Steve Canny
c688216b38
fix: remove .max_characters from ElementMetadata (#2032)
This metadata field is assumedly vestigial and is unused by any code in
the repo. `max_characters` is an optional argument to `chunk_by_title()`
and has meaning in that context, but is not written to the metadata.

Remove this unused field.
2023-11-08 19:56:31 +00:00
Steve Canny
0e2c21e5a2
fix: handle sectionless-docx in the general case (#1829)
A DOCX document that has no sections can still contain one or more
tables. Such files are never created by Word but Word can open them just
fine. These can be and are generated by other applications.

Use the newly-added `Document.iter_inner_content()` method added
upstream in `python-docx` to capture both paragraphs and tables from a
section-less DOCX document.

This generalizes the fix for MS Teams chat-transcripts (an example of
sectionless-docx) implemented in #1825.
2023-11-08 19:05:19 +00:00
qued
92ddf3a337
feat: enable request timeout (#2013)
Courtesy @cdpierse.

Adds a test to PR #1529 in accordance with feedback.

Description from original PR:

In python the default behaviour of `requests.get` without a `timeout`
being set is to hang indefinitely. We have a production use case where
the desired behaviour would be to raise a timeout error rather than have
the application just hang.

This PR adds a new optional keyword parameter `request_timeout` to
`partition` which is passed to `file_and_type_from_url` in the case
where we are fetching from a URL. This is then passed to `requests.get`

---------

Co-authored-by: Charles Pierse <charlespierse@gmail.com>
2023-11-08 00:44:58 +00:00
Steve Canny
80fe07b89f
fix: #1952 support nested docx tables (#2020)
In DOCX, like HTML, a table cell can itself contain a table. This is not
uncommon and is typically used for formatting purposes.

When a DOCX table is nested, create nested HTML tables to reflect that
structure and create a plain-text table with captures all the text in
nested tables, formatting it as a reasonable facsimile of a table.

This implements the solution described and spiked in PR #1952.

---------

Co-authored-by: Bruno Bornsztein <bruno.bornsztein@gmail.com>
2023-11-08 00:37:21 +00:00
shreyanid
6db663e7bb
refactor: separate click wrappers from core evaluation functionality (#1981)
### Summary
Click decorated functions cannot (properly) be called outside of the
click interface. This makes it difficult to reuse the setup
functionality in measure_text_edit_distance or
measure_element_type_accuracy. This PR removes the click decoration and
separates it into a wrapper function purely to execute the command.

### Technical Details
- Changed as suggested in [this StackOverflow
post](https://stackoverflow.com/questions/40091347/call-another-click-command-from-a-click-command)
response
- The locations of these now distinct functions are separate: the
`_command` click-decorated functions stay in ingest/evaluate.py, and the
core functions measure_text_edit_distance and
measure_element_type_accuracy are moved into the unstructured/metrics/
folder (which is a more logical location for them).
- Initial test added for measure_text_edit_distance

### Test
`sh ./test_unstructured_ingest/evaluation-metrics.sh text-extraction`
functionality is unchanged.

---------

Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com>
Co-authored-by: shreyanid <shreyanid@users.noreply.github.com>
Co-authored-by: Trevor Bossert <37596773+tabossert@users.noreply.github.com>
2023-11-07 19:54:22 +00:00
Yuming Long
ad14321016
Chore: don't pass empty language code to tesseract CLI (#1996)
Summary:

Close: https://github.com/Unstructured-IO/unstructured/issues/1920

* stop passing in empty string from `languages` to tesseract, which will
result in passing empty string to language config `-l` for the tesseract
CLI
* also stop passing in duplicate language code from `languages` to
tesseract OCR
* if we failed to convert any iso languages from the `languages`
parameter, proceed OCR with `eng` as default
  
### Test
* First confirm the tesseract error `Estimating resolution as X` before
this:
* on the `unstructured-api` repo with main branch, run `make
run-web-app`
* curl to test error from empty string, or just any wrong input like `-F
'languages="eng,de"'`:
 ```
curl -X 'POST'  'http://0.0.0.0:8000/general/v0/general' \
  -H 'accept: application/json'   \
-H 'Content-Type: multipart/form-data' \
 -F 'files=@sample-docs/layout-parser-paper-with-table.jpg' \
-F 'languages=""'  \
-F 'strategy=hi_res'  \
-F 'pdf_infer_table_structure=True' \
 | jq -C . | less -R
``` 

* after this change:
   * in your unstructured API env, cd to unstructured repo and install it locally with `pip install -e .`
   * check out to this branch
   * run `make run-web-app` again in api repo
   * the curl command return output and see warning in log

---------

Co-authored-by: qued <64741807+qued@users.noreply.github.com>
2023-11-06 19:30:12 -06:00
Ahmet Melek
ca78dc737a
feat: extend ingest options to support multiple embedding modules, add deterministic ingest test for embeddings (#1918)
Closes #1782 

This PR:
- Extends ingest pipeline so that it is possible to select an embedding
provider from a range of providers
- Modifies the ingest embedding test to be a diff test, since the
embedding vectors are reproducible after supporting multiple providers

Additional info on the chosen provider for the test:
- Found `langchain.embeddings.HuggingFaceEmbeddings` to be deterministic
even when there's no seed set
- Took 6.84s to pass a unit test with the provider (without cache,
including model download)
- `langchain.embeddings.HuggingFaceEmbeddings` runs in local, making it
zero cost

For all these reasons, testing embedding modules with the Huggingface
model seems to be making sense

---------

Co-authored-by: cragwolfe <crag@unstructured.io>
Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com>
Co-authored-by: ahmetmeleq <ahmetmeleq@users.noreply.github.com>
2023-11-06 12:26:12 +00:00
Roman Isecke
ba4477ac20
feat: support table conversion for tabular destination connectors (#1917)
### Description
* A full schema was introduced to map the type of all output content
from the json partition output and mapped to a flattened table structure
to leverage table-based destination connectors. The delta table
destination connector was updated at the moment to take advantage of
this.
* Existing method to convert to a dataframe was updated because it had a
bug in it. Object content in the metadata would have the key name
changed when flattened but then this would be omitted since it didn't
exist in the `_get_metadata_table_fieldnames` response.
* Unit test was added to make sure we handle all values possible in an
Element when converting to a table
* Delta table ingest test was split into a source and destination test
(looking ahead to split these up in CI)

---------

Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com>
Co-authored-by: rbiseck3 <rbiseck3@users.noreply.github.com>
2023-11-03 16:47:21 +00:00
Christine Straub
9f7ff4fd98
rfctr: Clean up test functions in test_pdf.py (#1999)
### Summary:
- use the test utility function `example_doc_path()`
- clean up test functions related to `metadata_date` and
`exclude_metadata`
2023-11-03 10:02:43 -05:00
Mallori Harrell
d07baed4a1
bug: empty-elements (#1252)
- This PR adds a function to check if a piece of text only contains a
bullet (no text) to prevent creating an empty element.
- Also fixed a test that had a typo.
2023-11-02 10:52:41 -05:00
Steve Canny
4e40999070
rfctr: prepare docx partitioner and tests for nested tables PR to follow (#1978)
*Reviewer:* May be quicker to review commit by commit as they are quite
distinct and well-groomed to each focus on a single clean-up task.

Clean up odds-and-ends in the docx partitioner in preparation for adding
nested-tables support in a closely following PR.

1. Remove obsolete TODOs now in GitHub issues, which is probably where
they belong in future anyway.
2. Remove local DOCX "workaround" code that has been implemented
upstream and is now obsolete.
3. "Clean" the docx tests, introducing strict typing, extracting a
fixture or two, and generally tightening things up.
4. Extract docx-local versions of
`unstructured.partition.common.convert_ms_office_table_to_text()` which
will be the base for adding nested-table support. More information on
why this is required in that commit.
2023-11-02 05:22:17 +00:00
Steve Canny
51d07b6434
fix: flaky chunk metadata (#1947)
**Executive Summary.** When the elements in a _section_ are combined
into a _chunk_, the metadata in each of the elements is _consolidated_
into a single `ElementMetadata` instance. There are two main problems
with the current implementation:

1. The current algorithm simply uses the metadata of the first element
as the metadata for the chunk. This produces:
- **empty chunk metadata** when the first element has no metadata, such
as a `PageBreak("")`
- **missing chunk metadata** when the first element contains only
partial metadata such as a `Header()` or `Footer()`
- **misleading metadata** when the first element contains values
applicable only to that element, such as `category_depth`, `coordinates`
(bounding-box), `header_footer_type`, or `parent_id`
2. Second, list metadata such as `emphasized_text_content`,
`emphasized_text_tags`, `link_texts` and `link_urls` is only combined
when it is unique within the combined list. These lists are "unzipped"
pairs. For example, the first `link_texts` corresponds to the first
`link_urls` value. When an item is removed from one (because it matches
a prior entry) and not the other (say same text "here" but different
URL) the positional correspondence is broken and downstream processing
will at best be wrong, at worst raise an exception.

### Technical Discussion
Element metadata cannot be determined in the general case simply by
sampling that of the first element. At the same time, a simple union of
all values is also not sufficient. To effectively consolidate the
current variety of metadata fields we need four distinct strategies,
selecting which to apply to each field based on that fields provenance
and other characteristics.

The four strategies are:
- `FIRST` - Select the first non-`None` value across all the elements.
Several fields are determined by the document source (`filename`,
`file_directory`, etc.) and will not change within the output of a
single partitioning run. They might not appear in every element, but
they will be the same whenever they do appear. This strategy takes the
first one that appears, if any, as proxy for the value for the entire
chunk.
- `LIST` - Consolidate the four list fields like
`emphasized_text_content` and `link_urls` by concatenating them in
element order (no set semantics apply). All values from `elements[n]`
appear before those from `elements[n+1]` and existing order is
preserved.
- `LIST_UNIQUE` - Combine only unique elements across the (list) values
of the elements, preserving order in which a unique item first appeared.
- `REGEX` - Regex metadata has its own rules, including adjusting the
`start` and `end` offset of each match based its new position in the
concatenated text.
- `DROP` - Not all metadata can or should appear in a chunk. For
example, a chunk cannot be guaranteed to have a single `category_depth`
or `parent_id`.

Other strategies such as `COORDINATES` could be added to consolidate the
bounding box of the chunk from the coordinates of its elements, roughly
`min(lefts)`, `max(rights)`, etc. Others could be `LAST`, `MAJORITY`, or
`SUM` depending on how metadata evolves.

The proposed strategy assignments are these:

- `attached_to_filename`: FIRST,
- `category_depth`: DROP,
- `coordinates`: DROP,
- `data_source`: FIRST,
- `detection_class_prob`: DROP,  # -- ? confirm --
- `detection_origin`: DROP,      # -- ? confirm --
- `emphasized_text_contents`: LIST,
- `emphasized_text_tags`: LIST,
- `file_directory`: FIRST,
- `filename`: FIRST,
- `filetype`: FIRST,
- `header_footer_type`: DROP,
- `image_path`: DROP,
- `is_continuation`: DROP, # -- not expected, added by chunking, not
before --
- `languages`: LIST_UNIQUE,
- `last_modified`: FIRST,
- `link_texts`: LIST,
- `link_urls`: LIST,
- `links`: DROP,            # -- deprecated field --
- `max_characters`: DROP, # -- unused in code, probably remove from
ElementMetadata --
- `page_name`: FIRST,
- `page_number`: FIRST,
- `parent_id`: DROP,
- `regex_metadata`: REGEX,
- `section`: FIRST, # -- section unconditionally breaks on new section
--
- `sent_from`: FIRST,
- `sent_to`: FIRST,
- `subject`: FIRST,
- `text_as_html`: DROP, # -- not expected, only occurs in TableSection
--
- `url`: FIRST,

**Assumptions:**
- each .eml file is partitioned->chunked separately (not in batches),
therefore
  sent-from, sent-to, and subject will not change within a section.

### Implementation
Implementation of this behavior requires two steps:
1. **Collect** all non-`None` values from all elements, each in a
sequence by field-name. Fields not populated in any of the elements do
not appear in the collection.
```python
all_meta = {
    "filename": ["memo.docx", "memo.docx"]
    "link_texts": [["here", "here"], ["and here"]]
    "parent_id": ["f273a7cb", "808b4ced"]
}
```
2. **Apply** the specified strategy to each item in the overall
collection to produce the consolidated chunk meta (see implementation).

### Factoring
For the following reasons, the implementation of metadata consolidation
is extracted from its current location in `chunk_by_title()` to a
handful of collaborating methods in `_TextSection`.
- The current implementation of metadata consolidation "inline" in
`chunk_by_title()` already has too many moving pieces to be understood
without extended study. Adding strategies to that would make it worse.
- `_TextSection` is the only section type where metadata is consolidated
(the other two types always have exactly one element so already exactly
one metadata.)
- `_TextSection` is already the expert on all the information required
to consolidate metadata, in particular the elements that make up the
section and their text.

Some other problems were also fixed in that transition, such as mutation
of elements during the consolidation process.

### Technical Risk: adding new `ElementMetadata` field breaks metadata

If each metadata field requires a strategy assignment to be consolidated
and a developer adds a new `ElementMetadata` field without adding a
corresponding strategy mapping, metadata consolidation could break or
produce incorrect results.

This risk can be mitigated multiple ways:
1. Add a test that verifies a strategy is defined for each
(Recommended).
2. Define a default strategy, either `DROP` or `FIRST` for scalar types,
`LIST` for list types.
3. Raise an exception when an unknown metadata field is encountered.

This PR implements option 1 such that a developer will be notified
before merge if they add a new metadata field but do not define a
strategy for it.

### Other Considerations
- If end-users can in-future add arbitrary metadata fields _before_
chunking, then we'll need to define metadata-consolidation behavior for
such fields. Depending on how we implement user-defined metadata fields
we might:
- Require explicit definition of a new metadata field before use,
perhaps with a method like `ElementMetadata.add_custom_field()` which
requires a consolidation strategy to be defined (and/or has a default
value).
- Have a default strategy, perhaps `DROP` or `FIRST`, or `LIST` if the
field is type `list`.

### Further Context
Metadata is only consolidated for `TextSection` because the other two
section types (`TableSection` and `NonTextSection`) can only contain a
single element.

---

## Further discussion on consolidation strategy by field

### document-static
These fields are very likely to be the same for all elements in a single
document:

- `attached_to_filename`
- `data_source`
- `file_directory`
- `filename`
- `filetype`
- `last_modified`
- `sent_from`
- `sent_to`
- `subject`
- `url`

*Consolidation strategy:* `FIRST` - use first one found, if any.

### section-static
These fields are very likely to be the same for all elements in a single
section, which is the scope we really care about for metadata
consolidation:

- `section` - an EPUB document-section unconditionally starts new
section.

*Consolidation strategy:* `FIRST` - use first one found, if any.

### consolidated list-items
These `List` fields are consolidated by concatenating the lists from
each element that has one:

- `emphasized_text_contents`
- `emphasized_text_tags`
- `link_texts`
- `link_urls`
- `regex_metadata` - special case, this one gets indexes adjusted too.

*Consolidation strategy:* `LIST` - concatenate lists across elements.

### dynamic
These fields are likely to hold unique data for each element:

- `category_depth`
- `coordinates`
- `image_path`
- `parent_id`

*Consolidation strategy:*
- `DROP` as likely misleading.
- `COORDINATES` strategy could be added to compute the bounding box from
all bounding boxes.
- Consider allowing if they are all the same, perhaps an `ALL` strategy.

### slow-changing
These fields are somewhere in-between, likely to be common between
multiple elements but varied within a document:

- `header_footer_type` - *strategy:* drop as not-consolidatable
- `languages` - *strategy:* take first occurence
- `page_name` - *strategy:* take first occurence
- `page_number` - *strategy:* take first occurence, will all be the same
when `multipage_sections` is `False`. Worst-case semantics are "this
chunk began on this page".

### N/A
These field types do not figure in metadata-consolidation:

- `detection_class_prob` - I'm thinking this is for debug and should not
appear in chunks, but need confirmation.
- `detection_origin` - for debug only
- `is_continuation` - is _produced_ by chunking, never by partitioning
(not in our code anyway).
- `links` (deprecated, probably should be dropped)
- `max_characters` - is unused as far as I can tell, is unreferenced in
source code. Should be removed from `ElementMetadata` as far as I can
tell.
- `text_as_html` - only appears in a `Table` element, each of which
appears in its own section so needs no consolidation. Never appears in
`TextSection`.

*Consolidation strategy:* `DROP` any that appear (several never will)
2023-11-02 01:49:20 +00:00
John
2f553333bd
refactor text.py (#1872)
### Summary
Closes #1520 
Partial solution to #1521 

- Adds an abstraction layer between the user API and the partitioner
implementation
- Adds comments explaining paragraph chunking
- Makes edits to pass strict type-checking for both text.py and
test_text.py
2023-11-01 17:44:55 -05:00
John
b92cab7fbd
fix languages 500 error with empty string for ocr_languages (#1968)
Closes #1870 
Defining both `languages` and `ocr_languages` raises a ValueError, but
the api defaults to `ocr_languages` being an empty string, so if users
define `languages` they are automatically hitting the ValueError.

This fix checks if `ocr_languages` is an empty string and converts it to
`None` to avoid this.

### Testing
On the main branch, the following will raise the ValueError, but it will
correctly partition on this branch
```
from unstructured.partition.auto import partition
filename = "example-docs/category-level.docx"
elements = partition(filename,languages=['spa'],ocr_languages="")

elements[0].metadata.languages
```

---------

Co-authored-by: yuming <305248291@qq.com>
Co-authored-by: Yuming Long <63475068+yuming-long@users.noreply.github.com>
Co-authored-by: Austin Walker <awalk89@gmail.com>
2023-11-01 22:02:00 +00:00
Klaijan
1893d5a669
fix: avoid loop through None (#1975)
Fix this issue https://unstructured-ai.atlassian.net/browse/CORE-2455.
Adding logical check if the variable is not None.
2023-11-01 20:50:34 +00:00
Christine Straub
210d53a7e0
Fix: missing columns on table ingest output after table OCR refactor (#1959)
Closes #1873.
### Summary
Table OCR refactoring changed the default padding value for table image
cropping from
[12](https://github.com/Unstructured-IO/unstructured-inference/blob/main/unstructured_inference/inference/layoutelement.py#L95)
to
[0](https://github.com/Unstructured-IO/unstructured/blob/main/unstructured/partition/ocr.py#L260),
causing some columns in the table to be missing.
### Testing
```
filename = "example-docs/layout-parser-paper-with-table.pdf"
elements = pdf.partition_pdf(
    filename=filename,
    strategy="hi_res",
    infer_table_structure=True,
)
table = [el.metadata.text_as_html for el in elements if el.metadata.text_as_html]
assert "Large Model" in table[0]
```

---------

Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com>
Co-authored-by: christinestraub <christinestraub@users.noreply.github.com>
2023-11-01 18:34:27 +00:00
qued
b08562ba1a
tests: separate chipper tests (#1939)
Separates chipper tests to speed up testing and CI.
2023-10-31 21:02:00 +00:00
Denis Lusson
f585d489c1
feat: Add include_header argument for partition_csv and partition_tsv (#1764)
This PR add `include_header` argument for partition_csv and
partition_tsv. This is related to the following feature request
https://github.com/Unstructured-IO/unstructured/issues/1751.

`include_header` is already part of partition_xlsx. The work here is in
line with the current usage and testing of the `include_header` argument
in partition_xlsx.

---------

Co-authored-by: cragwolfe <crag@unstructured.io>
2023-10-31 08:16:36 +00:00
Klaijan
a11d4634f1
fix: type error string indices bug (#1940)
Fix TypeError: string indices must be integers. The `annotation_dict`
variable is conditioned to be `None` if instance type is not dict. Then
we add logic to skip the attempt if the value is `None`.
2023-10-30 17:38:57 -07:00
Christine Straub
1f0c563e0c
refactor: partition_pdf() for ocr_only strategy (#1811)
### Summary
Update `ocr_only` strategy in `partition_pdf()`. This PR adds the
functionality to get accurate coordinate data when partitioning PDFs and
Images with the `ocr_only` strategy.
- Add functionality to perform OCR region grouping based on the OCR text
taken from `pytesseract.image_to_string()`
- Add functionality to get layout elements from OCR regions (ocr_layout)
for both `tesseract` and `paddle`
- Add functionality to determine the `source` of merged text regions
when merging text regions in `merge_text_regions()`
- Merge multiple test functions related to "ocr_only" strategy into
`test_partition_pdf_with_ocr_only_strategy()`
- This PR also fixes [issue
#1792](https://github.com/Unstructured-IO/unstructured/issues/1792)
### Evaluation
```
# Image
PYTHONPATH=. python examples/custom-layout-order/evaluate_natural_reading_order.py example-docs/double-column-A.jpg ocr_only xy-cut image

# PDF
PYTHONPATH=. python examples/custom-layout-order/evaluate_natural_reading_order.py example-docs/multi-column-2p.pdf ocr_only xy-cut pdf
```
### Test
- **Before update**
All elements have the same coordinate data 


![multi-column-2p_1_xy-cut](https://github.com/Unstructured-IO/unstructured/assets/9475974/aae0195a-2943-4fa8-bdd8-807f2f09c768)

- **After update**
All elements have accurate coordinate data


![multi-column-2p_1_xy-cut](https://github.com/Unstructured-IO/unstructured/assets/9475974/0f6c6202-9e65-4acf-bcd4-ac9dd01ab64a)

---------

Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com>
Co-authored-by: christinestraub <christinestraub@users.noreply.github.com>
2023-10-30 20:13:29 +00:00