Fixes#2339
Fixes to HTML partitioning introduced with v0.11.0 removed the use of
`tabulate` for forming the HTML placed in `HTMLTable.text_as_html`. This
had several benefits, but part of `tabulate`'s behavior was to make
row-length (cell-count) uniform across the rows of the table.
Lacking this prior uniformity produced a downstream problem reported in
On closer inspection, the method used to "harvest" cell-text was
producing more text-nodes than there were cells and was sensitive to
where whitespace was used to format the HTML. It also "moved" text to
different columns in certain rows.
Refine the cell-text gathering mechanism to get exactly one text string
for each row cell, eliminating whitespace formatting nodes and producing
strict correspondence between the number of cells in the original HTML
table row and that placed in HTML.text_as_html.
HTML tables that are uniform (every row has the same number of cells)
will produce a uniform table in `.text_as_html`. Merged cells may still
produce a non-uniform table in `.text_as_html` (because the source table
is non-uniform).
Currently, we're using different kwarg names in partition() and
partition_pdf(), which has implications for the API since it goes
through partition().
### Summary
- rename `extract_element_types` -> `extract_image_block_types`
- rename `image_output_dir_path` to `extract_image_block_output_dir`
- rename `extract_to_payload` -> `extract_image_block_to_payload`
- rename `pdf_extract_images` -> `extract_images_in_pdf` in
`partition.auto`
- add unit tests to test element extraction for `pdf/image` via
`partition.auto`
### Testing
CI should pass.
Closes#2340
We need to make sure the custom url is passed to our client. The client
constructor takes the base url, so for compatibility we can continue to
take the full url and strip off the path.
To verify, run the api locally and confirm you can make calls to it.
```
# In unstructured-api
make run-web-app
# In ipython in this repo
from unstructured.partition.api import partition_via_api
filename = "example-docs/layout-parser-paper.pdf"
partition_via_api(filename=filename, api_url="http://localhost:8000")
```
Closes#2323.
### Summary
- update logic to return "hi_res" if either `extract_images_in_pdf` or
`extract_element_types` is set
- refactor: remove unused `file` parameter from
`determine_pdf_or_image_strategy()`
### Testing
```
from unstructured.partition.pdf import partition_pdf
elements = partition_pdf(
filename="example-docs/embedded-images-tables.pdf",
extract_element_types=["Image"],
extract_to_payload=True,
)
image_elements = [el for el in elements if el.category == ElementType.IMAGE]
print(image_elements)
```
Closes#2302.
### Summary
- add functionality to get a Base64 encoded string from a PIL image
- store base64 encoded image data in two metadata fields: `image_base64`
and `image_mime_type`
- update the "image element filter" logic to keep all image elements in
the output if a user specifies image extraction
### Testing
```
from unstructured.partition.pdf import partition_pdf
elements = partition_pdf(
filename="example-docs/embedded-images-tables.pdf",
strategy="hi_res",
extract_element_types=["Image", "Table"],
extract_to_payload=True,
)
```
or
```
from unstructured.partition.auto import partition
elements = partition(
filename="example-docs/embedded-images-tables.pdf",
strategy="hi_res",
pdf_extract_element_types=["Image", "Table"],
pdf_extract_to_payload=True,
)
```
Closes#2160
Explicitly adds `hi_res_model_name` as kwarg to relevant functions and
notes that `model_name` is to be deprecated.
Testing:
```
from unstructured.partition.auto import partition
filename = "example-docs/DA-1p.pdf"
elements = partition(filename, strategy="hi_res", hi_res_model_name="yolox")
```
---------
Co-authored-by: cragwolfe <crag@unstructured.io>
Co-authored-by: Steve Canny <stcanny@gmail.com>
Co-authored-by: Christine Straub <christinemstraub@gmail.com>
Co-authored-by: Yao You <yao@unstructured.io>
Co-authored-by: Yao You <theyaoyou@gmail.com>
The text of an oversized chunk is split on an arbitrary character
boundary (mid-word). The `chunk_by_character()` strategy introduces the
idea of allowing the user to specify a separator to use for
chunk-splitting. For `langchain` this is typically "\n\n", "\n", or " ";
blank-line, newline, or word boundaries respectively.
Even if the user is allowed to specify a separator, we must provide
fall-back for when a chunk contains no such character. This can be done
incrementally, like blank-line is preferable to newline, newline is
preferable to word, and word is preferable to arbitrary character.
Further, there is nothing particular to `chunk_by_character()` in
providing such a fall-back text-splitting strategy. It would be
preferable for all strategies to split oversized chunks on even-word
boundaries for example.
Note that while a "blank-line" ("\n\n") may be common in plain text, it
is unlikely to appear in the text of an element because it would have
been interpreted as an element boundary during partitioning.
Add _TextSplitter with basic separator preferences and fall-back and
apply it to chunk-splitting for all strategies. The `by_character`
chunking strategy may enhance this behavior by adding the option for a
user to specify a particular separator suited to their use case.
closes#816
## Description
Added functionality for `partition_email` to automatically decode base64
text before passing it to `partition_text` or `partition_html`.
Also adds base64 encoded email text test cases.
This PR addresses
[CORE-2969](https://unstructured-ai.atlassian.net/browse/CORE-2969)
- pdfminer sometimes fail to decode text in an pdf file and returns cid
codes as text
- now those text will be considered invalid and be replaced with ocr
results in `hi_res` mode
## test
This PR adds unit test for the utility functions. In addition the file
below would return elements with text in cid code on main but proper
ascii text with this PR:
[005-CISA-AA22-076-Strengthening-Cybersecurity-p1-p4.pdf](https://github.com/Unstructured-IO/unstructured/files/13662984/005-CISA-AA22-076-Strengthening-Cybersecurity-p1-p4.pdf)
This change improves both cct accuracy and %missing scores:
**before:**
```
metric average sample_sd population_sd count
--------------------------------------------------
cct-accuracy 0.681 0.267 0.266 105
cct-%missing 0.086 0.159 0.159 105
```
**after:**
```
metric average sample_sd population_sd count
--------------------------------------------------
cct-accuracy 0.697 0.251 0.250 105
cct-%missing 0.071 0.123 0.122 105
```
[CORE-2969]:
https://unstructured-ai.atlassian.net/browse/CORE-2969?atlOrigin=eyJpIjoiNWRkNTljNzYxNjVmNDY3MDlhMDU5Y2ZhYzA5YTRkZjUiLCJwIjoiZ2l0aHViLWNvbS1KU1cifQ
---------
Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com>
Co-authored-by: badGarnet <badGarnet@users.noreply.github.com>
Co-authored-by: christinestraub <christinemstraub@gmail.com>
Closes#2218. When a csv has commas in its content, and the delimiter is
something else, Pandas may throw an error. We can sniff the csv and get
the correct delimiter to pass to Pandas. To verify, try partitioning the
file in the linked bug.
`CheckBox` elements get special treatment during chunking. `CheckBox`
does not derive from `Text` and can contribute no text to a chunk. It is
considered "non-combinable" and so is emitted as-is as a chunk of its
own. A consequence of this is it breaks an otherwise contiguous chunk
into two wherever it occurs.
This is problematic, but becomes much more so when overlap is
introduced. Each chunk accepts a "tail" text fragment from its preceding
element and contributes its own tail fragment to the next chunk. These
tails represent the "overlap" between chunks. However, a non-text chunk
can neither accept nor provide a tail-fragment and so interrupts the
overlap. None of the possible solutions are terrific.
Give `Element` a `.text` attribute such that _all_ elements have a
`.text` attribute, even though its value is the empty-string for
element-types such as CheckBox and PageBreak which inherently have no
text. As a consequence, several `cast()` wrappers are no longer required
to satisfy strict type-checking.
This also allows a `CheckBox` element to be combined with `Text`
subtypes during chunking, essentially the same way `PageBreak` is,
contributing no text to the chunk.
Also, remove the `_NonTextSection` object which previously wrapped a
`CheckBox` element during pre-chunking as it is no longer required.
closes#2222.
### Summary
The "table" elements are saved as `table-<pageN>-<tableN>.jpg`. This
filename is presented in the `image_path` metadata field for the Table
element. The default would be to not do this.
### Testing
PDF: [124_PDFsam_Basel III - Finalising post-crisis
reforms.pdf](https://github.com/Unstructured-IO/unstructured/files/13591714/124_PDFsam_Basel.III.-.Finalising.post-crisis.reforms.pdf)
```
elements = partition_pdf(
filename="124_PDFsam_Basel III - Finalising post-crisis reforms.pdf",
strategy="hi_res",
infer_table_structure=True,
extract_element_types=['Table'],
)
```
### Summary
This PR is the second part of the "image extraction" refactor to move it
from unstructured-inference repo to unstructured repo, the first part is
done in
https://github.com/Unstructured-IO/unstructured-inference/pull/299. This
PR adds logic to support extracting images.
### Testing
`git clone -b refactor/remove_image_extraction_code --single-branch
https://github.com/Unstructured-IO/unstructured-inference.git && cd
unstructured-inference && pip install -e . && cd ../`
```
elements = partition_pdf(
filename="example-docs/embedded-images.pdf",
strategy="hi_res",
extract_images_in_pdf=True,
)
print("\n\n".join([str(el) for el in elements]))
```
Follow-up PR to
[https://github.com/Unstructured-IO/unstructured/pull/2195](https://github.com/Unstructured-IO/unstructured/pull/2195).
Removes unnecessary calls to `get_api_key()`. That helper function is
supposed to only be used for tests decorated by
@pytest.mark.skipif(skip_outside_ci, reason="Skipping test run outside
of CI") (which are skipped because those tests are partitioning pdf/jpg
files).
These tests are partitioning emails and rely on the MockResponse at the
top of the file, so they don't need to call `get_api_key()` and it can
simply be removed from them.
### Summary
This PR is the second part of `pdfminer` refactor to move it from
`unstructured-inference` repo to `unstructured` repo, the first part is
done in
https://github.com/Unstructured-IO/unstructured-inference/pull/294. This
PR adds logic to merge the extracted layout with the inferred layout.
The updated workflow for the `hi_res` strategy:
* pass the document (as data/filename) to the `inference` repo to get
`inferred_layout` (DocumentLayout)
* pass the `inferred_layout` returned from the `inference` repo and the
document (as data/filename) to the `pdfminer_processing` module, which
first opens the document (create temp file/dir as needed), and splits
the document by pages
* if is_image is `True`, return the passed
inferred_layout(DocumentLayout)
* if is_image is `False`:
* get extracted_layout (TextRegions) from the passed
document(data/filename) by pdfminer
* merge `extracted_layout` (TextRegions) with the passed
`inferred_layout` (DocumentLayout)
* return the `inferred_layout `(DocumentLayout) with updated elements
(all merged LayoutElements) as merged_layout (DocumentLayout)
* pass merged_layout and the document (as data/filename) to the `OCR`
module, which first opens the document (create temp file/dir as needed),
and splits the document by pages (convert PDF pages to image pages for
PDF file)
### Note
This PR also fixes issue #2164 by using functionality similar to the one
implemented in the `fast` strategy workflow when extracting elements by
`pdfminer`.
### TODO
* image extraction refactor to move it from `unstructured-inference`
repo to `unstructured` repo
* improving natural reading order by applying the current default
`xycut` sorting to the elements extracted by `pdfminer`
### Summary
Closes#2033
Updates `partition_via_api` to use `UnstructuredClient` for api calls
instead of `requests`.
Updates associated tests.
Note: This PR does **not** update `partition_multiple_via_api` as
documentation in `unstructured-python-client` indicates it does not
support multiple files. A new issue should be opened to add that
functionality to `unstructured-python-client`.
---------
Co-authored-by: Klaijan <klaijan@unstructured.io>
Co-authored-by: Roman Isecke <136338424+rbiseck3@users.noreply.github.com>
Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com>
Co-authored-by: rbiseck3 <rbiseck3@users.noreply.github.com>
### Summary
Add a procedure to repair PDF when the PDF structure is invalid for
`PDFminer` to process.
This PR handles two cases of `PSSyntaxError Invalid dictionary
construct: ...`:
* PDFminer open entire document and create pages generator on
`PDFPage.get_pages(fp)`: [sentry log
example](https://unstructuredio.sentry.io/issues/4655715023/?alert_rule_id=14681339&alert_type=issue¬ification_uuid=d8db4cf4-686f-4504-8a22-74a79a8e966f&project=4505909127086080&referrer=slack)
* PDFminer's interpreter process a single page on
`interpreter.process_page(page)`: [sentry log
example](https://unstructuredio.sentry.io/issues/4655898781/?referrer=slack¬ification_uuid=0d929d48-f490-4db8-8dad-5d431c8460bc&alert_rule_id=14681339&alert_type=issue)
**Additional tech details:**
* Add new dependency `pikepdf` in `requirements/extra-pdf-image.in`,
which is used for repairing PDF.
* Add new denpendenct `pypdf` in `requirements/extra-pdf-image.in`,
which is used to find the error page from entire document by reading the
PDF file again (can't find a way to split pdf in PDFminer).
* Refactor the `is null` check for `get_uris_from_annots`, since the
root cause is that `get_uris` passed a None `annots` to
`get_uris_from_annots`, so the Null check should happen in `get_uris`.
* Add more type protection in `get_uris_from_annots` when using any
`PDFObjRef.resolve()` as `dict` (it could still be a `PDFObjRef`). This
should fix :
* https://github.com/Unstructured-IO/unstructured/issues/1922 where
`annotation_dict` is a `PDFObjRef`
* https://github.com/Unstructured-IO/unstructured/issues/1921 where
`rect` is a `PDFObjRef`
### Test
Added three test files (both are larger than 500 KB) for unittests to
test:
* Repair entire doc
* Repair one page
* Reprocess failure after repairing one page (just return the elements
before error page in this case).
* Also seems like splitting the document into smaller pages could fix
this problem, but not sure why. For example, I saw error from reprocess
in the whole
[cancer.pdf](https://github.com/Unstructured-IO/unstructured/files/13461616/cancer.pdf)
doc, but no error when i split the pdf by error page....
* tested if i can repair the entire doc again in this case, saw other
error which means repairing is not helping imo
* PDFminer can process the whole doc after pikepdf only repaired the
entire doc in the first place, but we can't repair by pages in this way
---------
Co-authored-by: cragwolfe <crag@unstructured.io>
### Summary
This should fix the broken unit test on main CI
* change the strategy in
`test_partition_multiple_via_api_valid_request_data_kwargs` from `fast`
to `auto`, since the test was using `fast` for images, and we don't
support it.
A DOCX header or footer is a so-called "story part" meaning like the
document body (which is also a story part) it can contain both
paragraphs and tables. The implementation of `Header.text` and
`Footer.text` gather only the paragraphs.
Add a new method to extract all content from a header or footer,
including table content, suitable for use as the `.text` attribute of
that element.
Fixes#2126.
**Summary.** The `python-docx` table API is designed for _uniform_
tables (no merged cells, no nested tables). Naive processing of DOCX
tables using this API produces duplicate text when the table has merged
cells. Add a more sophisticated parsing method that reads only "root"
cells (those with an actual `<tc>` element) and skip cells spanned by a
merge.
In the process, abandon use of the `tabulate` package for this job
(which is also designed for uniform tables) and remove the whitespace
padding it adds for visual alignment of columns. Separate the text for
each cell with a single newline ("\n").
Since it's little extra trouble, add support for nested tables such that
their text also contributes to the `Table.text` string.
The new `._iter_table_texts()` method will also be used for parsing
tables in headers and footers (where they are frequently used for layout
purposes) in a closely following PR.
Fixes#2106.
Addresses a cluster of HTML-related bugs:
- empty table is identified as bulleted-table
- `partition_html()` emits empty (no text) tables (#1928)
- `.text_as_html` contains inappropriate `<br>` elements in invalid
locations.
- cells enclosed in `<thead>` and `<tfoot>` elements are dropped (#1928)
- `.text_as_html` contains whitespace padding
Each of these is addressed in a separate commit below.
Fixes#1928.
---------
Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com>
Co-authored-by: scanny <scanny@users.noreply.github.com>
Co-authored-by: Yuming Long <63475068+yuming-long@users.noreply.github.com>
Page breaks can and often do occur within a paragraph. The full text of
the paragraph is attributed to the page (number) the paragraph starts
on.
Improve page-break fidelity such that a paragraph containing a
page-break is split into two elements, one containing the text before
the page-break and the other the text after. Emit the `PageBreak`
element between these two and assign the correct page-number (n and n+1
respectively) to the two textual elements.
This functionality is largely provided upstream by the new `python-docx`
v1.0.0 release (1.0.0 from 0.8.11 because it drops Python 2 support).
That version also makes obsolete the "include hyperlink text in
`Paragraph.text` monkey patch that we had maintained up to now. Remove
that monkey-patch.
Closes#1985
**Summary.** Due to an interaction of coding errors, HTML text in
`TableChunk` splits of a `Table` element were repeating the entire HTML
for the table in each chunk.
**Technical Summary.** This behavior was fixed but not published in the
last chunking PR of a series. Finish up that PR and submit it all here.
This PR extracts chunking to the particular Section type (each has their
own distinct chunking behavior).
The test for nested tables added a few PRs ago indirectly relies on the
padding added to table-HTML by `tabulate`. The length of that padding
turns out to be non-deterministic, perhaps related to M1 vs. Intel
hardware.
Remove padding from tabulate output in the test so only actual content
is compared.
### Executive Summary
The structure of element metadata is currently static, meaning only
predefined fields can appear in the metadata. We would like the
flexibility for end-users, at their own discretion, to define and use
additional metadata fields that make sense for their particular
use-case.
### Concepts
A key concept for dynamic metadata is _known field_. A known-field is
one of those explicitly defined on `ElementMetadata`. Each of these has
a type and can be specified when _constructing_ a new `ElementMetadata`
instance. This is in contrast to an _end-user defined_ (or _ad-hoc_)
metadata field, one not known at "compile" time and added at the
discretion of an end-user to suit the purposes of their application.
An ad-hoc field can only be added by _assignment_ on an already
constructed instance.
### End-user ad-hoc metadata field behaviors
An ad-hoc field can be added to an `ElementMetadata` instance by
assignment:
```python
>>> metadata = ElementMetadata()
>>> metadata.coefficient = 0.536
```
A field added in this way can be accessed by name:
```python
>>> metadata.coefficient
0.536
```
and that field will appear in the JSON/dict for that instance:
```python
>>> metadata = ElementMetadata()
>>> metadata.coefficient = 0.536
>>> metadata.to_dict()
{"coefficient": 0.536}
```
However, accessing a "user-defined" value that has _not_ been assigned
on that instance raises `AttributeError`:
```python
>>> metadata.coeffcient # -- misspelled "coefficient" --
AttributeError: 'ElementMetadata' object has no attribute 'coeffcient'
```
This makes "tagging" a metadata item with a value very convenient, but
entails the proviso that if an end-user wants to add a metadata field to
_some_ elements and not others (sparse population), AND they want to
access that field by name on ANY element and receive `None` where it has
not been assigned, they will need to use an expression like this:
```python
coefficient = metadata.coefficient if hasattr(metadata, "coefficient") else None
```
### Implementation Notes
- **ad-hoc metadata fields** are discarded during consolidation (for
chunking) because we don't have a consolidation strategy defined for
those. We could consider using a default consolidation strategy like
`FIRST` or possibly allow a user to register a strategy (although that
gets hairy in non-private and multiple-memory-space situations.)
- ad-hoc metadata fields **cannot start with an underscore**.
- We have no way to distinguish an ad-hoc field from any "noise" fields
that might appear in a JSON/dict loaded using `.from_dict()`, so unlike
the original (which only loaded known-fields), we'll rehydrate anything
that we find there.
- No real type-safety is possible on ad-hoc fields but the type-checker
does not complain because the type of all ad-hoc fields is `Any` (which
is the best available behavior in my view).
- We may want to consider whether end-users should be able to add ad-hoc
fields to "sub" metadata objects too, like `DataSourceMetadata` and
conceivably `CoordinatesMetadata` (although I'm not immediately seeing a
use-case for the second one).
Closes#2059.
We've found some pdfs that throw an error in pdfminer. These files use a
ICCBased color profile but do not include an expected value `N`. As a
workaround, we can wrap pdfminer and drop any colorspace info, since we
don't need to render the document.
To verify, try to partition the document in the linked issue.
```
elements = partition(filename="google-2023-environmental-report_condensed.pdf", strategy="fast")
```
---------
Co-authored-by: cragwolfe <crag@unstructured.io>
Closes#2038.
### Summary
The `fast` strategy should not fall back to a more expensive strategy.
### Testing
For
[9493801-p17.pdf](https://github.com/Unstructured-IO/unstructured/files/13292884/9493801-p17.pdf),
the following code should return an empty list.
```
elements = partition(filename=filename, strategy="fast")
```
---------
Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com>
Co-authored-by: christinestraub <christinestraub@users.noreply.github.com>
### Summary
Closes#2011
`languages` was missing from the metadata when partitioning pdfs via
`hi_res` and `fast` strategies and missing from image partitions via
`hi_res`. This PR adds `languages` to the relevant function calls so it
is included in the resulting elements.
### Testing
On the main branch, `partition_image` will include `languages` when
`strategy='ocr_only'`, but not when `strategy='hi_res'`:
```
filename = "example-docs/english-and-korean.png"
from unstructured.partition.image import partition_image
elements = partition_image(filename, strategy="ocr_only", languages=['eng', 'kor'])
elements[0].metadata.languages
elements = partition_image(filename, strategy="hi_res", languages=['eng', 'kor'])
elements[0].metadata.languages
```
For `partition_pdf`, `'ocr_only'` will include `languages` in the
metadata, but `'fast'` and `'hi_res'` will not.
```
filename = "example-docs/korean-text-with-tables.pdf"
from unstructured.partition.pdf import partition_pdf
elements = partition_pdf(filename, strategy="ocr_only", languages=['kor'])
elements[0].metadata.languages
elements = partition_pdf(filename, strategy="fast", languages=['kor'])
elements[0].metadata.languages
elements = partition_pdf(filename, strategy="hi_res", languages=['kor'])
elements[0].metadata.languages
```
On this branch, `languages` is included in the metadata regardless of
strategy
---------
Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com>
Co-authored-by: Coniferish <Coniferish@users.noreply.github.com>
Closes#2027
Tables or pages that contain only numbers are returned as floats in a
pandas.DataFrame when the image or page is converted from
`.image_to_data()`. An AttributeError was raised downstream when trying
to `.strip()` the floats. This update converts those floats if needed
and otherwise strips the text.
Testing (note: the document used for testing is new, so you will have to
copy it to the main branch in order to see that this snippet raises an
AttributeError on the main branch, but works on this branch)
```
from unstructured.partition.pdf import partition_pdf
filename = "example-docs/all-number-table.pdf"
partition_pdf(filename, strategy="ocr_only")
```
---------
Co-authored-by: cragwolfe <crag@unstructured.io>
Page breaks are reliably indicated by `w:lastRenderedPageBreak` elements
present in the document XML. Page breaks are NOT reliably indicated by
"hard" page-breaks inserted by the author and when present are redundant
to a `w:lastRenderedPageBreak` element so cause over-counting if used.
Use rendered page-breaks only.
### Summary
- add constants for element type
- replace the `TYPE_TO_TEXT_ELEMENT_MAP` dictionary using the
`ElementType` constants
- replace element type strings using the constants
### Testing
CI should pass.
A DOCX document that has no sections can still contain one or more
tables. Such files are never created by Word but Word can open them just
fine. These can be and are generated by other applications.
Use the newly-added `Document.iter_inner_content()` method added
upstream in `python-docx` to capture both paragraphs and tables from a
section-less DOCX document.
This generalizes the fix for MS Teams chat-transcripts (an example of
sectionless-docx) implemented in #1825.
Courtesy @cdpierse.
Adds a test to PR #1529 in accordance with feedback.
Description from original PR:
In python the default behaviour of `requests.get` without a `timeout`
being set is to hang indefinitely. We have a production use case where
the desired behaviour would be to raise a timeout error rather than have
the application just hang.
This PR adds a new optional keyword parameter `request_timeout` to
`partition` which is passed to `file_and_type_from_url` in the case
where we are fetching from a URL. This is then passed to `requests.get`
---------
Co-authored-by: Charles Pierse <charlespierse@gmail.com>
In DOCX, like HTML, a table cell can itself contain a table. This is not
uncommon and is typically used for formatting purposes.
When a DOCX table is nested, create nested HTML tables to reflect that
structure and create a plain-text table with captures all the text in
nested tables, formatting it as a reasonable facsimile of a table.
This implements the solution described and spiked in PR #1952.
---------
Co-authored-by: Bruno Bornsztein <bruno.bornsztein@gmail.com>
Summary:
Close: https://github.com/Unstructured-IO/unstructured/issues/1920
* stop passing in empty string from `languages` to tesseract, which will
result in passing empty string to language config `-l` for the tesseract
CLI
* also stop passing in duplicate language code from `languages` to
tesseract OCR
* if we failed to convert any iso languages from the `languages`
parameter, proceed OCR with `eng` as default
### Test
* First confirm the tesseract error `Estimating resolution as X` before
this:
* on the `unstructured-api` repo with main branch, run `make
run-web-app`
* curl to test error from empty string, or just any wrong input like `-F
'languages="eng,de"'`:
```
curl -X 'POST' 'http://0.0.0.0:8000/general/v0/general' \
-H 'accept: application/json' \
-H 'Content-Type: multipart/form-data' \
-F 'files=@sample-docs/layout-parser-paper-with-table.jpg' \
-F 'languages=""' \
-F 'strategy=hi_res' \
-F 'pdf_infer_table_structure=True' \
| jq -C . | less -R
```
* after this change:
* in your unstructured API env, cd to unstructured repo and install it locally with `pip install -e .`
* check out to this branch
* run `make run-web-app` again in api repo
* the curl command return output and see warning in log
---------
Co-authored-by: qued <64741807+qued@users.noreply.github.com>
*Reviewer:* May be quicker to review commit by commit as they are quite
distinct and well-groomed to each focus on a single clean-up task.
Clean up odds-and-ends in the docx partitioner in preparation for adding
nested-tables support in a closely following PR.
1. Remove obsolete TODOs now in GitHub issues, which is probably where
they belong in future anyway.
2. Remove local DOCX "workaround" code that has been implemented
upstream and is now obsolete.
3. "Clean" the docx tests, introducing strict typing, extracting a
fixture or two, and generally tightening things up.
4. Extract docx-local versions of
`unstructured.partition.common.convert_ms_office_table_to_text()` which
will be the base for adding nested-table support. More information on
why this is required in that commit.
### Summary
Closes#1520
Partial solution to #1521
- Adds an abstraction layer between the user API and the partitioner
implementation
- Adds comments explaining paragraph chunking
- Makes edits to pass strict type-checking for both text.py and
test_text.py
Closes#1870
Defining both `languages` and `ocr_languages` raises a ValueError, but
the api defaults to `ocr_languages` being an empty string, so if users
define `languages` they are automatically hitting the ValueError.
This fix checks if `ocr_languages` is an empty string and converts it to
`None` to avoid this.
### Testing
On the main branch, the following will raise the ValueError, but it will
correctly partition on this branch
```
from unstructured.partition.auto import partition
filename = "example-docs/category-level.docx"
elements = partition(filename,languages=['spa'],ocr_languages="")
elements[0].metadata.languages
```
---------
Co-authored-by: yuming <305248291@qq.com>
Co-authored-by: Yuming Long <63475068+yuming-long@users.noreply.github.com>
Co-authored-by: Austin Walker <awalk89@gmail.com>
This PR add `include_header` argument for partition_csv and
partition_tsv. This is related to the following feature request
https://github.com/Unstructured-IO/unstructured/issues/1751.
`include_header` is already part of partition_xlsx. The work here is in
line with the current usage and testing of the `include_header` argument
in partition_xlsx.
---------
Co-authored-by: cragwolfe <crag@unstructured.io>
Fix TypeError: string indices must be integers. The `annotation_dict`
variable is conditioned to be `None` if instance type is not dict. Then
we add logic to skip the attempt if the value is `None`.
### Summary
Update `ocr_only` strategy in `partition_pdf()`. This PR adds the
functionality to get accurate coordinate data when partitioning PDFs and
Images with the `ocr_only` strategy.
- Add functionality to perform OCR region grouping based on the OCR text
taken from `pytesseract.image_to_string()`
- Add functionality to get layout elements from OCR regions (ocr_layout)
for both `tesseract` and `paddle`
- Add functionality to determine the `source` of merged text regions
when merging text regions in `merge_text_regions()`
- Merge multiple test functions related to "ocr_only" strategy into
`test_partition_pdf_with_ocr_only_strategy()`
- This PR also fixes [issue
#1792](https://github.com/Unstructured-IO/unstructured/issues/1792)
### Evaluation
```
# Image
PYTHONPATH=. python examples/custom-layout-order/evaluate_natural_reading_order.py example-docs/double-column-A.jpg ocr_only xy-cut image
# PDF
PYTHONPATH=. python examples/custom-layout-order/evaluate_natural_reading_order.py example-docs/multi-column-2p.pdf ocr_only xy-cut pdf
```
### Test
- **Before update**
All elements have the same coordinate data

- **After update**
All elements have accurate coordinate data

---------
Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com>
Co-authored-by: christinestraub <christinestraub@users.noreply.github.com>