**Summary**
Fixes: #2308
**Additional context**
Through a somewhat deep call-chain, partitioning a file-like object
(e.g. io.BytesIO) having its `.name` attribute set to a path not
pointing to an actual file on the local filesystem would raise
`FileNotFoundError` when the last-modified date was being computed for
the document.
This scenario is a legitimate partitioning call, where `file.name` is
used downstream to describe the source of, for example, a bytes payload
downloaded from the network.
**Fix**
- explicitly check for the existence of a file at the given path before
accessing it to get its modified date. Return `None` (already a
legitimate return value) when no such file exists.
- Generally clean up the implementations.
- Add unit tests that exercise all cases.
---------
Co-authored-by: John <43506685+Coniferish@users.noreply.github.com>
The purpose of this PR is to support using the same type of parameters
as `partition_*()` when using `partition_via_api()`. This PR works
together with `unsturctured-api` [PR
#368](https://github.com/Unstructured-IO/unstructured-api/pull/368).
**Note:** This PR will support extracting image blocks("Image", "Table")
via partition_via_api().
### Summary
- update `partition_via_api()` to convert all list type parameters to
JSON formatted strings before passing them to the unstructured client
SDK
- add a unit test function to test extracting image blocks via
`parition_via_api()`
- add a unit test function to test list type parameters passed to API
via unstructured client sdk
### Testing
```
from unstructured.partition.api import partition_via_api
elements = partition_via_api(
filename="example-docs/embedded-images-tables.pdf",
api_key="YOUR-API-KEY",
strategy="hi_res",
extract_image_block_types=["image", "table"],
)
image_block_elements = [el for el in elements if el.category == "Image" or el.category == "Table"]
print("\n\n".join([el.metadata.image_mime_type for el in image_block_elements]))
print("\n\n".join([el.metadata.image_base64 for el in image_block_elements]))
```
### Summary
Detects headers and footers when using `partition_pdf` with the fast
strategy. Identifies elements that are positioned in the top or bottom
5% of the page as headers or footers. If no coordinate information is
available, an element won't be detected as a header or footer.
---------
Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com>
Co-authored-by: MthwRobinson <MthwRobinson@users.noreply.github.com>
**Summary**
Refactoring as part of `partition_xlsx()` algorithm replacement that was
delayed by some CI challenges.
A separate PR because it is cohesive and relatively independent from the
prior PR.
**Summary**
For whatever reason, the `@add_chunking_strategy` decorator was not
present on `partition_json()`. This broke the only way to accomplish a
"chunking-only" workflow using the REST API. This PR remedies that
problem.
Reported bug: Text from docx shapes is not included in the `partition`
output.
Fix: Extend docx partition to search for text tags nested inside
structures responsible for creating the shape.
---------
Co-authored-by: Filip Knefel <filip@unstructured.io>
This PR is similar to ocr module refactoring PR -
https://github.com/Unstructured-IO/unstructured/pull/2492.
### Summary
- refactor "embedded text extraction" related modules to use decorator -
`@requires_dependencies` on functions that require external libraries
and import those libraries inside those functions instead of on module
level.
- add missing test cases for `pdf_image_utils.py` module to improve
average test coverage
### Testing
CI should pass.
**Reviewers:** It may be easier to review each of the two commits
separately. The first adds the new `_SubtableParser` object with its
unit-tests and the second one uses that object to replace the flawed
existing subtable-parsing algorithm.
**Summary**
There are a cluster of bugs in `partition_xlsx()` that all derive from
flaws in the algorithm we use to detect "subtables". These are
encountered when the user wants to get multiple document-elements from
each worksheet, which is the default (argument `find_subtable = True`).
This PR replaces the flawed existing algorithm with a `_SubtableParser`
object that encapsulates all that logic and has thorough unit-tests.
**Additional Context**
This is a summary of the failure cases. There are a few other cases but
they're closely related and this was enough evidence and scope for my
purposes. This PR fixes all these bugs:
```python
#
# -- ✅ CASE 1: There are no leading or trailing single-cell rows.
# -> this subtable functions never get called, subtable is emitted as the only element
#
# a b -> Table(a, b, c, d)
# c d
# -- ✅ CASE 2: There is exactly one leading single-cell row.
# -> Leading single-cell row emitted as `Title` element, core-table properly identified.
#
# a -> [ Title(a),
# b c Table(b, c, d, e) ]
# d e
# -- ❌ CASE 3: There are two-or-more leading single-cell rows.
# -> leading single-cell rows are included in subtable
#
# a -> [ Table(a, b, c, d, e, f) ]
# b
# c d
# e f
# -- ❌ CASE 4: There is exactly one trailing single-cell row.
# -> core table is dropped. trailing single-cell row is emitted as Title
# (this is the behavior in the reported bug)
#
# a b -> [ Title(e) ]
# c d
# e
# -- ❌ CASE 5: There are two-or-more trailing single-cell rows.
# -> core table is dropped. trailing single-cell rows are each emitted as a Title
#
# a b -> [ Title(e),
# c d Title(f) ]
# e
# f
# -- ✅ CASE 6: There are exactly one each leading and trailing single-cell rows.
# -> core table is correctly identified, leading and trailing single-cell rows are each
# emitted as a Title.
#
# a -> [ Title(a),
# b c Table(b, c, d, e),
# d e Title(f) ]
# f
# -- ✅ CASE 7: There are two leading and one trailing single-cell rows.
# -> core table is correctly identified, leading and trailing single-cell rows are each
# emitted as a Title.
#
# a -> [ Title(a),
# b Title(b),
# c d Table(c, d, e, f),
# e f Title(g) ]
# g
# -- ✅ CASE 8: There are two-or-more leading and trailing single-cell rows.
# -> core table is correctly identified, leading and trailing single-cell rows are each
# emitted as a Title.
#
# a -> [ Title(a),
# b Title(b),
# c d Table(c, d, e, f),
# e f Title(g),
# g Title(h) ]
# h
# -- ❌ CASE 9: Single-row subtable, no single-cell rows above or below.
# -> First cell is mistakenly emitted as title, remaining cells are dropped.
#
# a b c -> [ Title(a) ]
# -- ❌ CASE 10: Single-row subtable with one leading single-cell row.
# -> Leading single-row cell is correctly identified as title, core-table is mis-identified
# as a `Title` and truncated.
#
# a -> [ Title(a),
# b c d Title(b) ]
```
**Reviewers:** It may be faster to review each of the three commits
separately since they are groomed to only make one type of change each
(typing, docstrings, test-cleanup).
**Summary**
There are a cluster of bugs in `partition_xlsx()` that all derive from
flaws in the algorithm we use to detect "subtables". These are
encountered when the user wants to get multiple document-elements from
each worksheet, which is the default (argument `find_subtable = True`).
These commits clean up typing, lint, and other non-behavior-changing
aspects of the code in preparation for installing a new algorithm that
correctly identifies and partitions contiguous sub-regions of an Excel
worksheet into distinct elements.
**Additional Context**
This is a summary of the failure cases. There are a few other cases but
they're closely related and this was enough evidence and scope for my
purposes:
```python
#
# -- ✅ CASE 1: There are no leading or trailing single-cell rows.
# -> this subtable functions never get called, subtable is emitted as the only element
#
# a b -> Table(a, b, c, d)
# c d
# -- ✅ CASE 2: There is exactly one leading single-cell row.
# -> Leading single-cell row emitted as `Title` element, core-table properly identified.
#
# a -> [ Title(a),
# b c Table(b, c, d, e) ]
# d e
# -- ❌ CASE 3: There are two-or-more leading single-cell rows.
# -> leading single-cell rows are included in subtable
#
# a -> [ Table(a, b, c, d, e, f) ]
# b
# c d
# e f
# -- ❌ CASE 4: There is exactly one trailing single-cell row.
# -> core table is dropped. trailing single-cell row is emitted as Title
# (this is the behavior in the reported bug)
#
# a b -> [ Title(e) ]
# c d
# e
# -- ❌ CASE 5: There are two-or-more trailing single-cell rows.
# -> core table is dropped. trailing single-cell rows are each emitted as a Title
#
# a b -> [ Title(e),
# c d Title(f) ]
# e
# f
# -- ✅ CASE 6: There are exactly one each leading and trailing single-cell rows.
# -> core table is correctly identified, leading and trailing single-cell rows are each
# emitted as a Title.
#
# a -> [ Title(a),
# b c Table(b, c, d, e),
# d e Title(f) ]
# f
# -- ✅ CASE 7: There are two leading and one trailing single-cell rows.
# -> core table is correctly identified, leading and trailing single-cell rows are each
# emitted as a Title.
#
# a -> [ Title(a),
# b Title(b),
# c d Table(c, d, e, f),
# e f Title(g) ]
# g
# -- ✅ CASE 8: There are two-or-more leading and trailing single-cell rows.
# -> core table is correctly identified, leading and trailing single-cell rows are each
# emitted as a Title.
#
# a -> [ Title(a),
# b Title(b),
# c d Table(c, d, e, f),
# e f Title(g),
# g Title(h) ]
# h
# -- ❌ CASE 9: Single-row subtable, no single-cell rows above or below.
# -> First cell is mistakenly emitted as title, remaining cells are dropped.
#
# a b c -> [ Title(a) ]
# -- ❌ CASE 10: Single-row subtable with one leading single-cell row.
# -> Leading single-row cell is correctly identified as title, core-table is mis-identified
# as a `Title` and truncated.
#
# a -> [ Title(a),
# b c d Title(b) ]
```
### Summary
Closes#2489, which reported an inability to process `.p7s` files. PR
implements two changes:
- If the user selected content type for the email is not available and
there is another valid content type available, fall back to the other
valid content type.
- For signed message, extract the signature and add it to the metadata
### Testing
```python
from unstructured.partition.auto import partition
filename = "example-docs/eml/signed-doc.p7s"
elements = partition(filename=filename) # should get a message about fall back logic
print(elements[0]) # "This is a test"
elements[0].metadata.to_dict() # Will see the signature
```
This PR:
- Moves ingest dependencies into local scopes to be able to import
ingest connector classes without the need of installing imported
external dependencies. This allows lightweight use of the classes (not
the instances. to use the instances as intended you'll still need the
dependencies).
- Upgrades the embed module dependencies from `langchain` to
`langchain-community` module (to pass CI [rather than introducing a
pin])
- Does pip-compile
- Does minor refactors in other files to pass `ruff 2.0` checks which
were introduced by pip-compile
The purpose of this PR is to refactor OCR-related modules to reduce
unnecessary module imports to avoid potential issues (most likely due to
a "circular import").
### Summary
- add `inference_utils` module
(unstructured/partition/pdf_image/inference_utils.py) to define
unstructured-inference library related utility functions, which will
reduce importing unstructured-inference library functions in other files
- add `conftest.py` in `test_unstructured/partition/pdf_image/`
directory to define fixtures that are available to all tests in the same
directory and its subdirectories
### Testing
CI should pass
This is nice to natively support both Tesseract and Paddle. However, one
might already use another OCR and might want to keep using it (for
quality reasons, for cost reasons etc...).
This PR adds the ability for the user to specify its own OCR agent
implementation that is then called by unstructured.
I am new to unstructured so don't hesitate to let me know if you would
prefer this being done differently and I will rework the PR.
---------
Co-authored-by: Yao You <theyaoyou@gmail.com>
Co-authored-by: Yao You <yao@unstructured.io>
.heic files are an image filetype we have not supported.
#### Testing
```
from unstructured.partition.image import partition_image
png_filename = "example-docs/DA-1p.png"
heic_filename = "example-docs/DA-1p.heic"
png_elements = partition_image(png_filename, strategy="hi_res")
heic_elements = partition_image(heic_filename, strategy="hi_res")
for i in range(len(heic_elements)):
print(heic_elements[i].text == png_elements[i].text)
```
---------
Co-authored-by: christinestraub <christinemstraub@gmail.com>
This PR is the last in a series of PRs for refactoring and fixing the
language parameters (`languages` and `ocr_languages` so we can address
incorrect input by users. See #2293
It is recommended to go though this PR commit-by-commit and note the
commit message. The most significant commit is "update
check_languages..."
- there are multiple places setting the default `hi_res_model_name` in
both `unstructured` and `unstructured-inference`
- they lead to inconsistency and unexpected behaviors
- this fix removes a helper in `unstructured` that tries to set the
default hi_res layout detection model; instead we rely on the
`unstructured-inference` to provide that default when no explicit model
name is passed in
## test
```bash
UNSTRUCTURED_INCLUDE_DEBUG_METADATA=true ipython
```
```python
from unstructured.partition.auto import partition
# find a pdf file
elements = partition("foo.pdf", strategy="hi_res")
assert elements[0].metadata.detection_origin == "yolox"
```
---------
Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com>
Co-authored-by: badGarnet <badGarnet@users.noreply.github.com>
We have added a new version of chipper (Chipperv3), which needs to allow
unstructured to effective work with all the current Chipper versions.
This implies resizing images with the appropriate resolution and make
sure that Chipper elements are not sorted by unstructured.
In addition, it seems that PDFMiner is being called when calling
Chipper, which adds repeated elements from Chipper and PDFMiner.
To evaluate this PR, you can test the code below with the attached PDF.
The code writes a JSON file with the generated elements. The output can
be examined with `cat out.un.json | python -m json.tool`. There are
three things to check:
1. The size of the image passed to Chipper, which can be identiied in
the layout_height and layout_width attributes, which should have values
3301 and 2550 as shown in the example below:
```
[
{
"element_id": "c0493a7872f227e4172c4192c5f48a06",
"metadata": {
"coordinates": {
"layout_height": 3301,
"layout_width": 2550,
```
2. There should be no repeated elements.
3. Order should be closer to reading order.
The script to run Chipper from unstructured is:
```
from unstructured import __version__
print(__version__.__version__)
import json
from unstructured.partition.auto import partition
from unstructured.staging.base import elements_to_json
elements = json.loads(elements_to_json(partition("Huang_Improving_Table_Structure_Recognition_With_Visual-Alignment_Sequential_Coordinate_Modeling_CVPR_2023_paper-p6.pdf", strategy="hi_res", model_name="chipperv3")))
with open('out.un.json', 'w') as w:
json.dump(elements, w)
```
[Huang_Improving_Table_Structure_Recognition_With_Visual-Alignment_Sequential_Coordinate_Modeling_CVPR_2023_paper-p6.pdf](https://github.com/Unstructured-IO/unstructured/files/13817273/Huang_Improving_Table_Structure_Recognition_With_Visual-Alignment_Sequential_Coordinate_Modeling_CVPR_2023_paper-p6.pdf)
---------
Co-authored-by: Antonio Jimeno Yepes <antonio@unstructured.io>
This PR is one in a series of PRs for refactoring and fixing the
languages parameter so it can address incorrect input by users. #2293
This PR adds _clean_ocr_languages_arg. There are no calls to this
function yet, but it will be called in later PRs related to this series.
Closes#2320 .
### Summary
In certain circumstances, adjusting the image block crop padding can
improve image block extraction by preventing extracted image blocks from
being clipped.
### Testing
- PDF:
[LM339-D_2-2.pdf](https://github.com/Unstructured-IO/unstructured/files/13968952/LM339-D_2-2.pdf)
- Set two environment variables
`EXTRACT_IMAGE_BLOCK_CROP_HORIZONTAL_PAD` and
`EXTRACT_IMAGE_BLOCK_CROP_VERTICAL_PAD`
(e.g. `EXTRACT_IMAGE_BLOCK_CROP_HORIZONTAL_PAD = 40`,
`EXTRACT_IMAGE_BLOCK_CROP_VERTICAL_PAD = 20`
```
elements = partition_pdf(
filename="LM339-D_2-2.pdf",
extract_image_block_types=["image"],
)
```
This refactor removes `_convert_to_standard_langcode` and replaces it
with calling `_get_iso639_language_object` with a string slice.
Use of TESSERACT_LANGUAGES_AND_CODES, which was added to
`_convert_to_standard_langcode` previously, is moved to the relevant
part where `_convert_to_standard_langcode` was previously called.
If/else statements replace the list comprehension for readability and
`langdetect_langs.append("zho")` replaces
`_convert_to_standard_langcode("zh")` since that always returned
`"zho"`.
### Summary
Adds support for bitmap images (`.bmp`) in both file detection and
partitioning. Bitmap images will be processed with `partition_image`
just like JPGs and PNGs.
### Testing
```python
from unstructured.file_utils.filetype import detect_filetype
from unstructured.partition.auto import partition
from PIL import Image
filename = "example-docs/layout-parser-paper-with-table.jpg"
bmp_filename = "~/tmp/ayout-parser-paper-with-table.bmp"
img = Image.open(filename)
img.save(bmp_filename)
detect_filetype(filename=bmp_filename) # Should be FileType.BMP
elements = partition(filename=bmp_filename)
```
This PR is one in a series of PRs for refactoring and fixing the
`languages` parameter so it can address incorrect input by users. #2293
Refactor `_convert_language_code_to_pytesseract_lang_code` and extract
`_get_iso639_language_object` to its own function
```
from unstructured.partition.lang import _convert_language_code_to_pytesseract_lang_code as convert
convert("English") # this will raise an error on both main and this branch
convert("en") # this will return "eng" on both branches
```
### Summary
The goal of this PR is to keep all image elements when using "hi_res"
strategy. Previously, `Image` elements with small chunks of text were
ignored unless the image block extraction parameters
(`extract_images_in_pdf` or `extract_image_block_types`) were specified.
Now, all image elements are kept regardless of whether the image block
extraction parameters are specified.
### Testing
- on `main` branch,
```
elements = partition_pdf(
filename="example-docs/embedded-images.pdf",
strategy="hi_res",
)
image_elements = [el for el in elements if el.category == ElementType.IMAGE]
print("number of image elements: ", len(image_elements))
```
The above code will display `number of image elements: 0`.
- on this `feature` branch,
The same code will display `number of image elements: 3`
---------
Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com>
Co-authored-by: christinestraub <christinestraub@users.noreply.github.com>
This PR is one in a series of PRs for refactoring and fixing the
`languages` parameter so it can address incorrect input by users. #2293
This PR adds a dictionary for helping map fully spelled out languages to
tesseract language codes
---------
Co-authored-by: Roman Isecke <136338424+rbiseck3@users.noreply.github.com>
This PR culminates the restructuring of chunking over my prior
dozen-or-so commits by adding the new options to the API and
documentation.
Separately I'll be adding a new ingest test to defend against
regression, although the integration test included in this PR will do a
pretty good job of that too.
Fixes#2339
Fixes to HTML partitioning introduced with v0.11.0 removed the use of
`tabulate` for forming the HTML placed in `HTMLTable.text_as_html`. This
had several benefits, but part of `tabulate`'s behavior was to make
row-length (cell-count) uniform across the rows of the table.
Lacking this prior uniformity produced a downstream problem reported in
On closer inspection, the method used to "harvest" cell-text was
producing more text-nodes than there were cells and was sensitive to
where whitespace was used to format the HTML. It also "moved" text to
different columns in certain rows.
Refine the cell-text gathering mechanism to get exactly one text string
for each row cell, eliminating whitespace formatting nodes and producing
strict correspondence between the number of cells in the original HTML
table row and that placed in HTML.text_as_html.
HTML tables that are uniform (every row has the same number of cells)
will produce a uniform table in `.text_as_html`. Merged cells may still
produce a non-uniform table in `.text_as_html` (because the source table
is non-uniform).
Currently, we're using different kwarg names in partition() and
partition_pdf(), which has implications for the API since it goes
through partition().
### Summary
- rename `extract_element_types` -> `extract_image_block_types`
- rename `image_output_dir_path` to `extract_image_block_output_dir`
- rename `extract_to_payload` -> `extract_image_block_to_payload`
- rename `pdf_extract_images` -> `extract_images_in_pdf` in
`partition.auto`
- add unit tests to test element extraction for `pdf/image` via
`partition.auto`
### Testing
CI should pass.
Closes#2340
We need to make sure the custom url is passed to our client. The client
constructor takes the base url, so for compatibility we can continue to
take the full url and strip off the path.
To verify, run the api locally and confirm you can make calls to it.
```
# In unstructured-api
make run-web-app
# In ipython in this repo
from unstructured.partition.api import partition_via_api
filename = "example-docs/layout-parser-paper.pdf"
partition_via_api(filename=filename, api_url="http://localhost:8000")
```
Closes#2323.
### Summary
- update logic to return "hi_res" if either `extract_images_in_pdf` or
`extract_element_types` is set
- refactor: remove unused `file` parameter from
`determine_pdf_or_image_strategy()`
### Testing
```
from unstructured.partition.pdf import partition_pdf
elements = partition_pdf(
filename="example-docs/embedded-images-tables.pdf",
extract_element_types=["Image"],
extract_to_payload=True,
)
image_elements = [el for el in elements if el.category == ElementType.IMAGE]
print(image_elements)
```
Closes#2302.
### Summary
- add functionality to get a Base64 encoded string from a PIL image
- store base64 encoded image data in two metadata fields: `image_base64`
and `image_mime_type`
- update the "image element filter" logic to keep all image elements in
the output if a user specifies image extraction
### Testing
```
from unstructured.partition.pdf import partition_pdf
elements = partition_pdf(
filename="example-docs/embedded-images-tables.pdf",
strategy="hi_res",
extract_element_types=["Image", "Table"],
extract_to_payload=True,
)
```
or
```
from unstructured.partition.auto import partition
elements = partition(
filename="example-docs/embedded-images-tables.pdf",
strategy="hi_res",
pdf_extract_element_types=["Image", "Table"],
pdf_extract_to_payload=True,
)
```
Closes#2160
Explicitly adds `hi_res_model_name` as kwarg to relevant functions and
notes that `model_name` is to be deprecated.
Testing:
```
from unstructured.partition.auto import partition
filename = "example-docs/DA-1p.pdf"
elements = partition(filename, strategy="hi_res", hi_res_model_name="yolox")
```
---------
Co-authored-by: cragwolfe <crag@unstructured.io>
Co-authored-by: Steve Canny <stcanny@gmail.com>
Co-authored-by: Christine Straub <christinemstraub@gmail.com>
Co-authored-by: Yao You <yao@unstructured.io>
Co-authored-by: Yao You <theyaoyou@gmail.com>
The text of an oversized chunk is split on an arbitrary character
boundary (mid-word). The `chunk_by_character()` strategy introduces the
idea of allowing the user to specify a separator to use for
chunk-splitting. For `langchain` this is typically "\n\n", "\n", or " ";
blank-line, newline, or word boundaries respectively.
Even if the user is allowed to specify a separator, we must provide
fall-back for when a chunk contains no such character. This can be done
incrementally, like blank-line is preferable to newline, newline is
preferable to word, and word is preferable to arbitrary character.
Further, there is nothing particular to `chunk_by_character()` in
providing such a fall-back text-splitting strategy. It would be
preferable for all strategies to split oversized chunks on even-word
boundaries for example.
Note that while a "blank-line" ("\n\n") may be common in plain text, it
is unlikely to appear in the text of an element because it would have
been interpreted as an element boundary during partitioning.
Add _TextSplitter with basic separator preferences and fall-back and
apply it to chunk-splitting for all strategies. The `by_character`
chunking strategy may enhance this behavior by adding the option for a
user to specify a particular separator suited to their use case.
closes#816
## Description
Added functionality for `partition_email` to automatically decode base64
text before passing it to `partition_text` or `partition_html`.
Also adds base64 encoded email text test cases.
This PR addresses
[CORE-2969](https://unstructured-ai.atlassian.net/browse/CORE-2969)
- pdfminer sometimes fail to decode text in an pdf file and returns cid
codes as text
- now those text will be considered invalid and be replaced with ocr
results in `hi_res` mode
## test
This PR adds unit test for the utility functions. In addition the file
below would return elements with text in cid code on main but proper
ascii text with this PR:
[005-CISA-AA22-076-Strengthening-Cybersecurity-p1-p4.pdf](https://github.com/Unstructured-IO/unstructured/files/13662984/005-CISA-AA22-076-Strengthening-Cybersecurity-p1-p4.pdf)
This change improves both cct accuracy and %missing scores:
**before:**
```
metric average sample_sd population_sd count
--------------------------------------------------
cct-accuracy 0.681 0.267 0.266 105
cct-%missing 0.086 0.159 0.159 105
```
**after:**
```
metric average sample_sd population_sd count
--------------------------------------------------
cct-accuracy 0.697 0.251 0.250 105
cct-%missing 0.071 0.123 0.122 105
```
[CORE-2969]:
https://unstructured-ai.atlassian.net/browse/CORE-2969?atlOrigin=eyJpIjoiNWRkNTljNzYxNjVmNDY3MDlhMDU5Y2ZhYzA5YTRkZjUiLCJwIjoiZ2l0aHViLWNvbS1KU1cifQ
---------
Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com>
Co-authored-by: badGarnet <badGarnet@users.noreply.github.com>
Co-authored-by: christinestraub <christinemstraub@gmail.com>
Closes#2218. When a csv has commas in its content, and the delimiter is
something else, Pandas may throw an error. We can sniff the csv and get
the correct delimiter to pass to Pandas. To verify, try partitioning the
file in the linked bug.
`CheckBox` elements get special treatment during chunking. `CheckBox`
does not derive from `Text` and can contribute no text to a chunk. It is
considered "non-combinable" and so is emitted as-is as a chunk of its
own. A consequence of this is it breaks an otherwise contiguous chunk
into two wherever it occurs.
This is problematic, but becomes much more so when overlap is
introduced. Each chunk accepts a "tail" text fragment from its preceding
element and contributes its own tail fragment to the next chunk. These
tails represent the "overlap" between chunks. However, a non-text chunk
can neither accept nor provide a tail-fragment and so interrupts the
overlap. None of the possible solutions are terrific.
Give `Element` a `.text` attribute such that _all_ elements have a
`.text` attribute, even though its value is the empty-string for
element-types such as CheckBox and PageBreak which inherently have no
text. As a consequence, several `cast()` wrappers are no longer required
to satisfy strict type-checking.
This also allows a `CheckBox` element to be combined with `Text`
subtypes during chunking, essentially the same way `PageBreak` is,
contributing no text to the chunk.
Also, remove the `_NonTextSection` object which previously wrapped a
`CheckBox` element during pre-chunking as it is no longer required.
closes#2222.
### Summary
The "table" elements are saved as `table-<pageN>-<tableN>.jpg`. This
filename is presented in the `image_path` metadata field for the Table
element. The default would be to not do this.
### Testing
PDF: [124_PDFsam_Basel III - Finalising post-crisis
reforms.pdf](https://github.com/Unstructured-IO/unstructured/files/13591714/124_PDFsam_Basel.III.-.Finalising.post-crisis.reforms.pdf)
```
elements = partition_pdf(
filename="124_PDFsam_Basel III - Finalising post-crisis reforms.pdf",
strategy="hi_res",
infer_table_structure=True,
extract_element_types=['Table'],
)
```
### Summary
This PR is the second part of the "image extraction" refactor to move it
from unstructured-inference repo to unstructured repo, the first part is
done in
https://github.com/Unstructured-IO/unstructured-inference/pull/299. This
PR adds logic to support extracting images.
### Testing
`git clone -b refactor/remove_image_extraction_code --single-branch
https://github.com/Unstructured-IO/unstructured-inference.git && cd
unstructured-inference && pip install -e . && cd ../`
```
elements = partition_pdf(
filename="example-docs/embedded-images.pdf",
strategy="hi_res",
extract_images_in_pdf=True,
)
print("\n\n".join([str(el) for el in elements]))
```
Follow-up PR to
[https://github.com/Unstructured-IO/unstructured/pull/2195](https://github.com/Unstructured-IO/unstructured/pull/2195).
Removes unnecessary calls to `get_api_key()`. That helper function is
supposed to only be used for tests decorated by
@pytest.mark.skipif(skip_outside_ci, reason="Skipping test run outside
of CI") (which are skipped because those tests are partitioning pdf/jpg
files).
These tests are partitioning emails and rely on the MockResponse at the
top of the file, so they don't need to call `get_api_key()` and it can
simply be removed from them.
### Summary
This PR is the second part of `pdfminer` refactor to move it from
`unstructured-inference` repo to `unstructured` repo, the first part is
done in
https://github.com/Unstructured-IO/unstructured-inference/pull/294. This
PR adds logic to merge the extracted layout with the inferred layout.
The updated workflow for the `hi_res` strategy:
* pass the document (as data/filename) to the `inference` repo to get
`inferred_layout` (DocumentLayout)
* pass the `inferred_layout` returned from the `inference` repo and the
document (as data/filename) to the `pdfminer_processing` module, which
first opens the document (create temp file/dir as needed), and splits
the document by pages
* if is_image is `True`, return the passed
inferred_layout(DocumentLayout)
* if is_image is `False`:
* get extracted_layout (TextRegions) from the passed
document(data/filename) by pdfminer
* merge `extracted_layout` (TextRegions) with the passed
`inferred_layout` (DocumentLayout)
* return the `inferred_layout `(DocumentLayout) with updated elements
(all merged LayoutElements) as merged_layout (DocumentLayout)
* pass merged_layout and the document (as data/filename) to the `OCR`
module, which first opens the document (create temp file/dir as needed),
and splits the document by pages (convert PDF pages to image pages for
PDF file)
### Note
This PR also fixes issue #2164 by using functionality similar to the one
implemented in the `fast` strategy workflow when extracting elements by
`pdfminer`.
### TODO
* image extraction refactor to move it from `unstructured-inference`
repo to `unstructured` repo
* improving natural reading order by applying the current default
`xycut` sorting to the elements extracted by `pdfminer`
### Summary
Closes#2033
Updates `partition_via_api` to use `UnstructuredClient` for api calls
instead of `requests`.
Updates associated tests.
Note: This PR does **not** update `partition_multiple_via_api` as
documentation in `unstructured-python-client` indicates it does not
support multiple files. A new issue should be opened to add that
functionality to `unstructured-python-client`.
---------
Co-authored-by: Klaijan <klaijan@unstructured.io>
Co-authored-by: Roman Isecke <136338424+rbiseck3@users.noreply.github.com>
Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com>
Co-authored-by: rbiseck3 <rbiseck3@users.noreply.github.com>
### Summary
Add a procedure to repair PDF when the PDF structure is invalid for
`PDFminer` to process.
This PR handles two cases of `PSSyntaxError Invalid dictionary
construct: ...`:
* PDFminer open entire document and create pages generator on
`PDFPage.get_pages(fp)`: [sentry log
example](https://unstructuredio.sentry.io/issues/4655715023/?alert_rule_id=14681339&alert_type=issue¬ification_uuid=d8db4cf4-686f-4504-8a22-74a79a8e966f&project=4505909127086080&referrer=slack)
* PDFminer's interpreter process a single page on
`interpreter.process_page(page)`: [sentry log
example](https://unstructuredio.sentry.io/issues/4655898781/?referrer=slack¬ification_uuid=0d929d48-f490-4db8-8dad-5d431c8460bc&alert_rule_id=14681339&alert_type=issue)
**Additional tech details:**
* Add new dependency `pikepdf` in `requirements/extra-pdf-image.in`,
which is used for repairing PDF.
* Add new denpendenct `pypdf` in `requirements/extra-pdf-image.in`,
which is used to find the error page from entire document by reading the
PDF file again (can't find a way to split pdf in PDFminer).
* Refactor the `is null` check for `get_uris_from_annots`, since the
root cause is that `get_uris` passed a None `annots` to
`get_uris_from_annots`, so the Null check should happen in `get_uris`.
* Add more type protection in `get_uris_from_annots` when using any
`PDFObjRef.resolve()` as `dict` (it could still be a `PDFObjRef`). This
should fix :
* https://github.com/Unstructured-IO/unstructured/issues/1922 where
`annotation_dict` is a `PDFObjRef`
* https://github.com/Unstructured-IO/unstructured/issues/1921 where
`rect` is a `PDFObjRef`
### Test
Added three test files (both are larger than 500 KB) for unittests to
test:
* Repair entire doc
* Repair one page
* Reprocess failure after repairing one page (just return the elements
before error page in this case).
* Also seems like splitting the document into smaller pages could fix
this problem, but not sure why. For example, I saw error from reprocess
in the whole
[cancer.pdf](https://github.com/Unstructured-IO/unstructured/files/13461616/cancer.pdf)
doc, but no error when i split the pdf by error page....
* tested if i can repair the entire doc again in this case, saw other
error which means repairing is not helping imo
* PDFminer can process the whole doc after pikepdf only repaired the
entire doc in the first place, but we can't repair by pages in this way
---------
Co-authored-by: cragwolfe <crag@unstructured.io>
### Summary
This should fix the broken unit test on main CI
* change the strategy in
`test_partition_multiple_via_api_valid_request_data_kwargs` from `fast`
to `auto`, since the test was using `fast` for images, and we don't
support it.