462 Commits

Author SHA1 Message Date
Christine Straub
29b9ea7ba6
refactor: ocr modules (#2492)
The purpose of this PR is to refactor OCR-related modules to reduce
unnecessary module imports to avoid potential issues (most likely due to
a "circular import").

### Summary
- add `inference_utils` module
(unstructured/partition/pdf_image/inference_utils.py) to define
unstructured-inference library related utility functions, which will
reduce importing unstructured-inference library functions in other files
- add `conftest.py` in `test_unstructured/partition/pdf_image/`
directory to define fixtures that are available to all tests in the same
directory and its subdirectories

### Testing
CI should pass
2024-02-06 17:11:55 +00:00
Christine Straub
94001a208d
feat: improve table cell data (#2457)
The purpose of this PR is to pass embedded text through table processing
sub-pipeline later later use.
2024-02-01 05:29:19 +00:00
Christophe Jolif
ccc2302b33
feat: add the ability to specify a custom OCR besides the ones natively supported (#2462)
This is nice to natively support both Tesseract and Paddle. However, one
might already use another OCR and might want to keep using it (for
quality reasons, for cost reasons etc...).
This PR adds the ability for the user to specify its own OCR agent
implementation that is then called by unstructured.

I am new to unstructured so don't hesitate to let me know if you would
prefer this being done differently and I will rework the PR.

---------

Co-authored-by: Yao You <theyaoyou@gmail.com>
Co-authored-by: Yao You <yao@unstructured.io>
2024-01-31 16:38:14 -06:00
Christine Straub
8b1de4c2b8
fix: partition_pdf() not working when using chipper model with file (#2479)
Closes #2480.
 
### Summary
- fixed an error introduced by PR
[#2347](https://github.com/Unstructured-IO/unstructured/pull/2347) -
https://github.com/Unstructured-IO/unstructured/pull/2347/files#diff-cefa2d296ae7ffcf5c28b5734d5c7d506fbdb225c05a0bc27c6b755d5424ffdaL373
- updated `test_partition_pdf_with_model_name()` to test more model
names

### Testing
The updated test function `test_partition_pdf_with_model_name()` should
work on this branch, but fails on the `main` branch.
2024-01-31 17:36:59 +00:00
John
db67805ec6
feat: add support for partitioning .heic files (#2454)
.heic files are an image filetype we have not supported.

#### Testing
```
from unstructured.partition.image import partition_image

png_filename = "example-docs/DA-1p.png"
heic_filename = "example-docs/DA-1p.heic"

png_elements = partition_image(png_filename, strategy="hi_res")
heic_elements = partition_image(heic_filename, strategy="hi_res")

for i in range(len(heic_elements)):
	print(heic_elements[i].text == png_elements[i].text)
```

---------

Co-authored-by: christinestraub <christinemstraub@gmail.com>
2024-01-30 04:49:00 +00:00
John
9320311a19
fix: check languages args (#2435)
This PR is the last in a series of PRs for refactoring and fixing the
language parameters (`languages` and `ocr_languages` so we can address
incorrect input by users. See #2293

It is recommended to go though this PR commit-by-commit and note the
commit message. The most significant commit is "update
check_languages..."
2024-01-29 20:12:08 +00:00
Yao You
97fb10db4a
fix: default hi_res model rely on inference setting (#2441)
- there are multiple places setting the default `hi_res_model_name` in
both `unstructured` and `unstructured-inference`
- they lead to inconsistency and unexpected behaviors
- this fix removes a helper in `unstructured` that tries to set the
default hi_res layout detection model; instead we rely on the
`unstructured-inference` to provide that default when no explicit model
name is passed in

## test

```bash
UNSTRUCTURED_INCLUDE_DEBUG_METADATA=true ipython
```

```python
from unstructured.partition.auto import partition

# find a pdf file
elements = partition("foo.pdf", strategy="hi_res")
assert elements[0].metadata.detection_origin == "yolox"
```

---------

Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com>
Co-authored-by: badGarnet <badGarnet@users.noreply.github.com>
2024-01-29 16:44:41 +00:00
Antonio Jose Jimeno Yepes
d8b3bdb919
Check chipper version and prevent running pdfminer with chipper (#2347)
We have added a new version of chipper (Chipperv3), which needs to allow
unstructured to effective work with all the current Chipper versions.
This implies resizing images with the appropriate resolution and make
sure that Chipper elements are not sorted by unstructured.

In addition, it seems that PDFMiner is being called when calling
Chipper, which adds repeated elements from Chipper and PDFMiner.

To evaluate this PR, you can test the code below with the attached PDF.
The code writes a JSON file with the generated elements. The output can
be examined with `cat out.un.json | python -m json.tool`. There are
three things to check:

1. The size of the image passed to Chipper, which can be identiied in
the layout_height and layout_width attributes, which should have values
3301 and 2550 as shown in the example below:

```
[
    {
        "element_id": "c0493a7872f227e4172c4192c5f48a06",
        "metadata": {
            "coordinates": {
                "layout_height": 3301,
                "layout_width": 2550,

```

2. There should be no repeated elements. 
3. Order should be closer to reading order.

The script to run Chipper from unstructured is:

```
from unstructured import __version__
print(__version__.__version__)

import json
from unstructured.partition.auto import partition
from unstructured.staging.base import elements_to_json

elements = json.loads(elements_to_json(partition("Huang_Improving_Table_Structure_Recognition_With_Visual-Alignment_Sequential_Coordinate_Modeling_CVPR_2023_paper-p6.pdf", strategy="hi_res", model_name="chipperv3")))

with open('out.un.json', 'w') as w:
    json.dump(elements, w)

```



[Huang_Improving_Table_Structure_Recognition_With_Visual-Alignment_Sequential_Coordinate_Modeling_CVPR_2023_paper-p6.pdf](https://github.com/Unstructured-IO/unstructured/files/13817273/Huang_Improving_Table_Structure_Recognition_With_Visual-Alignment_Sequential_Coordinate_Modeling_CVPR_2023_paper-p6.pdf)

---------

Co-authored-by: Antonio Jimeno Yepes <antonio@unstructured.io>
2024-01-25 02:33:32 +00:00
Matt Robinson
4613e52e11
fix: treat yaml files as plain text (#2446)
### Summary

Closes #2412. Adds support for YAML MIME types and treats them as plain
text. In response to `500` errors that the API currently returns if the
MIME type is `text/yaml`.
2024-01-24 17:48:36 +00:00
David Potter
9fea85dc21
fix: remove none value keys from flattened dictionary (#2442)
When a partitioned or embedded document json has null values, those get
converted to a dictionary with None values.

This happens in the metadata. I have not see it in other keys.

Chroma and Pinecone do not like those None values. 

`flatten_dict` has been modified with a `remove_none` arg to remove keys
with None values.

Also, Pinecone has been pinned at 2.2.4 because at 3.0 and above it
breaks our code.

---------

Co-authored-by: potter-potter <david.potter@gmail.com>
2024-01-23 21:52:11 +00:00
John
c34fac9c3a
enhancement: add _clean_ocr_languages_arg helper function (#2413)
This PR is one in a series of PRs for refactoring and fixing the
languages parameter so it can address incorrect input by users. #2293

This PR adds _clean_ocr_languages_arg. There are no calls to this
function yet, but it will be called in later PRs related to this series.
2024-01-19 19:59:08 +00:00
Christine Straub
7378a378f6
enhancement: allow setting image block crop padding parameter (#2415)
Closes #2320 .

### Summary
In certain circumstances, adjusting the image block crop padding can
improve image block extraction by preventing extracted image blocks from
being clipped.

### Testing
- PDF:
[LM339-D_2-2.pdf](https://github.com/Unstructured-IO/unstructured/files/13968952/LM339-D_2-2.pdf)
- Set two environment variables
`EXTRACT_IMAGE_BLOCK_CROP_HORIZONTAL_PAD` and
`EXTRACT_IMAGE_BLOCK_CROP_VERTICAL_PAD`
(e.g. `EXTRACT_IMAGE_BLOCK_CROP_HORIZONTAL_PAD = 40`,
`EXTRACT_IMAGE_BLOCK_CROP_VERTICAL_PAD = 20`

```
elements = partition_pdf(
    filename="LM339-D_2-2.pdf",
    extract_image_block_types=["image"],
)
```
2024-01-19 06:28:32 +00:00
Ahmet Melek
a9ad8ac8d1
fix: update flatten dict to support flattening tuples (#2423)
This PR updates flatten_dict function to support flattening tuples. 

This is necessary for objects like Coordinates, when the object is not
written to the disk, therefore not being converted to a list before
getting flattened.
2024-01-19 00:21:22 +00:00
John
fa9f6ccc17
refactor: use _get_iso639_language_object (#2424)
This refactor removes `_convert_to_standard_langcode` and replaces it
with calling `_get_iso639_language_object` with a string slice.

Use of TESSERACT_LANGUAGES_AND_CODES, which was added to
`_convert_to_standard_langcode` previously, is moved to the relevant
part where `_convert_to_standard_langcode` was previously called.

If/else statements replace the list comprehension for readability and
`langdetect_langs.append("zho")` replaces
`_convert_to_standard_langcode("zh")` since that always returned
`"zho"`.
2024-01-19 00:14:45 +00:00
Matt Robinson
4d5038d9fd
enhancement: add support from bitmap images (#2414)
### Summary

Adds support for bitmap images (`.bmp`) in both file detection and
partitioning. Bitmap images will be processed with `partition_image`
just like JPGs and PNGs.

### Testing

```python
from unstructured.file_utils.filetype import detect_filetype
from unstructured.partition.auto import partition
from PIL import Image

filename = "example-docs/layout-parser-paper-with-table.jpg"
bmp_filename = "~/tmp/ayout-parser-paper-with-table.bmp"

img = Image.open(filename)
img.save(bmp_filename)

detect_filetype(filename=bmp_filename) # Should be FileType.BMP

elements = partition(filename=bmp_filename)
```
2024-01-17 22:50:36 +00:00
John
125b63cd7c
refactor: extract language helper functions (#2370)
This PR is one in a series of PRs for refactoring and fixing the
`languages` parameter so it can address incorrect input by users. #2293

Refactor `_convert_language_code_to_pytesseract_lang_code` and extract
`_get_iso639_language_object` to its own function


```
from unstructured.partition.lang import _convert_language_code_to_pytesseract_lang_code as convert
convert("English") # this will raise an error on both main and this branch
convert("en") # this will return "eng" on both branches
```
2024-01-16 17:51:03 +00:00
Christine Straub
ee06260987
feat: keep all image elements when using hi_res strategy. (#2382)
### Summary
The goal of this PR is to keep all image elements when using "hi_res"
strategy. Previously, `Image` elements with small chunks of text were
ignored unless the image block extraction parameters
(`extract_images_in_pdf` or `extract_image_block_types`) were specified.
Now, all image elements are kept regardless of whether the image block
extraction parameters are specified.

### Testing
- on `main` branch,
```
elements = partition_pdf(
    filename="example-docs/embedded-images.pdf",
    strategy="hi_res",
)
image_elements = [el for el in elements if el.category == ElementType.IMAGE]
print("number of image elements: ", len(image_elements))
```
The above code will display `number of image elements: 0`. 

- on this `feature` branch,

The same code will display `number of image elements: 3`

---------

Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com>
Co-authored-by: christinestraub <christinestraub@users.noreply.github.com>
2024-01-15 23:19:17 +00:00
Matt Robinson
36faf677c0
enhancement: file detection for .wav files (#2387)
### Summary

Adds filetype detection for `.wav` audio files

### Testing

```python
from unstructured.file_utils.filetype import detect_filetype

filename = "example-docs/CantinaBand3.wav"
detect_filetype(filename=filename) # Should be FileType.WAV
```
2024-01-15 16:50:49 +00:00
John
bfd0258ba5
chore: refactor _convert_to_standard_langcode (#2369)
This PR is one in a series of PRs for refactoring and fixing the
`languages` parameter so it can address incorrect input by users. #2293

This PR adds a dictionary for helping map fully spelled out languages to
tesseract language codes

---------

Co-authored-by: Roman Isecke <136338424+rbiseck3@users.noreply.github.com>
2024-01-11 00:34:13 +00:00
Steve Canny
23edf2e911
feature(chunking): add basic strategy and overlap (#2367)
This PR culminates the restructuring of chunking over my prior
dozen-or-so commits by adding the new options to the API and
documentation.

Separately I'll be adding a new ingest test to defend against
regression, although the integration test included in this PR will do a
pretty good job of that too.
2024-01-10 22:19:24 +00:00
Klaijan
e65a44eabb
feat: update cct eval for text dir (#2299)
The code makes edit to the `measure_text_extraction_accuracy` function
to allows dir of txt as well as json. The function also takes input
`output_type` to be either "json" or "txt" only, and checks if the files
under given directory/list contains only specified file type or not.

To test this feature, run the following code:

```PYTHONPATH=. python unstructured/ingest/evaluate.py measure-text-extraction-accuracy-command --output_dir <clean-text-path> --source_dir <cct-label-path> --output_type txt```
2024-01-05 23:34:53 +00:00
Steve Canny
7a1e732aa1
feat(chunking): add inter-chunk overlap (#2309)
Reviewer: This PR probably reviews faster commit-by-commit. Each of the
commits is groomed and focuses on a separate clear aspect of this
implementation.

This PR adds inter-chunk overlap capability to chunking. It does not yet
expose it via the API.

Inter-chunk overlap is overlap between whole pre-chunks, prior to any
text-splitting required for oversized chunks. Contrast with intra-chunk
overlap implemented in the prior PR which implements overlap on these
latter text-splitting boundaries.

Inter-chunk overlap is disabled by default since a pre-chunk already has
a "clean" semantic boundary (composed of whole elements) and adding
overlap there introduces noise from the adjacent context. If the user
wants inter-chunk overlap they must specify `overlap_all=True` in the
options. Inter-chunk overlap uses the same `overlap` length value used
by intra-chunk overlap and does not overlap when that value is 0.
2024-01-05 01:24:12 +00:00
Steve Canny
22cbdce7ca
fix(html): unequal row lengths in HTMLTable.text_as_html (#2345)
Fixes #2339

Fixes to HTML partitioning introduced with v0.11.0 removed the use of
`tabulate` for forming the HTML placed in `HTMLTable.text_as_html`. This
had several benefits, but part of `tabulate`'s behavior was to make
row-length (cell-count) uniform across the rows of the table.

Lacking this prior uniformity produced a downstream problem reported in

On closer inspection, the method used to "harvest" cell-text was
producing more text-nodes than there were cells and was sensitive to
where whitespace was used to format the HTML. It also "moved" text to
different columns in certain rows.

Refine the cell-text gathering mechanism to get exactly one text string
for each row cell, eliminating whitespace formatting nodes and producing
strict correspondence between the number of cells in the original HTML
table row and that placed in HTML.text_as_html.

HTML tables that are uniform (every row has the same number of cells)
will produce a uniform table in `.text_as_html`. Merged cells may still
produce a non-uniform table in `.text_as_html` (because the source table
is non-uniform).
2024-01-04 21:53:19 +00:00
Christine Straub
5b0ae3fd8b
Refactor: rename image extraction kwargs (#2303)
Currently, we're using different kwarg names in partition() and
partition_pdf(), which has implications for the API since it goes
through partition().

### Summary
- rename `extract_element_types` -> `extract_image_block_types`
- rename `image_output_dir_path` to `extract_image_block_output_dir`
- rename `extract_to_payload` -> `extract_image_block_to_payload`
- rename `pdf_extract_images` -> `extract_images_in_pdf` in
`partition.auto`
- add unit tests to test element extraction for `pdf/image` via
`partition.auto`
### Testing
CI should pass.
2024-01-04 17:52:00 +00:00
Austin Walker
91b892c79d
fix: Fix api_url param to partition_via_api (#2342)
Closes #2340 

We need to make sure the custom url is passed to our client. The client
constructor takes the base url, so for compatibility we can continue to
take the full url and strip off the path.

To verify, run the api locally and confirm you can make calls to it.

```
# In unstructured-api
make run-web-app

# In ipython in this repo
from unstructured.partition.api import partition_via_api
filename = "example-docs/layout-parser-paper.pdf"
partition_via_api(filename=filename, api_url="http://localhost:8000")
```
2024-01-03 20:08:48 +00:00
Christine Straub
9459af435d
Fix: element extraction not working when using "auto" strategy for pdf (#2324)
Closes #2323.

### Summary
- update logic to return "hi_res" if either `extract_images_in_pdf` or
`extract_element_types` is set
- refactor: remove unused `file` parameter from
`determine_pdf_or_image_strategy()`
### Testing
```
from unstructured.partition.pdf import partition_pdf

elements = partition_pdf(
    filename="example-docs/embedded-images-tables.pdf",
    extract_element_types=["Image"],
    extract_to_payload=True,
)

image_elements = [el for el in elements if el.category == ElementType.IMAGE]
print(image_elements)
```
2023-12-28 22:25:30 +00:00
Christine Straub
dd144456de
Feat: return base64 encoded images for PDF's (#2310)
Closes #2302.
### Summary
- add functionality to get a Base64 encoded string from a PIL image
- store base64 encoded image data in two metadata fields: `image_base64`
and `image_mime_type`
- update the "image element filter" logic to keep all image elements in
the output if a user specifies image extraction
### Testing
```
from unstructured.partition.pdf import partition_pdf

elements = partition_pdf(
    filename="example-docs/embedded-images-tables.pdf",
    strategy="hi_res",
    extract_element_types=["Image", "Table"],
    extract_to_payload=True,
)
```
or
```
from unstructured.partition.auto import partition

elements = partition(
    filename="example-docs/embedded-images-tables.pdf",
    strategy="hi_res",
    pdf_extract_element_types=["Image", "Table"],
    pdf_extract_to_payload=True,
)
```
2023-12-27 05:39:01 +00:00
Roman Isecke
8ba9fadf8a
feat: improve dataclass use for encoders (#2318)
### Description
Leverage a similar pattern to what is used for connectors, where there
is a nested config dataclass as a field, along with cached content for
things like the client and sample embedding for each. This required an
update on the embeddings config in ingest and I left a TODO in there
because the current approach breaks on other encoders such as bedrock
because the parameters in that config don't map to all encoders. But
this keeps the existing functionality working.

This update makes sure all variables associated with the dataclass exist
when it's instantiated rather than being added in the `__post_init__()`
method or the `initialize()`, allowing other libraries like pydantic to
appropriately generate schemas from it. It also now follows the pattern
of the connectors in that each class has a nested config class used to
instantiate the client itself as well as a field/property approach used
to cache the client.
2023-12-26 22:33:19 +00:00
Steve Canny
eb1b022ff8
feat(chunking): add overlap on chunk-splits (#2305)
There are two distinct overlap operations with completely different
implementations. This is "intra-chunk" overlap, applying overlap to
chunks resulting from text-splitting an oversized element.

So if an oversized element had text "abcd efgh ijkl mnop qrst" and was
split at 15 chars with overlap of 5, it would produce "abcd efgh ijkl"
and "ijkl mnop qrst". Any inter-chunk overlap from the prior chunk and
applied at the beginning of the string (before "abcd") is handled in a
separate operation in the next PR.
2023-12-22 20:35:18 +00:00
John
5c0043aa7d
chore: add hi_res_model_name kwarg (#2289)
Closes #2160 

Explicitly adds `hi_res_model_name` as kwarg to relevant functions and
notes that `model_name` is to be deprecated.

Testing:
```
from unstructured.partition.auto import partition
filename = "example-docs/DA-1p.pdf"
elements = partition(filename, strategy="hi_res", hi_res_model_name="yolox")
```

---------

Co-authored-by: cragwolfe <crag@unstructured.io>
Co-authored-by: Steve Canny <stcanny@gmail.com>
Co-authored-by: Christine Straub <christinemstraub@gmail.com>
Co-authored-by: Yao You <yao@unstructured.io>
Co-authored-by: Yao You <theyaoyou@gmail.com>
2023-12-22 15:06:54 +00:00
Steve Canny
093a11d058
rfctr(chunking): split oversized chunks on word boundary (#2297)
The text of an oversized chunk is split on an arbitrary character
boundary (mid-word). The `chunk_by_character()` strategy introduces the
idea of allowing the user to specify a separator to use for
chunk-splitting. For `langchain` this is typically "\n\n", "\n", or " ";
blank-line, newline, or word boundaries respectively.

Even if the user is allowed to specify a separator, we must provide
fall-back for when a chunk contains no such character. This can be done
incrementally, like blank-line is preferable to newline, newline is
preferable to word, and word is preferable to arbitrary character.

Further, there is nothing particular to `chunk_by_character()` in
providing such a fall-back text-splitting strategy. It would be
preferable for all strategies to split oversized chunks on even-word
boundaries for example.

Note that while a "blank-line" ("\n\n") may be common in plain text, it
is unlikely to appear in the text of an element because it would have
been interpreted as an element boundary during partitioning.

Add _TextSplitter with basic separator preferences and fall-back and
apply it to chunk-splitting for all strategies. The `by_character`
chunking strategy may enhance this behavior by adding the option for a
user to specify a particular separator suited to their use case.
2023-12-21 05:45:36 +00:00
John
04f4c3ab16
create teardown fixture for tests (#2269)
Closes #2263 
Files were being created as a side effect from running tests in
`test_unstructured/metrics/test_evaluate.py`. The added decorator
removes the created directory and its files after the tests run.

Testing
on the main branch, run `make test` or `pytest
test_unstructured/metrics/test_evaluate.py` and files will be created.
On this branch no files are created
2023-12-20 17:50:12 +00:00
Andy Li
4ae49419c9
feat: support base64-encoded text in partition_email (#2277)
closes #816 
## Description
Added functionality for `partition_email` to automatically decode base64
text before passing it to `partition_text` or `partition_html`.
Also adds base64 encoded email text test cases.
2023-12-19 23:37:17 -08:00
Steve Canny
82714cad98
rfctr(chunking): extract BasePreChunker (#2294)
The `_split_elements_by_title_and_table()` function fulfills the
pre-chunker role for `chunk_by_title()`, but most of its operation is
not strategy-specific and can be reused by other chunking strategies.

Extract `BasePreChunker` and use it as the base class for
`_ByTitlePreChunker` which now only needs to provide the boundary
predicates specific to that strategy.
2023-12-20 06:30:21 +00:00
Steve Canny
4e2ba2c9b2
rfctr(chunking): extract boundary predicates (#2284)
`chunk_by_title()` respects certain semantic boundaries while chunking.
Those are sections introduced by a `Title` element, sections introduced
by a `metadata.section` value change, and optionally page-breaks.
"Respecting" in this context means that elements on opposite sides of a
semantic boundary never appear in the same chunk.

The `metadata_differs()` function used for this purpose is clumsy to use
requiring the caller to maintain state (prior element). It also combines
what are independent predicates such that they cannot be individually
reused.

Introduce the `BoundaryPredicate` type which takes an element and
returns bool, indicating whether the element introduces a new semantic
boundary. These can be reused by any chunking strategy that needs them
and allows the pre-chunking operation to be generalized for use by any
chunking strategy, which it will be in the following PR.
2023-12-19 18:20:05 +00:00
David Potter
4b8352e0f5
feat: add chroma destination connector (#2240)
Adds Chroma (also known as ChromaDB) as a vector destination.

Currently Chroma is an in-memory single-process oriented library with
plans of a hosted and/or more production ready solution
-https://docs.trychroma.com/deployment

Though they now claim to support multiple Clients hitting the database
at once, I found that it was inconsistent. Sometimes multiprocessing
worked (maybe 1 out of 3 times) But the other times I would get
different errors. So I kept it single process.

---------

Co-authored-by: potter-potter <david.potter@gmail.com>
2023-12-19 16:58:23 +00:00
Steve Canny
0c7f64ecaa
rfctr(chunking): generalize PreChunkBuilder (#2283)
To implement inter-pre-chunk overlap, we need a context that sees every
pre-chunk both before and after it is accumulated (from elements).

- We need access to the pre-chunk when it is completed so we can extract
the "tail" overlap to be applied to the next chunk.
- We need access to the as-yet-unpopulated pre-chunk so we can add the
prior tail to it as a prefix.

This "visibility" is split between `PreChunkBuilder` and the pre-chunker
itself, which handles `TablePreChunk`s without the builder.

Move `Table` element and TablePreChunk` formation into `PreChunkBuilder`
such that _all_ element types (adding `Table` elements in particular)
pass through it. Then `PreChunkBuilder` becomes the context we require.

The actual overlap harvesting and application will come in a subsequent
commit.
2023-12-18 22:21:34 +00:00
Steve Canny
36e81c3367
rfctr(chunking): extract general-purpose objects to base (#2281)
Many of the classes defined in `unstructured.chunking.title` are
applicable to any chunking strategy and will shortly be used for the
"by-character" chunking strategy as well.

Move these and their tests to `unstructured.chunking.base`.

Along the way, rename `TextPreChunkBuilder` to `PreChunkBuilder` because
it will be generalized in a subsequent PR to also take `Table` elements
such that inter-pre-chunk overlap can be implemented.

Otherwise, no logic changes, just moves.
2023-12-16 17:28:15 +00:00
Christine Straub
a7c3f5f570
Refactor: importation consistency for partition_pdf() and partition_image() (#2282)
Closes #2278. This PR also removes the `extract_tables_in_pdf` mentioned
in issue #2280.
2023-12-15 22:29:58 +00:00
Steve Canny
70cf141036
rfctr: extract ChunkingOptions (#2266)
Chunking options for things like chunk-size are largely independent of
chunking strategy. Further, validating the args and applying defaults
based on call arguments is sophisticated to make its use easy for the
caller. These details distract from what the chunker is actually doing
and would need to be repeated for every chunking strategy if left where
they are.

Extract these settings and the rules governing chunking behavior based
on options into its own immutable object that can be passed to any
component that is subject to optional behavior (pretty much all of
them).
2023-12-15 19:51:02 +00:00
Yao You
5f5ff6319f
fix: consider text in cid code as invalid in hi_res (#2259)
This PR addresses
[CORE-2969](https://unstructured-ai.atlassian.net/browse/CORE-2969)
- pdfminer sometimes fail to decode text in an pdf file and returns cid
codes as text
- now those text will be considered invalid and be replaced with ocr
results in `hi_res` mode

## test

This PR adds unit test for the utility functions. In addition the file
below would return elements with text in cid code on main but proper
ascii text with this PR:


[005-CISA-AA22-076-Strengthening-Cybersecurity-p1-p4.pdf](https://github.com/Unstructured-IO/unstructured/files/13662984/005-CISA-AA22-076-Strengthening-Cybersecurity-p1-p4.pdf)

This change improves both cct accuracy and %missing scores:

**before:**
```
metric       average sample_sd population_sd count
--------------------------------------------------
cct-accuracy 0.681   0.267     0.266         105
cct-%missing 0.086   0.159     0.159         105
```

**after:**
```
metric       average sample_sd population_sd count
--------------------------------------------------
cct-accuracy 0.697   0.251     0.250         105
cct-%missing 0.071   0.123     0.122         105
```

[CORE-2969]:
https://unstructured-ai.atlassian.net/browse/CORE-2969?atlOrigin=eyJpIjoiNWRkNTljNzYxNjVmNDY3MDlhMDU5Y2ZhYzA5YTRkZjUiLCJwIjoiZ2l0aHViLWNvbS1KU1cifQ

---------

Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com>
Co-authored-by: badGarnet <badGarnet@users.noreply.github.com>
Co-authored-by: christinestraub <christinemstraub@gmail.com>
2023-12-14 06:49:23 +00:00
Austin Walker
d594c06a3e
fix: handle delimiter bug in partition_csv (#2224)
Closes #2218. When a csv has commas in its content, and the delimiter is
something else, Pandas may throw an error. We can sniff the csv and get
the correct delimiter to pass to Pandas. To verify, try partitioning the
file in the linked bug.
2023-12-13 23:57:46 +00:00
Steve Canny
cbeaed21ef
rfctr: rename pre chunk (#2261)
The original naming for the pre-cursor to a chunk in `chunk_by_title()`
was conflated with the idea of how these element subsequences were
bounded (by document-section) for that strategy. I mistakenly picked
that up as a universal concept but in fact no notion of section arises
in the `by_character` or other chunking strategies.

Fix this misconception by using the name `pre-chunk` for this concept
throughout.
2023-12-13 23:13:57 +00:00
Steve Canny
74d089d942
rfctr: skip CheckBox elements during chunking (#2253)
`CheckBox` elements get special treatment during chunking. `CheckBox`
does not derive from `Text` and can contribute no text to a chunk. It is
considered "non-combinable" and so is emitted as-is as a chunk of its
own. A consequence of this is it breaks an otherwise contiguous chunk
into two wherever it occurs.

This is problematic, but becomes much more so when overlap is
introduced. Each chunk accepts a "tail" text fragment from its preceding
element and contributes its own tail fragment to the next chunk. These
tails represent the "overlap" between chunks. However, a non-text chunk
can neither accept nor provide a tail-fragment and so interrupts the
overlap. None of the possible solutions are terrific.

Give `Element` a `.text` attribute such that _all_ elements have a
`.text` attribute, even though its value is the empty-string for
element-types such as CheckBox and PageBreak which inherently have no
text. As a consequence, several `cast()` wrappers are no longer required
to satisfy strict type-checking.

This also allows a `CheckBox` element to be combined with `Text`
subtypes during chunking, essentially the same way `PageBreak` is,
contributing no text to the chunk.

Also, remove the `_NonTextSection` object which previously wrapped a
`CheckBox` element during pre-chunking as it is no longer required.
2023-12-13 20:22:25 +00:00
Yao You
36e4639e05
fix: image may be scaled too large for tesseract (#2252)
This PR addresses
[CORE-2965](https://unstructured-ai.atlassian.net/browse/CORE-2965) by
limiting zoom factor so that the scaled image can still be processed by
tesseract.

- tesseract has a 2^31 byte limit on image data
- occasionally an image may be scaled too much and larger than that size
- fix limits the scaling factor so that we never scale an image larger
than what tesseract can handle

## test

A unit test is added in this PR to test a unlikely case where we'd scale
an image a few thousand times and massively exceed the limit without the
fix.

Unstructured reviewers can also use the document in the ticket to test.


[CORE-2965]:
https://unstructured-ai.atlassian.net/browse/CORE-2965?atlOrigin=eyJpIjoiNWRkNTljNzYxNjVmNDY3MDlhMDU5Y2ZhYzA5YTRkZjUiLCJwIjoiZ2l0aHViLWNvbS1KU1cifQ
2023-12-13 19:35:05 +00:00
John
d3a404cfb5
pdfminer bug (#2244)
Closes #2212.

### Summary
This PR implements logic to fall back to the "inferred_layout + OCR" if
pdfminer fails in the `hi_res` pipeline (discussed in[ this slack
channel](https://unstructuredw-kbe4326.slack.com/archives/C057R3F8F7A/p1701807299018929).

### Testing
PDF:
[NASA-SNA-8-D-027III-Rev2-CsmLmSpacecraftOperationalDataBook-Volume3-MassProperties-pg856.pdf](https://github.com/Unstructured-IO/unstructured/files/13554149/NASA-SNA-8-D-027III-Rev2-CsmLmSpacecraftOperationalDataBook-Volume3-MassProperties-pg856.pdf)

```
elements = partition_pdf(
    filename="NASA-SNA-8-D-027III-Rev2-CsmLmSpacecraftOperationalDataBook-Volume3-MassProperties-pg856.pdf",
    strategy="hi_res",
)
```

---------

Co-authored-by: christinestraub <christinemstraub@gmail.com>
2023-12-13 00:51:38 +00:00
Steve Canny
21bc67f52f
rfctr: improve element typing (#2247)
In preparation for work on generalized chunking including
`chunk_by_character()` and overlap, get `elements` module and tests
passing strict type-checking.
2023-12-12 23:12:23 +00:00
Christine Straub
da7ac625b1
Feat: save tables in PDF's as images (#2229)
closes #2222.

### Summary
The "table" elements are saved as `table-<pageN>-<tableN>.jpg`. This
filename is presented in the `image_path` metadata field for the Table
element. The default would be to not do this.

### Testing
PDF: [124_PDFsam_Basel III - Finalising post-crisis
reforms.pdf](https://github.com/Unstructured-IO/unstructured/files/13591714/124_PDFsam_Basel.III.-.Finalising.post-crisis.reforms.pdf)

```
elements = partition_pdf(
    filename="124_PDFsam_Basel III - Finalising post-crisis reforms.pdf",
    strategy="hi_res",
    infer_table_structure=True,
    extract_element_types=['Table'],
)
```
2023-12-11 19:14:41 +00:00
Christine Straub
ed76b11b1a
Refactor: support image extraction (#2201)
### Summary
This PR is the second part of the "image extraction" refactor to move it
from unstructured-inference repo to unstructured repo, the first part is
done in
https://github.com/Unstructured-IO/unstructured-inference/pull/299. This
PR adds logic to support extracting images.

### Testing

`git clone -b refactor/remove_image_extraction_code --single-branch
https://github.com/Unstructured-IO/unstructured-inference.git && cd
unstructured-inference && pip install -e . && cd ../`

```
elements = partition_pdf(
        filename="example-docs/embedded-images.pdf",
        strategy="hi_res",
        extract_images_in_pdf=True,
    )

print("\n\n".join([str(el) for el in elements]))
```
2023-12-05 18:22:29 +00:00
John
8fa5cbf036
build(ci): rm unneeded call to get_api_key in test (#2199)
Follow-up PR to
[https://github.com/Unstructured-IO/unstructured/pull/2195](https://github.com/Unstructured-IO/unstructured/pull/2195).

Removes unnecessary calls to `get_api_key()`. That helper function is
supposed to only be used for tests decorated by
@pytest.mark.skipif(skip_outside_ci, reason="Skipping test run outside
of CI") (which are skipped because those tests are partitioning pdf/jpg
files).
These tests are partitioning emails and rely on the MockResponse at the
top of the file, so they don't need to call `get_api_key()` and it can
simply be removed from them.
2023-12-03 21:28:05 -08:00