368 Commits

Author SHA1 Message Date
Klaijan
a11d4634f1
fix: type error string indices bug (#1940)
Fix TypeError: string indices must be integers. The `annotation_dict`
variable is conditioned to be `None` if instance type is not dict. Then
we add logic to skip the attempt if the value is `None`.
2023-10-30 17:38:57 -07:00
Christine Straub
1f0c563e0c
refactor: partition_pdf() for ocr_only strategy (#1811)
### Summary
Update `ocr_only` strategy in `partition_pdf()`. This PR adds the
functionality to get accurate coordinate data when partitioning PDFs and
Images with the `ocr_only` strategy.
- Add functionality to perform OCR region grouping based on the OCR text
taken from `pytesseract.image_to_string()`
- Add functionality to get layout elements from OCR regions (ocr_layout)
for both `tesseract` and `paddle`
- Add functionality to determine the `source` of merged text regions
when merging text regions in `merge_text_regions()`
- Merge multiple test functions related to "ocr_only" strategy into
`test_partition_pdf_with_ocr_only_strategy()`
- This PR also fixes [issue
#1792](https://github.com/Unstructured-IO/unstructured/issues/1792)
### Evaluation
```
# Image
PYTHONPATH=. python examples/custom-layout-order/evaluate_natural_reading_order.py example-docs/double-column-A.jpg ocr_only xy-cut image

# PDF
PYTHONPATH=. python examples/custom-layout-order/evaluate_natural_reading_order.py example-docs/multi-column-2p.pdf ocr_only xy-cut pdf
```
### Test
- **Before update**
All elements have the same coordinate data 


![multi-column-2p_1_xy-cut](https://github.com/Unstructured-IO/unstructured/assets/9475974/aae0195a-2943-4fa8-bdd8-807f2f09c768)

- **After update**
All elements have accurate coordinate data


![multi-column-2p_1_xy-cut](https://github.com/Unstructured-IO/unstructured/assets/9475974/0f6c6202-9e65-4acf-bcd4-ac9dd01ab64a)

---------

Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com>
Co-authored-by: christinestraub <christinestraub@users.noreply.github.com>
2023-10-30 20:13:29 +00:00
qued
645a0fb765
fix: md tables (#1924)
Courtesy @phowat, created a branch in the repo to make some changes and
merge quickly.

Closes #1486.

* **Fixes issue where tables from markdown documents were being treated
as text** Problem: Tables from markdown documents were being treated as
text, and not being extracted as tables. Solution: Enable the `tables`
extension when instantiating the `python-markdown` object. Importance:
This will allow users to extract structured data from tables in markdown
documents.

#### Testing:

On `main` run the following (run `git checkout fix/md-tables --
example-docs/simple-table.md` first to grab the example table from this
branch)
```python
from unstructured.partition.md import partition_md
elements = partition_md("example-docs/simple-table.md")
print(elements[0].category)

```
Output should be `UncategorizedText`. Then run the same code on this
branch and observe the output is `Table`.

---------

Co-authored-by: cragwolfe <crag@unstructured.io>
2023-10-30 14:09:46 +00:00
Benjamin Torres
05c3cd1be2
feat: clean pdfminer elements inside tables (#1808)
This PR introduces `clean_pdfminer_inner_elements` , which deletes
pdfminer elements inside other detection origins such as YoloX or
detectron.
This function returns the clean document.

Also, the ingest-test fixtures were updated to reflect the new standard
output.

The best way to check that this function is working properly is check
the new test `test_clean_pdfminer_inner_elements` in
`test_unstructured/partition/utils/test_processing_elements.py`

---------

Co-authored-by: Roman Isecke <roman@unstructured.io>
Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com>
Co-authored-by: rbiseck3 <rbiseck3@users.noreply.github.com>
Co-authored-by: Roman Isecke <136338424+rbiseck3@users.noreply.github.com>
2023-10-30 07:10:51 +00:00
Steve Canny
7373391aa4
fix: sectioner dissociated titles from their chunk (#1861)
### disassociated-titles

**Executive Summary**. Section titles are often combined with the prior
section and then missing from the section they belong to.

_Chunk combination_ is a behavior in which two succesive small chunks
are combined into a single chunk that better fills the chunk window.
Chunking can be and by default is configured to combine sequential small
chunks that will together fit within the full chunk window (default 500
chars).

Combination is only valid for "whole" chunks. The current implementation
attempts to combine at the element level (in the sectioner), meaning a
small initial element (such as a `Title`) is combined with the prior
section without considering the remaining length of the section that
title belongs to. This frequently causes a title element to be removed
from the chunk it belongs to and added to the prior, otherwise
unrelated, chunk.

Example:
```python
elements: List[Element] = [
    Title("Lorem Ipsum"),  # 11
    Text("Lorem ipsum dolor sit amet consectetur adipiscing elit."),  # 55
    Title("Rhoncus"),  # 7
    Text("In rhoncus ipsum sed lectus porta volutpat. Ut fermentum."),  # 57
]

chunks = chunk_by_title(elements, max_characters=80, combine_text_under_n_chars=80)

# -- want --------------------
CompositeElement('Lorem Ipsum\n\nLorem ipsum dolor sit amet consectetur adipiscing elit.')
CompositeElement('Rhoncus\n\nIn rhoncus ipsum sed lectus porta volutpat. Ut fermentum.')

# -- got ---------------------
CompositeElement('Lorem Ipsum\n\nLorem ipsum dolor sit amet consectetur adipiscing elit.\n\nRhoncus')
CompositeElement('In rhoncus ipsum sed lectus porta volutpat. Ut fermentum.')
```

**Technical Summary.** Combination cannot be effectively performed at
the element level, at least not without complicating things with
arbitrary look-ahead into future elements. Much more straightforward is
to combine sections once they have been formed from the element stream.

**Fix.** Introduce an intermediate stream processor that accepts a
stream of sections and emits a stream of sometimes-combined sections.

The solution implemented in this PR builds upon introducing `_Section`
objects to replace the `List[Element]` primitive used previously:

- `_TextSection` gets the `.combine()` method and `.text_length`
property which allows a combining client to produce a combined section
(only text-sections are ever combined).
- `_SectionCombiner` is introduced to encapsulate the logic of
combination, acting as a "filter", accepting a stream of sections and
emitting the same type, just with some resulting from two or more
combined input sections: `(Iterable[_Section]) -> Iterator[_Section]`.
- `_TextSectionAccumulator` is a helper to `_SectionCombiner` that takes
responsibility for repeatedly accumulating sections, characterizing
their length and doing the actual combining (calling
`_Section.combine(other_section)`) when instructed. Very similar in
concept to `_TextSectionBuilder`, just at the section level instead of
element level.
- Remove attempts to combine sections at the element level from
`_split_elements_by_title_and_table()` and install `_SectionCombiner` as
filter between sectioner and chunker.
2023-10-30 04:20:27 +00:00
Yao You
f87731e085
feat: use yolox as default to table extraction for pdf/image (#1919)
- yolox has better recall than yolox_quantized, the current default
model, for table detection
- update logic so that when `infer_table_structure=True` the default
model is `yolox` instead of `yolox_quantized`
- user can still override the default by passing in a `model_name` or
set the env variable `UNSTRUCTURED_HI_RES_MODEL_NAME`

## Test:

Partition the attached file with 

```python
from unstructured.partition.pdf import partition_pdf

yolox_elements = partition_pdf(filename, strategy="hi_re", infer_table_structure=True)
yolox_quantized_elements = partition_pdf(filename, strategy="hi_re", infer_table_structure=True, model_name="yolox_quantized")
```

Compare the table elements between those two and yolox (default)
elements should have more complete table.


[AK_AK-PERS_CAFR_2008_3.pdf](https://github.com/Unstructured-IO/unstructured/files/13191198/AK_AK-PERS_CAFR_2008_3.pdf)
2023-10-27 15:37:45 -05:00
Yao You
42f8cf1997
chore: add metric helper for table structure eval (#1877)
- add helper to run inference over an image or pdf of table and compare
it against a ground truth csv file
- this metric generates a similarity score between 1 and 0, where 1 is
perfect match and 0 is no match at all
- add example docs for testing
- NOTE: this metric is only relevant to table structure detection.
Therefore the input should be just the table area in an image/pdf file;
we are not evaluating table element detection in this metric
2023-10-27 13:23:44 -05:00
Steve Canny
f273a7cb83
fix: sectioner does not consider separator length (#1858)
### sectioner-does-not-consider-separator-length

**Executive Summary.** A primary responsibility of the sectioner is to
minimize the number of chunks that need to be split mid-text. It does
this by computing text-length of the section being formed and
"finishing" the section when adding another element would extend its
text beyond the window size.

When element-text is consolidated into a chunk, the text of each element
is joined, separated by a "blank-line" (`"\n\n"`). The sectioner does
not currently consider the added length of separators (2-chars each) and
so forms sections that need to be split mid-text when chunked.

Chunk-splitting should only be necessary when the text of a single
element is longer than the chunking window.

**Example**

  ```python
    elements: List[Element] = [
        Title("Chunking Priorities"),  # 19 chars
        ListItem("Divide text into manageable chunks"),  # 34 chars
        ListItem("Preserve semantic boundaries"),  # 28 chars
        ListItem("Minimize mid-text chunk-splitting"),  # 33 chars
    ]  # 114 chars total but 120 chars with separators

    chunks = chunk_by_title(elements, max_characters=115)
  ```

  Want:

  ```python
[
    CompositeElement(
        "Chunking Priorities"
        "\n\nDivide text into manageable chunks"
        "\n\nPreserve semantic boundaries"
    ),
    CompositeElement("Minimize mid-text chunk-splitting"),
]
  ```

  Got:

  ```python
[
    CompositeElement(
        "Chunking Priorities"
        "\n\nDivide text into manageable chunks"
        "\n\nPreserve semantic boundaries"
        "\n\nMinimize mid-text chunk-spli"),
    )
    CompositeElement("tting")
  ```

### Technical Summary

Because the sectioner does not consider separator (`"\n\n"`) length when
it computes the space remaining in the section, it over-populates the
section and when the chunker concatenates the element text (each
separated by the separator) the text exceeds the window length and the
chunk must be split mid-text, even though there was an even element
boundary it could have been split on.

### Fix

Consider separator length in the space-remaining computation.

The solution here extracts both the `section.text_length` and
`section.space_remaining` computations to a `_TextSectionBuilder` object
which removes the need for the sectioner
(`_split_elements_by_title_and_table()`) to deal with primitives
(List[Element], running text length, separator length, etc.) and allows
it to focus on the rules of when to start a new section.

This solution may seem like overkill at the moment and indeed it would
be except it forms the foundation for adding section-level chunk
combination (fix: dissociated title elements) in the next PR. The
objects introduced here will gain several additional responsibilities in
the next few chunking PRs in the pipeline and will earn their place.
2023-10-26 21:34:15 +00:00
qued
808b4ced7a
build(deps): remove ebooklib (#1878)
* **Removed `ebooklib` as a dependency** `ebooklib` is licensed under
AGPL3, which is incompatible with the Apache 2.0 license. Thus it is
being removed.
2023-10-26 12:22:40 -05:00
Sebastian Laverde Alfonso
c11a2ff478
feat: method to catch and classify overlapping bounding boxes (#1803)
We have established that overlapping bounding boxes does not have a
one-fits-all solution, so different cases need to be handled differently
to avoid information loss. We have manually identified the
cases/categories of overlapping. Now we need a method to
programmatically classify overlapping-bboxes cases within detected
elements in a document, and return a report about it (list of cases with
metadata). This fits two purposes:

- **Evaluation**: We can have a pipeline using the DVC data registry
that assess the performance of a detection model against a set of
documents (PDF/Images), by analysing the overlapping-bboxes cases it
has. The metadata in the output can be used for generating metrics for
this.
- **Scope overlapping cases**: Manual inspection give us a clue about
currently present cases of overlapping bboxes. We need to propose
solutions to fix those on code. This method generates a report by
analysing several aspects of two overlapping regions. This data can be
used to profile and specify the necessary changes that will fix each
case.
- **Fix overlapping cases**: We could introduce this functionality in
the flow of a partition method (such as `partition_pdf`, to handle the
calls to post-processing methods to fix overlapping. Tested on ~331
documents, the worst time per page is around 5ms. For a document such as
`layout-parser-paper.pdf` it takes 4.46 ms.

Introduces functionality to take a list of unstructured elements (which
contain bounding boxes) and identify pairs of bounding boxes which
overlap and which case is pertinent to the pairing. This PR includes the
following methods in `utils.py`:

- **`ngrams(s, n)`**: Generate n-grams from a string
- **`calculate_shared_ngram_percentage(string_A, string_B, n)`**:
Calculate the percentage of `common_ngrams` between `string_A` and
`string_B` with reference to the total number of ngrams in `string_A`.
- **`calculate_largest_ngram_percentage(string_A, string_B)`**:
Iteratively call `calculate_shared_ngram_percentage` starting from the
biggest ngram possible until the shared percentage is >0.0%
- **`is_parent_box(parent_target, child_target, add=0)`**: True if the
`child_target` bounding box is nested in the `parent_target` Box format:
[`x_bottom_left`, `y_bottom_left`, `x_top_right`, `y_top_right`]. The
parameter 'add' is the pixel error tolerance for extra pixels outside
the parent region
- **`calculate_overlap_percentage(box1, box2,
intersection_ratio_method="total")`**: Box format: [`x_bottom_left`,
`y_bottom_left`, `x_top_right`, `y_top_right`]. Calculates the
percentage of overlapped region with reference to biggest element-region
(`intersection_ratio_method="parent"`), the smallest element-region
(`intersection_ratio_method="partial"`), or to the disjunctive union
region (`intersection_ratio_method="total"`).
- **`identify_overlapping_or_nesting_case`**: Identify if there are
nested or overlapping elements. If overlapping is present,
it identifies the case calling the method `identify_overlapping_case`.
- **`identify_overlapping_case`**: Classifies the overlapping case for
an element_pair input in one of 5 categories of overlapping.
- **`catch_overlapping_and_nested_bboxes`**: Catch overlapping and
nested bounding boxes cases across a list of elements. The params
`nested_error_tolerance_px` and `sm_overlap_threshold` help controling
the separation of the cases.

The overlapping/nested elements cases that are being caught are:
1. **Nested elements**
2. **Small partial overlap**
3. **Partial overlap with empty content**
4. **Partial overlap with duplicate text (sharing 100% of the text)**
5. **Partial overlap without sharing text**
6. **Partial overlap sharing**
{`calculate_largest_ngram_percentage(...)`}% **of the text**

Here is a snippet to test it:
```
from unstructured.partition.auto import partition

model_name = "yolox_quantized"
target = "sample-docs/layout-parser-paper-fast.pdf"
elements = partition(filename=file_path_i, strategy='hi_res', model_name=model_name)
overlapping_flag, overlapping_cases = catch_overlapping_bboxes(elements)
for case in overlapping_cases:
    print(case, "\n")
```
Here is a screenshot of a json built with the output list
`overlapping_cases`:
<img width="377" alt="image"
src="https://github.com/Unstructured-IO/unstructured/assets/38184042/a6fea64b-d40a-4e01-beda-27840f4f4b3a">
2023-10-25 12:17:34 +00:00
qued
d8241cbcfc
fix: filename missing from image metadata (#1863)
Closes
[#1859](https://github.com/Unstructured-IO/unstructured/issues/1859).

* **Fixes elements partitioned from an image file missing certain
metadata** Metadata for image files, like file type, was being handled
differently from other file types. This caused a bug where other
metadata, like the file name, was being missed. This change brought
metadata handling for image files to be more in line with the handling
for other file types so that file name and other metadata fields are
being captured.

Additionally:
* Added test to verify filename is being captured in metadata
* Cleaned up `CHANGELOG.md` formatting

#### Testing:
The following produces output `None` on `main`, but outputs the filename
`layout-parser-paper-fast.jpg` on this branch:
```python
from unstructured.partition.auto import partition
elements = partition("example-docs/layout-parser-paper-fast.jpg")
print(elements[0].metadata.filename)

```
2023-10-25 05:19:51 +00:00
Steve Canny
40a265d027
fix: chunk_by_title() interface is rude (#1844)
### `chunk_by_title()` interface is "rude"

**Executive Summary.** Perhaps the most commonly specified option for
`chunk_by_title()` is `max_characters` (default: 500), which specifies
the chunk window size.

When a user specifies this value, they get an error message:
```python
  >>> chunks = chunk_by_title(elements, max_characters=100)
  ValueError: Invalid values for combine_text_under_n_chars, new_after_n_chars, and/or max_characters.
```
A few of the things that might reasonably pass through a user's mind at
such a moment are:
* "Is `110` not a valid value for `max_characters`? Why would that be?"
* "I didn't specify a value for `combine_text_under_n_chars` or
`new_after_n_chars`, in fact I don't know what they are because I
haven't studied the documentation and would prefer not to; I just want
smaller chunks! How could I supply an invalid value when I haven't
supplied any value at all for these?"
* "Which of these values is the problem? Why are you making me figure
that out for myself? I'm sure the code knows which one is not valid, why
doesn't it share that information with me? I'm busy here!"

In this particular case, the problem is that
`combine_text_under_n_chars` (defaults to 500) is greater than
`max_characters`, which means it would never take effect (which is
actually not a problem in itself).

To fix this, once figuring out that was the problem, probably after
opening an issue and maybe reading the source code, the user would need
to specify:
  ```python
  >>> chunks = chunk_by_title(
  ...     elements, max_characters=100, combine_text_under_n_chars=100
  ... )
  ```

This and other stressful user scenarios can be remedied by:
* Using "active" defaults for the `combine_text_under_n_chars` and
`new_after_n_chars` options.
* Providing a specific error message for each way a constraint may be
violated, such that direction to remedy the problem is immediately clear
to the user.

An *active default* is for example:
* Make the default for `combine_text_under_n_chars: int | None = None`
such that the code can detect when it has not been specified.
* When not specified, set its value to `max_characters`, the same as its
current (static) default.
This particular change would avoid the behavior in the motivating
example above.

Another alternative for this argument is simply:
```python
combine_text_under_n_chars = min(max_characters, combine_text_under_n_chars)
```

### Fix

1. Add constraint-specific error messages.
2. Use "active" defaults for `combine_text_under_n_ chars` and
`new_after_n_chars`.
3. Improve docstring to describe active defaults, and explain other
argument behaviors, in particular identifying suppression options like
`combine_text_under_n_chars = 0` to disable chunk combining.
2023-10-24 23:22:38 +00:00
John
8080f9480d
fix strategy test for api and linting (#1840)
### Summary 
Closes unstructured-api issue
[188](https://github.com/Unstructured-IO/unstructured-api/issues/188)
The test and gist were using different versions of the same file
(jpg/pdf), creating what looked like a bug when there wasn't one. The
api is correctly using the `strategy` kwarg.

### Testing
#### Checkout to `main`
- Comment out the `@pytest.mark.skip` decorators for the
`test_partition_via_api_with_no_strategy` test
- Add an API key to your env:
- Add `from dotenv import load_dotenv; load_dotenv()` to the top of the
file and have `UNS_API_KEY` defined in `.env`

- Run `pytest test_unstructured/partition/test_api.py -k
"test_partition_via_api_with_no_strategy"`
^the test will fail

#### Checkout to this branch 
- (make the same changes as above)
- Run `pytest test_unstructured/partition/test_api.py -k
"test_partition_via_api_with_no_strategy"`

### Other
`make tidy` and `make check` made linting changes to additional files
2023-10-24 22:17:54 +00:00
Yuming Long
01a0e003d9
Chore: stop passing extract_tables to inference and note table regression on entire doc OCR (#1850)
### Summary

A follow up ticket on
https://github.com/Unstructured-IO/unstructured/pull/1801, I forgot to
remove the lines that pass extract_tables to inference, and noted the
table regression if we only do one OCR for entire doc

**Tech details:**
* stop passing `extract_tables` parameter to inference
* added table extraction ingest test for image, which was skipped
before, and the "text_as_html" field contains the OCR output from the
table OCR refactor PR
* replaced `assert_called_once_with` with `call_args` so that the unit
tests don't need to test additional parameters
* added `error_margin` as ENV when comparing bounding boxes
of`ocr_region` with `table_element`
* added more tests for tables and noted the table regression in test for
partition pdf

### Test
* for stop passing `extract_tables` parameter to inference, run test
`test_partition_pdf_hi_res_ocr_mode_with_table_extraction` before this
branch and you will see warning like `Table OCR from get_tokens method
will be deprecated....`, which means it called the table OCR in
inference repo. This branch removed the warning.
2023-10-24 17:13:28 +00:00
qued
44cef80c82
test: Add test to ensure languages trickle down to ocr (#1857)
Closes
[#93](https://github.com/Unstructured-IO/unstructured-inference/issues/93).

Adds a test to ensure language parameters are passed all the way from
`partition_pdf` down to the OCR calls.

#### Testing:

CI should pass.
2023-10-24 16:54:19 +00:00
Yao You
b530e0a2be
fix: partition docx from teams output (#1825)
This PR resolves #1816 
- current docx partition assumes all contents are in sections
- this is not true for MS Teams chat transcript exported to docx
- now the code checks if there are sections or not; if not then iterate
through the paragraphs and partition contents in the paragraphs
2023-10-24 15:17:02 +00:00
Amanda Cameron
0584e1d031
chore: fix infer_table bug (#1833)
Carrying `skip_infer_table_types` to `infer_table_structure` in
partition flow. Now PPT/X, DOC/X, etc. Table elements should not have a
`text_as_html` field.

Note: I've continued to exclude this var from partitioners that go
through html flow, I think if we've already got the html it doesn't make
sense to carry the infer variable along, since we're not 'infer-ing' the
html table in these cases.


TODO:
  add unit tests

---------

Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com>
Co-authored-by: amanda103 <amanda103@users.noreply.github.com>
2023-10-24 00:11:53 +00:00
qued
7fdddfbc1e
chore: improve kwarg handling (#1810)
Closes `unstructured-inference` issue
[#265](https://github.com/Unstructured-IO/unstructured-inference/issues/265).

Cleaned up the kwarg handling, taking opportunities to turn instances of
handling kwargs as dicts to just using them as normal in function
signatures.

#### Testing:

Should just pass CI.
2023-10-23 04:48:28 +00:00
Steve Canny
82c8adba3f
fix: split-chunks appear out-of-order (#1824)
**Executive Summary.** Code inspection in preparation for adding the
chunk-overlap feature revealed a bug causing split-chunks to be inserted
out-of-order. For example, elements like this:

```
Text("One" + 400 chars)
Text("Two" + 400 chars)
Text("Three" + 600 chars)
Text("Four" + 400 chars)
Text("Five" + 600 chars)
```

Should produce chunks:
```
CompositeElement("One ...")           # (400 chars)
CompositeElement("Two ...")           # (400 chars)
CompositeElement("Three ...")         # (500 chars)
CompositeElement("rest of Three ...") # (100 chars)
CompositeElement("Four")              # (400 chars)
CompositeElement("Five ...")          # (500 chars)
CompositeElement("rest of Five ...")  # (100 chars)
```

but produced this instead:
```
CompositeElement("Five ...")          # (500 chars)
CompositeElement("rest of Five ...")  # (100 chars)
CompositeElement("Three ...")         # (500 chars)
CompositeElement("rest of Three ...") # (100 chars)
CompositeElement("One ...")           # (400 chars)
CompositeElement("Two ...")           # (400 chars)
CompositeElement("Four")              # (400 chars)
```

This PR fixes that behavior that was introduced on Oct 9 this year in
commit: f98d5e65 when adding chunk splitting.


**Technical Summary**

The essential transformation of chunking is:

```
elements          sections              chunks
List[Element] -> List[List[Element]] -> List[CompositeElement]
```

1. The _sectioner_ (`_split_elements_by_title_and_table()`) _groups_
semantically-related elements into _sections_ (`List[Element]`), in the
best case, that would be a title (heading) and the text that follows it
(until the next title). A heading and its text is often referred to as a
_section_ in publishing parlance, hence the name.

2. The _chunker_ (`chunk_by_title()` currently) does two things:
1. first it _consolidates_ the elements of each section into a single
`ConsolidatedElement` object (a "chunk"). This includes both joining the
element text into a single string as well as consolidating the metadata
of the section elements.
2. then if necessary it _splits_ the chunk into two or more
`ConsolidatedElement` objects when the consolidated text is too long to
fit in the specified window (`max_characters`).

Chunk splitting is only required when a single element (like a big
paragraph) has text longer than the specified window. Otherwise a
section and the chunk that derives from it reflects an even element
boundary.

`chunk_by_title()` was elaborated in commit f98d5e65 to add this
"chunk-splitting" behavior.

At the time there was some notion of wanting to "split from the end
backward" such that any small remainder chunk would appear first, and
could possibly be combined with a small prior chunk. To accomplish this,
split chunks were _inserted_ at the beginning of the list instead of
_appended_ to the end.

The `chunked_elements` variable (`List[CompositeElement]`) holds the
sequence of chunks that result from the chunking operation and is the
returned value for `chunk_by_title()`. This was the list
"split-from-the-end" chunks were inserted at the beginning of and that
unfortunately produces this out-of-order behavior because the insertion
was at the beginning of this "all-chunks-in-document" list, not a
sublist just for this chunk.

Further, the "split-from-the-end" behavior can produce no benefit
because chunks are never combined, only _elements_ are combined (across
semantic boundaries into a single section when a section is small) and
sectioning occurs _prior_ to chunking.

The fix is to rework the chunk-splitting passage to a straighforward
iterative algorithm that works both when a chunk must be split and when
it doesn't. This algorithm is also very easily extended to implement
split-chunk-overlap which is coming up in an immediately following PR.

```python
# -- split chunk into CompositeElements objects maxlen or smaller --
text_len = len(text)
start = 0
remaining = text_len

while remaining > 0:
    end = min(start + max_characters, text_len)
    chunked_elements.append(CompositeElement(text=text[start:end], metadata=chunk_meta))
    start = end - overlap
    remaining = text_len - end
```

*Forensic analysis*
The out-of-order-chunks behavior was introduced in commit 4ea71683 on
10/09/2023 in the same PR in which chunk-splitting was introduced.

---------

Co-authored-by: Shreya Nidadavolu <shreyanid9@gmail.com>
Co-authored-by: shreyanid <42684285+shreyanid@users.noreply.github.com>
2023-10-21 01:37:34 +00:00
Yuming Long
ce40cdc55f
Chore (refactor): support table extraction with pre-computed ocr data (#1801)
### Summary

Table OCR refactor, move the OCR part for table model in inference repo
to unst repo.
* Before this PR, table model extracts OCR tokens with texts and
bounding box and fills the tokens to the table structure in inference
repo. This means we need to do an additional OCR for tables.
* After this PR, we use the OCR data from entire page OCR and pass the
OCR tokens to inference repo, which means we only do one OCR for the
entire document.

**Tech details:**
* Combined env `ENTIRE_PAGE_OCR` and `TABLE_OCR` to `OCR_AGENT`, this
means we use the same OCR agent for entire page and tables since we only
do one OCR.
* Bump inference repo to `0.7.9`, which allow table model in inference
to use pre-computed OCR data from unst repo. Please check in
[PR](https://github.com/Unstructured-IO/unstructured-inference/pull/256).
* All notebooks lint are made by `make tidy`
* This PR also fixes
[issue](https://github.com/Unstructured-IO/unstructured/issues/1564),
I've added test for the issue in
`test_pdf.py::test_partition_pdf_hi_table_extraction_with_languages`
* Add same scaling logic to image [similar to previous Table
OCR](https://github.com/Unstructured-IO/unstructured-inference/blob/main/unstructured_inference/models/tables.py#L109C1-L113),
but now scaling is applied to entire image

### Test
* Not much to manually testing expect table extraction still works
* But due to change on scaling and use pre-computed OCR data from entire
page, there are some slight (better) changes on table output, here is an
comparison on test outputs i found from the same test
`test_partition_image_with_table_extraction`:

screen shot for table in `layout-parser-paper-with-table.jpg`:
<img width="343" alt="expected"
src="https://github.com/Unstructured-IO/unstructured/assets/63475068/278d7665-d212-433d-9a05-872c4502725c">
before refactor:
<img width="709" alt="before"
src="https://github.com/Unstructured-IO/unstructured/assets/63475068/347fbc3b-f52b-45b5-97e9-6f633eaa0d5e">
after refactor:
<img width="705" alt="after"
src="https://github.com/Unstructured-IO/unstructured/assets/63475068/b3cbd809-cf67-4e75-945a-5cbd06b33b2d">

### TODO
(added as a ticket) Still have some clean up to do in inference repo
since now unst repo have duplicate logic, but can keep them as a fall
back plan. If we want to remove anything OCR related in inference, here
are items that is deprecated and can be removed:
*
[`get_tokens`](https://github.com/Unstructured-IO/unstructured-inference/blob/main/unstructured_inference/models/tables.py#L77)
(already noted in code)
* parameter `extract_tables` in inference
*
[`interpret_table_block`](https://github.com/Unstructured-IO/unstructured-inference/blob/main/unstructured_inference/inference/layoutelement.py#L88)
*
[`load_agent`](https://github.com/Unstructured-IO/unstructured-inference/blob/main/unstructured_inference/models/tables.py#L197)
* env `TABLE_OCR` 

### Note
if we want to fallback for an additional table OCR (may need this for
using paddle for table), we need to:
* pass `infer_table_structure` to inference with `extract_tables`
parameter
* stop passing `infer_table_structure` to `ocr.py`

---------

Co-authored-by: Yao You <yao@unstructured.io>
2023-10-21 00:24:23 +00:00
Yao You
3437a23c91
fix: partition html fail with table without tbody (#1817)
This PR resolves #1807 
- fix a bug where when a table tagged content does not contain `tbody`
tag but `thead` tag for the rows the code fails
- now when there is no `tbody` in a table section we try to look for
`thead` isntead
- when both are not found return empty table
2023-10-20 23:21:59 +00:00
Yao You
aa7b7c87d6
fix: model_name being None raises attribution error (#1822)
This PR resolves #1754 
- function wrapper tries to use `cast` to convert kwargs into `str` but
when a value is `None` `cast(str, None)` still returns `None`
- fix replaces the conversion to simply using `str()` function call
2023-10-20 21:08:17 +00:00
Roman Isecke
63861f537e
Add check for duplicate click options (#1775)
### Description
Given that many of the options associated with the `Click` based cli
ingest commands are added dynamically from a number of configs, a check
was incorporated to make sure there were no duplicate entries to prevent
new configs from overwriting already added options.

### Issues that were found and fixes:
* duplicate api-key option set on Notion command conflicts with api key
used for unstructured api. Added notion prefix.
* retry logic configs had duplicates in biomed. Removed since this is
not handled by the pipeline.
2023-10-20 14:00:19 +00:00
John
fb2a1d42ce
Jj/1798 languages warning (#1805)
### Summary
Closes #1798 
Fixes language detection of elements with empty strings: This resolves a
warning message that was raised by `langdetect` if the language was
attempted to be detected on an empty string. Language detection is now
skipped for empty strings.

### Testing
on the main branch this will log the warning "No features in text", but
it will not log anything on this branch.
```
from unstructured.documents.elements import NarrativeText, PageBreak
from unstructured.partition.lang import apply_lang_metadata

elements = [NarrativeText("Sample text."), PageBreak("")]
elements = list(
        apply_lang_metadata(
            elements=elements,
            languages=["auto"],
            detect_language_per_element=True,
        ),
    )
```

### Other
Also changes imports in test_lang.py so imports are explicit

---------

Co-authored-by: cragwolfe <crag@unstructured.io>
2023-10-20 04:15:28 +00:00
Steve Canny
d9c2516364
fix: chunks break on regex-meta changes and regex-meta start/stop not adjusted (#1779)
**Executive Summary.** Introducing strict type-checking as preparation
for adding the chunk-overlap feature revealed a type mismatch for
regex-metadata between chunking tests and the (authoritative)
ElementMetadata definition. The implementation of regex-metadata aspects
of chunking passed the tests but did not produce the appropriate
behaviors in production where the actual data-structure was different.
This PR fixes these two bugs.

1. **Over-chunking.** The presence of `regex-metadata` in an element was
incorrectly being interpreted as a semantic boundary, leading to such
elements being isolated in their own chunks.

2. **Discarded regex-metadata.** regex-metadata present on the second or
later elements in a section (chunk) was discarded.


**Technical Summary**

The type of `ElementMetadata.regex_metadata` is `Dict[str,
List[RegexMetadata]]`. `RegexMetadata` is a `TypedDict` like `{"text":
"this matched", "start": 7, "end": 19}`.

Multiple regexes can be specified, each with a name like "mail-stop",
"version", etc. Each of those may produce its own set of matches, like:

```python
>>> element.regex_metadata
{
    "mail-stop": [{"text": "MS-107", "start": 18, "end": 24}],
    "version": [
        {"text": "current: v1.7.2", "start": 7, "end": 21},
        {"text": "supersedes: v1.7.0", "start": 22, "end": 40},
    ],
}
```

*Forensic analysis*
* The regex-metadata feature was added by Matt Robinson on 06/16/2023
commit: 4ea71683. The regex_metadata data structure is the same as when
it was added.

* The chunk-by-title feature was added by Matt Robinson on 08/29/2023
commit: f6a745a7. The mistaken regex-metadata data structure in the
tests is present in that commit.

Looks to me like a mis-remembering of the regex-metadata data-structure
and insufficient type-checking rigor (type-checker strictness level set
too low) to warn of the mistake.


**Over-chunking Behavior**

The over-chunking looked like this:

Chunking three elements with regex metadata should combine them into a
single chunk (`CompositeElement` object), subject to maximum size rules
(default 500 chars).

```python
elements: List[Element] = [
    Title(
        "Lorem Ipsum",
        metadata=ElementMetadata(
            regex_metadata={"ipsum": [RegexMetadata(text="Ipsum", start=6, end=11)]}
        ),
    ),
    Text(
        "Lorem ipsum dolor sit amet consectetur adipiscing elit.",
        metadata=ElementMetadata(
            regex_metadata={"dolor": [RegexMetadata(text="dolor", start=12, end=17)]}
        ),
    ),
    Text(
        "In rhoncus ipsum sed lectus porta volutpat.",
        metadata=ElementMetadata(
            regex_metadata={"ipsum": [RegexMetadata(text="ipsum", start=11, end=16)]}
        ),
    ),
]

chunks = chunk_by_title(elements)

assert chunks == [
    CompositeElement(
        "Lorem Ipsum\n\nLorem ipsum dolor sit amet consectetur adipiscing elit.\n\nIn rhoncus"
        " ipsum sed lectus porta volutpat."
    )
]
```

Observed behavior looked like this:

```python
chunks => [
    CompositeElement('Lorem Ipsum')
    CompositeElement('Lorem ipsum dolor sit amet consectetur adipiscing elit.')
    CompositeElement('In rhoncus ipsum sed lectus porta volutpat.')
]

```

The fix changed the approach from breaking on any metadata field not in
a specified group (`regex_metadata` was missing from this group) to only
breaking on specified fields (whitelisting instead of blacklisting).
This avoids overchunking every time we add a new metadata field and is
also simpler and easier to understand. This change in approach is
discussed in more detail here #1790.


**Dropping regex-metadata Behavior**

Chunking this section:

```python
elements: List[Element] = [
    Title(
        "Lorem Ipsum",
        metadata=ElementMetadata(
            regex_metadata={"ipsum": [RegexMetadata(text="Ipsum", start=6, end=11)]}
        ),
    ),
    Text(
        "Lorem ipsum dolor sit amet consectetur adipiscing elit.",
        metadata=ElementMetadata(
            regex_metadata={
                "dolor": [RegexMetadata(text="dolor", start=12, end=17)],
                "ipsum": [RegexMetadata(text="ipsum", start=6, end=11)],
            }
        ),
    ),
    Text(
        "In rhoncus ipsum sed lectus porta volutpat.",
        metadata=ElementMetadata(
            regex_metadata={"ipsum": [RegexMetadata(text="ipsum", start=11, end=16)]}
        ),
    ),
]
```

..should produce this regex_metadata on the single produced chunk:

```python
assert chunk == CompositeElement(
    "Lorem Ipsum\n\nLorem ipsum dolor sit amet consectetur adipiscing elit.\n\nIn rhoncus"
    " ipsum sed lectus porta volutpat."
)
assert chunk.metadata.regex_metadata == {
    "dolor": [RegexMetadata(text="dolor", start=25, end=30)],
    "ipsum": [
        RegexMetadata(text="Ipsum", start=6, end=11),
        RegexMetadata(text="ipsum", start=19, end=24),
        RegexMetadata(text="ipsum", start=81, end=86),
    ],
}
```

but instead produced this:

```python
regex_metadata == {"ipsum": [{"text": "Ipsum", "start": 6, "end": 11}]}
```

Which is the regex-metadata from the first element only.

The fix was to remove the consolidation+adjustment process from inside
the "list-attribute-processing" loop (because regex-metadata is not a
list) and process regex metadata separately.
2023-10-19 22:16:02 -05:00
Mallori Harrell
00635744ed
feat: Adds local embedding model (#1619)
This PR adds a local embedding model option as an alternative to using
our OpenAI embedding brick. This brick uses LangChain's
HuggingFacEmbeddings.
2023-10-19 11:51:36 -05:00
Roman Isecke
b265d8874b
refactoring linting (#1739)
### Description
Currently linting only takes place over the base unstructured directory
but we support python files throughout the repo. It makes sense for all
those files to also abide by the same linting rules so the entire repo
was set to be inspected when the linters are run. Along with that
autoflake was added as a linter which has a lot of added benefits such
as removing unused imports for you that would currently break flake and
require manual intervention.

The only real relevant changes in this PR are in the `Makefile`,
`setup.cfg`, and `requirements/test.in`. The rest is the result of
running the linters.
2023-10-17 12:45:12 +00:00
Léa
89fa88f076
fix: stop csv and tsv dropping the first line of the file (#1530)
The current code assumes the first line of csv and tsv files are a
header line. Most csv and tsv files don't have a header line, and even
for those that do, dropping this line may not be the desired behavior.

Here is a snippet of code that demonstrates the current behavior and the
proposed fix

```
import pandas as pd
from lxml.html.soupparser import fromstring as soupparser_fromstring

c1 = """
    Stanley Cups,,
    Team,Location,Stanley Cups
    Blues,STL,1
    Flyers,PHI,2
    Maple Leafs,TOR,13
    """

f = "./test.csv"
with open(f, 'w') as ff:
    ff.write(c1)
  
print("Suggested Improvement Keep First Line") 
table = pd.read_csv(f, header=None)
html_text = table.to_html(index=False, header=False, na_rep="")
text = soupparser_fromstring(html_text).text_content()
print(text)

print("\n\nOriginal Looses First Line") 
table = pd.read_csv(f)
html_text = table.to_html(index=False, header=False, na_rep="")
text = soupparser_fromstring(html_text).text_content()
print(text)
```

---------

Co-authored-by: cragwolfe <crag@unstructured.io>
Co-authored-by: Yao You <theyaoyou@gmail.com>
Co-authored-by: Yao You <yao@unstructured.io>
2023-10-16 17:59:35 -05:00
Klaijan
ba4c649cf0
feat: calculate element type percent match (#1723)
**Executive Summary**
Adds function to calculate the percent match between two element type
frequency output from `get_element_type_frequency` function.

**Technical Detail**
- The function takes two `Dict` input which both should be output from
`get_element_type_frequency`
- Implementors can define weight `category_depth_weight` they want to
give to the matching `type` but different in `category_depth` case
- The function loops through output item list first to find exact match
and count total exact match, and collect the remaining value for both
output and source in new list (of `dict` type). Then it loops through
existing source item list that has not been an exact match, to find
`type` match which then weigh with the factor of `category_depth_weight`
defined earlier, default at 0.5)

**Output**
output
```
{
  ("Title", 0): 2,
  ("Title", 1): 1,
  ("NarrativeText", None): 3,
  ("UncategorizedText", None): 1,
}
```

source
```
{
  ("Title", 0): 1,
  ("Title", 1): 2,
  ("NarrativeText", None): 5,
}
```

With this output and source, and weight of 0.5, the % match will yield
5.5 / 8 -- for 5 exact match, and 1 partial match with 0.5 weight.

---------

Co-authored-by: shreyanid <42684285+shreyanid@users.noreply.github.com>
2023-10-16 17:57:28 +00:00
John
6d7fe3ab02
fix: default to None for the languages metadata field (#1743)
### Summary
Closes #1714
Changes the default value for `languages` to `None` for elements that
don't have text or the language can't be detected.

### Testing
```
from unstructured.partition.auto import partition
filename = "example-docs/handbook-1p.docx"
elements = partition(filename=filename, detect_language_per_element=True)

# PageBreak elements don't have text and will be collected here
none_langs = [element for element in elements if element.metadata.languages is None]
none_langs[0].text
```

---------

Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com>
Co-authored-by: Coniferish <Coniferish@users.noreply.github.com>
Co-authored-by: cragwolfe <crag@unstructured.io>
2023-10-14 22:46:24 +00:00
qued
cf31c9a2c4
fix: use nx to avoid recursion limit (#1761)
Fixes recursion limit error that was being raised when partitioning
Excel documents of a certain size.

Previously we used a recursive method to find subtables within an excel
sheet. However this would run afoul of Python's recursion depth limit
when there was a contiguous block of more than 1000 cells within a
sheet. This function has been updated to use the NetworkX library which
avoids Python recursion issues.

* Updated `_get_connected_components` to use `networkx` graph methods
rather than implementing our own algorithm for finding contiguous groups
of cells within a sheet.
* Added a test and example doc that replicates the `RecursionError`
prior to the change.
*  Added `networkx` to `extra_xlsx` dependencies and `pip-compile`d.

#### Testing:
The following run from a Python terminal should raise a `RecursionError`
on `main` and succeed on this branch:
```python
import sys
from unstructured.partition.xlsx import partition_xlsx
old_recursion_limit = sys.getrecursionlimit()
try:
    sys.setrecursionlimit(1000)
    filename = "example-docs/more-than-1k-cells.xlsx"
    partition_xlsx(filename=filename)
finally:
    sys.setrecursionlimit(old_recursion_limit)

```
Note: the recursion limit is different in different contexts. Checking
my own system, the default in a notebook seems to be 3000, but in a
terminal it's 1000. The documented Python default recursion limit is
1000.
2023-10-14 19:38:21 +00:00
qued
95728ead0f
fix: zero divide in under_non_alpha_ratio (#1753)
The function `under_non_alpha_ratio` in
`unstructured.partition.text_type` was producing a divide-by-zero error.
After investigation I found this was a possibility when the function was
passed a string of all spaces.

---------

Co-authored-by: cragwolfe <crag@unstructured.io>
2023-10-13 21:20:01 +00:00
Steve Canny
4b84d596c2
docx: add hyperlink metadata (#1746) 2023-10-13 06:26:14 +00:00
qued
8100f1e7e2
chore: process chipper hierarchy (#1634)
PR to support schema changes introduced from [PR
232](https://github.com/Unstructured-IO/unstructured-inference/pull/232)
in `unstructured-inference`.

Specifically what needs to be supported is:
* Change to the way `LayoutElement` from `unstructured-inference` is
structured, specifically that this class is no longer a subclass of
`Rectangle`, and instead `LayoutElement` has a `bbox` property that
captures the location information and a `from_coords` method that allows
construction of a `LayoutElement` directly from coordinates.
* Removal of `LocationlessLayoutElement` since chipper now exports
bounding boxes, and if we need to support elements without bounding
boxes, we can make the `bbox` property mentioned above optional.
* Getting hierarchy data directly from the inference elements rather
than in post-processing
* Don't try to reorder elements received from chipper v2, as they should
already be ordered.

#### Testing:

The following demonstrates that the new version of chipper is inferring
hierarchy.

```python
from unstructured.partition.pdf import partition_pdf
elements = partition_pdf("example-docs/layout-parser-paper-fast.pdf", strategy="hi_res", model_name="chipper")
children = [el for el in elements if el.metadata.parent_id is not None]
print(children)

```
Also verify that running the traditional `hi_res` gives different
results:
```python
from unstructured.partition.pdf import partition_pdf
elements = partition_pdf("example-docs/layout-parser-paper-fast.pdf", strategy="hi_res")

```

---------

Co-authored-by: Sebastian Laverde Alfonso <lavmlk20201@gmail.com>
Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com>
Co-authored-by: christinestraub <christinemstraub@gmail.com>
2023-10-13 01:28:46 +00:00
ryannikolaidis
40523061ca
fix: _add_embeddings_to_elements bug resulting in duplicated elements (#1719)
Currently when the OpenAIEmbeddingEncoder adds embeddings to Elements in
`_add_embeddings_to_elements` it overwrites each Element's `to_dict`
method, mistakenly resulting in each Element having identical values
with the exception of the actual embedding value. This was due to the
way it leverages a nested `new_to_dict` method to overwrite. Instead,
this updates the original definition of Element itself to accommodate
the `embeddings` field when available. This also adds a test to validate
that values are not duplicated.
2023-10-12 21:47:32 +00:00
Roman Isecke
ebf0722dcc
roman/ingest continue on error (#1736)
### Description
Add flag to raise an error on failure but default to only log it and
continue with other docs
2023-10-12 21:33:10 +00:00
Steve Canny
d726963e42
serde tests round-trip through JSON (#1681)
Each partitioner has a test like `test_partition_x_with_json()`. What
these do is serialize the elements produced by the partitioner to JSON,
then read them back in from JSON and compare the before and after
elements.

Because our element equality (`Element.__eq__()`) is shallow, this
doesn't tell us a lot, but if we take it one more step, like
`List[Element] -> JSON -> List[Element] -> JSON` and then compare the
JSON, it gives us some confidence that the serialized elements can be
"re-hydrated" without losing any information.

This actually showed up a few problems, all in the
serialization/deserialization (serde) code that all elements share.
2023-10-12 19:47:55 +00:00
Inscore
8ab40c20c1
fix: correct PDF list item parsing (#1693)
The current implementation removes elements from the beginning of the
element list and duplicates the list items

---------

Co-authored-by: Klaijan <klaijan@unstructured.io>
Co-authored-by: yuming <305248291@qq.com>
Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com>
Co-authored-by: yuming-long <yuming-long@users.noreply.github.com>
2023-10-11 20:38:36 +00:00
John
9500d04791
detect document language across all partitioners (#1627)
### Summary
Closes #1534 and #1535
Detects document language using `langdetect` package. 
Creates new kwargs for user to set the document language (`languages`)
or detect the language at the element level instead of the default
document level (`detect_language_per_element`)

---------

Co-authored-by: shreyanid <42684285+shreyanid@users.noreply.github.com>
Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com>
Co-authored-by: Coniferish <Coniferish@users.noreply.github.com>
Co-authored-by: cragwolfe <crag@unstructured.io>
Co-authored-by: Austin Walker <austin@unstructured.io>
2023-10-11 01:47:56 +00:00
Klaijan
ee75ce25e2
feat: element type frequency (#1688)
**Executive Summary**

Add function that returns frequency of given element types and depth.

---------

Co-authored-by: shreyanid <42684285+shreyanid@users.noreply.github.com>
2023-10-11 00:36:44 +00:00
shreyanid
9d228c7ecb
feat: calculate metric for percent of text missing (#1701)
### Summary
Missing text is a particularly important metric of quality for the
Unstructured library because information from the document is not being
captured and therefore not usable by downstream applications.

Add function to calculate the percent of text missing relative to the
source transcription. Function takes 2 text strings (output and source)
as input, and returns the percentage of text missing as a decimal.

### Technical Details
- The 2 input strings are both assumed to already contain clean and
concatenated text (CCT)
- Implementation compares the bags of words (frequency counts for each
word present in the text) of each input text
- Duplicated/extra text is not penalized
- Value is limited to the range [0, 1]

### Test
- Several edge cases are covered in the test function (missing text,
duplicated text, spaced out words, etc).
- Can test other cases or text inputs by calling the function with 2 CCT
strings as "output" and "source"
2023-10-10 20:54:49 +00:00
Yuming Long
e597ec7a0f
Fix: skip empty annotation bbox (#1665)
Address: https://github.com/Unstructured-IO/unstructured/issues/1663
## Summary
While trying to find how overlap between a element bbox and annotation
bbox, we find the intersection of two bboxes and divide it by the size
of annotation bbox, this will cause a zero division error if size of
annotation bbox is 0.

* this PR fix the zero division error for function
`check_annotations_within_element`
* also fix error: `TypeError: unsupported operand type(s) for -: 'float'
and 'NoneType'` by stop inserting empty word with None bbox into list of
words in function `get_word_bounding_box_from_element`

## Test
reproduce with code and document as the user mentioned and should see no
error:
```
from unstructured.partition.auto import partition

elements = partition(
    filename="./IZSAM8.2_221012.pdf",
    strategy="fast",
)
```
2023-10-10 20:48:44 +00:00
Mallori Harrell
a5d7ae4611
Feat: Bag of words for testing metric (#1650)
This PR adds the `bag_of_words` function to count the frequency of words
for evaluation.

**Testing**
```Python
from unstructured.cleaners.core import bag_of_words
string = "The dog loved the cat, but the cat loved the cow."

print(bag_of_words)

---------

Co-authored-by: Mallori Harrell <mallori@Malloris-MacBook-Pro.local>
Co-authored-by: Klaijan <klaijan@unstructured.io>
Co-authored-by: Shreya Nidadavolu <shreyanid9@gmail.com>
Co-authored-by: shreyanid <42684285+shreyanid@users.noreply.github.com>
2023-10-10 18:46:01 +00:00
Amanda Cameron
f98d5e65ca
chore: adding max_characters to other element type chunking (#1673)
This PR adds the `max_characters` (hard max) param to non-table element
chunking. Additionally updates the `num_characters` metadata to
`max_characters` to make it clearer which param we're referencing.

To test:

```
from unstructured.partition.html import partition_html

filename = "example-docs/example-10k-1p.html"
chunk_elements = partition_html(
        filename,
        chunking_strategy="by_title",
        combine_text_under_n_chars=0,
        new_after_n_chars=50,
        max_characters=100,
    )

for chunk in chunk_elements:
     print(len(chunk.text))

# previously we were only respecting the "soft max" (default of 500) for elements other than tables
# now we should see that all the elements have text fields under 100 chars.
```

---------

Co-authored-by: cragwolfe <crag@unstructured.io>
2023-10-09 19:42:36 +00:00
Klaijan
33edbf84f5
feat: add calculate edit distance feature (#1656)
**Executive Summary**

Adds function to calculate edit distance (Levenshtein distance) between
two strings. The function can return as: 1. score (similarity = 1 -
distance/source_len) 2. distance (raw levenshtein distance)

**Technical details**
- The `weights` param is set to default at (2,1,1) for (insertion,
deletion, substitution), meaning that we will penalize the insertion we
need to add from output (target) in comparison with the source
(reference). In other word, the missing extraction will be penalized
higher.
- The function takes in 2 strings in an assumption that both string are
already clean and concatenated (CCT)

**Important Note!**
Test case needs to be updated to use CCT once the function is ready. It
is now only tested the "functionality" of edit distance, not the edit
distance with CCT as its intended to be.

---------

Co-authored-by: cragwolfe <crag@unstructured.io>
2023-10-07 01:21:14 +00:00
Yuming Long
dcd6d0ff67
Refactor: support entire page OCR with ocr_mode and ocr_languages (#1579)
## Summary
Second part of OCR refactor to move it from inference repo to
unstructured repo, first part is done in
https://github.com/Unstructured-IO/unstructured-inference/pull/231. This
PR adds OCR process logics to entire page OCR, and support two OCR
modes, "entire_page" or "individual_blocks".

The updated workflow for `Hi_res` partition:
* pass the document as data/filename to inference repo to get
`inferred_layout` (DocumentLayout)
* pass the document as data/filename to OCR module, which first open the
document (create temp file/dir as needed), and split the document by
pages (convert PDF pages to image pages for PDF file)
* if ocr mode is `"entire_page"`
    *  OCR the entire image
    * merge the OCR layout with inferred page layout
 * if ocr mode is `"individual_blocks"`
* from inferred page layout, find element with no extracted text, crop
the entire image by the bboxes of the element
* replace empty text element with the text obtained from OCR the cropped
image
* return all merged PageLayouts and form a DocumentLayout subject for
later on process

This PR also bump `unstructured-inference==0.7.2` since the branch relay
on OCR refactor from unstructured-inference.
  
## Test
```
from unstructured.partition.auto import partition

entrie_page_ocr_mode_elements = partition(filename="example-docs/english-and-korean.png", ocr_mode="entire_page", ocr_languages="eng+kor", strategy="hi_res")
individual_blocks_ocr_mode_elements = partition(filename="example-docs/english-and-korean.png", ocr_mode="individual_blocks", ocr_languages="eng+kor", strategy="hi_res")
print([el.text for el in entrie_page_ocr_mode_elements])
print([el.text for el in individual_blocks_ocr_mode_elements])
```
latest output:
```
# entrie_page
['RULES AND INSTRUCTIONS 1. Template for day 1 (korean) , for day 2 (English) for day 3 both English and korean. 2. Use all your accounts. use different emails to send. Its better to have many email', 'accounts.', 'Note: Remember to write your own "OPENING MESSAGE" before you copy and paste the template. please always include [TREASURE HARUTO] for example:', '안녕하세요, 저 희 는 YGEAS 그룹 TREASUREWH HARUTOM|2] 팬 입니다. 팬 으 로서, HARUTO 씨 받 는 대 우 에 대해 의 구 심 과 불 공 평 함 을 LRU, 이 일 을 통해 저 희 의 의 혹 을 전 달 하여 귀 사 의 진지한 민 과 적극적인 답 변 을 받을 수 있 기 를 바랍니다.', '3. CC Harutonations@gmail.com so we can keep track of how many emails were', 'successfully sent', '4. Use the hashtag of Haruto on your tweet to show that vou have sent vour email]', '메 고']
# individual_blocks
['RULES AND INSTRUCTIONS 1. Template for day 1 (korean) , for day 2 (English) for day 3 both English and korean. 2. Use all your accounts. use different emails to send. Its better to have many email', 'Note: Remember to write your own "OPENING MESSAGE" before you copy and paste the template. please always include [TREASURE HARUTO] for example:', '안녕하세요, 저 희 는 YGEAS 그룹 TREASURES HARUTOM| 2] 팬 입니다. 팬 으로서, HARUTO 씨 받 는 대 우 에 대해 의 구 심 과 habe ERO, 이 머 일 을 적극 저 희 의 ASS 전 달 하여 귀 사 의 진지한 고 2 있 기 를 바랍니다.', '3. CC Harutonations@gmail.com so we can keep track of how many emails were ciiccecefisliy cant', 'VULLESSIULY Set 4. Use the hashtag of Haruto on your tweet to show that you have sent your email']
```

---------

Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com>
Co-authored-by: yuming-long <yuming-long@users.noreply.github.com>
Co-authored-by: christinestraub <christinemstraub@gmail.com>
Co-authored-by: christinestraub <christinestraub@users.noreply.github.com>
2023-10-06 22:54:49 +00:00
Christine Straub
5d14a2aea0
feat: shrink bboxes by top left (#1633)
Closes #1573.
### Summary
- update `shrink_bbox()` to keep top left rather than center
### Evaluation
Run the following command for this
[PDF](https://utic-dev-tech-fixtures.s3.us-east-2.amazonaws.com/pastebin/patent-11723901-page2.pdf).
```
PYTHONPATH=. python examples/custom-layout-order/evaluate_xy_cut_sorting.py <file_path> <strategy>
```

---------

Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com>
2023-10-06 05:16:11 +00:00
Benjamin Torres
e0201e9a11
feat/add sources from unstructured inference (#1538)
This PR adds support for `source` property from
`unstructured_inference`, allowing the user to be able to see the origin
of the data under `detection_origin`field environment variable
UNSTRUCTURED_INCLUDE_DEBUG_METADATA=true

In order to try this feature you can use this code:
```
from unstructured.partition.pdf import partition_pdf_or_image

yolox_elements = partition_pdf_or_image(filename='example-docs/loremipsum-flat.pdf', strategy='hi_res', model_name='yolox')

sources = [e.detection_origin for e in yolox_elements]
print(sources)
```
And will print 'yolox' as source for all the elements
2023-10-05 20:26:47 +00:00
Sebastian Laverde Alfonso
e90a979f45
fix: Better logic for setting category_depth metadata for Title elements (#1517)
This PR promotes the `category_depth` metadata for `Title` elements from
`None` to 0, whenever `Headline` and/or `Subheadline` types (that are
also mapped to `Title` elements with depth 1 and 2) are present. An
additional test to `test_common.py` has been added to check on the
improvement. More test of how this logic fixes the behaviour can be
found in a adapted version on the colab
[here](https://colab.research.google.com/drive/1LoScFJBYUhkM6X7pMp8cDaJLC_VoxGci?usp=sharing).

---------

Co-authored-by: qued <64741807+qued@users.noreply.github.com>
2023-10-05 17:51:06 +00:00
Newel H
e34396b2c9
Feat: Native hierarchies for elements from pptx documents (#1616)
## Summary
**Improve title detection in pptx documents** The default title
textboxes on a pptx slide are now categorized as titles.
**Improve hierarchy detection in pptx documents** List items, and other
slide text are properly nested under the slide title. This will enable
better chunking of pptx documents.

Hierarchy detection is improved by determining category depth via the
following:
- Check if the paragraph item has a level parameter via the python pptx
paragraph. If so, use the paragraph level as the category_depth level.
- If the shape being checked is a title shape and the item is not a
bullet or email, the element will be set as a Title with a depth
corresponding to the enumerated paragraph increment (e.g. 1st line of
title shape is depth 0, second is depth 1 etc.).
- If the shape is not a title shape but the paragraph is a title, the
increment will match the level + 1, so that all paragraph titles are at
least 1 to set them below the slide title element
2023-10-05 12:55:45 -04:00