585 Commits

Author SHA1 Message Date
qued
8100f1e7e2
chore: process chipper hierarchy (#1634)
PR to support schema changes introduced from [PR
232](https://github.com/Unstructured-IO/unstructured-inference/pull/232)
in `unstructured-inference`.

Specifically what needs to be supported is:
* Change to the way `LayoutElement` from `unstructured-inference` is
structured, specifically that this class is no longer a subclass of
`Rectangle`, and instead `LayoutElement` has a `bbox` property that
captures the location information and a `from_coords` method that allows
construction of a `LayoutElement` directly from coordinates.
* Removal of `LocationlessLayoutElement` since chipper now exports
bounding boxes, and if we need to support elements without bounding
boxes, we can make the `bbox` property mentioned above optional.
* Getting hierarchy data directly from the inference elements rather
than in post-processing
* Don't try to reorder elements received from chipper v2, as they should
already be ordered.

#### Testing:

The following demonstrates that the new version of chipper is inferring
hierarchy.

```python
from unstructured.partition.pdf import partition_pdf
elements = partition_pdf("example-docs/layout-parser-paper-fast.pdf", strategy="hi_res", model_name="chipper")
children = [el for el in elements if el.metadata.parent_id is not None]
print(children)

```
Also verify that running the traditional `hi_res` gives different
results:
```python
from unstructured.partition.pdf import partition_pdf
elements = partition_pdf("example-docs/layout-parser-paper-fast.pdf", strategy="hi_res")

```

---------

Co-authored-by: Sebastian Laverde Alfonso <lavmlk20201@gmail.com>
Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com>
Co-authored-by: christinestraub <christinemstraub@gmail.com>
2023-10-13 01:28:46 +00:00
ryannikolaidis
40523061ca
fix: _add_embeddings_to_elements bug resulting in duplicated elements (#1719)
Currently when the OpenAIEmbeddingEncoder adds embeddings to Elements in
`_add_embeddings_to_elements` it overwrites each Element's `to_dict`
method, mistakenly resulting in each Element having identical values
with the exception of the actual embedding value. This was due to the
way it leverages a nested `new_to_dict` method to overwrite. Instead,
this updates the original definition of Element itself to accommodate
the `embeddings` field when available. This also adds a test to validate
that values are not duplicated.
2023-10-12 21:47:32 +00:00
Roman Isecke
ebf0722dcc
roman/ingest continue on error (#1736)
### Description
Add flag to raise an error on failure but default to only log it and
continue with other docs
2023-10-12 21:33:10 +00:00
Steve Canny
d726963e42
serde tests round-trip through JSON (#1681)
Each partitioner has a test like `test_partition_x_with_json()`. What
these do is serialize the elements produced by the partitioner to JSON,
then read them back in from JSON and compare the before and after
elements.

Because our element equality (`Element.__eq__()`) is shallow, this
doesn't tell us a lot, but if we take it one more step, like
`List[Element] -> JSON -> List[Element] -> JSON` and then compare the
JSON, it gives us some confidence that the serialized elements can be
"re-hydrated" without losing any information.

This actually showed up a few problems, all in the
serialization/deserialization (serde) code that all elements share.
2023-10-12 19:47:55 +00:00
Inscore
8ab40c20c1
fix: correct PDF list item parsing (#1693)
The current implementation removes elements from the beginning of the
element list and duplicates the list items

---------

Co-authored-by: Klaijan <klaijan@unstructured.io>
Co-authored-by: yuming <305248291@qq.com>
Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com>
Co-authored-by: yuming-long <yuming-long@users.noreply.github.com>
2023-10-11 20:38:36 +00:00
John
9500d04791
detect document language across all partitioners (#1627)
### Summary
Closes #1534 and #1535
Detects document language using `langdetect` package. 
Creates new kwargs for user to set the document language (`languages`)
or detect the language at the element level instead of the default
document level (`detect_language_per_element`)

---------

Co-authored-by: shreyanid <42684285+shreyanid@users.noreply.github.com>
Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com>
Co-authored-by: Coniferish <Coniferish@users.noreply.github.com>
Co-authored-by: cragwolfe <crag@unstructured.io>
Co-authored-by: Austin Walker <austin@unstructured.io>
2023-10-11 01:47:56 +00:00
Klaijan
ee75ce25e2
feat: element type frequency (#1688)
**Executive Summary**

Add function that returns frequency of given element types and depth.

---------

Co-authored-by: shreyanid <42684285+shreyanid@users.noreply.github.com>
2023-10-11 00:36:44 +00:00
shreyanid
9d228c7ecb
feat: calculate metric for percent of text missing (#1701)
### Summary
Missing text is a particularly important metric of quality for the
Unstructured library because information from the document is not being
captured and therefore not usable by downstream applications.

Add function to calculate the percent of text missing relative to the
source transcription. Function takes 2 text strings (output and source)
as input, and returns the percentage of text missing as a decimal.

### Technical Details
- The 2 input strings are both assumed to already contain clean and
concatenated text (CCT)
- Implementation compares the bags of words (frequency counts for each
word present in the text) of each input text
- Duplicated/extra text is not penalized
- Value is limited to the range [0, 1]

### Test
- Several edge cases are covered in the test function (missing text,
duplicated text, spaced out words, etc).
- Can test other cases or text inputs by calling the function with 2 CCT
strings as "output" and "source"
2023-10-10 20:54:49 +00:00
Yuming Long
e597ec7a0f
Fix: skip empty annotation bbox (#1665)
Address: https://github.com/Unstructured-IO/unstructured/issues/1663
## Summary
While trying to find how overlap between a element bbox and annotation
bbox, we find the intersection of two bboxes and divide it by the size
of annotation bbox, this will cause a zero division error if size of
annotation bbox is 0.

* this PR fix the zero division error for function
`check_annotations_within_element`
* also fix error: `TypeError: unsupported operand type(s) for -: 'float'
and 'NoneType'` by stop inserting empty word with None bbox into list of
words in function `get_word_bounding_box_from_element`

## Test
reproduce with code and document as the user mentioned and should see no
error:
```
from unstructured.partition.auto import partition

elements = partition(
    filename="./IZSAM8.2_221012.pdf",
    strategy="fast",
)
```
2023-10-10 20:48:44 +00:00
Mallori Harrell
a5d7ae4611
Feat: Bag of words for testing metric (#1650)
This PR adds the `bag_of_words` function to count the frequency of words
for evaluation.

**Testing**
```Python
from unstructured.cleaners.core import bag_of_words
string = "The dog loved the cat, but the cat loved the cow."

print(bag_of_words)

---------

Co-authored-by: Mallori Harrell <mallori@Malloris-MacBook-Pro.local>
Co-authored-by: Klaijan <klaijan@unstructured.io>
Co-authored-by: Shreya Nidadavolu <shreyanid9@gmail.com>
Co-authored-by: shreyanid <42684285+shreyanid@users.noreply.github.com>
2023-10-10 18:46:01 +00:00
Amanda Cameron
f98d5e65ca
chore: adding max_characters to other element type chunking (#1673)
This PR adds the `max_characters` (hard max) param to non-table element
chunking. Additionally updates the `num_characters` metadata to
`max_characters` to make it clearer which param we're referencing.

To test:

```
from unstructured.partition.html import partition_html

filename = "example-docs/example-10k-1p.html"
chunk_elements = partition_html(
        filename,
        chunking_strategy="by_title",
        combine_text_under_n_chars=0,
        new_after_n_chars=50,
        max_characters=100,
    )

for chunk in chunk_elements:
     print(len(chunk.text))

# previously we were only respecting the "soft max" (default of 500) for elements other than tables
# now we should see that all the elements have text fields under 100 chars.
```

---------

Co-authored-by: cragwolfe <crag@unstructured.io>
2023-10-09 19:42:36 +00:00
Klaijan
33edbf84f5
feat: add calculate edit distance feature (#1656)
**Executive Summary**

Adds function to calculate edit distance (Levenshtein distance) between
two strings. The function can return as: 1. score (similarity = 1 -
distance/source_len) 2. distance (raw levenshtein distance)

**Technical details**
- The `weights` param is set to default at (2,1,1) for (insertion,
deletion, substitution), meaning that we will penalize the insertion we
need to add from output (target) in comparison with the source
(reference). In other word, the missing extraction will be penalized
higher.
- The function takes in 2 strings in an assumption that both string are
already clean and concatenated (CCT)

**Important Note!**
Test case needs to be updated to use CCT once the function is ready. It
is now only tested the "functionality" of edit distance, not the edit
distance with CCT as its intended to be.

---------

Co-authored-by: cragwolfe <crag@unstructured.io>
2023-10-07 01:21:14 +00:00
Yuming Long
dcd6d0ff67
Refactor: support entire page OCR with ocr_mode and ocr_languages (#1579)
## Summary
Second part of OCR refactor to move it from inference repo to
unstructured repo, first part is done in
https://github.com/Unstructured-IO/unstructured-inference/pull/231. This
PR adds OCR process logics to entire page OCR, and support two OCR
modes, "entire_page" or "individual_blocks".

The updated workflow for `Hi_res` partition:
* pass the document as data/filename to inference repo to get
`inferred_layout` (DocumentLayout)
* pass the document as data/filename to OCR module, which first open the
document (create temp file/dir as needed), and split the document by
pages (convert PDF pages to image pages for PDF file)
* if ocr mode is `"entire_page"`
    *  OCR the entire image
    * merge the OCR layout with inferred page layout
 * if ocr mode is `"individual_blocks"`
* from inferred page layout, find element with no extracted text, crop
the entire image by the bboxes of the element
* replace empty text element with the text obtained from OCR the cropped
image
* return all merged PageLayouts and form a DocumentLayout subject for
later on process

This PR also bump `unstructured-inference==0.7.2` since the branch relay
on OCR refactor from unstructured-inference.
  
## Test
```
from unstructured.partition.auto import partition

entrie_page_ocr_mode_elements = partition(filename="example-docs/english-and-korean.png", ocr_mode="entire_page", ocr_languages="eng+kor", strategy="hi_res")
individual_blocks_ocr_mode_elements = partition(filename="example-docs/english-and-korean.png", ocr_mode="individual_blocks", ocr_languages="eng+kor", strategy="hi_res")
print([el.text for el in entrie_page_ocr_mode_elements])
print([el.text for el in individual_blocks_ocr_mode_elements])
```
latest output:
```
# entrie_page
['RULES AND INSTRUCTIONS 1. Template for day 1 (korean) , for day 2 (English) for day 3 both English and korean. 2. Use all your accounts. use different emails to send. Its better to have many email', 'accounts.', 'Note: Remember to write your own "OPENING MESSAGE" before you copy and paste the template. please always include [TREASURE HARUTO] for example:', '안녕하세요, 저 희 는 YGEAS 그룹 TREASUREWH HARUTOM|2] 팬 입니다. 팬 으 로서, HARUTO 씨 받 는 대 우 에 대해 의 구 심 과 불 공 평 함 을 LRU, 이 일 을 통해 저 희 의 의 혹 을 전 달 하여 귀 사 의 진지한 민 과 적극적인 답 변 을 받을 수 있 기 를 바랍니다.', '3. CC Harutonations@gmail.com so we can keep track of how many emails were', 'successfully sent', '4. Use the hashtag of Haruto on your tweet to show that vou have sent vour email]', '메 고']
# individual_blocks
['RULES AND INSTRUCTIONS 1. Template for day 1 (korean) , for day 2 (English) for day 3 both English and korean. 2. Use all your accounts. use different emails to send. Its better to have many email', 'Note: Remember to write your own "OPENING MESSAGE" before you copy and paste the template. please always include [TREASURE HARUTO] for example:', '안녕하세요, 저 희 는 YGEAS 그룹 TREASURES HARUTOM| 2] 팬 입니다. 팬 으로서, HARUTO 씨 받 는 대 우 에 대해 의 구 심 과 habe ERO, 이 머 일 을 적극 저 희 의 ASS 전 달 하여 귀 사 의 진지한 고 2 있 기 를 바랍니다.', '3. CC Harutonations@gmail.com so we can keep track of how many emails were ciiccecefisliy cant', 'VULLESSIULY Set 4. Use the hashtag of Haruto on your tweet to show that you have sent your email']
```

---------

Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com>
Co-authored-by: yuming-long <yuming-long@users.noreply.github.com>
Co-authored-by: christinestraub <christinemstraub@gmail.com>
Co-authored-by: christinestraub <christinestraub@users.noreply.github.com>
2023-10-06 22:54:49 +00:00
Christine Straub
5d14a2aea0
feat: shrink bboxes by top left (#1633)
Closes #1573.
### Summary
- update `shrink_bbox()` to keep top left rather than center
### Evaluation
Run the following command for this
[PDF](https://utic-dev-tech-fixtures.s3.us-east-2.amazonaws.com/pastebin/patent-11723901-page2.pdf).
```
PYTHONPATH=. python examples/custom-layout-order/evaluate_xy_cut_sorting.py <file_path> <strategy>
```

---------

Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com>
2023-10-06 05:16:11 +00:00
Benjamin Torres
e0201e9a11
feat/add sources from unstructured inference (#1538)
This PR adds support for `source` property from
`unstructured_inference`, allowing the user to be able to see the origin
of the data under `detection_origin`field environment variable
UNSTRUCTURED_INCLUDE_DEBUG_METADATA=true

In order to try this feature you can use this code:
```
from unstructured.partition.pdf import partition_pdf_or_image

yolox_elements = partition_pdf_or_image(filename='example-docs/loremipsum-flat.pdf', strategy='hi_res', model_name='yolox')

sources = [e.detection_origin for e in yolox_elements]
print(sources)
```
And will print 'yolox' as source for all the elements
2023-10-05 20:26:47 +00:00
Sebastian Laverde Alfonso
e90a979f45
fix: Better logic for setting category_depth metadata for Title elements (#1517)
This PR promotes the `category_depth` metadata for `Title` elements from
`None` to 0, whenever `Headline` and/or `Subheadline` types (that are
also mapped to `Title` elements with depth 1 and 2) are present. An
additional test to `test_common.py` has been added to check on the
improvement. More test of how this logic fixes the behaviour can be
found in a adapted version on the colab
[here](https://colab.research.google.com/drive/1LoScFJBYUhkM6X7pMp8cDaJLC_VoxGci?usp=sharing).

---------

Co-authored-by: qued <64741807+qued@users.noreply.github.com>
2023-10-05 17:51:06 +00:00
Newel H
e34396b2c9
Feat: Native hierarchies for elements from pptx documents (#1616)
## Summary
**Improve title detection in pptx documents** The default title
textboxes on a pptx slide are now categorized as titles.
**Improve hierarchy detection in pptx documents** List items, and other
slide text are properly nested under the slide title. This will enable
better chunking of pptx documents.

Hierarchy detection is improved by determining category depth via the
following:
- Check if the paragraph item has a level parameter via the python pptx
paragraph. If so, use the paragraph level as the category_depth level.
- If the shape being checked is a title shape and the item is not a
bullet or email, the element will be set as a Title with a depth
corresponding to the enumerated paragraph increment (e.g. 1st line of
title shape is depth 0, second is depth 1 etc.).
- If the shape is not a title shape but the paragraph is a title, the
increment will match the level + 1, so that all paragraph titles are at
least 1 to set them below the slide title element
2023-10-05 12:55:45 -04:00
Christine Straub
b30d6a601e
Fix/1209 tweak xycut ordering output (#1630)
Closes GH Issue #1209.

### Summary
- add swapped `xycut` sorting
- update `xycut` sorting evaluation script

PDFs:
-
[sbaa031.073.pdf](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7234218/pdf/sbaa031.073.pdf)
-
[multi-column-2p.pdf](https://github.com/Unstructured-IO/unstructured/files/12796147/multi-column-2p.pdf)
-
[11723901.pdf](https://github.com/Unstructured-IO/unstructured-inference/files/12360085/11723901.pdf)
### Testing
```
elements = partition_pdf("sbaa031.073.pdf", strategy="hi_res")
print("\n\n".join([str(el) for el in elements]))
```
### Evaluation
```
PYTHONPATH=. python examples/custom-layout-order/evaluate_xy_cut_sorting.py sbaa031.073.pdf hi_res xycut_only
```
2023-10-05 07:41:38 +00:00
ryannikolaidis
9960ce5f00
fix: chunking fails with detection_class_prob in metadata (#1637) 2023-10-04 22:14:21 +00:00
Klaijan
0a65fc2134
feat: xlsx subtable extraction (#1585)
**Executive Summary**
Unstructured is now able to capture subtables, along with other text
element types within the `.xlsx` sheet.

**Technical Details**
- The function now reads the excel *without* header as default
- Leverages the connected components search to find subtables within the
sheet. This search is based on dfs search
- It also handle the overlapping table or text cases
- Row with only single cell of data is considered not a table, and
therefore passed on the determine the element type as text
- In connected elements, it is possible to have table title, header, or
footer. We run the count for the first non-single empty rows from top
and bottom to determine those text

**Result**
This table now reads as:
<img width="747" alt="image"
src="https://github.com/Unstructured-IO/unstructured/assets/2177850/6b8e6d01-4ca5-43f4-ae88-6104b0174ed2">

```
[
    {
        "type": "Title",
        "element_id": "3315afd97f7f2ebcd450e7c939878429",
        "metadata": {
            "filename": "vodafone.xlsx",
            "file_directory": "example-docs",
            "last_modified": "2023-10-03T17:51:34",
            "filetype": "application/vnd.openxmlformats-officedocument.spreadsheetml.sheet",
            "parent_id": "3315afd97f7f2ebcd450e7c939878429",
            "languages": [
                "spa",
                "ita"
            ],
            "page_number": 1,
            "page_name": "Index",
            "text_as_html": "<table border=\"1\" class=\"dataframe\">\n  <tbody>\n    <tr>\n      <td>Topic</td>\n      <td>Period</td>\n      <td></td>\n      <td></td>\n      <td>Page</td>\n    </tr>\n    <tr>\n      <td>Quarterly revenue</td>\n      <td>Nine quarters to 30 June 2023</td>\n      <td></td>\n      <td></td>\n      <td>1</td>\n    </tr>\n    <tr>\n      <td>Group financial performance</td>\n      <td>FY 22</td>\n      <td>FY 23</td>\n      <td></td>\n      <td>2</td>\n    </tr>\n    <tr>\n      <td>Segmental results</td>\n      <td>FY 22</td>\n      <td>FY 23</td>\n      <td></td>\n      <td>3</td>\n    </tr>\n    <tr>\n      <td>Segmental analysis</td>\n      <td>FY 22</td>\n      <td>FY 23</td>\n      <td></td>\n      <td>4</td>\n    </tr>\n    <tr>\n      <td>Cash flow</td>\n      <td>FY 22</td>\n      <td>FY 23</td>\n      <td></td>\n      <td>5</td>\n    </tr>\n  </tbody>\n</table>"
        },
        "text": "Financial performance"
    },
    {
        "type": "Table",
        "element_id": "17f5d512705be6f8812e5dbb801ba727",
        "metadata": {
            "filename": "vodafone.xlsx",
            "file_directory": "example-docs",
            "last_modified": "2023-10-03T17:51:34",
            "filetype": "application/vnd.openxmlformats-officedocument.spreadsheetml.sheet",
            "parent_id": "3315afd97f7f2ebcd450e7c939878429",
            "languages": [
                "spa",
                "ita"
            ],
            "page_number": 1,
            "page_name": "Index",
            "text_as_html": "<table border=\"1\" class=\"dataframe\">\n  <tbody>\n    <tr>\n      <td>Topic</td>\n      <td>Period</td>\n      <td></td>\n      <td></td>\n      <td>Page</td>\n    </tr>\n    <tr>\n      <td>Quarterly revenue</td>\n      <td>Nine quarters to 30 June 2023</td>\n      <td></td>\n      <td></td>\n      <td>1</td>\n    </tr>\n    <tr>\n      <td>Group financial performance</td>\n      <td>FY 22</td>\n      <td>FY 23</td>\n      <td></td>\n      <td>2</td>\n    </tr>\n    <tr>\n      <td>Segmental results</td>\n      <td>FY 22</td>\n      <td>FY 23</td>\n      <td></td>\n      <td>3</td>\n    </tr>\n    <tr>\n      <td>Segmental analysis</td>\n      <td>FY 22</td>\n      <td>FY 23</td>\n      <td></td>\n      <td>4</td>\n    </tr>\n    <tr>\n      <td>Cash flow</td>\n      <td>FY 22</td>\n      <td>FY 23</td>\n      <td></td>\n      <td>5</td>\n    </tr>\n  </tbody>\n</table>"
        },
        "text": "\n\n\nTopic\nPeriod\n\n\nPage\n\n\nQuarterly revenue\nNine quarters to 30 June 2023\n\n\n1\n\n\nGroup financial performance\nFY 22\nFY 23\n\n2\n\n\nSegmental results\nFY 22\nFY 23\n\n3\n\n\nSegmental analysis\nFY 22\nFY 23\n\n4\n\n\nCash flow\nFY 22\nFY 23\n\n5\n\n\n"
    },
    {
        "type": "Title",
        "element_id": "8a9db7161a02b427f8fda883656036e1",
        "metadata": {
            "filename": "vodafone.xlsx",
            "file_directory": "example-docs",
            "last_modified": "2023-10-03T17:51:34",
            "filetype": "application/vnd.openxmlformats-officedocument.spreadsheetml.sheet",
            "parent_id": "8a9db7161a02b427f8fda883656036e1",
            "languages": [
                "spa",
                "ita"
            ],
            "page_number": 1,
            "page_name": "Index",
            "text_as_html": "<table border=\"1\" class=\"dataframe\">\n  <tbody>\n    <tr>\n      <td>Topic</td>\n      <td>Period</td>\n      <td></td>\n      <td></td>\n      <td>Page</td>\n    </tr>\n    <tr>\n      <td>Mobile customers</td>\n      <td>Nine quarters to 30 June 2023</td>\n      <td></td>\n      <td></td>\n      <td>6</td>\n    </tr>\n    <tr>\n      <td>Fixed broadband customers</td>\n      <td>Nine quarters to 30 June 2023</td>\n      <td></td>\n      <td></td>\n      <td>7</td>\n    </tr>\n    <tr>\n      <td>Marketable homes passed</td>\n      <td>Nine quarters to 30 June 2023</td>\n      <td></td>\n      <td></td>\n      <td>8</td>\n    </tr>\n    <tr>\n      <td>TV customers</td>\n      <td>Nine quarters to 30 June 2023</td>\n      <td></td>\n      <td></td>\n      <td>9</td>\n    </tr>\n    <tr>\n      <td>Converged customers</td>\n      <td>Nine quarters to 30 June 2023</td>\n      <td></td>\n      <td></td>\n      <td>10</td>\n    </tr>\n    <tr>\n      <td>Mobile churn</td>\n      <td>Nine quarters to 30 June 2023</td>\n      <td></td>\n      <td></td>\n      <td>11</td>\n    </tr>\n    <tr>\n      <td>Mobile data usage</td>\n      <td>Nine quarters to 30 June 2023</td>\n      <td></td>\n      <td></td>\n      <td>12</td>\n    </tr>\n    <tr>\n      <td>Mobile ARPU</td>\n      <td>Nine quarters to 30 June 2023</td>\n      <td></td>\n      <td></td>\n      <td>13</td>\n    </tr>\n  </tbody>\n</table>"
        },
        "text": "Operational metrics"
    },
    {
        "type": "Table",
        "element_id": "d5d16f7bf9c7950cd45fae06e12e5847",
        "metadata": {
            "filename": "vodafone.xlsx",
            "file_directory": "example-docs",
            "last_modified": "2023-10-03T17:51:34",
            "filetype": "application/vnd.openxmlformats-officedocument.spreadsheetml.sheet",
            "parent_id": "8a9db7161a02b427f8fda883656036e1",
            "languages": [
                "spa",
                "ita"
            ],
            "page_number": 1,
            "page_name": "Index",
            "text_as_html": "<table border=\"1\" class=\"dataframe\">\n  <tbody>\n    <tr>\n      <td>Topic</td>\n      <td>Period</td>\n      <td></td>\n      <td></td>\n      <td>Page</td>\n    </tr>\n    <tr>\n      <td>Mobile customers</td>\n      <td>Nine quarters to 30 June 2023</td>\n      <td></td>\n      <td></td>\n      <td>6</td>\n    </tr>\n    <tr>\n      <td>Fixed broadband customers</td>\n      <td>Nine quarters to 30 June 2023</td>\n      <td></td>\n      <td></td>\n      <td>7</td>\n    </tr>\n    <tr>\n      <td>Marketable homes passed</td>\n      <td>Nine quarters to 30 June 2023</td>\n      <td></td>\n      <td></td>\n      <td>8</td>\n    </tr>\n    <tr>\n      <td>TV customers</td>\n      <td>Nine quarters to 30 June 2023</td>\n      <td></td>\n      <td></td>\n      <td>9</td>\n    </tr>\n    <tr>\n      <td>Converged customers</td>\n      <td>Nine quarters to 30 June 2023</td>\n      <td></td>\n      <td></td>\n      <td>10</td>\n    </tr>\n    <tr>\n      <td>Mobile churn</td>\n      <td>Nine quarters to 30 June 2023</td>\n      <td></td>\n      <td></td>\n      <td>11</td>\n    </tr>\n    <tr>\n      <td>Mobile data usage</td>\n      <td>Nine quarters to 30 June 2023</td>\n      <td></td>\n      <td></td>\n      <td>12</td>\n    </tr>\n    <tr>\n      <td>Mobile ARPU</td>\n      <td>Nine quarters to 30 June 2023</td>\n      <td></td>\n      <td></td>\n      <td>13</td>\n    </tr>\n  </tbody>\n</table>"
        },
        "text": "\n\n\nTopic\nPeriod\n\n\nPage\n\n\nMobile customers\nNine quarters to 30 June 2023\n\n\n6\n\n\nFixed broadband customers\nNine quarters to 30 June 2023\n\n\n7\n\n\nMarketable homes passed\nNine quarters to 30 June 2023\n\n\n8\n\n\nTV customers\nNine quarters to 30 June 2023\n\n\n9\n\n\nConverged customers\nNine quarters to 30 June 2023\n\n\n10\n\n\nMobile churn\nNine quarters to 30 June 2023\n\n\n11\n\n\nMobile data usage\nNine quarters to 30 June 2023\n\n\n12\n\n\nMobile ARPU\nNine quarters to 30 June 2023\n\n\n13\n\n\n"
    },
    {
        "type": "Title",
        "element_id": "f97e9da0e3b879f0a9df979ae260a5f7",
        "metadata": {
            "filename": "vodafone.xlsx",
            "file_directory": "example-docs",
            "last_modified": "2023-10-03T17:51:34",
            "filetype": "application/vnd.openxmlformats-officedocument.spreadsheetml.sheet",
            "parent_id": "f97e9da0e3b879f0a9df979ae260a5f7",
            "languages": [
                "spa",
                "ita"
            ],
            "page_number": 1,
            "page_name": "Index",
            "text_as_html": "<table border=\"1\" class=\"dataframe\">\n  <tbody>\n    <tr>\n      <td>Topic</td>\n      <td>Period</td>\n      <td></td>\n      <td></td>\n      <td>Page</td>\n    </tr>\n    <tr>\n      <td>Average foreign exchange rates</td>\n      <td>Nine quarters to 30 June 2023</td>\n      <td></td>\n      <td></td>\n      <td>14</td>\n    </tr>\n    <tr>\n      <td>Guidance rates</td>\n      <td>FY 23/24</td>\n      <td></td>\n      <td></td>\n      <td>14</td>\n    </tr>\n  </tbody>\n</table>"
        },
        "text": "Other"
    },
    {
        "type": "Table",
        "element_id": "080e1a745a2a3f2df22b6a08d33d59bb",
        "metadata": {
            "filename": "vodafone.xlsx",
            "file_directory": "example-docs",
            "last_modified": "2023-10-03T17:51:34",
            "filetype": "application/vnd.openxmlformats-officedocument.spreadsheetml.sheet",
            "parent_id": "f97e9da0e3b879f0a9df979ae260a5f7",
            "languages": [
                "spa",
                "ita"
            ],
            "page_number": 1,
            "page_name": "Index",
            "text_as_html": "<table border=\"1\" class=\"dataframe\">\n  <tbody>\n    <tr>\n      <td>Topic</td>\n      <td>Period</td>\n      <td></td>\n      <td></td>\n      <td>Page</td>\n    </tr>\n    <tr>\n      <td>Average foreign exchange rates</td>\n      <td>Nine quarters to 30 June 2023</td>\n      <td></td>\n      <td></td>\n      <td>14</td>\n    </tr>\n    <tr>\n      <td>Guidance rates</td>\n      <td>FY 23/24</td>\n      <td></td>\n      <td></td>\n      <td>14</td>\n    </tr>\n  </tbody>\n</table>"
        },
        "text": "\n\n\nTopic\nPeriod\n\n\nPage\n\n\nAverage foreign exchange rates\nNine quarters to 30 June 2023\n\n\n14\n\n\nGuidance rates\nFY 23/24\n\n\n14\n\n\n"
    }
]
```
2023-10-04 13:30:23 -04:00
Yao You
19d8bff275
feat: change default hi_res model to yolox quantized (#1607) 2023-10-04 03:28:47 +00:00
Amanda Cameron
1fb464235a
chore: Table chunking (#1540)
This change is adding to our `add_chunking_strategy` logic so that we
are able to chunk Table elements' `text` and `text_as_html` params. In
order to keep the functionality under the same `by_title` chunking
strategy we have renamed the `combine_under_n_chars` to
`max_characters`. It functions the same way for the combining elements
under Title's, as well as specifying a chunk size (in chars) for
TableChunk elements.

*renaming the variable to `max_characters` will also reflect the 'hard
max' we will implement for large elements in followup PRs


Additionally -> some lint changes snuck in when I ran `make tidy` hence
the minor changes in unrelated files :)

TODO:
 add unit tests
--> note: added where I could to unit tests! Some unit tests I just
clarified that the chunking strategy was now 'by_title' because we don't
have a file example that has Table elements to test the
'by_num_characters' chunking strategy
  update changelog

To manually test:
```
In [1]: filename="example-docs/example-10k.html"

In [2]: from unstructured.chunking.title import chunk_table_element

In [3]: from unstructured.partition.auto import partition

In [4]: elements = partition(filename)

# element at -2 happens to be a Table, and we'll get chunks of char size 4 here
In [5]: chunks = chunk_table_element(elements[-2], 4)

# examine text and text_as_html params
ln [6]: for c in chunks:
                    print(c.text)
                    print(c.metadata.text_as_html)
```

---------

Co-authored-by: Yao You <theyaoyou@gmail.com>
2023-10-03 09:40:34 -07:00
Newel H
bcd0eee753
Feat: Detect all text in HTML Heading tags as titles (#1556)
## Summary
This will increase the accuracy of hierarchies in HTML documents and
provide more accurate element categorization. If text is in an HTML
heading tag and is not a list item, address categorize it as a title.

## Testing
```
from unstructured.partition.html import partition_html
elements = partition_html(url="https://www.eda.gov/grants/2015")
```
Before, the date headers at the given url would not be correctly parsed
as titles, after this change they are now correctly identified.

A unit test to verify the functionality has been added:
`test_html_partition::test_html_heading_title_detection` that includes
values that were previously detected as narrative text and uncategorized
text
2023-10-03 11:54:36 -04:00
Klaijan
d6efd52b4b
fix: isalnum referenced before assignment (#1586)
**Executive Summary**
Fix bug on the `get_word_bounding_box_from_element` function that
prevent `partition_pdf` to run.

**Technical Details**
- The function originally first define `isalnum` on the first index. Now
switched to conditional on flag value.
2023-10-03 11:25:20 -04:00
unifyh
89bd2faaf7
fix: Fix various cases of HTML text missing after partition (#1587)
Fix 4 cases of text missing after partition:
1. Text immediately after `<body>`
```html
<body>
  missing1
  <div>hello</div>
</body>
```

2. Text inside container and immediately after `<br/>`
```html
<div>hello<br/>missing2</div>
```

3. Text immediately after a text opening tag, if said tag contains
`<br/>`
```html
<p>missing3<br/>hello</p>
```

4. Text inside `<body>` if it is the only content (different cause from
case 1)
```html
<body>missing4</body>
```

Also fix problem causing
`test_unstructured/documents/test_html.py::test_exclude_tag_types` to
not work as intended.

This will close GitHub Issue#1543
2023-10-03 04:17:51 +00:00
Yao You
ad59a879cc
chore: bump inference to 0.6.6 (#1563)
- bump `unstructured-inference` to `0.6.6`
- specify default model name for element detection to be
`detectron2_onnx` to keep current behavior
- NOTE: the updated inference package by default would use yolox as
element detection model; this will be evaluated and enabled in a
separated PR

---------

Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com>
Co-authored-by: badGarnet <badGarnet@users.noreply.github.com>
2023-09-29 19:09:57 +00:00
Christine Straub
94fbbed189
feat: bbox shrinking in xycut algo, better natural reading order (#1560)
Closes GH Issue #1233.

### Summary
- add functionality to shrink all bounding boxes along x and y axes
(still centered around the same center point) before running xy-cut sort

### Evaluation
Run the followin gcommand for this
[PDF](https://utic-dev-tech-fixtures.s3.us-east-2.amazonaws.com/pastebin/patent-11723901-page2.pdf).

PYTHONPATH=. python examples/custom-layout-order/evaluate_xy_cut_sorting.py <file_path> <strategy>
2023-09-29 03:48:02 +00:00
qued
e5d08662d4
enhancement: memory efficient xml partitioning (#1547)
Closes #1236. Partitions XML documents iteratively in most cases*, never
loading the entire tree into memory. This ends up being much faster.

(* The exception is when the argument `xml_path` is passed to filter
elements. I was not able to find a way in Python to compare XPaths while
streaming the elements, aside from writing a custom XPath parser. So the
shortest way forward was to bite the bullet and load the whole tree in
memory when filtering by XPath.)

Memory usage is about 20% of usage on `main` when processing a 470MB XML
file. Time to process is 10s vs 900s.

Output is slightly different, but appears to be an improvement, adding
lines of text that are skipped in current partitioning. No text is lost.
2023-09-28 02:34:06 +00:00
Austin Walker
f34c277bca
fix: add backwards compatibility to ElementMetadata (#1526)
Fixes https://github.com/Unstructured-IO/unstructured-api/issues/237

The problem:
The `ElementMetadata` class was not able to ignore fields that it didn't
know about. This surfaced in `partition_via_api`, when the hosted api
schema is newer than the local `unstructured` version. In
`ElementMetadata.from_json()` we get errors such as `TypeError:
__init__() got an unexpected keyword argument 'parent_id'`.

The fix:
The `from_json` methods for these dataclasses should drop any unexpected
fields before calling `__init__`.

To verify:
This shouldn't throw an error
```
from unstructured.staging.base import elements_from_json
import json

test_api_result = json.dumps([
    {
        "type": "Title",
        "element_id": "2f7cc75f6467bba468022c4c2875335e",
        "metadata": {
            "filename": "layout-parser-paper.pdf",
            "filetype": "application/pdf",
            "page_number": 1,
            "new_field": "foo",
        },
        "text": "LayoutParser: A Unified Toolkit for Deep Learning Based Document Image Analysis"
    }
])

elements = elements_from_json(text=test_api_result)

print(elements)
```
2023-09-27 18:40:56 +00:00
Klaijan
d26d591d6a
feat: get embedded url, associate text and start index for pdf (#1539)
**Executive Summary**

Adds PDF functionality to capture hyperlink (external or internal) for
pdf fast strategy along with associate text.

**Technical Details**

- `pdfminer` associates `annotation` (links and uris) with bounding box
rather than text. Therefore, the link and text matching is not a perfect
pair but rather a logic-based and calculation matching from bounding box
overlapping.
- There is no word-level bounding box. Only character-level (access
using `LTChar`). Thus in order to get to word-level, there is a window
slicing through the text. The words are captured in alphanumeric and
non-alphanumeric separately, meaning it will split the word if contains
both, on the first encounter of non-alphanumeric.)
- The bounding box calculation is calculated using start and stop
coordinates for the corresponding word calculated from above. The
calculation is simply using distance between two dots.

The result now contains `links` in `metadata` as shown below:

```
            "links": [
                {
                    "text": "link",
                    "url": "https://github.com/Unstructured-IO/unstructured",
                    "start_index": 12
                },
                {
                    "text": "email",
                    "url": "mailto:unstructuredai@earlygrowth.com",
                    "start_index": 30
                },
                {
                    "text": "phone number",
                    "url": "tel:6505124019",
                    "start_index": 49
                }
            ]
```

---------

Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com>
Co-authored-by: Klaijan <Klaijan@users.noreply.github.com>
2023-09-27 13:43:32 -04:00
Newel H
55315cf645
Feat: Native hierarchies for docx element types (#1505)
Improves hierarchy from docx files by leveraging natural hierarchies
built into docx documents. Hierarchy can now be detected from an
indentation level for list bullets/numbers and by style name (e.g.
Heading 1, List Bullet 2, List Number).

Hierarchy detection is improved by determining category depth via the
following:
1. Check if the paragraph item has an indentation level (ilvl) xpath -
these are typically on list bullet/numbers. Return the indentation level
if it exists
2. Check the name of the paragraph style if it contains any category
depth information (e.g. Heading 1 vs Heading 2 or List Bullet vs List
Bullet 2). Return the category depth if found, else default to depth of
0.
3. Check the paragraph ilvl via the paragraph's style name. Outside of
the paragraph's metadata, docx stores default ilvls for various style
names, which requires a complex lookup. This check is yet to be
implemented, as the above methods cover most usecases but the
implementation is stubbed out.
---
Co-authored-by: Steve Canny <stcanny@gmail.com>
2023-09-27 11:32:46 -04:00
Steve Canny
ab29de8dbd
Rfctr: Refactor PPTX partitioning to more closely align with how pptx documents are structured
This refactor solves a problem or two, the big one being recursing into
group-shapes to get all shapes on the slide, but mostly lays the
groundwork to allow us to refine further aspects such as list-item
detection, off-slide shape detection, and image-capture going forward.
2023-09-26 15:43:55 -04:00
shreyanid
32bfebccf7
feat: introduce language detection function for text partitioning function (#1453)
### Summary
Uses `langdetect` to detect all languages present in the input document.

### Details
- Converts all language codes (whether user inputted or detected using
`langdetect`) to a standard ISO 639-3 code.
- Adds `languages` field to the metadata
- Will revisit how to nonstandardly represent simplified vs traditional
Chinese scripts internally (separate PR).
- Update ingest test results to add `languages` field to documents. Some
other side effects are changes in order of some elements and changes in
element categorization

### Test
You can test the detect_languages function individually by importing the
function and inputting a text sample and optionally a language:
```
text = "My lubimy mleko i chleb."
doc_langs = detect_languages(text)
print(doc_langs)
```
-> ['ces', 'pol', 'slk']

---------

Co-authored-by: Newel H <37004249+newelh@users.noreply.github.com>
Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com>
Co-authored-by: shreyanid <shreyanid@users.noreply.github.com>
Co-authored-by: Trevor Bossert <37596773+tabossert@users.noreply.github.com>
Co-authored-by: Ronny H <138828701+ron-unstructured@users.noreply.github.com>
2023-09-26 18:09:27 +00:00
Benjamin Torres
5d193c8e5a
fix/bad formed formula (#1481)
@ron-unstructured reported that loading files with:

```
from unstructured.partition.pdf import partition_pdf

elements_yolox = partition_pdf(filename="1706.03762.pdf", strategy='hi_res', model_name="yolox")
print(elements_yolox)
```

Throws an error. After debugging the execution I found that the issue is
that an object of class Formula is being created, however, this class
doesn't contain an __init__ method. This PR solves the issue of adding a
constructor method with an empty string for the element.

The file can be found at:

https://drive.google.com/drive/folders/1hDumyps0hA4_d-GZxs3Hij15Cpa5fjWY?usp=sharing

After this PR is merged this file is correctly processed
2023-09-23 02:36:22 +00:00
rvztz
2f52df180f
Adds data source properties to onedrive, reddit and slack (#1281) 2023-09-20 04:26:36 +00:00
Amanda Cameron
e359afafbe
fix: coordinates bug on pdf parsing (#1462)
Addresses: https://github.com/Unstructured-IO/unstructured/issues/1460

We were raising an error with invalid coordinates, which prevented us
from continuing to return the element and continue parsing the pdf. Now
instead of raising the error we'll return early.

to test:
```
from unstructured.partition.auto import partition

elements = partition(url='https://www.apple.com/environment/pdf/Apple_Environmental_Progress_Report_2022.pdf', strategy="fast")
```

---------

Co-authored-by: cragwolfe <crag@unstructured.io>
2023-09-19 19:25:31 -07:00
Steve Canny
b54994ae95
rfctr: docx partitioning (#1422)
Reviewers: I recommend reviewing commit-by-commit or just looking at the
final version of `partition/docx.py` as View File.

This refactor solves a few problems but mostly lays the groundwork to
allow us to refine further aspects such as page-break detection,
list-item detection, and moving python-docx internals upstream to that
library so our work doesn't depend on that domain-knowledge.
2023-09-19 15:32:46 -07:00
shreyanid
eb8ce89137
chore: function to map between standard and Tesseract language codes (#1421)
### Summary
In order to convert between incompatible language codes from packages
used for OCR, this change adds a function to map between any standard
language codes and tesseract OCR specific codes. Users can input
language information to `languages` in any Tesseract-supported langcode
or any ISO 639 standard language code.

### Details
- Introduces the
[python-iso639](https://pypi.org/project/python-iso639/) package for
matching standard language codes. Recompiles all dependencies.
- If a language is not already supplied by the user as a Tesseract
specific langcode, supplies all possible script/orthography variants of
the language to the Tesseract OCR agent.

### Test
Added many unit tests for a variety of language combinations, special
cases, and variants. For general testing, call partition functions with
any lang codes in the languages parameter (Tesseract or standard).

for example,
```
from unstructured.partition.auto import partition

elements = partition(filename="example-docs/layout-parser-paper.pdf", strategy="hi_res", languages=["en", "chi"])
print("\n\n".join([str(el) for el in elements]))
```
should supply eng+chi_sim+chi_sim_vert+chi_tra+chi_tra_vert to Tesseract
2023-09-18 08:42:02 -07:00
Amanda Cameron
a9f18eddb8
chore: adding test case for odt tables (#1434)
ODT table extraction is happening! Just added to an existing example-doc
and an accompanying test case.
2023-09-16 22:29:44 -07:00
Yao You
b534b2a6cd
Chore: bump inference package version to 0.5.28 and new release (#1355)
This bump removes the preprocessing before table structure extraction
and improves the OCR results for tables.

---------

Co-authored-by: yuming-long <yuming-long@users.noreply.github.com>
2023-09-15 18:26:15 -07:00
John
de4d496fcf
Fix bbox coordinates for ocr_only strategy (#1325)
### Summary
Duplicate PR of #1259 because of issues with checks
Closes #1227, which found that `nan` values were present in the
coordinates being generated for some elements.
This breaks logic out from `add_pytesseract_bbox_to_elements` to new
functions `_get_element_box` and
`convert_multiple_coordinates_to_new_system`. It also updates the logic
to check that the current bounding box matches the first character of
the element's text (as to avoid the `~` characters that
`pytesseract.image_to_boxes` includes, but are not present in
`pytesseract.image_to_string`.

### Testing
```
from unstructured.partition.image import partition_image
from PIL import Image, ImageDraw

filename="example-docs/layout-parser-paper-with-table.jpg"
elements = partition_image(filename=filename, strategy="ocr_only")
image = Image.open(filename)
draw = ImageDraw.Draw(image)
for i, element in enumerate(elements):
    print(i, element.metadata.coordinates)
    if element.metadata.coordinates:
        draw.polygon(element.metadata.coordinates.points, outline="red", width=2)
output = "example-docs/box-layout-parser-paper-with-table.jpg"
image.save(output)
image.close()
```

---------

Co-authored-by: qued <64741807+qued@users.noreply.github.com>
Co-authored-by: cragwolfe <crag@unstructured.io>
Co-authored-by: Yao You <theyaoyou@gmail.com>
2023-09-15 15:11:16 -05:00
qued
0d61c98481
fix: Pass partition_image kwargs downstream (#1426)
`partition_pdf` allows for passing a `model_name` parameter. Given the
similarity between the image and PDF pipelines, the expected behavior is
that `partition_image` should support the same parameter, but
`partition_image` was unintentionally not passing along its `kwargs`.
This was corrected by adding the kwargs to the downstream call.

#### Testing:

```python
from unstructured.partition.image import partition_image

output1 = partition_image("example-docs/layout-parser-paper-fast.jpg", model_name="detectron2_onnx")
output2 = partition_image("example-docs/layout-parser-paper-fast.jpg", model_name="yolox")

# These shouldn't be the same, since they were produced using different models.
assert output1 != output2

```
The assertion should fail on `main`, but pass on this branch.
2023-09-15 15:09:58 -05:00
Amanda Cameron
50db2abd9f
fix: updating element types (#1394)
This PR adds an arg to the html partition flow called `source_format` if
anything other than "html" we will return non-HTML elements to conform
with the file type we received.

addresses: https://github.com/Unstructured-IO/unstructured/issues/726
2023-09-15 11:51:22 -05:00
Sebastian Laverde Alfonso
40b1d0d092
feat: improved chipper elements mapping and new category_depth metadata (#1308)
Two changes: 
1. Improved mapping of `chipper` element types `Headline` (to `Title`),
`Subheadline`(to `Title`) and `Abstract`( to `NarrativeText`.
2. New element metadata `category_depth`: `None` unless is `Headline`
(`category_depth=1`), or `Subheadline` (`category_depth=2`). The update
of `category_depth` happens during the transform
`normalize_layout_element`.

---------

Co-authored-by: Yao You <theyaoyou@gmail.com>
Co-authored-by: Yao You <yao@unstructured.io>
Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com>
Co-authored-by: LaverdeS <LaverdeS@users.noreply.github.com>
Co-authored-by: Benjamin Torres <benjats07@users.noreply.github.com>
Co-authored-by: Benjamin Torres <benjamin@unstructured.io>
2023-09-15 14:43:17 +00:00
Newel H
cd704e873b
Feat: Create a naive hierarchy for elements (#1268)
## **Summary**
By adding hierarchy to unstructured elements, users will have more
information for implementing vector db/LLM chunking strategies. For
example, text elements could be queried by their preceding title
element. The hierarchy is implemented by a parent_id tag in the
element's metadata.

### Features
- Introduces a parent_id to ElementMetadata (The id of the parent
element, not a pointer)
- Creates a rule set for assigning hierarchies. Sensible default is
assigned, with an optional override parameter
- Sets element parent ids if there isn't an existing parent id or
matches the ruleset

### How it works

Hierarchies are assigned via a parent id field in element metadata.
Elements are read sequentially and evaluated against a ruleset. For
example take the following elements:

1. Title, "This is the Title"
2. Text, "this is the text"

And the ruleset: `{"title": ["text"]}`. When evaluated, the parent_id of
2 will be the id of 1. The algorithm for determining this is more
complex and resolves several edge cases, so please read the code for
further details.

### Schema Changes

```
@dataclass
class ElementMetadata:
      coordinates: Optional[CoordinatesMetadata] = None
      data_source: Optional[DataSourceMetadata] = None
      filename: Optional[str] = None
      file_directory: Optional[str] = None
      last_modified: Optional[str] = None
      filetype: Optional[str] = None
      attached_to_filename: Optional[str] = None
+     parent_id: Optional[Union[str, uuid.UUID, NoID, UUID]] = None
+     category_depth: Optional[int] = None

...
```

### Testing
```
from unstructured.partition.auto import partition
from typing import List

elements = partition(filename="./unstructured/example-docs/fake-html.html", strategy="auto")

for element in elements:
    print(
        f"Category:  {getattr(element, 'category', '')}\n"\
        f"Text:      {getattr(element, 'text', '')}\n"
        f"ID:        {element.id}\n" \
        f"Parent ID: {element.metadata.parent_id}\n"\
        f"Depth:     {element.metadata.category_depth}\n" \
    )
```

### Additional Notes
Implementing this feature revealed a possibly undesired side-effect in
how element metadata are processed. In
`unstructured/partition/common.py` the `_add_element_metadata` is
invoked as part of the `add_metadata_with_filetype` decorator for
filetype partitioning. This method is intended to add additional
information to the metadata generated with the element including
filename and filetype, however the existing metadata is merged into a
newly created metadata object rather than the other way around. Because
of the way it's structured, new metadata fields can easily be forgotten
and pose debugging challenges to developers. This likely warrants a new
issue.

I'm guessing that the implementation is done this way to avoid issues
with deserializing elements, but could be wrong.

---------

Co-authored-by: Benjamin Torres <benjats07@users.noreply.github.com>
2023-09-14 11:23:16 -04:00
Ahmet Melek
bd33a52ee0
fix: coordinates metadata hinders chunking (#1374)
Closes https://github.com/Unstructured-IO/unstructured/issues/1373

This PR: 
- drops the `coordinates` metadata field in `chunk_by_title` to fix
https://github.com/Unstructured-IO/unstructured/issues/1373 (read issue
for the details)
- adds relevant test that checks the particular case
2023-09-14 10:10:03 +00:00
Klaijan
00181b88df
feat: pdf auto strategy groups broken numbered and bullet list items(#1393)
**Summary**
Adds logic to combine broken numbered list for pdf fast strategy.

**Details**
Previously the document reads the numbered list items part of the
`layout-parser-paper-fast.pdf` file as:

```
'1. An off-the-shelf toolkit for applying DL models for layout detection, character'
'recognition, and other DIA tasks (Section 3)'
'2. A rich repository of pre-trained neural network models (Model Zoo) that'
'underlies the off-the-shelf usage'
'3. Comprehensive tools for efficient document image data annotation and model'
'tuning to support different levels of customization'
'4. A DL model hub and community platform for the easy sharing, distribu- tion, and discussion of DIA models and pipelines, to promote reusability, reproducibility, and extensibility (Section 4)'
```

Now it reads:

```
'1. An off-the-shelf toolkit for applying DL models for layout detection, character recognition, and other DIA tasks (Section 3)'
'2. A rich repository of pre-trained neural network models (Model Zoo) that underlies the off-the-shelf usage'
'3. Comprehensive tools for efficient document image data annotation and model' tuning to support different levels of customization'
'4. A DL model hub and community platform for the easy sharing, distribu- tion, and discussion of DIA models and pipelines, to promote reusability, reproducibility, and extensibility (Section 4)'
```

The added logic leverages `ElementType` and `coordinates` to determine
whether the following lines is a part of the previously detected
`ListItem` or not.

**Test**
Add test that checks the element length less than original version with
broken numbered list. The test also checks whether the first detected
numbered list ends with previously broken line.

---------

Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com>
Co-authored-by: Klaijan <Klaijan@users.noreply.github.com>
2023-09-13 21:30:06 +00:00
shreyanid
d87c83d7b6
chore: refactor languages parameter for text_type functions (#1399)
### Summary
In order to support language functionality other than Tesseract OCR, we
want to represent languages provided for either partitioning accuracy or
OCR as a standard list of langcodes as strings. To identify element
types such as NarrativeText and Title, continue the refactor into
functions that use language checks to determine those potential
classifications.

### Details
Replaces `language` with `languages` (a list of strings) as a parameter
to `is_possible_narrative_text` and `is_possible_title`.


### Test
Call `is_possible_narrative_text` and `is_possible_title` with text in a
variety of languages and different inputs for `languages`. The resulting
element classifications should be no different from the current outputs.

ex: see `test_text_type_handles_multi_language_examples` in
`test_unstructured/partition/test_text_type.py`.
2023-09-13 19:46:36 +00:00
shreyanid
1b7c99d878
chore: refactor languages parameter for auto partition (#1400)
### Summary
In order to support language functionality other than Tesseract OCR, we
want to represent languages provided for either partitioning accuracy or
OCR as a standard list of langcodes as strings.

### Details
Follows the pattern established with PDFs in #1334. Adds languages (a
list of strings) as a parameter to partition in auto.py. Marks
ocr_languages for deprecation.

### Test
Call partition with a variety of filetypes (especially pdfs/images),
strategies, languages, or ocr_languages.
- inclusion of ocr_languages as a parameter should display a deprecation
warning and may proceed with partitioning if no other conflicts
- the other valid call outputs should be no different from the current
outputs
2023-09-13 13:07:28 -04:00
shreyanid
2b571eb9a3
chore: refactor languages parameter for image partition functions (#1395)
Adds languages (a list of strings) as a parameter to `partition_image`. Marks ocr_languages for deprecation.
2023-09-13 04:11:58 +00:00