14 Commits

Author SHA1 Message Date
cragwolfe
4c13d12dc3
fix: prevent spammy ListItem's from images and PDF's (#1210)
The issue was that for blocks detected in an image such as:

![image](https://github.com/Unstructured-IO/unstructured/assets/28578599/a955bf2c-a683-4cef-a19f-546f9378835a)
, where the full image is:

https://utic-dev-tech-fixtures.s3.us-east-2.amazonaws.com/pastebin//Users/cragwolfe/tmp/IRS-form-1987.png
, many ListItem's would be extracted that were not adding much value to
the output (assuming the block was determined to be of type List from
the layout model). This particular file is also used in ingest tests,
and you can see the prior output here:


https://github.com/Unstructured-IO/unstructured/blob/483b09b/test_unstructured_ingest/expected-structured-output/azure/IRS-form-1987.png.json#L93-L280

Test Instructions:

1. run the following snippet:

```
import json
import os
from datetime import datetime

from unstructured.__version__ import __version__
from unstructured.partition.auto import partition
from unstructured.staging.base import elements_to_json
                                                                                                 
filename = "/opt/home/tmp/IRS-form-1987.png"
output_dir = "/opt/home/tmp/json"
base_name_with_ext = os.path.basename(filename)
output_filename_part = os.path.join(output_dir, base_name_with_ext)

print(f"unstructured version: {__version__}")
#for strategy in ("hi_res", "fast", "auto"):                                                                                                            
for strategy in ("hi_res",):
    d1 = datetime.now()
    elements = partition(filename=filename, strategy=strategy)
    elems_as_dicts = json.loads(elements_to_json(elements, indent=2))

    # strip out metadata for the sake of more readable results                                                                                          
    for element_dict in elems_as_dicts:
	del element_dict["metadata"]
    json_filename=f"{output_filename_part}-{strategy}.json"

    with open(json_filename, "w") as jsonf:
        jsonf.write(json.dumps(elems_as_dicts, indent=2))
    d2 = datetime.now()
    print(f"num elements for {strategy}: {len(elements)}")
    print(f"time elapsed     {strategy}: {(d2-d1).total_seconds()}")
```
updating the `filename` and `output_dir` paths for your particular local
environment.

2. Open the json file that was writen to your `output_dir`, named
IRS-form-1987.png-hi_res.json

Witness the new element:
```
  {
    "type": "ListItem",
    "element_id": "7d3ba328af2c20ddeef5d2c1d270f60f",
    "text": "Long-term contracts.\u2014If you are required to change your method of accounting for long-term contracts under section 460, see Notice 87
-61 (9/21/87), 1987-38 IRB 40, for the notification procedures that must be followed Other methods. \u2014Unless the Service has Published a regulation
 or procedure to the contrary, all other changes in accounting methods required by the Act are automatically considered to be approved by the Commissio
ner. Examples of method changes automatically approved by the Commissioner are those changes required to effect: (1) the repeal of the reserve method f
or bad debts of taxpayers other than financial institutions (Act section 805); (2) the repeal of the installment method for sales under a revolving cre
dit plan (Act section 812); (3) the Inclusion of income attributable to the sale or furnishing of utility services no later than the year in which the 
services were provided to customers (Act section 821); and (4) the repeal of the deduction for qualified discount coupons (Act section 823). Do not fil
e Form 3115 for these changes."
  },
```
2023-08-26 21:01:07 -07:00
Christine Straub
483b09b3c9
Feat/1136 elements ordering for pdf (#1161)
### Summary
Address
[#1136](https://github.com/Unstructured-IO/unstructured/issues/1136) for
`hi_res` and `fast` strategies. The `ocr_only` strategy does not include
coordinates.
- add functionality to switch sort mode between the current `basic`
sorting and the new `xy-cut` sorting for `hi_res` and `fast` strategies
- add the script to evaluate the `xy-cut` sorting approach
- add jupyter notebook to provide evaluation and visualization for the
`xy-cut` sorting approach

### Evaluation
```
export PYTHONPATH=.:$PYTHONPATH && python examples/custom-layout-order/evaluate_xy_cut_sorting.py <file_path> <strategy>
```
Here, the file should be under the project root directory. For example,
```
export PYTHONPATH=.:$PYTHONPATH && python examples/custom-layout-order/evaluate_xy_cut_sorting.py example-docs/multi-column-2p.pdf fast
```
2023-08-24 17:46:19 -07:00
Austin Walker
e7d189fcc8
chore: Bump inference and set default ocr_mode to entire_page (#1172)
* pip-compile in order to bump unstructured-inference
* Set the default `ocr_mode` back to `enitre_page` now that [this
error](https://github.com/Unstructured-IO/unstructured-inference/pull/183)
is addressed
* Explicitly add `sphinx-tabs` to `build.in`. This file provides
`docs/requirements.txt`.
* Remove a pinned `pydantic` version
* Fix a makefile command to `pip-compile` a missing ingest file.
2023-08-22 16:05:02 -07:00
Austin Walker
dd243b4fd9
chore: pass ocr_mode in partition_pdf_or_image (#1154)
Set to individual_blocks for now to work around [this
bug](https://github.com/Unstructured-IO/unstructured-inference/issues/179).

I verified by printing the current ocr_mode in inference. The
`entire_page` default is overridden.

---------

Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com>
Co-authored-by: awalker4 <awalker4@users.noreply.github.com>
2023-08-18 20:59:08 +00:00
cragwolfe
dd0f582585
build(deps): bump unstructured-inference==0.5.13 (#1141)
Bump to unstructured-inference==0.5.13, which includes:

Fix extracted image elements being included in layout merge, addresses the issue
where an entire-page image in a PDF was not passed to the layout model when using hi_res.
2023-08-17 06:25:00 +00:00
Christine Straub
0a23139720
enhancement: implement full-page OCR(#1133)
*implements full-page OCR as supported in unstructured-inference=0.5.11.
2023-08-16 19:16:35 +00:00
Christine Straub
0e887cc36b
Feat/1060 update metadata fields (#1099)
Closes Github Issue #1060.

* update the metadata field links
* update the metadata field emphasized_texts
2023-08-16 04:33:06 +00:00
John
6e5d27c6c3
fix pdf partition of list items being detected as titles in OCR only mode (#1119)
Closes Github issue #1010

adds group_bullet_paragraph func to handle grouping of bullet items that are split across multiple lines
2023-08-15 09:35:54 -07:00
ryannikolaidis
cd1df5e8e6
fix: remove default encoding for ingest (#1036) 2023-08-05 16:57:45 +00:00
Matt Robinson
f4ddf53590
feat: track emphasized text in partition_html (#1034)
* Feat/965 track emphasized text html (#1021)

* feat: add functionality to track emphasized text (<strong>, <em>, <span>, <b>, <i> tags) in HTML

* feat: add `include_tail_text` parameter to `_construct_text`

* test: add test case for `_get_emphasized_texts_from_tag`

* test: add `emphasized_texts` to metadata

* chore: update changelog & version

* fix tests

* fix lint errors

* chore: update changelog

* chore: small comment updates

* feat: update `XMLDocument._read_xml` to create `<p>` tag element for the text enclosed in the `<pre>` tag

* chore: update changelog

* Update ingest test fixtures (#1026)

Co-authored-by: christinestraub <christinestraub@users.noreply.github.com>

---------

Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com>
Co-authored-by: christinestraub <christinestraub@users.noreply.github.com>
Co-authored-by: Matt Robinson <mrobinson@unstructured.io>

* ingest-test-fixtures-update

* Update ingest test fixtures (#1035)

Co-authored-by: MthwRobinson <MthwRobinson@users.noreply.github.com>

---------

Co-authored-by: Christine Straub <christinemstraub@gmail.com>
Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com>
Co-authored-by: christinestraub <christinestraub@users.noreply.github.com>
Co-authored-by: MthwRobinson <MthwRobinson@users.noreply.github.com>
2023-08-03 16:24:25 +00:00
Matt Robinson
6e852cbe70
feat: track links from anchor tags in partition_html (#959)
* track tags in html

* pass through links as metadata

* add test for grabbing links

* one more link

* changelog and version

* update docs

* fix tests

* update empty link assertion

* ingest-test-fixtures-update

* Update ingest test fixtures (#961)
2023-07-24 18:28:56 +00:00
Jason Scheirer
196efa09b1
chore: Add encoding param to ingest (#955)
* Add encoding param to ingest
2023-07-24 10:06:13 -07:00
qued
350bb1dad5
enhancement: clean pdf elements (bump unstructured-inference) (#790)
More deterministic element ordering when using hi_res PDF parsing strategy (from unstructured-inference bump to 0.5.4)
Make large model available (from unstructured-inference bump to 0.5.3)
Combine inferred elements with extracted elements (from unstructured-inference bump to 0.5.2)

---------

Co-authored-by: Roman Isecke <roman@unstructured.io>
Co-authored-by: Crag Wolfe <crag@unstructured.io>
2023-06-29 18:35:06 -07:00
ryannikolaidis
62e20442df
chore: refactor ingest tests (#814)
- Adds reusable validation scripts (check-x.sh) to minimize repeated (or near-repeated) code and create one source of truth
- Restructures the location of download and output folders such that they are nested in the test_unstructured_ingest directory
- Adds gitignore for output folders / files to avoid them accidentally getting checked into the repository
- Construct paths as reusable variables declared at top of scripts
- Sort order of flag for ingest calls, across all tests (this makes it easier to parse at a glance)
- OVERWRITE_FIXTURES removes all old fixtures for path to guarantee no stale results are left behind
- Bonus: don't check/exit on expected number of expected outputs when OVERWRITE_FIXTURES is true
- Bonus: exclude file_directory from Slack and Discord test scripts (match convention in all others)
2023-06-29 23:13:41 +00:00