1109 Commits

Author SHA1 Message Date
cragwolfe
4c13d12dc3
fix: prevent spammy ListItem's from images and PDF's (#1210)
The issue was that for blocks detected in an image such as:

![image](https://github.com/Unstructured-IO/unstructured/assets/28578599/a955bf2c-a683-4cef-a19f-546f9378835a)
, where the full image is:

https://utic-dev-tech-fixtures.s3.us-east-2.amazonaws.com/pastebin//Users/cragwolfe/tmp/IRS-form-1987.png
, many ListItem's would be extracted that were not adding much value to
the output (assuming the block was determined to be of type List from
the layout model). This particular file is also used in ingest tests,
and you can see the prior output here:


https://github.com/Unstructured-IO/unstructured/blob/483b09b/test_unstructured_ingest/expected-structured-output/azure/IRS-form-1987.png.json#L93-L280

Test Instructions:

1. run the following snippet:

```
import json
import os
from datetime import datetime

from unstructured.__version__ import __version__
from unstructured.partition.auto import partition
from unstructured.staging.base import elements_to_json
                                                                                                 
filename = "/opt/home/tmp/IRS-form-1987.png"
output_dir = "/opt/home/tmp/json"
base_name_with_ext = os.path.basename(filename)
output_filename_part = os.path.join(output_dir, base_name_with_ext)

print(f"unstructured version: {__version__}")
#for strategy in ("hi_res", "fast", "auto"):                                                                                                            
for strategy in ("hi_res",):
    d1 = datetime.now()
    elements = partition(filename=filename, strategy=strategy)
    elems_as_dicts = json.loads(elements_to_json(elements, indent=2))

    # strip out metadata for the sake of more readable results                                                                                          
    for element_dict in elems_as_dicts:
	del element_dict["metadata"]
    json_filename=f"{output_filename_part}-{strategy}.json"

    with open(json_filename, "w") as jsonf:
        jsonf.write(json.dumps(elems_as_dicts, indent=2))
    d2 = datetime.now()
    print(f"num elements for {strategy}: {len(elements)}")
    print(f"time elapsed     {strategy}: {(d2-d1).total_seconds()}")
```
updating the `filename` and `output_dir` paths for your particular local
environment.

2. Open the json file that was writen to your `output_dir`, named
IRS-form-1987.png-hi_res.json

Witness the new element:
```
  {
    "type": "ListItem",
    "element_id": "7d3ba328af2c20ddeef5d2c1d270f60f",
    "text": "Long-term contracts.\u2014If you are required to change your method of accounting for long-term contracts under section 460, see Notice 87
-61 (9/21/87), 1987-38 IRB 40, for the notification procedures that must be followed Other methods. \u2014Unless the Service has Published a regulation
 or procedure to the contrary, all other changes in accounting methods required by the Act are automatically considered to be approved by the Commissio
ner. Examples of method changes automatically approved by the Commissioner are those changes required to effect: (1) the repeal of the reserve method f
or bad debts of taxpayers other than financial institutions (Act section 805); (2) the repeal of the installment method for sales under a revolving cre
dit plan (Act section 812); (3) the Inclusion of income attributable to the sale or furnishing of utility services no later than the year in which the 
services were provided to customers (Act section 821); and (4) the repeal of the deduction for qualified discount coupons (Act section 823). Do not fil
e Form 3115 for these changes."
  },
```
0.10.7
2023-08-26 21:01:07 -07:00
cragwolfe
3f1c90eef2
build: bump unstructured-inference==0.5.17, cut release (#1207)
Pulls in @awalker4's tesseract enhancement:
https://github.com/Unstructured-IO/unstructured-inference/pull/185
0.10.6
2023-08-26 01:05:48 +00:00
Matt Robinson
07f76275f1
feat: detect PGP encrypted content in partition_email and partition_msg (#1205)
### Summary

Closes #1018. Enables `partition_email` and `partition_msg` to detect if
an email has PGP encrypted content. Based on the specification in [RFC
2015](https://www.ietf.org/rfc/rfc2015.txt). The test emails are based
on the example email in the spec. If PGP detected content is detected, a
warning is emitted and an empty set of lists is returned.

### Testing

```python
from unstructured.partition_email import partition_email

filename = "example-docs/eml/fake-encrypted.eml"
partition_email(filename=filename)
```

```python
from unstructured.partition_msg import partition_msg

filename = "example-docs/fake-encrypted.msg"
partition_msgl(filename=filename)
```
2023-08-25 17:09:25 -07:00
John
5872fa23c3
Extract coordinates from PDFs and images when using OCR only strategy (#1163)
### Summary
Closes #983 
Creates new function `add_pytesseract_bbox_to_elements`
Fixes typos in docstrings

### Testing
```
from unstructured.partition.image import partition_image
from PIL import Image, ImageDraw

png_filename="example-docs/english-and-korean.png"
png_elements = partition_image(filename=png_filename, strategy="ocr_only")
png_image = Image.open(png_filename)
draw = ImageDraw.Draw(png_image)
draw.polygon(png_elements[0].metadata.coordinates.points, outline="red", width=2)
draw.polygon(png_elements[1].metadata.coordinates.points, outline="red", width=2)
draw.polygon(png_elements[2].metadata.coordinates.points, outline="red", width=2)
output = "example-docs/english-and-korean-box.png"
png_image.save(output)
png_image.close()
```
2023-08-25 05:32:12 +00:00
Matt Robinson
c578b85699
fix: respect <pre> tag order in partition_html (#1197)
### Summary

Closes #1184. Updates `partition_html` to respect the ordering of
`<pre>` tags in HTML documents.

### Testing

The elements in the following example should be in the correct order.

```python
    from unstructured.partition.html import partition_html

    html_text = """
    <pre>The Big Brown Bear</pre>
    <div>The big brown bear is growling.</div>
    <pre>The big brown bear is sleeping.</pre>
    <div>The Big Blue Bear</div>
    """
    elements = partition_html(text=html_text)
    print("\n\n".join([str(el) for el in elements]))
```
2023-08-25 04:14:48 +00:00
Christine Straub
483b09b3c9
Feat/1136 elements ordering for pdf (#1161)
### Summary
Address
[#1136](https://github.com/Unstructured-IO/unstructured/issues/1136) for
`hi_res` and `fast` strategies. The `ocr_only` strategy does not include
coordinates.
- add functionality to switch sort mode between the current `basic`
sorting and the new `xy-cut` sorting for `hi_res` and `fast` strategies
- add the script to evaluate the `xy-cut` sorting approach
- add jupyter notebook to provide evaluation and visualization for the
`xy-cut` sorting approach

### Evaluation
```
export PYTHONPATH=.:$PYTHONPATH && python examples/custom-layout-order/evaluate_xy_cut_sorting.py <file_path> <strategy>
```
Here, the file should be under the project root directory. For example,
```
export PYTHONPATH=.:$PYTHONPATH && python examples/custom-layout-order/evaluate_xy_cut_sorting.py example-docs/multi-column-2p.pdf fast
```
2023-08-24 17:46:19 -07:00
Trevor Bossert
f267cef329
feat: Adds in threaded replies (#1188)
- Puts threaded replies into the same text field as parent message,
allowing for a full thread to be under a single element_id
- Output is now XML instead of TXT to allow for easier parsing of new
format.

https://github.com/Unstructured-IO/unstructured/issues/1186
2023-08-24 12:12:29 -07:00
ryannikolaidis
566e947d13
fix: ARM build with constraint for safetensors <=0.3.2 (#1196) 2023-08-24 18:00:25 +00:00
Klaijan
1524841cd9
feat: supports multipage tiff (#1131)
Add test case test_partition_image_with_multipage_tiff that reads multipage TIFF file and

- confirms that the function reads all the pages in the TIFF.

- page number is added to the metadata

This PR is branched from and developed on top of 6d6be99 commit.
2023-08-24 15:12:50 +00:00
Matt Robinson
cdae53cc29
chore: deprecation warning for file_filename (#1191)
### Summary

Closes #1007. Adds a deprecation warning for the `file_filename` kwarg
to `partition`, `partition_via_api`, and `partition_multiple_via_api`.
Also catches a warning in `ebooklib` that we do not want to emit in
`unstructured`.

### Testing

```python
from unstructured.partition.auto import partition

filename = "example-docs/winter-sports.epub"

# Should not emit a warning
with open(filename, "rb") as f:
    elements = partition(file=f, metadata_filename="test.epub")
# Should be test.epub
elements[0].metadata.filename

# Should emit a warning
with open(filename, "rb") as f:
    elements = partition(file=f, file_filename="test.epub")
# Should be test.epub
elements[0].metadata.filename

# Should raise an error
with open(filename, "rb") as f:
    elements = partition(file=f, metadata_filename="test.epub", file_filename="test.epub")
```
2023-08-24 07:02:47 +00:00
ryannikolaidis
835378aba6
ci: fix documentation build flow (#1181) 2023-08-24 00:24:03 -05:00
cragwolfe
df4bd459d5
build(deps): bump unstructured-inference==0.5.16 (#1182)
Pulls in @newelh's fix:
https://github.com/Unstructured-IO/unstructured-inference/pull/184
2023-08-23 05:28:45 +00:00
Charles
1ddf542e14
fix: Don't call extractable_elements if strategy is ocr_only (#1160)
- fixes #1079 where partitioning is happening twice in the case of
`strategy="ocr_only"`
- only calls `extractable_elements` if we can predetermine that
`ocr_only` is not a possible strategy even if it was the intended
strategy.
- Adds additional assertion test that `_partition_pdf_or_image_with_ocr`
is not called when falling back to `fast` from `ocr_only`
2023-08-22 19:43:33 -07:00
cragwolfe
e9c649224e
chore: changelog repair (#1179)
chaos reigns in the changelog. whyyy

* there was no 0.10.3 release, so remove that from the CHANGELOG.
* fixup 0.10.5 with a couple that were added (in retrospect)
2023-08-22 16:48:18 -07:00
Austin Walker
e7d189fcc8
chore: Bump inference and set default ocr_mode to entire_page (#1172)
* pip-compile in order to bump unstructured-inference
* Set the default `ocr_mode` back to `enitre_page` now that [this
error](https://github.com/Unstructured-IO/unstructured-inference/pull/183)
is addressed
* Explicitly add `sphinx-tabs` to `build.in`. This file provides
`docs/requirements.txt`.
* Remove a pinned `pydantic` version
* Fix a makefile command to `pip-compile` a missing ingest file.
0.10.5
2023-08-22 16:05:02 -07:00
Jack Retterer
05e311651a
doc: add delta tables connector reference (#1177)
Added delta tables to connectors page for users to discover
2023-08-22 12:50:27 -07:00
ryannikolaidis
ac2313a3fa
doc: fix get-api-key link (#1175) 2023-08-22 19:31:07 +00:00
ryannikolaidis
ab7fafcb41
doc: add pdf extra note (#1165) 2023-08-22 18:20:26 +00:00
Roman Isecke
4114022d9d
roman/ingest-custom-errors (#1152)
### Description
Adds three custom errors to ingest:
* `SourceConnectionError`
* `DestinationConnectionError`
* `PartitionError`

Included is a base custom error class that adds a wrapper. This wrapper
wraps any raised exception into the custom error.
2023-08-22 12:28:29 -04:00
Roman Isecke
106ee965a6
Roman/delta table connector (#1132)
### Description
Add delta table connector and test against a delta table generated via
delta.io and uploaded to s3. Shows an example of how to use the
connection options to leverage s3.

I was able to get this to work with s3 if I pass in the access and
secret keys as storage options. Even though the s3 bucket being used is
public, would not work without those.

---------

Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com>
Co-authored-by: rbiseck3 <rbiseck3@users.noreply.github.com>
2023-08-22 10:19:46 -04:00
Matt Robinson
ad595d32f6
enhancement: tell users to install missing extras (#1167)
### Summary

Updates `partition` to let users know to installs the appropriate extras
if they're missing. Prior to this PR, users would get an exception
stating `partition_pdf` (or whichever function that requires extras)
does not exist.

### Testing

First `pip uninstall ebooklib`. Then run

```python
from unstructured.partition.auto import partition

partition(filename="example-docs/winter-sports.epub")
```

The error should look like

```python
ImportError: partition_epub is not available. Install the epub dependencies with pip install "unstructured[epub]"
```
2023-08-22 03:00:21 +00:00
Jack Retterer
f639d04695
Fixed some typos (#1162)
The Wikipedia data connector was labeled as Airtable.
2023-08-21 18:03:15 -07:00
Roman Isecke
db8af4f5de
Roman/notion tests (#1072)
### Description
* Add ingest test for Notion docs
* Update default cache dir for connectors to include connector name.
Makes debugging the cached content easier.

---------

Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com>
Co-authored-by: rbiseck3 <rbiseck3@users.noreply.github.com>
2023-08-21 15:16:50 -04:00
Jack Retterer
a35ff890e0
Update docs jack (#1157)
Documentation Overhaul

- Added documentation hierarchy
- Added options for Bash vs Python for API & Upstream Connectors
- Added Introduction section (Overview, Key Concepts, Getting Started)
- Redid connectors section
- Installation is now broken up (needs further work)
2023-08-21 10:27:32 -07:00
ryannikolaidis
6330278839
chore: add ingest diagrams and explanation of flow to ingest README (#1158) 2023-08-21 16:24:03 +00:00
Newel H
e4aa7373e2
test: create CI pipelines for verifying base and extras pass respective tests (#1137)
**Summary**
Closes #747
* Create CI Pipeline for running text, xml, email, and html doc tests
against the library installed without extras
* Create CI Pipeline for running each library extra against their
respective tests
2023-08-19 12:56:13 -04:00
John
69edffb0c0
bug: update partition_msg and partition_email so attachments also receive metadata_last_modified kwarg (#1134)
### Summary
Closes #1027
The msg test in question was no longer failing after removing the
quick-fix and comment explaining the issue. However, the test was not
functioning as intended. Test was refactored to appropriately test
`metadata_last_modified` of attachments.
`partition_msg` was then updated to pass `metadata_last_modified` to
`attachment_partitioner`.
The same was done for email partitioning.

### Testing
```
from unstructured.partition.text import partition_text
from unstructured.partition.msg import partition_msg
from unstructured.partition.email import partition_email

filename="example-docs/fake-email-attachment.msg"
elements = partition_msg(filename=filename, attachment_partitioner=partition_text, process_attachments=True, metadata_last_modified="0000-00-00")

# previously, these were different values because last_modified wasn't being updated in attachments
elements[1].metadata.last_modified 
elements[-1].text
elements[-1].metadata.last_modified

email_filename="example-docs/eml/fake-email-attachment.eml"
email_elements = partition_email(filename=email_filename, attachment_partitioner=partition_text, process_attachments=True, metadata_last_modified="0000-00-00")

email_elements[1].metadata.last_modified 
email_elements[-1].text
email_elements[-1].metadata.last_modified
```
2023-08-18 23:21:11 +00:00
Austin Walker
dd243b4fd9
chore: pass ocr_mode in partition_pdf_or_image (#1154)
Set to individual_blocks for now to work around [this
bug](https://github.com/Unstructured-IO/unstructured-inference/issues/179).

I verified by printing the current ocr_mode in inference. The
`entire_page` default is overridden.

---------

Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com>
Co-authored-by: awalker4 <awalker4@users.noreply.github.com>
0.10.4
2023-08-18 20:59:08 +00:00
cragwolfe
1456f06b2d
chore: skip consistently failing test in main (#1150)
The reason this test is failing is the API is returning "fast" results
when "hi_res" is requested, which is being tracked in this ticket:
https://github.com/Unstructured-IO/unstructured-api/issues/188 .

This failure was only showing up on the `main` branch, per the commented
out `pytest` skips.
2023-08-18 10:06:17 -07:00
cragwolfe
6001e2fa62
chore: changelog repair (#1149)
0.10.2 had been released, but prior commit co-mingled 0.10.2 and 0.10.3. This corrects the changelog and intentionally skips over 0.10.3.

Bonus: remove accidental dupe line in 0.10.0.
2023-08-17 23:41:04 -07:00
Francisco Kurucz
d2a41f462d
doc: fix typo on partition_md function in bricks documentation (#1147) 2023-08-17 20:54:11 -07:00
ryannikolaidis
668d0f1b01
feat: per-process ingest connections (#1058)
* adds per process connections for Google Drive connector
2023-08-17 17:34:08 +00:00
cragwolfe
dd0f582585
build(deps): bump unstructured-inference==0.5.13 (#1141)
Bump to unstructured-inference==0.5.13, which includes:

Fix extracted image elements being included in layout merge, addresses the issue
where an entire-page image in a PDF was not passed to the layout model when using hi_res.
0.10.2
2023-08-17 06:25:00 +00:00
John
9f7bd6127b
enhancement: Add include_header kwarg for xlsx, default True(#1125)
Closes Github issue #1121

Adds include_header kwarg to partition_xlsx and change default behavior to True.
0.10.1
2023-08-17 04:16:23 +00:00
cragwolfe
22c12ef806
bump unstructured-inference (#1140)
Pulls in fix from unstructured-inference==0.5.12:

When a pdf page doesn't have much data, it may get buffered in the write to a tempfile. If this happens, we'll hit an error reading the file back. Open to suggestions for a way to unit test this - I was creating some test files with pypdf but I couldn't trigger the error.
2023-08-16 22:29:37 +00:00
cragwolfe
6f1b8d5f28
build(deps): bump unstructured-inference to 0.5.11 (#1138)
* Bump unstructured-inference==0.5.11:
  - better defaults for DPI for hi_res and  Chipper
2023-08-16 20:52:40 +00:00
Christine Straub
0a23139720
enhancement: implement full-page OCR(#1133)
*implements full-page OCR as supported in unstructured-inference=0.5.11.
2023-08-16 19:16:35 +00:00
Newel H
be093d2e66
chore: Update dead links to correct pages (#1127)
Summary
Closes #1124

Updates dead links in repository README
- Quick Start > Install for local development
- Learn more > Batch Processing)

Updates document dependencies to include tesseract-lang for additional language support (requirement for tests to pass)

Testing
All tests pass
2023-08-16 10:43:37 -04:00
Christine Straub
0e887cc36b
Feat/1060 update metadata fields (#1099)
Closes Github Issue #1060.

* update the metadata field links
* update the metadata field emphasized_texts
0.10.0
2023-08-16 04:33:06 +00:00
Sebastian Laverde Alfonso
fe5048a834
feat: chipper local inference notebook (#1116)
Download chipper model for local use and demonstrate how to partition a .pdf document
through the unstructured and unstructured_inference libraries.
2023-08-15 20:43:23 -07:00
cragwolfe
d19183f442
build(lint): don't check version in main against self (#1123)
If on the main branch already, it does not make sense to check if the latest commit is the same non-dev version.

This fixes an annoyance where the CI Lint job would fail on release main commits, but besides that was not causing any other issues.
2023-08-15 17:57:59 +00:00
John
6e5d27c6c3
fix pdf partition of list items being detected as titles in OCR only mode (#1119)
Closes Github issue #1010

adds group_bullet_paragraph func to handle grouping of bullet items that are split across multiple lines
2023-08-15 09:35:54 -07:00
qued
cb923b96a2
build(deps): dependency cleanup (#1102)
Cleans up some pins that were prone to conflicts. All pins belong in constraints.in.
0.9.3
2023-08-15 05:15:44 +00:00
cragwolfe
d835fb1086
chore: bump pip version in published image (#1111)
for consistency with the development environment, i.e. the Makefile.
2023-08-14 21:59:31 +00:00
Mike Lay
79a1eb8683
Handle inline and lacking filename (#1109)
Handle Content-Disposition: inline and attachment without filename

* Add new email test example and test with Content-Disposition: inline.
* Move attachment_info above for loop so it is always defined
* Check if item is inline as well as attachment as these both lack an = character to split on
* Create filename if filename is not specified and write file.
* Update list_attachments with new filename
2023-08-14 18:38:53 +00:00
Christine Straub
80266460fd
fix: GH issue 1057 etree parser error (csv) (#1112)
Addresses #1057 for CSV. Related to PR #1077.

* update partition_csv to always use soupparser_fromstring to parse html text
2023-08-14 17:48:57 +00:00
Mark Risher
612f9da6e8
Update news-of-the-day.ipynb - typo (#1113)
Fixed typo
2023-08-14 16:48:49 +00:00
Mike Lay
2e0ab86c6a
Fix attachments with = in filename (#1110)
Fix attachments with = in filename

* Limit split to first match of = to prevent creating a list of more than two parts
* Add example email with attachment name and test for issue
2023-08-13 20:35:18 -07:00
Christine Straub
fc2699ff06
Fix/1057 etree parser error tsv (#1106)
* feat: always use `soupparser_fromstring` to parse `html text` which gracefully handles emoji
* chore: update changelog & version
2023-08-14 01:22:36 +00:00
cragwolfe
b4b8ac4d8a
chore: run make pip-compile on mac (#1107)
so cuda deps removed.
2023-08-13 20:42:12 +00:00