This PR resolves#1247 by using the matching elements and bbox for
coordinate computation.
This PR also updates the example doc
`example-docs/layout-parser-paper-fast.pdf` so that it includes a true
blank page and a page with text "this page is intentionally left blank".
This change helps us testing:
- differences between fast and hi_res
- code handling empty pages in between pages with contents (which
triggers the bug found in #1247 )
Lastly, this PR updates the names of the variables inside
`_partition_pdf_or_image_with_ocr` so that matching inputs all starts
with `_` like `_elements`, `_text`, and `_bboxes` to improve
readability.
This change also improves partition performance for multi-page pdfs as
it reduces the amount of iterations inside
`add_pytesseract_bbox_to_elements`. Testing locally on m2 mac + Rocky
docker shows it reduces partition time for DA-619p.pdf file from around
1min to around 23s.
### Summary
Closes#1229. Updates `partition_xml` so that the element type is
inferred on each leaf node when `xml_keep_tags=False` instead of
delegating splitting and partitioning to `partition_xml`. If
`xml_keep_tags=True`, the file is treated like a text file still and
partitioning is still delegated to `partition_text`.
Also adds the option to pass `text` as an input to `partition_xml`.
### Testing
Create a `parrots.xml` file that looks like:
```xml
<xml><parrot><name>Conure</name><description>A conure is a very friendly bird.
Conures are feathery and like to dance.</description></parrot></xml>
```
Run:
```python
from unstructured.partition.xml import partition_xml
from unstructured.staging.base import convert_to_dict
elements = partition_xml(filename="parrots.xml")
convert_to_dict(elements)
```
One `main`, the output is the following. Notice how the `<name>` tag
incorrectly gets merged into `<description>` in the first element.
```python
[{'element_id': '7ae4074435df8dfcefcf24a4e6c52026',
'metadata': {'file_directory': '/home/matt/tmp',
'filename': 'parrots.xml',
'filetype': 'application/xml',
'last_modified': '2023-08-30T14:21:38'},
'text': 'Conure A conure is a very friendly bird.',
'type': 'NarrativeText'},
{'element_id': '859ecb332da6961acd2fb6a0185d1549',
'metadata': {'file_directory': '/home/matt/tmp',
'filename': 'parrots.xml',
'filetype': 'application/xml',
'last_modified': '2023-08-30T14:21:38'},
'text': 'Conures are feathery and like to dance.',
'type': 'NarrativeText'}]
```
One the feature branch, the output is the following, and the tags are
correctly separated.
```python
[{'element_id': '5512218914e4eeacf71a9cd42c373710',
'metadata': {'file_directory': '/home/matt/tmp',
'filename': 'parrots.xml',
'filetype': 'application/xml',
'last_modified': '2023-08-30T14:21:38'},
'text': 'Conure',
'type': 'Title'},
{'element_id': '113bf8d250c2b1a77c9c2caa4b812f85',
'metadata': {'file_directory': '/home/matt/tmp',
'filename': 'parrots.xml',
'filetype': 'application/xml',
'last_modified': '2023-08-30T14:21:38'},
'text': 'A conure is a very friendly bird.\n'
'\n'
'Conures are feathery and like to dance.',
'type': 'NarrativeText'}]
```
Update `test_json` to not use auto partition due to dependencies. Previously, to run `test_json` requires full requirements installation library to read file types, including but not limited to, docx, pptx, as well as others. Therefore the test will raise error with base installation. With the update, this fix also add to other test files to check its invariant with `elements_to_json`.
### Summary
An initial pass on smart chunking for RAG applications. Breaks a
document into sections based on the presence of `Title` elements. Also
starts a new section under the following conditions:
- If metadata changes, indicating a change in section or page or a
switch to processing attachments. If `multipage_sections=True`, sections
can span pages. `multipage_sections` defaults to True.
- If the length of the section exceeds `new_after_n_chars` characters.
The default is `1500`. The chunking function does not split individual
elements, so it's possible for a section to exceed that threshold if an
individual element if over `new_after_n_chars` characters, which could
occur with a long `NarrativeText` element.
- Section under `combine_under_n_chars` characters are combined. The
default is `500`.
### Testing
```python
from unstructured.partition.html import partition_html
from unstructured.chunking.title import chunk_by_title
url = "https://understandingwar.org/backgrounder/russian-offensive-campaign-assessment-august-27-2023-0"
elements = partition_html(url=url)
chunks = chunk_by_title(elements)
for chunk in chunks:
print(chunk)
print("\n\n" + "-"*80)
input()
```
Uncomment confluence-diff ingest test to:
- see if the test has consistent results
- keep testing the confluence connector
---------
Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com>
Co-authored-by: ahmetmeleq <ahmetmeleq@users.noreply.github.com>
The issue was that for blocks detected in an image such as:

, where the full image is:
https://utic-dev-tech-fixtures.s3.us-east-2.amazonaws.com/pastebin//Users/cragwolfe/tmp/IRS-form-1987.png
, many ListItem's would be extracted that were not adding much value to
the output (assuming the block was determined to be of type List from
the layout model). This particular file is also used in ingest tests,
and you can see the prior output here:
https://github.com/Unstructured-IO/unstructured/blob/483b09b/test_unstructured_ingest/expected-structured-output/azure/IRS-form-1987.png.json#L93-L280
Test Instructions:
1. run the following snippet:
```
import json
import os
from datetime import datetime
from unstructured.__version__ import __version__
from unstructured.partition.auto import partition
from unstructured.staging.base import elements_to_json
filename = "/opt/home/tmp/IRS-form-1987.png"
output_dir = "/opt/home/tmp/json"
base_name_with_ext = os.path.basename(filename)
output_filename_part = os.path.join(output_dir, base_name_with_ext)
print(f"unstructured version: {__version__}")
#for strategy in ("hi_res", "fast", "auto"):
for strategy in ("hi_res",):
d1 = datetime.now()
elements = partition(filename=filename, strategy=strategy)
elems_as_dicts = json.loads(elements_to_json(elements, indent=2))
# strip out metadata for the sake of more readable results
for element_dict in elems_as_dicts:
del element_dict["metadata"]
json_filename=f"{output_filename_part}-{strategy}.json"
with open(json_filename, "w") as jsonf:
jsonf.write(json.dumps(elems_as_dicts, indent=2))
d2 = datetime.now()
print(f"num elements for {strategy}: {len(elements)}")
print(f"time elapsed {strategy}: {(d2-d1).total_seconds()}")
```
updating the `filename` and `output_dir` paths for your particular local
environment.
2. Open the json file that was writen to your `output_dir`, named
IRS-form-1987.png-hi_res.json
Witness the new element:
```
{
"type": "ListItem",
"element_id": "7d3ba328af2c20ddeef5d2c1d270f60f",
"text": "Long-term contracts.\u2014If you are required to change your method of accounting for long-term contracts under section 460, see Notice 87
-61 (9/21/87), 1987-38 IRB 40, for the notification procedures that must be followed Other methods. \u2014Unless the Service has Published a regulation
or procedure to the contrary, all other changes in accounting methods required by the Act are automatically considered to be approved by the Commissio
ner. Examples of method changes automatically approved by the Commissioner are those changes required to effect: (1) the repeal of the reserve method f
or bad debts of taxpayers other than financial institutions (Act section 805); (2) the repeal of the installment method for sales under a revolving cre
dit plan (Act section 812); (3) the Inclusion of income attributable to the sale or furnishing of utility services no later than the year in which the
services were provided to customers (Act section 821); and (4) the repeal of the deduction for qualified discount coupons (Act section 823). Do not fil
e Form 3115 for these changes."
},
```
### Summary
Closes#1018. Enables `partition_email` and `partition_msg` to detect if
an email has PGP encrypted content. Based on the specification in [RFC
2015](https://www.ietf.org/rfc/rfc2015.txt). The test emails are based
on the example email in the spec. If PGP detected content is detected, a
warning is emitted and an empty set of lists is returned.
### Testing
```python
from unstructured.partition_email import partition_email
filename = "example-docs/eml/fake-encrypted.eml"
partition_email(filename=filename)
```
```python
from unstructured.partition_msg import partition_msg
filename = "example-docs/fake-encrypted.msg"
partition_msgl(filename=filename)
```
### Summary
Closes#1184. Updates `partition_html` to respect the ordering of
`<pre>` tags in HTML documents.
### Testing
The elements in the following example should be in the correct order.
```python
from unstructured.partition.html import partition_html
html_text = """
<pre>The Big Brown Bear</pre>
<div>The big brown bear is growling.</div>
<pre>The big brown bear is sleeping.</pre>
<div>The Big Blue Bear</div>
"""
elements = partition_html(text=html_text)
print("\n\n".join([str(el) for el in elements]))
```
### Summary
Address
[#1136](https://github.com/Unstructured-IO/unstructured/issues/1136) for
`hi_res` and `fast` strategies. The `ocr_only` strategy does not include
coordinates.
- add functionality to switch sort mode between the current `basic`
sorting and the new `xy-cut` sorting for `hi_res` and `fast` strategies
- add the script to evaluate the `xy-cut` sorting approach
- add jupyter notebook to provide evaluation and visualization for the
`xy-cut` sorting approach
### Evaluation
```
export PYTHONPATH=.:$PYTHONPATH && python examples/custom-layout-order/evaluate_xy_cut_sorting.py <file_path> <strategy>
```
Here, the file should be under the project root directory. For example,
```
export PYTHONPATH=.:$PYTHONPATH && python examples/custom-layout-order/evaluate_xy_cut_sorting.py example-docs/multi-column-2p.pdf fast
```
- Puts threaded replies into the same text field as parent message,
allowing for a full thread to be under a single element_id
- Output is now XML instead of TXT to allow for easier parsing of new
format.
https://github.com/Unstructured-IO/unstructured/issues/1186
Add test case test_partition_image_with_multipage_tiff that reads multipage TIFF file and
- confirms that the function reads all the pages in the TIFF.
- page number is added to the metadata
This PR is branched from and developed on top of 6d6be99 commit.
### Summary
Closes#1007. Adds a deprecation warning for the `file_filename` kwarg
to `partition`, `partition_via_api`, and `partition_multiple_via_api`.
Also catches a warning in `ebooklib` that we do not want to emit in
`unstructured`.
### Testing
```python
from unstructured.partition.auto import partition
filename = "example-docs/winter-sports.epub"
# Should not emit a warning
with open(filename, "rb") as f:
elements = partition(file=f, metadata_filename="test.epub")
# Should be test.epub
elements[0].metadata.filename
# Should emit a warning
with open(filename, "rb") as f:
elements = partition(file=f, file_filename="test.epub")
# Should be test.epub
elements[0].metadata.filename
# Should raise an error
with open(filename, "rb") as f:
elements = partition(file=f, metadata_filename="test.epub", file_filename="test.epub")
```
- fixes#1079 where partitioning is happening twice in the case of
`strategy="ocr_only"`
- only calls `extractable_elements` if we can predetermine that
`ocr_only` is not a possible strategy even if it was the intended
strategy.
- Adds additional assertion test that `_partition_pdf_or_image_with_ocr`
is not called when falling back to `fast` from `ocr_only`
chaos reigns in the changelog. whyyy
* there was no 0.10.3 release, so remove that from the CHANGELOG.
* fixup 0.10.5 with a couple that were added (in retrospect)
* pip-compile in order to bump unstructured-inference
* Set the default `ocr_mode` back to `enitre_page` now that [this
error](https://github.com/Unstructured-IO/unstructured-inference/pull/183)
is addressed
* Explicitly add `sphinx-tabs` to `build.in`. This file provides
`docs/requirements.txt`.
* Remove a pinned `pydantic` version
* Fix a makefile command to `pip-compile` a missing ingest file.
### Description
Adds three custom errors to ingest:
* `SourceConnectionError`
* `DestinationConnectionError`
* `PartitionError`
Included is a base custom error class that adds a wrapper. This wrapper
wraps any raised exception into the custom error.
### Description
Add delta table connector and test against a delta table generated via
delta.io and uploaded to s3. Shows an example of how to use the
connection options to leverage s3.
I was able to get this to work with s3 if I pass in the access and
secret keys as storage options. Even though the s3 bucket being used is
public, would not work without those.
---------
Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com>
Co-authored-by: rbiseck3 <rbiseck3@users.noreply.github.com>
### Summary
Updates `partition` to let users know to installs the appropriate extras
if they're missing. Prior to this PR, users would get an exception
stating `partition_pdf` (or whichever function that requires extras)
does not exist.
### Testing
First `pip uninstall ebooklib`. Then run
```python
from unstructured.partition.auto import partition
partition(filename="example-docs/winter-sports.epub")
```
The error should look like
```python
ImportError: partition_epub is not available. Install the epub dependencies with pip install "unstructured[epub]"
```
### Description
* Add ingest test for Notion docs
* Update default cache dir for connectors to include connector name.
Makes debugging the cached content easier.
---------
Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com>
Co-authored-by: rbiseck3 <rbiseck3@users.noreply.github.com>
Documentation Overhaul
- Added documentation hierarchy
- Added options for Bash vs Python for API & Upstream Connectors
- Added Introduction section (Overview, Key Concepts, Getting Started)
- Redid connectors section
- Installation is now broken up (needs further work)
**Summary**
Closes#747
* Create CI Pipeline for running text, xml, email, and html doc tests
against the library installed without extras
* Create CI Pipeline for running each library extra against their
respective tests
### Summary
Closes#1027
The msg test in question was no longer failing after removing the
quick-fix and comment explaining the issue. However, the test was not
functioning as intended. Test was refactored to appropriately test
`metadata_last_modified` of attachments.
`partition_msg` was then updated to pass `metadata_last_modified` to
`attachment_partitioner`.
The same was done for email partitioning.
### Testing
```
from unstructured.partition.text import partition_text
from unstructured.partition.msg import partition_msg
from unstructured.partition.email import partition_email
filename="example-docs/fake-email-attachment.msg"
elements = partition_msg(filename=filename, attachment_partitioner=partition_text, process_attachments=True, metadata_last_modified="0000-00-00")
# previously, these were different values because last_modified wasn't being updated in attachments
elements[1].metadata.last_modified
elements[-1].text
elements[-1].metadata.last_modified
email_filename="example-docs/eml/fake-email-attachment.eml"
email_elements = partition_email(filename=email_filename, attachment_partitioner=partition_text, process_attachments=True, metadata_last_modified="0000-00-00")
email_elements[1].metadata.last_modified
email_elements[-1].text
email_elements[-1].metadata.last_modified
```
Set to individual_blocks for now to work around [this
bug](https://github.com/Unstructured-IO/unstructured-inference/issues/179).
I verified by printing the current ocr_mode in inference. The
`entire_page` default is overridden.
---------
Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com>
Co-authored-by: awalker4 <awalker4@users.noreply.github.com>
The reason this test is failing is the API is returning "fast" results
when "hi_res" is requested, which is being tracked in this ticket:
https://github.com/Unstructured-IO/unstructured-api/issues/188 .
This failure was only showing up on the `main` branch, per the commented
out `pytest` skips.
0.10.2 had been released, but prior commit co-mingled 0.10.2 and 0.10.3. This corrects the changelog and intentionally skips over 0.10.3.
Bonus: remove accidental dupe line in 0.10.0.