83 Commits

Author SHA1 Message Date
Christine Straub
2d951722df
Feat/1332 save embedded images in pdf (#1371)
Addresses
[#1332](https://github.com/Unstructured-IO/unstructured/issues/1332)
with `unstructured-inference` PR
[#208](https://github.com/Unstructured-IO/unstructured-inference/pull/208).
### Summary
- Add `image_path` to element metadata
- Pass parameters related to extracting images in PDF
- Preserve image elements ignored due to garbage text if
`el.metadata.image_path` is `True`
### Testing


from unstructured.partition.pdf import partition_pdf

f_path = "example-docs/embedded-images.pdf"

# default image output directory
elements = partition_pdf(
    f_path,
    strategy=strategy,
    extract_images_in_pdf=True,
)

# specific image output directory
elements = partition_pdf(
    f_path,
    strategy=strategy,
    extract_images_in_pdf=True,
    image_output_dir_path=<directory path>,
)
2023-09-22 09:16:03 +00:00
Amanda Cameron
e359afafbe
fix: coordinates bug on pdf parsing (#1462)
Addresses: https://github.com/Unstructured-IO/unstructured/issues/1460

We were raising an error with invalid coordinates, which prevented us
from continuing to return the element and continue parsing the pdf. Now
instead of raising the error we'll return early.

to test:
```
from unstructured.partition.auto import partition

elements = partition(url='https://www.apple.com/environment/pdf/Apple_Environmental_Progress_Report_2022.pdf', strategy="fast")
```

---------

Co-authored-by: cragwolfe <crag@unstructured.io>
2023-09-19 19:25:31 -07:00
shreyanid
eb8ce89137
chore: function to map between standard and Tesseract language codes (#1421)
### Summary
In order to convert between incompatible language codes from packages
used for OCR, this change adds a function to map between any standard
language codes and tesseract OCR specific codes. Users can input
language information to `languages` in any Tesseract-supported langcode
or any ISO 639 standard language code.

### Details
- Introduces the
[python-iso639](https://pypi.org/project/python-iso639/) package for
matching standard language codes. Recompiles all dependencies.
- If a language is not already supplied by the user as a Tesseract
specific langcode, supplies all possible script/orthography variants of
the language to the Tesseract OCR agent.

### Test
Added many unit tests for a variety of language combinations, special
cases, and variants. For general testing, call partition functions with
any lang codes in the languages parameter (Tesseract or standard).

for example,
```
from unstructured.partition.auto import partition

elements = partition(filename="example-docs/layout-parser-paper.pdf", strategy="hi_res", languages=["en", "chi"])
print("\n\n".join([str(el) for el in elements]))
```
should supply eng+chi_sim+chi_sim_vert+chi_tra+chi_tra_vert to Tesseract
2023-09-18 08:42:02 -07:00
Amanda Cameron
a9f18eddb8
chore: adding test case for odt tables (#1434)
ODT table extraction is happening! Just added to an existing example-doc
and an accompanying test case.
2023-09-16 22:29:44 -07:00
Newel H
cd704e873b
Feat: Create a naive hierarchy for elements (#1268)
## **Summary**
By adding hierarchy to unstructured elements, users will have more
information for implementing vector db/LLM chunking strategies. For
example, text elements could be queried by their preceding title
element. The hierarchy is implemented by a parent_id tag in the
element's metadata.

### Features
- Introduces a parent_id to ElementMetadata (The id of the parent
element, not a pointer)
- Creates a rule set for assigning hierarchies. Sensible default is
assigned, with an optional override parameter
- Sets element parent ids if there isn't an existing parent id or
matches the ruleset

### How it works

Hierarchies are assigned via a parent id field in element metadata.
Elements are read sequentially and evaluated against a ruleset. For
example take the following elements:

1. Title, "This is the Title"
2. Text, "this is the text"

And the ruleset: `{"title": ["text"]}`. When evaluated, the parent_id of
2 will be the id of 1. The algorithm for determining this is more
complex and resolves several edge cases, so please read the code for
further details.

### Schema Changes

```
@dataclass
class ElementMetadata:
      coordinates: Optional[CoordinatesMetadata] = None
      data_source: Optional[DataSourceMetadata] = None
      filename: Optional[str] = None
      file_directory: Optional[str] = None
      last_modified: Optional[str] = None
      filetype: Optional[str] = None
      attached_to_filename: Optional[str] = None
+     parent_id: Optional[Union[str, uuid.UUID, NoID, UUID]] = None
+     category_depth: Optional[int] = None

...
```

### Testing
```
from unstructured.partition.auto import partition
from typing import List

elements = partition(filename="./unstructured/example-docs/fake-html.html", strategy="auto")

for element in elements:
    print(
        f"Category:  {getattr(element, 'category', '')}\n"\
        f"Text:      {getattr(element, 'text', '')}\n"
        f"ID:        {element.id}\n" \
        f"Parent ID: {element.metadata.parent_id}\n"\
        f"Depth:     {element.metadata.category_depth}\n" \
    )
```

### Additional Notes
Implementing this feature revealed a possibly undesired side-effect in
how element metadata are processed. In
`unstructured/partition/common.py` the `_add_element_metadata` is
invoked as part of the `add_metadata_with_filetype` decorator for
filetype partitioning. This method is intended to add additional
information to the metadata generated with the element including
filename and filetype, however the existing metadata is merged into a
newly created metadata object rather than the other way around. Because
of the way it's structured, new metadata fields can easily be forgotten
and pose debugging challenges to developers. This likely warrants a new
issue.

I'm guessing that the implementation is done this way to avoid issues
with deserializing elements, but could be wrong.

---------

Co-authored-by: Benjamin Torres <benjats07@users.noreply.github.com>
2023-09-14 11:23:16 -04:00
Amanda Cameron
7fd81dc7df
Table processing test for RTF (#1388)
This PR does two things:
1. Adds test case (and alters sample doc) for rtf and epub files with
table
2. Adds `xls/x` file extension to `skip_infer_table_types` default list

---------

Co-authored-by: shreyanid <42684285+shreyanid@users.noreply.github.com>
2023-09-12 18:27:05 -07:00
Yao You
1a0b737e9c
revert pdf changes and add new pdf for empty page testing (#1255)
- revert the layout parser fast pdf file to original with just two pages
- add a new file that has one empty page and one page says "this page is
intentionally left blank" for tests
2023-09-01 22:33:06 +00:00
Yao You
27773132b7
[issue 1247] fix element and bbox mismatch bug (#1250)
This PR resolves #1247 by using the matching elements and bbox for
coordinate computation.

This PR also updates the example doc
`example-docs/layout-parser-paper-fast.pdf` so that it includes a true
blank page and a page with text "this page is intentionally left blank".
This change helps us testing:
- differences between fast and hi_res
- code handling empty pages in between pages with contents (which
triggers the bug found in #1247 )

Lastly, this PR updates the names of the variables inside
`_partition_pdf_or_image_with_ocr` so that matching inputs all starts
with `_` like `_elements`, `_text`, and `_bboxes` to improve
readability.

This change also improves partition performance for multi-page pdfs as
it reduces the amount of iterations inside
`add_pytesseract_bbox_to_elements`. Testing locally on m2 mac + Rocky
docker shows it reduces partition time for DA-619p.pdf file from around
1min to around 23s.
2023-08-30 23:34:55 +00:00
Matt Robinson
07f76275f1
feat: detect PGP encrypted content in partition_email and partition_msg (#1205)
### Summary

Closes #1018. Enables `partition_email` and `partition_msg` to detect if
an email has PGP encrypted content. Based on the specification in [RFC
2015](https://www.ietf.org/rfc/rfc2015.txt). The test emails are based
on the example email in the spec. If PGP detected content is detected, a
warning is emitted and an empty set of lists is returned.

### Testing

```python
from unstructured.partition_email import partition_email

filename = "example-docs/eml/fake-encrypted.eml"
partition_email(filename=filename)
```

```python
from unstructured.partition_msg import partition_msg

filename = "example-docs/fake-encrypted.msg"
partition_msgl(filename=filename)
```
2023-08-25 17:09:25 -07:00
Christine Straub
483b09b3c9
Feat/1136 elements ordering for pdf (#1161)
### Summary
Address
[#1136](https://github.com/Unstructured-IO/unstructured/issues/1136) for
`hi_res` and `fast` strategies. The `ocr_only` strategy does not include
coordinates.
- add functionality to switch sort mode between the current `basic`
sorting and the new `xy-cut` sorting for `hi_res` and `fast` strategies
- add the script to evaluate the `xy-cut` sorting approach
- add jupyter notebook to provide evaluation and visualization for the
`xy-cut` sorting approach

### Evaluation
```
export PYTHONPATH=.:$PYTHONPATH && python examples/custom-layout-order/evaluate_xy_cut_sorting.py <file_path> <strategy>
```
Here, the file should be under the project root directory. For example,
```
export PYTHONPATH=.:$PYTHONPATH && python examples/custom-layout-order/evaluate_xy_cut_sorting.py example-docs/multi-column-2p.pdf fast
```
2023-08-24 17:46:19 -07:00
Klaijan
1524841cd9
feat: supports multipage tiff (#1131)
Add test case test_partition_image_with_multipage_tiff that reads multipage TIFF file and

- confirms that the function reads all the pages in the TIFF.

- page number is added to the metadata

This PR is branched from and developed on top of 6d6be99 commit.
2023-08-24 15:12:50 +00:00
John
6e5d27c6c3
fix pdf partition of list items being detected as titles in OCR only mode (#1119)
Closes Github issue #1010

adds group_bullet_paragraph func to handle grouping of bullet items that are split across multiple lines
2023-08-15 09:35:54 -07:00
Mike Lay
79a1eb8683
Handle inline and lacking filename (#1109)
Handle Content-Disposition: inline and attachment without filename

* Add new email test example and test with Content-Disposition: inline.
* Move attachment_info above for loop so it is always defined
* Check if item is inline as well as attachment as these both lack an = character to split on
* Create filename if filename is not specified and write file.
* Update list_attachments with new filename
2023-08-14 18:38:53 +00:00
Mike Lay
2e0ab86c6a
Fix attachments with = in filename (#1110)
Fix attachments with = in filename

* Limit split to first match of = to prevent creating a list of more than two parts
* Add example email with attachment name and test for issue
2023-08-13 20:35:18 -07:00
Christine Straub
fc2699ff06
Fix/1057 etree parser error tsv (#1106)
* feat: always use `soupparser_fromstring` to parse `html text` which gracefully handles emoji
* chore: update changelog & version
2023-08-14 01:22:36 +00:00
Christine Straub
d26ab1deac
fix: etree parser error (#1077)
* feat: add functionality to check if a string contains any emoji characters

* feat: add functionality to switch `html` text parser based on whether the `html` text contains emoji

* chore: add `beautifulsoup4` and `emoji` packages to `requirements/base.in` for general use

* chore: update changelog & version

* chore: update changelog & version

* chore: update dependencies

* test: update `EXPECTED_XLS_TEXT_LEN` for `test_auto_partition_xls_from_filename`

* chore: update changelog & version
2023-08-10 23:28:57 +00:00
kravetsmic
25ca5744cf
feat: optionally ignore header and footer tags in partition html (#1013)
---------

Co-authored-by: Matt Robinson <mrobinson@unstructured.io>
2023-08-04 21:56:33 +00:00
Christine Straub
b76d2ee745
feat: track emphasized text msword (#1048)
* feat: add functionality to track emphasized text (`bold/italic` formatting) from paragraph

* chore: add docstring

* chore: fix lint errors

* feat: ignore spaces when extracting emphasized texts from a paragraph

* feat: add functionality to track emphasized text (`bold/italic` formatting) from table

* test: add test case for grabbing emphasized texts from element metadata

* chore: fix lint errors

* chore: update changelog & version

* Update ingest test fixtures (#1047)
2023-08-04 17:04:12 -04:00
Hynek Kydlíček
47b20119c3
fix: extract emojis with partition_xlsx (#1009)
* 🐛 fixxed emoji xlsx bug

* update version and changelog

* check if beautifulsoup exists

* update docs

* fix html parser call

* fix failing attachment test

*   added emoji test, added requirment fixed dependency

* 🐛 dependency

* 🐛 correct depeendency

* linting, linting, linting

* check for bs4

* skip auto xls filename test

---------

Co-authored-by: Matt Robinson <mrobinson@unstructured.io>
2023-08-04 10:14:08 -04:00
Sebastian Laverde Alfonso
084ead173a
chore: custom layout order example notebook (#1024)
* chore: CFR double column sample

Federal Regulations document for example notebook in `examples/custom-layout-order`

* chore: custom-layout-order example dir

* feat: helper methods to plot and reorder layouts

Helper methods: `plot_image_with_bounding_boxes_coloured` and `reorder_elements_in_double_columns`

* chore: delete __init__.py

---------

Co-authored-by: Benjamin Torres <benjats07@users.noreply.github.com>
2023-08-02 18:29:04 -06:00
Yuming Long
d46c1c2d83
Chore: Pass table support param to partition image (#973)
* add param and test in image table extraction

* version and changelog

* need to publish this one for api repo

* add new param skip_infer_table_types

* use warning

* clean up with mapping

* add test for tsv

* fix test fail

* weird change from merge

* doc nit

* don't use mapping

* correct conflict
2023-07-27 13:33:36 -04:00
David Potter
f7e46af22f
feat: adds Outlook connector (#939)
* bonus: fixes issue with email partitioning where From field was being assigned the To field value.
2023-07-26 04:09:26 +00:00
John
f282a10715
enhancement: improve json detection by detect_filetype (#971)
* update regex pattern

* improve json regex pattern checks and add test file

* update file name

* update tests and formatting

* update changelog and version
2023-07-25 12:47:39 -04:00
Jason Scheirer
196efa09b1
chore: Add encoding param to ingest (#955)
* Add encoding param to ingest
2023-07-24 10:06:13 -07:00
Christine Straub
5b7ae29876
fix: 521 pdf2image memory error (#924)
Closes issue #521. Implements the same logic as unstructured-inference/PR #136 for the ocr_only strategy.

* Add functionality to convert a PDF in small chunks of pages at a time
* Add functionality to write images to computer storage temporarily instead of keeping them in memory
* Set the file's current position to the beginning after reading the file in convert_to_bytes
2023-07-14 15:08:33 -05:00
John
6173362620
fix: detect list items in MS Word documents (#909)
* fix merge conflict

* update changelog and version
2023-07-10 15:29:08 +00:00
qued
79f734d3f9
fix: better extractable check (#900)
auto strategy was choosing the fast strategy in cases where the pdf contents were just a flat image, resulting in no output. This PR changes the behavior of auto so that elements that can be extracted by fast are extracted, a cursory examination of the elements is made to see if there are elements with text present, and if so then these elements are used as the output. Otherwise fallback strategies come into play.
2023-07-07 23:41:37 -05:00
Christine Straub
47bc4009a8
fix: adjust threshold for encoding detection (#894)
* chore: add example doc

* fix: adjust encoding recognition threshold value in `detect_file_encoding`

* test: add test cases for German characters

* chore: update changelog & version
2023-07-07 09:25:03 -04:00
Matt Robinson
52aced8677
fix: validate encodings from email headers (#881)
* add validate encoding function

* remove extraneous file

* added test case for malformed encoding

* version and changelog
2023-07-06 13:49:27 +00:00
Matt Robinson
44411ecc59
enhancement: max_partition kwarg for limiting element size (#818)
* add max partition size logic

* work splitting logic into split_by_paragraph

* pass through max_partition to other functions

* added test for splitting long document

* add type hint

* add documentation

* version and changelog

* ingest-test-fixtures-update

* Update ingest test fixtures (#819)

Co-authored-by: MthwRobinson <MthwRobinson@users.noreply.github.com>

* retrigger ci

* ingest-test-fixtures-update

* ingest-test-fixtures-update

* Update ingest test fixtures (#821)

Co-authored-by: MthwRobinson <MthwRobinson@users.noreply.github.com>

* update default for partition_xml

* update version for release

* update msg doc string

---------

Co-authored-by: MthwRobinson <MthwRobinson@users.noreply.github.com>
2023-06-28 15:26:01 -04:00
shreyanid
433d6af1bc
fix: format Arabic and Hebrew annotated encodings (#823)
* add modified arabic and hebrew encodings

* added calls to format_encoding_str so encoding is checked before use

* added formatting to detect_filetype()

* explicitly provided default value for null encoding parameter

* fixed format of annotated encodings list

* adding hebrew base64 test file

* small lint fixes

* update changelog

* bump version to -dev2
2023-06-27 18:15:02 -07:00
kravetsmic
58e988e110
feature(html partition): parse pre tag (#642)
* feature(html partition): parse pre tag

* chore: update CHANGELOG.md

* style: black format xml.py

* Added tests dor html with pre tag

* remove skip test, update parse pre tag

* fix style

* chore: spell check

* chore: update changelog & version

* chore: update ingest test fixtures

* chore: add exception handling if `element.text` is `None` in `_read_xml`

* test: add more sanity testing on the `.text` content of the element(s)

* refactor: move the conditional logic for <pre> outside of the `try/except` block

---------

Co-authored-by: cragwolfe <crag@unstructured.io>
Co-authored-by: christinestraub <christinemstraub@gmail.com>
2023-06-27 18:52:39 +00:00
Martin Mauch
752e78e803
feat: partition_org for Org Mode documents (#780)
* feat: partition_org for Org Mode documents

* update version
2023-06-23 18:45:31 +00:00
Matt Robinson
8683e2695c
fix: enable partition_pdf to recursively grab text with fast strategy (#796)
* initial pass on text in figures

* refactor text extraction

* update tests

* fix title test

* add test for docs that require recursive text grab

* version and changelog

* ingest-test-fixtures-update

* there are 8 pdf files now
2023-06-22 11:19:54 -04:00
Christine Straub
743482b6d3
Bug/635 unicode decode error eml (#739)
* Adds functionality to extract charset info from eml files
* Adds missed file-like object handling in detect_file_encoding
* Adds functionality to replace the MIME encodings for eml files with one of the
   common encodings if a unicode error occurs
* Organize the eml example files in the example-docs/eml directory
2023-06-17 00:52:13 +00:00
Angus Sinclair
ec403e245c
fix malformed pptx issue (#761)
* fix malformed pptx issue

Added a new test to check for the ability to partition a malformed PowerPoint file. Modified the `partition_pptx` function to skip processing shapes that are not on the actual slide, but only if they have top and left positions. Also modified `_order_shapes` function to handle cases where shapes do not have top or left positions.

* update changelog

* fix lint issue SIM102 nested ifs

* fix black linting
2023-06-15 19:52:44 +00:00
John
a9b9b873b1
feat: partition_tsv for tab separated value files (#758)
* first pass at partition_tsv

* working tests

* create constants for tests and debug `make test` failure

* make check and tidy

* undo changes for testing locally

* update changelog and version

* fix bricks.rst

* refactor if statements

* make tidy

* fix README and change try/except to if/else

* update changelog and version

* fix\ docstring
2023-06-15 18:50:53 +00:00
Matt Robinson
a800967478
enhancements: add page numbers for word docs when available (#750)
* add support for page numbers in docx when present

* version and changelog

* add comment on page numbers

* add header and footer to doc elements list

* update integrations docs

* include_page_breaks kwarg for doc and docx

* merge element metadata for pagebreaks

* fix typo

* fix changelog typo

* change page number default to None

* add initial_page_number kwarg

* make page number tests in pdf more explicit

* revert test file

* update ingest tests

* update test fixture outputs

* updates to IRS forms fixtures

* ingest-test-fixtures-update

* Update ingest test fixtures (#759)

Co-authored-by: MthwRobinson <MthwRobinson@users.noreply.github.com>

---------

Co-authored-by: Unstructured-DevOps <111007769+Unstructured-DevOps@users.noreply.github.com>
Co-authored-by: MthwRobinson <MthwRobinson@users.noreply.github.com>
2023-06-15 12:21:17 -04:00
Matt Robinson
053a6c6e5c
enhancement: extract headers and footers in partition_docx (#742)
* added tests for headers and footers

* add docs on headers and footers; tweak to metadata

* version and changelog
2023-06-14 09:42:59 -04:00
Matt Robinson
c82fdb6a89
feat: partition_rst for ReStructured Text documents (#725)
* add example rst file

* filetype detection for rst files

* add partition_rst function

* add partition_rst to auto

* update readme

* update docs

* changelog and version

* pandocs -> pandoc

* fix typo
2023-06-12 19:31:10 +00:00
Matt Robinson
19ab6d960f
enhancement: handling for empty files in detect_filetype and partition (#710)
* add empty filetype

* add empty handling to partition

* changelog and version
2023-06-09 16:07:50 -04:00
Christine Straub
547bb38d86
fix: encoding/decoding error with default utf-8 encoding for html, xml, and auto (#660)
Add functionality to try other common encodings for html, xml files if an error related to the encoding is raised and the user has not specified an encoding.

Change auto.py to have a None default for encoding

Remove the unused parameter encoding from partition_pdf

Add functionality to the read_txt_file utility function to handle file-like object from URL
2023-06-05 11:27:12 -07:00
ryannikolaidis
7d157c1ede
test: add benchmark script (#638) 2023-06-05 09:14:43 -07:00
Meir
74a61e33d8
fix: metadata.page_number of pptx files (#675)
* fix: metadata.page_number of pptx files

* update changelog
2023-06-02 13:22:43 +00:00
cshaddox
d23e0d6420
feat: table extraction for power points (#664)
* Handling tables

* updating changelog

* Adding accidentally removed code

* remove newline

* reuse table extraction function; add test

---------

Co-authored-by: Matt Robinson <mrobinson@unstructured.io>
2023-05-31 18:26:32 +00:00
Christine Straub
5b5fb3e13b
Issue/encoding error eml (#639)
This PR adds functionality to try other common encodings for email (.eml) files if an error related to the
encoding is raised and the user has not specified an encoding.
2023-05-30 10:24:02 -07:00
cragwolfe
c5d9469001
feat: add xls support (#632)
Add support for older .XLS files from the partition function in unstructured.partition.auto.

Note, this should also work on the centos7 unstructured image (with the requirements/*txt updates in this PR).
2023-05-26 01:55:32 -07:00
Christine Straub
a1fed6d4c6
Issue/unicode error (#608)
This PR adds functionality to try other common encodings if an error related to the encoding is raised and the user has not specified an encoding.
2023-05-23 13:35:38 -07:00
Matt Robinson
21c821d651
feat: add partition_csv function (#619)
* add csv into filetype detection

* first pass on csv

* add tests for csv

* add csv to auto

* version bump

* update readme and docs

* fix doc strings
2023-05-19 15:57:42 -04:00
Matt Robinson
23ff32cc42
feat: add partition_xml for XML files (#596)
* first pass on partition_xml

* add option to keep xml tags

* added tests for xml

* fix filename

* update filenames

* remove outdated readme

* add xml to auto

* version and changelog

* update readme and docs

* pass through include_metadata

* update include_metadata description

* add README back in

* linting, linting, linting

* more linting

* spooled to bytes doesnt need to be a tuple

* Add tests for newly supported filetypes

* Correct metadata filetype

* doc typo

Co-authored-by: qued <64741807+qued@users.noreply.github.com>

* typo fix

Co-authored-by: qued <64741807+qued@users.noreply.github.com>

* typo fix

Co-authored-by: qued <64741807+qued@users.noreply.github.com>

* keep_xml_tags -> xml_keep_tags

---------

Co-authored-by: Alan Bertl <alan@unstructured.io>
Co-authored-by: qued <64741807+qued@users.noreply.github.com>
2023-05-18 15:40:12 +00:00