1447 Commits

Author SHA1 Message Date
Charles
1ddf542e14
fix: Don't call extractable_elements if strategy is ocr_only (#1160)
- fixes #1079 where partitioning is happening twice in the case of
`strategy="ocr_only"`
- only calls `extractable_elements` if we can predetermine that
`ocr_only` is not a possible strategy even if it was the intended
strategy.
- Adds additional assertion test that `_partition_pdf_or_image_with_ocr`
is not called when falling back to `fast` from `ocr_only`
2023-08-22 19:43:33 -07:00
cragwolfe
e9c649224e
chore: changelog repair (#1179)
chaos reigns in the changelog. whyyy

* there was no 0.10.3 release, so remove that from the CHANGELOG.
* fixup 0.10.5 with a couple that were added (in retrospect)
2023-08-22 16:48:18 -07:00
Austin Walker
e7d189fcc8
chore: Bump inference and set default ocr_mode to entire_page (#1172)
* pip-compile in order to bump unstructured-inference
* Set the default `ocr_mode` back to `enitre_page` now that [this
error](https://github.com/Unstructured-IO/unstructured-inference/pull/183)
is addressed
* Explicitly add `sphinx-tabs` to `build.in`. This file provides
`docs/requirements.txt`.
* Remove a pinned `pydantic` version
* Fix a makefile command to `pip-compile` a missing ingest file.
0.10.5
2023-08-22 16:05:02 -07:00
Jack Retterer
05e311651a
doc: add delta tables connector reference (#1177)
Added delta tables to connectors page for users to discover
2023-08-22 12:50:27 -07:00
ryannikolaidis
ac2313a3fa
doc: fix get-api-key link (#1175) 2023-08-22 19:31:07 +00:00
ryannikolaidis
ab7fafcb41
doc: add pdf extra note (#1165) 2023-08-22 18:20:26 +00:00
Roman Isecke
4114022d9d
roman/ingest-custom-errors (#1152)
### Description
Adds three custom errors to ingest:
* `SourceConnectionError`
* `DestinationConnectionError`
* `PartitionError`

Included is a base custom error class that adds a wrapper. This wrapper
wraps any raised exception into the custom error.
2023-08-22 12:28:29 -04:00
Roman Isecke
106ee965a6
Roman/delta table connector (#1132)
### Description
Add delta table connector and test against a delta table generated via
delta.io and uploaded to s3. Shows an example of how to use the
connection options to leverage s3.

I was able to get this to work with s3 if I pass in the access and
secret keys as storage options. Even though the s3 bucket being used is
public, would not work without those.

---------

Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com>
Co-authored-by: rbiseck3 <rbiseck3@users.noreply.github.com>
2023-08-22 10:19:46 -04:00
Matt Robinson
ad595d32f6
enhancement: tell users to install missing extras (#1167)
### Summary

Updates `partition` to let users know to installs the appropriate extras
if they're missing. Prior to this PR, users would get an exception
stating `partition_pdf` (or whichever function that requires extras)
does not exist.

### Testing

First `pip uninstall ebooklib`. Then run

```python
from unstructured.partition.auto import partition

partition(filename="example-docs/winter-sports.epub")
```

The error should look like

```python
ImportError: partition_epub is not available. Install the epub dependencies with pip install "unstructured[epub]"
```
2023-08-22 03:00:21 +00:00
Jack Retterer
f639d04695
Fixed some typos (#1162)
The Wikipedia data connector was labeled as Airtable.
2023-08-21 18:03:15 -07:00
Roman Isecke
db8af4f5de
Roman/notion tests (#1072)
### Description
* Add ingest test for Notion docs
* Update default cache dir for connectors to include connector name.
Makes debugging the cached content easier.

---------

Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com>
Co-authored-by: rbiseck3 <rbiseck3@users.noreply.github.com>
2023-08-21 15:16:50 -04:00
Jack Retterer
a35ff890e0
Update docs jack (#1157)
Documentation Overhaul

- Added documentation hierarchy
- Added options for Bash vs Python for API & Upstream Connectors
- Added Introduction section (Overview, Key Concepts, Getting Started)
- Redid connectors section
- Installation is now broken up (needs further work)
2023-08-21 10:27:32 -07:00
ryannikolaidis
6330278839
chore: add ingest diagrams and explanation of flow to ingest README (#1158) 2023-08-21 16:24:03 +00:00
Newel H
e4aa7373e2
test: create CI pipelines for verifying base and extras pass respective tests (#1137)
**Summary**
Closes #747
* Create CI Pipeline for running text, xml, email, and html doc tests
against the library installed without extras
* Create CI Pipeline for running each library extra against their
respective tests
2023-08-19 12:56:13 -04:00
John
69edffb0c0
bug: update partition_msg and partition_email so attachments also receive metadata_last_modified kwarg (#1134)
### Summary
Closes #1027
The msg test in question was no longer failing after removing the
quick-fix and comment explaining the issue. However, the test was not
functioning as intended. Test was refactored to appropriately test
`metadata_last_modified` of attachments.
`partition_msg` was then updated to pass `metadata_last_modified` to
`attachment_partitioner`.
The same was done for email partitioning.

### Testing
```
from unstructured.partition.text import partition_text
from unstructured.partition.msg import partition_msg
from unstructured.partition.email import partition_email

filename="example-docs/fake-email-attachment.msg"
elements = partition_msg(filename=filename, attachment_partitioner=partition_text, process_attachments=True, metadata_last_modified="0000-00-00")

# previously, these were different values because last_modified wasn't being updated in attachments
elements[1].metadata.last_modified 
elements[-1].text
elements[-1].metadata.last_modified

email_filename="example-docs/eml/fake-email-attachment.eml"
email_elements = partition_email(filename=email_filename, attachment_partitioner=partition_text, process_attachments=True, metadata_last_modified="0000-00-00")

email_elements[1].metadata.last_modified 
email_elements[-1].text
email_elements[-1].metadata.last_modified
```
2023-08-18 23:21:11 +00:00
Austin Walker
dd243b4fd9
chore: pass ocr_mode in partition_pdf_or_image (#1154)
Set to individual_blocks for now to work around [this
bug](https://github.com/Unstructured-IO/unstructured-inference/issues/179).

I verified by printing the current ocr_mode in inference. The
`entire_page` default is overridden.

---------

Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com>
Co-authored-by: awalker4 <awalker4@users.noreply.github.com>
0.10.4
2023-08-18 20:59:08 +00:00
cragwolfe
1456f06b2d
chore: skip consistently failing test in main (#1150)
The reason this test is failing is the API is returning "fast" results
when "hi_res" is requested, which is being tracked in this ticket:
https://github.com/Unstructured-IO/unstructured-api/issues/188 .

This failure was only showing up on the `main` branch, per the commented
out `pytest` skips.
2023-08-18 10:06:17 -07:00
cragwolfe
6001e2fa62
chore: changelog repair (#1149)
0.10.2 had been released, but prior commit co-mingled 0.10.2 and 0.10.3. This corrects the changelog and intentionally skips over 0.10.3.

Bonus: remove accidental dupe line in 0.10.0.
2023-08-17 23:41:04 -07:00
Francisco Kurucz
d2a41f462d
doc: fix typo on partition_md function in bricks documentation (#1147) 2023-08-17 20:54:11 -07:00
ryannikolaidis
668d0f1b01
feat: per-process ingest connections (#1058)
* adds per process connections for Google Drive connector
2023-08-17 17:34:08 +00:00
cragwolfe
dd0f582585
build(deps): bump unstructured-inference==0.5.13 (#1141)
Bump to unstructured-inference==0.5.13, which includes:

Fix extracted image elements being included in layout merge, addresses the issue
where an entire-page image in a PDF was not passed to the layout model when using hi_res.
0.10.2
2023-08-17 06:25:00 +00:00
John
9f7bd6127b
enhancement: Add include_header kwarg for xlsx, default True(#1125)
Closes Github issue #1121

Adds include_header kwarg to partition_xlsx and change default behavior to True.
0.10.1
2023-08-17 04:16:23 +00:00
cragwolfe
22c12ef806
bump unstructured-inference (#1140)
Pulls in fix from unstructured-inference==0.5.12:

When a pdf page doesn't have much data, it may get buffered in the write to a tempfile. If this happens, we'll hit an error reading the file back. Open to suggestions for a way to unit test this - I was creating some test files with pypdf but I couldn't trigger the error.
2023-08-16 22:29:37 +00:00
cragwolfe
6f1b8d5f28
build(deps): bump unstructured-inference to 0.5.11 (#1138)
* Bump unstructured-inference==0.5.11:
  - better defaults for DPI for hi_res and  Chipper
2023-08-16 20:52:40 +00:00
Christine Straub
0a23139720
enhancement: implement full-page OCR(#1133)
*implements full-page OCR as supported in unstructured-inference=0.5.11.
2023-08-16 19:16:35 +00:00
Newel H
be093d2e66
chore: Update dead links to correct pages (#1127)
Summary
Closes #1124

Updates dead links in repository README
- Quick Start > Install for local development
- Learn more > Batch Processing)

Updates document dependencies to include tesseract-lang for additional language support (requirement for tests to pass)

Testing
All tests pass
2023-08-16 10:43:37 -04:00
Christine Straub
0e887cc36b
Feat/1060 update metadata fields (#1099)
Closes Github Issue #1060.

* update the metadata field links
* update the metadata field emphasized_texts
0.10.0
2023-08-16 04:33:06 +00:00
Sebastian Laverde Alfonso
fe5048a834
feat: chipper local inference notebook (#1116)
Download chipper model for local use and demonstrate how to partition a .pdf document
through the unstructured and unstructured_inference libraries.
2023-08-15 20:43:23 -07:00
cragwolfe
d19183f442
build(lint): don't check version in main against self (#1123)
If on the main branch already, it does not make sense to check if the latest commit is the same non-dev version.

This fixes an annoyance where the CI Lint job would fail on release main commits, but besides that was not causing any other issues.
2023-08-15 17:57:59 +00:00
John
6e5d27c6c3
fix pdf partition of list items being detected as titles in OCR only mode (#1119)
Closes Github issue #1010

adds group_bullet_paragraph func to handle grouping of bullet items that are split across multiple lines
2023-08-15 09:35:54 -07:00
qued
cb923b96a2
build(deps): dependency cleanup (#1102)
Cleans up some pins that were prone to conflicts. All pins belong in constraints.in.
0.9.3
2023-08-15 05:15:44 +00:00
cragwolfe
d835fb1086
chore: bump pip version in published image (#1111)
for consistency with the development environment, i.e. the Makefile.
2023-08-14 21:59:31 +00:00
Mike Lay
79a1eb8683
Handle inline and lacking filename (#1109)
Handle Content-Disposition: inline and attachment without filename

* Add new email test example and test with Content-Disposition: inline.
* Move attachment_info above for loop so it is always defined
* Check if item is inline as well as attachment as these both lack an = character to split on
* Create filename if filename is not specified and write file.
* Update list_attachments with new filename
2023-08-14 18:38:53 +00:00
Christine Straub
80266460fd
fix: GH issue 1057 etree parser error (csv) (#1112)
Addresses #1057 for CSV. Related to PR #1077.

* update partition_csv to always use soupparser_fromstring to parse html text
2023-08-14 17:48:57 +00:00
Mark Risher
612f9da6e8
Update news-of-the-day.ipynb - typo (#1113)
Fixed typo
2023-08-14 16:48:49 +00:00
Mike Lay
2e0ab86c6a
Fix attachments with = in filename (#1110)
Fix attachments with = in filename

* Limit split to first match of = to prevent creating a list of more than two parts
* Add example email with attachment name and test for issue
2023-08-13 20:35:18 -07:00
Christine Straub
fc2699ff06
Fix/1057 etree parser error tsv (#1106)
* feat: always use `soupparser_fromstring` to parse `html text` which gracefully handles emoji
* chore: update changelog & version
2023-08-14 01:22:36 +00:00
cragwolfe
b4b8ac4d8a
chore: run make pip-compile on mac (#1107)
so cuda deps removed.
2023-08-13 20:42:12 +00:00
Christine Straub
4a3176885f
Fix/1057 etree parser error xlsx (#1094)
* feat: add functionality to check if a string contains any emoji characters

* feat: add functionality to switch `html` text parser based on whether the `html` text contains emoji

* chore: add `beautifulsoup4` and `emoji` packages to `requirements/base.in` for general use

* chore: update changelog & version

* chore: update changelog & version

* chore: update dependencies

* test: update `EXPECTED_XLS_TEXT_LEN` for `test_auto_partition_xls_from_filename`

* chore: update changelog & version

* feat: add functionality to switch html text parser based on whether the html text contains emoji

* chore: update changelog & version

* fix lint errors

* test: revert the `EXPECTED_XLS_TEXT_LEN` value back

* feat: always use `soupparser_fromstring` to parse `html text`

* fix lint error
2023-08-13 12:20:33 -07:00
cragwolfe
02af625b93
chore: fix fickle test to not be so time sensitive (#1105) 2023-08-13 10:58:46 -07:00
Noah Greer
fa0a5afb71
docs: correct spelling of partition in docs (#1104)
Fixes a typo in several places where the word `partition` is misspelled
as `partiton`
2023-08-12 14:57:27 -07:00
John
f63a66dbef
Capture section and chapter in the metadata for epubs under epub_section (#1005)
Capture section and chapter in the metadata for epubs under epub_section.
Closes Github issue #459
2023-08-12 21:02:06 +00:00
Ronny H
0d5b5a0e79
Revamp README & Bricks documentation (#1103)
Reorganize README.md
2023-08-12 19:58:51 +00:00
Roman Isecke
9d29f5dc2e
Add init file to make notion module discoverable (#1100)
One of the added modules was missing an __init__.py file which made it undiscoverable in the path when running as a cli command via console script rather than the PYTHONPATH=. python ... approach.
2023-08-12 12:21:07 -07:00
Ahmet Melek
627f78c16f
feat: airtable connector (#1012)
* add the first version of airtable connector

* change imports as inline to fail gracefully in case of lacking dependency

* parse tables as csv rather than plain text

* add relevant logic to be able to use --airtable-list-of-paths

* add script for creation of reseources for testing, add test script (large) for testing with a large number of tables to validate scroll functionality, update test script (diff) based on the new settings

* fix ingest test names

* add scripts for the large table test

* remove large table test from diff test

* make base and table ids explicit

* add and remove comments

* use -ne instead of !=

* update code based on the recent ingest refactor, update changelog and version

* shellcheck fix

* update comments

* update check-num-rows-and-columns-output error message

Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com>

* update help comments

* update help comments

* update help comments

* update workflows to set auth tokens and to run make install

* add comments on create_scale_test_components

* separate component ids from the test script, add comments to document test component creation

* add LARGE_BASE test, implement LARGE_BASE component creation, replace component id

* shellcheck fixes

* shellcheck fixes

* update docs

* update comment

* bump version

* add wrongly deleted file

* sort columns before saving to process

* Update ingest test fixtures (#1098)

Co-authored-by: ahmetmeleq <ahmetmeleq@users.noreply.github.com>

---------

Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com>
Co-authored-by: ahmetmeleq <ahmetmeleq@users.noreply.github.com>
2023-08-11 12:02:51 -07:00
Matt Robinson
fa5a3dbd81
feat: unique_element_ids kwarg for UUID elements (#1085)
* added kwarg for unique elements

* test for unique ids

* update docs

* changelog and version
2023-08-11 11:02:37 +00:00
Christine Straub
d26ab1deac
fix: etree parser error (#1077)
* feat: add functionality to check if a string contains any emoji characters

* feat: add functionality to switch `html` text parser based on whether the `html` text contains emoji

* chore: add `beautifulsoup4` and `emoji` packages to `requirements/base.in` for general use

* chore: update changelog & version

* chore: update changelog & version

* chore: update dependencies

* test: update `EXPECTED_XLS_TEXT_LEN` for `test_auto_partition_xls_from_filename`

* chore: update changelog & version
2023-08-10 23:28:57 +00:00
Ronny H
b31c62fa84
replace Weaviate nearText with BM25 query algorithm (#1078) 2023-08-10 22:15:27 +00:00
cragwolfe
6779918406
build(release): bump unstructured-inference (#1074)
* build(release): bump unstructured-inference

Related to downstream issue:
Unstructured-IO/unstructured-api#182

And upstream PR:
Unstructured-IO/unstructured-inference#165

---------

Co-authored-by: Shreya Nidadavolu <shreyanid9@gmail.com>
0.9.2
2023-08-10 20:57:46 +00:00
Ahmet Melek
64a1930c46
chore[ingest]: fix confluence ingest diff tests (#1082)
* trigger CI

* trigger CI

* trigger CI

* do not ingest personal spaces in the diff test

* fix argument

* Update ingest test fixtures (#1083)

Co-authored-by: ahmetmeleq <ahmetmeleq@users.noreply.github.com>

---------

Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com>
Co-authored-by: ahmetmeleq <ahmetmeleq@users.noreply.github.com>
2023-08-10 17:45:17 +00:00