459 Commits

Author SHA1 Message Date
cragwolfe
aaea6358f6
build(deps): bump pip (#558) 2023-05-08 23:08:10 -07:00
ryannikolaidis
2fc4d37454
chore: pin inference version, bump deps, and update openssl (#551) 2023-05-08 17:02:55 -07:00
Matt Robinson
3d3f3df3ec
enhancement: add "ocr_only" strategy for PDFs (#553)
* add tests for validating strategy

* refactor into determine_pdf_strategy function

* refactor pdf strategies into strategies

* remove commented out code

* remove unreachable code

* add in handling for image types

* a little more refactoring

* import ocr partioning for images

* catch warnings, partition type for valid strategies

* fallback to ocr_only from fast

* fallback logic for hi_res

* test for fallback to ocr only

* fallback logic ofr ocr_only

* more tests for fallback logic

* update doc strings

* version and changelog

* linting, linting, linting

* update docs to include notes about strategy

* fix typos

* change back patched filename
0.6.4
2023-05-08 17:21:24 +00:00
Trevor Bossert
1ac72c6ee8
Fixes issue where detectron2 would not install on OSX (#552)
* Fixes issue where detectron2 would not install on OSX

Tested on Apple silicon based MacBook Pro.  This installs tensorboard which is required on OSX and arm based cpu’s for detectron2.

* Improve Arch detection for tensorboard

* remove makefile from commands in readme

pin tensorboard version
2023-05-05 17:16:28 -07:00
Matt Robinson
0fc0571c02
fix(ci): don't skip deploy for tags (#549) 2023-05-05 09:51:41 -04:00
Matt Robinson
392cccdbf7
enhancement: add ocr_only strategy for partition_image (#540)
* spike for ocr-only strategy for images

* fix for file processing

* extra space

* add korean to ci

* added test for ocr_only strategy

* added docs for ocr_only

* changelog and version

* added test for bad strategy

* skip korean test if in docker

* bump version

* version bump

* document valid strategies

* bump version for release

---------

Co-authored-by: qued <64741807+qued@users.noreply.github.com>
0.6.3
2023-05-04 20:23:51 +00:00
Matt Robinson
fae5f8fdde
feat: add partition_odt for open office docs (#548)
* added filetype detection for odt

* add function for partition odt documents

* add odt files to auto

* changelog and version

* docs and readme

* update installation docs

* skip tests if not supported or in docker

* import pytest

* fix docs typos
2023-05-04 19:28:08 +00:00
Matt Robinson
981805e435
feat: stage_for_baseplate function (#546)
* added a staging brick for baseplate

* added a test for baseplate

* update documentation

* version and changelog
2023-05-04 11:05:38 -04:00
Matt Robinson
aa01cdfc7a
fix: group together text from the same bounding box in partition_pdf with fast strategy (#542)
* switch to using PDF objects

* linting, linting, linting

* couple more tweaks

* added test for chevron-page

* version and changelog

* linting, linting, linting

* now processing 4 files
2023-05-03 18:33:24 -04:00
Matt Robinson
7e43a25f07
feat: add partition_multiple_via_api function (#539)
* added function for multiple files via api

* make multiple work with files

* updated docs strings

* changelog and version

* docs and contextlib for open files

* tests for partition multiple

* add tests for error conditions

* add output example
2023-05-03 15:06:06 -04:00
Matt Robinson
3c3c59a726
build(deps): add pdfminer.six to dependencies (#537) 2023-05-02 15:36:12 +00:00
Matt Robinson
19488bf15f
ci: only build docs on tags (#538)
* ci: only build docs on tags

* add branch for docs builds
2023-05-02 15:15:23 +00:00
dependabot[bot]
61209b34bd
build(deps): bump yarl from 1.8.2 to 1.9.2 in /requirements (#530)
Bumps [yarl](https://github.com/aio-libs/yarl) from 1.8.2 to 1.9.2.
- [Release notes](https://github.com/aio-libs/yarl/releases)
- [Changelog](https://github.com/aio-libs/yarl/blob/master/CHANGES.rst)
- [Commits](https://github.com/aio-libs/yarl/compare/v1.8.2...v1.9.2)

---
updated-dependencies:
- dependency-name: yarl
  dependency-type: direct:production
  update-type: version-update:semver-minor
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Co-authored-by: Matt Robinson <mrobinson@unstructured.io>
2023-05-01 18:18:45 -04:00
Matt Robinson
e805ed465d
docs: add slack and github links back into docs page (#535)
* stars and github link to top of page

* wording updates

* remove unnecessary font weight change

* remove next arrows

* buttons to bottom on sidebar
2023-05-01 18:17:52 -04:00
Matt Robinson
22ebfa6714
docs: add download badges to README (#536)
* downloads badge

* total downloads
2023-05-01 18:17:31 -04:00
dependabot[bot]
7f9ec8108d
build(deps): bump importlib-metadata in /requirements (#531)
Bumps [importlib-metadata](https://github.com/python/importlib_metadata) from 6.5.0 to 6.6.0.
- [Release notes](https://github.com/python/importlib_metadata/releases)
- [Changelog](https://github.com/python/importlib_metadata/blob/main/CHANGES.rst)
- [Commits](https://github.com/python/importlib_metadata/compare/v6.5.0...v6.6.0)

---
updated-dependencies:
- dependency-name: importlib-metadata
  dependency-type: direct:production
  update-type: version-update:semver-minor
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Co-authored-by: Matt Robinson <mrobinson@unstructured.io>
2023-05-01 21:49:26 +00:00
dependabot[bot]
8ed1627928
build(deps): bump huggingface-hub from 0.13.4 to 0.14.1 in /requirements (#528)
Bumps [huggingface-hub](https://github.com/huggingface/huggingface_hub) from 0.13.4 to 0.14.1.
- [Release notes](https://github.com/huggingface/huggingface_hub/releases)
- [Commits](https://github.com/huggingface/huggingface_hub/compare/v0.13.4...v0.14.1)

---
updated-dependencies:
- dependency-name: huggingface-hub
  dependency-type: direct:production
  update-type: version-update:semver-minor
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2023-05-01 17:22:21 -04:00
Matt Robinson
1b8e9a353a version bump for release 0.6.2 2023-04-26 16:29:16 -04:00
Matt Robinson
9fdc310358
fix: update detect_filetype for JSONs with text/plain MIME type (#520)
* check to see if text file is a json

* add json check into filetype detection

* added test for updated file detection logic

* bytes/strings handling

* changlog and version bump
2023-04-26 13:52:47 -04:00
Matt Robinson
4156cb12e0
feat: partition_via_api helper function (#518)
* added function for partitioning via api

* added tests for api function

* changelog and version

* add docs for partition_via_api
2023-04-26 09:05:35 -04:00
JaeyongLee
be8e6da884
fix: correct return types in exceeds_caps_ratio (#489)
* fix: fix text_type.py exceeds_cap_ratio() returns

There are cases when function is_possible_narrative_text receives an incorrect return from function exceeds_cap_ratio and does an incorrect classification, so some of the return values of exceeds_cap_ratio are corrected

* Update text_type.py exceeds_cap_ratio()

..

* Update text_type.py

..

* Update CHANGELOG.md

..

* linting, linting, linting ...

* update tests

* more test fixes

* Update text_type.py

..

* bump version and changelog

* add punctuation check

---------

Co-authored-by: Matt Robinson <mrobinson@unstructured.io>
Co-authored-by: Matt Robinson <mrobinson@unstructuredai.io>
2023-04-24 10:45:09 -04:00
Matt Robinson
894a190001
enhancement: check for copy protection on PDFs and fallback to hi res when necessary (#514)
* function to check if pdf is extractable

* add fallback logic for unextractable pdfs

* tests for docs with copy protection

* add test for unprocessable pdf

* update docs

* changelog and version

* update logic for images; reset file before proceeding

* 3 files for api tests

* docs update
2023-04-21 21:35:43 +00:00
qued
5b6640a55a
chore: change table param name (#513)
Updated parameter names that controls whether we try to infer table structure.
0.6.1
2023-04-21 13:48:19 -05:00
Sebastian Laverde Alfonso
ba59ad6b3a
chore: add copy-protected pdf to sample-docs (#512) 2023-04-21 18:02:38 +00:00
Matt Robinson
a7a9ccd3a4
ci: separate job for ingest tests (#511)
* separate job for ingest tests

* remove lint from description
2023-04-21 13:31:36 -04:00
qued
dc4147d7df
feat: extract tables (#503)
Exposes table extraction through partition and partition_pdf.
0.6.0
2023-04-21 17:01:29 +00:00
Mallori Harrell
5d1e61cb3f
feat: add msg attachment support (#510)
* add msg function and fix bug in eml attachment function
2023-04-21 11:14:46 -05:00
Matt Robinson
6874df91ef
feat: allow users to pass OCR language into partition (#509)
* pip-compile new reqs

* bump inference version

* add language to pdf and image calls

* tests for passing in language

* version bump and changelog

* update docs

* pass ocr_languages in auto

* updated test fixtures

* typo in doc string
2023-04-21 13:41:26 +00:00
natygyoon
db2f70dbc4
sync version-sync.sh with other repos (#508) 2023-04-21 05:48:38 +09:00
Matt Robinson
bd1e540af9
feat: parameter to turn off SSL verification (#506)
* add kwarg for ssl verification

* update docs

* update version and changelog

* add verify kwarg to test
2023-04-20 11:13:56 -04:00
Matt Robinson
43854e367a
docs: fix incomplete hi_res docs (#505) 2023-04-20 09:43:33 -04:00
Amanda Cameron
db6e5b41b8
chore: updating readme with api announcement (#499)
* updating readme
2023-04-19 11:59:26 -07:00
Matt Robinson
87c6d5e679
build: version bump for 0.5.13 release (#501) 0.5.13 2023-04-19 14:35:45 -04:00
Matt Robinson
4e1cc5ab3d
fix: add slack to fixture update script (#500) 2023-04-19 18:16:44 +00:00
Matt Robinson
39b261aee6
fix: group broken paragraphs when using the fast strategy for PDFs (#485)
* group broken paragraphs with fast strategy

* changelog and version

* fix broken tests for text.py

* formatting for paragraph pattern re

* fix test

* fix whitespace substitution

* one more test tweak

* blurb to account for short lines

* fix for shorter paragraphs

* update changelog

* remove extra line break from auto

* retrigger ci

* trying skipping azure

* skip azure (test)

* updated github and azure fixtures

* update slack fixture
2023-04-19 13:54:17 -04:00
Shukri
396295fc04
fix: formatting error in sphinx docs (#498)
* fix: formatting error in sphinx docs
2023-04-17 23:13:09 -07:00
cragwolfe
bfba2bb1eb
fix: workaround .json file detection with old libmagic installs (#493)
Fixes issue where .json files were recognized as "text/plain" rather than "application/json on
the Unstructured image (and other installs that may have an older libmagic).

Also adds missing json auto partition tests.

Including an xfail test for #492 .
2023-04-17 23:11:21 -07:00
Shukri
8d4308af43
doc: typo (#495)
XML/HTML Depenedencies -> XML/HTML Dependencies
2023-04-17 20:26:50 -07:00
qued
3a61046307
fix: Fix typo in function call (#491)
Closes GitHub Issue #487. Fixed typo in call to exactly_one in partition_json.
2023-04-17 23:37:50 +00:00
cragwolfe
5657378602
test: avoid misleading output in ingest tests (#488)
Previously, if there was an error (non-zero exit code) in an ingest test script,
the script would still complete and echo a warning about mismatched outputs
and how to regenerate the fixtures. However, this statement is irrelevant and
misleading: if the ingest failed with a non-zero exit code in the first place,
that is the failure that should be debugged -- don't confuse the user with
a comment about outputs.
2023-04-17 21:57:44 +00:00
pravin-unstructured
4020da56ad
Went through this demo notebook with Matt. Decision was made to add it to our collection of examples for use later. (#484) 2023-04-17 11:53:25 -04:00
Trevor Bossert
cff7f4fd5a
Slack connector (#462)
This connector takes a slack channel id, token and other options to
pull conversation history for a channel and store it as a text file that
is then processed by unstructured into expected output.
2023-04-16 19:34:43 +00:00
cragwolfe
a11563fe63
fix: update ingest test fixtures, disable biomed test (#486)
* Update test fixtures that should have been updated in prior commit
* Disable biomed ingest tests for now, the fail more often than not
* Bonus: echo `tesseract --version` in the update script, since that is a key thing that influences fixture outputs.
2023-04-15 00:07:09 +00:00
JaeyongLee
8456676fad
fix: fix text_type.py exceeds_cap_ratio() returns (#478)
There are cases when function is_possible_narrative_text receives an incorrect return from function exceeds_cap_ratio and does an incorrect classification, so some of the return values of exceeds_cap_ratio are corrected.

---------

Co-authored-by: Matt Robinson <mrobinson@unstructured.io>
2023-04-14 11:53:10 -07:00
cragwolfe
46ac2a2226
build(CI): add access token for github-ingest test (#482)
Avoids the occaisonal CI test failures in test-ingest-github.sh that were due to
rate-limited non-auth'ed requests against a GitHub repo.
2023-04-14 11:14:21 -07:00
Matt Robinson
137b4b9a2e
feat: cleaning brick for normalizing bytes string output (#481)
* add cleaning brick for emojis

* changelog and versoin

* docs for bytes_string_to_string

* different test for bytes_string_to_string
2023-04-13 19:39:08 +00:00
Matt Robinson
9c1c6a13f6
fix: updates markdown code to process markdown with embedded html (#480)
* add carriage return to html if missing

* test on markdown with embedded html

* changelog and version

* check for html parser

* linting, linting, linting
2023-04-13 12:47:45 -04:00
Matt Robinson
ec02d9298e
fix: only warn about fallback to fast in partition_pdf if hi_res is used (#479)
* only warn if detectron2 not available and hi_res is used

* changelog and version
2023-04-13 11:46:35 -04:00
Matt Robinson
b628fa8048
feat: allow headers in partition (#473)
* feat: allow headers in `partition`

* warning if header is set and url is not

* update emoji test
2023-04-13 15:04:15 +00:00
jonvet
7f0f33ddb0
fix: encode xml string if document_tree is None in _read_xml (#477)
* fix: encode xml string if document_tree is `None` in `_read_xml`

* don't encode text in test
2023-04-13 09:09:58 -04:00