805 Commits

Author SHA1 Message Date
Matt Robinson
0fc0571c02
fix(ci): don't skip deploy for tags (#549) 2023-05-05 09:51:41 -04:00
Matt Robinson
392cccdbf7
enhancement: add ocr_only strategy for partition_image (#540)
* spike for ocr-only strategy for images

* fix for file processing

* extra space

* add korean to ci

* added test for ocr_only strategy

* added docs for ocr_only

* changelog and version

* added test for bad strategy

* skip korean test if in docker

* bump version

* version bump

* document valid strategies

* bump version for release

---------

Co-authored-by: qued <64741807+qued@users.noreply.github.com>
0.6.3
2023-05-04 20:23:51 +00:00
Matt Robinson
fae5f8fdde
feat: add partition_odt for open office docs (#548)
* added filetype detection for odt

* add function for partition odt documents

* add odt files to auto

* changelog and version

* docs and readme

* update installation docs

* skip tests if not supported or in docker

* import pytest

* fix docs typos
2023-05-04 19:28:08 +00:00
Matt Robinson
981805e435
feat: stage_for_baseplate function (#546)
* added a staging brick for baseplate

* added a test for baseplate

* update documentation

* version and changelog
2023-05-04 11:05:38 -04:00
Matt Robinson
aa01cdfc7a
fix: group together text from the same bounding box in partition_pdf with fast strategy (#542)
* switch to using PDF objects

* linting, linting, linting

* couple more tweaks

* added test for chevron-page

* version and changelog

* linting, linting, linting

* now processing 4 files
2023-05-03 18:33:24 -04:00
Matt Robinson
7e43a25f07
feat: add partition_multiple_via_api function (#539)
* added function for multiple files via api

* make multiple work with files

* updated docs strings

* changelog and version

* docs and contextlib for open files

* tests for partition multiple

* add tests for error conditions

* add output example
2023-05-03 15:06:06 -04:00
Matt Robinson
3c3c59a726
build(deps): add pdfminer.six to dependencies (#537) 2023-05-02 15:36:12 +00:00
Matt Robinson
19488bf15f
ci: only build docs on tags (#538)
* ci: only build docs on tags

* add branch for docs builds
2023-05-02 15:15:23 +00:00
dependabot[bot]
61209b34bd
build(deps): bump yarl from 1.8.2 to 1.9.2 in /requirements (#530)
Bumps [yarl](https://github.com/aio-libs/yarl) from 1.8.2 to 1.9.2.
- [Release notes](https://github.com/aio-libs/yarl/releases)
- [Changelog](https://github.com/aio-libs/yarl/blob/master/CHANGES.rst)
- [Commits](https://github.com/aio-libs/yarl/compare/v1.8.2...v1.9.2)

---
updated-dependencies:
- dependency-name: yarl
  dependency-type: direct:production
  update-type: version-update:semver-minor
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Co-authored-by: Matt Robinson <mrobinson@unstructured.io>
2023-05-01 18:18:45 -04:00
Matt Robinson
e805ed465d
docs: add slack and github links back into docs page (#535)
* stars and github link to top of page

* wording updates

* remove unnecessary font weight change

* remove next arrows

* buttons to bottom on sidebar
2023-05-01 18:17:52 -04:00
Matt Robinson
22ebfa6714
docs: add download badges to README (#536)
* downloads badge

* total downloads
2023-05-01 18:17:31 -04:00
dependabot[bot]
7f9ec8108d
build(deps): bump importlib-metadata in /requirements (#531)
Bumps [importlib-metadata](https://github.com/python/importlib_metadata) from 6.5.0 to 6.6.0.
- [Release notes](https://github.com/python/importlib_metadata/releases)
- [Changelog](https://github.com/python/importlib_metadata/blob/main/CHANGES.rst)
- [Commits](https://github.com/python/importlib_metadata/compare/v6.5.0...v6.6.0)

---
updated-dependencies:
- dependency-name: importlib-metadata
  dependency-type: direct:production
  update-type: version-update:semver-minor
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Co-authored-by: Matt Robinson <mrobinson@unstructured.io>
2023-05-01 21:49:26 +00:00
dependabot[bot]
8ed1627928
build(deps): bump huggingface-hub from 0.13.4 to 0.14.1 in /requirements (#528)
Bumps [huggingface-hub](https://github.com/huggingface/huggingface_hub) from 0.13.4 to 0.14.1.
- [Release notes](https://github.com/huggingface/huggingface_hub/releases)
- [Commits](https://github.com/huggingface/huggingface_hub/compare/v0.13.4...v0.14.1)

---
updated-dependencies:
- dependency-name: huggingface-hub
  dependency-type: direct:production
  update-type: version-update:semver-minor
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2023-05-01 17:22:21 -04:00
Matt Robinson
1b8e9a353a version bump for release 0.6.2 2023-04-26 16:29:16 -04:00
Matt Robinson
9fdc310358
fix: update detect_filetype for JSONs with text/plain MIME type (#520)
* check to see if text file is a json

* add json check into filetype detection

* added test for updated file detection logic

* bytes/strings handling

* changlog and version bump
2023-04-26 13:52:47 -04:00
Matt Robinson
4156cb12e0
feat: partition_via_api helper function (#518)
* added function for partitioning via api

* added tests for api function

* changelog and version

* add docs for partition_via_api
2023-04-26 09:05:35 -04:00
JaeyongLee
be8e6da884
fix: correct return types in exceeds_caps_ratio (#489)
* fix: fix text_type.py exceeds_cap_ratio() returns

There are cases when function is_possible_narrative_text receives an incorrect return from function exceeds_cap_ratio and does an incorrect classification, so some of the return values of exceeds_cap_ratio are corrected

* Update text_type.py exceeds_cap_ratio()

..

* Update text_type.py

..

* Update CHANGELOG.md

..

* linting, linting, linting ...

* update tests

* more test fixes

* Update text_type.py

..

* bump version and changelog

* add punctuation check

---------

Co-authored-by: Matt Robinson <mrobinson@unstructured.io>
Co-authored-by: Matt Robinson <mrobinson@unstructuredai.io>
2023-04-24 10:45:09 -04:00
Matt Robinson
894a190001
enhancement: check for copy protection on PDFs and fallback to hi res when necessary (#514)
* function to check if pdf is extractable

* add fallback logic for unextractable pdfs

* tests for docs with copy protection

* add test for unprocessable pdf

* update docs

* changelog and version

* update logic for images; reset file before proceeding

* 3 files for api tests

* docs update
2023-04-21 21:35:43 +00:00
qued
5b6640a55a
chore: change table param name (#513)
Updated parameter names that controls whether we try to infer table structure.
0.6.1
2023-04-21 13:48:19 -05:00
Sebastian Laverde Alfonso
ba59ad6b3a
chore: add copy-protected pdf to sample-docs (#512) 2023-04-21 18:02:38 +00:00
Matt Robinson
a7a9ccd3a4
ci: separate job for ingest tests (#511)
* separate job for ingest tests

* remove lint from description
2023-04-21 13:31:36 -04:00
qued
dc4147d7df
feat: extract tables (#503)
Exposes table extraction through partition and partition_pdf.
0.6.0
2023-04-21 17:01:29 +00:00
Mallori Harrell
5d1e61cb3f
feat: add msg attachment support (#510)
* add msg function and fix bug in eml attachment function
2023-04-21 11:14:46 -05:00
Matt Robinson
6874df91ef
feat: allow users to pass OCR language into partition (#509)
* pip-compile new reqs

* bump inference version

* add language to pdf and image calls

* tests for passing in language

* version bump and changelog

* update docs

* pass ocr_languages in auto

* updated test fixtures

* typo in doc string
2023-04-21 13:41:26 +00:00
natygyoon
db2f70dbc4
sync version-sync.sh with other repos (#508) 2023-04-21 05:48:38 +09:00
Matt Robinson
bd1e540af9
feat: parameter to turn off SSL verification (#506)
* add kwarg for ssl verification

* update docs

* update version and changelog

* add verify kwarg to test
2023-04-20 11:13:56 -04:00
Matt Robinson
43854e367a
docs: fix incomplete hi_res docs (#505) 2023-04-20 09:43:33 -04:00
Amanda Cameron
db6e5b41b8
chore: updating readme with api announcement (#499)
* updating readme
2023-04-19 11:59:26 -07:00
Matt Robinson
87c6d5e679
build: version bump for 0.5.13 release (#501) 0.5.13 2023-04-19 14:35:45 -04:00
Matt Robinson
4e1cc5ab3d
fix: add slack to fixture update script (#500) 2023-04-19 18:16:44 +00:00
Matt Robinson
39b261aee6
fix: group broken paragraphs when using the fast strategy for PDFs (#485)
* group broken paragraphs with fast strategy

* changelog and version

* fix broken tests for text.py

* formatting for paragraph pattern re

* fix test

* fix whitespace substitution

* one more test tweak

* blurb to account for short lines

* fix for shorter paragraphs

* update changelog

* remove extra line break from auto

* retrigger ci

* trying skipping azure

* skip azure (test)

* updated github and azure fixtures

* update slack fixture
2023-04-19 13:54:17 -04:00
Shukri
396295fc04
fix: formatting error in sphinx docs (#498)
* fix: formatting error in sphinx docs
2023-04-17 23:13:09 -07:00
cragwolfe
bfba2bb1eb
fix: workaround .json file detection with old libmagic installs (#493)
Fixes issue where .json files were recognized as "text/plain" rather than "application/json on
the Unstructured image (and other installs that may have an older libmagic).

Also adds missing json auto partition tests.

Including an xfail test for #492 .
2023-04-17 23:11:21 -07:00
Shukri
8d4308af43
doc: typo (#495)
XML/HTML Depenedencies -> XML/HTML Dependencies
2023-04-17 20:26:50 -07:00
qued
3a61046307
fix: Fix typo in function call (#491)
Closes GitHub Issue #487. Fixed typo in call to exactly_one in partition_json.
2023-04-17 23:37:50 +00:00
cragwolfe
5657378602
test: avoid misleading output in ingest tests (#488)
Previously, if there was an error (non-zero exit code) in an ingest test script,
the script would still complete and echo a warning about mismatched outputs
and how to regenerate the fixtures. However, this statement is irrelevant and
misleading: if the ingest failed with a non-zero exit code in the first place,
that is the failure that should be debugged -- don't confuse the user with
a comment about outputs.
2023-04-17 21:57:44 +00:00
pravin-unstructured
4020da56ad
Went through this demo notebook with Matt. Decision was made to add it to our collection of examples for use later. (#484) 2023-04-17 11:53:25 -04:00
Trevor Bossert
cff7f4fd5a
Slack connector (#462)
This connector takes a slack channel id, token and other options to
pull conversation history for a channel and store it as a text file that
is then processed by unstructured into expected output.
2023-04-16 19:34:43 +00:00
cragwolfe
a11563fe63
fix: update ingest test fixtures, disable biomed test (#486)
* Update test fixtures that should have been updated in prior commit
* Disable biomed ingest tests for now, the fail more often than not
* Bonus: echo `tesseract --version` in the update script, since that is a key thing that influences fixture outputs.
2023-04-15 00:07:09 +00:00
JaeyongLee
8456676fad
fix: fix text_type.py exceeds_cap_ratio() returns (#478)
There are cases when function is_possible_narrative_text receives an incorrect return from function exceeds_cap_ratio and does an incorrect classification, so some of the return values of exceeds_cap_ratio are corrected.

---------

Co-authored-by: Matt Robinson <mrobinson@unstructured.io>
2023-04-14 11:53:10 -07:00
cragwolfe
46ac2a2226
build(CI): add access token for github-ingest test (#482)
Avoids the occaisonal CI test failures in test-ingest-github.sh that were due to
rate-limited non-auth'ed requests against a GitHub repo.
2023-04-14 11:14:21 -07:00
Matt Robinson
137b4b9a2e
feat: cleaning brick for normalizing bytes string output (#481)
* add cleaning brick for emojis

* changelog and versoin

* docs for bytes_string_to_string

* different test for bytes_string_to_string
2023-04-13 19:39:08 +00:00
Matt Robinson
9c1c6a13f6
fix: updates markdown code to process markdown with embedded html (#480)
* add carriage return to html if missing

* test on markdown with embedded html

* changelog and version

* check for html parser

* linting, linting, linting
2023-04-13 12:47:45 -04:00
Matt Robinson
ec02d9298e
fix: only warn about fallback to fast in partition_pdf if hi_res is used (#479)
* only warn if detectron2 not available and hi_res is used

* changelog and version
2023-04-13 11:46:35 -04:00
Matt Robinson
b628fa8048
feat: allow headers in partition (#473)
* feat: allow headers in `partition`

* warning if header is set and url is not

* update emoji test
2023-04-13 15:04:15 +00:00
jonvet
7f0f33ddb0
fix: encode xml string if document_tree is None in _read_xml (#477)
* fix: encode xml string if document_tree is `None` in `_read_xml`

* don't encode text in test
2023-04-13 09:09:58 -04:00
Matt Robinson
e2e473dddd
feat: add url kwarg to partititon (#470)
* added url option to auto partition

* add test for partition from url

* version and changelog

* update docs

* add url to element metadata
0.5.12
2023-04-12 18:31:01 +00:00
qued
2110a266c8
fix: fix github issue formatting (#471)
Attempting to fix formatting of github issues transferred to Jira.

The old format was attempting to use double-slashes (\\) to specify line breaks. This worked in the test repo but didn't look right when merged to this repo.

Now attempting to use formatted text in the yaml with |. This worked in the test repo, but I guess that's no guarantee.
2023-04-12 16:59:12 +00:00
Austin Walker
4af4d33423
feat: add --partition-by-api and --partition-host to unstructured-ingest (#443)
* Add --partition-by-api and --partition-host args to ingest

* Fix error in make check

* Bump changelog

* Add a test ingest script

Also add a workaround for the test causing 400s from our api. Seems we need to make sure
unstructured-api can handle getting a file.content_type of None.

* Remove the content type workaround
2023-04-11 22:05:07 -07:00
cragwolfe
ba4dadaa98
build: skip biomed ingest tests 90% of time due to ftp connectivity (#467) 2023-04-11 11:27:38 -07:00