1447 Commits

Author SHA1 Message Date
dependabot[bot]
61209b34bd
build(deps): bump yarl from 1.8.2 to 1.9.2 in /requirements (#530)
Bumps [yarl](https://github.com/aio-libs/yarl) from 1.8.2 to 1.9.2.
- [Release notes](https://github.com/aio-libs/yarl/releases)
- [Changelog](https://github.com/aio-libs/yarl/blob/master/CHANGES.rst)
- [Commits](https://github.com/aio-libs/yarl/compare/v1.8.2...v1.9.2)

---
updated-dependencies:
- dependency-name: yarl
  dependency-type: direct:production
  update-type: version-update:semver-minor
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Co-authored-by: Matt Robinson <mrobinson@unstructured.io>
2023-05-01 18:18:45 -04:00
Matt Robinson
e805ed465d
docs: add slack and github links back into docs page (#535)
* stars and github link to top of page

* wording updates

* remove unnecessary font weight change

* remove next arrows

* buttons to bottom on sidebar
2023-05-01 18:17:52 -04:00
Matt Robinson
22ebfa6714
docs: add download badges to README (#536)
* downloads badge

* total downloads
2023-05-01 18:17:31 -04:00
dependabot[bot]
7f9ec8108d
build(deps): bump importlib-metadata in /requirements (#531)
Bumps [importlib-metadata](https://github.com/python/importlib_metadata) from 6.5.0 to 6.6.0.
- [Release notes](https://github.com/python/importlib_metadata/releases)
- [Changelog](https://github.com/python/importlib_metadata/blob/main/CHANGES.rst)
- [Commits](https://github.com/python/importlib_metadata/compare/v6.5.0...v6.6.0)

---
updated-dependencies:
- dependency-name: importlib-metadata
  dependency-type: direct:production
  update-type: version-update:semver-minor
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Co-authored-by: Matt Robinson <mrobinson@unstructured.io>
2023-05-01 21:49:26 +00:00
dependabot[bot]
8ed1627928
build(deps): bump huggingface-hub from 0.13.4 to 0.14.1 in /requirements (#528)
Bumps [huggingface-hub](https://github.com/huggingface/huggingface_hub) from 0.13.4 to 0.14.1.
- [Release notes](https://github.com/huggingface/huggingface_hub/releases)
- [Commits](https://github.com/huggingface/huggingface_hub/compare/v0.13.4...v0.14.1)

---
updated-dependencies:
- dependency-name: huggingface-hub
  dependency-type: direct:production
  update-type: version-update:semver-minor
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2023-05-01 17:22:21 -04:00
Matt Robinson
1b8e9a353a version bump for release 0.6.2 2023-04-26 16:29:16 -04:00
Matt Robinson
9fdc310358
fix: update detect_filetype for JSONs with text/plain MIME type (#520)
* check to see if text file is a json

* add json check into filetype detection

* added test for updated file detection logic

* bytes/strings handling

* changlog and version bump
2023-04-26 13:52:47 -04:00
Matt Robinson
4156cb12e0
feat: partition_via_api helper function (#518)
* added function for partitioning via api

* added tests for api function

* changelog and version

* add docs for partition_via_api
2023-04-26 09:05:35 -04:00
JaeyongLee
be8e6da884
fix: correct return types in exceeds_caps_ratio (#489)
* fix: fix text_type.py exceeds_cap_ratio() returns

There are cases when function is_possible_narrative_text receives an incorrect return from function exceeds_cap_ratio and does an incorrect classification, so some of the return values of exceeds_cap_ratio are corrected

* Update text_type.py exceeds_cap_ratio()

..

* Update text_type.py

..

* Update CHANGELOG.md

..

* linting, linting, linting ...

* update tests

* more test fixes

* Update text_type.py

..

* bump version and changelog

* add punctuation check

---------

Co-authored-by: Matt Robinson <mrobinson@unstructured.io>
Co-authored-by: Matt Robinson <mrobinson@unstructuredai.io>
2023-04-24 10:45:09 -04:00
Matt Robinson
894a190001
enhancement: check for copy protection on PDFs and fallback to hi res when necessary (#514)
* function to check if pdf is extractable

* add fallback logic for unextractable pdfs

* tests for docs with copy protection

* add test for unprocessable pdf

* update docs

* changelog and version

* update logic for images; reset file before proceeding

* 3 files for api tests

* docs update
2023-04-21 21:35:43 +00:00
qued
5b6640a55a
chore: change table param name (#513)
Updated parameter names that controls whether we try to infer table structure.
0.6.1
2023-04-21 13:48:19 -05:00
Sebastian Laverde Alfonso
ba59ad6b3a
chore: add copy-protected pdf to sample-docs (#512) 2023-04-21 18:02:38 +00:00
Matt Robinson
a7a9ccd3a4
ci: separate job for ingest tests (#511)
* separate job for ingest tests

* remove lint from description
2023-04-21 13:31:36 -04:00
qued
dc4147d7df
feat: extract tables (#503)
Exposes table extraction through partition and partition_pdf.
0.6.0
2023-04-21 17:01:29 +00:00
Mallori Harrell
5d1e61cb3f
feat: add msg attachment support (#510)
* add msg function and fix bug in eml attachment function
2023-04-21 11:14:46 -05:00
Matt Robinson
6874df91ef
feat: allow users to pass OCR language into partition (#509)
* pip-compile new reqs

* bump inference version

* add language to pdf and image calls

* tests for passing in language

* version bump and changelog

* update docs

* pass ocr_languages in auto

* updated test fixtures

* typo in doc string
2023-04-21 13:41:26 +00:00
natygyoon
db2f70dbc4
sync version-sync.sh with other repos (#508) 2023-04-21 05:48:38 +09:00
Matt Robinson
bd1e540af9
feat: parameter to turn off SSL verification (#506)
* add kwarg for ssl verification

* update docs

* update version and changelog

* add verify kwarg to test
2023-04-20 11:13:56 -04:00
Matt Robinson
43854e367a
docs: fix incomplete hi_res docs (#505) 2023-04-20 09:43:33 -04:00
Amanda Cameron
db6e5b41b8
chore: updating readme with api announcement (#499)
* updating readme
2023-04-19 11:59:26 -07:00
Matt Robinson
87c6d5e679
build: version bump for 0.5.13 release (#501) 0.5.13 2023-04-19 14:35:45 -04:00
Matt Robinson
4e1cc5ab3d
fix: add slack to fixture update script (#500) 2023-04-19 18:16:44 +00:00
Matt Robinson
39b261aee6
fix: group broken paragraphs when using the fast strategy for PDFs (#485)
* group broken paragraphs with fast strategy

* changelog and version

* fix broken tests for text.py

* formatting for paragraph pattern re

* fix test

* fix whitespace substitution

* one more test tweak

* blurb to account for short lines

* fix for shorter paragraphs

* update changelog

* remove extra line break from auto

* retrigger ci

* trying skipping azure

* skip azure (test)

* updated github and azure fixtures

* update slack fixture
2023-04-19 13:54:17 -04:00
Shukri
396295fc04
fix: formatting error in sphinx docs (#498)
* fix: formatting error in sphinx docs
2023-04-17 23:13:09 -07:00
cragwolfe
bfba2bb1eb
fix: workaround .json file detection with old libmagic installs (#493)
Fixes issue where .json files were recognized as "text/plain" rather than "application/json on
the Unstructured image (and other installs that may have an older libmagic).

Also adds missing json auto partition tests.

Including an xfail test for #492 .
2023-04-17 23:11:21 -07:00
Shukri
8d4308af43
doc: typo (#495)
XML/HTML Depenedencies -> XML/HTML Dependencies
2023-04-17 20:26:50 -07:00
qued
3a61046307
fix: Fix typo in function call (#491)
Closes GitHub Issue #487. Fixed typo in call to exactly_one in partition_json.
2023-04-17 23:37:50 +00:00
cragwolfe
5657378602
test: avoid misleading output in ingest tests (#488)
Previously, if there was an error (non-zero exit code) in an ingest test script,
the script would still complete and echo a warning about mismatched outputs
and how to regenerate the fixtures. However, this statement is irrelevant and
misleading: if the ingest failed with a non-zero exit code in the first place,
that is the failure that should be debugged -- don't confuse the user with
a comment about outputs.
2023-04-17 21:57:44 +00:00
pravin-unstructured
4020da56ad
Went through this demo notebook with Matt. Decision was made to add it to our collection of examples for use later. (#484) 2023-04-17 11:53:25 -04:00
Trevor Bossert
cff7f4fd5a
Slack connector (#462)
This connector takes a slack channel id, token and other options to
pull conversation history for a channel and store it as a text file that
is then processed by unstructured into expected output.
2023-04-16 19:34:43 +00:00
cragwolfe
a11563fe63
fix: update ingest test fixtures, disable biomed test (#486)
* Update test fixtures that should have been updated in prior commit
* Disable biomed ingest tests for now, the fail more often than not
* Bonus: echo `tesseract --version` in the update script, since that is a key thing that influences fixture outputs.
2023-04-15 00:07:09 +00:00
JaeyongLee
8456676fad
fix: fix text_type.py exceeds_cap_ratio() returns (#478)
There are cases when function is_possible_narrative_text receives an incorrect return from function exceeds_cap_ratio and does an incorrect classification, so some of the return values of exceeds_cap_ratio are corrected.

---------

Co-authored-by: Matt Robinson <mrobinson@unstructured.io>
2023-04-14 11:53:10 -07:00
cragwolfe
46ac2a2226
build(CI): add access token for github-ingest test (#482)
Avoids the occaisonal CI test failures in test-ingest-github.sh that were due to
rate-limited non-auth'ed requests against a GitHub repo.
2023-04-14 11:14:21 -07:00
Matt Robinson
137b4b9a2e
feat: cleaning brick for normalizing bytes string output (#481)
* add cleaning brick for emojis

* changelog and versoin

* docs for bytes_string_to_string

* different test for bytes_string_to_string
2023-04-13 19:39:08 +00:00
Matt Robinson
9c1c6a13f6
fix: updates markdown code to process markdown with embedded html (#480)
* add carriage return to html if missing

* test on markdown with embedded html

* changelog and version

* check for html parser

* linting, linting, linting
2023-04-13 12:47:45 -04:00
Matt Robinson
ec02d9298e
fix: only warn about fallback to fast in partition_pdf if hi_res is used (#479)
* only warn if detectron2 not available and hi_res is used

* changelog and version
2023-04-13 11:46:35 -04:00
Matt Robinson
b628fa8048
feat: allow headers in partition (#473)
* feat: allow headers in `partition`

* warning if header is set and url is not

* update emoji test
2023-04-13 15:04:15 +00:00
jonvet
7f0f33ddb0
fix: encode xml string if document_tree is None in _read_xml (#477)
* fix: encode xml string if document_tree is `None` in `_read_xml`

* don't encode text in test
2023-04-13 09:09:58 -04:00
Matt Robinson
e2e473dddd
feat: add url kwarg to partititon (#470)
* added url option to auto partition

* add test for partition from url

* version and changelog

* update docs

* add url to element metadata
0.5.12
2023-04-12 18:31:01 +00:00
qued
2110a266c8
fix: fix github issue formatting (#471)
Attempting to fix formatting of github issues transferred to Jira.

The old format was attempting to use double-slashes (\\) to specify line breaks. This worked in the test repo but didn't look right when merged to this repo.

Now attempting to use formatted text in the yaml with |. This worked in the test repo, but I guess that's no guarantee.
2023-04-12 16:59:12 +00:00
Austin Walker
4af4d33423
feat: add --partition-by-api and --partition-host to unstructured-ingest (#443)
* Add --partition-by-api and --partition-host args to ingest

* Fix error in make check

* Bump changelog

* Add a test ingest script

Also add a workaround for the test causing 400s from our api. Seems we need to make sure
unstructured-api can handle getting a file.content_type of None.

* Remove the content type workaround
2023-04-11 22:05:07 -07:00
cragwolfe
ba4dadaa98
build: skip biomed ingest tests 90% of time due to ftp connectivity (#467) 2023-04-11 11:27:38 -07:00
cragwolfe
7b44bcd6e0
build: script to update all ingest fixtures, add azure ingest fixtures (#367)
- Updates CI to install tesseract version 5.3.0 (better than 4.x in various ways incl. perf.).
- Adds azure expected output fixtures for more useful reference points and as a repro for Some PDF's with scanned images return empty elements #346 .
- Adds a script to regenerate ingest test fixtures that is run in an ubuntu docker container (like CI), with the same version of tesseract. See the comments in scripts/ingest-test-fixtures-update.sh for details.
- Updates expected outputs with above script.
- Updates individual test-ingest scripts to update expected .json output if OVERWRITE_FIXTURES=true.
2023-04-11 00:11:50 -07:00
Matt Robinson
7ec85272b7
feat: add partition_rtf for rich text files (#466)
* refactor epub; add rtf

* added test for rtf files

* filetype detection for rtf files

* add rtf to auto

* update docs for group_broken_paragraphs

* add rtf to docs

* update file list in readme

* update stage_for_transformers docs

* changelog and version bump

* skip rtf if in docker

* skip test if rtf not supported

* docs tweaks
2023-04-10 21:25:03 +00:00
cragwolfe
11f82a8b1b
fix(ingest): import connector-specific modules on demand (#460)
* fix(ingest): import connector-specific modules on demand
* unstructured-ingest --flatten-metadata supported for local connector.
* unstructured-ingest fix runtime error when using --metadata-include.
2023-04-08 11:35:35 -07:00
cragwolfe
bd01af2bac
build: add mimetypes DB to docker image (#455)
The mailcap centos7 package provides the file /etc/mime.types, which is used by the mimetypes python package. That said, the unstructured code base does not make much use of this but the upstream unstructured-api does.

Bonus: docx mimetype added in lookup table.
2023-04-07 13:59:29 -07:00
Matt Robinson
c99c099158
feat: enable grouping broken paragraphs in partition_text (#456)
* cleaning brick to group broken paragraphs

* docs for group_broken_paragraphs

* add docs for partition_text with grouper

* partition_text and auto with paragraph_grouper

* version and changelog

* typo in the docs

* linting, linting, linting

* switch to using regular expressions
2023-04-06 18:35:22 +00:00
ryannikolaidis
ee52a749c3
fix: docker smoke test on build (#457) 2023-04-06 10:03:42 -07:00
ryannikolaidis
ef9fb79ed4
chore: build with registry as cache (#454) 2023-04-06 00:34:07 -07:00
Matt Robinson
9b5cae49e1
fix: allow replace_mime_encodings to accept and encoding kwarg (#453)
* changelog and version

* added test
2023-04-05 22:53:38 +00:00