929 Commits

Author SHA1 Message Date
Matt Robinson
43854e367a
docs: fix incomplete hi_res docs (#505) 2023-04-20 09:43:33 -04:00
Amanda Cameron
db6e5b41b8
chore: updating readme with api announcement (#499)
* updating readme
2023-04-19 11:59:26 -07:00
Matt Robinson
87c6d5e679
build: version bump for 0.5.13 release (#501) 0.5.13 2023-04-19 14:35:45 -04:00
Matt Robinson
4e1cc5ab3d
fix: add slack to fixture update script (#500) 2023-04-19 18:16:44 +00:00
Matt Robinson
39b261aee6
fix: group broken paragraphs when using the fast strategy for PDFs (#485)
* group broken paragraphs with fast strategy

* changelog and version

* fix broken tests for text.py

* formatting for paragraph pattern re

* fix test

* fix whitespace substitution

* one more test tweak

* blurb to account for short lines

* fix for shorter paragraphs

* update changelog

* remove extra line break from auto

* retrigger ci

* trying skipping azure

* skip azure (test)

* updated github and azure fixtures

* update slack fixture
2023-04-19 13:54:17 -04:00
Shukri
396295fc04
fix: formatting error in sphinx docs (#498)
* fix: formatting error in sphinx docs
2023-04-17 23:13:09 -07:00
cragwolfe
bfba2bb1eb
fix: workaround .json file detection with old libmagic installs (#493)
Fixes issue where .json files were recognized as "text/plain" rather than "application/json on
the Unstructured image (and other installs that may have an older libmagic).

Also adds missing json auto partition tests.

Including an xfail test for #492 .
2023-04-17 23:11:21 -07:00
Shukri
8d4308af43
doc: typo (#495)
XML/HTML Depenedencies -> XML/HTML Dependencies
2023-04-17 20:26:50 -07:00
qued
3a61046307
fix: Fix typo in function call (#491)
Closes GitHub Issue #487. Fixed typo in call to exactly_one in partition_json.
2023-04-17 23:37:50 +00:00
cragwolfe
5657378602
test: avoid misleading output in ingest tests (#488)
Previously, if there was an error (non-zero exit code) in an ingest test script,
the script would still complete and echo a warning about mismatched outputs
and how to regenerate the fixtures. However, this statement is irrelevant and
misleading: if the ingest failed with a non-zero exit code in the first place,
that is the failure that should be debugged -- don't confuse the user with
a comment about outputs.
2023-04-17 21:57:44 +00:00
pravin-unstructured
4020da56ad
Went through this demo notebook with Matt. Decision was made to add it to our collection of examples for use later. (#484) 2023-04-17 11:53:25 -04:00
Trevor Bossert
cff7f4fd5a
Slack connector (#462)
This connector takes a slack channel id, token and other options to
pull conversation history for a channel and store it as a text file that
is then processed by unstructured into expected output.
2023-04-16 19:34:43 +00:00
cragwolfe
a11563fe63
fix: update ingest test fixtures, disable biomed test (#486)
* Update test fixtures that should have been updated in prior commit
* Disable biomed ingest tests for now, the fail more often than not
* Bonus: echo `tesseract --version` in the update script, since that is a key thing that influences fixture outputs.
2023-04-15 00:07:09 +00:00
JaeyongLee
8456676fad
fix: fix text_type.py exceeds_cap_ratio() returns (#478)
There are cases when function is_possible_narrative_text receives an incorrect return from function exceeds_cap_ratio and does an incorrect classification, so some of the return values of exceeds_cap_ratio are corrected.

---------

Co-authored-by: Matt Robinson <mrobinson@unstructured.io>
2023-04-14 11:53:10 -07:00
cragwolfe
46ac2a2226
build(CI): add access token for github-ingest test (#482)
Avoids the occaisonal CI test failures in test-ingest-github.sh that were due to
rate-limited non-auth'ed requests against a GitHub repo.
2023-04-14 11:14:21 -07:00
Matt Robinson
137b4b9a2e
feat: cleaning brick for normalizing bytes string output (#481)
* add cleaning brick for emojis

* changelog and versoin

* docs for bytes_string_to_string

* different test for bytes_string_to_string
2023-04-13 19:39:08 +00:00
Matt Robinson
9c1c6a13f6
fix: updates markdown code to process markdown with embedded html (#480)
* add carriage return to html if missing

* test on markdown with embedded html

* changelog and version

* check for html parser

* linting, linting, linting
2023-04-13 12:47:45 -04:00
Matt Robinson
ec02d9298e
fix: only warn about fallback to fast in partition_pdf if hi_res is used (#479)
* only warn if detectron2 not available and hi_res is used

* changelog and version
2023-04-13 11:46:35 -04:00
Matt Robinson
b628fa8048
feat: allow headers in partition (#473)
* feat: allow headers in `partition`

* warning if header is set and url is not

* update emoji test
2023-04-13 15:04:15 +00:00
jonvet
7f0f33ddb0
fix: encode xml string if document_tree is None in _read_xml (#477)
* fix: encode xml string if document_tree is `None` in `_read_xml`

* don't encode text in test
2023-04-13 09:09:58 -04:00
Matt Robinson
e2e473dddd
feat: add url kwarg to partititon (#470)
* added url option to auto partition

* add test for partition from url

* version and changelog

* update docs

* add url to element metadata
0.5.12
2023-04-12 18:31:01 +00:00
qued
2110a266c8
fix: fix github issue formatting (#471)
Attempting to fix formatting of github issues transferred to Jira.

The old format was attempting to use double-slashes (\\) to specify line breaks. This worked in the test repo but didn't look right when merged to this repo.

Now attempting to use formatted text in the yaml with |. This worked in the test repo, but I guess that's no guarantee.
2023-04-12 16:59:12 +00:00
Austin Walker
4af4d33423
feat: add --partition-by-api and --partition-host to unstructured-ingest (#443)
* Add --partition-by-api and --partition-host args to ingest

* Fix error in make check

* Bump changelog

* Add a test ingest script

Also add a workaround for the test causing 400s from our api. Seems we need to make sure
unstructured-api can handle getting a file.content_type of None.

* Remove the content type workaround
2023-04-11 22:05:07 -07:00
cragwolfe
ba4dadaa98
build: skip biomed ingest tests 90% of time due to ftp connectivity (#467) 2023-04-11 11:27:38 -07:00
cragwolfe
7b44bcd6e0
build: script to update all ingest fixtures, add azure ingest fixtures (#367)
- Updates CI to install tesseract version 5.3.0 (better than 4.x in various ways incl. perf.).
- Adds azure expected output fixtures for more useful reference points and as a repro for Some PDF's with scanned images return empty elements #346 .
- Adds a script to regenerate ingest test fixtures that is run in an ubuntu docker container (like CI), with the same version of tesseract. See the comments in scripts/ingest-test-fixtures-update.sh for details.
- Updates expected outputs with above script.
- Updates individual test-ingest scripts to update expected .json output if OVERWRITE_FIXTURES=true.
2023-04-11 00:11:50 -07:00
Matt Robinson
7ec85272b7
feat: add partition_rtf for rich text files (#466)
* refactor epub; add rtf

* added test for rtf files

* filetype detection for rtf files

* add rtf to auto

* update docs for group_broken_paragraphs

* add rtf to docs

* update file list in readme

* update stage_for_transformers docs

* changelog and version bump

* skip rtf if in docker

* skip test if rtf not supported

* docs tweaks
2023-04-10 21:25:03 +00:00
cragwolfe
11f82a8b1b
fix(ingest): import connector-specific modules on demand (#460)
* fix(ingest): import connector-specific modules on demand
* unstructured-ingest --flatten-metadata supported for local connector.
* unstructured-ingest fix runtime error when using --metadata-include.
2023-04-08 11:35:35 -07:00
cragwolfe
bd01af2bac
build: add mimetypes DB to docker image (#455)
The mailcap centos7 package provides the file /etc/mime.types, which is used by the mimetypes python package. That said, the unstructured code base does not make much use of this but the upstream unstructured-api does.

Bonus: docx mimetype added in lookup table.
2023-04-07 13:59:29 -07:00
Matt Robinson
c99c099158
feat: enable grouping broken paragraphs in partition_text (#456)
* cleaning brick to group broken paragraphs

* docs for group_broken_paragraphs

* add docs for partition_text with grouper

* partition_text and auto with paragraph_grouper

* version and changelog

* typo in the docs

* linting, linting, linting

* switch to using regular expressions
2023-04-06 18:35:22 +00:00
ryannikolaidis
ee52a749c3
fix: docker smoke test on build (#457) 2023-04-06 10:03:42 -07:00
ryannikolaidis
ef9fb79ed4
chore: build with registry as cache (#454) 2023-04-06 00:34:07 -07:00
Matt Robinson
9b5cae49e1
fix: allow replace_mime_encodings to accept and encoding kwarg (#453)
* changelog and version

* added test
2023-04-05 22:53:38 +00:00
Matt Robinson
b855fd269f
fix: fix html encoding to support foreign characters (#452)
* fix: fix html encoding to support foreign characters

* version and changelog
0.5.11
2023-04-05 20:18:54 +00:00
Siddartha Naidu
1dca0db6b0
fix: guard against style attribute being None (#449)
Some document elements may have a null style element which triggers an exception
when trying to access the name of the style.

Co-authored-by: Matt Robinson <mrobinson@unstructured.io>
2023-04-05 19:22:43 +00:00
ryannikolaidis
d298f57b8f
fix: issue when filename is provided but file is not on disk (#446) 0.5.10 2023-04-05 17:54:11 +00:00
natygyoon
e6d6509d54
feat: add --download-only parameter to unstructured-ingest (#416)
Add --download-only parameter so that files may be downloaded if they are not already present (as usual, in either --download-dir or the default download ~/.cache/... location if --download-dir is not specified) and skip processing them through unstructured.
2023-04-05 17:14:41 +00:00
qued
5398cdf4e1
fix: change url to html_url (#451) 2023-04-05 11:13:23 -05:00
qued
92aac29cab
chore: add github link to autogenerated issue (#445)
Added a link back to the original Github issue from which the Jira issue was created for tracking purposes.
2023-04-05 10:03:30 -05:00
cragwolfe
3972c80c51
build(deps): bump requirements (#414) 2023-04-05 02:59:06 +00:00
Matt Robinson
5ae895051a
feat: add sender and receive info to element metadata for emails (#439)
* add header metadata for .eml messages

* sent to and from are lists

* add metadata for outlook emails

* version and changelog
2023-04-04 14:23:41 -04:00
qued
4211dda360
build: sync detectron version (#440)
* Update detectron2 version in Dockerfile
* Update detectron2 version in docs
2023-04-03 18:47:43 -05:00
Amanda Cameron
555b95b8f7
Fixing test for unstructured-api (#425)
Ran into an error in tests for unstructured-api (see below for output). Somewhere along the lines we were reading a txt file into bytes and then the PARAGRAPH_PATTERN (a string) was not able to be compared to the bytes file.
0.5.9
2023-04-03 11:12:12 -07:00
Minty Mac
533241c274
Update README.md (#435)
Spelling error
2023-04-02 09:52:14 -07:00
ryannikolaidis
b746cb0ab4
docs: update README with notes about multi-platform (#433) 2023-04-01 17:31:32 -07:00
Matt Robinson
414883455b
fix: correct order of kwargs in pandoc (#421)
* fix: correct order of kwargs in pandoc

* only skip epub tests in Docker

* changelog

---------

Co-authored-by: Crag Wolfe <crag@unstructuredai.io>
Co-authored-by: cragwolfe <crag@unstructured.io>
0.5.8
2023-03-30 20:54:29 +00:00
ryannikolaidis
59785e4332
chore: install all extras in Dockerfile (#419)
* Adds step to install all extras
* Adds smoke test of wikipedia ingest to validate in CI
2023-03-30 13:23:30 -07:00
cragwolfe
32c79caee3
chore: use only regex for contains_english_word. (#382)
Updates the characters to split when creating candidate english words. Now uses regex to parse out non-alphabetic characters for each word

Note: This was originally an attempt to speedup contains_english_word() but there is no measurable change in performance.
2023-03-30 16:57:43 +00:00
Matt Robinson
e5dd9d5676
!fix: return a list of elements in stage_for_transformers (#420)
* update stage_for_transformers to return a list of elements

* bump changelog and version

* flag breaking change

* fix last word bug in chunk_by_attention_window
2023-03-30 12:27:11 -04:00
Umar Farooqi
2f5c61c178
fix: exclude empty tags during depth check (#379) 2023-03-30 10:03:50 -04:00
ryannikolaidis
19fb3031be
ci: update CI for deprecated set-output (#417) 2023-03-29 20:49:21 -07:00