459 Commits

Author SHA1 Message Date
Matt Robinson
e2e473dddd
feat: add url kwarg to partititon (#470)
* added url option to auto partition

* add test for partition from url

* version and changelog

* update docs

* add url to element metadata
0.5.12
2023-04-12 18:31:01 +00:00
qued
2110a266c8
fix: fix github issue formatting (#471)
Attempting to fix formatting of github issues transferred to Jira.

The old format was attempting to use double-slashes (\\) to specify line breaks. This worked in the test repo but didn't look right when merged to this repo.

Now attempting to use formatted text in the yaml with |. This worked in the test repo, but I guess that's no guarantee.
2023-04-12 16:59:12 +00:00
Austin Walker
4af4d33423
feat: add --partition-by-api and --partition-host to unstructured-ingest (#443)
* Add --partition-by-api and --partition-host args to ingest

* Fix error in make check

* Bump changelog

* Add a test ingest script

Also add a workaround for the test causing 400s from our api. Seems we need to make sure
unstructured-api can handle getting a file.content_type of None.

* Remove the content type workaround
2023-04-11 22:05:07 -07:00
cragwolfe
ba4dadaa98
build: skip biomed ingest tests 90% of time due to ftp connectivity (#467) 2023-04-11 11:27:38 -07:00
cragwolfe
7b44bcd6e0
build: script to update all ingest fixtures, add azure ingest fixtures (#367)
- Updates CI to install tesseract version 5.3.0 (better than 4.x in various ways incl. perf.).
- Adds azure expected output fixtures for more useful reference points and as a repro for Some PDF's with scanned images return empty elements #346 .
- Adds a script to regenerate ingest test fixtures that is run in an ubuntu docker container (like CI), with the same version of tesseract. See the comments in scripts/ingest-test-fixtures-update.sh for details.
- Updates expected outputs with above script.
- Updates individual test-ingest scripts to update expected .json output if OVERWRITE_FIXTURES=true.
2023-04-11 00:11:50 -07:00
Matt Robinson
7ec85272b7
feat: add partition_rtf for rich text files (#466)
* refactor epub; add rtf

* added test for rtf files

* filetype detection for rtf files

* add rtf to auto

* update docs for group_broken_paragraphs

* add rtf to docs

* update file list in readme

* update stage_for_transformers docs

* changelog and version bump

* skip rtf if in docker

* skip test if rtf not supported

* docs tweaks
2023-04-10 21:25:03 +00:00
cragwolfe
11f82a8b1b
fix(ingest): import connector-specific modules on demand (#460)
* fix(ingest): import connector-specific modules on demand
* unstructured-ingest --flatten-metadata supported for local connector.
* unstructured-ingest fix runtime error when using --metadata-include.
2023-04-08 11:35:35 -07:00
cragwolfe
bd01af2bac
build: add mimetypes DB to docker image (#455)
The mailcap centos7 package provides the file /etc/mime.types, which is used by the mimetypes python package. That said, the unstructured code base does not make much use of this but the upstream unstructured-api does.

Bonus: docx mimetype added in lookup table.
2023-04-07 13:59:29 -07:00
Matt Robinson
c99c099158
feat: enable grouping broken paragraphs in partition_text (#456)
* cleaning brick to group broken paragraphs

* docs for group_broken_paragraphs

* add docs for partition_text with grouper

* partition_text and auto with paragraph_grouper

* version and changelog

* typo in the docs

* linting, linting, linting

* switch to using regular expressions
2023-04-06 18:35:22 +00:00
ryannikolaidis
ee52a749c3
fix: docker smoke test on build (#457) 2023-04-06 10:03:42 -07:00
ryannikolaidis
ef9fb79ed4
chore: build with registry as cache (#454) 2023-04-06 00:34:07 -07:00
Matt Robinson
9b5cae49e1
fix: allow replace_mime_encodings to accept and encoding kwarg (#453)
* changelog and version

* added test
2023-04-05 22:53:38 +00:00
Matt Robinson
b855fd269f
fix: fix html encoding to support foreign characters (#452)
* fix: fix html encoding to support foreign characters

* version and changelog
0.5.11
2023-04-05 20:18:54 +00:00
Siddartha Naidu
1dca0db6b0
fix: guard against style attribute being None (#449)
Some document elements may have a null style element which triggers an exception
when trying to access the name of the style.

Co-authored-by: Matt Robinson <mrobinson@unstructured.io>
2023-04-05 19:22:43 +00:00
ryannikolaidis
d298f57b8f
fix: issue when filename is provided but file is not on disk (#446) 0.5.10 2023-04-05 17:54:11 +00:00
natygyoon
e6d6509d54
feat: add --download-only parameter to unstructured-ingest (#416)
Add --download-only parameter so that files may be downloaded if they are not already present (as usual, in either --download-dir or the default download ~/.cache/... location if --download-dir is not specified) and skip processing them through unstructured.
2023-04-05 17:14:41 +00:00
qued
5398cdf4e1
fix: change url to html_url (#451) 2023-04-05 11:13:23 -05:00
qued
92aac29cab
chore: add github link to autogenerated issue (#445)
Added a link back to the original Github issue from which the Jira issue was created for tracking purposes.
2023-04-05 10:03:30 -05:00
cragwolfe
3972c80c51
build(deps): bump requirements (#414) 2023-04-05 02:59:06 +00:00
Matt Robinson
5ae895051a
feat: add sender and receive info to element metadata for emails (#439)
* add header metadata for .eml messages

* sent to and from are lists

* add metadata for outlook emails

* version and changelog
2023-04-04 14:23:41 -04:00
qued
4211dda360
build: sync detectron version (#440)
* Update detectron2 version in Dockerfile
* Update detectron2 version in docs
2023-04-03 18:47:43 -05:00
Amanda Cameron
555b95b8f7
Fixing test for unstructured-api (#425)
Ran into an error in tests for unstructured-api (see below for output). Somewhere along the lines we were reading a txt file into bytes and then the PARAGRAPH_PATTERN (a string) was not able to be compared to the bytes file.
0.5.9
2023-04-03 11:12:12 -07:00
Minty Mac
533241c274
Update README.md (#435)
Spelling error
2023-04-02 09:52:14 -07:00
ryannikolaidis
b746cb0ab4
docs: update README with notes about multi-platform (#433) 2023-04-01 17:31:32 -07:00
Matt Robinson
414883455b
fix: correct order of kwargs in pandoc (#421)
* fix: correct order of kwargs in pandoc

* only skip epub tests in Docker

* changelog

---------

Co-authored-by: Crag Wolfe <crag@unstructuredai.io>
Co-authored-by: cragwolfe <crag@unstructured.io>
0.5.8
2023-03-30 20:54:29 +00:00
ryannikolaidis
59785e4332
chore: install all extras in Dockerfile (#419)
* Adds step to install all extras
* Adds smoke test of wikipedia ingest to validate in CI
2023-03-30 13:23:30 -07:00
cragwolfe
32c79caee3
chore: use only regex for contains_english_word. (#382)
Updates the characters to split when creating candidate english words. Now uses regex to parse out non-alphabetic characters for each word

Note: This was originally an attempt to speedup contains_english_word() but there is no measurable change in performance.
2023-03-30 16:57:43 +00:00
Matt Robinson
e5dd9d5676
!fix: return a list of elements in stage_for_transformers (#420)
* update stage_for_transformers to return a list of elements

* bump changelog and version

* flag breaking change

* fix last word bug in chunk_by_attention_window
2023-03-30 12:27:11 -04:00
Umar Farooqi
2f5c61c178
fix: exclude empty tags during depth check (#379) 2023-03-30 10:03:50 -04:00
ryannikolaidis
19fb3031be
ci: update CI for deprecated set-output (#417) 2023-03-29 20:49:21 -07:00
ryannikolaidis
77b6fb2792
ci: update dockerfile to also add models and nltk (#418) 2023-03-29 20:48:06 -07:00
natygyoon
7f6e094c1f
feat: add local file system connector for unstructured-ingest (#399)
* added local connector to unstructured-ingest
2023-03-29 15:53:23 -07:00
natygyoon
e6187b262f
enhancement: update elements_to_json to potentially return a string (#403)
* update elements_to_json to potentially return string if filename is not specified

* add text to elements_from_json
2023-03-29 12:38:30 -07:00
natygyoon
1da40806da
feat: add --max-docs parameter to unstructured-ingest (#402)
* added --max-docs parameter to unstructured-ingest
2023-03-30 03:24:12 +09:00
ryannikolaidis
65fec954ba
ci: publish amd and arm images (#404) 2023-03-29 07:02:39 +00:00
Matt Robinson
09b52b4fc4
fix: text kwargs no longer fail with empty string (#413)
* fix: text kwargs no longer fail with empty string

* linting
2023-03-28 21:03:51 +00:00
Matt Robinson
75cf233702
feat: add partition_msg for MSFT Outlook files (#412)
* added msg-parser dependency

* pass through kwargs in convert_file_to_text

* added partition_msg for processing msft outlook files

* version bump and changelog

* added tests for partition_msg

* added test for msg with plain text

* add partition_msg docs; fix underlines in integration docs

* add .msg to file list

* finish tests for auto msg

* linting, linting, linting
2023-03-28 20:15:22 +00:00
ryannikolaidis
e1a8db51ad
ci: test before publishing docker image (#390) 2023-03-27 13:16:48 -07:00
Amanda Cameron
71e035c34c
Adding content_type and file_filename to autopartition (#394)
Co-authored-by: cragwolfe <crag@unstructured.io>
0.5.7
2023-03-24 16:32:45 -07:00
cragwolfe
8ffd31029e
clean doc text (#398) 2023-03-24 08:43:27 -07:00
cragwolfe
ce9fc26009
feat: add ability to pass headers in partition_html (#397)
Also adds pytest-mock requirement, those fixtures are nice to have!

Implements issue/feature #396 .
2023-03-23 20:14:57 -07:00
natygyoon
a4394f6f16
feat: add --flatten-metadata to unstructured-ingest (#389)
* added --flatten-metadata to unstructured-ingest

* added unit tests for process_file()
2023-03-22 20:52:56 +00:00
natygyoon
66a0369fb6
feat: add --fields-include to unstructured-ingest (#376)
* add --fields-include parameter to unstructured-ingest

* add unit tests for process_file()
2023-03-22 14:12:35 +00:00
cragwolfe
3467a2786d
Update patterns.py (#391) 2023-03-21 23:58:18 -07:00
natygyoon
6b17cb228e
refactor: use exactly one throughout code base (#385)
added `exactly_one` to additional places like unstructured/partition too.
2023-03-21 16:50:13 -07:00
Amanda Cameron
a9da858fa3
chore: add tests for docker (#373) 2023-03-21 13:46:09 -07:00
Benjamin Torres
3c95b975fe
Fix: duplicated addition to elements list (#388) 0.5.6 2023-03-21 12:56:04 -07:00
natygyoon
c16862e7b3
feat: add --metadata-include and --metadata-exclude parameters to unstructured-ingest (#368)
* added metadata in/exclude params

* updated process_file

* existing tests

* remove default behavior

* changelog and ci

* line length

* import

* import

* import sorted

* import

* type

* line length

* main

* ci

* json

* dict

* type ignore

* lint

* unit tests for process_file

* lint

* type changed to Optional(str)

* ci

* line length

* added mutex check

* nit
2023-03-22 03:30:53 +09:00
ryannikolaidis
d5a0fce6a0
docs: update readme with notes about pulling and running the public Docker image. (#381) 2023-03-20 18:41:44 +00:00
cragwolfe
fbc7a69a53
feat: change english_words to set for performance gain (#380) 2023-03-19 22:51:32 +00:00