1393 Commits

Author SHA1 Message Date
qued
5398cdf4e1
fix: change url to html_url (#451) 2023-04-05 11:13:23 -05:00
qued
92aac29cab
chore: add github link to autogenerated issue (#445)
Added a link back to the original Github issue from which the Jira issue was created for tracking purposes.
2023-04-05 10:03:30 -05:00
cragwolfe
3972c80c51
build(deps): bump requirements (#414) 2023-04-05 02:59:06 +00:00
Matt Robinson
5ae895051a
feat: add sender and receive info to element metadata for emails (#439)
* add header metadata for .eml messages

* sent to and from are lists

* add metadata for outlook emails

* version and changelog
2023-04-04 14:23:41 -04:00
qued
4211dda360
build: sync detectron version (#440)
* Update detectron2 version in Dockerfile
* Update detectron2 version in docs
2023-04-03 18:47:43 -05:00
Amanda Cameron
555b95b8f7
Fixing test for unstructured-api (#425)
Ran into an error in tests for unstructured-api (see below for output). Somewhere along the lines we were reading a txt file into bytes and then the PARAGRAPH_PATTERN (a string) was not able to be compared to the bytes file.
0.5.9
2023-04-03 11:12:12 -07:00
Minty Mac
533241c274
Update README.md (#435)
Spelling error
2023-04-02 09:52:14 -07:00
ryannikolaidis
b746cb0ab4
docs: update README with notes about multi-platform (#433) 2023-04-01 17:31:32 -07:00
Matt Robinson
414883455b
fix: correct order of kwargs in pandoc (#421)
* fix: correct order of kwargs in pandoc

* only skip epub tests in Docker

* changelog

---------

Co-authored-by: Crag Wolfe <crag@unstructuredai.io>
Co-authored-by: cragwolfe <crag@unstructured.io>
0.5.8
2023-03-30 20:54:29 +00:00
ryannikolaidis
59785e4332
chore: install all extras in Dockerfile (#419)
* Adds step to install all extras
* Adds smoke test of wikipedia ingest to validate in CI
2023-03-30 13:23:30 -07:00
cragwolfe
32c79caee3
chore: use only regex for contains_english_word. (#382)
Updates the characters to split when creating candidate english words. Now uses regex to parse out non-alphabetic characters for each word

Note: This was originally an attempt to speedup contains_english_word() but there is no measurable change in performance.
2023-03-30 16:57:43 +00:00
Matt Robinson
e5dd9d5676
!fix: return a list of elements in stage_for_transformers (#420)
* update stage_for_transformers to return a list of elements

* bump changelog and version

* flag breaking change

* fix last word bug in chunk_by_attention_window
2023-03-30 12:27:11 -04:00
Umar Farooqi
2f5c61c178
fix: exclude empty tags during depth check (#379) 2023-03-30 10:03:50 -04:00
ryannikolaidis
19fb3031be
ci: update CI for deprecated set-output (#417) 2023-03-29 20:49:21 -07:00
ryannikolaidis
77b6fb2792
ci: update dockerfile to also add models and nltk (#418) 2023-03-29 20:48:06 -07:00
natygyoon
7f6e094c1f
feat: add local file system connector for unstructured-ingest (#399)
* added local connector to unstructured-ingest
2023-03-29 15:53:23 -07:00
natygyoon
e6187b262f
enhancement: update elements_to_json to potentially return a string (#403)
* update elements_to_json to potentially return string if filename is not specified

* add text to elements_from_json
2023-03-29 12:38:30 -07:00
natygyoon
1da40806da
feat: add --max-docs parameter to unstructured-ingest (#402)
* added --max-docs parameter to unstructured-ingest
2023-03-30 03:24:12 +09:00
ryannikolaidis
65fec954ba
ci: publish amd and arm images (#404) 2023-03-29 07:02:39 +00:00
Matt Robinson
09b52b4fc4
fix: text kwargs no longer fail with empty string (#413)
* fix: text kwargs no longer fail with empty string

* linting
2023-03-28 21:03:51 +00:00
Matt Robinson
75cf233702
feat: add partition_msg for MSFT Outlook files (#412)
* added msg-parser dependency

* pass through kwargs in convert_file_to_text

* added partition_msg for processing msft outlook files

* version bump and changelog

* added tests for partition_msg

* added test for msg with plain text

* add partition_msg docs; fix underlines in integration docs

* add .msg to file list

* finish tests for auto msg

* linting, linting, linting
2023-03-28 20:15:22 +00:00
ryannikolaidis
e1a8db51ad
ci: test before publishing docker image (#390) 2023-03-27 13:16:48 -07:00
Amanda Cameron
71e035c34c
Adding content_type and file_filename to autopartition (#394)
Co-authored-by: cragwolfe <crag@unstructured.io>
0.5.7
2023-03-24 16:32:45 -07:00
cragwolfe
8ffd31029e
clean doc text (#398) 2023-03-24 08:43:27 -07:00
cragwolfe
ce9fc26009
feat: add ability to pass headers in partition_html (#397)
Also adds pytest-mock requirement, those fixtures are nice to have!

Implements issue/feature #396 .
2023-03-23 20:14:57 -07:00
natygyoon
a4394f6f16
feat: add --flatten-metadata to unstructured-ingest (#389)
* added --flatten-metadata to unstructured-ingest

* added unit tests for process_file()
2023-03-22 20:52:56 +00:00
natygyoon
66a0369fb6
feat: add --fields-include to unstructured-ingest (#376)
* add --fields-include parameter to unstructured-ingest

* add unit tests for process_file()
2023-03-22 14:12:35 +00:00
cragwolfe
3467a2786d
Update patterns.py (#391) 2023-03-21 23:58:18 -07:00
natygyoon
6b17cb228e
refactor: use exactly one throughout code base (#385)
added `exactly_one` to additional places like unstructured/partition too.
2023-03-21 16:50:13 -07:00
Amanda Cameron
a9da858fa3
chore: add tests for docker (#373) 2023-03-21 13:46:09 -07:00
Benjamin Torres
3c95b975fe
Fix: duplicated addition to elements list (#388) 0.5.6 2023-03-21 12:56:04 -07:00
natygyoon
c16862e7b3
feat: add --metadata-include and --metadata-exclude parameters to unstructured-ingest (#368)
* added metadata in/exclude params

* updated process_file

* existing tests

* remove default behavior

* changelog and ci

* line length

* import

* import

* import sorted

* import

* type

* line length

* main

* ci

* json

* dict

* type ignore

* lint

* unit tests for process_file

* lint

* type changed to Optional(str)

* ci

* line length

* added mutex check

* nit
2023-03-22 03:30:53 +09:00
ryannikolaidis
d5a0fce6a0
docs: update readme with notes about pulling and running the public Docker image. (#381) 2023-03-20 18:41:44 +00:00
cragwolfe
fbc7a69a53
feat: change english_words to set for performance gain (#380) 2023-03-19 22:51:32 +00:00
ryannikolaidis
1e39e1ac2a
ci: Adds workflow to publish docker builds (#377) 2023-03-19 21:53:05 +00:00
Sebastian Laverde Alfonso
c9c1b843d2
docs: Integrations LangChain code fix (#378) 2023-03-17 22:59:22 +01:00
Sebastian Laverde Alfonso
b2f37c3eff
Docs: add Integrations section (#372)
* docs: update index, add integrations

* docs: fix typos

* docs: create integrations.rst section structure

* docs: descriptions and use for 8 integrations

* refactor: SEC example in Label Studio section

* Apply suggestions from code review

Co-authored-by: qued <64741807+qued@users.noreply.github.com>

* docs: change links order and refactor|paraphrase

---------

Co-authored-by: qued <64741807+qued@users.noreply.github.com>
2023-03-17 19:11:38 +00:00
Matt Robinson
b47bfaf33a
fix: update test to pass on later label_studio_sdk versions (#369)
Closes #200. Fixes the failing test for label_studio_sdk>0.0.17 using the suggestion found in this comment. The vcr fixture on the test needed allow_playback_repeats=True. Unpinned label_studio_sdk and pip-compiled.
2023-03-17 17:57:09 +00:00
Mallori Harrell
ff63ad81d9
chore: Add note about python version (#375)
* add note about python version


---------

Co-authored-by: Mallori Harrell <mallori@Malloris-MacBook-Pro.local>
2023-03-17 11:22:49 -05:00
qued
f6d787d95b
ci: workflow to create JIRA issue on GH issue create (#370)
Created a github workflow to create a new issue in JIRA when a github issue is created, mirroring the summary and description.

Pretty simplistic for now with a hardcoded project, and no support for any ongoing sync events.
2023-03-15 16:17:56 -05:00
natygyoon
e0eb66de52
feat: add staging brick to clean non-ascii characters from unicode (#366) 2023-03-14 21:31:51 -07:00
Amanda Cameron
edb847ce0b
adding Dockerfile (#359) 2023-03-14 13:40:01 -07:00
qued
a00c6feb9a
fix: changelog typo throwing off formatting (#365) 2023-03-14 16:30:53 +00:00
Matt Robinson
e43cb0e6e0
feat: add partition_epub function (#364)
* add pypandoc dependency

* added epub partitioner and file conversion

* test for partition_epub

* tests for file conversion

* add epub to filetype detection

* added epub to auto partition

* update bricks docs

* updated installing docs

* changelot and version

* add pandoc to dependencies

* add pandoc to debian dependencies

* linting, linting, linting

* typo fix

* typo fix

* file conversion type hints

* more type hints

---------

Co-authored-by: qued <64741807+qued@users.noreply.github.com>
0.5.4
2023-03-14 15:52:21 +00:00
qued
aa494623a2
chore: bump versions (#352)
Update versions of dependencies, including unpinning the unstructured-inference dependency that's causing conflicts in repos like pipeline-oer that want the newer version.
2023-03-14 09:40:30 -05:00
ryannikolaidis
a4726cb197
fix: open xml files in read only mode (#362) 2023-03-13 13:06:45 -07:00
cragwolfe
7b9475ef26
chore: rm competition announcement from the README (#361) 2023-03-13 09:34:26 -07:00
Matt Robinson
d17a94f395
chore: add libreoffice to ubuntu install script (#363) 2023-03-13 10:46:23 -04:00
Matt Robinson
7c08450597
feat: add "fast" strategy for PDF parsing; fallback to "fast" if detectron2 is not available (#357)
Adds a "fast" strategy for partitioning PDFs that uses pdfminer. The default strategy is "hi_res" and is the original partitioning logic that uses detectron2. If detectron2 is not available and the "hi_res" strategy is selected, partition_pdf fallsback to using the "fast" strategy. The implementation uses pdfminer because that's already installed as a dependency with the local-inference extra. There are other options for accomplishing this as well, but they would entail adding a new dependency. The "fast" strategy substantially speeds up processing.
2023-03-11 03:16:05 +00:00
Habeeb Shopeju
2ca843782c
Connector for Biomedical Literature (#345)
The implementation involves the introduction of SimpleBiomedConfig, BiomedIngestDoc and BiomedConnector which ingests documents from the PDF Download.
2023-03-11 01:09:54 +00:00