929 Commits

Author SHA1 Message Date
ryannikolaidis
77b6fb2792
ci: update dockerfile to also add models and nltk (#418) 2023-03-29 20:48:06 -07:00
natygyoon
7f6e094c1f
feat: add local file system connector for unstructured-ingest (#399)
* added local connector to unstructured-ingest
2023-03-29 15:53:23 -07:00
natygyoon
e6187b262f
enhancement: update elements_to_json to potentially return a string (#403)
* update elements_to_json to potentially return string if filename is not specified

* add text to elements_from_json
2023-03-29 12:38:30 -07:00
natygyoon
1da40806da
feat: add --max-docs parameter to unstructured-ingest (#402)
* added --max-docs parameter to unstructured-ingest
2023-03-30 03:24:12 +09:00
ryannikolaidis
65fec954ba
ci: publish amd and arm images (#404) 2023-03-29 07:02:39 +00:00
Matt Robinson
09b52b4fc4
fix: text kwargs no longer fail with empty string (#413)
* fix: text kwargs no longer fail with empty string

* linting
2023-03-28 21:03:51 +00:00
Matt Robinson
75cf233702
feat: add partition_msg for MSFT Outlook files (#412)
* added msg-parser dependency

* pass through kwargs in convert_file_to_text

* added partition_msg for processing msft outlook files

* version bump and changelog

* added tests for partition_msg

* added test for msg with plain text

* add partition_msg docs; fix underlines in integration docs

* add .msg to file list

* finish tests for auto msg

* linting, linting, linting
2023-03-28 20:15:22 +00:00
ryannikolaidis
e1a8db51ad
ci: test before publishing docker image (#390) 2023-03-27 13:16:48 -07:00
Amanda Cameron
71e035c34c
Adding content_type and file_filename to autopartition (#394)
Co-authored-by: cragwolfe <crag@unstructured.io>
0.5.7
2023-03-24 16:32:45 -07:00
cragwolfe
8ffd31029e
clean doc text (#398) 2023-03-24 08:43:27 -07:00
cragwolfe
ce9fc26009
feat: add ability to pass headers in partition_html (#397)
Also adds pytest-mock requirement, those fixtures are nice to have!

Implements issue/feature #396 .
2023-03-23 20:14:57 -07:00
natygyoon
a4394f6f16
feat: add --flatten-metadata to unstructured-ingest (#389)
* added --flatten-metadata to unstructured-ingest

* added unit tests for process_file()
2023-03-22 20:52:56 +00:00
natygyoon
66a0369fb6
feat: add --fields-include to unstructured-ingest (#376)
* add --fields-include parameter to unstructured-ingest

* add unit tests for process_file()
2023-03-22 14:12:35 +00:00
cragwolfe
3467a2786d
Update patterns.py (#391) 2023-03-21 23:58:18 -07:00
natygyoon
6b17cb228e
refactor: use exactly one throughout code base (#385)
added `exactly_one` to additional places like unstructured/partition too.
2023-03-21 16:50:13 -07:00
Amanda Cameron
a9da858fa3
chore: add tests for docker (#373) 2023-03-21 13:46:09 -07:00
Benjamin Torres
3c95b975fe
Fix: duplicated addition to elements list (#388) 0.5.6 2023-03-21 12:56:04 -07:00
natygyoon
c16862e7b3
feat: add --metadata-include and --metadata-exclude parameters to unstructured-ingest (#368)
* added metadata in/exclude params

* updated process_file

* existing tests

* remove default behavior

* changelog and ci

* line length

* import

* import

* import sorted

* import

* type

* line length

* main

* ci

* json

* dict

* type ignore

* lint

* unit tests for process_file

* lint

* type changed to Optional(str)

* ci

* line length

* added mutex check

* nit
2023-03-22 03:30:53 +09:00
ryannikolaidis
d5a0fce6a0
docs: update readme with notes about pulling and running the public Docker image. (#381) 2023-03-20 18:41:44 +00:00
cragwolfe
fbc7a69a53
feat: change english_words to set for performance gain (#380) 2023-03-19 22:51:32 +00:00
ryannikolaidis
1e39e1ac2a
ci: Adds workflow to publish docker builds (#377) 2023-03-19 21:53:05 +00:00
Sebastian Laverde Alfonso
c9c1b843d2
docs: Integrations LangChain code fix (#378) 2023-03-17 22:59:22 +01:00
Sebastian Laverde Alfonso
b2f37c3eff
Docs: add Integrations section (#372)
* docs: update index, add integrations

* docs: fix typos

* docs: create integrations.rst section structure

* docs: descriptions and use for 8 integrations

* refactor: SEC example in Label Studio section

* Apply suggestions from code review

Co-authored-by: qued <64741807+qued@users.noreply.github.com>

* docs: change links order and refactor|paraphrase

---------

Co-authored-by: qued <64741807+qued@users.noreply.github.com>
2023-03-17 19:11:38 +00:00
Matt Robinson
b47bfaf33a
fix: update test to pass on later label_studio_sdk versions (#369)
Closes #200. Fixes the failing test for label_studio_sdk>0.0.17 using the suggestion found in this comment. The vcr fixture on the test needed allow_playback_repeats=True. Unpinned label_studio_sdk and pip-compiled.
2023-03-17 17:57:09 +00:00
Mallori Harrell
ff63ad81d9
chore: Add note about python version (#375)
* add note about python version


---------

Co-authored-by: Mallori Harrell <mallori@Malloris-MacBook-Pro.local>
2023-03-17 11:22:49 -05:00
qued
f6d787d95b
ci: workflow to create JIRA issue on GH issue create (#370)
Created a github workflow to create a new issue in JIRA when a github issue is created, mirroring the summary and description.

Pretty simplistic for now with a hardcoded project, and no support for any ongoing sync events.
2023-03-15 16:17:56 -05:00
natygyoon
e0eb66de52
feat: add staging brick to clean non-ascii characters from unicode (#366) 2023-03-14 21:31:51 -07:00
Amanda Cameron
edb847ce0b
adding Dockerfile (#359) 2023-03-14 13:40:01 -07:00
qued
a00c6feb9a
fix: changelog typo throwing off formatting (#365) 2023-03-14 16:30:53 +00:00
Matt Robinson
e43cb0e6e0
feat: add partition_epub function (#364)
* add pypandoc dependency

* added epub partitioner and file conversion

* test for partition_epub

* tests for file conversion

* add epub to filetype detection

* added epub to auto partition

* update bricks docs

* updated installing docs

* changelot and version

* add pandoc to dependencies

* add pandoc to debian dependencies

* linting, linting, linting

* typo fix

* typo fix

* file conversion type hints

* more type hints

---------

Co-authored-by: qued <64741807+qued@users.noreply.github.com>
0.5.4
2023-03-14 15:52:21 +00:00
qued
aa494623a2
chore: bump versions (#352)
Update versions of dependencies, including unpinning the unstructured-inference dependency that's causing conflicts in repos like pipeline-oer that want the newer version.
2023-03-14 09:40:30 -05:00
ryannikolaidis
a4726cb197
fix: open xml files in read only mode (#362) 2023-03-13 13:06:45 -07:00
cragwolfe
7b9475ef26
chore: rm competition announcement from the README (#361) 2023-03-13 09:34:26 -07:00
Matt Robinson
d17a94f395
chore: add libreoffice to ubuntu install script (#363) 2023-03-13 10:46:23 -04:00
Matt Robinson
7c08450597
feat: add "fast" strategy for PDF parsing; fallback to "fast" if detectron2 is not available (#357)
Adds a "fast" strategy for partitioning PDFs that uses pdfminer. The default strategy is "hi_res" and is the original partitioning logic that uses detectron2. If detectron2 is not available and the "hi_res" strategy is selected, partition_pdf fallsback to using the "fast" strategy. The implementation uses pdfminer because that's already installed as a dependency with the local-inference extra. There are other options for accomplishing this as well, but they would entail adding a new dependency. The "fast" strategy substantially speeds up processing.
2023-03-11 03:16:05 +00:00
Habeeb Shopeju
2ca843782c
Connector for Biomedical Literature (#345)
The implementation involves the introduction of SimpleBiomedConfig, BiomedIngestDoc and BiomedConnector which ingests documents from the PDF Download.
2023-03-11 01:09:54 +00:00
Alvaro Bartolome
5291a96616
Add AzureBlobStorageConnector (#353)
* Add `AzureBlobStorageConnector` based on its `fsspec` implementation inheriting
from `FsspecConnector`
* Start deprecation life cycle for `unstructured-ingest --s3-url` option, to be deprecated in
  favor of `--remote-url`.

---------

Co-authored-by: Tom Aarsen <37621491+tomaarsen@users.noreply.github.com>
2023-03-10 15:43:40 -08:00
Matt Robinson
30b5a4da65
fix: parsing for files with message/rfc822 MIME type; dir for unsupported files (#358)
Adds the ability to process files with a message/rfc822 MIME type, which previously caused failures for example-docs/fake-email-header.eml.
2023-03-10 15:10:39 -08:00
Tom Aarsen
3d21b4098e
enhancement: improve detect_filetype warning to include filename (#355)
* Improve warning to include filename if provided

* Update changelog & version
2023-03-10 12:26:08 -05:00
Alvaro Bartolome
c51adb21e3
feat: add FsspecConnector to easily integrate new connectors with a fsspec implementation available (#318)
So as you may see this is a pretty big PR, that basically adds an "adapter" to easily plug in any connector with an available fsspec implementation. This is a way to standardize how the remote filesystems are used within unstructured.

I've additionally renamed s3_connector.py to s3.py for readability and consistency and tested that the current approach works as expected and is aligned with the expectations.
2023-03-10 06:15:19 +00:00
Matt Robinson
7c619f045b
feat: UNSTRUCTURED_LANGUAGE_CHECK env var to control (#351)
* environment variable to set language checks

* change log and version

* checks for if language checks are false

* update docs

* changelog type

* add assert to tests

* performance note in docstrings

* docstring tweaks
2023-03-09 17:33:48 +00:00
qued
e43e9178ae
feat: amazon linux 2 setup script (#350)
Added Amazon Linux 2 setup script. Also updated Ubuntu setup script to keep the scripts as aligned as possible.

Co-authored-by: cragwolfe <crag@unstructured.io>
0.5.3
2023-03-09 14:52:24 +00:00
natygyoon
6be07a5260
feat: update auto.partition() function to recognize Unstructured json (#337) 2023-03-08 10:36:01 -08:00
Tom Aarsen
1580c1bf8e
feat: Add GitLab ingest connector (#349)
Add GitLab data connector for ingest.

Involves more general Git functionality that is shared between the GitHub and GitLab data connectors.

Prevent code duplication for functionality between GitHub and GitLab ingest connectors.

Renamed github-access-token, github-branch and github-file-glob to git-access-token, git-branch and git-file-glob, respectively.

These work for GitHub and GitLab.
2023-03-08 00:15:21 -08:00
Tom Aarsen
a9152313aa
refactor: Introduce 'exactly_one' to simplify partitioning functions (#343) 2023-03-07 12:27:08 -06:00
Tom Aarsen
70420b5c78
refactor: Fully move towards logging; remove if config.verbose conditionals (#321)
Move away from printing, use logging exclusively.
2023-03-07 01:21:27 -08:00
Umar Farooqi
78f4301872
fix: add formatter in an error string (#348) 2023-03-06 22:35:15 -08:00
Habeeb Shopeju
4117f57e14
Connector for Google Drive (#294)
Implements issue #244
2023-03-07 06:01:02 +00:00
cragwolfe
905e4ae8f6
chore: nicer error message (#341)
Show a more meaningful error message (and potentially useful for debugging)
when file type is not supported by the auto partition().

Co-authored-by: Tom Aarsen <37621491+tomaarsen@users.noreply.github.com>
2023-03-06 16:08:10 -08:00
Tom Aarsen
d4a1508ab8
chore: Remove file accidentally created/committed (#344)
* Remove file accidentally created/committed

* Fix CHANGELOG
2023-03-06 23:50:53 +00:00