100 Commits

Author SHA1 Message Date
Matt Robinson
c35fff2972
feat: Add stage_for_weaviate and schema creation function (#672)
* add weaviate docker compose

* added staging brick and tests for weaviate

* initial notebook and requirements file

* add commentary to weaviate notebook

* weaviate readme

* update docs

* version and change log

* install weaviate client

* install weaviate; skip for docker

* linting, linting, linting

* install weaviate client with deps

* comments on weaviate client

* fix module not found error for docker container

* skipped wrong test in docker

* fix typos

* add in local-inference
2023-06-01 20:48:54 +00:00
qued
d3600dd5da
build(deps): update inference version (#662)
Updated to the the latest version of unstructured-inference. detectron2 now gets implemented with onnxruntime, yay!

---------

Co-authored-by: Matt Robinson <mrobinson@unstructured.io>
2023-05-31 13:50:15 -05:00
Matt Robinson
21c821d651
feat: add partition_csv function (#619)
* add csv into filetype detection

* first pass on csv

* add tests for csv

* add csv to auto

* version bump

* update readme and docs

* fix doc strings
2023-05-19 15:57:42 -04:00
Matt Robinson
23ff32cc42
feat: add partition_xml for XML files (#596)
* first pass on partition_xml

* add option to keep xml tags

* added tests for xml

* fix filename

* update filenames

* remove outdated readme

* add xml to auto

* version and changelog

* update readme and docs

* pass through include_metadata

* update include_metadata description

* add README back in

* linting, linting, linting

* more linting

* spooled to bytes doesnt need to be a tuple

* Add tests for newly supported filetypes

* Correct metadata filetype

* doc typo

Co-authored-by: qued <64741807+qued@users.noreply.github.com>

* typo fix

Co-authored-by: qued <64741807+qued@users.noreply.github.com>

* typo fix

Co-authored-by: qued <64741807+qued@users.noreply.github.com>

* keep_xml_tags -> xml_keep_tags

---------

Co-authored-by: Alan Bertl <alan@unstructured.io>
Co-authored-by: qued <64741807+qued@users.noreply.github.com>
2023-05-18 15:40:12 +00:00
Matt Robinson
b8037118c4
feat: add partition_xlsx for MSFT Excel files (#594)
* first pass on partition_xlsx

* add support for files

* add test for xlsx from filename

* added filetype metadata

* add xlsx to auto

* remove fake excel from unsupported

* version and changelog

* update docs

* update readme

* fix removed file reference

* fix some more tests

* pass in metadata filename

* add include_metadata flag
2023-05-16 19:40:40 +00:00
Nicolas
c62bee48ad
Update installing.rst (#590) 2023-05-16 02:08:01 +00:00
Matt Robinson
99aa346186
fix: make pytesseract a function level import (#581)
* make pytesseract a function level import

* version and changelog

* small docs formatting fix
2023-05-12 17:18:51 -05:00
Matt Robinson
727d366a94
enhancement: auto strategy for PDFs and images (#578)
* added functions for determining auto stratgy

* change default strategy to auto

* tests for auto strategy

* update docs

* changelog and version

* bump version

* remove ingest file in wrong location

* update jpg output

* typo fix
2023-05-12 17:45:08 +00:00
Matt Robinson
3d3f3df3ec
enhancement: add "ocr_only" strategy for PDFs (#553)
* add tests for validating strategy

* refactor into determine_pdf_strategy function

* refactor pdf strategies into strategies

* remove commented out code

* remove unreachable code

* add in handling for image types

* a little more refactoring

* import ocr partioning for images

* catch warnings, partition type for valid strategies

* fallback to ocr_only from fast

* fallback logic for hi_res

* test for fallback to ocr only

* fallback logic ofr ocr_only

* more tests for fallback logic

* update doc strings

* version and changelog

* linting, linting, linting

* update docs to include notes about strategy

* fix typos

* change back patched filename
2023-05-08 17:21:24 +00:00
Matt Robinson
392cccdbf7
enhancement: add ocr_only strategy for partition_image (#540)
* spike for ocr-only strategy for images

* fix for file processing

* extra space

* add korean to ci

* added test for ocr_only strategy

* added docs for ocr_only

* changelog and version

* added test for bad strategy

* skip korean test if in docker

* bump version

* version bump

* document valid strategies

* bump version for release

---------

Co-authored-by: qued <64741807+qued@users.noreply.github.com>
2023-05-04 20:23:51 +00:00
Matt Robinson
fae5f8fdde
feat: add partition_odt for open office docs (#548)
* added filetype detection for odt

* add function for partition odt documents

* add odt files to auto

* changelog and version

* docs and readme

* update installation docs

* skip tests if not supported or in docker

* import pytest

* fix docs typos
2023-05-04 19:28:08 +00:00
Matt Robinson
981805e435
feat: stage_for_baseplate function (#546)
* added a staging brick for baseplate

* added a test for baseplate

* update documentation

* version and changelog
2023-05-04 11:05:38 -04:00
Matt Robinson
7e43a25f07
feat: add partition_multiple_via_api function (#539)
* added function for multiple files via api

* make multiple work with files

* updated docs strings

* changelog and version

* docs and contextlib for open files

* tests for partition multiple

* add tests for error conditions

* add output example
2023-05-03 15:06:06 -04:00
Matt Robinson
e805ed465d
docs: add slack and github links back into docs page (#535)
* stars and github link to top of page

* wording updates

* remove unnecessary font weight change

* remove next arrows

* buttons to bottom on sidebar
2023-05-01 18:17:52 -04:00
Matt Robinson
4156cb12e0
feat: partition_via_api helper function (#518)
* added function for partitioning via api

* added tests for api function

* changelog and version

* add docs for partition_via_api
2023-04-26 09:05:35 -04:00
Matt Robinson
894a190001
enhancement: check for copy protection on PDFs and fallback to hi res when necessary (#514)
* function to check if pdf is extractable

* add fallback logic for unextractable pdfs

* tests for docs with copy protection

* add test for unprocessable pdf

* update docs

* changelog and version

* update logic for images; reset file before proceeding

* 3 files for api tests

* docs update
2023-04-21 21:35:43 +00:00
qued
dc4147d7df
feat: extract tables (#503)
Exposes table extraction through partition and partition_pdf.
2023-04-21 17:01:29 +00:00
Matt Robinson
6874df91ef
feat: allow users to pass OCR language into partition (#509)
* pip-compile new reqs

* bump inference version

* add language to pdf and image calls

* tests for passing in language

* version bump and changelog

* update docs

* pass ocr_languages in auto

* updated test fixtures

* typo in doc string
2023-04-21 13:41:26 +00:00
Matt Robinson
bd1e540af9
feat: parameter to turn off SSL verification (#506)
* add kwarg for ssl verification

* update docs

* update version and changelog

* add verify kwarg to test
2023-04-20 11:13:56 -04:00
Matt Robinson
43854e367a
docs: fix incomplete hi_res docs (#505) 2023-04-20 09:43:33 -04:00
Shukri
396295fc04
fix: formatting error in sphinx docs (#498)
* fix: formatting error in sphinx docs
2023-04-17 23:13:09 -07:00
Shukri
8d4308af43
doc: typo (#495)
XML/HTML Depenedencies -> XML/HTML Dependencies
2023-04-17 20:26:50 -07:00
Matt Robinson
137b4b9a2e
feat: cleaning brick for normalizing bytes string output (#481)
* add cleaning brick for emojis

* changelog and versoin

* docs for bytes_string_to_string

* different test for bytes_string_to_string
2023-04-13 19:39:08 +00:00
Matt Robinson
e2e473dddd
feat: add url kwarg to partititon (#470)
* added url option to auto partition

* add test for partition from url

* version and changelog

* update docs

* add url to element metadata
2023-04-12 18:31:01 +00:00
Matt Robinson
7ec85272b7
feat: add partition_rtf for rich text files (#466)
* refactor epub; add rtf

* added test for rtf files

* filetype detection for rtf files

* add rtf to auto

* update docs for group_broken_paragraphs

* add rtf to docs

* update file list in readme

* update stage_for_transformers docs

* changelog and version bump

* skip rtf if in docker

* skip test if rtf not supported

* docs tweaks
2023-04-10 21:25:03 +00:00
Matt Robinson
c99c099158
feat: enable grouping broken paragraphs in partition_text (#456)
* cleaning brick to group broken paragraphs

* docs for group_broken_paragraphs

* add docs for partition_text with grouper

* partition_text and auto with paragraph_grouper

* version and changelog

* typo in the docs

* linting, linting, linting

* switch to using regular expressions
2023-04-06 18:35:22 +00:00
qued
4211dda360
build: sync detectron version (#440)
* Update detectron2 version in Dockerfile
* Update detectron2 version in docs
2023-04-03 18:47:43 -05:00
natygyoon
e6187b262f
enhancement: update elements_to_json to potentially return a string (#403)
* update elements_to_json to potentially return string if filename is not specified

* add text to elements_from_json
2023-03-29 12:38:30 -07:00
Matt Robinson
75cf233702
feat: add partition_msg for MSFT Outlook files (#412)
* added msg-parser dependency

* pass through kwargs in convert_file_to_text

* added partition_msg for processing msft outlook files

* version bump and changelog

* added tests for partition_msg

* added test for msg with plain text

* add partition_msg docs; fix underlines in integration docs

* add .msg to file list

* finish tests for auto msg

* linting, linting, linting
2023-03-28 20:15:22 +00:00
Amanda Cameron
71e035c34c
Adding content_type and file_filename to autopartition (#394)
Co-authored-by: cragwolfe <crag@unstructured.io>
2023-03-24 16:32:45 -07:00
cragwolfe
8ffd31029e
clean doc text (#398) 2023-03-24 08:43:27 -07:00
cragwolfe
ce9fc26009
feat: add ability to pass headers in partition_html (#397)
Also adds pytest-mock requirement, those fixtures are nice to have!

Implements issue/feature #396 .
2023-03-23 20:14:57 -07:00
Sebastian Laverde Alfonso
c9c1b843d2
docs: Integrations LangChain code fix (#378) 2023-03-17 22:59:22 +01:00
Sebastian Laverde Alfonso
b2f37c3eff
Docs: add Integrations section (#372)
* docs: update index, add integrations

* docs: fix typos

* docs: create integrations.rst section structure

* docs: descriptions and use for 8 integrations

* refactor: SEC example in Label Studio section

* Apply suggestions from code review

Co-authored-by: qued <64741807+qued@users.noreply.github.com>

* docs: change links order and refactor|paraphrase

---------

Co-authored-by: qued <64741807+qued@users.noreply.github.com>
2023-03-17 19:11:38 +00:00
natygyoon
e0eb66de52
feat: add staging brick to clean non-ascii characters from unicode (#366) 2023-03-14 21:31:51 -07:00
Matt Robinson
e43cb0e6e0
feat: add partition_epub function (#364)
* add pypandoc dependency

* added epub partitioner and file conversion

* test for partition_epub

* tests for file conversion

* add epub to filetype detection

* added epub to auto partition

* update bricks docs

* updated installing docs

* changelot and version

* add pandoc to dependencies

* add pandoc to debian dependencies

* linting, linting, linting

* typo fix

* typo fix

* file conversion type hints

* more type hints

---------

Co-authored-by: qued <64741807+qued@users.noreply.github.com>
2023-03-14 15:52:21 +00:00
Matt Robinson
7c08450597
feat: add "fast" strategy for PDF parsing; fallback to "fast" if detectron2 is not available (#357)
Adds a "fast" strategy for partitioning PDFs that uses pdfminer. The default strategy is "hi_res" and is the original partitioning logic that uses detectron2. If detectron2 is not available and the "hi_res" strategy is selected, partition_pdf fallsback to using the "fast" strategy. The implementation uses pdfminer because that's already installed as a dependency with the local-inference extra. There are other options for accomplishing this as well, but they would entail adding a new dependency. The "fast" strategy substantially speeds up processing.
2023-03-11 03:16:05 +00:00
Alvaro Bartolome
c51adb21e3
feat: add FsspecConnector to easily integrate new connectors with a fsspec implementation available (#318)
So as you may see this is a pretty big PR, that basically adds an "adapter" to easily plug in any connector with an available fsspec implementation. This is a way to standardize how the remote filesystems are used within unstructured.

I've additionally renamed s3_connector.py to s3.py for readability and consistency and tested that the current approach works as expected and is aligned with the expectations.
2023-03-10 06:15:19 +00:00
Matt Robinson
7c619f045b
feat: UNSTRUCTURED_LANGUAGE_CHECK env var to control (#351)
* environment variable to set language checks

* change log and version

* checks for if language checks are false

* update docs

* changelog type

* add assert to tests

* performance note in docstrings

* docstring tweaks
2023-03-09 17:33:48 +00:00
Matt Robinson
1cd1bd8eba
docs: more detailed bricks writeup; reoganize docs (#304)
* add print statement in readme

* elements before bricks

* new preamble to bricks section

* add preamble to bricks section

* add preamble to cleaning section

* descriptions of each documentation page

* non-brick helper functions to the bottom

* fix codeblock

* includes some optional kwargs

* code blocks

* typo fix
2023-02-27 23:11:49 +00:00
Matt Robinson
5db94fdee6
docs: add getting started section and remove outdated docs (#277)
* add getting started section to the docs

* remove old examples

* update example notebook

* change to convert_to_dict

* various and sundry edits
2023-02-27 15:10:53 +00:00
Tom Aarsen
9062d25d0d
Resolve numerous typos (#280)
* Resolve numerous typos

* Resolve typo in mime type
2023-02-24 17:48:23 -08:00
Matt Robinson
0d229f0a5e
fix: preserve all elements when serialized; feat: helper functions for serialization (#273)
* added type to text element map

* add element_id and coordinates

* added test for serialization

* added serialization for check boxes

* add dict_to_elements and covert_to_dict aliases

* helpers for serializing and deserializing elements

* bump version; changelog

* add Text to tests

* aliases for isd functions

* remove test elements json

* changelog updates

* make indent a kwarg

* update expected structured output

* docs update

* use new function in ingest code

* pop coordinates due to floating point differences

* pop coordinates
2023-02-23 21:58:59 +00:00
Matt Robinson
354eff1e2b
build(deps): automatically download nltk models when required (#246)
* code for downloading nltk packages

* don't run nltk make command in ci

* test for model downloads

* remove nltk install from docs

* update changelog and bump version
2023-02-23 17:19:13 +00:00
Matt Robinson
314924137f
docs: add quotes to local-inference install instructions (#245) 2023-02-21 09:58:26 -06:00
Matt Robinson
7472e1bb21
docs: add a quick start page to the readme and docs (#240)
* added quick start section to the readme

* added quick start to docs

* parenthetical on extra deps

* typo

* fix typo

* fixed mixed tabs/spaces
2023-02-17 22:13:28 +00:00
Matt Robinson
601f250edc
feat: add partition_ppt for older power point docs (#238)
* added partition_ppt function and tests

* add ppt support to auto

* version bump

* update docs

* doc fixes

* update changelog

* `.docx` -> `.pptx`

* its -> their

* remove whitespace
2023-02-17 16:57:08 +00:00
Matt Robinson
6036af33e7
feat: add partition_doc for .doc files (#236)
* first pass on doc partitioning

* add libreoffice to deps

* update docs and readme

* add .doc to auto

* changelog bump

* value error with missing doc

* doc updates
2023-02-17 09:30:23 -05:00
Matt Robinson
558ee63e90
feat: ability to skip English language specific checks with env var (#224)
* add language env var

* update docs

* version and bump change log
2023-02-15 09:15:47 -05:00
Matt Robinson
a68dc35940
chore: default to local inference for partition_pdf and partition_image (#222)
* chore: default the url to None for pdf and images

* bump changelog and version
2023-02-14 16:16:33 -05:00