unstructured

mirror of https://github.com/Unstructured-IO/unstructured.git synced 2025-10-01 11:21:13 +00:00

Author	SHA1	Message	Date
Matt Robinson	c35fff2972	feat: Add `stage_for_weaviate` and schema creation function (#672 ) * add weaviate docker compose * added staging brick and tests for weaviate * initial notebook and requirements file * add commentary to weaviate notebook * weaviate readme * update docs * version and change log * install weaviate client * install weaviate; skip for docker * linting, linting, linting * install weaviate client with deps * comments on weaviate client * fix module not found error for docker container * skipped wrong test in docker * fix typos * add in local-inference	2023-06-01 20:48:54 +00:00
qued	d3600dd5da	build(deps): update inference version (#662 ) Updated to the the latest version of unstructured-inference. detectron2 now gets implemented with onnxruntime, yay! --------- Co-authored-by: Matt Robinson <mrobinson@unstructured.io>	2023-05-31 13:50:15 -05:00
Matt Robinson	21c821d651	feat: add `partition_csv` function (#619 ) * add csv into filetype detection * first pass on csv * add tests for csv * add csv to auto * version bump * update readme and docs * fix doc strings	2023-05-19 15:57:42 -04:00
Matt Robinson	23ff32cc42	feat: add `partition_xml` for XML files (#596 ) * first pass on partition_xml * add option to keep xml tags * added tests for xml * fix filename * update filenames * remove outdated readme * add xml to auto * version and changelog * update readme and docs * pass through include_metadata * update include_metadata description * add README back in * linting, linting, linting * more linting * spooled to bytes doesnt need to be a tuple * Add tests for newly supported filetypes * Correct metadata filetype * doc typo Co-authored-by: qued <64741807+qued@users.noreply.github.com> * typo fix Co-authored-by: qued <64741807+qued@users.noreply.github.com> * typo fix Co-authored-by: qued <64741807+qued@users.noreply.github.com> * keep_xml_tags -> xml_keep_tags --------- Co-authored-by: Alan Bertl <alan@unstructured.io> Co-authored-by: qued <64741807+qued@users.noreply.github.com>	2023-05-18 15:40:12 +00:00
Matt Robinson	b8037118c4	feat: add `partition_xlsx` for MSFT Excel files (#594 ) * first pass on partition_xlsx * add support for files * add test for xlsx from filename * added filetype metadata * add xlsx to auto * remove fake excel from unsupported * version and changelog * update docs * update readme * fix removed file reference * fix some more tests * pass in metadata filename * add include_metadata flag	2023-05-16 19:40:40 +00:00
Nicolas	c62bee48ad	Update installing.rst (#590 )	2023-05-16 02:08:01 +00:00
Matt Robinson	99aa346186	fix: make `pytesseract` a function level import (#581 ) * make pytesseract a function level import * version and changelog * small docs formatting fix	2023-05-12 17:18:51 -05:00
Matt Robinson	727d366a94	enhancement: auto strategy for PDFs and images (#578 ) * added functions for determining auto stratgy * change default strategy to auto * tests for auto strategy * update docs * changelog and version * bump version * remove ingest file in wrong location * update jpg output * typo fix	2023-05-12 17:45:08 +00:00
Matt Robinson	3d3f3df3ec	enhancement: add "ocr_only" strategy for PDFs (#553 ) * add tests for validating strategy * refactor into determine_pdf_strategy function * refactor pdf strategies into strategies * remove commented out code * remove unreachable code * add in handling for image types * a little more refactoring * import ocr partioning for images * catch warnings, partition type for valid strategies * fallback to ocr_only from fast * fallback logic for hi_res * test for fallback to ocr only * fallback logic ofr ocr_only * more tests for fallback logic * update doc strings * version and changelog * linting, linting, linting * update docs to include notes about strategy * fix typos * change back patched filename	2023-05-08 17:21:24 +00:00
Matt Robinson	392cccdbf7	enhancement: add ocr_only strategy for `partition_image` (#540 ) * spike for ocr-only strategy for images * fix for file processing * extra space * add korean to ci * added test for ocr_only strategy * added docs for ocr_only * changelog and version * added test for bad strategy * skip korean test if in docker * bump version * version bump * document valid strategies * bump version for release --------- Co-authored-by: qued <64741807+qued@users.noreply.github.com>	2023-05-04 20:23:51 +00:00
Matt Robinson	fae5f8fdde	feat: add `partition_odt` for open office docs (#548 ) * added filetype detection for odt * add function for partition odt documents * add odt files to auto * changelog and version * docs and readme * update installation docs * skip tests if not supported or in docker * import pytest * fix docs typos	2023-05-04 19:28:08 +00:00
Matt Robinson	981805e435	feat: `stage_for_baseplate` function (#546 ) * added a staging brick for baseplate * added a test for baseplate * update documentation * version and changelog	2023-05-04 11:05:38 -04:00
Matt Robinson	7e43a25f07	feat: add `partition_multiple_via_api` function (#539 ) * added function for multiple files via api * make multiple work with files * updated docs strings * changelog and version * docs and contextlib for open files * tests for partition multiple * add tests for error conditions * add output example	2023-05-03 15:06:06 -04:00
Matt Robinson	e805ed465d	docs: add slack and github links back into docs page (#535 ) * stars and github link to top of page * wording updates * remove unnecessary font weight change * remove next arrows * buttons to bottom on sidebar	2023-05-01 18:17:52 -04:00
Matt Robinson	4156cb12e0	feat: `partition_via_api` helper function (#518 ) * added function for partitioning via api * added tests for api function * changelog and version * add docs for partition_via_api	2023-04-26 09:05:35 -04:00
Matt Robinson	894a190001	enhancement: check for copy protection on PDFs and fallback to hi res when necessary (#514 ) * function to check if pdf is extractable * add fallback logic for unextractable pdfs * tests for docs with copy protection * add test for unprocessable pdf * update docs * changelog and version * update logic for images; reset file before proceeding * 3 files for api tests * docs update	2023-04-21 21:35:43 +00:00
qued	dc4147d7df	feat: extract tables (#503 ) Exposes table extraction through partition and partition_pdf.	2023-04-21 17:01:29 +00:00
Matt Robinson	6874df91ef	feat: allow users to pass OCR language into `partition` (#509 ) * pip-compile new reqs * bump inference version * add language to pdf and image calls * tests for passing in language * version bump and changelog * update docs * pass ocr_languages in auto * updated test fixtures * typo in doc string	2023-04-21 13:41:26 +00:00
Matt Robinson	bd1e540af9	feat: parameter to turn off SSL verification (#506 ) * add kwarg for ssl verification * update docs * update version and changelog * add verify kwarg to test	2023-04-20 11:13:56 -04:00
Matt Robinson	43854e367a	docs: fix incomplete hi_res docs (#505 )	2023-04-20 09:43:33 -04:00
Shukri	396295fc04	fix: formatting error in sphinx docs (#498 ) * fix: formatting error in sphinx docs	2023-04-17 23:13:09 -07:00
Shukri	8d4308af43	doc: typo (#495 ) XML/HTML Depenedencies -> XML/HTML Dependencies	2023-04-17 20:26:50 -07:00
Matt Robinson	137b4b9a2e	feat: cleaning brick for normalizing bytes string output (#481 ) * add cleaning brick for emojis * changelog and versoin * docs for bytes_string_to_string * different test for bytes_string_to_string	2023-04-13 19:39:08 +00:00
Matt Robinson	e2e473dddd	feat: add `url` kwarg to `partititon` (#470 ) * added url option to auto partition * add test for partition from url * version and changelog * update docs * add url to element metadata	2023-04-12 18:31:01 +00:00
Matt Robinson	7ec85272b7	feat: add `partition_rtf` for rich text files (#466 ) * refactor epub; add rtf * added test for rtf files * filetype detection for rtf files * add rtf to auto * update docs for group_broken_paragraphs * add rtf to docs * update file list in readme * update stage_for_transformers docs * changelog and version bump * skip rtf if in docker * skip test if rtf not supported * docs tweaks	2023-04-10 21:25:03 +00:00
Matt Robinson	c99c099158	feat: enable grouping broken paragraphs in `partition_text` (#456 ) * cleaning brick to group broken paragraphs * docs for group_broken_paragraphs * add docs for partition_text with grouper * partition_text and auto with paragraph_grouper * version and changelog * typo in the docs * linting, linting, linting * switch to using regular expressions	2023-04-06 18:35:22 +00:00
qued	4211dda360	build: sync detectron version (#440 ) * Update detectron2 version in Dockerfile * Update detectron2 version in docs	2023-04-03 18:47:43 -05:00
natygyoon	e6187b262f	enhancement: update elements_to_json to potentially return a string (#403 ) * update elements_to_json to potentially return string if filename is not specified * add text to elements_from_json	2023-03-29 12:38:30 -07:00
Matt Robinson	75cf233702	feat: add `partition_msg` for MSFT Outlook files (#412 ) * added msg-parser dependency * pass through kwargs in convert_file_to_text * added partition_msg for processing msft outlook files * version bump and changelog * added tests for partition_msg * added test for msg with plain text * add partition_msg docs; fix underlines in integration docs * add .msg to file list * finish tests for auto msg * linting, linting, linting	2023-03-28 20:15:22 +00:00
Amanda Cameron	71e035c34c	Adding content_type and file_filename to autopartition (#394 ) Co-authored-by: cragwolfe <crag@unstructured.io>	2023-03-24 16:32:45 -07:00
cragwolfe	8ffd31029e	clean doc text (#398 )	2023-03-24 08:43:27 -07:00
cragwolfe	ce9fc26009	feat: add ability to pass headers in partition_html (#397 ) Also adds pytest-mock requirement, those fixtures are nice to have! Implements issue/feature #396 .	2023-03-23 20:14:57 -07:00
Sebastian Laverde Alfonso	c9c1b843d2	docs: Integrations LangChain code fix (#378 )	2023-03-17 22:59:22 +01:00
Sebastian Laverde Alfonso	b2f37c3eff	Docs: add Integrations section (#372 ) * docs: update index, add integrations * docs: fix typos * docs: create integrations.rst section structure * docs: descriptions and use for 8 integrations * refactor: SEC example in Label Studio section * Apply suggestions from code review Co-authored-by: qued <64741807+qued@users.noreply.github.com> * docs: change links order and refactor\|paraphrase --------- Co-authored-by: qued <64741807+qued@users.noreply.github.com>	2023-03-17 19:11:38 +00:00
natygyoon	e0eb66de52	feat: add staging brick to clean non-ascii characters from unicode (#366 )	2023-03-14 21:31:51 -07:00
Matt Robinson	e43cb0e6e0	feat: add `partition_epub` function (#364 ) * add pypandoc dependency * added epub partitioner and file conversion * test for partition_epub * tests for file conversion * add epub to filetype detection * added epub to auto partition * update bricks docs * updated installing docs * changelot and version * add pandoc to dependencies * add pandoc to debian dependencies * linting, linting, linting * typo fix * typo fix * file conversion type hints * more type hints --------- Co-authored-by: qued <64741807+qued@users.noreply.github.com>	2023-03-14 15:52:21 +00:00
Matt Robinson	7c08450597	feat: add `"fast"` strategy for PDF parsing; fallback to `"fast"` if `detectron2` is not available (#357 ) Adds a "fast" strategy for partitioning PDFs that uses pdfminer. The default strategy is "hi_res" and is the original partitioning logic that uses detectron2. If detectron2 is not available and the "hi_res" strategy is selected, partition_pdf fallsback to using the "fast" strategy. The implementation uses pdfminer because that's already installed as a dependency with the local-inference extra. There are other options for accomplishing this as well, but they would entail adding a new dependency. The "fast" strategy substantially speeds up processing.	2023-03-11 03:16:05 +00:00
Alvaro Bartolome	c51adb21e3	feat: add `FsspecConnector` to easily integrate new connectors with a `fsspec` implementation available (#318 ) So as you may see this is a pretty big PR, that basically adds an "adapter" to easily plug in any connector with an available fsspec implementation. This is a way to standardize how the remote filesystems are used within unstructured. I've additionally renamed s3_connector.py to s3.py for readability and consistency and tested that the current approach works as expected and is aligned with the expectations.	2023-03-10 06:15:19 +00:00
Matt Robinson	7c619f045b	feat: `UNSTRUCTURED_LANGUAGE_CHECK` env var to control (#351 ) * environment variable to set language checks * change log and version * checks for if language checks are false * update docs * changelog type * add assert to tests * performance note in docstrings * docstring tweaks	2023-03-09 17:33:48 +00:00
Matt Robinson	1cd1bd8eba	docs: more detailed bricks writeup; reoganize docs (#304 ) * add print statement in readme * elements before bricks * new preamble to bricks section * add preamble to bricks section * add preamble to cleaning section * descriptions of each documentation page * non-brick helper functions to the bottom * fix codeblock * includes some optional kwargs * code blocks * typo fix	2023-02-27 23:11:49 +00:00
Matt Robinson	5db94fdee6	docs: add getting started section and remove outdated docs (#277 ) * add getting started section to the docs * remove old examples * update example notebook * change to convert_to_dict * various and sundry edits	2023-02-27 15:10:53 +00:00
Tom Aarsen	9062d25d0d	Resolve numerous typos (#280 ) * Resolve numerous typos * Resolve typo in mime type	2023-02-24 17:48:23 -08:00
Matt Robinson	0d229f0a5e	fix: preserve all elements when serialized; feat: helper functions for serialization (#273 ) * added type to text element map * add element_id and coordinates * added test for serialization * added serialization for check boxes * add dict_to_elements and covert_to_dict aliases * helpers for serializing and deserializing elements * bump version; changelog * add Text to tests * aliases for isd functions * remove test elements json * changelog updates * make indent a kwarg * update expected structured output * docs update * use new function in ingest code * pop coordinates due to floating point differences * pop coordinates	2023-02-23 21:58:59 +00:00
Matt Robinson	354eff1e2b	build(deps): automatically download `nltk` models when required (#246 ) * code for downloading nltk packages * don't run nltk make command in ci * test for model downloads * remove nltk install from docs * update changelog and bump version	2023-02-23 17:19:13 +00:00
Matt Robinson	314924137f	docs: add quotes to local-inference install instructions (#245 )	2023-02-21 09:58:26 -06:00
Matt Robinson	7472e1bb21	docs: add a quick start page to the readme and docs (#240 ) * added quick start section to the readme * added quick start to docs * parenthetical on extra deps * typo * fix typo * fixed mixed tabs/spaces	2023-02-17 22:13:28 +00:00
Matt Robinson	601f250edc	feat: add `partition_ppt` for older power point docs (#238 ) * added partition_ppt function and tests * add ppt support to auto * version bump * update docs * doc fixes * update changelog * `.docx` -> `.pptx` * its -> their * remove whitespace	2023-02-17 16:57:08 +00:00
Matt Robinson	6036af33e7	feat: add `partition_doc` for `.doc` files (#236 ) * first pass on doc partitioning * add libreoffice to deps * update docs and readme * add .doc to auto * changelog bump * value error with missing doc * doc updates	2023-02-17 09:30:23 -05:00
Matt Robinson	558ee63e90	feat: ability to skip English language specific checks with env var (#224 ) * add language env var * update docs * version and bump change log	2023-02-15 09:15:47 -05:00
Matt Robinson	a68dc35940	chore: default to local inference for `partition_pdf` and `partition_image` (#222 ) * chore: default the url to None for pdf and images * bump changelog and version	2023-02-14 16:16:33 -05:00

1 2

100 Commits