unstructured

mirror of https://github.com/Unstructured-IO/unstructured.git synced 2025-08-27 02:06:49 +00:00

Author	SHA1	Message	Date
cragwolfe	7b44bcd6e0	build: script to update all ingest fixtures, add azure ingest fixtures (#367 ) - Updates CI to install tesseract version 5.3.0 (better than 4.x in various ways incl. perf.). - Adds azure expected output fixtures for more useful reference points and as a repro for Some PDF's with scanned images return empty elements #346 . - Adds a script to regenerate ingest test fixtures that is run in an ubuntu docker container (like CI), with the same version of tesseract. See the comments in scripts/ingest-test-fixtures-update.sh for details. - Updates expected outputs with above script. - Updates individual test-ingest scripts to update expected .json output if OVERWRITE_FIXTURES=true.	2023-04-11 00:11:50 -07:00
Matt Robinson	7ec85272b7	feat: add `partition_rtf` for rich text files (#466 ) * refactor epub; add rtf * added test for rtf files * filetype detection for rtf files * add rtf to auto * update docs for group_broken_paragraphs * add rtf to docs * update file list in readme * update stage_for_transformers docs * changelog and version bump * skip rtf if in docker * skip test if rtf not supported * docs tweaks	2023-04-10 21:25:03 +00:00
cragwolfe	11f82a8b1b	fix(ingest): import connector-specific modules on demand (#460 ) * fix(ingest): import connector-specific modules on demand * unstructured-ingest --flatten-metadata supported for local connector. * unstructured-ingest fix runtime error when using --metadata-include.	2023-04-08 11:35:35 -07:00
cragwolfe	bd01af2bac	build: add mimetypes DB to docker image (#455 ) The mailcap centos7 package provides the file /etc/mime.types, which is used by the mimetypes python package. That said, the unstructured code base does not make much use of this but the upstream unstructured-api does. Bonus: docx mimetype added in lookup table.	2023-04-07 13:59:29 -07:00
Matt Robinson	c99c099158	feat: enable grouping broken paragraphs in `partition_text` (#456 ) * cleaning brick to group broken paragraphs * docs for group_broken_paragraphs * add docs for partition_text with grouper * partition_text and auto with paragraph_grouper * version and changelog * typo in the docs * linting, linting, linting * switch to using regular expressions	2023-04-06 18:35:22 +00:00
ryannikolaidis	ee52a749c3	fix: docker smoke test on build (#457 )	2023-04-06 10:03:42 -07:00
ryannikolaidis	ef9fb79ed4	chore: build with registry as cache (#454 )	2023-04-06 00:34:07 -07:00
Matt Robinson	9b5cae49e1	fix: allow `replace_mime_encodings` to accept and `encoding` kwarg (#453 ) * changelog and version * added test	2023-04-05 22:53:38 +00:00
Matt Robinson	b855fd269f	fix: fix html encoding to support foreign characters (#452 ) * fix: fix html encoding to support foreign characters * version and changelog 0.5.11	2023-04-05 20:18:54 +00:00
Siddartha Naidu	1dca0db6b0	fix: guard against style attribute being None (#449 ) Some document elements may have a null style element which triggers an exception when trying to access the name of the style. Co-authored-by: Matt Robinson <mrobinson@unstructured.io>	2023-04-05 19:22:43 +00:00
ryannikolaidis	d298f57b8f	fix: issue when filename is provided but file is not on disk (#446 ) 0.5.10	2023-04-05 17:54:11 +00:00
natygyoon	e6d6509d54	feat: add --download-only parameter to unstructured-ingest (#416 ) Add --download-only parameter so that files may be downloaded if they are not already present (as usual, in either --download-dir or the default download ~/.cache/... location if --download-dir is not specified) and skip processing them through unstructured.	2023-04-05 17:14:41 +00:00
qued	5398cdf4e1	fix: change url to html_url (#451 )	2023-04-05 11:13:23 -05:00
qued	92aac29cab	chore: add github link to autogenerated issue (#445 ) Added a link back to the original Github issue from which the Jira issue was created for tracking purposes.	2023-04-05 10:03:30 -05:00
cragwolfe	3972c80c51	build(deps): bump requirements (#414 )	2023-04-05 02:59:06 +00:00
Matt Robinson	5ae895051a	feat: add sender and receive info to element metadata for emails (#439 ) * add header metadata for .eml messages * sent to and from are lists * add metadata for outlook emails * version and changelog	2023-04-04 14:23:41 -04:00
qued	4211dda360	build: sync detectron version (#440 ) * Update detectron2 version in Dockerfile * Update detectron2 version in docs	2023-04-03 18:47:43 -05:00
Amanda Cameron	555b95b8f7	Fixing test for unstructured-api (#425 ) Ran into an error in tests for unstructured-api (see below for output). Somewhere along the lines we were reading a txt file into bytes and then the PARAGRAPH_PATTERN (a string) was not able to be compared to the bytes file. 0.5.9	2023-04-03 11:12:12 -07:00
Minty Mac	533241c274	Update README.md (#435 ) Spelling error	2023-04-02 09:52:14 -07:00
ryannikolaidis	b746cb0ab4	docs: update README with notes about multi-platform (#433 )	2023-04-01 17:31:32 -07:00
Matt Robinson	414883455b	fix: correct order of kwargs in pandoc (#421 ) * fix: correct order of kwargs in pandoc * only skip epub tests in Docker * changelog --------- Co-authored-by: Crag Wolfe <crag@unstructuredai.io> Co-authored-by: cragwolfe <crag@unstructured.io> 0.5.8	2023-03-30 20:54:29 +00:00
ryannikolaidis	59785e4332	chore: install all extras in Dockerfile (#419 ) * Adds step to install all extras * Adds smoke test of wikipedia ingest to validate in CI	2023-03-30 13:23:30 -07:00
cragwolfe	32c79caee3	chore: use only regex for contains_english_word. (#382 ) Updates the characters to split when creating candidate english words. Now uses regex to parse out non-alphabetic characters for each word Note: This was originally an attempt to speedup contains_english_word() but there is no measurable change in performance.	2023-03-30 16:57:43 +00:00
Matt Robinson	e5dd9d5676	!fix: return a list of elements in `stage_for_transformers` (#420 ) * update stage_for_transformers to return a list of elements * bump changelog and version * flag breaking change * fix last word bug in chunk_by_attention_window	2023-03-30 12:27:11 -04:00
Umar Farooqi	2f5c61c178	fix: exclude empty tags during depth check (#379 )	2023-03-30 10:03:50 -04:00
ryannikolaidis	19fb3031be	ci: update CI for deprecated set-output (#417 )	2023-03-29 20:49:21 -07:00
ryannikolaidis	77b6fb2792	ci: update dockerfile to also add models and nltk (#418 )	2023-03-29 20:48:06 -07:00
natygyoon	7f6e094c1f	feat: add local file system connector for unstructured-ingest (#399 ) * added local connector to unstructured-ingest	2023-03-29 15:53:23 -07:00
natygyoon	e6187b262f	enhancement: update elements_to_json to potentially return a string (#403 ) * update elements_to_json to potentially return string if filename is not specified * add text to elements_from_json	2023-03-29 12:38:30 -07:00
natygyoon	1da40806da	feat: add --max-docs parameter to unstructured-ingest (#402 ) * added --max-docs parameter to unstructured-ingest	2023-03-30 03:24:12 +09:00
ryannikolaidis	65fec954ba	ci: publish amd and arm images (#404 )	2023-03-29 07:02:39 +00:00
Matt Robinson	09b52b4fc4	fix: text kwargs no longer fail with empty string (#413 ) * fix: text kwargs no longer fail with empty string * linting	2023-03-28 21:03:51 +00:00
Matt Robinson	75cf233702	feat: add `partition_msg` for MSFT Outlook files (#412 ) * added msg-parser dependency * pass through kwargs in convert_file_to_text * added partition_msg for processing msft outlook files * version bump and changelog * added tests for partition_msg * added test for msg with plain text * add partition_msg docs; fix underlines in integration docs * add .msg to file list * finish tests for auto msg * linting, linting, linting	2023-03-28 20:15:22 +00:00
ryannikolaidis	e1a8db51ad	ci: test before publishing docker image (#390 )	2023-03-27 13:16:48 -07:00
Amanda Cameron	71e035c34c	Adding content_type and file_filename to autopartition (#394 ) Co-authored-by: cragwolfe <crag@unstructured.io> 0.5.7	2023-03-24 16:32:45 -07:00
cragwolfe	8ffd31029e	clean doc text (#398 )	2023-03-24 08:43:27 -07:00
cragwolfe	ce9fc26009	feat: add ability to pass headers in partition_html (#397 ) Also adds pytest-mock requirement, those fixtures are nice to have! Implements issue/feature #396 .	2023-03-23 20:14:57 -07:00
natygyoon	a4394f6f16	feat: add --flatten-metadata to unstructured-ingest (#389 ) * added --flatten-metadata to unstructured-ingest * added unit tests for process_file()	2023-03-22 20:52:56 +00:00
natygyoon	66a0369fb6	feat: add --fields-include to unstructured-ingest (#376 ) * add --fields-include parameter to unstructured-ingest * add unit tests for process_file()	2023-03-22 14:12:35 +00:00
cragwolfe	3467a2786d	Update patterns.py (#391 )	2023-03-21 23:58:18 -07:00
natygyoon	6b17cb228e	refactor: use `exactly one` throughout code base (#385 ) added `exactly_one` to additional places like unstructured/partition too.	2023-03-21 16:50:13 -07:00
Amanda Cameron	a9da858fa3	chore: add tests for docker (#373 )	2023-03-21 13:46:09 -07:00
Benjamin Torres	3c95b975fe	Fix: duplicated addition to elements list (#388 ) 0.5.6	2023-03-21 12:56:04 -07:00
natygyoon	c16862e7b3	feat: add --metadata-include and --metadata-exclude parameters to unstructured-ingest (#368 ) * added metadata in/exclude params * updated process_file * existing tests * remove default behavior * changelog and ci * line length * import * import * import sorted * import * type * line length * main * ci * json * dict * type ignore * lint * unit tests for process_file * lint * type changed to Optional(str) * ci * line length * added mutex check * nit	2023-03-22 03:30:53 +09:00
ryannikolaidis	d5a0fce6a0	docs: update readme with notes about pulling and running the public Docker image. (#381 )	2023-03-20 18:41:44 +00:00
cragwolfe	fbc7a69a53	feat: change english_words to set for performance gain (#380 )	2023-03-19 22:51:32 +00:00
ryannikolaidis	1e39e1ac2a	ci: Adds workflow to publish docker builds (#377 )	2023-03-19 21:53:05 +00:00
Sebastian Laverde Alfonso	c9c1b843d2	docs: Integrations LangChain code fix (#378 )	2023-03-17 22:59:22 +01:00
Sebastian Laverde Alfonso	b2f37c3eff	Docs: add Integrations section (#372 ) * docs: update index, add integrations * docs: fix typos * docs: create integrations.rst section structure * docs: descriptions and use for 8 integrations * refactor: SEC example in Label Studio section * Apply suggestions from code review Co-authored-by: qued <64741807+qued@users.noreply.github.com> * docs: change links order and refactor\|paraphrase --------- Co-authored-by: qued <64741807+qued@users.noreply.github.com>	2023-03-17 19:11:38 +00:00
Matt Robinson	b47bfaf33a	fix: update test to pass on later `label_studio_sdk` versions (#369 ) Closes #200. Fixes the failing test for label_studio_sdk>0.0.17 using the suggestion found in this comment. The vcr fixture on the test needed allow_playback_repeats=True. Unpinned label_studio_sdk and pip-compiled.	2023-03-17 17:57:09 +00:00

... 9 10 11 12 13 ...

805 Commits