unstructured

mirror of https://github.com/Unstructured-IO/unstructured.git synced 2025-11-17 02:47:32 +00:00

Author	SHA1	Message	Date
Matt Robinson	e2e473dddd	feat: add `url` kwarg to `partititon` (#470 ) * added url option to auto partition * add test for partition from url * version and changelog * update docs * add url to element metadata 0.5.12	2023-04-12 18:31:01 +00:00
qued	2110a266c8	fix: fix github issue formatting (#471 ) Attempting to fix formatting of github issues transferred to Jira. The old format was attempting to use double-slashes (\\) to specify line breaks. This worked in the test repo but didn't look right when merged to this repo. Now attempting to use formatted text in the yaml with \|. This worked in the test repo, but I guess that's no guarantee.	2023-04-12 16:59:12 +00:00
Austin Walker	4af4d33423	feat: add --partition-by-api and --partition-host to unstructured-ingest (#443 ) * Add --partition-by-api and --partition-host args to ingest * Fix error in make check * Bump changelog * Add a test ingest script Also add a workaround for the test causing 400s from our api. Seems we need to make sure unstructured-api can handle getting a file.content_type of None. * Remove the content type workaround	2023-04-11 22:05:07 -07:00
cragwolfe	ba4dadaa98	build: skip biomed ingest tests 90% of time due to ftp connectivity (#467 )	2023-04-11 11:27:38 -07:00
cragwolfe	7b44bcd6e0	build: script to update all ingest fixtures, add azure ingest fixtures (#367 ) - Updates CI to install tesseract version 5.3.0 (better than 4.x in various ways incl. perf.). - Adds azure expected output fixtures for more useful reference points and as a repro for Some PDF's with scanned images return empty elements #346 . - Adds a script to regenerate ingest test fixtures that is run in an ubuntu docker container (like CI), with the same version of tesseract. See the comments in scripts/ingest-test-fixtures-update.sh for details. - Updates expected outputs with above script. - Updates individual test-ingest scripts to update expected .json output if OVERWRITE_FIXTURES=true.	2023-04-11 00:11:50 -07:00
Matt Robinson	7ec85272b7	feat: add `partition_rtf` for rich text files (#466 ) * refactor epub; add rtf * added test for rtf files * filetype detection for rtf files * add rtf to auto * update docs for group_broken_paragraphs * add rtf to docs * update file list in readme * update stage_for_transformers docs * changelog and version bump * skip rtf if in docker * skip test if rtf not supported * docs tweaks	2023-04-10 21:25:03 +00:00
cragwolfe	11f82a8b1b	fix(ingest): import connector-specific modules on demand (#460 ) * fix(ingest): import connector-specific modules on demand * unstructured-ingest --flatten-metadata supported for local connector. * unstructured-ingest fix runtime error when using --metadata-include.	2023-04-08 11:35:35 -07:00
cragwolfe	bd01af2bac	build: add mimetypes DB to docker image (#455 ) The mailcap centos7 package provides the file /etc/mime.types, which is used by the mimetypes python package. That said, the unstructured code base does not make much use of this but the upstream unstructured-api does. Bonus: docx mimetype added in lookup table.	2023-04-07 13:59:29 -07:00
Matt Robinson	c99c099158	feat: enable grouping broken paragraphs in `partition_text` (#456 ) * cleaning brick to group broken paragraphs * docs for group_broken_paragraphs * add docs for partition_text with grouper * partition_text and auto with paragraph_grouper * version and changelog * typo in the docs * linting, linting, linting * switch to using regular expressions	2023-04-06 18:35:22 +00:00
ryannikolaidis	ee52a749c3	fix: docker smoke test on build (#457 )	2023-04-06 10:03:42 -07:00
ryannikolaidis	ef9fb79ed4	chore: build with registry as cache (#454 )	2023-04-06 00:34:07 -07:00
Matt Robinson	9b5cae49e1	fix: allow `replace_mime_encodings` to accept and `encoding` kwarg (#453 ) * changelog and version * added test	2023-04-05 22:53:38 +00:00
Matt Robinson	b855fd269f	fix: fix html encoding to support foreign characters (#452 ) * fix: fix html encoding to support foreign characters * version and changelog 0.5.11	2023-04-05 20:18:54 +00:00
Siddartha Naidu	1dca0db6b0	fix: guard against style attribute being None (#449 ) Some document elements may have a null style element which triggers an exception when trying to access the name of the style. Co-authored-by: Matt Robinson <mrobinson@unstructured.io>	2023-04-05 19:22:43 +00:00
ryannikolaidis	d298f57b8f	fix: issue when filename is provided but file is not on disk (#446 ) 0.5.10	2023-04-05 17:54:11 +00:00
natygyoon	e6d6509d54	feat: add --download-only parameter to unstructured-ingest (#416 ) Add --download-only parameter so that files may be downloaded if they are not already present (as usual, in either --download-dir or the default download ~/.cache/... location if --download-dir is not specified) and skip processing them through unstructured.	2023-04-05 17:14:41 +00:00
qued	5398cdf4e1	fix: change url to html_url (#451 )	2023-04-05 11:13:23 -05:00
qued	92aac29cab	chore: add github link to autogenerated issue (#445 ) Added a link back to the original Github issue from which the Jira issue was created for tracking purposes.	2023-04-05 10:03:30 -05:00
cragwolfe	3972c80c51	build(deps): bump requirements (#414 )	2023-04-05 02:59:06 +00:00
Matt Robinson	5ae895051a	feat: add sender and receive info to element metadata for emails (#439 ) * add header metadata for .eml messages * sent to and from are lists * add metadata for outlook emails * version and changelog	2023-04-04 14:23:41 -04:00
qued	4211dda360	build: sync detectron version (#440 ) * Update detectron2 version in Dockerfile * Update detectron2 version in docs	2023-04-03 18:47:43 -05:00
Amanda Cameron	555b95b8f7	Fixing test for unstructured-api (#425 ) Ran into an error in tests for unstructured-api (see below for output). Somewhere along the lines we were reading a txt file into bytes and then the PARAGRAPH_PATTERN (a string) was not able to be compared to the bytes file. 0.5.9	2023-04-03 11:12:12 -07:00
Minty Mac	533241c274	Update README.md (#435 ) Spelling error	2023-04-02 09:52:14 -07:00
ryannikolaidis	b746cb0ab4	docs: update README with notes about multi-platform (#433 )	2023-04-01 17:31:32 -07:00
Matt Robinson	414883455b	fix: correct order of kwargs in pandoc (#421 ) * fix: correct order of kwargs in pandoc * only skip epub tests in Docker * changelog --------- Co-authored-by: Crag Wolfe <crag@unstructuredai.io> Co-authored-by: cragwolfe <crag@unstructured.io> 0.5.8	2023-03-30 20:54:29 +00:00
ryannikolaidis	59785e4332	chore: install all extras in Dockerfile (#419 ) * Adds step to install all extras * Adds smoke test of wikipedia ingest to validate in CI	2023-03-30 13:23:30 -07:00
cragwolfe	32c79caee3	chore: use only regex for contains_english_word. (#382 ) Updates the characters to split when creating candidate english words. Now uses regex to parse out non-alphabetic characters for each word Note: This was originally an attempt to speedup contains_english_word() but there is no measurable change in performance.	2023-03-30 16:57:43 +00:00
Matt Robinson	e5dd9d5676	!fix: return a list of elements in `stage_for_transformers` (#420 ) * update stage_for_transformers to return a list of elements * bump changelog and version * flag breaking change * fix last word bug in chunk_by_attention_window	2023-03-30 12:27:11 -04:00
Umar Farooqi	2f5c61c178	fix: exclude empty tags during depth check (#379 )	2023-03-30 10:03:50 -04:00
ryannikolaidis	19fb3031be	ci: update CI for deprecated set-output (#417 )	2023-03-29 20:49:21 -07:00
ryannikolaidis	77b6fb2792	ci: update dockerfile to also add models and nltk (#418 )	2023-03-29 20:48:06 -07:00
natygyoon	7f6e094c1f	feat: add local file system connector for unstructured-ingest (#399 ) * added local connector to unstructured-ingest	2023-03-29 15:53:23 -07:00
natygyoon	e6187b262f	enhancement: update elements_to_json to potentially return a string (#403 ) * update elements_to_json to potentially return string if filename is not specified * add text to elements_from_json	2023-03-29 12:38:30 -07:00
natygyoon	1da40806da	feat: add --max-docs parameter to unstructured-ingest (#402 ) * added --max-docs parameter to unstructured-ingest	2023-03-30 03:24:12 +09:00
ryannikolaidis	65fec954ba	ci: publish amd and arm images (#404 )	2023-03-29 07:02:39 +00:00
Matt Robinson	09b52b4fc4	fix: text kwargs no longer fail with empty string (#413 ) * fix: text kwargs no longer fail with empty string * linting	2023-03-28 21:03:51 +00:00
Matt Robinson	75cf233702	feat: add `partition_msg` for MSFT Outlook files (#412 ) * added msg-parser dependency * pass through kwargs in convert_file_to_text * added partition_msg for processing msft outlook files * version bump and changelog * added tests for partition_msg * added test for msg with plain text * add partition_msg docs; fix underlines in integration docs * add .msg to file list * finish tests for auto msg * linting, linting, linting	2023-03-28 20:15:22 +00:00
ryannikolaidis	e1a8db51ad	ci: test before publishing docker image (#390 )	2023-03-27 13:16:48 -07:00
Amanda Cameron	71e035c34c	Adding content_type and file_filename to autopartition (#394 ) Co-authored-by: cragwolfe <crag@unstructured.io> 0.5.7	2023-03-24 16:32:45 -07:00
cragwolfe	8ffd31029e	clean doc text (#398 )	2023-03-24 08:43:27 -07:00
cragwolfe	ce9fc26009	feat: add ability to pass headers in partition_html (#397 ) Also adds pytest-mock requirement, those fixtures are nice to have! Implements issue/feature #396 .	2023-03-23 20:14:57 -07:00
natygyoon	a4394f6f16	feat: add --flatten-metadata to unstructured-ingest (#389 ) * added --flatten-metadata to unstructured-ingest * added unit tests for process_file()	2023-03-22 20:52:56 +00:00
natygyoon	66a0369fb6	feat: add --fields-include to unstructured-ingest (#376 ) * add --fields-include parameter to unstructured-ingest * add unit tests for process_file()	2023-03-22 14:12:35 +00:00
cragwolfe	3467a2786d	Update patterns.py (#391 )	2023-03-21 23:58:18 -07:00
natygyoon	6b17cb228e	refactor: use `exactly one` throughout code base (#385 ) added `exactly_one` to additional places like unstructured/partition too.	2023-03-21 16:50:13 -07:00
Amanda Cameron	a9da858fa3	chore: add tests for docker (#373 )	2023-03-21 13:46:09 -07:00
Benjamin Torres	3c95b975fe	Fix: duplicated addition to elements list (#388 ) 0.5.6	2023-03-21 12:56:04 -07:00
natygyoon	c16862e7b3	feat: add --metadata-include and --metadata-exclude parameters to unstructured-ingest (#368 ) * added metadata in/exclude params * updated process_file * existing tests * remove default behavior * changelog and ci * line length * import * import * import sorted * import * type * line length * main * ci * json * dict * type ignore * lint * unit tests for process_file * lint * type changed to Optional(str) * ci * line length * added mutex check * nit	2023-03-22 03:30:53 +09:00
ryannikolaidis	d5a0fce6a0	docs: update readme with notes about pulling and running the public Docker image. (#381 )	2023-03-20 18:41:44 +00:00
cragwolfe	fbc7a69a53	feat: change english_words to set for performance gain (#380 )	2023-03-19 22:51:32 +00:00

... 2 3 4 5 6 ...

459 Commits