unstructured

mirror of https://github.com/Unstructured-IO/unstructured.git synced 2025-12-23 05:04:18 +00:00

Author	SHA1	Message	Date
Matt Robinson	43854e367a	docs: fix incomplete hi_res docs (#505 )	2023-04-20 09:43:33 -04:00
Amanda Cameron	db6e5b41b8	chore: updating readme with api announcement (#499 ) * updating readme	2023-04-19 11:59:26 -07:00
Matt Robinson	87c6d5e679	build: version bump for 0.5.13 release (#501 ) 0.5.13	2023-04-19 14:35:45 -04:00
Matt Robinson	4e1cc5ab3d	fix: add slack to fixture update script (#500 )	2023-04-19 18:16:44 +00:00
Matt Robinson	39b261aee6	fix: group broken paragraphs when using the fast strategy for PDFs (#485 ) * group broken paragraphs with fast strategy * changelog and version * fix broken tests for text.py * formatting for paragraph pattern re * fix test * fix whitespace substitution * one more test tweak * blurb to account for short lines * fix for shorter paragraphs * update changelog * remove extra line break from auto * retrigger ci * trying skipping azure * skip azure (test) * updated github and azure fixtures * update slack fixture	2023-04-19 13:54:17 -04:00
Shukri	396295fc04	fix: formatting error in sphinx docs (#498 ) * fix: formatting error in sphinx docs	2023-04-17 23:13:09 -07:00
cragwolfe	bfba2bb1eb	fix: workaround .json file detection with old libmagic installs (#493 ) Fixes issue where .json files were recognized as "text/plain" rather than "application/json on the Unstructured image (and other installs that may have an older libmagic). Also adds missing json auto partition tests. Including an xfail test for #492 .	2023-04-17 23:11:21 -07:00
Shukri	8d4308af43	doc: typo (#495 ) XML/HTML Depenedencies -> XML/HTML Dependencies	2023-04-17 20:26:50 -07:00
qued	3a61046307	fix: Fix typo in function call (#491 ) Closes GitHub Issue #487. Fixed typo in call to exactly_one in partition_json.	2023-04-17 23:37:50 +00:00
cragwolfe	5657378602	test: avoid misleading output in ingest tests (#488 ) Previously, if there was an error (non-zero exit code) in an ingest test script, the script would still complete and echo a warning about mismatched outputs and how to regenerate the fixtures. However, this statement is irrelevant and misleading: if the ingest failed with a non-zero exit code in the first place, that is the failure that should be debugged -- don't confuse the user with a comment about outputs.	2023-04-17 21:57:44 +00:00
pravin-unstructured	4020da56ad	Went through this demo notebook with Matt. Decision was made to add it to our collection of examples for use later. (#484 )	2023-04-17 11:53:25 -04:00
Trevor Bossert	cff7f4fd5a	Slack connector (#462 ) This connector takes a slack channel id, token and other options to pull conversation history for a channel and store it as a text file that is then processed by unstructured into expected output.	2023-04-16 19:34:43 +00:00
cragwolfe	a11563fe63	fix: update ingest test fixtures, disable biomed test (#486 ) * Update test fixtures that should have been updated in prior commit * Disable biomed ingest tests for now, the fail more often than not * Bonus: echo `tesseract --version` in the update script, since that is a key thing that influences fixture outputs.	2023-04-15 00:07:09 +00:00
JaeyongLee	8456676fad	fix: fix text_type.py exceeds_cap_ratio() returns (#478 ) There are cases when function is_possible_narrative_text receives an incorrect return from function exceeds_cap_ratio and does an incorrect classification, so some of the return values of exceeds_cap_ratio are corrected. --------- Co-authored-by: Matt Robinson <mrobinson@unstructured.io>	2023-04-14 11:53:10 -07:00
cragwolfe	46ac2a2226	build(CI): add access token for github-ingest test (#482 ) Avoids the occaisonal CI test failures in test-ingest-github.sh that were due to rate-limited non-auth'ed requests against a GitHub repo.	2023-04-14 11:14:21 -07:00
Matt Robinson	137b4b9a2e	feat: cleaning brick for normalizing bytes string output (#481 ) * add cleaning brick for emojis * changelog and versoin * docs for bytes_string_to_string * different test for bytes_string_to_string	2023-04-13 19:39:08 +00:00
Matt Robinson	9c1c6a13f6	fix: updates markdown code to process markdown with embedded html (#480 ) * add carriage return to html if missing * test on markdown with embedded html * changelog and version * check for html parser * linting, linting, linting	2023-04-13 12:47:45 -04:00
Matt Robinson	ec02d9298e	fix: only warn about fallback to fast in `partition_pdf` if hi_res is used (#479 ) * only warn if detectron2 not available and hi_res is used * changelog and version	2023-04-13 11:46:35 -04:00
Matt Robinson	b628fa8048	feat: allow headers in `partition` (#473 ) * feat: allow headers in `partition` * warning if header is set and url is not * update emoji test	2023-04-13 15:04:15 +00:00
jonvet	7f0f33ddb0	fix: encode xml string if document_tree is `None` in `_read_xml` (#477 ) * fix: encode xml string if document_tree is `None` in `_read_xml` * don't encode text in test	2023-04-13 09:09:58 -04:00
Matt Robinson	e2e473dddd	feat: add `url` kwarg to `partititon` (#470 ) * added url option to auto partition * add test for partition from url * version and changelog * update docs * add url to element metadata 0.5.12	2023-04-12 18:31:01 +00:00
qued	2110a266c8	fix: fix github issue formatting (#471 ) Attempting to fix formatting of github issues transferred to Jira. The old format was attempting to use double-slashes (\\) to specify line breaks. This worked in the test repo but didn't look right when merged to this repo. Now attempting to use formatted text in the yaml with \|. This worked in the test repo, but I guess that's no guarantee.	2023-04-12 16:59:12 +00:00
Austin Walker	4af4d33423	feat: add --partition-by-api and --partition-host to unstructured-ingest (#443 ) * Add --partition-by-api and --partition-host args to ingest * Fix error in make check * Bump changelog * Add a test ingest script Also add a workaround for the test causing 400s from our api. Seems we need to make sure unstructured-api can handle getting a file.content_type of None. * Remove the content type workaround	2023-04-11 22:05:07 -07:00
cragwolfe	ba4dadaa98	build: skip biomed ingest tests 90% of time due to ftp connectivity (#467 )	2023-04-11 11:27:38 -07:00
cragwolfe	7b44bcd6e0	build: script to update all ingest fixtures, add azure ingest fixtures (#367 ) - Updates CI to install tesseract version 5.3.0 (better than 4.x in various ways incl. perf.). - Adds azure expected output fixtures for more useful reference points and as a repro for Some PDF's with scanned images return empty elements #346 . - Adds a script to regenerate ingest test fixtures that is run in an ubuntu docker container (like CI), with the same version of tesseract. See the comments in scripts/ingest-test-fixtures-update.sh for details. - Updates expected outputs with above script. - Updates individual test-ingest scripts to update expected .json output if OVERWRITE_FIXTURES=true.	2023-04-11 00:11:50 -07:00
Matt Robinson	7ec85272b7	feat: add `partition_rtf` for rich text files (#466 ) * refactor epub; add rtf * added test for rtf files * filetype detection for rtf files * add rtf to auto * update docs for group_broken_paragraphs * add rtf to docs * update file list in readme * update stage_for_transformers docs * changelog and version bump * skip rtf if in docker * skip test if rtf not supported * docs tweaks	2023-04-10 21:25:03 +00:00
cragwolfe	11f82a8b1b	fix(ingest): import connector-specific modules on demand (#460 ) * fix(ingest): import connector-specific modules on demand * unstructured-ingest --flatten-metadata supported for local connector. * unstructured-ingest fix runtime error when using --metadata-include.	2023-04-08 11:35:35 -07:00
cragwolfe	bd01af2bac	build: add mimetypes DB to docker image (#455 ) The mailcap centos7 package provides the file /etc/mime.types, which is used by the mimetypes python package. That said, the unstructured code base does not make much use of this but the upstream unstructured-api does. Bonus: docx mimetype added in lookup table.	2023-04-07 13:59:29 -07:00
Matt Robinson	c99c099158	feat: enable grouping broken paragraphs in `partition_text` (#456 ) * cleaning brick to group broken paragraphs * docs for group_broken_paragraphs * add docs for partition_text with grouper * partition_text and auto with paragraph_grouper * version and changelog * typo in the docs * linting, linting, linting * switch to using regular expressions	2023-04-06 18:35:22 +00:00
ryannikolaidis	ee52a749c3	fix: docker smoke test on build (#457 )	2023-04-06 10:03:42 -07:00
ryannikolaidis	ef9fb79ed4	chore: build with registry as cache (#454 )	2023-04-06 00:34:07 -07:00
Matt Robinson	9b5cae49e1	fix: allow `replace_mime_encodings` to accept and `encoding` kwarg (#453 ) * changelog and version * added test	2023-04-05 22:53:38 +00:00
Matt Robinson	b855fd269f	fix: fix html encoding to support foreign characters (#452 ) * fix: fix html encoding to support foreign characters * version and changelog 0.5.11	2023-04-05 20:18:54 +00:00
Siddartha Naidu	1dca0db6b0	fix: guard against style attribute being None (#449 ) Some document elements may have a null style element which triggers an exception when trying to access the name of the style. Co-authored-by: Matt Robinson <mrobinson@unstructured.io>	2023-04-05 19:22:43 +00:00
ryannikolaidis	d298f57b8f	fix: issue when filename is provided but file is not on disk (#446 ) 0.5.10	2023-04-05 17:54:11 +00:00
natygyoon	e6d6509d54	feat: add --download-only parameter to unstructured-ingest (#416 ) Add --download-only parameter so that files may be downloaded if they are not already present (as usual, in either --download-dir or the default download ~/.cache/... location if --download-dir is not specified) and skip processing them through unstructured.	2023-04-05 17:14:41 +00:00
qued	5398cdf4e1	fix: change url to html_url (#451 )	2023-04-05 11:13:23 -05:00
qued	92aac29cab	chore: add github link to autogenerated issue (#445 ) Added a link back to the original Github issue from which the Jira issue was created for tracking purposes.	2023-04-05 10:03:30 -05:00
cragwolfe	3972c80c51	build(deps): bump requirements (#414 )	2023-04-05 02:59:06 +00:00
Matt Robinson	5ae895051a	feat: add sender and receive info to element metadata for emails (#439 ) * add header metadata for .eml messages * sent to and from are lists * add metadata for outlook emails * version and changelog	2023-04-04 14:23:41 -04:00
qued	4211dda360	build: sync detectron version (#440 ) * Update detectron2 version in Dockerfile * Update detectron2 version in docs	2023-04-03 18:47:43 -05:00
Amanda Cameron	555b95b8f7	Fixing test for unstructured-api (#425 ) Ran into an error in tests for unstructured-api (see below for output). Somewhere along the lines we were reading a txt file into bytes and then the PARAGRAPH_PATTERN (a string) was not able to be compared to the bytes file. 0.5.9	2023-04-03 11:12:12 -07:00
Minty Mac	533241c274	Update README.md (#435 ) Spelling error	2023-04-02 09:52:14 -07:00
ryannikolaidis	b746cb0ab4	docs: update README with notes about multi-platform (#433 )	2023-04-01 17:31:32 -07:00
Matt Robinson	414883455b	fix: correct order of kwargs in pandoc (#421 ) * fix: correct order of kwargs in pandoc * only skip epub tests in Docker * changelog --------- Co-authored-by: Crag Wolfe <crag@unstructuredai.io> Co-authored-by: cragwolfe <crag@unstructured.io> 0.5.8	2023-03-30 20:54:29 +00:00
ryannikolaidis	59785e4332	chore: install all extras in Dockerfile (#419 ) * Adds step to install all extras * Adds smoke test of wikipedia ingest to validate in CI	2023-03-30 13:23:30 -07:00
cragwolfe	32c79caee3	chore: use only regex for contains_english_word. (#382 ) Updates the characters to split when creating candidate english words. Now uses regex to parse out non-alphabetic characters for each word Note: This was originally an attempt to speedup contains_english_word() but there is no measurable change in performance.	2023-03-30 16:57:43 +00:00
Matt Robinson	e5dd9d5676	!fix: return a list of elements in `stage_for_transformers` (#420 ) * update stage_for_transformers to return a list of elements * bump changelog and version * flag breaking change * fix last word bug in chunk_by_attention_window	2023-03-30 12:27:11 -04:00
Umar Farooqi	2f5c61c178	fix: exclude empty tags during depth check (#379 )	2023-03-30 10:03:50 -04:00
ryannikolaidis	19fb3031be	ci: update CI for deprecated set-output (#417 )	2023-03-29 20:49:21 -07:00

... 11 12 13 14 15 ...

929 Commits