unstructured

mirror of https://github.com/Unstructured-IO/unstructured.git synced 2025-10-12 00:23:35 +00:00

Author	SHA1	Message	Date
qued	4211dda360	build: sync detectron version (#440 ) * Update detectron2 version in Dockerfile * Update detectron2 version in docs	2023-04-03 18:47:43 -05:00
Amanda Cameron	555b95b8f7	Fixing test for unstructured-api (#425 ) Ran into an error in tests for unstructured-api (see below for output). Somewhere along the lines we were reading a txt file into bytes and then the PARAGRAPH_PATTERN (a string) was not able to be compared to the bytes file. 0.5.9	2023-04-03 11:12:12 -07:00
Minty Mac	533241c274	Update README.md (#435 ) Spelling error	2023-04-02 09:52:14 -07:00
ryannikolaidis	b746cb0ab4	docs: update README with notes about multi-platform (#433 )	2023-04-01 17:31:32 -07:00
Matt Robinson	414883455b	fix: correct order of kwargs in pandoc (#421 ) * fix: correct order of kwargs in pandoc * only skip epub tests in Docker * changelog --------- Co-authored-by: Crag Wolfe <crag@unstructuredai.io> Co-authored-by: cragwolfe <crag@unstructured.io> 0.5.8	2023-03-30 20:54:29 +00:00
ryannikolaidis	59785e4332	chore: install all extras in Dockerfile (#419 ) * Adds step to install all extras * Adds smoke test of wikipedia ingest to validate in CI	2023-03-30 13:23:30 -07:00
cragwolfe	32c79caee3	chore: use only regex for contains_english_word. (#382 ) Updates the characters to split when creating candidate english words. Now uses regex to parse out non-alphabetic characters for each word Note: This was originally an attempt to speedup contains_english_word() but there is no measurable change in performance.	2023-03-30 16:57:43 +00:00
Matt Robinson	e5dd9d5676	!fix: return a list of elements in `stage_for_transformers` (#420 ) * update stage_for_transformers to return a list of elements * bump changelog and version * flag breaking change * fix last word bug in chunk_by_attention_window	2023-03-30 12:27:11 -04:00
Umar Farooqi	2f5c61c178	fix: exclude empty tags during depth check (#379 )	2023-03-30 10:03:50 -04:00
ryannikolaidis	19fb3031be	ci: update CI for deprecated set-output (#417 )	2023-03-29 20:49:21 -07:00
ryannikolaidis	77b6fb2792	ci: update dockerfile to also add models and nltk (#418 )	2023-03-29 20:48:06 -07:00
natygyoon	7f6e094c1f	feat: add local file system connector for unstructured-ingest (#399 ) * added local connector to unstructured-ingest	2023-03-29 15:53:23 -07:00
natygyoon	e6187b262f	enhancement: update elements_to_json to potentially return a string (#403 ) * update elements_to_json to potentially return string if filename is not specified * add text to elements_from_json	2023-03-29 12:38:30 -07:00
natygyoon	1da40806da	feat: add --max-docs parameter to unstructured-ingest (#402 ) * added --max-docs parameter to unstructured-ingest	2023-03-30 03:24:12 +09:00
ryannikolaidis	65fec954ba	ci: publish amd and arm images (#404 )	2023-03-29 07:02:39 +00:00
Matt Robinson	09b52b4fc4	fix: text kwargs no longer fail with empty string (#413 ) * fix: text kwargs no longer fail with empty string * linting	2023-03-28 21:03:51 +00:00
Matt Robinson	75cf233702	feat: add `partition_msg` for MSFT Outlook files (#412 ) * added msg-parser dependency * pass through kwargs in convert_file_to_text * added partition_msg for processing msft outlook files * version bump and changelog * added tests for partition_msg * added test for msg with plain text * add partition_msg docs; fix underlines in integration docs * add .msg to file list * finish tests for auto msg * linting, linting, linting	2023-03-28 20:15:22 +00:00
ryannikolaidis	e1a8db51ad	ci: test before publishing docker image (#390 )	2023-03-27 13:16:48 -07:00
Amanda Cameron	71e035c34c	Adding content_type and file_filename to autopartition (#394 ) Co-authored-by: cragwolfe <crag@unstructured.io> 0.5.7	2023-03-24 16:32:45 -07:00
cragwolfe	8ffd31029e	clean doc text (#398 )	2023-03-24 08:43:27 -07:00
cragwolfe	ce9fc26009	feat: add ability to pass headers in partition_html (#397 ) Also adds pytest-mock requirement, those fixtures are nice to have! Implements issue/feature #396 .	2023-03-23 20:14:57 -07:00
natygyoon	a4394f6f16	feat: add --flatten-metadata to unstructured-ingest (#389 ) * added --flatten-metadata to unstructured-ingest * added unit tests for process_file()	2023-03-22 20:52:56 +00:00
natygyoon	66a0369fb6	feat: add --fields-include to unstructured-ingest (#376 ) * add --fields-include parameter to unstructured-ingest * add unit tests for process_file()	2023-03-22 14:12:35 +00:00
cragwolfe	3467a2786d	Update patterns.py (#391 )	2023-03-21 23:58:18 -07:00
natygyoon	6b17cb228e	refactor: use `exactly one` throughout code base (#385 ) added `exactly_one` to additional places like unstructured/partition too.	2023-03-21 16:50:13 -07:00
Amanda Cameron	a9da858fa3	chore: add tests for docker (#373 )	2023-03-21 13:46:09 -07:00
Benjamin Torres	3c95b975fe	Fix: duplicated addition to elements list (#388 ) 0.5.6	2023-03-21 12:56:04 -07:00
natygyoon	c16862e7b3	feat: add --metadata-include and --metadata-exclude parameters to unstructured-ingest (#368 ) * added metadata in/exclude params * updated process_file * existing tests * remove default behavior * changelog and ci * line length * import * import * import sorted * import * type * line length * main * ci * json * dict * type ignore * lint * unit tests for process_file * lint * type changed to Optional(str) * ci * line length * added mutex check * nit	2023-03-22 03:30:53 +09:00
ryannikolaidis	d5a0fce6a0	docs: update readme with notes about pulling and running the public Docker image. (#381 )	2023-03-20 18:41:44 +00:00
cragwolfe	fbc7a69a53	feat: change english_words to set for performance gain (#380 )	2023-03-19 22:51:32 +00:00
ryannikolaidis	1e39e1ac2a	ci: Adds workflow to publish docker builds (#377 )	2023-03-19 21:53:05 +00:00
Sebastian Laverde Alfonso	c9c1b843d2	docs: Integrations LangChain code fix (#378 )	2023-03-17 22:59:22 +01:00
Sebastian Laverde Alfonso	b2f37c3eff	Docs: add Integrations section (#372 ) * docs: update index, add integrations * docs: fix typos * docs: create integrations.rst section structure * docs: descriptions and use for 8 integrations * refactor: SEC example in Label Studio section * Apply suggestions from code review Co-authored-by: qued <64741807+qued@users.noreply.github.com> * docs: change links order and refactor\|paraphrase --------- Co-authored-by: qued <64741807+qued@users.noreply.github.com>	2023-03-17 19:11:38 +00:00
Matt Robinson	b47bfaf33a	fix: update test to pass on later `label_studio_sdk` versions (#369 ) Closes #200. Fixes the failing test for label_studio_sdk>0.0.17 using the suggestion found in this comment. The vcr fixture on the test needed allow_playback_repeats=True. Unpinned label_studio_sdk and pip-compiled.	2023-03-17 17:57:09 +00:00
Mallori Harrell	ff63ad81d9	chore: Add note about python version (#375 ) * add note about python version --------- Co-authored-by: Mallori Harrell <mallori@Malloris-MacBook-Pro.local>	2023-03-17 11:22:49 -05:00
qued	f6d787d95b	ci: workflow to create JIRA issue on GH issue create (#370 ) Created a github workflow to create a new issue in JIRA when a github issue is created, mirroring the summary and description. Pretty simplistic for now with a hardcoded project, and no support for any ongoing sync events.	2023-03-15 16:17:56 -05:00
natygyoon	e0eb66de52	feat: add staging brick to clean non-ascii characters from unicode (#366 )	2023-03-14 21:31:51 -07:00
Amanda Cameron	edb847ce0b	adding Dockerfile (#359 )	2023-03-14 13:40:01 -07:00
qued	a00c6feb9a	fix: changelog typo throwing off formatting (#365 )	2023-03-14 16:30:53 +00:00
Matt Robinson	e43cb0e6e0	feat: add `partition_epub` function (#364 ) * add pypandoc dependency * added epub partitioner and file conversion * test for partition_epub * tests for file conversion * add epub to filetype detection * added epub to auto partition * update bricks docs * updated installing docs * changelot and version * add pandoc to dependencies * add pandoc to debian dependencies * linting, linting, linting * typo fix * typo fix * file conversion type hints * more type hints --------- Co-authored-by: qued <64741807+qued@users.noreply.github.com> 0.5.4	2023-03-14 15:52:21 +00:00
qued	aa494623a2	chore: bump versions (#352 ) Update versions of dependencies, including unpinning the unstructured-inference dependency that's causing conflicts in repos like pipeline-oer that want the newer version.	2023-03-14 09:40:30 -05:00
ryannikolaidis	a4726cb197	fix: open xml files in read only mode (#362 )	2023-03-13 13:06:45 -07:00
cragwolfe	7b9475ef26	chore: rm competition announcement from the README (#361 )	2023-03-13 09:34:26 -07:00
Matt Robinson	d17a94f395	chore: add libreoffice to ubuntu install script (#363 )	2023-03-13 10:46:23 -04:00
Matt Robinson	7c08450597	feat: add `"fast"` strategy for PDF parsing; fallback to `"fast"` if `detectron2` is not available (#357 ) Adds a "fast" strategy for partitioning PDFs that uses pdfminer. The default strategy is "hi_res" and is the original partitioning logic that uses detectron2. If detectron2 is not available and the "hi_res" strategy is selected, partition_pdf fallsback to using the "fast" strategy. The implementation uses pdfminer because that's already installed as a dependency with the local-inference extra. There are other options for accomplishing this as well, but they would entail adding a new dependency. The "fast" strategy substantially speeds up processing.	2023-03-11 03:16:05 +00:00
Habeeb Shopeju	2ca843782c	Connector for Biomedical Literature (#345 ) The implementation involves the introduction of SimpleBiomedConfig, BiomedIngestDoc and BiomedConnector which ingests documents from the PDF Download.	2023-03-11 01:09:54 +00:00
Alvaro Bartolome	5291a96616	Add `AzureBlobStorageConnector` (#353 ) * Add `AzureBlobStorageConnector` based on its `fsspec` implementation inheriting from `FsspecConnector` * Start deprecation life cycle for `unstructured-ingest --s3-url` option, to be deprecated in favor of `--remote-url`. --------- Co-authored-by: Tom Aarsen <37621491+tomaarsen@users.noreply.github.com>	2023-03-10 15:43:40 -08:00
Matt Robinson	30b5a4da65	fix: parsing for files with `message/rfc822` MIME type; dir for unsupported files (#358 ) Adds the ability to process files with a message/rfc822 MIME type, which previously caused failures for example-docs/fake-email-header.eml.	2023-03-10 15:10:39 -08:00
Tom Aarsen	3d21b4098e	enhancement: improve `detect_filetype` warning to include filename (#355 ) * Improve warning to include filename if provided * Update changelog & version	2023-03-10 12:26:08 -05:00
Alvaro Bartolome	c51adb21e3	feat: add `FsspecConnector` to easily integrate new connectors with a `fsspec` implementation available (#318 ) So as you may see this is a pretty big PR, that basically adds an "adapter" to easily plug in any connector with an available fsspec implementation. This is a way to standardize how the remote filesystems are used within unstructured. I've additionally renamed s3_connector.py to s3.py for readability and consistency and tested that the current approach works as expected and is aligned with the expectations.	2023-03-10 06:15:19 +00:00

1 2 3 4 5 ...

389 Commits