unstructured

mirror of https://github.com/Unstructured-IO/unstructured.git synced 2025-09-18 21:10:01 +00:00

Author	SHA1	Message	Date
Ahmet Melek	4b827f0793	fix: local connector output filename when a single file is being processed (#879 ) * fix string processing error for _output_filename * Add docstring and type hint, update CHANGELOG, update version * update test fixture * simple code change commit to retrigger ci checks * update test fixture - after brew install tesseract-lang * Update ingest test fixtures (#882) Co-authored-by: ahmetmeleq <ahmetmeleq@users.noreply.github.com> * correct CHANGELOG * correct CHANGELOG --------- Co-authored-by: Unstructured-DevOps <111007769+Unstructured-DevOps@users.noreply.github.com> Co-authored-by: ahmetmeleq <ahmetmeleq@users.noreply.github.com>	2023-07-05 14:37:40 -07:00
Emily Chen	24ebd0fa4e	chore: Move coordinate details from Element model to a metadata model (#827 )	2023-07-05 11:25:11 -07:00
Ahmet Melek	5ea216cf07	feat: elasticsearch connector (#817 )	2023-07-01 17:45:28 +00:00
David Potter	bec733cdf8	feat: add Dropbox connector (#844 )	2023-06-30 17:08:27 -07:00
qued	350bb1dad5	enhancement: clean pdf elements (bump unstructured-inference) (#790 ) More deterministic element ordering when using hi_res PDF parsing strategy (from unstructured-inference bump to 0.5.4) Make large model available (from unstructured-inference bump to 0.5.3) Combine inferred elements with extracted elements (from unstructured-inference bump to 0.5.2) --------- Co-authored-by: Roman Isecke <roman@unstructured.io> Co-authored-by: Crag Wolfe <crag@unstructured.io>	2023-06-29 18:35:06 -07:00
ryannikolaidis	62e20442df	chore: refactor ingest tests (#814 ) - Adds reusable validation scripts (check-x.sh) to minimize repeated (or near-repeated) code and create one source of truth - Restructures the location of download and output folders such that they are nested in the test_unstructured_ingest directory - Adds gitignore for output folders / files to avoid them accidentally getting checked into the repository - Construct paths as reusable variables declared at top of scripts - Sort order of flag for ingest calls, across all tests (this makes it easier to parse at a glance) - OVERWRITE_FIXTURES removes all old fixtures for path to guarantee no stale results are left behind - Bonus: don't check/exit on expected number of expected outputs when OVERWRITE_FIXTURES is true - Bonus: exclude file_directory from Slack and Discord test scripts (match convention in all others)	2023-06-29 23:13:41 +00:00
ryannikolaidis	8ea5f6939e	fix: parameterized ingest test overwriting (#838 ) * sets OVERWRITE_FIXTURES to default to false in test-ingest-local-single-file.sh * fixes incorrect expected results * update expected results to properly parse Korean text * bonus: installs language pack for Korean in CI and ingest fixture workflows	2023-06-29 18:37:09 +00:00
ryannikolaidis	60fe231f08	fix: use api key where needed in tests (#843 ) * passes api key for unstructured-api to unit and ingest tests as needed. * adds check for env var CI to otherwise skip tests that require an api key	2023-06-29 17:31:01 +00:00
Roman Isecke	9882c2b83f	Avoid setting metadata in constructor signature for elements (#837 ) Avoid setting metadata in constructor signature for elements because that can lead to unexpected object reuse (and modification). Bonus refactor for PageBreak to have text values of "". --------- Co-authored-by: Alan Bertl <alan@unstructured.io> Co-authored-by: Crag Wolfe <crag@unstructuredai.io>	2023-06-29 03:14:05 +00:00
Matt Robinson	44411ecc59	enhancement: `max_partition` kwarg for limiting element size (#818 ) * add max partition size logic * work splitting logic into split_by_paragraph * pass through max_partition to other functions * added test for splitting long document * add type hint * add documentation * version and changelog * ingest-test-fixtures-update * Update ingest test fixtures (#819) Co-authored-by: MthwRobinson <MthwRobinson@users.noreply.github.com> * retrigger ci * ingest-test-fixtures-update * ingest-test-fixtures-update * Update ingest test fixtures (#821) Co-authored-by: MthwRobinson <MthwRobinson@users.noreply.github.com> * update default for partition_xml * update version for release * update msg doc string --------- Co-authored-by: MthwRobinson <MthwRobinson@users.noreply.github.com>	2023-06-28 15:26:01 -04:00
Matt Robinson	38457777fa	fix: ignore escaped commas in CSV checks (#832 ) * fix file content checking bug * skip counting commas in quotes for csv detection * add test for comma count * change file content grab to -1 * version and changelog * add csv to extension check * add file to tests * ingest-test-fixtures-update * Update ingest test fixtures (#833) Co-authored-by: MthwRobinson <MthwRobinson@users.noreply.github.com> * fix typo * fix changelog wording --------- Co-authored-by: MthwRobinson <MthwRobinson@users.noreply.github.com>	2023-06-28 17:22:23 +00:00
kravetsmic	58e988e110	feature(html partition): parse pre tag (#642 ) * feature(html partition): parse pre tag * chore: update CHANGELOG.md * style: black format xml.py * Added tests dor html with pre tag * remove skip test, update parse pre tag * fix style * chore: spell check * chore: update changelog & version * chore: update ingest test fixtures * chore: add exception handling if `element.text` is `None` in `_read_xml` * test: add more sanity testing on the `.text` content of the element(s) * refactor: move the conditional logic for <pre> outside of the `try/except` block --------- Co-authored-by: cragwolfe <crag@unstructured.io> Co-authored-by: christinestraub <christinemstraub@gmail.com>	2023-06-27 18:52:39 +00:00
ryannikolaidis	a5c7e5b41e	chore: DRY ingest connectors (#769 )	2023-06-26 20:12:05 +00:00
Emily Chen	a8a19ceba0	chore: Add --ocr-languages parameter to unstructured ingest (#793 )	2023-06-23 12:38:33 -07:00
Matt Robinson	8683e2695c	fix: enable `partition_pdf` to recursively grab text with fast strategy (#796 ) * initial pass on text in figures * refactor text extraction * update tests * fix title test * add test for docs that require recursive text grab * version and changelog * ingest-test-fixtures-update * there are 8 pdf files now	2023-06-22 11:19:54 -04:00
David Potter	3b472cb7df	feat: add google cloud storage connector (#746 )	2023-06-21 15:14:50 -07:00
qued	db4c5dfdf7	feat: coordinate systems (#774 ) Added the CoordinateSystem class for tracking the system in which coordinates are represented, and changing the system if desired.	2023-06-20 11:19:55 -05:00
ryannikolaidis	4faa27ffe7	test: add google drive ingest test (#764 )	2023-06-16 16:28:24 +00:00
Matt Robinson	a800967478	enhancements: add page numbers for word docs when available (#750 ) * add support for page numbers in docx when present * version and changelog * add comment on page numbers * add header and footer to doc elements list * update integrations docs * include_page_breaks kwarg for doc and docx * merge element metadata for pagebreaks * fix typo * fix changelog typo * change page number default to None * add initial_page_number kwarg * make page number tests in pdf more explicit * revert test file * update ingest tests * update test fixture outputs * updates to IRS forms fixtures * ingest-test-fixtures-update * Update ingest test fixtures (#759) Co-authored-by: MthwRobinson <MthwRobinson@users.noreply.github.com> --------- Co-authored-by: Unstructured-DevOps <111007769+Unstructured-DevOps@users.noreply.github.com> Co-authored-by: MthwRobinson <MthwRobinson@users.noreply.github.com>	2023-06-15 12:21:17 -04:00
kravetsmic	7fd7d7afae	feat(biomed connector): added additional params (#468 ) (#623 ) Unstructured-ingest biomed connector: Adds max retries, max request time with backoff and decay. --------- Co-authored-by: Crag Wolfe <crag@unstructuredai.io>	2023-06-15 01:57:45 -07:00
Yuming Long	2fbb1ccd30	Chore(ingest) : add tests on PDFs with fast strategy (#614 ) Summary * Updates "fast" PDF output element ordering to be consistent across Python versions by using the X,Y coordinates of elements extracted * Added PDFs ingest tests with fast strategy with new script ./test_unstructured_ingest/test-ingest-pdf-fast-reprocess.sh Updated ingest tests procedure: * Processing files with hi_res strategy, and preserve downloads to repo files-ingest-download/<ingest_test_name> * Reprocessing all PDFs with fast strategy from local file files-ingest-download, the partition outputs are stored at expected-structured-output/pdf-fast-reprocess/<ingest_test_name> Test * Reproduce tests with ./scripts/ingest-test-fixtures-update.sh , should expect no update. Also don't need any secret tokens since relevant tests won't produce PDFs.	2023-06-12 19:02:48 +00:00
Yuming Long	80f0b4a132	Fix: Pass `strategy` parameter down from `partition` for `partition_image` (#708 ) * changelog and version * passing param down * test should be auto * doc nit * lint * update image output	2023-06-09 13:54:18 -04:00
ryannikolaidis	2094b976cf	feat: adds data_source metadata to ElementMetadata (#690 )	2023-06-07 21:22:18 -07:00
ryannikolaidis	29f0deda63	test: revive ingest unit tests (#688 )	2023-06-06 09:03:13 -07:00
Christine Straub	547bb38d86	fix: encoding/decoding error with default utf-8 encoding for html, xml, and auto (#660 ) Add functionality to try other common encodings for html, xml files if an error related to the encoding is raised and the user has not specified an encoding. Change auto.py to have a None default for encoding Remove the unused parameter encoding from partition_pdf Add functionality to the read_txt_file utility function to handle file-like object from URL	2023-06-05 11:27:12 -07:00
ryannikolaidis	7d157c1ede	test: add benchmark script (#638 )	2023-06-05 09:14:43 -07:00
qued	d3600dd5da	build(deps): update inference version (#662 ) Updated to the the latest version of unstructured-inference. detectron2 now gets implemented with onnxruntime, yay! --------- Co-authored-by: Matt Robinson <mrobinson@unstructured.io>	2023-05-31 13:50:15 -05:00
Yuming Long	ab5f92dd79	Fix(ingest): Deprecate `--s3-url` in favor of `--remote-url` (#616 ) * deprecation s3-url * changelopg and versioin * download dir not now	2023-05-19 12:11:40 -04:00
ryannikolaidis	7942bc9d5b	chore: refactor for ingest standard_config options (#599 )	2023-05-18 16:49:30 -07:00
Mallori Harrell	34d563c1fc	feat: Create spacy notebook example (#593 ) * add new notebook for spacy	2023-05-17 15:42:15 -05:00
Trevor Bossert	f4f40f58e3	Add discord token so tests run (#598 ) * Add discord token so tests run * install discord deps * Update expected results for discord test	2023-05-16 16:46:20 -07:00
Trevor Bossert	830d67f653	Feat: Discord connector (#515 ) * Initial commit of discord connector based off of initial work by @tnachen with modifications https://github.com/tnachen/unstructured/tree/tnachen/discord_connector * Add test file change format of imports * working version of the connector More work to be done to tidy it up and add any additional options * add to test fixtures update * fix spacing * tests working, switching to bot testing channel * add additional channel add reprocess to tests * add try clause to allow for exit on error Update changelog and bump version * add updated expected output filtes * add logic to check if —discord-period is an integer Add more to option description * fix lint error * Update discord reqs * PR feedback * add newline * another newline --------- Co-authored-by: Justin Bossert <packerbacker21@hotmail.com>	2023-05-16 11:46:30 -07:00
Matt Robinson	bd6a8a3a40	enhancement: add `file_directory` to element metadata (#585 ) * enhancement: add `file_directory` to element metadata * update msg test * exclude file_directory * update slack output * added file directory tests on partition_x paths	2023-05-15 18:25:39 -04:00
Yuming Long	5b6f11bb88	Chore(ingest): Add `--partition-strategy` parameter in CLI (#582 ) * change strategy arg defalut to auto in partition * passing --partition-strategy down * add strategy="hi_res" to test (default changed) * made an error on param name, added note	2023-05-15 19:26:53 +00:00
qued	55272eeceb	enhancement: filetype in metadata (#583 ) Adds filetype to metadata. I've created a decorator that adds metadata to a list of elements. This replaces some existing boilerplate, but also adds a nice layered approach to determining the filetype. Since in some cases several partition_ functions handle a file in various formats, the partition function that first touches a file will be the last one to alter its metadata, resulting in the correct filetype metadata. Tests are added to make sure: * When partition is used, any content type or auto file type detection will override file-specific partition function metadata * Both auto and file-specific partitioning gives the desired filetype metadata Won't work with image files currently... the plumbing is there to use the image format inferred by PIL, but we need to pull in the fix from this PR to unstructured-inference .	2023-05-15 13:23:19 -05:00
Matt Robinson	727d366a94	enhancement: auto strategy for PDFs and images (#578 ) * added functions for determining auto stratgy * change default strategy to auto * tests for auto strategy * update docs * changelog and version * bump version * remove ingest file in wrong location * update jpg output * typo fix	2023-05-12 17:45:08 +00:00
Matt Robinson	8da1ddc6ec	enhancement: add method for getting datetime; cleanup filename attribute (#575 ) * added method for extracting datetime * change filename metadata to the base filename * fix filename metadata for msg * changelog and bump version * fix expected structured output * newline back in file * reset outpout file * update filename output * update test fixtures * update fixture	2023-05-12 11:33:01 -04:00
ryannikolaidis	2fc4d37454	chore: pin inference version, bump deps, and update openssl (#551 )	2023-05-08 17:02:55 -07:00
Matt Robinson	aa01cdfc7a	fix: group together text from the same bounding box in `partition_pdf` with fast strategy (#542 ) * switch to using PDF objects * linting, linting, linting * couple more tweaks * added test for chevron-page * version and changelog * linting, linting, linting * now processing 4 files	2023-05-03 18:33:24 -04:00
Matt Robinson	894a190001	enhancement: check for copy protection on PDFs and fallback to hi res when necessary (#514 ) * function to check if pdf is extractable * add fallback logic for unextractable pdfs * tests for docs with copy protection * add test for unprocessable pdf * update docs * changelog and version * update logic for images; reset file before proceeding * 3 files for api tests * docs update	2023-04-21 21:35:43 +00:00
qued	dc4147d7df	feat: extract tables (#503 ) Exposes table extraction through partition and partition_pdf.	2023-04-21 17:01:29 +00:00
Matt Robinson	6874df91ef	feat: allow users to pass OCR language into `partition` (#509 ) * pip-compile new reqs * bump inference version * add language to pdf and image calls * tests for passing in language * version bump and changelog * update docs * pass ocr_languages in auto * updated test fixtures * typo in doc string	2023-04-21 13:41:26 +00:00
Matt Robinson	39b261aee6	fix: group broken paragraphs when using the fast strategy for PDFs (#485 ) * group broken paragraphs with fast strategy * changelog and version * fix broken tests for text.py * formatting for paragraph pattern re * fix test * fix whitespace substitution * one more test tweak * blurb to account for short lines * fix for shorter paragraphs * update changelog * remove extra line break from auto * retrigger ci * trying skipping azure * skip azure (test) * updated github and azure fixtures * update slack fixture	2023-04-19 13:54:17 -04:00
cragwolfe	5657378602	test: avoid misleading output in ingest tests (#488 ) Previously, if there was an error (non-zero exit code) in an ingest test script, the script would still complete and echo a warning about mismatched outputs and how to regenerate the fixtures. However, this statement is irrelevant and misleading: if the ingest failed with a non-zero exit code in the first place, that is the failure that should be debugged -- don't confuse the user with a comment about outputs.	2023-04-17 21:57:44 +00:00
Trevor Bossert	cff7f4fd5a	Slack connector (#462 ) This connector takes a slack channel id, token and other options to pull conversation history for a channel and store it as a text file that is then processed by unstructured into expected output.	2023-04-16 19:34:43 +00:00
cragwolfe	a11563fe63	fix: update ingest test fixtures, disable biomed test (#486 ) * Update test fixtures that should have been updated in prior commit * Disable biomed ingest tests for now, the fail more often than not * Bonus: echo `tesseract --version` in the update script, since that is a key thing that influences fixture outputs.	2023-04-15 00:07:09 +00:00
cragwolfe	46ac2a2226	build(CI): add access token for github-ingest test (#482 ) Avoids the occaisonal CI test failures in test-ingest-github.sh that were due to rate-limited non-auth'ed requests against a GitHub repo.	2023-04-14 11:14:21 -07:00
Austin Walker	4af4d33423	feat: add --partition-by-api and --partition-host to unstructured-ingest (#443 ) * Add --partition-by-api and --partition-host args to ingest * Fix error in make check * Bump changelog * Add a test ingest script Also add a workaround for the test causing 400s from our api. Seems we need to make sure unstructured-api can handle getting a file.content_type of None. * Remove the content type workaround	2023-04-11 22:05:07 -07:00
cragwolfe	ba4dadaa98	build: skip biomed ingest tests 90% of time due to ftp connectivity (#467 )	2023-04-11 11:27:38 -07:00
cragwolfe	7b44bcd6e0	build: script to update all ingest fixtures, add azure ingest fixtures (#367 ) - Updates CI to install tesseract version 5.3.0 (better than 4.x in various ways incl. perf.). - Adds azure expected output fixtures for more useful reference points and as a repro for Some PDF's with scanned images return empty elements #346 . - Adds a script to regenerate ingest test fixtures that is run in an ubuntu docker container (like CI), with the same version of tesseract. See the comments in scripts/ingest-test-fixtures-update.sh for details. - Updates expected outputs with above script. - Updates individual test-ingest scripts to update expected .json output if OVERWRITE_FIXTURES=true.	2023-04-11 00:11:50 -07:00

... 3 4 5 6 7

321 Commits