unstructured

mirror of https://github.com/Unstructured-IO/unstructured.git synced 2025-12-21 20:25:15 +00:00

Author	SHA1	Message	Date
ryannikolaidis	a5c7e5b41e	chore: DRY ingest connectors (#769 )	2023-06-26 20:12:05 +00:00
Amanda Cameron	95f02f290d	chore: update readme for api keys (#792 ) * api announcement * updating copy * version bump 0.7.9	2023-06-26 11:56:01 -07:00
ryannikolaidis	7f0f5fab04	ci: fix amd build issue (#804 )	2023-06-24 23:46:32 +00:00
MalteHB	030c56fcba	enhancement: better leaf element string check in XML parsing (#734 ) * Enhance leaf element string check in XML parsing * fix is_string check * changelog and version --------- Co-authored-by: Matt Robinson <mrobinson@unstructured.io> Co-authored-by: Matt Robinson <mrobinson@unstructuredai.io>	2023-06-23 20:44:50 +00:00
Emily Chen	a8a19ceba0	chore: Add --ocr-languages parameter to unstructured ingest (#793 )	2023-06-23 12:38:33 -07:00
Martin Mauch	752e78e803	feat: partition_org for Org Mode documents (#780 ) * feat: partition_org for Org Mode documents * update version	2023-06-23 18:45:31 +00:00
little_huang	5320aa681f	docs: fix indentation (#802 ) Co-authored-by: 黄宝成 <huangbc@publink.cn>	2023-06-23 09:50:31 -05:00
Christine Straub	5f5da65e0b	Fix/handle-spooled-temp-file-eml (#800 ) This PR is for the unstructured-api smoke tests pass. 0.7.8	2023-06-22 19:21:28 -07:00
Matt Robinson	901ef16835	fix: allow `partition_email` to process emails with no content (#797 ) * version and changelog * ingest-test-fixtures-update	2023-06-22 12:52:27 -04:00
Matt Robinson	8683e2695c	fix: enable `partition_pdf` to recursively grab text with fast strategy (#796 ) * initial pass on text in figures * refactor text extraction * update tests * fix title test * add test for docs that require recursive text grab * version and changelog * ingest-test-fixtures-update * there are 8 pdf files now	2023-06-22 11:19:54 -04:00
David Potter	3b472cb7df	feat: add google cloud storage connector (#746 )	2023-06-21 15:14:50 -07:00
shreyanid	21c346dab8	broken file link in quick start sample code (#789 )	2023-06-21 13:39:10 -07:00
Roman Isecke	61ea00a06f	Update Dockerfile to use multistage build and cache layers (#785 ) * Update Dockerfile to use multistage build and cache layers * Fix Dockerfile	2023-06-21 13:12:45 -04:00
ryannikolaidis	e08936b6fb	chore: update all bash scripts to use shebang: /usr/bin/env bash (#779 )	2023-06-20 16:00:55 -07:00
Matt Robinson	c53ce117bc	fix: enable `partition_html` to grab content outside of `<article>` tags (#772 ) * optionally dont assemble articles * add test for content outside of articles * pass kwargs in partition * changelog and version * update default to False * bump version for release * back to dev version to get another fix in the release 0.7.7	2023-06-20 17:07:30 +00:00
Matt Robinson	feaf1cb4df	fix: check for xml attribute when identifying pagebreaks (#778 )	2023-06-20 12:44:00 -04:00
qued	db4c5dfdf7	feat: coordinate systems (#774 ) Added the CoordinateSystem class for tracking the system in which coordinates are represented, and changing the system if desired.	2023-06-20 11:19:55 -05:00
Christine Straub	743482b6d3	Bug/635 unicode decode error eml (#739 ) * Adds functionality to extract charset info from eml files * Adds missed file-like object handling in detect_file_encoding * Adds functionality to replace the MIME encodings for eml files with one of the common encodings if a unicode error occurs * Organize the eml example files in the example-docs/eml directory	2023-06-17 00:52:13 +00:00
cragwolfe	2989f53358	chore: bump to python 3.8.17 (#766 ) The images pushed quay.io will now have python 3.8.17 rather than python 3.8.15.	2023-06-16 11:17:03 -07:00
cragwolfe	68f04159bc	chore: rm old detectron2 install from makefile (#767 ) * chore: remove vestigal Makefile target and tensorboard	2023-06-16 10:05:36 -07:00
ryannikolaidis	4faa27ffe7	test: add google drive ingest test (#764 )	2023-06-16 16:28:24 +00:00
Yuming Long	a611532e3c	Chore: convert fast strategy to ocr_only for images (#735 ) * fall back to ocr only * more note * add test case * maybe remove skipping dockertest for kor ocr? * bump again * clean up flag * empty commit 0.7.6	2023-06-16 10:59:13 -04:00
Matt Robinson	4ea716837d	feat: add ability to extract extra metadata with regex (#763 ) * first pass on regex metadata * fix typing for regex metadata * add dataclass back in * add decorators * fix tests * update docs * add tests for regex metadata * add process metadata to tsv * changelog and version * docs typos * consolidate to using a single kwarg * fix test	2023-06-16 10:10:56 -04:00
Angus Sinclair	ec403e245c	fix malformed pptx issue (#761 ) * fix malformed pptx issue Added a new test to check for the ability to partition a malformed PowerPoint file. Modified the `partition_pptx` function to skip processing shapes that are not on the actual slide, but only if they have top and left positions. Also modified `_order_shapes` function to handle cases where shapes do not have top or left positions. * update changelog * fix lint issue SIM102 nested ifs * fix black linting	2023-06-15 19:52:44 +00:00
Yuming Long	5bf78c077d	Fix: remove fake api key in test (#762 ) * no fake api key * changlog and version * remove kwarg since we have default	2023-06-15 19:18:22 +00:00
John	a9b9b873b1	feat: partition_tsv for tab separated value files (#758 ) * first pass at partition_tsv * working tests * create constants for tests and debug `make test` failure * make check and tidy * undo changes for testing locally * update changelog and version * fix bricks.rst * refactor if statements * make tidy * fix README and change try/except to if/else * update changelog and version * fix\ docstring	2023-06-15 18:50:53 +00:00
Matt Robinson	075bf0bdba	fix test that requires api key	2023-06-15 14:34:57 -04:00
Matt Robinson	a800967478	enhancements: add page numbers for word docs when available (#750 ) * add support for page numbers in docx when present * version and changelog * add comment on page numbers * add header and footer to doc elements list * update integrations docs * include_page_breaks kwarg for doc and docx * merge element metadata for pagebreaks * fix typo * fix changelog typo * change page number default to None * add initial_page_number kwarg * make page number tests in pdf more explicit * revert test file * update ingest tests * update test fixture outputs * updates to IRS forms fixtures * ingest-test-fixtures-update * Update ingest test fixtures (#759) Co-authored-by: MthwRobinson <MthwRobinson@users.noreply.github.com> --------- Co-authored-by: Unstructured-DevOps <111007769+Unstructured-DevOps@users.noreply.github.com> Co-authored-by: MthwRobinson <MthwRobinson@users.noreply.github.com>	2023-06-15 12:21:17 -04:00
kravetsmic	7fd7d7afae	feat(biomed connector): added additional params (#468 ) (#623 ) Unstructured-ingest biomed connector: Adds max retries, max request time with backoff and decay. --------- Co-authored-by: Crag Wolfe <crag@unstructuredai.io>	2023-06-15 01:57:45 -07:00
Matt Robinson	e0c477de68	docs: update slack invite link (#749 )	2023-06-14 10:06:45 -04:00
Matt Robinson	053a6c6e5c	enhancement: extract headers and footers in `partition_docx` (#742 ) * added tests for headers and footers * add docs on headers and footers; tweak to metadata * version and changelog	2023-06-14 09:42:59 -04:00
cragwolfe	3fe7e1b6ca	fix: pdf2image library is core requirement (#745 ) 0.7.5	2023-06-13 23:04:41 -07:00
kravetsmic	8258dbb25f	feat: add --api-key parameter to unstructured-ingest (#644 )	2023-06-14 05:05:18 +00:00
ryannikolaidis	9443bd40e2	ci: add set up python to test_unit job (#743 )	2023-06-14 01:50:37 +00:00
ryannikolaidis	9d3f7183fd	ci: add cache version to ingest-test-fixtures-update-pr workflow (#737 )	2023-06-13 18:15:35 -07:00
ryannikolaidis	a753370dc7	ci: update ingest fixtures from gh workflow (#702 )	2023-06-13 10:27:32 -07:00
fran-unstructured	a313c02f69	docs: sort functions in bricks.rst in alphabetical order v2 (#728 ) Co-authored-by: Francisco Ansaldo <franciscoansaldo@Franciscos-MacBook-Pro.local>	2023-06-12 18:22:23 -04:00
Matt Robinson	c82fdb6a89	feat: `partition_rst` for ReStructured Text documents (#725 ) * add example rst file * filetype detection for rst files * add partition_rst function * add partition_rst to auto * update readme * update docs * changelog and version * pandocs -> pandoc * fix typo	2023-06-12 19:31:10 +00:00
Yuming Long	2fbb1ccd30	Chore(ingest) : add tests on PDFs with fast strategy (#614 ) Summary * Updates "fast" PDF output element ordering to be consistent across Python versions by using the X,Y coordinates of elements extracted * Added PDFs ingest tests with fast strategy with new script ./test_unstructured_ingest/test-ingest-pdf-fast-reprocess.sh Updated ingest tests procedure: * Processing files with hi_res strategy, and preserve downloads to repo files-ingest-download/<ingest_test_name> * Reprocessing all PDFs with fast strategy from local file files-ingest-download, the partition outputs are stored at expected-structured-output/pdf-fast-reprocess/<ingest_test_name> Test * Reproduce tests with ./scripts/ingest-test-fixtures-update.sh , should expect no update. Also don't need any secret tokens since relevant tests won't produce PDFs.	2023-06-12 19:02:48 +00:00
Matt Robinson	3f80301964	fix: handling for emails without datetimes (#724 ) * add empty filetype * add empty handling to partition * changelog and version * handling for when there is no datetime * changelog and version	2023-06-12 17:11:04 +00:00
Yuming Long	b354e8eec6	Chore: Allow passing kwargs to request data field (#716 ) * bump again :( * update to kwarg * add test case * rename to request_kwargs * remove install detectron2 * pip compile * add changelog for remove detectron2 install * resolve weaviate import issue on python 3.9 0.7.4	2023-06-12 12:39:58 -04:00
John	fc53277826	fix: Enable MIME type detection if libmagic is not available (#714 ) * fix: Add filetype check if libmagic unavailable * make tidy * make check * fix: change mime_type error to warning * Update changelog and __version__ * fix: Add filetype to requirements	2023-06-09 17:06:21 -04:00
Matt Robinson	19ab6d960f	enhancement: handling for empty files in `detect_filetype` and `partition` (#710 ) * add empty filetype * add empty handling to partition * changelog and version	2023-06-09 16:07:50 -04:00
Yuming Long	80f0b4a132	Fix: Pass `strategy` parameter down from `partition` for `partition_image` (#708 ) * changelog and version * passing param down * test should be auto * doc nit * lint * update image output	2023-06-09 13:54:18 -04:00
Matt Robinson	0289ca3ea7	fix: handle encoding for text file checks (#707 ) * fixed encoding issue for _is_text_file_a_json * changelog and version	2023-06-09 11:08:16 -04:00
John	b2b92ea79d	fix: filetype detection if a CSV has a text/plain MIME type (#691 ) * fix: Filetype detection if a CSV has a text/plain MIME type #621 * bug: fix csv detection and create _read_file_start_for_type_check func * fix: Make call to _is_text_file_a_csv from detect_filetype	2023-06-08 16:21:07 -04:00
Matt Robinson	c1ba090c34	fix: suppress file conversion warnings in `convert_office_doc` (#703 ) * test that output is suppressed * add test for error output * changelog and version	2023-06-08 12:33:06 -04:00
dependabot[bot]	559a5578ba	build(deps): bump label-studio-sdk in /requirements (#701 ) Bumps [label-studio-sdk](https://github.com/heartexlabs/label-studio-sdk) from 0.0.27 to 0.0.28. - [Commits](https://github.com/heartexlabs/label-studio-sdk/commits) --- updated-dependencies: - dependency-name: label-studio-sdk dependency-type: direct:production update-type: version-update:semver-patch ... Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> Co-authored-by: Matt Robinson <mrobinson@unstructured.io>	2023-06-08 11:41:40 -04:00
dependabot[bot]	8aa87fc3b7	build(deps): bump ruff from 0.0.270 to 0.0.272 in /requirements (#699 ) Bumps [ruff](https://github.com/charliermarsh/ruff) from 0.0.270 to 0.0.272. - [Release notes](https://github.com/charliermarsh/ruff/releases) - [Changelog](https://github.com/astral-sh/ruff/blob/main/BREAKING_CHANGES.md) - [Commits](https://github.com/charliermarsh/ruff/compare/v0.0.270...v0.0.272) --- updated-dependencies: - dependency-name: ruff dependency-type: direct:production update-type: version-update:semver-patch ... Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> Co-authored-by: Matt Robinson <mrobinson@unstructured.io>	2023-06-08 09:40:17 -04:00
dependabot[bot]	f681de8d74	build(deps): bump sphinx-rtd-theme from 1.2.1 to 1.2.2 in /requirements (#698 ) Bumps [sphinx-rtd-theme](https://github.com/readthedocs/sphinx_rtd_theme) from 1.2.1 to 1.2.2. - [Changelog](https://github.com/readthedocs/sphinx_rtd_theme/blob/master/docs/changelog.rst) - [Commits](https://github.com/readthedocs/sphinx_rtd_theme/compare/1.2.1...1.2.2) --- updated-dependencies: - dependency-name: sphinx-rtd-theme dependency-type: direct:production update-type: version-update:semver-patch ... Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>	2023-06-08 09:39:35 -04:00

... 8 9 10 11 12 ...

929 Commits