unstructured

mirror of https://github.com/Unstructured-IO/unstructured.git synced 2025-08-04 23:03:11 +00:00

Author	SHA1	Message	Date
fran-unstructured	a313c02f69	docs: sort functions in bricks.rst in alphabetical order v2 (#728 ) Co-authored-by: Francisco Ansaldo <franciscoansaldo@Franciscos-MacBook-Pro.local>	2023-06-12 18:22:23 -04:00
Matt Robinson	c82fdb6a89	feat: `partition_rst` for ReStructured Text documents (#725 ) * add example rst file * filetype detection for rst files * add partition_rst function * add partition_rst to auto * update readme * update docs * changelog and version * pandocs -> pandoc * fix typo	2023-06-12 19:31:10 +00:00
Yuming Long	2fbb1ccd30	Chore(ingest) : add tests on PDFs with fast strategy (#614 ) Summary * Updates "fast" PDF output element ordering to be consistent across Python versions by using the X,Y coordinates of elements extracted * Added PDFs ingest tests with fast strategy with new script ./test_unstructured_ingest/test-ingest-pdf-fast-reprocess.sh Updated ingest tests procedure: * Processing files with hi_res strategy, and preserve downloads to repo files-ingest-download/<ingest_test_name> * Reprocessing all PDFs with fast strategy from local file files-ingest-download, the partition outputs are stored at expected-structured-output/pdf-fast-reprocess/<ingest_test_name> Test * Reproduce tests with ./scripts/ingest-test-fixtures-update.sh , should expect no update. Also don't need any secret tokens since relevant tests won't produce PDFs.	2023-06-12 19:02:48 +00:00
Matt Robinson	3f80301964	fix: handling for emails without datetimes (#724 ) * add empty filetype * add empty handling to partition * changelog and version * handling for when there is no datetime * changelog and version	2023-06-12 17:11:04 +00:00
Yuming Long	b354e8eec6	Chore: Allow passing kwargs to request data field (#716 ) * bump again :( * update to kwarg * add test case * rename to request_kwargs * remove install detectron2 * pip compile * add changelog for remove detectron2 install * resolve weaviate import issue on python 3.9 0.7.4	2023-06-12 12:39:58 -04:00
John	fc53277826	fix: Enable MIME type detection if libmagic is not available (#714 ) * fix: Add filetype check if libmagic unavailable * make tidy * make check * fix: change mime_type error to warning * Update changelog and __version__ * fix: Add filetype to requirements	2023-06-09 17:06:21 -04:00
Matt Robinson	19ab6d960f	enhancement: handling for empty files in `detect_filetype` and `partition` (#710 ) * add empty filetype * add empty handling to partition * changelog and version	2023-06-09 16:07:50 -04:00
Yuming Long	80f0b4a132	Fix: Pass `strategy` parameter down from `partition` for `partition_image` (#708 ) * changelog and version * passing param down * test should be auto * doc nit * lint * update image output	2023-06-09 13:54:18 -04:00
Matt Robinson	0289ca3ea7	fix: handle encoding for text file checks (#707 ) * fixed encoding issue for _is_text_file_a_json * changelog and version	2023-06-09 11:08:16 -04:00
John	b2b92ea79d	fix: filetype detection if a CSV has a text/plain MIME type (#691 ) * fix: Filetype detection if a CSV has a text/plain MIME type #621 * bug: fix csv detection and create _read_file_start_for_type_check func * fix: Make call to _is_text_file_a_csv from detect_filetype	2023-06-08 16:21:07 -04:00
Matt Robinson	c1ba090c34	fix: suppress file conversion warnings in `convert_office_doc` (#703 ) * test that output is suppressed * add test for error output * changelog and version	2023-06-08 12:33:06 -04:00
dependabot[bot]	559a5578ba	build(deps): bump label-studio-sdk in /requirements (#701 ) Bumps [label-studio-sdk](https://github.com/heartexlabs/label-studio-sdk) from 0.0.27 to 0.0.28. - [Commits](https://github.com/heartexlabs/label-studio-sdk/commits) --- updated-dependencies: - dependency-name: label-studio-sdk dependency-type: direct:production update-type: version-update:semver-patch ... Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> Co-authored-by: Matt Robinson <mrobinson@unstructured.io>	2023-06-08 11:41:40 -04:00
dependabot[bot]	8aa87fc3b7	build(deps): bump ruff from 0.0.270 to 0.0.272 in /requirements (#699 ) Bumps [ruff](https://github.com/charliermarsh/ruff) from 0.0.270 to 0.0.272. - [Release notes](https://github.com/charliermarsh/ruff/releases) - [Changelog](https://github.com/astral-sh/ruff/blob/main/BREAKING_CHANGES.md) - [Commits](https://github.com/charliermarsh/ruff/compare/v0.0.270...v0.0.272) --- updated-dependencies: - dependency-name: ruff dependency-type: direct:production update-type: version-update:semver-patch ... Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> Co-authored-by: Matt Robinson <mrobinson@unstructured.io>	2023-06-08 09:40:17 -04:00
dependabot[bot]	f681de8d74	build(deps): bump sphinx-rtd-theme from 1.2.1 to 1.2.2 in /requirements (#698 ) Bumps [sphinx-rtd-theme](https://github.com/readthedocs/sphinx_rtd_theme) from 1.2.1 to 1.2.2. - [Changelog](https://github.com/readthedocs/sphinx_rtd_theme/blob/master/docs/changelog.rst) - [Commits](https://github.com/readthedocs/sphinx_rtd_theme/compare/1.2.1...1.2.2) --- updated-dependencies: - dependency-name: sphinx-rtd-theme dependency-type: direct:production update-type: version-update:semver-patch ... Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>	2023-06-08 09:39:35 -04:00
Matt Robinson	aa4d4329db	fix: `partition_via_api` reflects actual filetype in metadata (#696 ) * fix: `partition_via_api` reflects actual filetype in metadata * added in list length check * changelog typo	2023-06-08 13:24:16 +00:00
ryannikolaidis	dabda67c8f	fix: ingest-test-fixtures-update script to pass env vars (#697 )	2023-06-08 04:48:49 +00:00
ryannikolaidis	2094b976cf	feat: adds data_source metadata to ElementMetadata (#690 )	2023-06-07 21:22:18 -07:00
Matt Robinson	6bc116887f	enhancement: add encoding to `elements_to_json` and `elements_from_json` (#694 ) * add encoding to elements_to_json and elements_from_json * version and changelog * add new test * fix version * revert test file * blank line to test * no blank line 0.7.2	2023-06-07 13:20:06 -04:00
Matt Robinson	c6dc466e79	docs: update capabilities table; fix mistake in para grouping docs (#683 ) * docs: update capabilities table with rtf/md/epub tables * fix regex in docs * revert bricks update --------- Co-authored-by: qued <64741807+qued@users.noreply.github.com>	2023-06-06 18:29:56 +00:00
Yuming Long	533689196b	Chore: bump base image to update tesseract version (#680 ) * dockerfile * changelog version * version bump	2023-06-06 17:01:16 +00:00
kravetsmic	7df31ead75	feat: if no params show help (#649 ) * feat: if no params show help * Remove comments * feat: update checking params * updated main script and changelog * version bump --------- Co-authored-by: yuming <305248291@qq.com>	2023-06-06 16:25:44 +00:00
ryannikolaidis	29f0deda63	test: revive ingest unit tests (#688 )	2023-06-06 09:03:13 -07:00
Sebastian Laverde Alfonso	508ce48d54	Feat: notebook for Elasticsearch integration (#681 ) * feat: nb elasticsearch unstructured sentiment * chore: refactor readme for elasticsearch nb * fix: update es-credentials.ini * chore: update es-credentials.ini * fix: type in nb load-into-es.ipynb exist --> exists * fix: typo 2 in nb load-into-es.ipynb obtaing --> obtain	2023-06-05 19:05:08 +00:00
Christine Straub	547bb38d86	fix: encoding/decoding error with default utf-8 encoding for html, xml, and auto (#660 ) Add functionality to try other common encodings for html, xml files if an error related to the encoding is raised and the user has not specified an encoding. Change auto.py to have a None default for encoding Remove the unused parameter encoding from partition_pdf Add functionality to the read_txt_file utility function to handle file-like object from URL	2023-06-05 11:27:12 -07:00
ryannikolaidis	7d157c1ede	test: add benchmark script (#638 )	2023-06-05 09:14:43 -07:00
John	18aefc854a	chore: Re-enable test_upload_label_studio_data_with_sdk (#674 )	2023-06-02 23:38:43 +00:00
Matt Robinson	cf0ff91e37	fix: recognize code files with auto (#677 ) * add check for code mime type * add file extensions * add new tests * version and changelog	2023-06-02 20:09:43 +00:00
Matt Robinson	6c10d8f022	docs: update detectron2 instructions in readme (#678 )	2023-06-02 19:44:41 +00:00
Meir	74a61e33d8	fix: metadata.page_number of pptx files (#675 ) * fix: metadata.page_number of pptx files * update changelog	2023-06-02 13:22:43 +00:00
qued	01f76888e0	build(deps): add tabulate dependency (#673 ) tabulate is used by functions that extract tables from Microsoft documents, but there is nothing explicitly requiring the library. This was not caught by tests, because for some reason, tabulate is in base.txt. This PR adds the dependency to base.in (which also puts it in setup.py), and recompiles the dependencies.	2023-06-01 16:56:24 -05:00
ryannikolaidis	bdef4fd398	test: adds profiling script (#661 )	2023-06-01 21:26:05 +00:00
Matt Robinson	c35fff2972	feat: Add `stage_for_weaviate` and schema creation function (#672 ) * add weaviate docker compose * added staging brick and tests for weaviate * initial notebook and requirements file * add commentary to weaviate notebook * weaviate readme * update docs * version and change log * install weaviate client * install weaviate; skip for docker * linting, linting, linting * install weaviate client with deps * comments on weaviate client * fix module not found error for docker container * skipped wrong test in docker * fix typos * add in local-inference 0.7.1	2023-06-01 20:48:54 +00:00
Trevor Bossert	cf70c86574	Build from rocky base image (#665 ) * build from Rocky linux unstructured base image * add qemu for arm * comment out push while testing * remove quotes * Add arch * bump login action * add ARCH env var to the push step * run only subset of tests on arm image Tests on emulated arm are extremely slow. Likelyhood of something breaking in arm image only, is minimal. I say that knowing I likely just jinxed us. * re-enable push from main * add a dnf cleanup * version bump * move from dev to minor version bump	2023-06-01 12:16:04 -07:00
dependabot[bot]	cd9fd9b395	build(deps): bump pygithub from 1.57.0 to 1.58.2 in /requirements (#669 ) Bumps [pygithub](https://github.com/pygithub/pygithub) from 1.57.0 to 1.58.2. - [Release notes](https://github.com/pygithub/pygithub/releases) - [Changelog](https://github.com/PyGithub/PyGithub/blob/master/doc/changes.rst) - [Commits](https://github.com/pygithub/pygithub/compare/v1.57...v1.58.2) --- updated-dependencies: - dependency-name: pygithub dependency-type: direct:production update-type: version-update:semver-minor ... Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> Co-authored-by: Matt Robinson <mrobinson@unstructured.io>	2023-06-01 18:45:47 +00:00
dependabot[bot]	1152fe4383	build(deps): bump sphinx-rtd-theme in /requirements (#670 ) Bumps [sphinx-rtd-theme](https://github.com/readthedocs/sphinx_rtd_theme) from 1.2.0rc3 to 1.2.1. - [Changelog](https://github.com/readthedocs/sphinx_rtd_theme/blob/master/docs/changelog.rst) - [Commits](https://github.com/readthedocs/sphinx_rtd_theme/compare/1.2.0rc3...1.2.1) --- updated-dependencies: - dependency-name: sphinx-rtd-theme dependency-type: direct:production update-type: version-update:semver-patch ... Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>	2023-06-01 14:28:27 -04:00
Matt Robinson	be04e1b7c4	docs: tables supported for ppt now	2023-05-31 16:15:04 -04:00
qued	d3600dd5da	build(deps): update inference version (#662 ) Updated to the the latest version of unstructured-inference. detectron2 now gets implemented with onnxruntime, yay! --------- Co-authored-by: Matt Robinson <mrobinson@unstructured.io> 0.7.0	2023-05-31 13:50:15 -05:00
cshaddox	d23e0d6420	feat: table extraction for power points (#664 ) * Handling tables * updating changelog * Adding accidentally removed code * remove newline * reuse table extraction function; add test --------- Co-authored-by: Matt Robinson <mrobinson@unstructured.io>	2023-05-31 18:26:32 +00:00
Matt Robinson	52e5a5ca8d	fix: raise `ValueError` in `partition_via_api` if filename not present (#663 ) * raise value error if filename not specified for api * version and changelog	2023-05-31 18:09:58 +00:00
kravetsmic	795a9a0b4c	feat: add jupyter make commands (#651 ) Co-authored-by: Matt Robinson <mrobinson@unstructured.io>	2023-05-31 14:01:23 +00:00
John	c78c5b6adf	fix: `page_number` appears in `partition_html` metadata if `include_metadata=False` (#658 ) * fix: page_number appears in partition_html metadata if include_metadata=False * Update common.py * Update CHANGELOG --------- Co-authored-by: Matt Robinson <mrobinson@unstructured.io>	2023-05-30 20:47:55 +00:00
Matt Robinson	f7cde5539a	fix: `page_number` should not always be 1 in the metadata (#657 ) * fix page number issue * add tests * changelog and version * update changelog	2023-05-30 15:10:14 -04:00
wesleysanjose	b8dcf437ee	fix: add `.log` to list of TXT filetypes	2023-05-30 14:13:58 -04:00
Christine Straub	5b5fb3e13b	Issue/encoding error eml (#639 ) This PR adds functionality to try other common encodings for email (.eml) files if an error related to the encoding is raised and the user has not specified an encoding.	2023-05-30 10:24:02 -07:00
Matt Robinson	3e983efce3	docs: add feature table to README (#655 ) * remove announcement * add table with filetypes * remove filetype specific examples * remove line break * remove easy gif * fix extra whitespace	2023-05-30 15:56:25 +00:00
Yuming Long	66058e76bf	changelog and version (#645 ) 0.6.11	2023-05-26 22:21:16 -04:00
Yuming Long	fc59a043b7	Chore: Support epub tests in docker image (#630 ) * docker works * more epub tests * changelog version * support epub + odt + rtf * update dockerfile * revert.. * install pandoc on ci env * pandoc docker grab bashed on arch * move arch into image * move back to base image	2023-05-26 15:38:48 -04:00
cragwolfe	c5d9469001	feat: add xls support (#632 ) Add support for older .XLS files from the partition function in unstructured.partition.auto. Note, this should also work on the centos7 unstructured image (with the requirements/*txt updates in this PR). 0.6.10	2023-05-26 01:55:32 -07:00
ryannikolaidis	b767f6b0ec	fix(ci): prevent gha caching conflicts (#643 )	2023-05-25 17:20:28 -07:00
qued	c82bad1061	build(deps): avoid version conflicts (#636 ) Addresses #631. * Uses constraints to keep dependency versions more consistent. * Moves all dependencies to .in files which are then ingested by setup.py. * Adds script to check consistency of all extras. * Adds consistency check to CI. I should note that while it shouldn't be possible to cause a conflict between base.txt and any of the extras (because base.txt constrains all the extras) it is possible to get a conflict between two of the extras files. There are ways of trying to avoid that (like constraining each file by all the files that have already been processed before it in the order given in the make pip-compile target) but the ones I could think of seemed a little overwrought, and come with problems of their own. If a conflict arises, it should be flagged by CI or locally with make check-deps. When/if that happens, you can resolve the conflict by adding appropriate global constraints in requirements/constraints.txt. Also note that if fileA.in is constrained by fileB.txt, then fileB.in should be compiled before fileA.in in the make pip-compile target. Otherwise fileA.in will be compiled with the old version of fileB.txt which can cause conflicts or keep dependencies from being updated properly. 0.6.9	2023-05-24 22:29:35 +00:00

... 18 19 20 21 22 ...

1393 Commits