unstructured

mirror of https://github.com/Unstructured-IO/unstructured.git synced 2025-10-22 05:25:29 +00:00

Author	SHA1	Message	Date
qued	aa494623a2	chore: bump versions (#352 ) Update versions of dependencies, including unpinning the unstructured-inference dependency that's causing conflicts in repos like pipeline-oer that want the newer version.	2023-03-14 09:40:30 -05:00
ryannikolaidis	a4726cb197	fix: open xml files in read only mode (#362 )	2023-03-13 13:06:45 -07:00
cragwolfe	7b9475ef26	chore: rm competition announcement from the README (#361 )	2023-03-13 09:34:26 -07:00
Matt Robinson	d17a94f395	chore: add libreoffice to ubuntu install script (#363 )	2023-03-13 10:46:23 -04:00
Matt Robinson	7c08450597	feat: add `"fast"` strategy for PDF parsing; fallback to `"fast"` if `detectron2` is not available (#357 ) Adds a "fast" strategy for partitioning PDFs that uses pdfminer. The default strategy is "hi_res" and is the original partitioning logic that uses detectron2. If detectron2 is not available and the "hi_res" strategy is selected, partition_pdf fallsback to using the "fast" strategy. The implementation uses pdfminer because that's already installed as a dependency with the local-inference extra. There are other options for accomplishing this as well, but they would entail adding a new dependency. The "fast" strategy substantially speeds up processing.	2023-03-11 03:16:05 +00:00
Habeeb Shopeju	2ca843782c	Connector for Biomedical Literature (#345 ) The implementation involves the introduction of SimpleBiomedConfig, BiomedIngestDoc and BiomedConnector which ingests documents from the PDF Download.	2023-03-11 01:09:54 +00:00
Alvaro Bartolome	5291a96616	Add `AzureBlobStorageConnector` (#353 ) * Add `AzureBlobStorageConnector` based on its `fsspec` implementation inheriting from `FsspecConnector` * Start deprecation life cycle for `unstructured-ingest --s3-url` option, to be deprecated in favor of `--remote-url`. --------- Co-authored-by: Tom Aarsen <37621491+tomaarsen@users.noreply.github.com>	2023-03-10 15:43:40 -08:00
Matt Robinson	30b5a4da65	fix: parsing for files with `message/rfc822` MIME type; dir for unsupported files (#358 ) Adds the ability to process files with a message/rfc822 MIME type, which previously caused failures for example-docs/fake-email-header.eml.	2023-03-10 15:10:39 -08:00
Tom Aarsen	3d21b4098e	enhancement: improve `detect_filetype` warning to include filename (#355 ) * Improve warning to include filename if provided * Update changelog & version	2023-03-10 12:26:08 -05:00
Alvaro Bartolome	c51adb21e3	feat: add `FsspecConnector` to easily integrate new connectors with a `fsspec` implementation available (#318 ) So as you may see this is a pretty big PR, that basically adds an "adapter" to easily plug in any connector with an available fsspec implementation. This is a way to standardize how the remote filesystems are used within unstructured. I've additionally renamed s3_connector.py to s3.py for readability and consistency and tested that the current approach works as expected and is aligned with the expectations.	2023-03-10 06:15:19 +00:00
Matt Robinson	7c619f045b	feat: `UNSTRUCTURED_LANGUAGE_CHECK` env var to control (#351 ) * environment variable to set language checks * change log and version * checks for if language checks are false * update docs * changelog type * add assert to tests * performance note in docstrings * docstring tweaks	2023-03-09 17:33:48 +00:00
qued	e43e9178ae	feat: amazon linux 2 setup script (#350 ) Added Amazon Linux 2 setup script. Also updated Ubuntu setup script to keep the scripts as aligned as possible. Co-authored-by: cragwolfe <crag@unstructured.io> 0.5.3	2023-03-09 14:52:24 +00:00
natygyoon	6be07a5260	feat: update auto.partition() function to recognize Unstructured json (#337 )	2023-03-08 10:36:01 -08:00
Tom Aarsen	1580c1bf8e	feat: Add GitLab ingest connector (#349 ) Add GitLab data connector for ingest. Involves more general Git functionality that is shared between the GitHub and GitLab data connectors. Prevent code duplication for functionality between GitHub and GitLab ingest connectors. Renamed github-access-token, github-branch and github-file-glob to git-access-token, git-branch and git-file-glob, respectively. These work for GitHub and GitLab.	2023-03-08 00:15:21 -08:00
Tom Aarsen	a9152313aa	refactor: Introduce 'exactly_one' to simplify partitioning functions (#343 )	2023-03-07 12:27:08 -06:00
Tom Aarsen	70420b5c78	refactor: Fully move towards logging; remove `if config.verbose` conditionals (#321 ) Move away from printing, use logging exclusively.	2023-03-07 01:21:27 -08:00
Umar Farooqi	78f4301872	fix: add formatter in an error string (#348 )	2023-03-06 22:35:15 -08:00
Habeeb Shopeju	4117f57e14	Connector for Google Drive (#294 ) Implements issue #244	2023-03-07 06:01:02 +00:00
cragwolfe	905e4ae8f6	chore: nicer error message (#341 ) Show a more meaningful error message (and potentially useful for debugging) when file type is not supported by the auto partition(). Co-authored-by: Tom Aarsen <37621491+tomaarsen@users.noreply.github.com>	2023-03-06 16:08:10 -08:00
Tom Aarsen	d4a1508ab8	chore: Remove file accidentally created/committed (#344 ) * Remove file accidentally created/committed * Fix CHANGELOG	2023-03-06 23:50:53 +00:00
Amanda Cameron	64efcc0e50	Adding optional encoding arg, and text_partition tests (#339 )	2023-03-06 15:07:33 -08:00
Ikko Eltociear Ashimine	213077e2ab	docs: update sec-sentiment-analysis.ipynb (#342 ) Huggingface -> Hugging Face	2023-03-06 15:16:14 +00:00
Alvaro Bartolome	2979e17aa4	feat: add `.pre-commit-config.yaml` to let users enable `pre-commit` hooks (#320 ) Per the README, provides an optional `pre-commit` configuration file to ensure code matches the formatting and linting standards used in `unstructured`.	2023-03-05 20:23:39 +00:00
Tom Aarsen	f5af87a540	feat: Expose Wikipedia `auto_suggest` argument to the ingest CLI (#336 ) * Add support for '--wikipedia-auto-suggest' to the unstructured-ingest CLI	2023-03-02 12:31:29 -08:00
Matt Robinson	a5da3de43b	fix: ensure all text is maintained in html output (#335 ) * fix: ensure all text is maintained in html pages * add back in replace unicode quotes * changelog and version bump * apt-get update in ci * white space differences in output 0.5.2	2023-03-02 14:03:13 -05:00
qued	ed074b5828	fix: set through env to avoid interpretation as command (#329 ) When I took the changes to the Ubuntu setup script and propagated them to other scripts that run in slightly different contexts, the script failed at line 45 as DEBIAN_FRONTEND=noninteractive was interpreted as a command rather than a variable assignment. Added the env command so there's no misinterpretation. Tested in docker as both root and user.	2023-03-01 12:56:37 -06:00
dependabot[bot]	fcaed15b14	build(deps): Bump actions/checkout from 2 to 3 (#325 ) Bumps [actions/checkout](https://github.com/actions/checkout) from 2 to 3. - [Release notes](https://github.com/actions/checkout/releases) - [Changelog](https://github.com/actions/checkout/blob/main/CHANGELOG.md) - [Commits](https://github.com/actions/checkout/compare/v2...v3) --- updated-dependencies: - dependency-name: actions/checkout dependency-type: direct:production update-type: version-update:semver-major ... Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> Co-authored-by: cragwolfe <crag@unstructured.io>	2023-03-01 13:11:42 -05:00
Alvaro Bartolome	707f92f717	feat: improve caching mechanism for `download_dir` on ingest (#314 ) * `unstructured-ingest` now uses a default `--download_dir` of `$HOME/.cache/unstructured/ingest` rather than a "tmp-ingest-" dir in the working directory. * `unstructured-ingest` no longer re-downloads files when --preserve-downloads is used without --download-dir.	2023-03-01 09:19:32 -08:00
Tom Aarsen	95109db6b0	refactor: For S3 Ingest, write to file directly using `json.dump` (#312 ) * Write to file directly using json.dump No changelog entry due to the simplicity of the change	2023-02-28 22:56:45 -08:00
cragwolfe	a6f8256148	bump: release commit (#317 ) * update github ingest outputs * CHANGELOG, test github ingest more often in CI * more changelog detail 0.5.1	2023-03-01 11:12:52 +11:00
Tom Aarsen	350c4230ee	fix: Remove JavaScript from HTML reader output (#313 ) * Fixes an error causing JavaScript to appear in the output of `partition_html` sometimes.	2023-02-28 14:24:24 -08:00
Tom Aarsen	1ccbc05b10	Fix: Resolve several issues with the require dependencies decorator (#315 ) Fix several issues re. the requires_dependencies decorator: * There was a missing space between the sentences. * Crucial brackets were missing in making the error message. * "pygithub" was used where "github" should have been used.	2023-02-28 20:21:59 +00:00
Matt Robinson	69661788cf	fix: track narrative text and figure captions in HTML documents (#309 ) * fix for missing narrative text in partition_html * fixes so existing tests pass * tests for figure caption and narrative text * bump version; changelog 0.5.0	2023-02-28 15:36:08 +00:00
Alvaro Bartolome	e52dd5c179	feat: add `requires_dependencies` decorator (#302 ) * Add `requires_dependencies` decorator * Use `required_dependencies` on Reddit & S3 * Fix bug in `requires_dependencies` To used named args the decorator needs to be also wrapped * Add `requires_dependencies` integration tests * Add `requires_dependencies` in `Competition.md` * Update `CHANGELOG.md` * Bump version 0.4.16-dev5 * Ignore `F401` unused imports in `requires_dependencies` tests * Apply suggestions from code review * Add `functools.wrap` to keep docs, & annotations * Use `requires_dependencies` in `GitHubConnector`	2023-02-28 14:50:39 +00:00
Tom Aarsen	54a6db1c2c	feat: Add Wikipedia ingest connector (#299 ) The connector can process a Wikipedia page and output the HTML, the plain text contents, and the summary. No API key required Also add test case verifying that 3 files are indeed created (one for HTML, one for text, one for the summary).	2023-02-28 08:25:11 +00:00
Alvaro Bartolome	a74d389fa7	fix: `process_document` behavior when exception is raised (#298 )	2023-02-28 00:04:26 -08:00
cragwolfe	c7eba1636d	build(deps): make pip-compile (#307 ) * build: pip-compile, skip test deps * s	2023-02-28 17:28:14 +11:00
cragwolfe	5eaf4490fd	build: Release commit for version 0.4.16 (#305 ) 0.4.16	2023-02-28 15:48:48 +11:00
qued	d566f9b56a	Inject DEBIAN_FRONTEND into sudo env (#290 ) Gets rid of the interactive prompt when tzdata gets installed.	2023-02-28 02:27:58 +00:00
Matt Robinson	1cd1bd8eba	docs: more detailed bricks writeup; reoganize docs (#304 ) * add print statement in readme * elements before bricks * new preamble to bricks section * add preamble to bricks section * add preamble to cleaning section * descriptions of each documentation page * non-brick helper functions to the bottom * fix codeblock * includes some optional kwargs * code blocks * typo fix	2023-02-27 23:11:49 +00:00
Tom Aarsen	ded60afda9	feat: Add GitHub data connector; add Markdown partitioner (#284 )	2023-02-27 14:36:44 -08:00
Alvaro Bartolome	c89bba100f	Update `Competition.md` (#297 ) Minor edits, fix local installation URL.	2023-02-27 10:52:39 -08:00
Matt Robinson	9b0dbc7026	build(deps): bump dependencies; resolve security issues in example dependencies (#300 ) * bump cryptography version * re pip-compile for latest versions * update argilla example requirements * dependency updates * bump versions * pin unstructured-inference due to multithreading issue * linting, linting, linting * dependency on one line	2023-02-27 12:45:28 -05:00
Tom Aarsen	5eb1466acc	Resolve various style issues to improve overall code quality (#282 ) * Apply import sorting ruff . --select I --fix * Remove unnecessary open mode parameter ruff . --select UP015 --fix * Use f-string formatting rather than .format * Remove extraneous parentheses Also use "" instead of str() * Resolve missing trailing commas ruff . --select COM --fix * Rewrite list() and dict() calls using literals ruff . --select C4 --fix * Add () to pytest.fixture, use tuples for parametrize, etc. ruff . --select PT --fix * Simplify code: merge conditionals, context managers ruff . --select SIM --fix * Import without unnecessary alias ruff . --select PLR0402 --fix * Apply formatting via black * Rewrite ValueError somewhat Slightly unrelated to the rest of the PR * Apply formatting to tests via black * Update expected exception message to match 0d81564 * Satisfy E501 line too long in test * Update changelog & version * Add ruff to make tidy and test deps * Run 'make tidy' * Update changelog & version * Update changelog & version * Add ruff to 'check' target Doing so required me to also fix some non-auto-fixable issues. Two of them I fixed with a noqa: SIM115, but especially the one in __init__ may need some attention. That said, that refactor is out of scope of this PR.	2023-02-27 11:30:54 -05:00
Matt Robinson	5db94fdee6	docs: add getting started section and remove outdated docs (#277 ) * add getting started section to the docs * remove old examples * update example notebook * change to convert_to_dict * various and sundry edits	2023-02-27 15:10:53 +00:00
cragwolfe	ee8739dfa6	fix: pip-compile statement for ingest-s3 (#296 )	2023-02-27 10:19:03 +01:00
Tom Aarsen	486c7987fc	feat: Add Reddit ingest connector (#293 ) Add Reddit data connector for ingest. * The connector can process a subreddit. * Either via a search query, * or via hot posts. * The texts in the submissions are converted to markdown files including the post title and the text body, if any (i.e. no images or videos). * The number of posts to fetch can be changed with the CLI.	2023-02-27 00:11:04 -08:00
cragwolfe	0a51f28e7d	fix: Ingest main: actually initialize the connector (#285 )	2023-02-26 14:53:51 -08:00
qued	30ac3e6daa	Changes so script runs as root in docker (#287 )	2023-02-25 13:48:48 -08:00
cragwolfe	0e3440ac08	fix: add libmagic dep to ubuntu script (#281 )	2023-02-25 19:53:38 +00:00

1 2 3 4 5

249 Commits