unstructured

mirror of https://github.com/Unstructured-IO/unstructured.git synced 2025-11-18 03:17:21 +00:00

Author	SHA1	Message	Date
ryannikolaidis	1e39e1ac2a	ci: Adds workflow to publish docker builds (#377 )	2023-03-19 21:53:05 +00:00
Sebastian Laverde Alfonso	c9c1b843d2	docs: Integrations LangChain code fix (#378 )	2023-03-17 22:59:22 +01:00
Sebastian Laverde Alfonso	b2f37c3eff	Docs: add Integrations section (#372 ) * docs: update index, add integrations * docs: fix typos * docs: create integrations.rst section structure * docs: descriptions and use for 8 integrations * refactor: SEC example in Label Studio section * Apply suggestions from code review Co-authored-by: qued <64741807+qued@users.noreply.github.com> * docs: change links order and refactor\|paraphrase --------- Co-authored-by: qued <64741807+qued@users.noreply.github.com>	2023-03-17 19:11:38 +00:00
Matt Robinson	b47bfaf33a	fix: update test to pass on later `label_studio_sdk` versions (#369 ) Closes #200. Fixes the failing test for label_studio_sdk>0.0.17 using the suggestion found in this comment. The vcr fixture on the test needed allow_playback_repeats=True. Unpinned label_studio_sdk and pip-compiled.	2023-03-17 17:57:09 +00:00
Mallori Harrell	ff63ad81d9	chore: Add note about python version (#375 ) * add note about python version --------- Co-authored-by: Mallori Harrell <mallori@Malloris-MacBook-Pro.local>	2023-03-17 11:22:49 -05:00
qued	f6d787d95b	ci: workflow to create JIRA issue on GH issue create (#370 ) Created a github workflow to create a new issue in JIRA when a github issue is created, mirroring the summary and description. Pretty simplistic for now with a hardcoded project, and no support for any ongoing sync events.	2023-03-15 16:17:56 -05:00
natygyoon	e0eb66de52	feat: add staging brick to clean non-ascii characters from unicode (#366 )	2023-03-14 21:31:51 -07:00
Amanda Cameron	edb847ce0b	adding Dockerfile (#359 )	2023-03-14 13:40:01 -07:00
qued	a00c6feb9a	fix: changelog typo throwing off formatting (#365 )	2023-03-14 16:30:53 +00:00
Matt Robinson	e43cb0e6e0	feat: add `partition_epub` function (#364 ) * add pypandoc dependency * added epub partitioner and file conversion * test for partition_epub * tests for file conversion * add epub to filetype detection * added epub to auto partition * update bricks docs * updated installing docs * changelot and version * add pandoc to dependencies * add pandoc to debian dependencies * linting, linting, linting * typo fix * typo fix * file conversion type hints * more type hints --------- Co-authored-by: qued <64741807+qued@users.noreply.github.com> 0.5.4	2023-03-14 15:52:21 +00:00
qued	aa494623a2	chore: bump versions (#352 ) Update versions of dependencies, including unpinning the unstructured-inference dependency that's causing conflicts in repos like pipeline-oer that want the newer version.	2023-03-14 09:40:30 -05:00
ryannikolaidis	a4726cb197	fix: open xml files in read only mode (#362 )	2023-03-13 13:06:45 -07:00
cragwolfe	7b9475ef26	chore: rm competition announcement from the README (#361 )	2023-03-13 09:34:26 -07:00
Matt Robinson	d17a94f395	chore: add libreoffice to ubuntu install script (#363 )	2023-03-13 10:46:23 -04:00
Matt Robinson	7c08450597	feat: add `"fast"` strategy for PDF parsing; fallback to `"fast"` if `detectron2` is not available (#357 ) Adds a "fast" strategy for partitioning PDFs that uses pdfminer. The default strategy is "hi_res" and is the original partitioning logic that uses detectron2. If detectron2 is not available and the "hi_res" strategy is selected, partition_pdf fallsback to using the "fast" strategy. The implementation uses pdfminer because that's already installed as a dependency with the local-inference extra. There are other options for accomplishing this as well, but they would entail adding a new dependency. The "fast" strategy substantially speeds up processing.	2023-03-11 03:16:05 +00:00
Habeeb Shopeju	2ca843782c	Connector for Biomedical Literature (#345 ) The implementation involves the introduction of SimpleBiomedConfig, BiomedIngestDoc and BiomedConnector which ingests documents from the PDF Download.	2023-03-11 01:09:54 +00:00
Alvaro Bartolome	5291a96616	Add `AzureBlobStorageConnector` (#353 ) * Add `AzureBlobStorageConnector` based on its `fsspec` implementation inheriting from `FsspecConnector` * Start deprecation life cycle for `unstructured-ingest --s3-url` option, to be deprecated in favor of `--remote-url`. --------- Co-authored-by: Tom Aarsen <37621491+tomaarsen@users.noreply.github.com>	2023-03-10 15:43:40 -08:00
Matt Robinson	30b5a4da65	fix: parsing for files with `message/rfc822` MIME type; dir for unsupported files (#358 ) Adds the ability to process files with a message/rfc822 MIME type, which previously caused failures for example-docs/fake-email-header.eml.	2023-03-10 15:10:39 -08:00
Tom Aarsen	3d21b4098e	enhancement: improve `detect_filetype` warning to include filename (#355 ) * Improve warning to include filename if provided * Update changelog & version	2023-03-10 12:26:08 -05:00
Alvaro Bartolome	c51adb21e3	feat: add `FsspecConnector` to easily integrate new connectors with a `fsspec` implementation available (#318 ) So as you may see this is a pretty big PR, that basically adds an "adapter" to easily plug in any connector with an available fsspec implementation. This is a way to standardize how the remote filesystems are used within unstructured. I've additionally renamed s3_connector.py to s3.py for readability and consistency and tested that the current approach works as expected and is aligned with the expectations.	2023-03-10 06:15:19 +00:00
Matt Robinson	7c619f045b	feat: `UNSTRUCTURED_LANGUAGE_CHECK` env var to control (#351 ) * environment variable to set language checks * change log and version * checks for if language checks are false * update docs * changelog type * add assert to tests * performance note in docstrings * docstring tweaks	2023-03-09 17:33:48 +00:00
qued	e43e9178ae	feat: amazon linux 2 setup script (#350 ) Added Amazon Linux 2 setup script. Also updated Ubuntu setup script to keep the scripts as aligned as possible. Co-authored-by: cragwolfe <crag@unstructured.io> 0.5.3	2023-03-09 14:52:24 +00:00
natygyoon	6be07a5260	feat: update auto.partition() function to recognize Unstructured json (#337 )	2023-03-08 10:36:01 -08:00
Tom Aarsen	1580c1bf8e	feat: Add GitLab ingest connector (#349 ) Add GitLab data connector for ingest. Involves more general Git functionality that is shared between the GitHub and GitLab data connectors. Prevent code duplication for functionality between GitHub and GitLab ingest connectors. Renamed github-access-token, github-branch and github-file-glob to git-access-token, git-branch and git-file-glob, respectively. These work for GitHub and GitLab.	2023-03-08 00:15:21 -08:00
Tom Aarsen	a9152313aa	refactor: Introduce 'exactly_one' to simplify partitioning functions (#343 )	2023-03-07 12:27:08 -06:00
Tom Aarsen	70420b5c78	refactor: Fully move towards logging; remove `if config.verbose` conditionals (#321 ) Move away from printing, use logging exclusively.	2023-03-07 01:21:27 -08:00
Umar Farooqi	78f4301872	fix: add formatter in an error string (#348 )	2023-03-06 22:35:15 -08:00
Habeeb Shopeju	4117f57e14	Connector for Google Drive (#294 ) Implements issue #244	2023-03-07 06:01:02 +00:00
cragwolfe	905e4ae8f6	chore: nicer error message (#341 ) Show a more meaningful error message (and potentially useful for debugging) when file type is not supported by the auto partition(). Co-authored-by: Tom Aarsen <37621491+tomaarsen@users.noreply.github.com>	2023-03-06 16:08:10 -08:00
Tom Aarsen	d4a1508ab8	chore: Remove file accidentally created/committed (#344 ) * Remove file accidentally created/committed * Fix CHANGELOG	2023-03-06 23:50:53 +00:00
Amanda Cameron	64efcc0e50	Adding optional encoding arg, and text_partition tests (#339 )	2023-03-06 15:07:33 -08:00
Ikko Eltociear Ashimine	213077e2ab	docs: update sec-sentiment-analysis.ipynb (#342 ) Huggingface -> Hugging Face	2023-03-06 15:16:14 +00:00
Alvaro Bartolome	2979e17aa4	feat: add `.pre-commit-config.yaml` to let users enable `pre-commit` hooks (#320 ) Per the README, provides an optional `pre-commit` configuration file to ensure code matches the formatting and linting standards used in `unstructured`.	2023-03-05 20:23:39 +00:00
Tom Aarsen	f5af87a540	feat: Expose Wikipedia `auto_suggest` argument to the ingest CLI (#336 ) * Add support for '--wikipedia-auto-suggest' to the unstructured-ingest CLI	2023-03-02 12:31:29 -08:00
Matt Robinson	a5da3de43b	fix: ensure all text is maintained in html output (#335 ) * fix: ensure all text is maintained in html pages * add back in replace unicode quotes * changelog and version bump * apt-get update in ci * white space differences in output 0.5.2	2023-03-02 14:03:13 -05:00
qued	ed074b5828	fix: set through env to avoid interpretation as command (#329 ) When I took the changes to the Ubuntu setup script and propagated them to other scripts that run in slightly different contexts, the script failed at line 45 as DEBIAN_FRONTEND=noninteractive was interpreted as a command rather than a variable assignment. Added the env command so there's no misinterpretation. Tested in docker as both root and user.	2023-03-01 12:56:37 -06:00
dependabot[bot]	fcaed15b14	build(deps): Bump actions/checkout from 2 to 3 (#325 ) Bumps [actions/checkout](https://github.com/actions/checkout) from 2 to 3. - [Release notes](https://github.com/actions/checkout/releases) - [Changelog](https://github.com/actions/checkout/blob/main/CHANGELOG.md) - [Commits](https://github.com/actions/checkout/compare/v2...v3) --- updated-dependencies: - dependency-name: actions/checkout dependency-type: direct:production update-type: version-update:semver-major ... Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> Co-authored-by: cragwolfe <crag@unstructured.io>	2023-03-01 13:11:42 -05:00
Alvaro Bartolome	707f92f717	feat: improve caching mechanism for `download_dir` on ingest (#314 ) * `unstructured-ingest` now uses a default `--download_dir` of `$HOME/.cache/unstructured/ingest` rather than a "tmp-ingest-" dir in the working directory. * `unstructured-ingest` no longer re-downloads files when --preserve-downloads is used without --download-dir.	2023-03-01 09:19:32 -08:00
Tom Aarsen	95109db6b0	refactor: For S3 Ingest, write to file directly using `json.dump` (#312 ) * Write to file directly using json.dump No changelog entry due to the simplicity of the change	2023-02-28 22:56:45 -08:00
cragwolfe	a6f8256148	bump: release commit (#317 ) * update github ingest outputs * CHANGELOG, test github ingest more often in CI * more changelog detail 0.5.1	2023-03-01 11:12:52 +11:00
Tom Aarsen	350c4230ee	fix: Remove JavaScript from HTML reader output (#313 ) * Fixes an error causing JavaScript to appear in the output of `partition_html` sometimes.	2023-02-28 14:24:24 -08:00
Tom Aarsen	1ccbc05b10	Fix: Resolve several issues with the require dependencies decorator (#315 ) Fix several issues re. the requires_dependencies decorator: * There was a missing space between the sentences. * Crucial brackets were missing in making the error message. * "pygithub" was used where "github" should have been used.	2023-02-28 20:21:59 +00:00
Matt Robinson	69661788cf	fix: track narrative text and figure captions in HTML documents (#309 ) * fix for missing narrative text in partition_html * fixes so existing tests pass * tests for figure caption and narrative text * bump version; changelog 0.5.0	2023-02-28 15:36:08 +00:00
Alvaro Bartolome	e52dd5c179	feat: add `requires_dependencies` decorator (#302 ) * Add `requires_dependencies` decorator * Use `required_dependencies` on Reddit & S3 * Fix bug in `requires_dependencies` To used named args the decorator needs to be also wrapped * Add `requires_dependencies` integration tests * Add `requires_dependencies` in `Competition.md` * Update `CHANGELOG.md` * Bump version 0.4.16-dev5 * Ignore `F401` unused imports in `requires_dependencies` tests * Apply suggestions from code review * Add `functools.wrap` to keep docs, & annotations * Use `requires_dependencies` in `GitHubConnector`	2023-02-28 14:50:39 +00:00
Tom Aarsen	54a6db1c2c	feat: Add Wikipedia ingest connector (#299 ) The connector can process a Wikipedia page and output the HTML, the plain text contents, and the summary. No API key required Also add test case verifying that 3 files are indeed created (one for HTML, one for text, one for the summary).	2023-02-28 08:25:11 +00:00
Alvaro Bartolome	a74d389fa7	fix: `process_document` behavior when exception is raised (#298 )	2023-02-28 00:04:26 -08:00
cragwolfe	c7eba1636d	build(deps): make pip-compile (#307 ) * build: pip-compile, skip test deps * s	2023-02-28 17:28:14 +11:00
cragwolfe	5eaf4490fd	build: Release commit for version 0.4.16 (#305 ) 0.4.16	2023-02-28 15:48:48 +11:00
qued	d566f9b56a	Inject DEBIAN_FRONTEND into sudo env (#290 ) Gets rid of the interactive prompt when tzdata gets installed.	2023-02-28 02:27:58 +00:00
Matt Robinson	1cd1bd8eba	docs: more detailed bricks writeup; reoganize docs (#304 ) * add print statement in readme * elements before bricks * new preamble to bricks section * add preamble to bricks section * add preamble to cleaning section * descriptions of each documentation page * non-brick helper functions to the bottom * fix codeblock * includes some optional kwargs * code blocks * typo fix	2023-02-27 23:11:49 +00:00

... 3 4 5 6 7 ...

459 Commits