unstructured

mirror of https://github.com/Unstructured-IO/unstructured.git synced 2025-12-21 04:06:27 +00:00

Author	SHA1	Message	Date
Amanda Cameron	64efcc0e50	Adding optional encoding arg, and text_partition tests (#339 )	2023-03-06 15:07:33 -08:00
Ikko Eltociear Ashimine	213077e2ab	docs: update sec-sentiment-analysis.ipynb (#342 ) Huggingface -> Hugging Face	2023-03-06 15:16:14 +00:00
Alvaro Bartolome	2979e17aa4	feat: add `.pre-commit-config.yaml` to let users enable `pre-commit` hooks (#320 ) Per the README, provides an optional `pre-commit` configuration file to ensure code matches the formatting and linting standards used in `unstructured`.	2023-03-05 20:23:39 +00:00
Tom Aarsen	f5af87a540	feat: Expose Wikipedia `auto_suggest` argument to the ingest CLI (#336 ) * Add support for '--wikipedia-auto-suggest' to the unstructured-ingest CLI	2023-03-02 12:31:29 -08:00
Matt Robinson	a5da3de43b	fix: ensure all text is maintained in html output (#335 ) * fix: ensure all text is maintained in html pages * add back in replace unicode quotes * changelog and version bump * apt-get update in ci * white space differences in output 0.5.2	2023-03-02 14:03:13 -05:00
qued	ed074b5828	fix: set through env to avoid interpretation as command (#329 ) When I took the changes to the Ubuntu setup script and propagated them to other scripts that run in slightly different contexts, the script failed at line 45 as DEBIAN_FRONTEND=noninteractive was interpreted as a command rather than a variable assignment. Added the env command so there's no misinterpretation. Tested in docker as both root and user.	2023-03-01 12:56:37 -06:00
dependabot[bot]	fcaed15b14	build(deps): Bump actions/checkout from 2 to 3 (#325 ) Bumps [actions/checkout](https://github.com/actions/checkout) from 2 to 3. - [Release notes](https://github.com/actions/checkout/releases) - [Changelog](https://github.com/actions/checkout/blob/main/CHANGELOG.md) - [Commits](https://github.com/actions/checkout/compare/v2...v3) --- updated-dependencies: - dependency-name: actions/checkout dependency-type: direct:production update-type: version-update:semver-major ... Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> Co-authored-by: cragwolfe <crag@unstructured.io>	2023-03-01 13:11:42 -05:00
Alvaro Bartolome	707f92f717	feat: improve caching mechanism for `download_dir` on ingest (#314 ) * `unstructured-ingest` now uses a default `--download_dir` of `$HOME/.cache/unstructured/ingest` rather than a "tmp-ingest-" dir in the working directory. * `unstructured-ingest` no longer re-downloads files when --preserve-downloads is used without --download-dir.	2023-03-01 09:19:32 -08:00
Tom Aarsen	95109db6b0	refactor: For S3 Ingest, write to file directly using `json.dump` (#312 ) * Write to file directly using json.dump No changelog entry due to the simplicity of the change	2023-02-28 22:56:45 -08:00
cragwolfe	a6f8256148	bump: release commit (#317 ) * update github ingest outputs * CHANGELOG, test github ingest more often in CI * more changelog detail 0.5.1	2023-03-01 11:12:52 +11:00
Tom Aarsen	350c4230ee	fix: Remove JavaScript from HTML reader output (#313 ) * Fixes an error causing JavaScript to appear in the output of `partition_html` sometimes.	2023-02-28 14:24:24 -08:00
Tom Aarsen	1ccbc05b10	Fix: Resolve several issues with the require dependencies decorator (#315 ) Fix several issues re. the requires_dependencies decorator: * There was a missing space between the sentences. * Crucial brackets were missing in making the error message. * "pygithub" was used where "github" should have been used.	2023-02-28 20:21:59 +00:00
Matt Robinson	69661788cf	fix: track narrative text and figure captions in HTML documents (#309 ) * fix for missing narrative text in partition_html * fixes so existing tests pass * tests for figure caption and narrative text * bump version; changelog 0.5.0	2023-02-28 15:36:08 +00:00
Alvaro Bartolome	e52dd5c179	feat: add `requires_dependencies` decorator (#302 ) * Add `requires_dependencies` decorator * Use `required_dependencies` on Reddit & S3 * Fix bug in `requires_dependencies` To used named args the decorator needs to be also wrapped * Add `requires_dependencies` integration tests * Add `requires_dependencies` in `Competition.md` * Update `CHANGELOG.md` * Bump version 0.4.16-dev5 * Ignore `F401` unused imports in `requires_dependencies` tests * Apply suggestions from code review * Add `functools.wrap` to keep docs, & annotations * Use `requires_dependencies` in `GitHubConnector`	2023-02-28 14:50:39 +00:00
Tom Aarsen	54a6db1c2c	feat: Add Wikipedia ingest connector (#299 ) The connector can process a Wikipedia page and output the HTML, the plain text contents, and the summary. No API key required Also add test case verifying that 3 files are indeed created (one for HTML, one for text, one for the summary).	2023-02-28 08:25:11 +00:00
Alvaro Bartolome	a74d389fa7	fix: `process_document` behavior when exception is raised (#298 )	2023-02-28 00:04:26 -08:00
cragwolfe	c7eba1636d	build(deps): make pip-compile (#307 ) * build: pip-compile, skip test deps * s	2023-02-28 17:28:14 +11:00
cragwolfe	5eaf4490fd	build: Release commit for version 0.4.16 (#305 ) 0.4.16	2023-02-28 15:48:48 +11:00
qued	d566f9b56a	Inject DEBIAN_FRONTEND into sudo env (#290 ) Gets rid of the interactive prompt when tzdata gets installed.	2023-02-28 02:27:58 +00:00
Matt Robinson	1cd1bd8eba	docs: more detailed bricks writeup; reoganize docs (#304 ) * add print statement in readme * elements before bricks * new preamble to bricks section * add preamble to bricks section * add preamble to cleaning section * descriptions of each documentation page * non-brick helper functions to the bottom * fix codeblock * includes some optional kwargs * code blocks * typo fix	2023-02-27 23:11:49 +00:00
Tom Aarsen	ded60afda9	feat: Add GitHub data connector; add Markdown partitioner (#284 )	2023-02-27 14:36:44 -08:00
Alvaro Bartolome	c89bba100f	Update `Competition.md` (#297 ) Minor edits, fix local installation URL.	2023-02-27 10:52:39 -08:00
Matt Robinson	9b0dbc7026	build(deps): bump dependencies; resolve security issues in example dependencies (#300 ) * bump cryptography version * re pip-compile for latest versions * update argilla example requirements * dependency updates * bump versions * pin unstructured-inference due to multithreading issue * linting, linting, linting * dependency on one line	2023-02-27 12:45:28 -05:00
Tom Aarsen	5eb1466acc	Resolve various style issues to improve overall code quality (#282 ) * Apply import sorting ruff . --select I --fix * Remove unnecessary open mode parameter ruff . --select UP015 --fix * Use f-string formatting rather than .format * Remove extraneous parentheses Also use "" instead of str() * Resolve missing trailing commas ruff . --select COM --fix * Rewrite list() and dict() calls using literals ruff . --select C4 --fix * Add () to pytest.fixture, use tuples for parametrize, etc. ruff . --select PT --fix * Simplify code: merge conditionals, context managers ruff . --select SIM --fix * Import without unnecessary alias ruff . --select PLR0402 --fix * Apply formatting via black * Rewrite ValueError somewhat Slightly unrelated to the rest of the PR * Apply formatting to tests via black * Update expected exception message to match 0d81564 * Satisfy E501 line too long in test * Update changelog & version * Add ruff to make tidy and test deps * Run 'make tidy' * Update changelog & version * Update changelog & version * Add ruff to 'check' target Doing so required me to also fix some non-auto-fixable issues. Two of them I fixed with a noqa: SIM115, but especially the one in __init__ may need some attention. That said, that refactor is out of scope of this PR.	2023-02-27 11:30:54 -05:00
Matt Robinson	5db94fdee6	docs: add getting started section and remove outdated docs (#277 ) * add getting started section to the docs * remove old examples * update example notebook * change to convert_to_dict * various and sundry edits	2023-02-27 15:10:53 +00:00
cragwolfe	ee8739dfa6	fix: pip-compile statement for ingest-s3 (#296 )	2023-02-27 10:19:03 +01:00
Tom Aarsen	486c7987fc	feat: Add Reddit ingest connector (#293 ) Add Reddit data connector for ingest. * The connector can process a subreddit. * Either via a search query, * or via hot posts. * The texts in the submissions are converted to markdown files including the post title and the text body, if any (i.e. no images or videos). * The number of posts to fetch can be changed with the CLI.	2023-02-27 00:11:04 -08:00
cragwolfe	0a51f28e7d	fix: Ingest main: actually initialize the connector (#285 )	2023-02-26 14:53:51 -08:00
qued	30ac3e6daa	Changes so script runs as root in docker (#287 )	2023-02-25 13:48:48 -08:00
cragwolfe	0e3440ac08	fix: add libmagic dep to ubuntu script (#281 )	2023-02-25 19:53:38 +00:00
Tom Aarsen	e61ce2cc00	Skip posix_path test on Windows (#283 )	2023-02-25 08:31:34 +00:00
qued	a79b365ab4	feat: add ubuntu setup script (#279 )	2023-02-24 20:05:26 -06:00
Tom Aarsen	9062d25d0d	Resolve numerous typos (#280 ) * Resolve numerous typos * Resolve typo in mime type	2023-02-24 17:48:23 -08:00
grungyfeline998	956f04d770	feat: detect filetype with extension if libmagic is unavailable (#268 ) * included the previous PR changes and verified black * resolved the issues mentioned * make tidy and add tests --------- Co-authored-by: Matt Robinson <mrobinson@unstructured.io> Co-authored-by: Matt Robinson <mrobinson@unstructuredai.io>	2023-02-24 15:23:29 +00:00
cragwolfe	e419ba1d33	doc: Announce the competition! (#274 )	2023-02-23 16:52:34 -08:00
Matt Robinson	0d229f0a5e	fix: preserve all elements when serialized; feat: helper functions for serialization (#273 ) * added type to text element map * add element_id and coordinates * added test for serialization * added serialization for check boxes * add dict_to_elements and covert_to_dict aliases * helpers for serializing and deserializing elements * bump version; changelog * add Text to tests * aliases for isd functions * remove test elements json * changelog updates * make indent a kwarg * update expected structured output * docs update * use new function in ingest code * pop coordinates due to floating point differences * pop coordinates 0.4.15	2023-02-23 21:58:59 +00:00
Matt Robinson	354eff1e2b	build(deps): automatically download `nltk` models when required (#246 ) * code for downloading nltk packages * don't run nltk make command in ci * test for model downloads * remove nltk install from docs * update changelog and bump version 0.4.14	2023-02-23 17:19:13 +00:00
cragwolfe	83f04545df	fix: Adds missing __init__.py (#259 ) 0.4.13	2023-02-22 21:31:34 -08:00
cragwolfe	80c0fab215	build: new release (#249 ) Cut a release that has the unstructured-ingest command line included in the unstructured package. Bonus tweak to the Ingest checklist. 0.4.12	2023-02-23 03:44:05 +00:00
Viktor Zhemchuzhnikov	60abac2c4b	feat: add allow custom parsers in partition_html (#251 ) This will allow partition_html to use a custom XMLParser or HTMLParser. It can be useful if one needs to specify additional arguments to these parsers (not only built-in remove_comments=True). --------- Co-authored-by: Viktor Zhemchuzhnikov <v.zhemchuzhnikov@xsolla.com>	2023-02-23 01:57:42 +00:00
cragwolfe	1b8bf318b8	refactor: move processing logic to IngestDoc (#248 ) Moves the logic to partition a raw document to the IngestDoc level to allow for easier overrides for subclasses of IngestDoc.	2023-02-22 01:02:05 +00:00
cragwolfe	69acb083bd	refactor: break up logic from one line to 2 (#247 ) Separate elements out into separate variable to allow for conditional logic based on the instance type of the doc (or other properties).	2023-02-21 17:44:58 -06:00
cragwolfe	87fd0d01dc	feat: Ingest refactors, doc updates (#243 ) - Creates ABC's for ingest connectors - Updates the s3_connector classes to inherit from ABC's - Moves s3 test script to it's own file to establish pattern for additional connectors - Rewrites the Ingest.md doc, including instructions how how to add a connector - Updates the example s3 ingest script to use the new location for main.py Note that there were no logic changes, this is essentially a refactoring PR. Test instructions: Run ./test_unstructured_ingest/test-ingest.sh and ./examples/ingest/s3-small-batch/ingest.sh.	2023-02-21 10:15:33 -08:00
Matt Robinson	314924137f	docs: add quotes to local-inference install instructions (#245 )	2023-02-21 09:58:26 -06:00
noahdemoes	f205e6f3ae	build: add Python 3.9 and Python 3.10 to the CI test job (#235 ) * add python 3.9 3.10 * run on branch * run on branch * run on branch * run on branch * revert * update all jobs * update all jobs * update all jobs	2023-02-20 14:08:46 -08:00
Matt Robinson	7472e1bb21	docs: add a quick start page to the readme and docs (#240 ) * added quick start section to the readme * added quick start to docs * parenthetical on extra deps * typo * fix typo * fixed mixed tabs/spaces	2023-02-17 22:13:28 +00:00
Matt Robinson	601f250edc	feat: add `partition_ppt` for older power point docs (#238 ) * added partition_ppt function and tests * add ppt support to auto * version bump * update docs * doc fixes * update changelog * `.docx` -> `.pptx` * its -> their * remove whitespace 0.4.11	2023-02-17 16:57:08 +00:00
Matt Robinson	6036af33e7	feat: add `partition_doc` for `.doc` files (#236 ) * first pass on doc partitioning * add libreoffice to deps * update docs and readme * add .doc to auto * changelog bump * value error with missing doc * doc updates	2023-02-17 09:30:23 -05:00
Matt Robinson	9bbd4a1d56	docs: file exploration training notebook (#221 )	2023-02-16 20:33:02 +00:00
Matt Robinson	f5ff140d7c	fix: `ElementMetadata` serializes when the filename is a `Path` object (#233 ) 0.4.10	2023-02-16 17:20:51 +00:00

... 13 14 15 16 17 ...

929 Commits