929 Commits

Author SHA1 Message Date
Amanda Cameron
64efcc0e50
Adding optional encoding arg, and text_partition tests (#339) 2023-03-06 15:07:33 -08:00
Ikko Eltociear Ashimine
213077e2ab
docs: update sec-sentiment-analysis.ipynb (#342)
Huggingface -> Hugging Face
2023-03-06 15:16:14 +00:00
Alvaro Bartolome
2979e17aa4
feat: add .pre-commit-config.yaml to let users enable pre-commit hooks (#320)
Per the README, provides an optional `pre-commit` configuration
file to ensure code matches the formatting and linting standards used in `unstructured`.
2023-03-05 20:23:39 +00:00
Tom Aarsen
f5af87a540
feat: Expose Wikipedia auto_suggest argument to the ingest CLI (#336)
* Add support for '--wikipedia-auto-suggest' to the unstructured-ingest CLI
2023-03-02 12:31:29 -08:00
Matt Robinson
a5da3de43b
fix: ensure all text is maintained in html output (#335)
* fix: ensure all text is maintained in html pages

* add back in replace unicode quotes

* changelog and version bump

* apt-get update in ci

* white space differences in output
0.5.2
2023-03-02 14:03:13 -05:00
qued
ed074b5828
fix: set through env to avoid interpretation as command (#329)
When I took the changes to the Ubuntu setup script and propagated them to other scripts that run in slightly different contexts, the script failed at line 45 as DEBIAN_FRONTEND=noninteractive was interpreted as a command rather than a variable assignment.

Added the env command so there's no misinterpretation. Tested in docker as both root and user.
2023-03-01 12:56:37 -06:00
dependabot[bot]
fcaed15b14
build(deps): Bump actions/checkout from 2 to 3 (#325)
Bumps [actions/checkout](https://github.com/actions/checkout) from 2 to 3.
- [Release notes](https://github.com/actions/checkout/releases)
- [Changelog](https://github.com/actions/checkout/blob/main/CHANGELOG.md)
- [Commits](https://github.com/actions/checkout/compare/v2...v3)

---
updated-dependencies:
- dependency-name: actions/checkout
  dependency-type: direct:production
  update-type: version-update:semver-major
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Co-authored-by: cragwolfe <crag@unstructured.io>
2023-03-01 13:11:42 -05:00
Alvaro Bartolome
707f92f717
feat: improve caching mechanism for download_dir on ingest (#314)
* `unstructured-ingest` now uses a default `--download_dir` of `$HOME/.cache/unstructured/ingest`
rather than a "tmp-ingest-" dir in the working directory.
* `unstructured-ingest` no longer re-downloads files when --preserve-downloads
is used without --download-dir.
2023-03-01 09:19:32 -08:00
Tom Aarsen
95109db6b0
refactor: For S3 Ingest, write to file directly using json.dump (#312)
* Write to file directly using json.dump

No changelog entry due to the simplicity of the change
2023-02-28 22:56:45 -08:00
cragwolfe
a6f8256148
bump: release commit (#317)
* update github ingest outputs

* CHANGELOG, test github ingest more often in CI

* more changelog detail
0.5.1
2023-03-01 11:12:52 +11:00
Tom Aarsen
350c4230ee
fix: Remove JavaScript from HTML reader output (#313)
* Fixes an error causing JavaScript to appear in the output of `partition_html` sometimes.
2023-02-28 14:24:24 -08:00
Tom Aarsen
1ccbc05b10
Fix: Resolve several issues with the require dependencies decorator (#315)
Fix several issues re. the requires_dependencies decorator:
* There was a missing space between the sentences.
* Crucial brackets were missing in making the error message.
* "pygithub" was used where "github" should have been used.
2023-02-28 20:21:59 +00:00
Matt Robinson
69661788cf
fix: track narrative text and figure captions in HTML documents (#309)
* fix for missing narrative text in partition_html

* fixes so existing tests pass

* tests for figure caption and narrative text

* bump version; changelog
0.5.0
2023-02-28 15:36:08 +00:00
Alvaro Bartolome
e52dd5c179
feat: add requires_dependencies decorator (#302)
* Add `requires_dependencies` decorator

* Use `required_dependencies` on Reddit & S3

* Fix bug in `requires_dependencies`

To used named args the decorator needs to be also wrapped

* Add `requires_dependencies` integration tests

* Add `requires_dependencies` in `Competition.md`

* Update `CHANGELOG.md`

* Bump version 0.4.16-dev5

* Ignore `F401` unused imports in `requires_dependencies` tests

* Apply suggestions from code review

* Add `functools.wrap` to keep docs, & annotations

* Use `requires_dependencies` in `GitHubConnector`
2023-02-28 14:50:39 +00:00
Tom Aarsen
54a6db1c2c
feat: Add Wikipedia ingest connector (#299)
The connector can process a Wikipedia page
and output the HTML,
the plain text contents,
and the summary.
No API key required
Also add test case verifying that 3 files are indeed created (one for HTML, one for text, one for the summary).
2023-02-28 08:25:11 +00:00
Alvaro Bartolome
a74d389fa7
fix: process_document behavior when exception is raised (#298) 2023-02-28 00:04:26 -08:00
cragwolfe
c7eba1636d
build(deps): make pip-compile (#307)
* build: pip-compile, skip test deps

* s
2023-02-28 17:28:14 +11:00
cragwolfe
5eaf4490fd
build: Release commit for version 0.4.16 (#305) 0.4.16 2023-02-28 15:48:48 +11:00
qued
d566f9b56a
Inject DEBIAN_FRONTEND into sudo env (#290)
Gets rid of the interactive prompt when tzdata gets installed.
2023-02-28 02:27:58 +00:00
Matt Robinson
1cd1bd8eba
docs: more detailed bricks writeup; reoganize docs (#304)
* add print statement in readme

* elements before bricks

* new preamble to bricks section

* add preamble to bricks section

* add preamble to cleaning section

* descriptions of each documentation page

* non-brick helper functions to the bottom

* fix codeblock

* includes some optional kwargs

* code blocks

* typo fix
2023-02-27 23:11:49 +00:00
Tom Aarsen
ded60afda9
feat: Add GitHub data connector; add Markdown partitioner (#284) 2023-02-27 14:36:44 -08:00
Alvaro Bartolome
c89bba100f
Update Competition.md (#297)
Minor edits, fix local installation URL.
2023-02-27 10:52:39 -08:00
Matt Robinson
9b0dbc7026
build(deps): bump dependencies; resolve security issues in example dependencies (#300)
* bump cryptography version

* re pip-compile for latest versions

* update argilla example requirements

* dependency updates

* bump versions

* pin unstructured-inference due to multithreading issue

* linting, linting, linting

* dependency on one line
2023-02-27 12:45:28 -05:00
Tom Aarsen
5eb1466acc
Resolve various style issues to improve overall code quality (#282)
* Apply import sorting

ruff . --select I --fix

* Remove unnecessary open mode parameter

ruff . --select UP015 --fix

* Use f-string formatting rather than .format

* Remove extraneous parentheses

Also use "" instead of str()

* Resolve missing trailing commas

ruff . --select COM --fix

* Rewrite list() and dict() calls using literals

ruff . --select C4 --fix

* Add () to pytest.fixture, use tuples for parametrize, etc.

ruff . --select PT --fix

* Simplify code: merge conditionals, context managers

ruff . --select SIM --fix

* Import without unnecessary alias

ruff . --select PLR0402 --fix

* Apply formatting via black

* Rewrite ValueError somewhat

Slightly unrelated to the rest of the PR

* Apply formatting to tests via black

* Update expected exception message to match
0d81564

* Satisfy E501 line too long in test

* Update changelog & version

* Add ruff to make tidy and test deps

* Run 'make tidy'

* Update changelog & version

* Update changelog & version

* Add ruff to 'check' target

Doing so required me to also fix some non-auto-fixable issues. Two of them I fixed with a noqa: SIM115, but especially the one in __init__ may need some attention. That said, that refactor is out of scope of this PR.
2023-02-27 11:30:54 -05:00
Matt Robinson
5db94fdee6
docs: add getting started section and remove outdated docs (#277)
* add getting started section to the docs

* remove old examples

* update example notebook

* change to convert_to_dict

* various and sundry edits
2023-02-27 15:10:53 +00:00
cragwolfe
ee8739dfa6
fix: pip-compile statement for ingest-s3 (#296) 2023-02-27 10:19:03 +01:00
Tom Aarsen
486c7987fc
feat: Add Reddit ingest connector (#293)
Add Reddit data connector for ingest.
* The connector can process a subreddit.
* Either via a search query,
*  or via hot posts.
* The texts in the submissions are converted to markdown files including the post title and the text body, if any (i.e. no images or videos).
* The number of posts to fetch can be changed with the CLI.
2023-02-27 00:11:04 -08:00
cragwolfe
0a51f28e7d
fix: Ingest main: actually initialize the connector (#285) 2023-02-26 14:53:51 -08:00
qued
30ac3e6daa
Changes so script runs as root in docker (#287) 2023-02-25 13:48:48 -08:00
cragwolfe
0e3440ac08
fix: add libmagic dep to ubuntu script (#281) 2023-02-25 19:53:38 +00:00
Tom Aarsen
e61ce2cc00
Skip posix_path test on Windows (#283) 2023-02-25 08:31:34 +00:00
qued
a79b365ab4
feat: add ubuntu setup script (#279) 2023-02-24 20:05:26 -06:00
Tom Aarsen
9062d25d0d
Resolve numerous typos (#280)
* Resolve numerous typos

* Resolve typo in mime type
2023-02-24 17:48:23 -08:00
grungyfeline998
956f04d770
feat: detect filetype with extension if libmagic is unavailable (#268)
* included the previous PR changes and verified black

* resolved the issues mentioned

* make tidy and add tests

---------

Co-authored-by: Matt Robinson <mrobinson@unstructured.io>
Co-authored-by: Matt Robinson <mrobinson@unstructuredai.io>
2023-02-24 15:23:29 +00:00
cragwolfe
e419ba1d33
doc: Announce the competition! (#274) 2023-02-23 16:52:34 -08:00
Matt Robinson
0d229f0a5e
fix: preserve all elements when serialized; feat: helper functions for serialization (#273)
* added type to text element map

* add element_id and coordinates

* added test for serialization

* added serialization for check boxes

* add dict_to_elements and covert_to_dict aliases

* helpers for serializing and deserializing elements

* bump version; changelog

* add Text to tests

* aliases for isd functions

* remove test elements json

* changelog updates

* make indent a kwarg

* update expected structured output

* docs update

* use new function in ingest code

* pop coordinates due to floating point differences

* pop coordinates
0.4.15
2023-02-23 21:58:59 +00:00
Matt Robinson
354eff1e2b
build(deps): automatically download nltk models when required (#246)
* code for downloading nltk packages

* don't run nltk make command in ci

* test for model downloads

* remove nltk install from docs

* update changelog and bump version
0.4.14
2023-02-23 17:19:13 +00:00
cragwolfe
83f04545df
fix: Adds missing __init__.py (#259) 0.4.13 2023-02-22 21:31:34 -08:00
cragwolfe
80c0fab215
build: new release (#249)
Cut a release that has the unstructured-ingest command line included in the unstructured package.

Bonus tweak to the Ingest checklist.
0.4.12
2023-02-23 03:44:05 +00:00
Viktor Zhemchuzhnikov
60abac2c4b
feat: add allow custom parsers in partition_html (#251)
This will allow partition_html to use a custom XMLParser or HTMLParser.
It can be useful if one needs to specify additional arguments to these parsers (not only built-in remove_comments=True).
---------

Co-authored-by: Viktor Zhemchuzhnikov <v.zhemchuzhnikov@xsolla.com>
2023-02-23 01:57:42 +00:00
cragwolfe
1b8bf318b8
refactor: move processing logic to IngestDoc (#248)
Moves the logic to partition a raw document to the IngestDoc level to
allow for easier overrides for subclasses of IngestDoc.
2023-02-22 01:02:05 +00:00
cragwolfe
69acb083bd
refactor: break up logic from one line to 2 (#247)
Separate elements out into separate variable to allow for conditional logic based on the instance type of the doc (or other properties).
2023-02-21 17:44:58 -06:00
cragwolfe
87fd0d01dc
feat: Ingest refactors, doc updates (#243)
- Creates ABC's for ingest connectors
- Updates the s3_connector classes to inherit from ABC's
- Moves s3 test script to it's own file to establish pattern for additional connectors
- Rewrites the Ingest.md doc, including instructions how how to add a connector
- Updates the example s3 ingest script to use the new location for main.py

Note that there were no logic changes, this is essentially a refactoring PR.

Test instructions:

Run ./test_unstructured_ingest/test-ingest.sh and ./examples/ingest/s3-small-batch/ingest.sh.
2023-02-21 10:15:33 -08:00
Matt Robinson
314924137f
docs: add quotes to local-inference install instructions (#245) 2023-02-21 09:58:26 -06:00
noahdemoes
f205e6f3ae
build: add Python 3.9 and Python 3.10 to the CI test job (#235)
* add python 3.9 3.10

* run on branch

* run on branch

* run on branch

* run on branch

* revert

* update all jobs

* update all jobs

* update all jobs
2023-02-20 14:08:46 -08:00
Matt Robinson
7472e1bb21
docs: add a quick start page to the readme and docs (#240)
* added quick start section to the readme

* added quick start to docs

* parenthetical on extra deps

* typo

* fix typo

* fixed mixed tabs/spaces
2023-02-17 22:13:28 +00:00
Matt Robinson
601f250edc
feat: add partition_ppt for older power point docs (#238)
* added partition_ppt function and tests

* add ppt support to auto

* version bump

* update docs

* doc fixes

* update changelog

* `.docx` -> `.pptx`

* its -> their

* remove whitespace
0.4.11
2023-02-17 16:57:08 +00:00
Matt Robinson
6036af33e7
feat: add partition_doc for .doc files (#236)
* first pass on doc partitioning

* add libreoffice to deps

* update docs and readme

* add .doc to auto

* changelog bump

* value error with missing doc

* doc updates
2023-02-17 09:30:23 -05:00
Matt Robinson
9bbd4a1d56
docs: file exploration training notebook (#221) 2023-02-16 20:33:02 +00:00
Matt Robinson
f5ff140d7c
fix: ElementMetadata serializes when the filename is a Path object (#233) 0.4.10 2023-02-16 17:20:51 +00:00