249 Commits

Author SHA1 Message Date
qued
aa494623a2
chore: bump versions (#352)
Update versions of dependencies, including unpinning the unstructured-inference dependency that's causing conflicts in repos like pipeline-oer that want the newer version.
2023-03-14 09:40:30 -05:00
ryannikolaidis
a4726cb197
fix: open xml files in read only mode (#362) 2023-03-13 13:06:45 -07:00
cragwolfe
7b9475ef26
chore: rm competition announcement from the README (#361) 2023-03-13 09:34:26 -07:00
Matt Robinson
d17a94f395
chore: add libreoffice to ubuntu install script (#363) 2023-03-13 10:46:23 -04:00
Matt Robinson
7c08450597
feat: add "fast" strategy for PDF parsing; fallback to "fast" if detectron2 is not available (#357)
Adds a "fast" strategy for partitioning PDFs that uses pdfminer. The default strategy is "hi_res" and is the original partitioning logic that uses detectron2. If detectron2 is not available and the "hi_res" strategy is selected, partition_pdf fallsback to using the "fast" strategy. The implementation uses pdfminer because that's already installed as a dependency with the local-inference extra. There are other options for accomplishing this as well, but they would entail adding a new dependency. The "fast" strategy substantially speeds up processing.
2023-03-11 03:16:05 +00:00
Habeeb Shopeju
2ca843782c
Connector for Biomedical Literature (#345)
The implementation involves the introduction of SimpleBiomedConfig, BiomedIngestDoc and BiomedConnector which ingests documents from the PDF Download.
2023-03-11 01:09:54 +00:00
Alvaro Bartolome
5291a96616
Add AzureBlobStorageConnector (#353)
* Add `AzureBlobStorageConnector` based on its `fsspec` implementation inheriting
from `FsspecConnector`
* Start deprecation life cycle for `unstructured-ingest --s3-url` option, to be deprecated in
  favor of `--remote-url`.

---------

Co-authored-by: Tom Aarsen <37621491+tomaarsen@users.noreply.github.com>
2023-03-10 15:43:40 -08:00
Matt Robinson
30b5a4da65
fix: parsing for files with message/rfc822 MIME type; dir for unsupported files (#358)
Adds the ability to process files with a message/rfc822 MIME type, which previously caused failures for example-docs/fake-email-header.eml.
2023-03-10 15:10:39 -08:00
Tom Aarsen
3d21b4098e
enhancement: improve detect_filetype warning to include filename (#355)
* Improve warning to include filename if provided

* Update changelog & version
2023-03-10 12:26:08 -05:00
Alvaro Bartolome
c51adb21e3
feat: add FsspecConnector to easily integrate new connectors with a fsspec implementation available (#318)
So as you may see this is a pretty big PR, that basically adds an "adapter" to easily plug in any connector with an available fsspec implementation. This is a way to standardize how the remote filesystems are used within unstructured.

I've additionally renamed s3_connector.py to s3.py for readability and consistency and tested that the current approach works as expected and is aligned with the expectations.
2023-03-10 06:15:19 +00:00
Matt Robinson
7c619f045b
feat: UNSTRUCTURED_LANGUAGE_CHECK env var to control (#351)
* environment variable to set language checks

* change log and version

* checks for if language checks are false

* update docs

* changelog type

* add assert to tests

* performance note in docstrings

* docstring tweaks
2023-03-09 17:33:48 +00:00
qued
e43e9178ae
feat: amazon linux 2 setup script (#350)
Added Amazon Linux 2 setup script. Also updated Ubuntu setup script to keep the scripts as aligned as possible.

Co-authored-by: cragwolfe <crag@unstructured.io>
0.5.3
2023-03-09 14:52:24 +00:00
natygyoon
6be07a5260
feat: update auto.partition() function to recognize Unstructured json (#337) 2023-03-08 10:36:01 -08:00
Tom Aarsen
1580c1bf8e
feat: Add GitLab ingest connector (#349)
Add GitLab data connector for ingest.

Involves more general Git functionality that is shared between the GitHub and GitLab data connectors.

Prevent code duplication for functionality between GitHub and GitLab ingest connectors.

Renamed github-access-token, github-branch and github-file-glob to git-access-token, git-branch and git-file-glob, respectively.

These work for GitHub and GitLab.
2023-03-08 00:15:21 -08:00
Tom Aarsen
a9152313aa
refactor: Introduce 'exactly_one' to simplify partitioning functions (#343) 2023-03-07 12:27:08 -06:00
Tom Aarsen
70420b5c78
refactor: Fully move towards logging; remove if config.verbose conditionals (#321)
Move away from printing, use logging exclusively.
2023-03-07 01:21:27 -08:00
Umar Farooqi
78f4301872
fix: add formatter in an error string (#348) 2023-03-06 22:35:15 -08:00
Habeeb Shopeju
4117f57e14
Connector for Google Drive (#294)
Implements issue #244
2023-03-07 06:01:02 +00:00
cragwolfe
905e4ae8f6
chore: nicer error message (#341)
Show a more meaningful error message (and potentially useful for debugging)
when file type is not supported by the auto partition().

Co-authored-by: Tom Aarsen <37621491+tomaarsen@users.noreply.github.com>
2023-03-06 16:08:10 -08:00
Tom Aarsen
d4a1508ab8
chore: Remove file accidentally created/committed (#344)
* Remove file accidentally created/committed

* Fix CHANGELOG
2023-03-06 23:50:53 +00:00
Amanda Cameron
64efcc0e50
Adding optional encoding arg, and text_partition tests (#339) 2023-03-06 15:07:33 -08:00
Ikko Eltociear Ashimine
213077e2ab
docs: update sec-sentiment-analysis.ipynb (#342)
Huggingface -> Hugging Face
2023-03-06 15:16:14 +00:00
Alvaro Bartolome
2979e17aa4
feat: add .pre-commit-config.yaml to let users enable pre-commit hooks (#320)
Per the README, provides an optional `pre-commit` configuration
file to ensure code matches the formatting and linting standards used in `unstructured`.
2023-03-05 20:23:39 +00:00
Tom Aarsen
f5af87a540
feat: Expose Wikipedia auto_suggest argument to the ingest CLI (#336)
* Add support for '--wikipedia-auto-suggest' to the unstructured-ingest CLI
2023-03-02 12:31:29 -08:00
Matt Robinson
a5da3de43b
fix: ensure all text is maintained in html output (#335)
* fix: ensure all text is maintained in html pages

* add back in replace unicode quotes

* changelog and version bump

* apt-get update in ci

* white space differences in output
0.5.2
2023-03-02 14:03:13 -05:00
qued
ed074b5828
fix: set through env to avoid interpretation as command (#329)
When I took the changes to the Ubuntu setup script and propagated them to other scripts that run in slightly different contexts, the script failed at line 45 as DEBIAN_FRONTEND=noninteractive was interpreted as a command rather than a variable assignment.

Added the env command so there's no misinterpretation. Tested in docker as both root and user.
2023-03-01 12:56:37 -06:00
dependabot[bot]
fcaed15b14
build(deps): Bump actions/checkout from 2 to 3 (#325)
Bumps [actions/checkout](https://github.com/actions/checkout) from 2 to 3.
- [Release notes](https://github.com/actions/checkout/releases)
- [Changelog](https://github.com/actions/checkout/blob/main/CHANGELOG.md)
- [Commits](https://github.com/actions/checkout/compare/v2...v3)

---
updated-dependencies:
- dependency-name: actions/checkout
  dependency-type: direct:production
  update-type: version-update:semver-major
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Co-authored-by: cragwolfe <crag@unstructured.io>
2023-03-01 13:11:42 -05:00
Alvaro Bartolome
707f92f717
feat: improve caching mechanism for download_dir on ingest (#314)
* `unstructured-ingest` now uses a default `--download_dir` of `$HOME/.cache/unstructured/ingest`
rather than a "tmp-ingest-" dir in the working directory.
* `unstructured-ingest` no longer re-downloads files when --preserve-downloads
is used without --download-dir.
2023-03-01 09:19:32 -08:00
Tom Aarsen
95109db6b0
refactor: For S3 Ingest, write to file directly using json.dump (#312)
* Write to file directly using json.dump

No changelog entry due to the simplicity of the change
2023-02-28 22:56:45 -08:00
cragwolfe
a6f8256148
bump: release commit (#317)
* update github ingest outputs

* CHANGELOG, test github ingest more often in CI

* more changelog detail
0.5.1
2023-03-01 11:12:52 +11:00
Tom Aarsen
350c4230ee
fix: Remove JavaScript from HTML reader output (#313)
* Fixes an error causing JavaScript to appear in the output of `partition_html` sometimes.
2023-02-28 14:24:24 -08:00
Tom Aarsen
1ccbc05b10
Fix: Resolve several issues with the require dependencies decorator (#315)
Fix several issues re. the requires_dependencies decorator:
* There was a missing space between the sentences.
* Crucial brackets were missing in making the error message.
* "pygithub" was used where "github" should have been used.
2023-02-28 20:21:59 +00:00
Matt Robinson
69661788cf
fix: track narrative text and figure captions in HTML documents (#309)
* fix for missing narrative text in partition_html

* fixes so existing tests pass

* tests for figure caption and narrative text

* bump version; changelog
0.5.0
2023-02-28 15:36:08 +00:00
Alvaro Bartolome
e52dd5c179
feat: add requires_dependencies decorator (#302)
* Add `requires_dependencies` decorator

* Use `required_dependencies` on Reddit & S3

* Fix bug in `requires_dependencies`

To used named args the decorator needs to be also wrapped

* Add `requires_dependencies` integration tests

* Add `requires_dependencies` in `Competition.md`

* Update `CHANGELOG.md`

* Bump version 0.4.16-dev5

* Ignore `F401` unused imports in `requires_dependencies` tests

* Apply suggestions from code review

* Add `functools.wrap` to keep docs, & annotations

* Use `requires_dependencies` in `GitHubConnector`
2023-02-28 14:50:39 +00:00
Tom Aarsen
54a6db1c2c
feat: Add Wikipedia ingest connector (#299)
The connector can process a Wikipedia page
and output the HTML,
the plain text contents,
and the summary.
No API key required
Also add test case verifying that 3 files are indeed created (one for HTML, one for text, one for the summary).
2023-02-28 08:25:11 +00:00
Alvaro Bartolome
a74d389fa7
fix: process_document behavior when exception is raised (#298) 2023-02-28 00:04:26 -08:00
cragwolfe
c7eba1636d
build(deps): make pip-compile (#307)
* build: pip-compile, skip test deps

* s
2023-02-28 17:28:14 +11:00
cragwolfe
5eaf4490fd
build: Release commit for version 0.4.16 (#305) 0.4.16 2023-02-28 15:48:48 +11:00
qued
d566f9b56a
Inject DEBIAN_FRONTEND into sudo env (#290)
Gets rid of the interactive prompt when tzdata gets installed.
2023-02-28 02:27:58 +00:00
Matt Robinson
1cd1bd8eba
docs: more detailed bricks writeup; reoganize docs (#304)
* add print statement in readme

* elements before bricks

* new preamble to bricks section

* add preamble to bricks section

* add preamble to cleaning section

* descriptions of each documentation page

* non-brick helper functions to the bottom

* fix codeblock

* includes some optional kwargs

* code blocks

* typo fix
2023-02-27 23:11:49 +00:00
Tom Aarsen
ded60afda9
feat: Add GitHub data connector; add Markdown partitioner (#284) 2023-02-27 14:36:44 -08:00
Alvaro Bartolome
c89bba100f
Update Competition.md (#297)
Minor edits, fix local installation URL.
2023-02-27 10:52:39 -08:00
Matt Robinson
9b0dbc7026
build(deps): bump dependencies; resolve security issues in example dependencies (#300)
* bump cryptography version

* re pip-compile for latest versions

* update argilla example requirements

* dependency updates

* bump versions

* pin unstructured-inference due to multithreading issue

* linting, linting, linting

* dependency on one line
2023-02-27 12:45:28 -05:00
Tom Aarsen
5eb1466acc
Resolve various style issues to improve overall code quality (#282)
* Apply import sorting

ruff . --select I --fix

* Remove unnecessary open mode parameter

ruff . --select UP015 --fix

* Use f-string formatting rather than .format

* Remove extraneous parentheses

Also use "" instead of str()

* Resolve missing trailing commas

ruff . --select COM --fix

* Rewrite list() and dict() calls using literals

ruff . --select C4 --fix

* Add () to pytest.fixture, use tuples for parametrize, etc.

ruff . --select PT --fix

* Simplify code: merge conditionals, context managers

ruff . --select SIM --fix

* Import without unnecessary alias

ruff . --select PLR0402 --fix

* Apply formatting via black

* Rewrite ValueError somewhat

Slightly unrelated to the rest of the PR

* Apply formatting to tests via black

* Update expected exception message to match
0d81564

* Satisfy E501 line too long in test

* Update changelog & version

* Add ruff to make tidy and test deps

* Run 'make tidy'

* Update changelog & version

* Update changelog & version

* Add ruff to 'check' target

Doing so required me to also fix some non-auto-fixable issues. Two of them I fixed with a noqa: SIM115, but especially the one in __init__ may need some attention. That said, that refactor is out of scope of this PR.
2023-02-27 11:30:54 -05:00
Matt Robinson
5db94fdee6
docs: add getting started section and remove outdated docs (#277)
* add getting started section to the docs

* remove old examples

* update example notebook

* change to convert_to_dict

* various and sundry edits
2023-02-27 15:10:53 +00:00
cragwolfe
ee8739dfa6
fix: pip-compile statement for ingest-s3 (#296) 2023-02-27 10:19:03 +01:00
Tom Aarsen
486c7987fc
feat: Add Reddit ingest connector (#293)
Add Reddit data connector for ingest.
* The connector can process a subreddit.
* Either via a search query,
*  or via hot posts.
* The texts in the submissions are converted to markdown files including the post title and the text body, if any (i.e. no images or videos).
* The number of posts to fetch can be changed with the CLI.
2023-02-27 00:11:04 -08:00
cragwolfe
0a51f28e7d
fix: Ingest main: actually initialize the connector (#285) 2023-02-26 14:53:51 -08:00
qued
30ac3e6daa
Changes so script runs as root in docker (#287) 2023-02-25 13:48:48 -08:00
cragwolfe
0e3440ac08
fix: add libmagic dep to ubuntu script (#281) 2023-02-25 19:53:38 +00:00