24 Commits

Author SHA1 Message Date
Alvaro Bartolome
5291a96616
Add AzureBlobStorageConnector (#353)
* Add `AzureBlobStorageConnector` based on its `fsspec` implementation inheriting
from `FsspecConnector`
* Start deprecation life cycle for `unstructured-ingest --s3-url` option, to be deprecated in
  favor of `--remote-url`.

---------

Co-authored-by: Tom Aarsen <37621491+tomaarsen@users.noreply.github.com>
2023-03-10 15:43:40 -08:00
Tom Aarsen
1580c1bf8e
feat: Add GitLab ingest connector (#349)
Add GitLab data connector for ingest.

Involves more general Git functionality that is shared between the GitHub and GitLab data connectors.

Prevent code duplication for functionality between GitHub and GitLab ingest connectors.

Renamed github-access-token, github-branch and github-file-glob to git-access-token, git-branch and git-file-glob, respectively.

These work for GitHub and GitLab.
2023-03-08 00:15:21 -08:00
Habeeb Shopeju
4117f57e14
Connector for Google Drive (#294)
Implements issue #244
2023-03-07 06:01:02 +00:00
Tom Aarsen
54a6db1c2c
feat: Add Wikipedia ingest connector (#299)
The connector can process a Wikipedia page
and output the HTML,
the plain text contents,
and the summary.
No API key required
Also add test case verifying that 3 files are indeed created (one for HTML, one for text, one for the summary).
2023-02-28 08:25:11 +00:00
Tom Aarsen
ded60afda9
feat: Add GitHub data connector; add Markdown partitioner (#284) 2023-02-27 14:36:44 -08:00
Tom Aarsen
5eb1466acc
Resolve various style issues to improve overall code quality (#282)
* Apply import sorting

ruff . --select I --fix

* Remove unnecessary open mode parameter

ruff . --select UP015 --fix

* Use f-string formatting rather than .format

* Remove extraneous parentheses

Also use "" instead of str()

* Resolve missing trailing commas

ruff . --select COM --fix

* Rewrite list() and dict() calls using literals

ruff . --select C4 --fix

* Add () to pytest.fixture, use tuples for parametrize, etc.

ruff . --select PT --fix

* Simplify code: merge conditionals, context managers

ruff . --select SIM --fix

* Import without unnecessary alias

ruff . --select PLR0402 --fix

* Apply formatting via black

* Rewrite ValueError somewhat

Slightly unrelated to the rest of the PR

* Apply formatting to tests via black

* Update expected exception message to match
0d81564

* Satisfy E501 line too long in test

* Update changelog & version

* Add ruff to make tidy and test deps

* Run 'make tidy'

* Update changelog & version

* Update changelog & version

* Add ruff to 'check' target

Doing so required me to also fix some non-auto-fixable issues. Two of them I fixed with a noqa: SIM115, but especially the one in __init__ may need some attention. That said, that refactor is out of scope of this PR.
2023-02-27 11:30:54 -05:00
cragwolfe
ee8739dfa6
fix: pip-compile statement for ingest-s3 (#296) 2023-02-27 10:19:03 +01:00
Tom Aarsen
486c7987fc
feat: Add Reddit ingest connector (#293)
Add Reddit data connector for ingest.
* The connector can process a subreddit.
* Either via a search query,
*  or via hot posts.
* The texts in the submissions are converted to markdown files including the post title and the text body, if any (i.e. no images or videos).
* The number of posts to fetch can be changed with the CLI.
2023-02-27 00:11:04 -08:00
qued
a79b365ab4
feat: add ubuntu setup script (#279) 2023-02-24 20:05:26 -06:00
Matt Robinson
354eff1e2b
build(deps): automatically download nltk models when required (#246)
* code for downloading nltk packages

* don't run nltk make command in ci

* test for model downloads

* remove nltk install from docs

* update changelog and bump version
2023-02-23 17:19:13 +00:00
cragwolfe
ab542ca3c6
feat: Sample ingest project with S3 connector (#218) 2023-02-14 12:27:45 -08:00
Matt Robinson
2d08fcbf83
fix: titles and narrative text need at least one english word (#188)
* added check for english words

* update docs

* at least one word needs to have multiple characters

* bump change log
2023-02-01 09:10:48 -05:00
Matt Robinson
f36e514c6d
build(deps): weekly dependency bump (#183) 2023-01-30 11:05:48 -05:00
Matt Robinson
1ce8447ba7
build(deps): bump unstructured inference; compile from setup.py (#176)
* bump unstructured inference; compile from setup.py

* bump version

* compile the local-inference extra

* linting, linting, linting
2023-01-25 16:32:57 +00:00
Matt Robinson
7b3b594ee5
fix: correct make install-ci target (#138)
* fix install-ci make target

* add note to readme about libmagic

* remove mydoc.docx

* remove local-inference
2023-01-09 17:03:09 -05:00
Matt Robinson
5376bc510f
feat: generic partition brick with filetype detection (#132)
* add python-magic

* first pass on filetype detection

* tests for filetype detection

* more tests for file detection

* added tests for error conditions

* install libmagic dev in github

* libmagic install instructions

* pattern for checking email files

* support reading .eml in rb mode

* add auto partition function

* auto tests for emal

* auto tests for docx

* added tests for html

* add pdf and html tests

* linting, linting, linting

* added docs for auto partitioning

* update readme with generic partition brick

* bumped version

* added test for bad type

* detect .docx files from application/octet-stream

* linting, linting, linting

* identify xlsx from octet stream

* install poppler in ci

* fix mocks; test for unknown type

* install poppler utils

* install in one line

* only poppler-utils

* file extension logic from application/octet-stream

* install local inference for ci

* install detectron2

* removing unused dockerfile
2023-01-09 16:15:14 -05:00
qued
a75499d465
feat: local inference (#125)
Splits partition_pdf into two paths, one used for local inference when url is None, another for inference via api when url is a string.
2023-01-04 16:19:05 -06:00
Matt Robinson
b1cce16c16
feat: translate_text cleaning brick (#101)
* initial implementation for translate brick

* more input validation

* tests for translate brick

* added docs

* bumped version

* chinese and arabic tests

* re-run pip-compile

* add torch to dependencies

* cleanup doc string

* fix long string

* fix typo in docs

* take out empty string check

* return string if string is empty

* added huggingface into make install
2022-12-15 15:35:15 -05:00
Mallori Harrell
53fcf4e912
chore: Remove PDF parsing code and dependencies (#75)
Remove PDF parsing code and dependencies.
2022-11-21 11:47:29 -06:00
dependabot[bot]
8936ab21a7
build(deps): Bump mypy from 0.982 to 0.990 in /requirements (#73)
* build(deps): Bump mypy from 0.982 to 0.990 in /requirements

Bumps [mypy](https://github.com/python/mypy) from 0.982 to 0.990.
- [Release notes](https://github.com/python/mypy/releases)
- [Commits](https://github.com/python/mypy/compare/v0.982...v0.990)

---
updated-dependencies:
- dependency-name: mypy
  dependency-type: direct:production
  update-type: version-update:semver-minor
...

Signed-off-by: dependabot[bot] <support@github.com>

* fix typing issues

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Co-authored-by: Matt Robinson <mrobinson@unstructuredai.io>
Co-authored-by: Matt Robinson <mrobinson@unstructured.io>
2022-11-14 17:57:05 +00:00
Matt Robinson
fb16847946
feat: Staging brick for attention window chunking (#34)
* add huggingface dependencies and re pip-compile

* first pass on chunk by attention window

* test for chunking function

* completed tests for chunk_by_attention_window

* change default buffer size to 2

* wrapper function for staging

* added docs for transformers

* fix wording and typos

* updated change log and bumped the version

* added docs on huggingface dependencies

* fix typo

* re pip-compile
2022-10-13 11:18:27 -04:00
qued
1d3076a4b2
feat: keep version synchronized (#25)
* Added script to check/sync versions using CHANGELOG.md as a source of truth.
* Script currently only syncs __version__.py but can easily be extended to cover other files by adding the files to an array in the script.
* Also updated sphinx conf.py to get version dynamically from __version__.py
2022-10-10 13:11:48 -05:00
Yuming Long
8eba1b6006
feat: Add shellcheck to CI and Make target (#10) 2022-09-29 15:24:28 -04:00
Matt Robinson
5f40c78f25 Initial Release 2022-09-26 14:55:20 -07:00