unstructured

mirror of https://github.com/Unstructured-IO/unstructured.git synced 2025-06-27 02:30:08 +00:00

Author	SHA1	Message	Date
Yuming Long	fc59a043b7	Chore: Support epub tests in docker image (#630 ) * docker works * more epub tests * changelog version * support epub + odt + rtf * update dockerfile * revert.. * install pandoc on ci env * pandoc docker grab bashed on arch * move arch into image * move back to base image	2023-05-26 15:38:48 -04:00
qued	c82bad1061	build(deps): avoid version conflicts (#636 ) Addresses #631. * Uses constraints to keep dependency versions more consistent. * Moves all dependencies to .in files which are then ingested by setup.py. * Adds script to check consistency of all extras. * Adds consistency check to CI. I should note that while it shouldn't be possible to cause a conflict between base.txt and any of the extras (because base.txt constrains all the extras) it is possible to get a conflict between two of the extras files. There are ways of trying to avoid that (like constraining each file by all the files that have already been processed before it in the order given in the make pip-compile target) but the ones I could think of seemed a little overwrought, and come with problems of their own. If a conflict arises, it should be flagged by CI or locally with make check-deps. When/if that happens, you can resolve the conflict by adding appropriate global constraints in requirements/constraints.txt. Also note that if fileA.in is constrained by fileB.txt, then fileB.in should be compiled before fileA.in in the make pip-compile target. Otherwise fileA.in will be compiled with the old version of fileB.txt which can cause conflicts or keep dependencies from being updated properly.	2023-05-24 22:29:35 +00:00
Trevor Bossert	830d67f653	Feat: Discord connector (#515 ) * Initial commit of discord connector based off of initial work by @tnachen with modifications https://github.com/tnachen/unstructured/tree/tnachen/discord_connector * Add test file change format of imports * working version of the connector More work to be done to tidy it up and add any additional options * add to test fixtures update * fix spacing * tests working, switching to bot testing channel * add additional channel add reprocess to tests * add try clause to allow for exit on error Update changelog and bump version * add updated expected output filtes * add logic to check if —discord-period is an integer Add more to option description * fix lint error * Update discord reqs * PR feedback * add newline * another newline --------- Co-authored-by: Justin Bossert <packerbacker21@hotmail.com>	2023-05-16 11:46:30 -07:00
cragwolfe	aaea6358f6	build(deps): bump pip (#558 )	2023-05-08 23:08:10 -07:00
Trevor Bossert	1ac72c6ee8	Fixes issue where detectron2 would not install on OSX (#552 ) * Fixes issue where detectron2 would not install on OSX Tested on Apple silicon based MacBook Pro. This installs tensorboard which is required on OSX and arm based cpu’s for detectron2. * Improve Arch detection for tensorboard * remove makefile from commands in readme pin tensorboard version	2023-05-05 17:16:28 -07:00
natygyoon	db2f70dbc4	sync version-sync.sh with other repos (#508 )	2023-04-21 05:48:38 +09:00
cragwolfe	bfba2bb1eb	fix: workaround .json file detection with old libmagic installs (#493 ) Fixes issue where .json files were recognized as "text/plain" rather than "application/json on the Unstructured image (and other installs that may have an older libmagic). Also adds missing json auto partition tests. Including an xfail test for #492 .	2023-04-17 23:11:21 -07:00
Trevor Bossert	cff7f4fd5a	Slack connector (#462 ) This connector takes a slack channel id, token and other options to pull conversation history for a channel and store it as a text file that is then processed by unstructured into expected output.	2023-04-16 19:34:43 +00:00
cragwolfe	7b44bcd6e0	build: script to update all ingest fixtures, add azure ingest fixtures (#367 ) - Updates CI to install tesseract version 5.3.0 (better than 4.x in various ways incl. perf.). - Adds azure expected output fixtures for more useful reference points and as a repro for Some PDF's with scanned images return empty elements #346 . - Adds a script to regenerate ingest test fixtures that is run in an ubuntu docker container (like CI), with the same version of tesseract. See the comments in scripts/ingest-test-fixtures-update.sh for details. - Updates expected outputs with above script. - Updates individual test-ingest scripts to update expected .json output if OVERWRITE_FIXTURES=true.	2023-04-11 00:11:50 -07:00
ryannikolaidis	ee52a749c3	fix: docker smoke test on build (#457 )	2023-04-06 10:03:42 -07:00
ryannikolaidis	ef9fb79ed4	chore: build with registry as cache (#454 )	2023-04-06 00:34:07 -07:00
qued	4211dda360	build: sync detectron version (#440 ) * Update detectron2 version in Dockerfile * Update detectron2 version in docs	2023-04-03 18:47:43 -05:00
ryannikolaidis	59785e4332	chore: install all extras in Dockerfile (#419 ) * Adds step to install all extras * Adds smoke test of wikipedia ingest to validate in CI	2023-03-30 13:23:30 -07:00
ryannikolaidis	65fec954ba	ci: publish amd and arm images (#404 )	2023-03-29 07:02:39 +00:00
Matt Robinson	75cf233702	feat: add `partition_msg` for MSFT Outlook files (#412 ) * added msg-parser dependency * pass through kwargs in convert_file_to_text * added partition_msg for processing msft outlook files * version bump and changelog * added tests for partition_msg * added test for msg with plain text * add partition_msg docs; fix underlines in integration docs * add .msg to file list * finish tests for auto msg * linting, linting, linting	2023-03-28 20:15:22 +00:00
Amanda Cameron	a9da858fa3	chore: add tests for docker (#373 )	2023-03-21 13:46:09 -07:00
Amanda Cameron	edb847ce0b	adding Dockerfile (#359 )	2023-03-14 13:40:01 -07:00
Alvaro Bartolome	5291a96616	Add `AzureBlobStorageConnector` (#353 ) * Add `AzureBlobStorageConnector` based on its `fsspec` implementation inheriting from `FsspecConnector` * Start deprecation life cycle for `unstructured-ingest --s3-url` option, to be deprecated in favor of `--remote-url`. --------- Co-authored-by: Tom Aarsen <37621491+tomaarsen@users.noreply.github.com>	2023-03-10 15:43:40 -08:00
Tom Aarsen	1580c1bf8e	feat: Add GitLab ingest connector (#349 ) Add GitLab data connector for ingest. Involves more general Git functionality that is shared between the GitHub and GitLab data connectors. Prevent code duplication for functionality between GitHub and GitLab ingest connectors. Renamed github-access-token, github-branch and github-file-glob to git-access-token, git-branch and git-file-glob, respectively. These work for GitHub and GitLab.	2023-03-08 00:15:21 -08:00
Habeeb Shopeju	4117f57e14	Connector for Google Drive (#294 ) Implements issue #244	2023-03-07 06:01:02 +00:00
Tom Aarsen	54a6db1c2c	feat: Add Wikipedia ingest connector (#299 ) The connector can process a Wikipedia page and output the HTML, the plain text contents, and the summary. No API key required Also add test case verifying that 3 files are indeed created (one for HTML, one for text, one for the summary).	2023-02-28 08:25:11 +00:00
Tom Aarsen	ded60afda9	feat: Add GitHub data connector; add Markdown partitioner (#284 )	2023-02-27 14:36:44 -08:00
Tom Aarsen	5eb1466acc	Resolve various style issues to improve overall code quality (#282 ) * Apply import sorting ruff . --select I --fix * Remove unnecessary open mode parameter ruff . --select UP015 --fix * Use f-string formatting rather than .format * Remove extraneous parentheses Also use "" instead of str() * Resolve missing trailing commas ruff . --select COM --fix * Rewrite list() and dict() calls using literals ruff . --select C4 --fix * Add () to pytest.fixture, use tuples for parametrize, etc. ruff . --select PT --fix * Simplify code: merge conditionals, context managers ruff . --select SIM --fix * Import without unnecessary alias ruff . --select PLR0402 --fix * Apply formatting via black * Rewrite ValueError somewhat Slightly unrelated to the rest of the PR * Apply formatting to tests via black * Update expected exception message to match 0d81564 * Satisfy E501 line too long in test * Update changelog & version * Add ruff to make tidy and test deps * Run 'make tidy' * Update changelog & version * Update changelog & version * Add ruff to 'check' target Doing so required me to also fix some non-auto-fixable issues. Two of them I fixed with a noqa: SIM115, but especially the one in __init__ may need some attention. That said, that refactor is out of scope of this PR.	2023-02-27 11:30:54 -05:00
cragwolfe	ee8739dfa6	fix: pip-compile statement for ingest-s3 (#296 )	2023-02-27 10:19:03 +01:00
Tom Aarsen	486c7987fc	feat: Add Reddit ingest connector (#293 ) Add Reddit data connector for ingest. * The connector can process a subreddit. * Either via a search query, * or via hot posts. * The texts in the submissions are converted to markdown files including the post title and the text body, if any (i.e. no images or videos). * The number of posts to fetch can be changed with the CLI.	2023-02-27 00:11:04 -08:00
qued	a79b365ab4	feat: add ubuntu setup script (#279 )	2023-02-24 20:05:26 -06:00
Matt Robinson	354eff1e2b	build(deps): automatically download `nltk` models when required (#246 ) * code for downloading nltk packages * don't run nltk make command in ci * test for model downloads * remove nltk install from docs * update changelog and bump version	2023-02-23 17:19:13 +00:00
cragwolfe	ab542ca3c6	feat: Sample ingest project with S3 connector (#218 )	2023-02-14 12:27:45 -08:00
Matt Robinson	2d08fcbf83	fix: titles and narrative text need at least one english word (#188 ) * added check for english words * update docs * at least one word needs to have multiple characters * bump change log	2023-02-01 09:10:48 -05:00
Matt Robinson	f36e514c6d	build(deps): weekly dependency bump (#183 )	2023-01-30 11:05:48 -05:00
Matt Robinson	1ce8447ba7	build(deps): bump unstructured inference; compile from setup.py (#176 ) * bump unstructured inference; compile from setup.py * bump version * compile the local-inference extra * linting, linting, linting	2023-01-25 16:32:57 +00:00
Matt Robinson	7b3b594ee5	fix: correct `make install-ci` target (#138 ) * fix install-ci make target * add note to readme about libmagic * remove mydoc.docx * remove local-inference	2023-01-09 17:03:09 -05:00
Matt Robinson	5376bc510f	feat: generic `partition` brick with filetype detection (#132 ) * add python-magic * first pass on filetype detection * tests for filetype detection * more tests for file detection * added tests for error conditions * install libmagic dev in github * libmagic install instructions * pattern for checking email files * support reading .eml in rb mode * add auto partition function * auto tests for emal * auto tests for docx * added tests for html * add pdf and html tests * linting, linting, linting * added docs for auto partitioning * update readme with generic partition brick * bumped version * added test for bad type * detect .docx files from application/octet-stream * linting, linting, linting * identify xlsx from octet stream * install poppler in ci * fix mocks; test for unknown type * install poppler utils * install in one line * only poppler-utils * file extension logic from application/octet-stream * install local inference for ci * install detectron2 * removing unused dockerfile	2023-01-09 16:15:14 -05:00
qued	a75499d465	feat: local inference (#125 ) Splits partition_pdf into two paths, one used for local inference when url is None, another for inference via api when url is a string.	2023-01-04 16:19:05 -06:00
Matt Robinson	b1cce16c16	feat: `translate_text` cleaning brick (#101 ) * initial implementation for translate brick * more input validation * tests for translate brick * added docs * bumped version * chinese and arabic tests * re-run pip-compile * add torch to dependencies * cleanup doc string * fix long string * fix typo in docs * take out empty string check * return string if string is empty * added huggingface into make install	2022-12-15 15:35:15 -05:00
Mallori Harrell	53fcf4e912	chore: Remove PDF parsing code and dependencies (#75 ) Remove PDF parsing code and dependencies.	2022-11-21 11:47:29 -06:00
dependabot[bot]	8936ab21a7	build(deps): Bump mypy from 0.982 to 0.990 in /requirements (#73 ) * build(deps): Bump mypy from 0.982 to 0.990 in /requirements Bumps [mypy](https://github.com/python/mypy) from 0.982 to 0.990. - [Release notes](https://github.com/python/mypy/releases) - [Commits](https://github.com/python/mypy/compare/v0.982...v0.990) --- updated-dependencies: - dependency-name: mypy dependency-type: direct:production update-type: version-update:semver-minor ... Signed-off-by: dependabot[bot] <support@github.com> * fix typing issues Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> Co-authored-by: Matt Robinson <mrobinson@unstructuredai.io> Co-authored-by: Matt Robinson <mrobinson@unstructured.io>	2022-11-14 17:57:05 +00:00
Matt Robinson	fb16847946	feat: Staging brick for attention window chunking (#34 ) * add huggingface dependencies and re pip-compile * first pass on chunk by attention window * test for chunking function * completed tests for chunk_by_attention_window * change default buffer size to 2 * wrapper function for staging * added docs for transformers * fix wording and typos * updated change log and bumped the version * added docs on huggingface dependencies * fix typo * re pip-compile	2022-10-13 11:18:27 -04:00
qued	1d3076a4b2	feat: keep version synchronized (#25 ) * Added script to check/sync versions using CHANGELOG.md as a source of truth. * Script currently only syncs __version__.py but can easily be extended to cover other files by adding the files to an array in the script. * Also updated sphinx conf.py to get version dynamically from __version__.py	2022-10-10 13:11:48 -05:00
Yuming Long	8eba1b6006	feat: Add shellcheck to CI and Make target (#10 )	2022-09-29 15:24:28 -04:00
Matt Robinson	5f40c78f25	Initial Release	2022-09-26 14:55:20 -07:00

41 Commits