unstructured

mirror of https://github.com/Unstructured-IO/unstructured.git synced 2025-06-27 02:30:08 +00:00

Author	SHA1	Message	Date
Matt Robinson	3c3c59a726	build(deps): add pdfminer.six to dependencies (#537 )	2023-05-02 15:36:12 +00:00
qued	dc4147d7df	feat: extract tables (#503 ) Exposes table extraction through partition and partition_pdf.	2023-04-21 17:01:29 +00:00
Matt Robinson	6874df91ef	feat: allow users to pass OCR language into `partition` (#509 ) * pip-compile new reqs * bump inference version * add language to pdf and image calls * tests for passing in language * version bump and changelog * update docs * pass ocr_languages in auto * updated test fixtures * typo in doc string	2023-04-21 13:41:26 +00:00
Trevor Bossert	cff7f4fd5a	Slack connector (#462 ) This connector takes a slack channel id, token and other options to pull conversation history for a channel and store it as a text file that is then processed by unstructured into expected output.	2023-04-16 19:34:43 +00:00
cragwolfe	3972c80c51	build(deps): bump requirements (#414 )	2023-04-05 02:59:06 +00:00
Matt Robinson	75cf233702	feat: add `partition_msg` for MSFT Outlook files (#412 ) * added msg-parser dependency * pass through kwargs in convert_file_to_text * added partition_msg for processing msft outlook files * version bump and changelog * added tests for partition_msg * added test for msg with plain text * add partition_msg docs; fix underlines in integration docs * add .msg to file list * finish tests for auto msg * linting, linting, linting	2023-03-28 20:15:22 +00:00
Matt Robinson	e43cb0e6e0	feat: add `partition_epub` function (#364 ) * add pypandoc dependency * added epub partitioner and file conversion * test for partition_epub * tests for file conversion * add epub to filetype detection * added epub to auto partition * update bricks docs * updated installing docs * changelot and version * add pandoc to dependencies * add pandoc to debian dependencies * linting, linting, linting * typo fix * typo fix * file conversion type hints * more type hints --------- Co-authored-by: qued <64741807+qued@users.noreply.github.com>	2023-03-14 15:52:21 +00:00
qued	aa494623a2	chore: bump versions (#352 ) Update versions of dependencies, including unpinning the unstructured-inference dependency that's causing conflicts in repos like pipeline-oer that want the newer version.	2023-03-14 09:40:30 -05:00
Alvaro Bartolome	5291a96616	Add `AzureBlobStorageConnector` (#353 ) * Add `AzureBlobStorageConnector` based on its `fsspec` implementation inheriting from `FsspecConnector` * Start deprecation life cycle for `unstructured-ingest --s3-url` option, to be deprecated in favor of `--remote-url`. --------- Co-authored-by: Tom Aarsen <37621491+tomaarsen@users.noreply.github.com>	2023-03-10 15:43:40 -08:00
Alvaro Bartolome	c51adb21e3	feat: add `FsspecConnector` to easily integrate new connectors with a `fsspec` implementation available (#318 ) So as you may see this is a pretty big PR, that basically adds an "adapter" to easily plug in any connector with an available fsspec implementation. This is a way to standardize how the remote filesystems are used within unstructured. I've additionally renamed s3_connector.py to s3.py for readability and consistency and tested that the current approach works as expected and is aligned with the expectations.	2023-03-10 06:15:19 +00:00
Tom Aarsen	1580c1bf8e	feat: Add GitLab ingest connector (#349 ) Add GitLab data connector for ingest. Involves more general Git functionality that is shared between the GitHub and GitLab data connectors. Prevent code duplication for functionality between GitHub and GitLab ingest connectors. Renamed github-access-token, github-branch and github-file-glob to git-access-token, git-branch and git-file-glob, respectively. These work for GitHub and GitLab.	2023-03-08 00:15:21 -08:00
Habeeb Shopeju	4117f57e14	Connector for Google Drive (#294 ) Implements issue #244	2023-03-07 06:01:02 +00:00
Tom Aarsen	54a6db1c2c	feat: Add Wikipedia ingest connector (#299 ) The connector can process a Wikipedia page and output the HTML, the plain text contents, and the summary. No API key required Also add test case verifying that 3 files are indeed created (one for HTML, one for text, one for the summary).	2023-02-28 08:25:11 +00:00
Tom Aarsen	ded60afda9	feat: Add GitHub data connector; add Markdown partitioner (#284 )	2023-02-27 14:36:44 -08:00
Matt Robinson	9b0dbc7026	build(deps): bump dependencies; resolve security issues in example dependencies (#300 ) * bump cryptography version * re pip-compile for latest versions * update argilla example requirements * dependency updates * bump versions * pin unstructured-inference due to multithreading issue * linting, linting, linting * dependency on one line	2023-02-27 12:45:28 -05:00
Tom Aarsen	5eb1466acc	Resolve various style issues to improve overall code quality (#282 ) * Apply import sorting ruff . --select I --fix * Remove unnecessary open mode parameter ruff . --select UP015 --fix * Use f-string formatting rather than .format * Remove extraneous parentheses Also use "" instead of str() * Resolve missing trailing commas ruff . --select COM --fix * Rewrite list() and dict() calls using literals ruff . --select C4 --fix * Add () to pytest.fixture, use tuples for parametrize, etc. ruff . --select PT --fix * Simplify code: merge conditionals, context managers ruff . --select SIM --fix * Import without unnecessary alias ruff . --select PLR0402 --fix * Apply formatting via black * Rewrite ValueError somewhat Slightly unrelated to the rest of the PR * Apply formatting to tests via black * Update expected exception message to match 0d81564 * Satisfy E501 line too long in test * Update changelog & version * Add ruff to make tidy and test deps * Run 'make tidy' * Update changelog & version * Update changelog & version * Add ruff to 'check' target Doing so required me to also fix some non-auto-fixable issues. Two of them I fixed with a noqa: SIM115, but especially the one in __init__ may need some attention. That said, that refactor is out of scope of this PR.	2023-02-27 11:30:54 -05:00
Tom Aarsen	486c7987fc	feat: Add Reddit ingest connector (#293 ) Add Reddit data connector for ingest. * The connector can process a subreddit. * Either via a search query, * or via hot posts. * The texts in the submissions are converted to markdown files including the post title and the text body, if any (i.e. no images or videos). * The number of posts to fetch can be changed with the CLI.	2023-02-27 00:11:04 -08:00
cragwolfe	87fd0d01dc	feat: Ingest refactors, doc updates (#243 ) - Creates ABC's for ingest connectors - Updates the s3_connector classes to inherit from ABC's - Moves s3 test script to it's own file to establish pattern for additional connectors - Rewrites the Ingest.md doc, including instructions how how to add a connector - Updates the example s3 ingest script to use the new location for main.py Note that there were no logic changes, this is essentially a refactoring PR. Test instructions: Run ./test_unstructured_ingest/test-ingest.sh and ./examples/ingest/s3-small-batch/ingest.sh.	2023-02-21 10:15:33 -08:00
cragwolfe	ab542ca3c6	feat: Sample ingest project with S3 connector (#218 )	2023-02-14 12:27:45 -08:00
Matt Robinson	a7ca58e0bc	fix: more english words; split on punctuation (#191 ) * add a bigger list of english words * update thresholds and add tests * update docs; bump version * fix version * add additional english words back in * linting, linting, linting * add slashes * work -> word	2023-02-02 17:25:47 +00:00
Matt Robinson	1ce8447ba7	build(deps): bump unstructured inference; compile from setup.py (#176 ) * bump unstructured inference; compile from setup.py * bump version * compile the local-inference extra * linting, linting, linting	2023-01-25 16:32:57 +00:00
Matt Robinson	59f972d739	build(deps): add `requests` as a base dependency (#162 ) * build(deps): add `requests` as a base dependency * linting, linting, linting * changelog typo	2023-01-18 16:36:23 +00:00
Matt Robinson	419c0867d3	build(deps): bump `unstructured_inference` version range (#151 ) * bump unstructured-inference to 0.2.3 * bump version	2023-01-13 22:21:36 +00:00
Matt Robinson	eba4c80b1e	feat: `get_directory_file_info` for exploring a directory of files (#142 ) * added python-pptx to requirements * added filetype detection for powerpoint * add more filetypes to detect * more tests * added tests for filetype * reorder document types * tests for get_directory_file_info * added docs for get_directory_file_info * bump version * Word -> Office * added test for filetype * add group by filetype example	2023-01-11 12:40:50 -05:00
Matt Robinson	5376bc510f	feat: generic `partition` brick with filetype detection (#132 ) * add python-magic * first pass on filetype detection * tests for filetype detection * more tests for file detection * added tests for error conditions * install libmagic dev in github * libmagic install instructions * pattern for checking email files * support reading .eml in rb mode * add auto partition function * auto tests for emal * auto tests for docx * added tests for html * add pdf and html tests * linting, linting, linting * added docs for auto partitioning * update readme with generic partition brick * bumped version * added test for bad type * detect .docx files from application/octet-stream * linting, linting, linting * identify xlsx from octet stream * install poppler in ci * fix mocks; test for unknown type * install poppler utils * install in one line * only poppler-utils * file extension logic from application/octet-stream * install local inference for ci * install detectron2 * removing unused dockerfile	2023-01-09 16:15:14 -05:00
qued	a75499d465	feat: local inference (#125 ) Splits partition_pdf into two paths, one used for local inference when url is None, another for inference via api when url is a string.	2023-01-04 16:19:05 -06:00
Matt Robinson	17045aed80	feat: add `convert_to_dataframe` staging brick (#127 ) * add pandas to deps; pip-compile * staging brick to convert elements to dataframe * bump version * add convert_to_dataframe docs * bump wheel version * typo fix * typo fix 2!	2023-01-04 12:04:59 -05:00
Matt Robinson	b14f6ac9bd	feat: extract metadata from `.docx`, `.xlsx`, and `.jpg` (#113 ) * add python-docx dependency * added function for extracting metadata from word documents * add openpyxl * added get_jpg_metadata; fixed typing * bump changelog * added pillow to dependencies	2022-12-26 09:34:36 -05:00
Matt Robinson	407f700b20	build(deps): bump `certify` to incorporate security patches (#105 ) * pin certifi in base and huggingface * pinning for build and docs	2022-12-19 14:47:15 -05:00
Matt Robinson	b1cce16c16	feat: `translate_text` cleaning brick (#101 ) * initial implementation for translate brick * more input validation * tests for translate brick * added docs * bumped version * chinese and arabic tests * re-run pip-compile * add torch to dependencies * cleanup doc string * fix long string * fix typo in docs * take out empty string check * return string if string is empty * added huggingface into make install	2022-12-15 15:35:15 -05:00
Matt Robinson	5c4428413a	build(deps): Bump jupyter-core library (#85 )	2022-11-30 10:04:56 -05:00
asymness	2170a2aae2	feat: Implement Argilla staging brick (#81 ) * Add argilla to dependencies and run pip-compile * Implement Argilla staging brick and add unit tests * Update version and changelog * Update docs with description and usage for Argilla staging brick * Remove unused fixtures and fix typo in Argilla tests * add missing quote in docs * changelog tweak * doc tweaks Co-authored-by: Matt Robinson <mrobinson@unstructuredai.io> Co-authored-by: Matt Robinson <mrobinson@unstructured.io>	2022-11-28 14:41:48 +00:00
Mallori Harrell	53fcf4e912	chore: Remove PDF parsing code and dependencies (#75 ) Remove PDF parsing code and dependencies.	2022-11-21 11:47:29 -06:00
Yuming Long	7c61639f23	python_require (#65 )	2022-11-11 12:15:23 -05:00
Matt Robinson	2715950d6f	chore: Add long description content type; bump version (#59 )	2022-11-08 16:55:41 -05:00
benjats07	6b3e86c508	docs: Added long description to PyPi (#58 ) * docs: Added long description to PyPi * Added fields for description in PyPi	2022-11-08 15:22:43 -06:00
Matt Robinson	fb16847946	feat: Staging brick for attention window chunking (#34 ) * add huggingface dependencies and re pip-compile * first pass on chunk by attention window * test for chunking function * completed tests for chunk_by_attention_window * change default buffer size to 2 * wrapper function for staging * added docs for transformers * fix wording and typos * updated change log and bumped the version * added docs on huggingface dependencies * fix typo * re pip-compile	2022-10-13 11:18:27 -04:00
Matt Robinson	5f40c78f25	Initial Release	2022-09-26 14:55:20 -07:00

38 Commits