unstructured

mirror of https://github.com/Unstructured-IO/unstructured.git synced 2025-10-02 11:52:25 +00:00

Author	SHA1	Message	Date
Matt Robinson	7c08450597	feat: add `"fast"` strategy for PDF parsing; fallback to `"fast"` if `detectron2` is not available (#357 ) Adds a "fast" strategy for partitioning PDFs that uses pdfminer. The default strategy is "hi_res" and is the original partitioning logic that uses detectron2. If detectron2 is not available and the "hi_res" strategy is selected, partition_pdf fallsback to using the "fast" strategy. The implementation uses pdfminer because that's already installed as a dependency with the local-inference extra. There are other options for accomplishing this as well, but they would entail adding a new dependency. The "fast" strategy substantially speeds up processing.	2023-03-11 03:16:05 +00:00
Alvaro Bartolome	c51adb21e3	feat: add `FsspecConnector` to easily integrate new connectors with a `fsspec` implementation available (#318 ) So as you may see this is a pretty big PR, that basically adds an "adapter" to easily plug in any connector with an available fsspec implementation. This is a way to standardize how the remote filesystems are used within unstructured. I've additionally renamed s3_connector.py to s3.py for readability and consistency and tested that the current approach works as expected and is aligned with the expectations.	2023-03-10 06:15:19 +00:00
Matt Robinson	7c619f045b	feat: `UNSTRUCTURED_LANGUAGE_CHECK` env var to control (#351 ) * environment variable to set language checks * change log and version * checks for if language checks are false * update docs * changelog type * add assert to tests * performance note in docstrings * docstring tweaks	2023-03-09 17:33:48 +00:00
Alvaro Bartolome	2979e17aa4	feat: add `.pre-commit-config.yaml` to let users enable `pre-commit` hooks (#320 ) Per the README, provides an optional `pre-commit` configuration file to ensure code matches the formatting and linting standards used in `unstructured`.	2023-03-05 20:23:39 +00:00
Matt Robinson	1cd1bd8eba	docs: more detailed bricks writeup; reoganize docs (#304 ) * add print statement in readme * elements before bricks * new preamble to bricks section * add preamble to bricks section * add preamble to cleaning section * descriptions of each documentation page * non-brick helper functions to the bottom * fix codeblock * includes some optional kwargs * code blocks * typo fix	2023-02-27 23:11:49 +00:00
Matt Robinson	9b0dbc7026	build(deps): bump dependencies; resolve security issues in example dependencies (#300 ) * bump cryptography version * re pip-compile for latest versions * update argilla example requirements * dependency updates * bump versions * pin unstructured-inference due to multithreading issue * linting, linting, linting * dependency on one line	2023-02-27 12:45:28 -05:00
Matt Robinson	5db94fdee6	docs: add getting started section and remove outdated docs (#277 ) * add getting started section to the docs * remove old examples * update example notebook * change to convert_to_dict * various and sundry edits	2023-02-27 15:10:53 +00:00
Tom Aarsen	9062d25d0d	Resolve numerous typos (#280 ) * Resolve numerous typos * Resolve typo in mime type	2023-02-24 17:48:23 -08:00
Matt Robinson	0d229f0a5e	fix: preserve all elements when serialized; feat: helper functions for serialization (#273 ) * added type to text element map * add element_id and coordinates * added test for serialization * added serialization for check boxes * add dict_to_elements and covert_to_dict aliases * helpers for serializing and deserializing elements * bump version; changelog * add Text to tests * aliases for isd functions * remove test elements json * changelog updates * make indent a kwarg * update expected structured output * docs update * use new function in ingest code * pop coordinates due to floating point differences * pop coordinates	2023-02-23 21:58:59 +00:00
Matt Robinson	354eff1e2b	build(deps): automatically download `nltk` models when required (#246 ) * code for downloading nltk packages * don't run nltk make command in ci * test for model downloads * remove nltk install from docs * update changelog and bump version	2023-02-23 17:19:13 +00:00
Matt Robinson	314924137f	docs: add quotes to local-inference install instructions (#245 )	2023-02-21 09:58:26 -06:00
Matt Robinson	7472e1bb21	docs: add a quick start page to the readme and docs (#240 ) * added quick start section to the readme * added quick start to docs * parenthetical on extra deps * typo * fix typo * fixed mixed tabs/spaces	2023-02-17 22:13:28 +00:00
Matt Robinson	601f250edc	feat: add `partition_ppt` for older power point docs (#238 ) * added partition_ppt function and tests * add ppt support to auto * version bump * update docs * doc fixes * update changelog * `.docx` -> `.pptx` * its -> their * remove whitespace	2023-02-17 16:57:08 +00:00
Matt Robinson	6036af33e7	feat: add `partition_doc` for `.doc` files (#236 ) * first pass on doc partitioning * add libreoffice to deps * update docs and readme * add .doc to auto * changelog bump * value error with missing doc * doc updates	2023-02-17 09:30:23 -05:00
Matt Robinson	558ee63e90	feat: ability to skip English language specific checks with env var (#224 ) * add language env var * update docs * version and bump change log	2023-02-15 09:15:47 -05:00
Matt Robinson	a68dc35940	chore: default to local inference for `partition_pdf` and `partition_image` (#222 ) * chore: default the url to None for pdf and images * bump changelog and version	2023-02-14 16:16:33 -05:00
cragwolfe	ab542ca3c6	feat: Sample ingest project with S3 connector (#218 )	2023-02-14 12:27:45 -08:00
Matt Robinson	f890972139	docs: add bricks training notebook (#211 ) * added bricks notebook * more unicode quotes; isd dataframe column fix * fix remove_punctuation docs * typo fixes * put staging bricks in code	2023-02-10 14:39:14 +00:00
Matt Robinson	d0c6d50962	note on local inference	2023-02-09 15:16:14 -05:00
Matt Robinson	7f9aefc549	update partition_pdf section; added partition_image	2023-02-09 15:13:26 -05:00
Matt Robinson	24c90a03dc	docs: switch theme and style refresh (#209 ) * add furo theme * switch theme to furo * css for custom sidebar * remove unnecessary images * removed unnecessary fonts * fix logo background * hide package name * add favico, tweak colors * copyright 2023 * update copyright years * update hover colors * fix title tab	2023-02-09 10:40:28 -05:00
djacobs7	15b0dffdb0	docs: correct kwarg in bricks.rst (#206 ) Changed whitespace to extra_whitespace in documentation, to match options text.	2023-02-08 18:21:58 +00:00
Matt Robinson	e73cf09977	feat: optional page breaks for `.pptx`, `.pdf`, `.html` and images (#205 ) * page breaks for pptx * added page breaks for image/pdf * tests for images with page breaks * page breaks for html documents * linting, linting, linting * changelog and bump version * update docs * fix typo * refactor reusable code to common.py * add type back in	2023-02-08 15:11:15 +00:00
sparkbrains	2b88890210	docs: customize sphinx doc theme (#192 ) * feature: adding a feature for customizing color theme of sphinx docs * fix: adding changelog and comments * Adding css for changing colors of sidebar * fix: removing changelog description	2023-02-06 17:30:55 +00:00
Matt Robinson	782b4352ec	build(deps): weekly dependency update; reduce dependabot frequency (#194 ) * deps: pip-compile to update dependencies * bump version * linting, linting, linting * typo	2023-02-06 16:39:29 +00:00
Matt Robinson	a7ca58e0bc	fix: more english words; split on punctuation (#191 ) * add a bigger list of english words * update thresholds and add tests * update docs; bump version * fix version * add additional english words back in * linting, linting, linting * add slashes * work -> word	2023-02-02 17:25:47 +00:00
Matt Robinson	0589344ff7	fix: require a minimum prop of alpha characters for titles and narrative text (#190 ) * added alpha ratio check * added tests for alpha ratio * bump changelog and update docs * update changelog/version; update docs * ofr -> or	2023-02-02 14:59:04 +00:00
Matt Robinson	1230a163fd	feat: set a user controlled max word length for titles (#189 ) * update the docs * add option for title max word length * bump version; update changelog * change max length to 12 * docs updates * to -> too	2023-02-01 19:32:16 +00:00
Matt Robinson	2d08fcbf83	fix: titles and narrative text need at least one english word (#188 ) * added check for english words * update docs * at least one word needs to have multiple characters * bump change log	2023-02-01 09:10:48 -05:00
Matt Robinson	f36e514c6d	build(deps): weekly dependency bump (#183 )	2023-01-30 11:05:48 -05:00
Matt Robinson	339c133326	fix: cleanup from live `.docx` tests (#177 ) * add env var for cap threshold; raise default threshold * update docs and tests * added check for ending in a comma * update docs * no caps check for all upper text * capture Text in html and text * check category in Text equality check * lower case all caps before checking for verbs * added check for us city/state/zip * added address type * add address to html * add address to text * fix for text tests; escape for large text segments * refactor regex for readability * update comment * additional test for text with linebreaks * update docs * update changelog * update elements docs * remove old comment * case -> cast * type fix	2023-01-26 15:52:25 +00:00
qued	d2909ac688	chore: update all deps (#172 )	2023-01-23 13:03:02 -06:00
Matt Robinson	8b6c5fac9d	feat: basic PowerPoint parsing in `partition_pptx` (#166 ) * parition pptx and tests * add parition_pptx to auto * update doc types in readme * add pptx docs * bump version * remove extra whitespace * partition -> partitioning	2023-01-23 17:03:09 +00:00
Matt Robinson	f12240c5e7	feat: add support for `.txt` files in `partition` (#150 ) * added partition_text for auto * rename partition_text tests * bump version and update docs	2023-01-13 16:39:53 -05:00
Matt Robinson	eba4c80b1e	feat: `get_directory_file_info` for exploring a directory of files (#142 ) * added python-pptx to requirements * added filetype detection for powerpoint * add more filetypes to detect * more tests * added tests for filetype * reorder document types * tests for get_directory_file_info * added docs for get_directory_file_info * bump version * Word -> Office * added test for filetype * add group by filetype example	2023-01-11 12:40:50 -05:00
Matt Robinson	5376bc510f	feat: generic `partition` brick with filetype detection (#132 ) * add python-magic * first pass on filetype detection * tests for filetype detection * more tests for file detection * added tests for error conditions * install libmagic dev in github * libmagic install instructions * pattern for checking email files * support reading .eml in rb mode * add auto partition function * auto tests for emal * auto tests for docx * added tests for html * add pdf and html tests * linting, linting, linting * added docs for auto partitioning * update readme with generic partition brick * bumped version * added test for bad type * detect .docx files from application/octet-stream * linting, linting, linting * identify xlsx from octet stream * install poppler in ci * fix mocks; test for unknown type * install poppler utils * install in one line * only poppler-utils * file extension logic from application/octet-stream * install local inference for ci * install detectron2 * removing unused dockerfile	2023-01-09 16:15:14 -05:00
Mallori Harrell	d7a00046a9	feat: Add new functionality to parse text and header of emails (#111 ) * partition_text function	2023-01-09 17:08:08 +00:00
Matt Robinson	fee95b643c	feat: add `partition_docx` for Word documents (#131 ) * first pass on docx parsing * linting, linting, linting * test docx with filename * added documentation * more tests; version bump * typo * another typo * another typo! * it -> its * save -> saved * remove None since it's the default argument	2023-01-05 20:13:39 +00:00
Matt Robinson	33b983fbf0	docs: instructions on how to install on Windows + `conda` (#129 ) * add environment.yml * instructions on how to install base package and detectron2 * added instructions on paddleocr * remove covers * install -> to install * specified the shell * updated example snippets * update environment.yml * updated the repo reference * no more ands!	2023-01-05 16:21:44 +00:00
Sebastian Laverde Alfonso	5a47eb06e9	feat: new bricks for removing and extracting ordered bullets (#128 ) * feat: new cleaning brick for ordered bullets * test: add test for cleaning ordered bullets * feat: new brick for extracting ordered bullets * test: add test for extracting ordered bullets * docs: update CHANGELOG and bump new dev version * chore: change extract ordered bullets return type to tuple * chore: made tidy * chore: regex to split on pattern instead of built-in * chore: catch ValueError, made tidy and fix incompatible type * chore: assertion statements in one line of code * docs: add documentation for new clean and extract bricks to bricks.rst * docs: refactor CHANGELOG 0.3.5.dev5 to dev6 with new bullets * docs: update CHANGELOG 0.3.6-dev0 changes and bump version Co-authored-by: Sebastian Laverde <sebastian@unstructured.io>	2023-01-05 17:06:26 +01:00
qued	a75499d465	feat: local inference (#125 ) Splits partition_pdf into two paths, one used for local inference when url is None, another for inference via api when url is a string.	2023-01-04 16:19:05 -06:00
Matt Robinson	17045aed80	feat: add `convert_to_dataframe` staging brick (#127 ) * add pandas to deps; pip-compile * staging brick to convert elements to dataframe * bump version * add convert_to_dataframe docs * bump wheel version * typo fix * typo fix 2!	2023-01-04 12:04:59 -05:00
Matt Robinson	445533745c	feat: helper functions to identify and extract phone numbers (#124 ) * added pattern for finding phone numbers * added cleaning brick for extracting phone numbers * add docs * changelog and bump version * switch to us phone numbers * bump dev version	2023-01-03 13:31:05 -05:00
Mallori Harrell	509ad4951c	feat: Add `extract_attachment_info` (#112 ) * Adds function to extract attachments and their metadata from eml files	2023-01-03 11:41:54 -06:00
Matt Robinson	b14f6ac9bd	feat: extract metadata from `.docx`, `.xlsx`, and `.jpg` (#113 ) * add python-docx dependency * added function for extracting metadata from word documents * add openpyxl * added get_jpg_metadata; fixed typing * bump changelog * added pillow to dependencies	2022-12-26 09:34:36 -05:00
Matt Robinson	407f700b20	build(deps): bump `certify` to incorporate security patches (#105 ) * pin certifi in base and huggingface * pinning for build and docs	2022-12-19 14:47:15 -05:00
Matt Robinson	7a74cdda86	feat: add `partition_email` cleaning brick (#104 ) * fix for processing deeply embedded list elements * fix types in mime encodings cleaner * first pass on partition_email * tests for email * test for mime encodings * changelog bump * added note about \n= * linting, linting, linting * added email docs * add partition_email to the readme * add one more test	2022-12-19 18:02:44 +00:00
Matt Robinson	1d68bb2482	feat: `apply` method to apply cleaning bricks to elements (#102 ) * add apply method to apply cleaners to elements * bump version * add check for string output * documentations for the apply method * change interface to *cleaners	2022-12-15 22:19:02 +00:00
Matt Robinson	b1cce16c16	feat: `translate_text` cleaning brick (#101 ) * initial implementation for translate brick * more input validation * tests for translate brick * added docs * bumped version * chinese and arabic tests * re-run pip-compile * add torch to dependencies * cleanup doc string * fix long string * fix typo in docs * take out empty string check * return string if string is empty * added huggingface into make install	2022-12-15 15:35:15 -05:00
Matt Robinson	3c19c7cd8a	feat: Add partition_html brick (#91 ) * update readme * updated sphinx docs * bump version; changelog * clear cache; retrigger ci * rename test file * switch default parameters to None * typo in the changelog * add in text output	2022-12-12 14:22:10 +00:00

1 2

73 Commits