unstructured

mirror of https://github.com/Unstructured-IO/unstructured.git synced 2025-08-04 23:03:11 +00:00

Author	SHA1	Message	Date
Matt Robinson	354eff1e2b	build(deps): automatically download `nltk` models when required (#246 ) * code for downloading nltk packages * don't run nltk make command in ci * test for model downloads * remove nltk install from docs * update changelog and bump version 0.4.14	2023-02-23 17:19:13 +00:00
cragwolfe	83f04545df	fix: Adds missing __init__.py (#259 ) 0.4.13	2023-02-22 21:31:34 -08:00
cragwolfe	80c0fab215	build: new release (#249 ) Cut a release that has the unstructured-ingest command line included in the unstructured package. Bonus tweak to the Ingest checklist. 0.4.12	2023-02-23 03:44:05 +00:00
Viktor Zhemchuzhnikov	60abac2c4b	feat: add allow custom parsers in partition_html (#251 ) This will allow partition_html to use a custom XMLParser or HTMLParser. It can be useful if one needs to specify additional arguments to these parsers (not only built-in remove_comments=True). --------- Co-authored-by: Viktor Zhemchuzhnikov <v.zhemchuzhnikov@xsolla.com>	2023-02-23 01:57:42 +00:00
cragwolfe	1b8bf318b8	refactor: move processing logic to IngestDoc (#248 ) Moves the logic to partition a raw document to the IngestDoc level to allow for easier overrides for subclasses of IngestDoc.	2023-02-22 01:02:05 +00:00
cragwolfe	69acb083bd	refactor: break up logic from one line to 2 (#247 ) Separate elements out into separate variable to allow for conditional logic based on the instance type of the doc (or other properties).	2023-02-21 17:44:58 -06:00
cragwolfe	87fd0d01dc	feat: Ingest refactors, doc updates (#243 ) - Creates ABC's for ingest connectors - Updates the s3_connector classes to inherit from ABC's - Moves s3 test script to it's own file to establish pattern for additional connectors - Rewrites the Ingest.md doc, including instructions how how to add a connector - Updates the example s3 ingest script to use the new location for main.py Note that there were no logic changes, this is essentially a refactoring PR. Test instructions: Run ./test_unstructured_ingest/test-ingest.sh and ./examples/ingest/s3-small-batch/ingest.sh.	2023-02-21 10:15:33 -08:00
Matt Robinson	314924137f	docs: add quotes to local-inference install instructions (#245 )	2023-02-21 09:58:26 -06:00
noahdemoes	f205e6f3ae	build: add Python 3.9 and Python 3.10 to the CI test job (#235 ) * add python 3.9 3.10 * run on branch * run on branch * run on branch * run on branch * revert * update all jobs * update all jobs * update all jobs	2023-02-20 14:08:46 -08:00
Matt Robinson	7472e1bb21	docs: add a quick start page to the readme and docs (#240 ) * added quick start section to the readme * added quick start to docs * parenthetical on extra deps * typo * fix typo * fixed mixed tabs/spaces	2023-02-17 22:13:28 +00:00
Matt Robinson	601f250edc	feat: add `partition_ppt` for older power point docs (#238 ) * added partition_ppt function and tests * add ppt support to auto * version bump * update docs * doc fixes * update changelog * `.docx` -> `.pptx` * its -> their * remove whitespace 0.4.11	2023-02-17 16:57:08 +00:00
Matt Robinson	6036af33e7	feat: add `partition_doc` for `.doc` files (#236 ) * first pass on doc partitioning * add libreoffice to deps * update docs and readme * add .doc to auto * changelog bump * value error with missing doc * doc updates	2023-02-17 09:30:23 -05:00
Matt Robinson	9bbd4a1d56	docs: file exploration training notebook (#221 )	2023-02-16 20:33:02 +00:00
Matt Robinson	f5ff140d7c	fix: `ElementMetadata` serializes when the filename is a `Path` object (#233 ) 0.4.10	2023-02-16 17:20:51 +00:00
cragwolfe	3c1b089071	feat: Ingest CLI flags and test fixture updates (#227 ) * Many command line options added. The sample ingest project is now an easy to use CLI (no code editing necessary), capable of processing large numbers of files from S3 in a re-entrant manner. See Ingest.md. * Fixes issue where text fixtures had been truncated * Adds a check to make sure this doesn't happen again * Moves fixture outputs for the existing connector one subdir lower, to make room for future connector outputs.	2023-02-16 16:45:50 +00:00
Matt Robinson	74e6b84b41	feat: add metadata tracking to document elements (#225 ) * add metadata field to elements * metadata tracking for pdf/image * metadata for html * update expected outputs * metadata for the rest of the document types * take out file metadata for now * add url to tables * added metadata to test_auto * bump version * added coordinates to __init__ * fix coordinates in tests 0.4.9	2023-02-15 18:26:20 +00:00
Ethan Steininger	b8dce6109b	doc: update README with local-inference instructions doc: update README with local-inference instructions	2023-02-15 14:49:40 +00:00
Matt Robinson	558ee63e90	feat: ability to skip English language specific checks with env var (#224 ) * add language env var * update docs * version and bump change log	2023-02-15 09:15:47 -05:00
Matt Robinson	a68dc35940	chore: default to local inference for `partition_pdf` and `partition_image` (#222 ) * chore: default the url to None for pdf and images * bump changelog and version	2023-02-14 16:16:33 -05:00
cragwolfe	ab542ca3c6	feat: Sample ingest project with S3 connector (#218 )	2023-02-14 12:27:45 -08:00
qued	6d1d50d218	docs: update make targets (#217 )	2023-02-14 06:08:29 +00:00
qued	5d0743ff8b	docs: add info about os dependencies (#216 )	2023-02-14 05:31:52 +00:00
natygyoon	a920e55405	fix: remove comments when parsing XML or HTML (#210 ) * Update xml.py remove comments while parsing * change logged in CHANGLOG and editted version * make tidy * editted version * new version 0.4.8-dev1 * editted version * Update CHANGELOG.md Co-authored-by: cragwolfe <crag@unstructuredai.io> --------- Co-authored-by: cragwolfe <crag@unstructuredai.io> 0.4.8	2023-02-11 02:52:13 +09:00
Matt Robinson	962de78def	fix: remove response text when the HTML status code is an error (#213 ) * release: version 0.4.7 * remove response text from url error 0.4.7	2023-02-10 11:39:56 -05:00
Matt Robinson	f890972139	docs: add bricks training notebook (#211 ) * added bricks notebook * more unicode quotes; isd dataframe column fix * fix remove_punctuation docs * typo fixes * put staging bricks in code	2023-02-10 14:39:14 +00:00
Matt Robinson	d0c6d50962	note on local inference	2023-02-09 15:16:14 -05:00
Matt Robinson	7f9aefc549	update partition_pdf section; added partition_image	2023-02-09 15:13:26 -05:00
Matt Robinson	24c90a03dc	docs: switch theme and style refresh (#209 ) * add furo theme * switch theme to furo * css for custom sidebar * remove unnecessary images * removed unnecessary fonts * fix logo background * hide package name * add favico, tweak colors * copyright 2023 * update copyright years * update hover colors * fix title tab	2023-02-09 10:40:28 -05:00
Matt Robinson	7fb3797165	docs: core concepts training notebook (#207 ) * added to_dict to elements * first training notebook * bump changelog, rerun notebook * remove coordinates and id * rerun notebook * has -> have * partitioning -> partition * various and sundry typos * switch to using convert_to_isd	2023-02-09 14:34:34 +00:00
Matt Robinson	47ab808e0f	feat: file info dataframe from filenames and file content (#204 ) * added function for exploring a list of files * file info from file contents * added tests for file info from contents * bump version and add tests * add dev to version	2023-02-08 20:48:39 +00:00
djacobs7	15b0dffdb0	docs: correct kwarg in bricks.rst (#206 ) Changed whitespace to extra_whitespace in documentation, to match options text.	2023-02-08 18:21:58 +00:00
Matt Robinson	e73cf09977	feat: optional page breaks for `.pptx`, `.pdf`, `.html` and images (#205 ) * page breaks for pptx * added page breaks for image/pdf * tests for images with page breaks * page breaks for html documents * linting, linting, linting * changelog and bump version * update docs * fix typo * refactor reusable code to common.py * add type back in	2023-02-08 15:11:15 +00:00
Sebastian Laverde Alfonso	46b023f454	docs: update colab notebook link (#203 )	2023-02-07 18:50:03 +01:00
Matt Robinson	ee9f15483f	feat: `partition_html` directly from a url (#202 ) * added tests for html from url * bump version * added types-requests * and -> an	2023-02-07 14:09:34 +00:00
sparkbrains	2b88890210	docs: customize sphinx doc theme (#192 ) * feature: adding a feature for customizing color theme of sphinx docs * fix: adding changelog and comments * Adding css for changing colors of sidebar * fix: removing changelog description	2023-02-06 17:30:55 +00:00
Matt Robinson	782b4352ec	build(deps): weekly dependency update; reduce dependabot frequency (#194 ) * deps: pip-compile to update dependencies * bump version * linting, linting, linting * typo	2023-02-06 16:39:29 +00:00
Matt Robinson	014585e872	fix: preserve the order of shapes in `partition_pptx` output (#193 ) * order the shapes top to bottom and left to right * added tests for ordering * update change log and bump version * more tests * don't need enumerate * n -> on 0.4.6	2023-02-03 22:12:33 +00:00
Matt Robinson	a7ca58e0bc	fix: more english words; split on punctuation (#191 ) * add a bigger list of english words * update thresholds and add tests * update docs; bump version * fix version * add additional english words back in * linting, linting, linting * add slashes * work -> word	2023-02-02 17:25:47 +00:00
Matt Robinson	0589344ff7	fix: require a minimum prop of alpha characters for titles and narrative text (#190 ) * added alpha ratio check * added tests for alpha ratio * bump changelog and update docs * update changelog/version; update docs * ofr -> or	2023-02-02 14:59:04 +00:00
Matt Robinson	1230a163fd	feat: set a user controlled max word length for titles (#189 ) * update the docs * add option for title max word length * bump version; update changelog * change max length to 12 * docs updates * to -> too	2023-02-01 19:32:16 +00:00
Matt Robinson	2d08fcbf83	fix: titles and narrative text need at least one english word (#188 ) * added check for english words * update docs * at least one word needs to have multiple characters * bump change log	2023-02-01 09:10:48 -05:00
Matt Robinson	d0bf8904fa	docs: example notebooks from community repo (#187 )	2023-01-31 10:37:32 -05:00
sparkbrains	243bf7ed5e	test: Increase coverage (#181 )	2023-01-30 22:47:09 -08:00
Matt Robinson	f36e514c6d	build(deps): weekly dependency bump (#183 )	2023-01-30 11:05:48 -05:00
Matt Robinson	e6cfde5c4a	fix: no `UserWarning` when `partition_pdf` is called (#179 )	2023-01-27 12:08:18 -05:00
Matt Robinson	339c133326	fix: cleanup from live `.docx` tests (#177 ) * add env var for cap threshold; raise default threshold * update docs and tests * added check for ending in a comma * update docs * no caps check for all upper text * capture Text in html and text * check category in Text equality check * lower case all caps before checking for verbs * added check for us city/state/zip * added address type * add address to html * add address to text * fix for text tests; escape for large text segments * refactor regex for readability * update comment * additional test for text with linebreaks * update docs * update changelog * update elements docs * remove old comment * case -> cast * type fix	2023-01-26 15:52:25 +00:00
Matt Robinson	1ce8447ba7	build(deps): bump unstructured inference; compile from setup.py (#176 ) * bump unstructured inference; compile from setup.py * bump version * compile the local-inference extra * linting, linting, linting 0.4.4	2023-01-25 16:32:57 +00:00
Matt Robinson	26a5546152	fix: handle xml filetype detection on amazon linux (#173 ) * fix: handle xml filetype detection on amazon linux * option for html or xml * fix typo * back to dev tag	2023-01-25 11:20:01 -05:00
Matt Robinson	3b6546515d	docs: add links to linkedin and slack (#175 )	2023-01-24 13:51:10 -08:00
qued	d2909ac688	chore: update all deps (#172 )	2023-01-23 13:03:02 -06:00

... 23 24 25 26 27 ...

1393 Commits