unstructured

mirror of https://github.com/Unstructured-IO/unstructured.git synced 2025-10-15 18:17:09 +00:00

Author	SHA1	Message	Date
Matt Robinson	f890972139	docs: add bricks training notebook (#211 ) * added bricks notebook * more unicode quotes; isd dataframe column fix * fix remove_punctuation docs * typo fixes * put staging bricks in code	2023-02-10 14:39:14 +00:00
Matt Robinson	7fb3797165	docs: core concepts training notebook (#207 ) * added to_dict to elements * first training notebook * bump changelog, rerun notebook * remove coordinates and id * rerun notebook * has -> have * partitioning -> partition * various and sundry typos * switch to using convert_to_isd	2023-02-09 14:34:34 +00:00
Matt Robinson	47ab808e0f	feat: file info dataframe from filenames and file content (#204 ) * added function for exploring a list of files * file info from file contents * added tests for file info from contents * bump version and add tests * add dev to version	2023-02-08 20:48:39 +00:00
Matt Robinson	e73cf09977	feat: optional page breaks for `.pptx`, `.pdf`, `.html` and images (#205 ) * page breaks for pptx * added page breaks for image/pdf * tests for images with page breaks * page breaks for html documents * linting, linting, linting * changelog and bump version * update docs * fix typo * refactor reusable code to common.py * add type back in	2023-02-08 15:11:15 +00:00
Matt Robinson	ee9f15483f	feat: `partition_html` directly from a url (#202 ) * added tests for html from url * bump version * added types-requests * and -> an	2023-02-07 14:09:34 +00:00
Matt Robinson	782b4352ec	build(deps): weekly dependency update; reduce dependabot frequency (#194 ) * deps: pip-compile to update dependencies * bump version * linting, linting, linting * typo	2023-02-06 16:39:29 +00:00
Matt Robinson	014585e872	fix: preserve the order of shapes in `partition_pptx` output (#193 ) * order the shapes top to bottom and left to right * added tests for ordering * update change log and bump version * more tests * don't need enumerate * n -> on	2023-02-03 22:12:33 +00:00
Matt Robinson	a7ca58e0bc	fix: more english words; split on punctuation (#191 ) * add a bigger list of english words * update thresholds and add tests * update docs; bump version * fix version * add additional english words back in * linting, linting, linting * add slashes * work -> word	2023-02-02 17:25:47 +00:00
Matt Robinson	0589344ff7	fix: require a minimum prop of alpha characters for titles and narrative text (#190 ) * added alpha ratio check * added tests for alpha ratio * bump changelog and update docs * update changelog/version; update docs * ofr -> or	2023-02-02 14:59:04 +00:00
Matt Robinson	1230a163fd	feat: set a user controlled max word length for titles (#189 ) * update the docs * add option for title max word length * bump version; update changelog * change max length to 12 * docs updates * to -> too	2023-02-01 19:32:16 +00:00
Matt Robinson	2d08fcbf83	fix: titles and narrative text need at least one english word (#188 ) * added check for english words * update docs * at least one word needs to have multiple characters * bump change log	2023-02-01 09:10:48 -05:00
sparkbrains	243bf7ed5e	test: Increase coverage (#181 )	2023-01-30 22:47:09 -08:00
Matt Robinson	e6cfde5c4a	fix: no `UserWarning` when `partition_pdf` is called (#179 )	2023-01-27 12:08:18 -05:00
Matt Robinson	339c133326	fix: cleanup from live `.docx` tests (#177 ) * add env var for cap threshold; raise default threshold * update docs and tests * added check for ending in a comma * update docs * no caps check for all upper text * capture Text in html and text * check category in Text equality check * lower case all caps before checking for verbs * added check for us city/state/zip * added address type * add address to html * add address to text * fix for text tests; escape for large text segments * refactor regex for readability * update comment * additional test for text with linebreaks * update docs * update changelog * update elements docs * remove old comment * case -> cast * type fix	2023-01-26 15:52:25 +00:00
Matt Robinson	26a5546152	fix: handle xml filetype detection on amazon linux (#173 ) * fix: handle xml filetype detection on amazon linux * option for html or xml * fix typo * back to dev tag	2023-01-25 11:20:01 -05:00
Matt Robinson	8b6c5fac9d	feat: basic PowerPoint parsing in `partition_pptx` (#166 ) * parition pptx and tests * add parition_pptx to auto * update doc types in readme * add pptx docs * bump version * remove extra whitespace * partition -> partitioning	2023-01-23 17:03:09 +00:00
Matt Robinson	8d3e616846	feat: add ability to parse `LayoutElement` lists (#165 ) * added ability to split list items * changelog and version bump * retrigger ci	2023-01-20 08:55:11 -05:00
Matt Robinson	c1822911a5	chore: return `Element` objects in `partition_pdf` and `partition_image` (#164 ) * helper function to convert to element * test for element types * fix for healthcheck url * version bump * note on coordinates * mention FigureCaption * test_shared -> test_common * add check boxes for checkbox template * update changelog	2023-01-19 14:29:28 +00:00
Matt Robinson	74ce2ae6e5	fix: update `detect_filetype` to properly handle older office files (#161 )	2023-01-18 11:18:20 -05:00
Mallori Harrell	08ccee0acb	chore: Fix parse received data (#143 ) * fix parse_received data	2023-01-17 16:36:44 -06:00
Matt Robinson	749f9c6be8	fix: avoid divide by zero in `exceeds_cap_ratio` (#160 )	2023-01-17 15:22:12 -05:00
Matt Robinson	9c3c14e94d	fix: resolves `UnicodeDecodeError` in `partition_email` for emails with attachments (#158 ) * split emails by \n= * added test for equivalence betweent html and plain text * changelog and bump version * add check for content disposition	2023-01-17 11:33:45 -05:00
qued	8abf1f119d	feat: partition image (#144 ) Adds partition_image to partition image file types, which is integrated into the partition brick. This relies on the 0.2.2 version of unstructured-inference.	2023-01-13 22:24:13 -06:00
Matt Robinson	f12240c5e7	feat: add support for `.txt` files in `partition` (#150 ) * added partition_text for auto * rename partition_text tests * bump version and update docs	2023-01-13 16:39:53 -05:00
Matt Robinson	eba4c80b1e	feat: `get_directory_file_info` for exploring a directory of files (#142 ) * added python-pptx to requirements * added filetype detection for powerpoint * add more filetypes to detect * more tests * added tests for filetype * reorder document types * tests for get_directory_file_info * added docs for get_directory_file_info * bump version * Word -> Office * added test for filetype * add group by filetype example	2023-01-11 12:40:50 -05:00
Mallori Harrell	e0feba83f6	feat: Add Image element and `find_embedded_image` function (#130 ) * add find_embedded_image	2023-01-09 19:49:19 -06:00
Matt Robinson	5376bc510f	feat: generic `partition` brick with filetype detection (#132 ) * add python-magic * first pass on filetype detection * tests for filetype detection * more tests for file detection * added tests for error conditions * install libmagic dev in github * libmagic install instructions * pattern for checking email files * support reading .eml in rb mode * add auto partition function * auto tests for emal * auto tests for docx * added tests for html * add pdf and html tests * linting, linting, linting * added docs for auto partitioning * update readme with generic partition brick * bumped version * added test for bad type * detect .docx files from application/octet-stream * linting, linting, linting * identify xlsx from octet stream * install poppler in ci * fix mocks; test for unknown type * install poppler utils * install in one line * only poppler-utils * file extension logic from application/octet-stream * install local inference for ci * install detectron2 * removing unused dockerfile	2023-01-09 16:15:14 -05:00
Mallori Harrell	d7a00046a9	feat: Add new functionality to parse text and header of emails (#111 ) * partition_text function	2023-01-09 17:08:08 +00:00
Matt Robinson	fee95b643c	feat: add `partition_docx` for Word documents (#131 ) * first pass on docx parsing * linting, linting, linting * test docx with filename * added documentation * more tests; version bump * typo * another typo * another typo! * it -> its * save -> saved * remove None since it's the default argument	2023-01-05 20:13:39 +00:00
Sebastian Laverde Alfonso	5a47eb06e9	feat: new bricks for removing and extracting ordered bullets (#128 ) * feat: new cleaning brick for ordered bullets * test: add test for cleaning ordered bullets * feat: new brick for extracting ordered bullets * test: add test for extracting ordered bullets * docs: update CHANGELOG and bump new dev version * chore: change extract ordered bullets return type to tuple * chore: made tidy * chore: regex to split on pattern instead of built-in * chore: catch ValueError, made tidy and fix incompatible type * chore: assertion statements in one line of code * docs: add documentation for new clean and extract bricks to bricks.rst * docs: refactor CHANGELOG 0.3.5.dev5 to dev6 with new bullets * docs: update CHANGELOG 0.3.6-dev0 changes and bump version Co-authored-by: Sebastian Laverde <sebastian@unstructured.io>	2023-01-05 17:06:26 +01:00
qued	a75499d465	feat: local inference (#125 ) Splits partition_pdf into two paths, one used for local inference when url is None, another for inference via api when url is a string.	2023-01-04 16:19:05 -06:00
Matt Robinson	17045aed80	feat: add `convert_to_dataframe` staging brick (#127 ) * add pandas to deps; pip-compile * staging brick to convert elements to dataframe * bump version * add convert_to_dataframe docs * bump wheel version * typo fix * typo fix 2!	2023-01-04 12:04:59 -05:00
Matt Robinson	445533745c	feat: helper functions to identify and extract phone numbers (#124 ) * added pattern for finding phone numbers * added cleaning brick for extracting phone numbers * add docs * changelog and bump version * switch to us phone numbers * bump dev version	2023-01-03 13:31:05 -05:00
Mallori Harrell	509ad4951c	feat: Add `extract_attachment_info` (#112 ) * Adds function to extract attachments and their metadata from eml files	2023-01-03 11:41:54 -06:00
Matt Robinson	b14f6ac9bd	feat: extract metadata from `.docx`, `.xlsx`, and `.jpg` (#113 ) * add python-docx dependency * added function for extracting metadata from word documents * add openpyxl * added get_jpg_metadata; fixed typing * bump changelog * added pillow to dependencies	2022-12-26 09:34:36 -05:00
Mallori Harrell	e0a76effff	feat: Added `EmailElement` for email documents (#103 ) * new EmailElement data structure	2022-12-21 16:03:44 -06:00
Matt Robinson	4f6fc29b54	fix: `partition_html` should process container divs that include text (#110 ) * check for containers with text * added tests for containers with text * changelog and version bump	2022-12-21 21:51:04 +00:00
Mallori Harrell	6f4d9ad06c	chore: add new pattern for dash bullet (#109 ) * add new pattern for dash bullet	2022-12-21 10:23:51 -06:00
Yuming Long	4803281861	chore: logger should not be setting up a BasicConfig (#106 ) * feat: simple logger * doc: changelog and version	2022-12-20 10:39:02 -05:00
Matt Robinson	7a74cdda86	feat: add `partition_email` cleaning brick (#104 ) * fix for processing deeply embedded list elements * fix types in mime encodings cleaner * first pass on partition_email * tests for email * test for mime encodings * changelog bump * added note about \n= * linting, linting, linting * added email docs * add partition_email to the readme * add one more test	2022-12-19 18:02:44 +00:00
Matt Robinson	1d68bb2482	feat: `apply` method to apply cleaning bricks to elements (#102 ) * add apply method to apply cleaners to elements * bump version * add check for string output * documentations for the apply method * change interface to *cleaners	2022-12-15 22:19:02 +00:00
Matt Robinson	b1cce16c16	feat: `translate_text` cleaning brick (#101 ) * initial implementation for translate brick * more input validation * tests for translate brick * added docs * bumped version * chinese and arabic tests * re-run pip-compile * add torch to dependencies * cleanup doc string * fix long string * fix typo in docs * take out empty string check * return string if string is empty * added huggingface into make install	2022-12-15 15:35:15 -05:00
Matt Robinson	3c19c7cd8a	feat: Add partition_html brick (#91 ) * update readme * updated sphinx docs * bump version; changelog * clear cache; retrigger ci * rename test file * switch default parameters to None * typo in the changelog * add in text output	2022-12-12 14:22:10 +00:00
Matt Robinson	0658744c38	test: mock model api calls; full coverage for partition_pdf (#88 ) * test: mock model api calls; full coverage for partition_pdf * bump version	2022-11-30 16:34:24 -05:00
Matt Robinson	77cd5cc01f	feat: text2text and token classification for argilla (#87 ) * add support for text2text * add support for token classification datasets * bump versions * updated docs * remove extra comment * fix wording in docs * fix some more wording	2022-11-30 20:07:42 +00:00
Matt Robinson	c62f18c0d0	feat: Add html escape quotes to cleaning brick (#84 ) * feat: Add html escape quotes to cleaning brick * bump changelog	2022-11-29 10:58:31 -05:00
asymness	2170a2aae2	feat: Implement Argilla staging brick (#81 ) * Add argilla to dependencies and run pip-compile * Implement Argilla staging brick and add unit tests * Update version and changelog * Update docs with description and usage for Argilla staging brick * Remove unused fixtures and fix typo in Argilla tests * add missing quote in docs * changelog tweak * doc tweaks Co-authored-by: Matt Robinson <mrobinson@unstructuredai.io> Co-authored-by: Matt Robinson <mrobinson@unstructured.io>	2022-11-28 14:41:48 +00:00
Matt Robinson	b041b0197d	feat: Add entities kwarg to datasaur bricks (#77 ) * added entities to datasaur * add tests for datasaur with entities * update docs * fix missing imports * bump version * remove accidental file	2022-11-22 19:50:19 +00:00
Matt Robinson	08e091c5a9	chore: Reorganize partition bricks under partition directory (#76 ) * move partition_pdf to partition folder * move partition.py * refactor partioning bricks into partition diretory * import to nlp for backward compatibility * update docs * update version and bump changelog * fix typo in changelog * update readme reference	2022-11-21 22:27:23 +00:00
Mallori Harrell	53fcf4e912	chore: Remove PDF parsing code and dependencies (#75 ) Remove PDF parsing code and dependencies.	2022-11-21 11:47:29 -06:00

... 4 5 6 7 8

368 Commits