unstructured

mirror of https://github.com/Unstructured-IO/unstructured.git synced 2025-11-14 09:27:35 +00:00

Author	SHA1	Message	Date
ryannikolaidis	66bf4b0198	feat: support extracting image url in html (#3955 ) also removes mimetype when base64 is not included in image metadata --------- Co-authored-by: ryannikolaidis <ryannikolaidis@users.noreply.github.com>	2025-03-13 22:41:10 +00:00
ryannikolaidis	c0457c1cc3	feat: include images when partitioning html (#3945 ) Currently we [filter img tags](`2addb19473/unstructured/partition/html/partition.py (L226-L229)`) before tags are converted to Elements by the html partitioner. More importantly we also don’t currently have a defined “block” / mapping to support these. This adds these mappings and logic to process. It also respects `extract_image_block_types` and `extract_image_block_to_payload` (as we do with pdfs) to determine whether base64 is included in the metadata. The partitioned Image Elements sets the text to the img tag’s alt text if available. The partitioned Image Elements include the [url in the metadata](https://github.com/Unstructured-IO/unstructured/blob/main/unstructured/documents/elements.py#L209) (rather than image_base64) if the img tag src is a url. ## Testing unit tests have been added for explicit coverage. existing integration tests and other unit test fixtures have been updated to account for `Image` elements now present --------- Co-authored-by: ryannikolaidis <ryannikolaidis@users.noreply.github.com>	2025-03-08 01:25:21 +00:00
Steve Canny	b3a2dd4755	fix: html incorrectly categorizing text (#3841 ) Fixes #3666 --------- Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com> Co-authored-by: scanny <scanny@users.noreply.github.com>	2024-12-18 18:46:54 +00:00
Steve Canny	208c7edc52	rfctr(csv): minify HTML and table text is cct (#3733 ) Summary Eliminate historical "idiosyncracies" of `table.metadata.text_as_html` HTML introduced by `partition_csv()`. Produce minified `.text_as_html` consistent with that formed by chunking. Additional Context - CSV `.metadata.text_as_html` is minified (no extra whitespace or thead, tbody, tfoot elements). - `table.text` is clean-concatenated-text (CCT) of table. --------- Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com> Co-authored-by: scanny <scanny@users.noreply.github.com>	2024-10-19 06:49:09 +00:00
MiXiBo	0506aff788	add support for `start_index` in `html` links extraction (#2600 ) add support for start_index in html links extraction (closes #2625) Testing ``` from unstructured.partition.html import partition_html from unstructured.staging.base import elements_to_json html_text = """<html> <p>Hello there I am a <a href="/link">very important link!</a></p> <p>Here is a list of my favorite things</p> <ul> <li><a href="https://en.wikipedia.org/wiki/Parrot">Parrots</a></li> <li>Dogs</li> </ul> <a href="/loner">A lone link!</a> </html>""" elements = partition_html(text=html_text) print(elements_to_json(elements)) ``` --------- Co-authored-by: Michael Niestroj <michael.niestroj@unblu.com> Co-authored-by: christinestraub <christinemstraub@gmail.com> Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com> Co-authored-by: christinestraub <christinestraub@users.noreply.github.com> Co-authored-by: Ronny H <138828701+ron-unstructured@users.noreply.github.com>	2024-04-12 06:14:20 +00:00
Matt Robinson	b4d9ad8130	enhancement: detect headers in `partition_pdf` with fast strategy (#2455 ) ### Summary Detects headers and footers when using `partition_pdf` with the fast strategy. Identifies elements that are positioned in the top or bottom 5% of the page as headers or footers. If no coordinate information is available, an element won't be detected as a header or footer. --------- Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com> Co-authored-by: MthwRobinson <MthwRobinson@users.noreply.github.com>	2024-02-23 16:56:09 +00:00
Roman Isecke	b37b4689bc	drop python3.8 (#2372 ) ### Description Remove all uses of python3.8 --------- Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com> Co-authored-by: rbiseck3 <rbiseck3@users.noreply.github.com>	2024-01-09 23:37:30 +00:00
Roman Isecke	cc05e948ff	chore: sensitive info connector audit (#2227 ) ### Description All other connectors that were not included in https://github.com/Unstructured-IO/unstructured/pull/2194 are now updated to follow the new pattern and mark any variables as sensitive where it makes sense. Core changes: * All connectors now support an `AccessConfig` to mark data that's needed for auth (i.e. username, password) and those that are sensitive are designated appropriately using the new enhanced field. * All cli configs on the cli definition now inherit from the base config in the connector file to reuse the variables set on that dataclass * The base writer class was updated to better generalize the new approach given better use of dataclasses * The base cli classes were refactored to also take into account the need for a connector and write config when creating the respective runner/writer classes. * Any mismatch between the cli field name and the dataclass field name were updated on the dataclass side to not impact the user but maintain consistency * Add custom redaction logic for mongodb URIs since the password is expected to be a part of it. Now this: `"mongodb+srv://ingest-test-user:r4hK3BD07b@ingest-test.hgaig.mongodb.net/"` -> `"mongodb+srv://ingest-test-user:*REDACTED@ingest-test.hgaig.mongodb.net/"` in the logs Bundle all fsspec based files into their own packages. * Refactor custom `_decode_dataclass` used for enhanced json mixin by using a monkey-patch approach. The original approach was breaking on optional nested dataclasses when serializing since the other methods in `dataclasses_json_core` weren't using the new method. By monkey-patching the original method with a new one, all other methods in that library would use the new one. --------- Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com> Co-authored-by: rbiseck3 <rbiseck3@users.noreply.github.com>	2023-12-11 17:37:49 +00:00
John	9500d04791	detect document language across all partitioners (#1627 ) ### Summary Closes #1534 and #1535 Detects document language using `langdetect` package. Creates new kwargs for user to set the document language (`languages`) or detect the language at the element level instead of the default document level (`detect_language_per_element`) --------- Co-authored-by: shreyanid <42684285+shreyanid@users.noreply.github.com> Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com> Co-authored-by: Coniferish <Coniferish@users.noreply.github.com> Co-authored-by: cragwolfe <crag@unstructured.io> Co-authored-by: Austin Walker <austin@unstructured.io>	2023-10-11 01:47:56 +00:00
Yao You	19d8bff275	feat: change default hi_res model to yolox quantized (#1607 )	2023-10-04 03:28:47 +00:00
Klaijan	d26d591d6a	feat: get embedded url, associate text and start index for pdf (#1539 ) Executive Summary Adds PDF functionality to capture hyperlink (external or internal) for pdf fast strategy along with associate text. Technical Details - `pdfminer` associates `annotation` (links and uris) with bounding box rather than text. Therefore, the link and text matching is not a perfect pair but rather a logic-based and calculation matching from bounding box overlapping. - There is no word-level bounding box. Only character-level (access using `LTChar`). Thus in order to get to word-level, there is a window slicing through the text. The words are captured in alphanumeric and non-alphanumeric separately, meaning it will split the word if contains both, on the first encounter of non-alphanumeric.) - The bounding box calculation is calculated using start and stop coordinates for the corresponding word calculated from above. The calculation is simply using distance between two dots. The result now contains `links` in `metadata` as shown below: ``` "links": [ { "text": "link", "url": "https://github.com/Unstructured-IO/unstructured", "start_index": 12 }, { "text": "email", "url": "mailto:unstructuredai@earlygrowth.com", "start_index": 30 }, { "text": "phone number", "url": "tel:6505124019", "start_index": 49 } ] ``` --------- Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com> Co-authored-by: Klaijan <Klaijan@users.noreply.github.com>	2023-09-27 13:43:32 -04:00
Amanda Cameron	a501d1d18f	Adding table extraction to partition_html (#1324 ) Adding table extraction to HTML partitioning. This PR utilizes 'table' HTML elements to extract and parse HTML tables and return them in partitioning. ``` # checkout this branch, go into ipython shell In [1]: from unstructured.partition.html import partition_html In [2]: path_to_html = "{html sample file with table}" In [3]: elements = partition_html(path_to_html) ``` you should see the table in the elements list!	2023-09-11 11:14:11 -07:00
Roman Isecke	db8af4f5de	Roman/notion tests (#1072 ) ### Description * Add ingest test for Notion docs * Update default cache dir for connectors to include connector name. Makes debugging the cached content easier. --------- Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com> Co-authored-by: rbiseck3 <rbiseck3@users.noreply.github.com>	2023-08-21 15:16:50 -04:00

13 Commits