unstructured

mirror of https://github.com/Unstructured-IO/unstructured.git synced 2025-12-19 19:23:46 +00:00

Author	SHA1	Message	Date
Amanda Cameron	1fb464235a	chore: Table chunking (#1540 ) This change is adding to our `add_chunking_strategy` logic so that we are able to chunk Table elements' `text` and `text_as_html` params. In order to keep the functionality under the same `by_title` chunking strategy we have renamed the `combine_under_n_chars` to `max_characters`. It functions the same way for the combining elements under Title's, as well as specifying a chunk size (in chars) for TableChunk elements. *renaming the variable to `max_characters` will also reflect the 'hard max' we will implement for large elements in followup PRs Additionally -> some lint changes snuck in when I ran `make tidy` hence the minor changes in unrelated files :) TODO: ✅ add unit tests --> note: added where I could to unit tests! Some unit tests I just clarified that the chunking strategy was now 'by_title' because we don't have a file example that has Table elements to test the 'by_num_characters' chunking strategy ✅ update changelog To manually test: ``` In [1]: filename="example-docs/example-10k.html" In [2]: from unstructured.chunking.title import chunk_table_element In [3]: from unstructured.partition.auto import partition In [4]: elements = partition(filename) # element at -2 happens to be a Table, and we'll get chunks of char size 4 here In [5]: chunks = chunk_table_element(elements[-2], 4) # examine text and text_as_html params ln [6]: for c in chunks: print(c.text) print(c.metadata.text_as_html) ``` --------- Co-authored-by: Yao You <theyaoyou@gmail.com>	2023-10-03 09:40:34 -07:00
Newel H	55315cf645	Feat: Native hierarchies for docx element types (#1505 ) Improves hierarchy from docx files by leveraging natural hierarchies built into docx documents. Hierarchy can now be detected from an indentation level for list bullets/numbers and by style name (e.g. Heading 1, List Bullet 2, List Number). Hierarchy detection is improved by determining category depth via the following: 1. Check if the paragraph item has an indentation level (ilvl) xpath - these are typically on list bullet/numbers. Return the indentation level if it exists 2. Check the name of the paragraph style if it contains any category depth information (e.g. Heading 1 vs Heading 2 or List Bullet vs List Bullet 2). Return the category depth if found, else default to depth of 0. 3. Check the paragraph ilvl via the paragraph's style name. Outside of the paragraph's metadata, docx stores default ilvls for various style names, which requires a complex lookup. This check is yet to be implemented, as the above methods cover most usecases but the implementation is stubbed out. --- Co-authored-by: Steve Canny <stcanny@gmail.com>	2023-09-27 11:32:46 -04:00
rvztz	2f52df180f	Adds data source properties to onedrive, reddit and slack (#1281 )	2023-09-20 04:26:36 +00:00
Steve Canny	b54994ae95	rfctr: docx partitioning (#1422 ) Reviewers: I recommend reviewing commit-by-commit or just looking at the final version of `partition/docx.py` as View File. This refactor solves a few problems but mostly lays the groundwork to allow us to refine further aspects such as page-break detection, list-item detection, and moving python-docx internals upstream to that library so our work doesn't depend on that domain-knowledge.	2023-09-19 15:32:46 -07:00
John	de4d496fcf	Fix bbox coordinates for ocr_only strategy (#1325 ) ### Summary Duplicate PR of #1259 because of issues with checks Closes #1227, which found that `nan` values were present in the coordinates being generated for some elements. This breaks logic out from `add_pytesseract_bbox_to_elements` to new functions `_get_element_box` and `convert_multiple_coordinates_to_new_system`. It also updates the logic to check that the current bounding box matches the first character of the element's text (as to avoid the `~` characters that `pytesseract.image_to_boxes` includes, but are not present in `pytesseract.image_to_string`. ### Testing ``` from unstructured.partition.image import partition_image from PIL import Image, ImageDraw filename="example-docs/layout-parser-paper-with-table.jpg" elements = partition_image(filename=filename, strategy="ocr_only") image = Image.open(filename) draw = ImageDraw.Draw(image) for i, element in enumerate(elements): print(i, element.metadata.coordinates) if element.metadata.coordinates: draw.polygon(element.metadata.coordinates.points, outline="red", width=2) output = "example-docs/box-layout-parser-paper-with-table.jpg" image.save(output) image.close() ``` --------- Co-authored-by: qued <64741807+qued@users.noreply.github.com> Co-authored-by: cragwolfe <crag@unstructured.io> Co-authored-by: Yao You <theyaoyou@gmail.com>	2023-09-15 15:11:16 -05:00
John	c58b261feb	chunk_by_title decorator (#1304 ) ### Summary Partial solution to #1185. Related to #1222. Creates decorator from `chunk_by_title` cleaning brick. Breaks a document into sections based on the presence of Title elements. Also starts a new section under the following conditions: - If metadata changes, indicating a change in section or page or a switch to processing attachments. If `multipage_sections=True`, sections can span pages. `multipage_sections` defaults to True. - If the length of the section exceeds `new_after_n_chars` characters. The default is 1500. The chunking function does not split individual elements, so it's possible for a section to exceed that threshold if an individual element if over `new_after_n_chars characters`, which could occur with a long NarrativeText element. Combines sections under these conditions - Sections under `combine_under_n_chars` characters are combined. The default is 500. ### Testing from unstructured.partition.html import partition_html url = "https://understandingwar.org/backgrounder/russian-offensive-campaign-assessment-august-27-2023-0" chunks = partition_html(url=url, chunking_strategy="by_title") for chunk in chunks: print(chunk) print("\n\n" + "-"*80) input()	2023-09-11 21:00:14 +00:00
Klaijan	675a10ea69	fix: update test_json to not use auto partition (#1187 ) Update `test_json` to not use auto partition due to dependencies. Previously, to run `test_json` requires full requirements installation library to read file types, including but not limited to, docx, pptx, as well as others. Therefore the test will raise error with base installation. With the update, this fix also add to other test files to check its invariant with `elements_to_json`.	2023-08-29 16:59:26 -04:00
Newel H	e4aa7373e2	test: create CI pipelines for verifying base and extras pass respective tests (#1137 ) Summary Closes #747 * Create CI Pipeline for running text, xml, email, and html doc tests against the library installed without extras * Create CI Pipeline for running each library extra against their respective tests	2023-08-19 12:56:13 -04:00

8 Commits