519 Commits

Author SHA1 Message Date
Alan Bertl
a46becc185 Add tests for new functions 2025-06-17 17:11:28 -05:00
Alan Bertl
e5bfc750ae Update tests 2025-05-22 15:40:42 -05:00
jordan-homan
570ee078a4
fix: throw validation error when json is passed with invalid unstructured json (#4002)
### Notes
Adds validation if `json` / `ndjson` are not valid unstructured schema.

### Testing
Manually tested serverless API with example json:

```

test_length = [] = 200

test_invalid = [{"invalid": "schema"}] = 422
test_invalid_ndjson ={"hi": "there"} = 422

test_chunk = [{"type":"Header","element_id":"a23fdadef9277f217563e217ebd074d5" ... = 200

```
2025-05-19 18:24:44 +00:00
Austin Walker
e3417d7e98
fix: Fix for Pillow error when extracting PNG images (#3998)
When I tried to partition a PNG file and extract images, I got an error
from Pillow:

```
WARNING  unstructured:pdf_image_utils.py:230 Image Extraction Error: Skipping the failed image
Traceback (most recent call last):
  File "/Users/austin/.pyenv/versions/unstructured/lib/python3.10/site-packages/PIL/JpegImagePlugin.py", line 666, in _save
    rawmode = RAWMODE[im.mode]
KeyError: 'RGBA'
```

The issue is that a PNG has an additional layer that cannot be saved off
in jpeg format. We can fix this with a quick conversion. I added a png
test case that is now passing with this fix.
2025-05-08 21:57:05 +00:00
Philippe PRADOS
d570f4624b
Fix sort_page_element. ensures that sorting is stable and not random. (#3978)
The sort_page_element() use the element id to sort the elements.
Two executions of the same code, on the same file, produce different
results. The order of the elements is random.
This makes it impossible to write stable unit tests, for example, or to
obtain reproducible results.
2025-04-07 15:57:20 +00:00
cragwolfe
dfa17bd3a0
fix: hi_res PDF parsing: only uncategorized text for extracted elements (#3975) 2025-04-04 14:38:23 -07:00
Antonio Jose Jimeno Yepes
0fa5174bd7
Image within div or span with no text is annotated as Image (#3962)
Ticket: https://unstructured-ai.atlassian.net/browse/ML-942

The following uncompressed HTML document can be used to test the
transformation using the `partition_html` function from the VLM
partitioner.


[recalibrating-risk-report.pdf.json.html.zip](https://github.com/user-attachments/files/19330528/recalibrating-risk-report.pdf.json.html.zip)
2025-03-20 04:09:02 +00:00
Yao You
7de630e45e
Feat/bump numpy to 2 (#3961)
This PR updates a few dependencies so that they are compatible with
`numpy>=2`.
2025-03-18 21:33:48 +00:00
Yao You
4e424efd22
feat: use lxml instead of bs4 to parse hOCR data (#3960)
- `lxml` is a much faster library than `bs4` when the input data is
regular
- since the hOCR data is guaranteed to be regular (programmatically
generated) we don't need `bs4` here to parse the data
- `lxml` improves parsing speed by about 10x

Example runtime profiling locally using the same `hocr` data from 1 page
pdf, where `agent.hocr_to_dataframe_bs4` is the current method on main
and `agent.hocr_to_dataframe` is the PR's method.

![Screenshot 2025-03-17 at 12 14
59 PM](https://github.com/user-attachments/assets/7c483857-8711-4d72-8954-e83510fef783)
2025-03-18 00:36:19 +00:00
ryannikolaidis
66bf4b0198
feat: support extracting image url in html (#3955)
also removes mimetype when base64 is not included in image metadata

---------

Co-authored-by: ryannikolaidis <ryannikolaidis@users.noreply.github.com>
2025-03-13 22:41:10 +00:00
Yao You
2dceac34b5
Feat/remove reference of PageLayout.elements (#3943)
This PR removes usage of `PageLayout.elements` from partition function,
except for when `analysis=True`. This PR updates the partition logic so
that `PageLayout.elements_array` is used everywhere to save memory and
cpu cost.
Since the analysis function is intended for investigation and not for
general document processing purposes, this part of the code is left for
a future refactor.

`PageLayout.elements` uses a list to store layout elements' data while
`elements_array` uses `numpy` array to store the data, which has much
lower memory requirements. Using `memory_profiler` to test the
differences is usually around 10x.
2025-03-12 15:21:21 +00:00
Yao You
8759b0aac9
feat: allow passing down of ocr agent and table agent (#3954)
This PR allows passing down both `ocr_agent` and `table_ocr_agent` as
parameters to specify the `OCRAgent` class for the page and tables, if
any, respectively. Both are default to using `tesseract`, consistent
with the present default behavior.

We used to rely on env variables to specify the agents but os env can be
changed during runtime outside of the caller's control. This method of
passing down the variables ensures that specification is independent of
env changes.

## testing

Using `example-docs/img/layout-parser-paper-with-table.jpg` and run
partition with two different settings. Note that this test requires
`paddleocr` extra.

```python
from unstructured.partition.auto import partition
from unstructured.partition.utils.constants import OCR_AGENT_TESSERACT, OCR_AGENT_PADDLE
elements = partition(f, strategy="hi_res", skip_infer_table_types=[], ocr_agent=OCR_AGENT_TESSERACT, table_ocr_agent=OCR_AGENT_PADDLE)
elements_alt = partition(f, strategy="hi_res", skip_infer_table_types=[], ocr_agent=OCR_AGENT_PADDLE, table_ocr_agent=OCR_AGENT_TESSERACT)
```

we should see both finish and slight differences in the table element's
text attribute.
2025-03-11 16:36:31 +00:00
ryannikolaidis
0001a33dba
fix: pass extract image args to all partitioners (#3950)
This is needed in order for the user to specify whether to extract the
base64 for images, which are now parsed by the html partitioner.

## Testing

Adds test that validates this by calling the auto-partitioner with
appropriate arguments partitioning an html file with base64 embedded
image.
2025-03-10 04:15:08 +00:00
ryannikolaidis
c0457c1cc3
feat: include images when partitioning html (#3945)
Currently we [filter img
tags](2addb19473/unstructured/partition/html/partition.py (L226-L229))
before tags are converted to Elements by the html partitioner. More
importantly we also don’t currently have a defined “block” / mapping to
support these. This adds these mappings and logic to process.

It also respects `extract_image_block_types` and
`extract_image_block_to_payload` (as we do with pdfs) to determine
whether base64 is included in the metadata.

The partitioned Image Elements sets the text to the img tag’s alt text
if available.

The partitioned Image Elements include the [url in the
metadata](https://github.com/Unstructured-IO/unstructured/blob/main/unstructured/documents/elements.py#L209)
(rather than image_base64) if the img tag src is a url.

## Testing

unit tests have been added for explicit coverage.
existing integration tests and other unit test fixtures have been
updated to account for `Image` elements now present

---------

Co-authored-by: ryannikolaidis <ryannikolaidis@users.noreply.github.com>
2025-03-08 01:25:21 +00:00
Pluto
74b0647aa2
Fix json bytes content type detection (#3941)
Fixes order of content type detection strategies for byte-encoded jsons.

Before
```
json_bytes = json.dumps([{"example": "data"}]).encode("utf-8")
file_buffer = io.BytesIO(json_bytes)
detect_filetype(file=file_buffer, metadata_file_path="filename.pdf") 
```

Before
PDF

Now
JSON
2025-03-07 10:33:33 +00:00
Nathan Van Gheem
19373de5ff
Enable dynamic file type registration (#3946)
The purpose of this PR is to enable registering new file types
dynamically.

The PR enables this through 2 primary functions:

1. `unstructured.file_utils.model.create_file_type` This registers the
new `FileType` enum which enables the rest of unstructured to understand
a new type of file
2. `unstructured.file_utils.model.register_partitioner` Decorator that
enables registering a partitioner function to run for a file type.

---------

Co-authored-by: Roman Isecke <136338424+rbiseck3@users.noreply.github.com>
2025-03-06 22:09:42 +00:00
Marek Połom
f333d7fe7f
feat: Json elements to HTML converter (#3936)
## NOTE
`test_unstructured_ingest/expected-structured-output-html` contains all
test HTML fixtures. Original JSON files, from which these HTML fixtures
are generated, were taken from
`test_unstructured_ingest/expected-structured-output`
2025-03-04 13:57:35 +00:00
Yao You
43b682ad3f
feat: allow extraction of camel cased element type names (#3938)
This PR allows element types with CamelCase names to be extractable
using `extract_image_block_types` variable.

Before: specify `extract_image_block_types=["NarrativeText"]` (or any
casing for `NarrativeText`) would raise a warning that it doesn't match
any available types and not image would be extracted for this element
type

Now: specify `extract_image_block_types=["NarrativeText"]` would extract
images for this element type

## testing

```python
from unstructured.partition.auto import partition
f = "example-docs/pdf/embedded-images-tables.pdf"
elements = partition(f, strategy="hi_res", extract_image_block_types=["narrativetext"])
```

Without this PR no figures would be extracted. With this PR a local
folder would be created to contain images of the narrative text elements
in path like `./figures/figure-1-1.jpg`

---------

Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com>
2025-03-04 01:33:05 +00:00
Pluto
0df50fe6e8
Fix file detection when spooled file is pased (#3932)
This pull request fixes the scenario when SpooledTemporaryFile is passed
to detect_file type. In such cases some weird number was assigned as
'name' (and it couldn't be overwritten as SpooledTemporaryFile can't
have fields assigned 😩 ) so I added in our object factory just another
scenario where we parse this type of file.
For BytesIo `name` attr is None as it should be and some other metadata
fields are leveraged for file type recognition
2025-02-20 13:00:25 +00:00
Pluto
3973a30b8c
Feat: Add pdfminer parameters configuration (#3918)
This pull request adds the ability to configure multiple pdfminer
parameters (with the simple possibility to extend for the additional
parameters). One of the parameters overwrites the default from LA Params
config class.

Example:
```python3
partition(
    filename=example_doc_path("pdf/layout-parser-paper-fast.pdf"),
    pdfminer_line_margin=1.123,
    pdfminer_char_margin=None,
    pdfminer_line_overlap=0.0123,
    pdfminer_word_margin=3.21,
)
assert pdfminer_mock.call_args.kwargs == {
    "line_margin": 1.123,
    "line_overlap": 0.0123,
    "word_margin": 3.21,
}
```

---------

Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com>
Co-authored-by: plutasnyy <plutasnyy@users.noreply.github.com>
2025-02-17 11:41:20 +00:00
Philippe PRADOS
b521bce9c6
Add password with PDF files (#3721)
Add password with PDF files
Must be combined with [PR 392 in
unstructured-inference](https://github.com/Unstructured-IO/unstructured-inference/pull/392)

---------

Co-authored-by: John J <43506685+Coniferish@users.noreply.github.com>
2025-02-11 17:39:16 +00:00
qued
b10379c14c
Fix: plug security issue partition system files via include (#3908)
#### Summary

A recent security review showed that it was possible to partition
arbitrary local files in cases where the filetype supports an "include"
functionality that brings in the content of files external to the
partitioned file. This affects `rst` and `org` files.

#### Fix

This PR fixes the above issue by passing the parameter `sandbox=True` in
all cases where `pypandoc.convert_file` is called.

Note I also added the parameter to a call to this method in the ODT
code. I haven't investigated whether there was a security issue with ODT
files, but it seems better to use pandoc in sandbox mode given the
security issues we know about.

#### Testing

To verify that the tests that are added with this PR find the relevant
issue:
- Remove the `sandbox=True` text from
`unstructured/file_utils/file_conversion.py` line 17.
- Run the tests
`test_unstructured.partition.test_rst.test_rst_wont_include_external_files`
and
`test_unstructured.partition.test_org.test_org_wont_include_external_files`.
Both should fail due to the partitioning containing the word "wombat",
which only appears in a file external to the partitioned file.
- Add the parameter back in, and the tests pass.
2025-02-06 03:27:18 +00:00
Pluto
5bb95b5841
Fix parsing table cells (#3904)
This PR:
- Fixes removing HTML tags that exist in <td> cells 
- stripping function was in general problematic to implement in easy and
straightforward way (you can't modify `descendants` in-place). So I
decided instead of patching something in table cell I added stripping
everywhere in the same consistent way. This is why some tests needed
small edits with removing one white-space in each tag. I believe this
won't cause any problems for downstream tasks.

Tested HTML:
```html
<table class="Table">
    <tbody>
        <tr>
            <td colspan="2">
                Some text                                        
            </td>
            <td>
                <input checked="" class="Checkbox" type="checkbox"/>
            </td>
        </tr>
    </tbody>
</table>
```
Before & After
```html
'<table class="Table" id="..."> <tbody> <tr> <td colspan="2">Some text</td><td></td></tr></tbody></table>'
'<table class="Table" id="..."><tbody><tr><td colspan="2">Some text</td><td><input checked="" type="checkbox"/></td></tr></tbody></table>''
```
2025-02-05 15:28:49 +00:00
Yao You
9d58b34ab4
Fix/fix table id checking logic (#3898)
- there is a bug in deciding if a page has tables before performing
table extraction. This logic checks if the id associated with Table type
element is True
- however, it should be checking if the id is `None` because sometimes
the id can be 0 (the first type of element in the page)
- the fix updates the logic
- adds a unit test for this specific case
2025-01-31 10:19:14 -08:00
Yao You
a9ff1e70b2
Fix/fix ocr region to elements bug (#3891)
This PR fixes a bug in `build_layout_elements_from_ocr_regions` where
texts are joint in incorrect orders.

The bug is due to incorrect masking of the `ocr_regions` after some are
already selected as one of the final groups. The fix uses simpler method
to mask the indices by simply use the same indices that adds the regions
to the final groups to mask them so they are not considered again.

## Testing

This PR adds a unit test specifically aimed for this bug. Without the
fix the test would fail.
Additionally any PDF files with repeated texts has a potential to
trigger this bug. e.g., create a simple pdf use the test text

```python
"LayoutParser: \n\nA Unified Toolkit for Deep Learning Based Document Image\n\nLayoutParser for Deep Learning"
```
and partition with `ocr_only` mode on main branch would hit this bug and
output text where position of the second "LayoutParser" is incorrect.
```python
[
    'LayoutParser:', 
    'A Unified Toolkit for Deep Learning Based Document Image',
    'for Deep Learning LayoutParser',
]
```
2025-01-29 12:11:17 +00:00
David Huggins-Daines
9e5ff225f6
fix: Correctly patch pdfminer to avoid unnecessarily and unsuccessfully repairing PDFs with long content streams, causing needless and endless OCR (#3822)
Fixes: #3815 

Verified on my very large documents that it doesn't unnecessarily and
unsuccessfully "repair" them.

You may or may not wish to keep the version check in `patch_psparser`.
Since ~you're pinning the version of pdfminer.six and since it isn't
guaranteed that the bug in question will be fixed in the next
pdfminer.six release (but it is rather serious, so I should hope so),
then perhaps you just want to unconditionally patch it.~ it seems like
pinning of versions is only operative when running from Docker (good!)
so never mind! Keep that version check!

Also corrected an import so that if you do feel like using a newer
version of pdfminer.six, it won't break on you.

---------

Authored-by: David Huggins-Daines <dhdaines@logisphere.ca>
2025-01-24 14:27:25 -06:00
Yao You
8f2a719873
Feat/refactor layoutelement textregion to vectorized data structure (#3881)
This PR refactors the data structure for `list[LayoutElement]` and
`list[TextRegion]` used in partition pdf/image files.

- new data structure replaces a list of objects with one object with
`numpy` array to store data
- this only affects partition internal steps and it doesn't change input
or output signature of `partition` function itself, i.e., `partition`
still returns `list[Element]`
- internally `list[LayoutElement]` -> `LayoutElements`;
`list[TextRegion]` -> `TextRegions`
- current refactor stops before clean up pdfminer elements inside
inferred layout elements -> the algorithm of clean up needs to be
refactored before the data structure refactor can move forward. So
current refactor converts the array data structure into list data
structure with `element_array.as_list()` call. This is the last step
before turning `list[LayoutElement]` into `list[Element]` as return
- a future PR will update this last step so that we build
`list[Element]` from `LayoutElements` data structure instead.

The goal of this PR is to replace the data structure as much as possible
without changing underlying logic. There are a few places where the
slicing or filtering logic was simple enough to be converted into vector
data structure operations. Those are refactored to be vector based. As a
result there is some small improvements observed in ingest test. This is
likely because the vector operations cleaned up some previous
inconsistency in data types and operations.

---------

Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com>
Co-authored-by: badGarnet <badGarnet@users.noreply.github.com>
2025-01-23 17:11:38 +00:00
Yao You
27cd53bd45
fix: fix multiple values for infer_table_structure (#3870)
This PR fixes a bug when using `partition` to partition an email with
image attachments with hi_res and allow table structure inference -> the
partitioning of the image would encounter a value error: `got multiple
values for keyword argument 'infer_table_structure'`.

This is because pass `kwargs` into partition "other" types of files in
this
[block](50ea6fe7fc/unstructured/partition/auto.py (L270-L280))
`infer_table_structure` is packaged into `partitioning_kwargs`. Then for
email at least when there are attachments that can be partitioned with
`hi_res` we pass that dict of `kwargs` right back into `partition` entry
-> so when we get
[here](50ea6fe7fc/unstructured/partition/auto.py (L222-L235))
we are both specifying explicitly `infer_table_structure` and have it in
`kwargs` variable

The fix is to detect first if `kwargs` already contains
`infer_table_structure` and if yes use that and pop it from `kwargs`.

---------

Co-authored-by: Kamil Plucinski <kamil.plucinski@deepsense.ai>
Co-authored-by: christinestraub <christinemstraub@gmail.com>
Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com>
Co-authored-by: christinestraub <christinestraub@users.noreply.github.com>
2025-01-17 18:41:04 +00:00
Pluto
8685905bd1
Character confidence threshold (#3860)
This change adds the ability to filter out characters predicted by
Tesseract with low confidence scores.

Some notes:
- I intentionally disabled it by default; I think some low score(like
0.9-0.95 for Tesseract) could be a safe choice though
- I wanted to use character bboxes and combine them into word bbox
later. However, a bug in Tesseract in some specific scenarios returns
incorrect character bboxes (unit tests caught it 🥳 ). More in comment in
the code
2025-01-13 13:12:46 +00:00
Roman Isecke
50ea6fe7fc
feat: add ndjson support (#3845)
### Description
Add ndjson file type support and treat is the same as json files.
2024-12-19 14:39:26 +00:00
Steve Canny
b3a2dd4755
fix: html incorrectly categorizing text (#3841)
Fixes #3666

---------

Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com>
Co-authored-by: scanny <scanny@users.noreply.github.com>
2024-12-18 18:46:54 +00:00
Steve Canny
9ece0b5ad2
fix: improve false-positive Title elements on Chinese text (#3836)
**Summary**
Improve element-type mapping for Chinese text. Fixes bug where Chinese
text would produce large numbers of false-positive `Title` elements.

Fixes #3084

---------

Co-authored-by: scanny <scanny@users.noreply.github.com>
Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com>
2024-12-18 01:16:42 +00:00
Steve Canny
10f0d54ac2
build: remove ruff version upper bound (#3829)
**Summary**
Remove pin on `ruff` linter and fix the handful of lint errors a newer
version catches.
2024-12-16 23:01:22 +00:00
Steve Canny
3b718ec89a
rfctr: prep for pluggable partitioners (#3806)
**Summary**
Prepare auto-partitioning for pluggable partitioners.

Move toward a uniform partitioner call signature in `auto/partition()`
such that a custom or override partitioner can be registered without
requiring code changes.

**Additional Context**
The central job of `auto/partition()` is to detect the file-type of the
given file and use that to dispatch partitioning to the corresponding
partitioner function e.g. `partition_pdf()` or `partition_docx()`.

In the existing code, each partitioner function is called with
parameters "hand-picked" from the available parameters passed to the
`partition()` function. This is unnecessary and couples those
partitioners tightly with the dispatch function. The desired state is
that all available arguments are passed as `kwargs` and the partitioner
function "self-selects" the arguments it will be sensitive to, applies
its own appropriate default values when the argument is omitted, and
simply ignore any arguments it doesn't use. Note that achieving this
requires no changes to partitioner functions because they already do
precisely this.

So the job is to pass all arguments (other than `filename` and `file`)
to the partitioner as `kwargs`. This will allow additional or alternate
partitioners to be registered at runtime and dispatched to, because as
long as they have the signature `partition_x(filename, file, kwargs) ->
list[Element]` then they can be dispatched to without customization.
2024-12-10 20:44:34 +00:00
Steve Canny
4379d883a3
chunk: relax table segregation during chunking (#3812)
**Summary**
Relax table-segregation rule applied during chunking such that a `Table`
and `Text`-subtype elements can be combined into a single chunk when the
chunking window allows.

**Additional Context**
Until now, `Table` elements have always been segregated during chunking,
i.e. a chunk that contained a table would never contain any other
element. In certain scenarios, especially when a large chunking window
of say 2000 characters is used, this behavior can reduce retrieval
effectiveness by isolating the table from surrounding context.

---------

Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com>
Co-authored-by: scanny <scanny@users.noreply.github.com>
2024-12-09 18:57:22 +00:00
Pluto
e48d79eca1
image alt support (#3797) 2024-11-26 16:20:23 +00:00
Pluto
e1babf0660
Define default HTML to ontology mapping (#3784) 2024-11-20 13:01:28 +00:00
Pluto
c2d17b1ca4
Fix extracting value from field (#3774) 2024-11-07 18:21:39 +00:00
Pluto
66d1e5a5cb
Add max recursion limit and fix to_text() method (#3773) 2024-11-07 15:08:16 +00:00
Christine Straub
df156ebe5a
feat: support pdf link extraction in hi_res strategy (#3753)
This PR aims to add support for link extraction in pdf `hi_res`
strategy. The `partition_pdf()` function now supports link extraction
when using the `hi_res` strategy, allowing users to extract hyperlinks
from PDF documents.

### Summary
- Added functionalities to support link extraction in hi_res flow
- Enhanced word extraction functionality used for link extraction in
both `fast` and `hi_res` flows, resulted in more correct `start_index`
and `text` in `links` metadata.
- Updated ingest fixture update workflow to not skip Astra DB source
test

### Testing
```
elements = partition_pdf(
    filename="example-docs/pdf/embedded-link.pdf",
    strategy="hi_res"
)
assert len(elements[0].metadata.links) == 3
```

---------

Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com>
Co-authored-by: christinestraub <christinestraub@users.noreply.github.com>
Co-authored-by: cragwolfe <crag@unstructured.io>
2024-10-31 16:52:27 +00:00
Pluto
1953b8699f
Ml 415/merge inline elements (#3749) 2024-10-31 12:17:25 +00:00
Maksymilian Operlejn
eb1b294b73
ML-405/ML-427 - OntologyElement improvements (#3758)
- the "value" attribute from <input/> tag will be taken into account and
processed as "text" in ontology
- the tables will now be parsed without any ids and classes - we have
different reasons behind that, for example, embeddings with ids and
classes can lose some semantic value. Also, more tokens = more expensive
LLM call
-  cleaned to_html, created to_text for OntologyElement
2024-10-31 01:30:53 +00:00
Pluto
5a91f0cda9
Fix layout parsing (#3754) 2024-10-25 14:42:06 +00:00
Pluto
2417f8ed84
Fix when parent id is none for first element in v2 notion: (#3752) 2024-10-25 09:43:36 +00:00
Pawel Kmiecik
bdfcc14e3d
fix: fix partition_via_api retry mechanism when the default SDK's retry config is empty. (#3746) 2024-10-24 09:37:22 +00:00
Pluto
03a3ed8d3b
Add parsing HTML to unstructured elements (#3732)
> This is POC change; not everything is working correctly and code
quality could be improved significantly

This ticket add parsing HTML to unstructured element and back. How is it
working?

HTML has a tree structure, Unstructured Elements is a list.
HTML structure is traversed in DFS order, creating Elements and adding
them to list. So the reading order from HTML is preserved. To be able to
compose tree again all elements has IDs, and metadata.parent_id is
leveraged

How html is preserved if there are 'layout' without text, or there are
deeply nested HTMLs that are just text from the point of view of
Unstructured Element?
Each element is parsed back to HTML using metadata.text_as_html field.
For layout elements only html_tag are there, for long text elements
there is everything required to recreate HTML - you can see examples in
unit tests or .json file I attached.

Pros of solution:
- Nothing had to be changed in element types 

Cons:
- There are elements without Text which may be confusing (they could be
replaced by some special type)

Core transformation logic can be found in 2 functions in
`unstructured/documents/transformations.py`

Knowns bugs (they are minor):
- sometimes html tag is changed incorrectly
- metadata.category_depth and metadata.page_number are not set
- page break is not added between pages 

How to test. Generate HTML:
```python3
from pathlib import Path

from vlm_partitioner.src.partition import partition

if __name__ == "__main__":
    doc_dir = Path("out_dir")
    file_path = Path("example_doc.pdf")
    partition(str(file_path), provider="anthropic", output_dir=str(doc_dir))
```

Then parse to unstructured elements and back to html
```python3
from pathlib import Path

from unstructured.documents.html_utils import indent_html
from unstructured.documents.transformations import parse_html_to_ontology, ontology_to_unstructured_elements, \
    unstructured_elements_to_ontology
from unstructured.staging.base import elements_to_json

if __name__ == "__main__":
    output_dir = Path("out_dir/")
    output_dir.mkdir(exist_ok=True, parents=True)

    doc_path = Path("out_dir/example_doc.html")

    html_content = doc_path.read_text()

    ontology = parse_html_to_ontology(html_content)
    unstructured_elements = ontology_to_unstructured_elements(ontology)

    elements_to_json(unstructured_elements, str(output_dir / f"{doc_path.stem}_unstr.json"))

    parsed_ontology = unstructured_elements_to_ontology(unstructured_elements)
    html_to_save = indent_html(parsed_ontology.to_html())

    Path(output_dir / f"{doc_path.stem}_parsed_unstr.html").write_text(html_to_save)
```

I attached example doc before and after running these scripts

[outputs.zip](https://github.com/user-attachments/files/17438673/outputs.zip)
2024-10-23 12:28:07 +00:00
Pawel Kmiecik
6bceac1749
feat: expose retry params in partition via api (#3724)
This PR:
- adds parameters to control the retry-mechanism behaviour for
`partition_via_api`:
```
    retries_initial_interval: [int] = None,
    retries_max_interval: Optional[int] = None,
    retries_exponent: Optional[float] = None,
    retries_max_elapsed_time: Optional[int] = None,
    retries_connection_errors: Optional[bool] = None,
```
- adds tests that check using them according to defaults
2024-10-22 14:43:28 +00:00
Yao You
a11ad22609
bump unstructured-inference (#3711)
This PR bumps `unstructured-inference` to `0.8.0`, which introduces
vectorized data structure for layout elements and text regions.
This PR also cleans up a few places in CI that has repeated definition
of env variables or missing installation of testing dependencies in
cache.

A few document ingest results are changed:
- two places for `biomed-api` (actually processed locally on runner) are
due to very small changes in numerical results of the bounding box
areas: one results in a duplicated page number/header and another
results in a deduplication of a word of a sentence that starts in a new
line. (yes, two cases goes in opposite directions)
- the layout parser paper now outputs the code lines with page number
inside the code box as list items

---------

Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com>
Co-authored-by: badGarnet <badGarnet@users.noreply.github.com>
Co-authored-by: christinestraub <christinemstraub@gmail.com>
2024-10-21 21:55:08 +00:00
Steve Canny
3240e3d17a
rfctr(pptx): minify HTML and table.text is cct (#3734)
**Summary**
Eliminate historical "idiosyncracies" of `table.metadata.text_as_html`
HTML introduced by `partition_pptx()`. Produce minified `.text_as_html`
consistent with that formed by chunking.

**Additional Context**
- PPTX `.metadata.text_as_html` is minified (no extra whitespace or
thead, tbody, tfoot elements).
- `table.text` is clean-concatenated-text (CCT) of table.
- Last use of `tabulate` library is removed and that dependency is
removed from `base.in`.
2024-10-21 16:23:15 +00:00
Steve Canny
208c7edc52
rfctr(csv): minify HTML and table text is cct (#3733)
**Summary**
Eliminate historical "idiosyncracies" of `table.metadata.text_as_html`
HTML introduced by `partition_csv()`. Produce minified `.text_as_html`
consistent with that formed by chunking.

**Additional Context**
- CSV `.metadata.text_as_html` is minified (no extra whitespace or
thead, tbody, tfoot elements).
- `table.text` is clean-concatenated-text (CCT) of table.

---------

Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com>
Co-authored-by: scanny <scanny@users.noreply.github.com>
2024-10-19 06:49:09 +00:00