**Summary**
Prepare auto-partitioning for pluggable partitioners.
Move toward a uniform partitioner call signature in `auto/partition()`
such that a custom or override partitioner can be registered without
requiring code changes.
**Additional Context**
The central job of `auto/partition()` is to detect the file-type of the
given file and use that to dispatch partitioning to the corresponding
partitioner function e.g. `partition_pdf()` or `partition_docx()`.
In the existing code, each partitioner function is called with
parameters "hand-picked" from the available parameters passed to the
`partition()` function. This is unnecessary and couples those
partitioners tightly with the dispatch function. The desired state is
that all available arguments are passed as `kwargs` and the partitioner
function "self-selects" the arguments it will be sensitive to, applies
its own appropriate default values when the argument is omitted, and
simply ignore any arguments it doesn't use. Note that achieving this
requires no changes to partitioner functions because they already do
precisely this.
So the job is to pass all arguments (other than `filename` and `file`)
to the partitioner as `kwargs`. This will allow additional or alternate
partitioners to be registered at runtime and dispatched to, because as
long as they have the signature `partition_x(filename, file, kwargs) ->
list[Element]` then they can be dispatched to without customization.
I noticed the ipv4 regex is wrong (it only capture one or two-digit
octets, e.g. `n.nn.n.nn`). Here's a correction and a bumped test for it.
If you wish I can break out the ipv4 test to its own case, so we don't
interfere with the existing `EMAIL_META_DATA_INPUT` ipv6 extraction
test.
Side note: The comment at `unstructured/nlp/patterns.py#95` includes a
bad ipv4 address example (last octet is wrongfully left-padded with a
zero). I left it as it is because I'm not sure if the intention is to
include "non-conventional" ipv4 addresses, like octal or hexadecimal
octets.
**Summary**
Relax table-segregation rule applied during chunking such that a `Table`
and `Text`-subtype elements can be combined into a single chunk when the
chunking window allows.
**Additional Context**
Until now, `Table` elements have always been segregated during chunking,
i.e. a chunk that contained a table would never contain any other
element. In certain scenarios, especially when a large chunking window
of say 2000 characters is used, this behavior can reduce retrieval
effectiveness by isolating the table from surrounding context.
---------
Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com>
Co-authored-by: scanny <scanny@users.noreply.github.com>
- per [ticket](https://unstructured-ai.atlassian.net/browse/ML-551),
there is a bug in the `unstructured` lib under metrics/evaluate.py that
incorrectly retrieves the file extension before the conversion to cct
file from paths like '*.pdf.txt' . (see below screenshot)
- the current status is in the top example
- we should have the correct version in the bottom example of the
screenshot.

- in addition, i also observe the doctype returned are not aligned, some
returning '.*' and some are returning without the dot.
- therefore, i just aligned them to be output into the same version
which is '.*".
Script to render HTML from unstructured elements.
NOTE: This script is not intended to be used as a module.
NOTE: This script is only intended to be used with outputs with
non-empty `metadata.text_as_html`.
TODO: It was noted that unstructured_elements_to_ontology func always
returns a single page
This script is using helper functions to handle multiple pages. I am not
sure if this was intended, or it is a bug - if it is a bug it would
require bit longer debugging - to make it usable fast I used
workarounds.
Usage: test with any outputs with non-empty `metadata.text_as_html`.
Example files attached.
`[Example-Bill-of-Lading-Waste.docx.pdf.json](https://github.com/user-attachments/files/17922898/Example-Bill-of-Lading-Waste.docx.pdf.json)`
[Breast_Cancer1-5.pdf.json](https://github.com/user-attachments/files/17922899/Breast_Cancer1-5.pdf.json)
This PR uses (number of actual table) weighted average instead of
average without weights for table metrics.
- pages where there are ground truth tables the weight is proportional
to the number of ground truth tables in that page
- pages where there are no ground truth tables but has predicted tables
(false positive) are assigned as 1 table worth of weight for the whole
page for calculating the mean value of `table_level_acc`
- pages with false positive tables do not contribute to table structural
or table content metrics
## test
This PR updates the existing test for evaluating table metrics:
- adds a second file with just 1 table vs. the existing file with 2
tables
- test the weighted average is written to the report
This simplest solution doesn't drop HTML from metadata when merging
Elements from HTML input. We still need to address how to handle nested
elements, and if we want to have `LayoutElements` in the metadata of
Composite Elements, a unit test showing the current behavior.
Note: metadata still contains `orig_elements` which has all the
metadata.
This PR aims to add support for link extraction in pdf `hi_res`
strategy. The `partition_pdf()` function now supports link extraction
when using the `hi_res` strategy, allowing users to extract hyperlinks
from PDF documents.
### Summary
- Added functionalities to support link extraction in hi_res flow
- Enhanced word extraction functionality used for link extraction in
both `fast` and `hi_res` flows, resulted in more correct `start_index`
and `text` in `links` metadata.
- Updated ingest fixture update workflow to not skip Astra DB source
test
### Testing
```
elements = partition_pdf(
filename="example-docs/pdf/embedded-link.pdf",
strategy="hi_res"
)
assert len(elements[0].metadata.links) == 3
```
---------
Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com>
Co-authored-by: christinestraub <christinestraub@users.noreply.github.com>
Co-authored-by: cragwolfe <crag@unstructured.io>
- the "value" attribute from <input/> tag will be taken into account and
processed as "text" in ontology
- the tables will now be parsed without any ids and classes - we have
different reasons behind that, for example, embeddings with ids and
classes can lose some semantic value. Also, more tokens = more expensive
LLM call
- cleaned to_html, created to_text for OntologyElement
This ticket ensures that CCT metric will not be sensitive to differences
in whitespace (including newline).
All whitespaces in string are changed to single space `" "` in both GT
and PRED before the metric is computed.
Additional changes in CHANGELOG due to auto-formatting.
> This is POC change; not everything is working correctly and code
quality could be improved significantly
This ticket add parsing HTML to unstructured element and back. How is it
working?
HTML has a tree structure, Unstructured Elements is a list.
HTML structure is traversed in DFS order, creating Elements and adding
them to list. So the reading order from HTML is preserved. To be able to
compose tree again all elements has IDs, and metadata.parent_id is
leveraged
How html is preserved if there are 'layout' without text, or there are
deeply nested HTMLs that are just text from the point of view of
Unstructured Element?
Each element is parsed back to HTML using metadata.text_as_html field.
For layout elements only html_tag are there, for long text elements
there is everything required to recreate HTML - you can see examples in
unit tests or .json file I attached.
Pros of solution:
- Nothing had to be changed in element types
Cons:
- There are elements without Text which may be confusing (they could be
replaced by some special type)
Core transformation logic can be found in 2 functions in
`unstructured/documents/transformations.py`
Knowns bugs (they are minor):
- sometimes html tag is changed incorrectly
- metadata.category_depth and metadata.page_number are not set
- page break is not added between pages
How to test. Generate HTML:
```python3
from pathlib import Path
from vlm_partitioner.src.partition import partition
if __name__ == "__main__":
doc_dir = Path("out_dir")
file_path = Path("example_doc.pdf")
partition(str(file_path), provider="anthropic", output_dir=str(doc_dir))
```
Then parse to unstructured elements and back to html
```python3
from pathlib import Path
from unstructured.documents.html_utils import indent_html
from unstructured.documents.transformations import parse_html_to_ontology, ontology_to_unstructured_elements, \
unstructured_elements_to_ontology
from unstructured.staging.base import elements_to_json
if __name__ == "__main__":
output_dir = Path("out_dir/")
output_dir.mkdir(exist_ok=True, parents=True)
doc_path = Path("out_dir/example_doc.html")
html_content = doc_path.read_text()
ontology = parse_html_to_ontology(html_content)
unstructured_elements = ontology_to_unstructured_elements(ontology)
elements_to_json(unstructured_elements, str(output_dir / f"{doc_path.stem}_unstr.json"))
parsed_ontology = unstructured_elements_to_ontology(unstructured_elements)
html_to_save = indent_html(parsed_ontology.to_html())
Path(output_dir / f"{doc_path.stem}_parsed_unstr.html").write_text(html_to_save)
```
I attached example doc before and after running these scripts
[outputs.zip](https://github.com/user-attachments/files/17438673/outputs.zip)
This PR:
- adds parameters to control the retry-mechanism behaviour for
`partition_via_api`:
```
retries_initial_interval: [int] = None,
retries_max_interval: Optional[int] = None,
retries_exponent: Optional[float] = None,
retries_max_elapsed_time: Optional[int] = None,
retries_connection_errors: Optional[bool] = None,
```
- adds tests that check using them according to defaults
This PR bumps `unstructured-inference` to `0.8.0`, which introduces
vectorized data structure for layout elements and text regions.
This PR also cleans up a few places in CI that has repeated definition
of env variables or missing installation of testing dependencies in
cache.
A few document ingest results are changed:
- two places for `biomed-api` (actually processed locally on runner) are
due to very small changes in numerical results of the bounding box
areas: one results in a duplicated page number/header and another
results in a deduplication of a word of a sentence that starts in a new
line. (yes, two cases goes in opposite directions)
- the layout parser paper now outputs the code lines with page number
inside the code box as list items
---------
Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com>
Co-authored-by: badGarnet <badGarnet@users.noreply.github.com>
Co-authored-by: christinestraub <christinemstraub@gmail.com>
This PR rounds the floating point number associated with coordinates in
`pdfminer_processing.py`. This helps to eliminate machine precision
caused randomness in bounding box overlap detection. Currently the
rounding is set to the nearest machine precision for `np.float32` using
`np.finfo(float)`, which yields resolution = `1e-15`.
## future work
We should reduce the rounding to only 6 digits after floating point
since the data type `float32` has a resolution of only `1e-6`. However
it would break tests. A followup is required to tune the threshold
values in `pdfminer_processing.py` so that it works with `1e-6`
resolution.