This PR uses (number of actual table) weighted average instead of
average without weights for table metrics.
- pages where there are ground truth tables the weight is proportional
to the number of ground truth tables in that page
- pages where there are no ground truth tables but has predicted tables
(false positive) are assigned as 1 table worth of weight for the whole
page for calculating the mean value of `table_level_acc`
- pages with false positive tables do not contribute to table structural
or table content metrics
## test
This PR updates the existing test for evaluating table metrics:
- adds a second file with just 1 table vs. the existing file with 2
tables
- test the weighted average is written to the report
This simplest solution doesn't drop HTML from metadata when merging
Elements from HTML input. We still need to address how to handle nested
elements, and if we want to have `LayoutElements` in the metadata of
Composite Elements, a unit test showing the current behavior.
Note: metadata still contains `orig_elements` which has all the
metadata.
This PR aims to add support for link extraction in pdf `hi_res`
strategy. The `partition_pdf()` function now supports link extraction
when using the `hi_res` strategy, allowing users to extract hyperlinks
from PDF documents.
### Summary
- Added functionalities to support link extraction in hi_res flow
- Enhanced word extraction functionality used for link extraction in
both `fast` and `hi_res` flows, resulted in more correct `start_index`
and `text` in `links` metadata.
- Updated ingest fixture update workflow to not skip Astra DB source
test
### Testing
```
elements = partition_pdf(
filename="example-docs/pdf/embedded-link.pdf",
strategy="hi_res"
)
assert len(elements[0].metadata.links) == 3
```
---------
Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com>
Co-authored-by: christinestraub <christinestraub@users.noreply.github.com>
Co-authored-by: cragwolfe <crag@unstructured.io>
- the "value" attribute from <input/> tag will be taken into account and
processed as "text" in ontology
- the tables will now be parsed without any ids and classes - we have
different reasons behind that, for example, embeddings with ids and
classes can lose some semantic value. Also, more tokens = more expensive
LLM call
- cleaned to_html, created to_text for OntologyElement
This ticket ensures that CCT metric will not be sensitive to differences
in whitespace (including newline).
All whitespaces in string are changed to single space `" "` in both GT
and PRED before the metric is computed.
Additional changes in CHANGELOG due to auto-formatting.
> This is POC change; not everything is working correctly and code
quality could be improved significantly
This ticket add parsing HTML to unstructured element and back. How is it
working?
HTML has a tree structure, Unstructured Elements is a list.
HTML structure is traversed in DFS order, creating Elements and adding
them to list. So the reading order from HTML is preserved. To be able to
compose tree again all elements has IDs, and metadata.parent_id is
leveraged
How html is preserved if there are 'layout' without text, or there are
deeply nested HTMLs that are just text from the point of view of
Unstructured Element?
Each element is parsed back to HTML using metadata.text_as_html field.
For layout elements only html_tag are there, for long text elements
there is everything required to recreate HTML - you can see examples in
unit tests or .json file I attached.
Pros of solution:
- Nothing had to be changed in element types
Cons:
- There are elements without Text which may be confusing (they could be
replaced by some special type)
Core transformation logic can be found in 2 functions in
`unstructured/documents/transformations.py`
Knowns bugs (they are minor):
- sometimes html tag is changed incorrectly
- metadata.category_depth and metadata.page_number are not set
- page break is not added between pages
How to test. Generate HTML:
```python3
from pathlib import Path
from vlm_partitioner.src.partition import partition
if __name__ == "__main__":
doc_dir = Path("out_dir")
file_path = Path("example_doc.pdf")
partition(str(file_path), provider="anthropic", output_dir=str(doc_dir))
```
Then parse to unstructured elements and back to html
```python3
from pathlib import Path
from unstructured.documents.html_utils import indent_html
from unstructured.documents.transformations import parse_html_to_ontology, ontology_to_unstructured_elements, \
unstructured_elements_to_ontology
from unstructured.staging.base import elements_to_json
if __name__ == "__main__":
output_dir = Path("out_dir/")
output_dir.mkdir(exist_ok=True, parents=True)
doc_path = Path("out_dir/example_doc.html")
html_content = doc_path.read_text()
ontology = parse_html_to_ontology(html_content)
unstructured_elements = ontology_to_unstructured_elements(ontology)
elements_to_json(unstructured_elements, str(output_dir / f"{doc_path.stem}_unstr.json"))
parsed_ontology = unstructured_elements_to_ontology(unstructured_elements)
html_to_save = indent_html(parsed_ontology.to_html())
Path(output_dir / f"{doc_path.stem}_parsed_unstr.html").write_text(html_to_save)
```
I attached example doc before and after running these scripts
[outputs.zip](https://github.com/user-attachments/files/17438673/outputs.zip)
This PR:
- adds parameters to control the retry-mechanism behaviour for
`partition_via_api`:
```
retries_initial_interval: [int] = None,
retries_max_interval: Optional[int] = None,
retries_exponent: Optional[float] = None,
retries_max_elapsed_time: Optional[int] = None,
retries_connection_errors: Optional[bool] = None,
```
- adds tests that check using them according to defaults
This PR bumps `unstructured-inference` to `0.8.0`, which introduces
vectorized data structure for layout elements and text regions.
This PR also cleans up a few places in CI that has repeated definition
of env variables or missing installation of testing dependencies in
cache.
A few document ingest results are changed:
- two places for `biomed-api` (actually processed locally on runner) are
due to very small changes in numerical results of the bounding box
areas: one results in a duplicated page number/header and another
results in a deduplication of a word of a sentence that starts in a new
line. (yes, two cases goes in opposite directions)
- the layout parser paper now outputs the code lines with page number
inside the code box as list items
---------
Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com>
Co-authored-by: badGarnet <badGarnet@users.noreply.github.com>
Co-authored-by: christinestraub <christinemstraub@gmail.com>
**Summary**
Eliminate historical "idiosyncracies" of `table.metadata.text_as_html`
HTML introduced by `partition_pptx()`. Produce minified `.text_as_html`
consistent with that formed by chunking.
**Additional Context**
- PPTX `.metadata.text_as_html` is minified (no extra whitespace or
thead, tbody, tfoot elements).
- `table.text` is clean-concatenated-text (CCT) of table.
- Last use of `tabulate` library is removed and that dependency is
removed from `base.in`.
**Summary**
Eliminate historical "idiosyncracies" of `table.metadata.text_as_html`
HTML introduced by `partition_csv()`. Produce minified `.text_as_html`
consistent with that formed by chunking.
**Additional Context**
- CSV `.metadata.text_as_html` is minified (no extra whitespace or
thead, tbody, tfoot elements).
- `table.text` is clean-concatenated-text (CCT) of table.
---------
Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com>
Co-authored-by: scanny <scanny@users.noreply.github.com>
**Summary**
Eliminate historical "idiosyncracies" of `table.metadata.text_as_html`
HTML introduced by `partition_xlsx()`. Produce minified `.text_as_html`
consistent with that formed by chunking.
**Additional Context**
- XLSX `.text_as_html` is minified (no extra whitespace or thead, tbody,
tfoot elements).
- `table.text` is clean-concatenated-text (CCT) of table.
---------
Co-authored-by: scanny <scanny@users.noreply.github.com>
**Summary**
Initial attempts to incrementally refactor `partition_email()` into
shape to allow pluggable partitioning quickly became too complex for
ready code-review. Prepare separate rewritten module and tests and swap
them out whole.
**Additional Context**
- Uses the modern stdlib `email` module to reliably accomplish several
manual decoding steps in the legacy code.
- Remove obsolete email-specific element-types which were replaced 18
months or so ago with email-specific metadata fields for things like Cc:
addresses, subject, etc.
- Remove accepting an email as `text: str` because MIME-email is
inherently a binary format which can and often does contain multiple and
contradictory character-encodings.
- Remove `encoding` parameters as it is now unused. An email file is not
a text file and as such does not have a single overall encoding.
Character encoding is specified individually for each MIME-part within
the message and often varies from one part to another in the same
message.
- Remove the need for a caller to specify `attachment_partitioner`.
There is only one reasonable choice for this which is
`auto.partition()`, consistent with the same interface and operation in
`partition_msg()`.
- Fixes#3671 along the way by silently skipping attachments with a
file-type for which there is no partitioner.
- Substantially extend the test-suite to cover multiple
transport-encoding/charset combinations.
---------
Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com>
Co-authored-by: scanny <scanny@users.noreply.github.com>
### Description
Alternative to https://github.com/Unstructured-IO/unstructured/pull/3572
but maintaining all ingest tests, running them by pulling in the latest
version of unstructured-ingest.
---------
Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com>
Co-authored-by: rbiseck3 <rbiseck3@users.noreply.github.com>
Co-authored-by: Christine Straub <christinemstraub@gmail.com>
Co-authored-by: christinestraub <christinestraub@users.noreply.github.com>
This PR addresses issue #3659 by adding an optional `language` parameter
to the `OCRAgentGoogleVision` class constructor.
This parameter serves as a "language hint" for the
`document_text_detection` method in the `ImageAnnotatorClient`. For more
information on language hints, refer to the [Google Cloud Vision
documentation](https://cloud.google.com/vision/docs/languages).
**Default Behavior**:
The language parameter defaults to None, allowing Google Cloud Vision to
auto-detect the language, as recommended in their documentation.
**Purpose**:
This change is necessary because the `OCRAgent`'s `get_instance` method
expects all `OCRAgent`s to include a language parameter in their
constructors.
**Context on Issue:**
When trying to parse a PDF with
`OCR_AGENT=unstructured.partition.utils.ocr_models.google_vision_ocr.OCRAgentGoogleVision`,
an error occurs in the `get_instance` method. The method expects a
`language` parameter, which the current `OCRAgentGoogleVision`
constructor does not support, leading to a positional argument error.
---------
Co-authored-by: Christine Straub <christinemstraub@gmail.com>
**Summary**
Remove double-decoration from EML and MSG.
**Additional Context**
- These needed to wait to the end because `partition_email()` and
`partition_msg()` can use any other partitioner for one of their
attachments.
---------
Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com>
Co-authored-by: scanny <scanny@users.noreply.github.com>
**Summary**
Install new `@apply_metadata()` on TXT.
**Additional Context**
- Both EML and MSG delegate to both HTML and TXT to partition the
message-body, depending on which MIME-part body payload is selected
(`text/plain` or `text/html`). This PR prepares the way to remove
decorators from EML and MSG in the next PR.
---------
Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com>
Co-authored-by: scanny <scanny@users.noreply.github.com>
**Summary**
Install new `@apply_metadata()` on HTML and remove decorators from
delegating partitioners EPUB, MD, ORG, RST, and RTF.
**Additional Context**
- All five of these delegating partitioners delegate to
`partition_html()` so they're something of a matched set. EML and MSG
also partially delegate to HTML but that's a harder problem (they also
delegate to all other partitioners for attachments) that we'll address a
couple PRs later .
- Replace use of `@process_metadata()` and
`@add_metadata_with_filetype()` decorators with `@apply_metadata()` on
`partition_html()`.
- Remove all decorators from delegating partitioners; this removes the
"double-decorating".
**Summary**
Install new `@apply_metadata()` on PPTX, TSV, XLSX, and XML and remove
decoration from PPT.
**Additional Context**
- Alphabetical order turns out to be hard, so this is the remaining
"easy" delegating partitioner and the remaining principal partitioners.
- Replace use of `@process_metadata()` and
`@add_metadata_with_filetype()` decorators with `@apply_metadata()` on
principal partitioners (those that do not delegate to other
partitioners.
- Remove all decorators from delegating partitioners (PPT in this case);
this removes the "double-decorating".
**Summary**
Install new `@apply_metadata()` on CSV and DOCX and remove decoration
from DOC and ODT.
**Additional Context**
- Working in alphabetical order and keeping PR size manageable, replace
use of `@process_metadata()` and `@add_metadata_with_filetype()`
decorators with `@apply_metadata()` on principal partitioners (those
that do not delegate to other partitioners.
- Remove all decorators from delegating partitioners (DOC and ODT in
this case); this removes the "double-decorating".
**Summary**
Refine `@apply_metadata()` replacement decorator. Note it has not been
installed yet.
- Apply `metadata_last_modified` arg with the `@apply_metadata()`
decorator. No need for redundant processing in each partitioner.
- Add "unique-ify" step to fix any cases where the same `Element` or
`ElementMetadata` instance was used more than once in the element
stream. This prevents unexpected "multi-mutation" in downstream
processes.
- Apply "global" metadata items before computing hash-ids. In
particular, `.metadata.filename` is used in the hash computation and
will produce different results if that's not already settled.
- Compute hash-ids _before_ computing `.metadata.parent_id`. This
removes the need for mapping UUID element-ids to their hash counterpart
and doing a fixup of `.parent_id` after applying hash-ids to elements.
**Additional Context**
- The `@apply_metadata()` decorator replaces the four metadata-related
decorators: `@process_metadata()`, `@add_metadata_with_filetype()`,
`@add_metadata()`, and `@add_filetype()`.
- It will be installed on each partitioner in a series of following PRs.
This is a fix for this
[bug](https://github.com/Unstructured-IO/unstructured/issues/3674), auto partition fails on text files which are empty or contain only whitespaces
Inference of .txt file type fails if the file has only whitespaces.
To Reproduce:
```
from tempfile import NamedTemporaryFile
from unstructured.partition.auto import partition
with NamedTemporaryFile(mode="w", suffix=".txt") as f:
f.write(" \n")
f.seek(0)
elements = partition(filename=f.name)
```
**Summary**
In preparation for pluggable auto-partitioners, add a new metadata
decorator to replace the four existing ones.
**Additional Context**
"Global" metadata items, those applied to all element on all
partitioners, are applied using a decorator.
Currently there are four decorators where there only needs to be one.
Consolidate those into a single metadata decorator.
One or two additional behaviors of the new decorator will allow us to
remove decorators from delegating partitioners which is a prerequisite
for pluggable auto-partitioners.
**Summary**
Remove unused `include_metadata` parameter.
**Additional Context**
- The `include_metadata` parameter was originally added circa v0.7.12 as
a mechanism for avoiding the "double-decorating" problem on delegating
partitioners.
- It turns out it doesn't fully address that problem, is now unused, and
is unnecessary for the solution we'll be adding as part of pluggable
partitioners.
- Remove the unnecessary complexity introduced by this unused parameter.
**Summary**
Step 2 in prep for pluggable auto-partitioners, remove `regex_metadata`
field from `ElementMetadata`.
**Additional Context**
- "regex-metadata" was an experimental feature that didn't pan out.
- It's implemented by one of the post-partitioning metadata decorators,
so get rid of it as part of the cleanup before consolidating those
decorators.
This PR fixes an occasional `KeyError` when calling
`assign_and_map_hash_ids`.
- This happens when the input `elements` has duplicated element
instances or metadata.
- When there are duplications the logic to iterate through all elements
and map their parent ids will raise an error when an already mapped
parent id is up for mapping.
- The fix adds a logic to check if the parent id exists in
`old_to_new_mapping` and if it doesn't we skip mapping it
## test
This PR adds a unit test on this case and the test would fail without
the fix.
Wrap the `shared.PartitionParameters` usage with
`operations.PartitionRequest`. This syntax has been deprecated since
v0.23.0 of the SDK, and will be unsupported in v0.26.0.
**Summary**
In preparation for pluggable auto-partitioners simplify metadata as
discussed.
**Additional Context**
- Pluggable auto-partitioners requires partitioners to have a consistent
call signature. An arbitrary partitioner provided at runtime needs to
have a call signature that is known and consistent. Basically
`partition_x(filename, *, file, **kwargs)`.
- The current `auto.partition()` is highly coupled to each distinct
file-type partitioner, deciding which arguments to forward to each.
- This is driven by the existence of "delegating" partitioners, those
that convert their file-type and then call a second partitioner to do
the actual partitioning. Both the delegating and proxy partitioners are
decorated with metadata-post-processing decorators and those decorators
are not idempotent. We call the situation where those decorators would
run twice "double-decorating". For example, EPUB converts to HTML and
calls `partition_html()` and both `partition_epub()` and
`partition_html()` are decorated.
- The way double-decorating has been avoided in the past is to avoid
sending the arguments the metadata decorators are sensitive to to the
proxy partitioner. This is very obscure, complex to reason about,
error-prone, and just overall not a viable strategy. The better solution
is to not decorate delegating partitioners and let the proxy partitioner
handle all the metadata.
- This first step in preparation for that is part of simplifying the
metadata processing by removing unused or unwanted legacy parameters.
- `date_from_file_object` is a misnomer because a file-object never
contains last-modified data.
- It can never produce useful results in the API where last-modified
information must be provided by `metadata_last_modified`.
- It is an undocumented parameter so not in use.
- Using it can produce incorrect metadata.
**Summary**
In preparation for consolidating post-partitioning metadata decorators,
extract `partition.common` module into a sub-package (directory) and
extract `partition.common.metadata` module to house metadata-specific
object shared by partitioners.
**Additional Context**
- This new module will be the home of the new consolidated metadata
decorator.
- The consolidated decorator is a step toward removing post-processing
decorators from _delegating_ partitioners. A delegating partitioner is
one that convert its file to a different format and "delegates" actual
partitioning to the partitioner for that target format. 10 of the 20
partitioners are delegating partitioners.
- Removing decorators from delegating partitioners will allow us to
avoid "double-decorating", i.e. running those decorators twice, once on
the principal partitioner and again on the proxy partitioner.
- This will allow us to send `**kwargs` to either partitioner, removing
the knowledge of which arguments to send for each file-type from
auto-partition.
- And this will allow pluggable auto-partitioners which all have a
`partition_x(filename, *, file, **kwargs) -> list[Element]` interface.
This PR enhances `pdfminer` image cleanup process by repositioning the
duplicate image removal step. It optimizes the removal of duplicated
pdfminer images by performing the cleanup before merging elements,
rather than after. This improvement reduces execution time and enhances
the overall processing speed of PDF documents.
---------
Co-authored-by: Yao You <theyaoyou@gmail.com>
**Summary**
Remove dead code in `unstructured.file_utils`.
**Additional Context**
These modules were added in 12/2022 and 1/2023 and are not referenced by
any code. Removing to reduce unnecessary complexity. These can of course
be recovered from Git history if we decide we want them again in future.
This PR implements splitting of `pdfminer` elements (`groups of text
chunks`) into smaller bounding boxes (`text lines`). This implementation
prevents loss of information from the object detection model and
facilitates more effective removal of duplicated `pdfminer` text. This
PR also addresses #3430.
---------
Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com>
Co-authored-by: christinestraub <christinestraub@users.noreply.github.com>
- Remove constraint pins for `Office365-REST-Python-Client`,
`weaviate-client`, and `platformdirs`. Removing the pin for `Office365`
brought to light some bugs in the Onedrive connector, so some changes
were also made to
`unstructured/ingest/v2/processes/connectors/onedrive.py`.
- Also, as part of updating dependencies `unstructured-client` was
updated to `0.25.8`, which introduced a new default for the `strategy`
param and required updating a test fixture.
- The `hubspot.sh` integration test was failing and is now ignored in CI
with this PR per discussion with @rbiseck3.
May be easiest to review commit-by-commit.
This PR vectorizes the computation of element overlap to speed up
deduplication process of extracted elements.
## test
This PR adds unit test to the new vectorized IOU and subregion
computation functions.
In addition, running partition on large files with many elements like
this slide:
[002489.pdf](https://github.com/user-attachments/files/16823176/002489.pdf)
shows a reduction of runtime from around 15min on the main branch to
less than 4min with this branch.
Profiling results show that the new implementation greatly reduces the
time cost of computation and now most of the time is spend on getting
the coordinates from a list of bboxes.
