1739 Commits

Author SHA1 Message Date
Christine Straub
3bca724624 feat: update standardize_quote() 2024-12-05 13:24:57 -08:00
Christine Straub
ef1c85ef0f feat: enhance quote standardization tests with additional Unicode scenarios 2024-12-05 11:35:19 -08:00
Christine Straub
7d06c120dc
Merge branch 'main' into ML-593/quote-standardization 2024-12-05 10:27:26 -08:00
Christine Straub
a4be1d6100 fix: modify changelog.md 2024-12-05 10:17:08 -08:00
Christine Straub
f44862a5ea fix:confict error 2024-12-05 09:56:52 -08:00
Christine Straub
00a999153c fix: confict error 2024-12-05 09:54:36 -08:00
Christine Straub
5e60942960 test:fix lint errors 2024-12-05 09:40:50 -08:00
Marianna
4140f625d0
add script to render html from unstructured elements (#3799)
Script to render HTML from unstructured elements.

NOTE: This script is not intended to be used as a module.
NOTE: This script is only intended to be used with outputs with
non-empty `metadata.text_as_html`.

TODO: It was noted that unstructured_elements_to_ontology func always
returns a single page
This script is using helper functions to handle multiple pages. I am not
sure if this was intended, or it is a bug - if it is a bug it would
require bit longer debugging - to make it usable fast I used
workarounds.

Usage: test with any outputs with non-empty `metadata.text_as_html`.
Example files attached.

`[Example-Bill-of-Lading-Waste.docx.pdf.json](https://github.com/user-attachments/files/17922898/Example-Bill-of-Lading-Waste.docx.pdf.json)`


[Breast_Cancer1-5.pdf.json](https://github.com/user-attachments/files/17922899/Breast_Cancer1-5.pdf.json)
2024-12-04 19:46:51 -08:00
Christine Straub
bb26603a30 test: enhance quote standardization tests with additional Unicode scenarios 2024-12-04 13:16:00 -08:00
Christine Straub
c0c3fd673f test: enhance quote standardization tests with additional Unicode scenarios 2024-12-04 13:02:07 -08:00
Christine Straub
9038b88b8e fix: ensure newline at end of file in standardize_quotes function 2024-12-04 12:48:15 -08:00
Christine Straub
c821f12d29 test: update string tests for consistent quote handling 2024-12-04 12:17:16 -08:00
Christine Straub
4e0f7cdbc0 Feat: enhance quote standardization with comprehensive Unicode coverage and update tests 2024-12-04 11:33:03 -08:00
Christine Straub
371cb7528d Feat: add quote standardization and update edit distance calculation 2024-12-03 21:21:39 -08:00
Nathan Van Gheem
0fb814db61
Use native ntlk download (#3796)
This PR changes how we download NLTK data to use the native nltk
downloader.

We had moved to our own hosted NLTK dataset because of this CVE:
https://nvd.nist.gov/vuln/detail/CVE-2024-39705

Ref: https://github.com/Unstructured-IO/unstructured/pull/3361

Latest versions of NLTK have fixed this issue:
https://github.com/nltk/nltk/blob/develop/ChangeLog
0.16.9
2024-12-02 19:30:28 +00:00
cragwolfe
9445a2dd01
chore: fix CHANGELOG formatting (#3800)
Fixes formatting in CHANGELOG.md where most of the page was bold and
indented. (verify the branch version here:
https://github.com/Unstructured-IO/unstructured/blob/crag/tables-tweak/CHANGELOG.md)

Bonus tweak: u-table-inspect.sh is more robust to adding borders for
visualizations
2024-11-26 15:38:42 -08:00
Pluto
0fe6ac60aa
Add backward compatibility for metric calculation (#3798)
Co-authored-by: cragwolfe <crag@unstructured.io>
0.16.8
2024-11-26 18:14:16 +00:00
Pluto
e48d79eca1
image alt support (#3797) 0.16.7 2024-11-26 16:20:23 +00:00
ryannikolaidis
626f73af5b
chore: remove dev and release as 0.16.6 (#3793) 0.16.6 2024-11-21 23:59:38 +00:00
Yao You
3b9b01c502
Feat: weighted average table metrics (#3348)
This PR uses (number of actual table) weighted average instead of
average without weights for table metrics.

- pages where there are ground truth tables the weight is proportional
to the number of ground truth tables in that page
- pages where there are no ground truth tables but has predicted tables
(false positive) are assigned as 1 table worth of weight for the whole
page for calculating the mean value of `table_level_acc`
- pages with false positive tables do not contribute to table structural
or table content metrics

## test

This PR updates the existing test for evaluating table metrics:
- adds a second file with just 1 table vs. the existing file with 2
tables
- test the weighted average is written to the report
2024-11-20 17:14:57 +00:00
Pluto
85ecdab077
Add text as html to orig elements chunks (#3779)
This simplest solution doesn't drop HTML from metadata when merging
Elements from HTML input. We still need to address how to handle nested
elements, and if we want to have `LayoutElements` in the metadata of
Composite Elements, a unit test showing the current behavior.
Note: metadata still contains `orig_elements` which has all the
metadata.
2024-11-20 13:27:17 +00:00
Pluto
e1babf0660
Define default HTML to ontology mapping (#3784) 2024-11-20 13:01:28 +00:00
Pluto
ca27b8aa97
Set <table> to be ontology.Table not UncategorizedText (#3782) 2024-11-15 14:30:48 +00:00
Yao You
a6aefee0cb
chore: remove dev and release as 0.16.5 (#3775) 0.16.5 2024-11-07 14:31:50 -06:00
Pluto
c2d17b1ca4
Fix extracting value from field (#3774) 2024-11-07 18:21:39 +00:00
Pluto
66d1e5a5cb
Add max recursion limit and fix to_text() method (#3773) 2024-11-07 15:08:16 +00:00
Christine Straub
df156ebe5a
feat: support pdf link extraction in hi_res strategy (#3753)
This PR aims to add support for link extraction in pdf `hi_res`
strategy. The `partition_pdf()` function now supports link extraction
when using the `hi_res` strategy, allowing users to extract hyperlinks
from PDF documents.

### Summary
- Added functionalities to support link extraction in hi_res flow
- Enhanced word extraction functionality used for link extraction in
both `fast` and `hi_res` flows, resulted in more correct `start_index`
and `text` in `links` metadata.
- Updated ingest fixture update workflow to not skip Astra DB source
test

### Testing
```
elements = partition_pdf(
    filename="example-docs/pdf/embedded-link.pdf",
    strategy="hi_res"
)
assert len(elements[0].metadata.links) == 3
```

---------

Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com>
Co-authored-by: christinestraub <christinestraub@users.noreply.github.com>
Co-authored-by: cragwolfe <crag@unstructured.io>
0.16.4
2024-10-31 16:52:27 +00:00
Pluto
1953b8699f
Ml 415/merge inline elements (#3749) 2024-10-31 12:17:25 +00:00
Maksymilian Operlejn
eb1b294b73
ML-405/ML-427 - OntologyElement improvements (#3758)
- the "value" attribute from <input/> tag will be taken into account and
processed as "text" in ontology
- the tables will now be parsed without any ids and classes - we have
different reasons behind that, for example, embeddings with ids and
classes can lose some semantic value. Also, more tokens = more expensive
LLM call
-  cleaned to_html, created to_text for OntologyElement
2024-10-31 01:30:53 +00:00
ryannikolaidis
d0be1151a1
chore: pin unstructured-ingest (#3764) 2024-10-30 23:25:47 +00:00
Tracy Shen
340a07f18b
[Merge] release to 0.16.3 (#3755)
- bump version to 0.16.3 based on Pluto's fix on layout parsing
- update unstructured-inference version to 0.8.1 in
0.16.3
2024-10-25 13:23:41 -07:00
Pluto
5a91f0cda9
Fix layout parsing (#3754) 2024-10-25 14:42:06 +00:00
Pluto
2417f8ed84
Fix when parent id is none for first element in v2 notion: (#3752) 2024-10-25 09:43:36 +00:00
Marianna
9835fe4d5b
set version 0.16.2 (#3748) 0.16.2 2024-10-24 16:29:43 +00:00
Marianna
aa5935b357
Ml 384/whitespaces in cct (#3747)
This ticket ensures that CCT metric will not be sensitive to differences
in whitespace (including newline).
All whitespaces in string are changed to single space `" "` in both GT
and PRED before the metric is computed.

Additional changes in CHANGELOG due to auto-formatting.
2024-10-24 13:02:34 +00:00
Pawel Kmiecik
bdfcc14e3d
fix: fix partition_via_api retry mechanism when the default SDK's retry config is empty. (#3746) 2024-10-24 09:37:22 +00:00
Pluto
0b4c72a618
Set version to 0.16.1 (#3745) 0.16.1 2024-10-23 12:24:44 -07:00
Pluto
03a3ed8d3b
Add parsing HTML to unstructured elements (#3732)
> This is POC change; not everything is working correctly and code
quality could be improved significantly

This ticket add parsing HTML to unstructured element and back. How is it
working?

HTML has a tree structure, Unstructured Elements is a list.
HTML structure is traversed in DFS order, creating Elements and adding
them to list. So the reading order from HTML is preserved. To be able to
compose tree again all elements has IDs, and metadata.parent_id is
leveraged

How html is preserved if there are 'layout' without text, or there are
deeply nested HTMLs that are just text from the point of view of
Unstructured Element?
Each element is parsed back to HTML using metadata.text_as_html field.
For layout elements only html_tag are there, for long text elements
there is everything required to recreate HTML - you can see examples in
unit tests or .json file I attached.

Pros of solution:
- Nothing had to be changed in element types 

Cons:
- There are elements without Text which may be confusing (they could be
replaced by some special type)

Core transformation logic can be found in 2 functions in
`unstructured/documents/transformations.py`

Knowns bugs (they are minor):
- sometimes html tag is changed incorrectly
- metadata.category_depth and metadata.page_number are not set
- page break is not added between pages 

How to test. Generate HTML:
```python3
from pathlib import Path

from vlm_partitioner.src.partition import partition

if __name__ == "__main__":
    doc_dir = Path("out_dir")
    file_path = Path("example_doc.pdf")
    partition(str(file_path), provider="anthropic", output_dir=str(doc_dir))
```

Then parse to unstructured elements and back to html
```python3
from pathlib import Path

from unstructured.documents.html_utils import indent_html
from unstructured.documents.transformations import parse_html_to_ontology, ontology_to_unstructured_elements, \
    unstructured_elements_to_ontology
from unstructured.staging.base import elements_to_json

if __name__ == "__main__":
    output_dir = Path("out_dir/")
    output_dir.mkdir(exist_ok=True, parents=True)

    doc_path = Path("out_dir/example_doc.html")

    html_content = doc_path.read_text()

    ontology = parse_html_to_ontology(html_content)
    unstructured_elements = ontology_to_unstructured_elements(ontology)

    elements_to_json(unstructured_elements, str(output_dir / f"{doc_path.stem}_unstr.json"))

    parsed_ontology = unstructured_elements_to_ontology(unstructured_elements)
    html_to_save = indent_html(parsed_ontology.to_html())

    Path(output_dir / f"{doc_path.stem}_parsed_unstr.html").write_text(html_to_save)
```

I attached example doc before and after running these scripts

[outputs.zip](https://github.com/user-attachments/files/17438673/outputs.zip)
2024-10-23 12:28:07 +00:00
Pawel Kmiecik
6bceac1749
feat: expose retry params in partition via api (#3724)
This PR:
- adds parameters to control the retry-mechanism behaviour for
`partition_via_api`:
```
    retries_initial_interval: [int] = None,
    retries_max_interval: Optional[int] = None,
    retries_exponent: Optional[float] = None,
    retries_max_elapsed_time: Optional[int] = None,
    retries_connection_errors: Optional[bool] = None,
```
- adds tests that check using them according to defaults
2024-10-22 14:43:28 +00:00
Yao You
a11ad22609
bump unstructured-inference (#3711)
This PR bumps `unstructured-inference` to `0.8.0`, which introduces
vectorized data structure for layout elements and text regions.
This PR also cleans up a few places in CI that has repeated definition
of env variables or missing installation of testing dependencies in
cache.

A few document ingest results are changed:
- two places for `biomed-api` (actually processed locally on runner) are
due to very small changes in numerical results of the bounding box
areas: one results in a duplicated page number/header and another
results in a deduplication of a word of a sentence that starts in a new
line. (yes, two cases goes in opposite directions)
- the layout parser paper now outputs the code lines with page number
inside the code box as list items

---------

Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com>
Co-authored-by: badGarnet <badGarnet@users.noreply.github.com>
Co-authored-by: christinestraub <christinemstraub@gmail.com>
2024-10-21 21:55:08 +00:00
Yao You
e764bc503c
feat: round numbers to reduce undeterministic behavior (#3740)
This PR rounds the floating point number associated with coordinates in
`pdfminer_processing.py`. This helps to eliminate machine precision
caused randomness in bounding box overlap detection. Currently the
rounding is set to the nearest machine precision for `np.float32` using
`np.finfo(float)`, which yields resolution = `1e-15`.

## future work

We should reduce the rounding to only 6 digits after floating point
since the data type `float32` has a resolution of only `1e-6`. However
it would break tests. A followup is required to tune the threshold
values in `pdfminer_processing.py` so that it works with `1e-6`
resolution.
2024-10-21 18:15:20 +00:00
Steve Canny
3240e3d17a
rfctr(pptx): minify HTML and table.text is cct (#3734)
**Summary**
Eliminate historical "idiosyncracies" of `table.metadata.text_as_html`
HTML introduced by `partition_pptx()`. Produce minified `.text_as_html`
consistent with that formed by chunking.

**Additional Context**
- PPTX `.metadata.text_as_html` is minified (no extra whitespace or
thead, tbody, tfoot elements).
- `table.text` is clean-concatenated-text (CCT) of table.
- Last use of `tabulate` library is removed and that dependency is
removed from `base.in`.
2024-10-21 16:23:15 +00:00
Yao You
3dea723656
chore: pin upper limit for unstructured-client (#3743)
- 0.26.0 breaks tests and reason is unknown
- 0.26.0 introduces async request but the sync version interface still
exists
2024-10-21 01:06:21 +00:00
Steve Canny
208c7edc52
rfctr(csv): minify HTML and table text is cct (#3733)
**Summary**
Eliminate historical "idiosyncracies" of `table.metadata.text_as_html`
HTML introduced by `partition_csv()`. Produce minified `.text_as_html`
consistent with that formed by chunking.

**Additional Context**
- CSV `.metadata.text_as_html` is minified (no extra whitespace or
thead, tbody, tfoot elements).
- `table.text` is clean-concatenated-text (CCT) of table.

---------

Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com>
Co-authored-by: scanny <scanny@users.noreply.github.com>
2024-10-19 06:49:09 +00:00
Steve Canny
c85f29e6ca
fix(xlsx): XLSX emits std minified .text_as_html (#3558)
**Summary**
Eliminate historical "idiosyncracies" of `table.metadata.text_as_html`
HTML introduced by `partition_xlsx()`. Produce minified `.text_as_html`
consistent with that formed by chunking.

**Additional Context**
- XLSX `.text_as_html` is minified (no extra whitespace or thead, tbody,
tfoot elements).
- `table.text` is clean-concatenated-text (CCT) of table.

---------

Co-authored-by: scanny <scanny@users.noreply.github.com>
2024-10-17 22:05:11 +00:00
Nathan Van Gheem
b092d45816
Remove unsupported chipper model (#3728)
The chipper model is no longer supported.
2024-10-17 17:40:45 +00:00
Steve Canny
1eceac26c8
rfctr(email): eml partitioner rewrite (#3694)
**Summary**
Initial attempts to incrementally refactor `partition_email()` into
shape to allow pluggable partitioning quickly became too complex for
ready code-review. Prepare separate rewritten module and tests and swap
them out whole.

**Additional Context**
- Uses the modern stdlib `email` module to reliably accomplish several
manual decoding steps in the legacy code.
- Remove obsolete email-specific element-types which were replaced 18
months or so ago with email-specific metadata fields for things like Cc:
addresses, subject, etc.
- Remove accepting an email as `text: str` because MIME-email is
inherently a binary format which can and often does contain multiple and
contradictory character-encodings.
- Remove `encoding` parameters as it is now unused. An email file is not
a text file and as such does not have a single overall encoding.
Character encoding is specified individually for each MIME-part within
the message and often varies from one part to another in the same
message.
- Remove the need for a caller to specify `attachment_partitioner`.
There is only one reasonable choice for this which is
`auto.partition()`, consistent with the same interface and operation in
`partition_msg()`.
- Fixes #3671 along the way by silently skipping attachments with a
file-type for which there is no partitioner.
- Substantially extend the test-suite to cover multiple
transport-encoding/charset combinations.

---------

Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com>
Co-authored-by: scanny <scanny@users.noreply.github.com>
2024-10-16 02:02:33 +00:00
Roman Isecke
9049e4e2be
feat/remove ingest code, use new dep for tests (#3595)
### Description
Alternative to https://github.com/Unstructured-IO/unstructured/pull/3572
but maintaining all ingest tests, running them by pulling in the latest
version of unstructured-ingest.

---------

Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com>
Co-authored-by: rbiseck3 <rbiseck3@users.noreply.github.com>
Co-authored-by: Christine Straub <christinemstraub@gmail.com>
Co-authored-by: christinestraub <christinestraub@users.noreply.github.com>
0.16.0
2024-10-15 10:01:34 -05:00
David Blore
ecf0267b85
fix: add language to OCRAgentGoogleVision constructor (#3696)
This PR addresses issue #3659 by adding an optional `language` parameter
to the `OCRAgentGoogleVision` class constructor.

This parameter serves as a "language hint" for the
`document_text_detection` method in the `ImageAnnotatorClient`. For more
information on language hints, refer to the [Google Cloud Vision
documentation](https://cloud.google.com/vision/docs/languages).


**Default Behavior**: 
The language parameter defaults to None, allowing Google Cloud Vision to
auto-detect the language, as recommended in their documentation.

**Purpose**: 
This change is necessary because the `OCRAgent`'s `get_instance` method
expects all `OCRAgent`s to include a language parameter in their
constructors.

**Context on Issue:**
When trying to parse a PDF with
`OCR_AGENT=unstructured.partition.utils.ocr_models.google_vision_ocr.OCRAgentGoogleVision`,
an error occurs in the `get_instance` method. The method expects a
`language` parameter, which the current `OCRAgentGoogleVision`
constructor does not support, leading to a positional argument error.

---------

Co-authored-by: Christine Straub <christinemstraub@gmail.com>
2024-10-14 05:35:05 +00:00
Christine Straub
6ba376ab7e
build(release): release commit for 0.15.14 (#3709)
### Summary
- cut release for version `0.15.14`
- ignore `vectara` ingest test due to a weird error occurring in:
https://github.com/Unstructured-IO/unstructured/actions/runs/11256744351/job/31317150581?pr=3709
0.15.14
2024-10-10 20:03:49 +00:00