1653 Commits

Author SHA1 Message Date
Steve Canny
9ece0b5ad2
fix: improve false-positive Title elements on Chinese text (#3836)
**Summary**
Improve element-type mapping for Chinese text. Fixes bug where Chinese
text would produce large numbers of false-positive `Title` elements.

Fixes #3084

---------

Co-authored-by: scanny <scanny@users.noreply.github.com>
Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com>
2024-12-18 01:16:42 +00:00
Ribhu Lahiri
9a9bf4c4f5
Added contributing from archived repo (#3616)
Added `CONTRIBUTING.md` from the archived repo as mentioned in the
issue: https://github.com/Unstructured-IO/unstructured/issues/3540

Co-authored-by: John <43506685+Coniferish@users.noreply.github.com>
2024-12-17 02:53:17 +00:00
Steve Canny
b5ff79d8db
fix: refine filetype detection (#3828)
**Summary**
Fixes a bug where a CSV file with asserted content-type
`application/vnd.ms-excel` was incorrectly identified as an XLS file and
failed partitioning.

**Additional Context**
The `content_type` argument to partitioning is often authored by the
client system (e.g. Unstructured SDK) and is both unreliable and outside
the control of the user. In this case the `.csv -> XLS` mapping is
correct for certain purposes (Excel is often used to load and edit CSV
files) but not for partitioning, and the user has no readily available
way to override the mapping.

XLS files as well as seven other common binary file types can be
efficiently detected 100% of the time (at least 99.999%) using code we
already have in the file detector.

- Promote this direct-inspection strategy to be tried first.
- When DOC, DOCX, EPUB, ODT, PPT, PPTX, XLS, or XLSX is detected, use
that file-type.
- When one of those types is NOT detected, clear the asserted
`content_type` when it matches any of those types. This prevents the
problem seen in the bug where the asserted content type was used to
determine the file-type.
- The remaining content_type, guess MIME-type, and filename-extension
mapping strategies are tried, in that order, only when direct inspection
fails. This is largely the same as it was before.
- Fix #3781 while we were in the neighborhood.
- Fix #3596 as well, essentially an earlier report of #3781.
2024-12-17 00:56:21 +00:00
Steve Canny
10f0d54ac2
build: remove ruff version upper bound (#3829)
**Summary**
Remove pin on `ruff` linter and fix the handful of lint errors a newer
version catches.
2024-12-16 23:01:22 +00:00
Steve Canny
b092fb7f47
fix: add .grype.yaml (#3834)
**Summary**
CVE-2024-11053 https://curl.se/docs/CVE-2024-11053.html (severity Low)
was published on Dec 11, 2024 and began failing CI builds on open-core
on Dec 13, 2024 when it appeared in `grype` apparently misclassified as
a critical vulnerability.

The severity reported on the CVE is "Low" so it should not fail builds.
Add a `.grype.yaml` file to ignore this CVE until grype is updated.
2024-12-16 19:39:55 +00:00
Steve Canny
3b718ec89a
rfctr: prep for pluggable partitioners (#3806)
**Summary**
Prepare auto-partitioning for pluggable partitioners.

Move toward a uniform partitioner call signature in `auto/partition()`
such that a custom or override partitioner can be registered without
requiring code changes.

**Additional Context**
The central job of `auto/partition()` is to detect the file-type of the
given file and use that to dispatch partitioning to the corresponding
partitioner function e.g. `partition_pdf()` or `partition_docx()`.

In the existing code, each partitioner function is called with
parameters "hand-picked" from the available parameters passed to the
`partition()` function. This is unnecessary and couples those
partitioners tightly with the dispatch function. The desired state is
that all available arguments are passed as `kwargs` and the partitioner
function "self-selects" the arguments it will be sensitive to, applies
its own appropriate default values when the argument is omitted, and
simply ignore any arguments it doesn't use. Note that achieving this
requires no changes to partitioner functions because they already do
precisely this.

So the job is to pass all arguments (other than `filename` and `file`)
to the partitioner as `kwargs`. This will allow additional or alternate
partitioners to be registered at runtime and dispatched to, because as
long as they have the signature `partition_x(filename, file, kwargs) ->
list[Element]` then they can be dispatched to without customization.
2024-12-10 20:44:34 +00:00
Steve Canny
b981d7197f
release: prepare release 0.16.11 (#3819)
Release only, no code changes.
0.16.11
2024-12-09 23:48:00 +00:00
Magnus F
1e2da6df46
fix: ipv4 address regex (#3808)
I noticed the ipv4 regex is wrong (it only capture one or two-digit
octets, e.g. `n.nn.n.nn`). Here's a correction and a bumped test for it.

If you wish I can break out the ipv4 test to its own case, so we don't
interfere with the existing `EMAIL_META_DATA_INPUT` ipv6 extraction
test.

Side note: The comment at `unstructured/nlp/patterns.py#95` includes a
bad ipv4 address example (last octet is wrongfully left-padded with a
zero). I left it as it is because I'm not sure if the intention is to
include "non-conventional" ipv4 addresses, like octal or hexadecimal
octets.
2024-12-09 14:19:13 -08:00
Steve Canny
4379d883a3
chunk: relax table segregation during chunking (#3812)
**Summary**
Relax table-segregation rule applied during chunking such that a `Table`
and `Text`-subtype elements can be combined into a single chunk when the
chunking window allows.

**Additional Context**
Until now, `Table` elements have always been segregated during chunking,
i.e. a chunk that contained a table would never contain any other
element. In certain scenarios, especially when a large chunking window
of say 2000 characters is used, this behavior can reduce retrieval
effectiveness by isolating the table from surrounding context.

---------

Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com>
Co-authored-by: scanny <scanny@users.noreply.github.com>
2024-12-09 18:57:22 +00:00
Christine Straub
18d6c81c47
Update CHANGELOG.md (#3818) 2024-12-09 14:47:53 +00:00
Christine Straub
2f06d5a2a2 test: fix lint error 2024-12-07 19:51:43 -08:00
Christine Straub
9076d56d9f fix: resolve mergeing conflict error 2024-12-07 19:40:11 -08:00
Tracy Shen
59e6cff611
release 0.16.10 (#3816)
release 0.16.10 so that competitor-eval can install and take advantage
of the latest change in the metric calculation
0.16.10
2024-12-07 17:24:45 +00:00
Tracy Shen
8c58bc57db
fix doctype parsing error (#3811)
- per [ticket](https://unstructured-ai.atlassian.net/browse/ML-551),
there is a bug in the `unstructured` lib under metrics/evaluate.py that
incorrectly retrieves the file extension before the conversion to cct
file from paths like '*.pdf.txt' . (see below screenshot)
    - the current status is in the top example
- we should have the correct version in the bottom example of the
screenshot.
   

![image](https://github.com/user-attachments/assets/6d82de85-3b54-4e77-a637-28a27fcb279d)

- in addition, i also observe the doctype returned are not aligned, some
returning '.*' and some are returning without the dot.
- therefore, i just aligned them to be output into the same version
which is '.*".
2024-12-06 23:55:01 +00:00
Christine Straub
3bca724624 feat: update standardize_quote() 2024-12-05 13:24:57 -08:00
Christine Straub
ef1c85ef0f feat: enhance quote standardization tests with additional Unicode scenarios 2024-12-05 11:35:19 -08:00
Christine Straub
7d06c120dc
Merge branch 'main' into ML-593/quote-standardization 2024-12-05 10:27:26 -08:00
Christine Straub
a4be1d6100 fix: modify changelog.md 2024-12-05 10:17:08 -08:00
Christine Straub
f44862a5ea fix:confict error 2024-12-05 09:56:52 -08:00
Christine Straub
00a999153c fix: confict error 2024-12-05 09:54:36 -08:00
Christine Straub
5e60942960 test:fix lint errors 2024-12-05 09:40:50 -08:00
Marianna
4140f625d0
add script to render html from unstructured elements (#3799)
Script to render HTML from unstructured elements.

NOTE: This script is not intended to be used as a module.
NOTE: This script is only intended to be used with outputs with
non-empty `metadata.text_as_html`.

TODO: It was noted that unstructured_elements_to_ontology func always
returns a single page
This script is using helper functions to handle multiple pages. I am not
sure if this was intended, or it is a bug - if it is a bug it would
require bit longer debugging - to make it usable fast I used
workarounds.

Usage: test with any outputs with non-empty `metadata.text_as_html`.
Example files attached.

`[Example-Bill-of-Lading-Waste.docx.pdf.json](https://github.com/user-attachments/files/17922898/Example-Bill-of-Lading-Waste.docx.pdf.json)`


[Breast_Cancer1-5.pdf.json](https://github.com/user-attachments/files/17922899/Breast_Cancer1-5.pdf.json)
2024-12-04 19:46:51 -08:00
Christine Straub
bb26603a30 test: enhance quote standardization tests with additional Unicode scenarios 2024-12-04 13:16:00 -08:00
Christine Straub
c0c3fd673f test: enhance quote standardization tests with additional Unicode scenarios 2024-12-04 13:02:07 -08:00
Christine Straub
9038b88b8e fix: ensure newline at end of file in standardize_quotes function 2024-12-04 12:48:15 -08:00
Christine Straub
c821f12d29 test: update string tests for consistent quote handling 2024-12-04 12:17:16 -08:00
Christine Straub
4e0f7cdbc0 Feat: enhance quote standardization with comprehensive Unicode coverage and update tests 2024-12-04 11:33:03 -08:00
Christine Straub
371cb7528d Feat: add quote standardization and update edit distance calculation 2024-12-03 21:21:39 -08:00
Nathan Van Gheem
0fb814db61
Use native ntlk download (#3796)
This PR changes how we download NLTK data to use the native nltk
downloader.

We had moved to our own hosted NLTK dataset because of this CVE:
https://nvd.nist.gov/vuln/detail/CVE-2024-39705

Ref: https://github.com/Unstructured-IO/unstructured/pull/3361

Latest versions of NLTK have fixed this issue:
https://github.com/nltk/nltk/blob/develop/ChangeLog
0.16.9
2024-12-02 19:30:28 +00:00
cragwolfe
9445a2dd01
chore: fix CHANGELOG formatting (#3800)
Fixes formatting in CHANGELOG.md where most of the page was bold and
indented. (verify the branch version here:
https://github.com/Unstructured-IO/unstructured/blob/crag/tables-tweak/CHANGELOG.md)

Bonus tweak: u-table-inspect.sh is more robust to adding borders for
visualizations
2024-11-26 15:38:42 -08:00
Pluto
0fe6ac60aa
Add backward compatibility for metric calculation (#3798)
Co-authored-by: cragwolfe <crag@unstructured.io>
0.16.8
2024-11-26 18:14:16 +00:00
Pluto
e48d79eca1
image alt support (#3797) 0.16.7 2024-11-26 16:20:23 +00:00
ryannikolaidis
626f73af5b
chore: remove dev and release as 0.16.6 (#3793) 0.16.6 2024-11-21 23:59:38 +00:00
Yao You
3b9b01c502
Feat: weighted average table metrics (#3348)
This PR uses (number of actual table) weighted average instead of
average without weights for table metrics.

- pages where there are ground truth tables the weight is proportional
to the number of ground truth tables in that page
- pages where there are no ground truth tables but has predicted tables
(false positive) are assigned as 1 table worth of weight for the whole
page for calculating the mean value of `table_level_acc`
- pages with false positive tables do not contribute to table structural
or table content metrics

## test

This PR updates the existing test for evaluating table metrics:
- adds a second file with just 1 table vs. the existing file with 2
tables
- test the weighted average is written to the report
2024-11-20 17:14:57 +00:00
Pluto
85ecdab077
Add text as html to orig elements chunks (#3779)
This simplest solution doesn't drop HTML from metadata when merging
Elements from HTML input. We still need to address how to handle nested
elements, and if we want to have `LayoutElements` in the metadata of
Composite Elements, a unit test showing the current behavior.
Note: metadata still contains `orig_elements` which has all the
metadata.
2024-11-20 13:27:17 +00:00
Pluto
e1babf0660
Define default HTML to ontology mapping (#3784) 2024-11-20 13:01:28 +00:00
Pluto
ca27b8aa97
Set <table> to be ontology.Table not UncategorizedText (#3782) 2024-11-15 14:30:48 +00:00
Yao You
a6aefee0cb
chore: remove dev and release as 0.16.5 (#3775) 0.16.5 2024-11-07 14:31:50 -06:00
Pluto
c2d17b1ca4
Fix extracting value from field (#3774) 2024-11-07 18:21:39 +00:00
Pluto
66d1e5a5cb
Add max recursion limit and fix to_text() method (#3773) 2024-11-07 15:08:16 +00:00
Christine Straub
df156ebe5a
feat: support pdf link extraction in hi_res strategy (#3753)
This PR aims to add support for link extraction in pdf `hi_res`
strategy. The `partition_pdf()` function now supports link extraction
when using the `hi_res` strategy, allowing users to extract hyperlinks
from PDF documents.

### Summary
- Added functionalities to support link extraction in hi_res flow
- Enhanced word extraction functionality used for link extraction in
both `fast` and `hi_res` flows, resulted in more correct `start_index`
and `text` in `links` metadata.
- Updated ingest fixture update workflow to not skip Astra DB source
test

### Testing
```
elements = partition_pdf(
    filename="example-docs/pdf/embedded-link.pdf",
    strategy="hi_res"
)
assert len(elements[0].metadata.links) == 3
```

---------

Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com>
Co-authored-by: christinestraub <christinestraub@users.noreply.github.com>
Co-authored-by: cragwolfe <crag@unstructured.io>
0.16.4
2024-10-31 16:52:27 +00:00
Pluto
1953b8699f
Ml 415/merge inline elements (#3749) 2024-10-31 12:17:25 +00:00
Maksymilian Operlejn
eb1b294b73
ML-405/ML-427 - OntologyElement improvements (#3758)
- the "value" attribute from <input/> tag will be taken into account and
processed as "text" in ontology
- the tables will now be parsed without any ids and classes - we have
different reasons behind that, for example, embeddings with ids and
classes can lose some semantic value. Also, more tokens = more expensive
LLM call
-  cleaned to_html, created to_text for OntologyElement
2024-10-31 01:30:53 +00:00
ryannikolaidis
d0be1151a1
chore: pin unstructured-ingest (#3764) 2024-10-30 23:25:47 +00:00
Tracy Shen
340a07f18b
[Merge] release to 0.16.3 (#3755)
- bump version to 0.16.3 based on Pluto's fix on layout parsing
- update unstructured-inference version to 0.8.1 in
0.16.3
2024-10-25 13:23:41 -07:00
Pluto
5a91f0cda9
Fix layout parsing (#3754) 2024-10-25 14:42:06 +00:00
Pluto
2417f8ed84
Fix when parent id is none for first element in v2 notion: (#3752) 2024-10-25 09:43:36 +00:00
Marianna
9835fe4d5b
set version 0.16.2 (#3748) 0.16.2 2024-10-24 16:29:43 +00:00
Marianna
aa5935b357
Ml 384/whitespaces in cct (#3747)
This ticket ensures that CCT metric will not be sensitive to differences
in whitespace (including newline).
All whitespaces in string are changed to single space `" "` in both GT
and PRED before the metric is computed.

Additional changes in CHANGELOG due to auto-formatting.
2024-10-24 13:02:34 +00:00
Pawel Kmiecik
bdfcc14e3d
fix: fix partition_via_api retry mechanism when the default SDK's retry config is empty. (#3746) 2024-10-24 09:37:22 +00:00