1418 Commits

Author SHA1 Message Date
John
8eee14d589
paragraph grouper type hint (#3002)
Fix type hint for paragraph_grouper param. `paragraph_grouper` can be
set to `False`, but the type hint did not not reflect this previously.
2024-05-10 19:37:07 +00:00
John
593aa47802
fix: ppt parameters include_page_breaks and include_slide_notes (#2996)
Pass the parameters `include_slide_notes` and `include_page_breaks` to
`partition_pptx` from `partition_ppt`.

Also update the .ppt example doc we use for testing so it has slide
notes and a PageBreak (and second page)
2024-05-10 17:57:36 +00:00
John
293a4a1152
remove unused links param (#3001)
The `links` param in `partition_pdf` was never used by the partitioner,
but added when that metadata element was created. This removes the
unused parameter since `links` are extracted during partitioning.
2024-05-10 16:55:17 +00:00
Michał Martyniak
f8c119a777
Quickfix: re-apply skip accuracy calculation enhancement (#3000)
Changes introduced [in this
PR](https://github.com/Unstructured-IO/unstructured/pull/2977) have been
overwritten by mistake after merging
https://github.com/Unstructured-IO/unstructured/pull/2973 - git did not
detect a conflict there due to redesign in evaluate.py
2024-05-10 11:13:33 +00:00
John
d829b669e6
Add starting_page_num param to partition_image (#2987)
Add missing starting_page_num param to partition_image

Closes #2985
2024-05-09 21:31:35 +00:00
Michał Martyniak
2f25d8f79e
Support for concurrent processing of documents during evaluation (#2973)
Currently, CCT eval takes a long time for any of the test_metrics CI
runs. Documents in an eval set are evaluated sequentially, and It
appears that a max of 1 cpu core is currently utilized. This implies
there could be a large speedup by running eval across multiple docs
concurrently (probably with multiprocessing).

Things done in this PR:
- [x] concurrent.futures.ProcessPoolExecutor instead of sequential
for-loop
- [x] refactor/reorganization of redundant pieces of code without
changing the inner logic too much. Without that we'd have 3 places where
documents are being processed. Take a look at `BaseMetricsCalculator`
class and classes that inherit from it.
- [x] string paths manipulation is now reworked and relies on
`pathlib.Path()`
2024-05-09 21:25:47 +00:00
amadeusz-ds
648ec33b44
feat(evaluation): skip accuracy calculation (#2977)
Skip accuracy calculation for files for which output and ground truth
sizes differ greatly.

~10% speed up on local machine, keeping the same metrics.

---------

Co-authored-by: cragwolfe <crag@unstructured.io>
2024-05-09 08:01:08 +00:00
John
e15adb418b
fix param types for partition_image and _pdf (#2988)
Make the `filename` and `file` params for `partition_image` and
`partition_pdf` match the other partitioners
2024-05-09 03:02:43 +00:00
Christine Straub
b64a48440d
chore: bump unstructured-inference 0.7.31 (#2981) 0.13.7 2024-05-08 16:26:58 +00:00
John
ef47d530f6
feat: add chunking to partition_tsv (#2982)
Closes #2980
2024-05-07 23:09:27 +00:00
Yao You
668dd0122f
remove unnecessary warning log (#2978)
The warning log about default model not set is no longer needed. This PR
removes this log to reduce confusion.
2024-05-07 17:04:44 +00:00
Pluto
4397dd6a10
Add calculation of table related metrics based on table_as_cells (#2898)
This pull request add metrics that are calculated based on
table_as_cells instead of text_as_html. This change is required for
comprehensive metrics calculation, as previously every colspan or
rowspan predicted was considered to be an incorrect predicted (even if
it was correct prediction)

This change has to be merged after
https://github.com/Unstructured-IO/unstructured/pull/2892 which
introduces table_as_cells field
2024-05-07 13:57:38 +00:00
Christine Straub
0cd07d78f9
feat: parition_pdf() add ability to get cid ratio (#2970)
This PR adds the ability to get the ratio of `cid` characters in
embedded text extracted by `pdfminer`. This PR is the second part of
moving `cid` related code from `unstructured-inference` to
`unstructured` and works together with
https://github.com/Unstructured-IO/unstructured-inference/pull/342.
2024-05-04 05:21:27 +00:00
Steve Canny
cb55245f70
rfctr: extract OCRAgent.get_agent() out of PDF subtree (#2965)
**Summary**
File-types other than PDF need to use OCR on extracted images. Extract
`OCRAgent.get_agent()` such that any file-type partitioner can use it
without risking dependency on PDF-only extras.
2024-05-03 19:39:22 +00:00
Steve Canny
17c2d075a8
rfctr improve partitioner typing (#2963)
**Summary**
Remedy the persistent type errors when importing `unstructured`. Give
the partitioner type annotations a general scrubbing while we're at it.
2024-05-03 16:11:55 +00:00
Steve Canny
39b74a2370
fix(test): Remedy macOS-only test failure not triggered by CI (#2957)
**Summary**
A crude and OS-specific mechanism was used to detect when a path
represented a temp-file. Change that to be robust across operating
systems and localized configurations. The specific problem was for DOC
files but this PR fixes it for PPT too which was prone to the same
problem.
2024-05-02 18:21:18 +00:00
Steve Canny
7dea2fa4a1
rfctr: tidy up ppt+doc tests (#2956)
**Summary**
Make tests for DOC and PPT formats more concise and readable in
preparation for adding one or two.
2024-05-02 16:00:00 +00:00
Steve Canny
601594d373
fix(docx): fix short-row DOCX table (#2943)
**Summary**
The DOCX format allows a table row to start late and/or end early,
meaning cells at the beginning or end of a row can be omitted. While
there are legitimate uses for this capability, using it in practice is
relatively rare. However, it can happen unintentionally when adjusting
cell borders with the mouse. Accommodate this case and generate accurate
`.text` and `.metadata.text_as_html` for these tables.
2024-05-02 00:45:52 +00:00
Steve Canny
eff84afe24
chore: update python-docx version dependency (#2952)
**Summary**
`unstructured` will use table features added in the most recent version
of `python-docx`.

Also update the `lxml` version constraint because `lxml>4.9.2` will not
install on Apple Silicon
(https://github.com/Unstructured-IO/unstructured/issues/1707).

`python-docx` requires `lxml` although other file formats require it as
well.
2024-05-01 21:36:31 +00:00
Yuming Long
542d442699
chore CORE-4775: remove html page number metadata field (#2942)
### Summary

Rip off page_number metadata fields until we have page counting for all
kinds of html files (not just limited to news articles with multiple
`<article>` tag)

### Test
Unit tests
`test_add_chunking_strategy_on_partition_html_respects_multipage` and
`test_add_chunking_strategy_title_on_partition_auto_respects_multipage`
removed since they relay on the `page_number` fields from the SEC html
file - now test moved to mock test for chunk_by_title -> revisit those
tests when we find test file for this

Also changed the element ids from partition outputs for html files -
element id change due to page number change (in element id hashing) ->
todo ticket: update other deterministic element id tests per crag's
comment

---------

Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com>
Co-authored-by: yuming-long <yuming-long@users.noreply.github.com>
2024-04-30 15:20:26 +00:00
Marco Lüthy
0d80886578
fix: parse URL response Content-Type according to RFC 9110 (#2950)
Currently, `file_and_type_from_url()` does not correctly handle the
`Content-Type` header. Specifically, it assumes that the header contains
only the mime-type (e.g. `text/html`), however, [RFC
9110](https://www.rfc-editor.org/rfc/rfc9110#field.content-type) allows
for additional directives — specifically the `charset` — to be returned
in the header. This leads to a `ValueError` when loading a URL with a
response Content-Type header such as `text/html; charset=UTF-8`.

To reproduce the issue:

```python
from unstructured.partition.auto import partition

url = "https://arstechnica.com/space/2024/04/nasa-still-doesnt-understand-root-cause-of-orion-heat-shield-issue/"
partition(url=url)
```

Which will result in the following exception:

```python
{
	"name": "ValueError",
	"message": "Invalid file. The FileType.UNK file type is not supported in partition.",
	"stack": "---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
Cell In[1], line 4
      1 from unstructured.partition.auto import partition
      3 url = \"https://arstechnica.com/space/2024/04/nasa-still-doesnt-understand-root-cause-of-orion-heat-shield-issue/\"
----> 4 partition(url=url)

File ~/miniconda3/envs/ai-tasks/lib/python3.11/site-packages/unstructured/partition/auto.py:541, in partition(filename, content_type, file, file_filename, url, include_page_breaks, strategy, encoding, paragraph_grouper, headers, skip_infer_table_types, ssl_verify, ocr_languages, languages, detect_language_per_element, pdf_infer_table_structure, extract_images_in_pdf, extract_image_block_types, extract_image_block_output_dir, extract_image_block_to_payload, xml_keep_tags, data_source_metadata, metadata_filename, request_timeout, hi_res_model_name, model_name, date_from_file_object, starting_page_number, **kwargs)
    539 else:
    540     msg = \"Invalid file\" if not filename else f\"Invalid file {filename}\"
--> 541     raise ValueError(f\"{msg}. The {filetype} file type is not supported in partition.\")
    543 for element in elements:
    544     element.metadata.url = url

ValueError: Invalid file. The FileType.UNK file type is not supported in partition."
}
```

This PR fixes the issue by parsing the mime-type out of the
`Content-Type` header string.


Closes #2257
0.13.6
2024-04-29 22:53:44 -07:00
Michał Martyniak
7720e72424
Fix: avoid elements sharing the same memory address (#2940)
This PR attempts to fix a memory issue, which resulted in errors like
this: https://github.com/Unstructured-IO/unstructured/issues/2931
The root cause seems to be in how ListItems are being combined, not in
how hashes or parent IDs are updated.

When `assign_and_map_hash_ids()` is called and elements (or elements'
metadata) do not have unique memory addresses, then updating the
parent_id of one element will also overwrite the parent_id of some other
element.

---------

Co-authored-by: cragwolfe <crag@unstructured.io>
0.13.5
2024-04-28 19:15:17 -07:00
Pluto
fa767d6706
chore: Bump unstructured inference 0.29 (#2932)
Co-authored-by: cragwolfe <crag@unstructured.io>
2024-04-27 19:49:22 +00:00
cragwolfe
9e46ed016c
fix: reqs arm64 friendly again. release 0.13.4 (#2935)
Cut a release.

Run pip-compile on mac to avoid `nvidia-*` requirements creeping into
`requirements/extra-pdf-image.txt`. This should fix arm64 image builds
that have been breaking on main.
0.13.4
2024-04-26 08:15:13 +00:00
Christine Straub
fcdfbabe8f
feat: partition_pdf() add an environment variable to control the capture of embedded links (#2934)
This PR aims to add an environment variable to control the capture of
embedded links in `partition_pdf()` for `fast` strategy.

Related PR: https://github.com/Unstructured-IO/unstructured/pull/2537
2024-04-25 21:00:21 +00:00
David Potter
00f544f100
fix: improve doc code (#2920)
Improves the documentation code.
Standardizes unstructured api key
Replaces misc hard coded values
Replaces `azureunstructured1` with a generic value
2024-04-25 17:55:15 +00:00
Pluto
df1f7bcd0e
Save table prediction in cells format (#2892)
This pull request allows to return predictions in raw cell
representation from table transformer. It will be later used to save
prediction in a cells format for simpler metrics calculation.

This PR has to be merged, after
https://github.com/Unstructured-IO/unstructured-inference/pull/335
2024-04-25 11:14:48 +00:00
John
3843af666e
feat: Enable remote chunking via unstructured-ingest (#2905)
Update: The cli shell script works when sending documents to the free
api, but the paid api is down, so waiting to test against it.

- The first commit adds docstrings and fixes type hints.
- The second commit reorganizes `test_unstructured_ingest` so it matches
the structure of `unstructured/ingest`.
- The third commit contains the primary changes for this PR.
- The `.chunk()` method responsible for sending elements to the correct
method is moved from `ChunkingConfig` to `Chunker` so that
`ChunkingConfig` acts as a config object instead of containing
implementation logic. `Chunker.chunk()` also now takes a json file
instead of a list of elements. This is done to avoid redundant
serialization if the file is to be sent to the api for chunking.

---------

Co-authored-by: Ahmet Melek <39141206+ahmetmeleq@users.noreply.github.com>
2024-04-25 00:24:58 +00:00
Michał Martyniak
2d1923ac7e
Better element IDs - deterministic and document-unique hashes (#2673)
Part two of: https://github.com/Unstructured-IO/unstructured/pull/2842

Main changes compared to part one:
* hash computation includes element's sequence number on page, page
number, document filename and its text
* there are more test for deterministic behavior of IDs returned by
partitioning functions + their uniqueness (guaranteed at the document
level, and high probability across multiple documents)

This PR addresses the following issue:
https://github.com/Unstructured-IO/unstructured/issues/2461
2024-04-24 00:05:20 -07:00
Dimitri Lozeve
abb0174181
Integration with the Google Cloud Vision API (#2902)
This PR adds a third OCR provider, alongside Tesseract and Paddle: the
[Google Cloud Vision API](https://cloud.google.com/vision).

It can be used similarly to other OCR methods: set the `OCR_AGENT`
environment variable to the path to the OCR module
(`unstructured.partition.utils.ocr_models.google_vision_ocr.OCRAgentGoogleVision`).
You also need to set the credentials to use Google APIs, for instance by
setting the `GOOGLE_APPLICATION_CREDENTIALS` environment variable.

---------

Co-authored-by: christinestraub <christinemstraub@gmail.com>
2024-04-23 21:11:39 +00:00
Steve Canny
05ff975081
fix: remove unused ElementMetadata.section (#2921)
**Summary**
The `.section` field in `ElementMetadata` is dead code, possibly a
remainder from a prior iteration of `partition_epub()`. In any case, it
is not populated by any partitioner. Remove it and any code that uses
it.
2024-04-22 23:58:17 +00:00
Steve Canny
305247b4e1
chore: bump unstructured-inference pin (#2913)
**Summary**
Update dependencies to use the new version of `unstructured-inference`
released yesterday. Remedy a few small problems with `make pip-compile`
that stood in the way.
0.13.3
2024-04-21 03:08:20 +00:00
Roman Isecke
9ad2993fe3
bug: fix pip-compile (#2885)
### Description
Currently wasn't compiling `base.in` first, which is required because
others use the generated `.txt` file as a constraint.
2024-04-19 21:39:25 +00:00
Steve Canny
4dc8327149
rfctr(pptx): make PptxPartitionerOptions public (#2901)
**Summary**
A few additional small, mechanical odds and ends required for PPTX image
extraction.

The big one is removing the leading underscore from
`PptxPartitionerOptions` because now client code that implements a
custom Picture-shape sub-partitioner will need to reference this class.
2024-04-19 04:50:06 +00:00
Christine Straub
ac5048bf30
enhancement: remove duplicate embedded images (#2897)
This PR aims to remove duplicate embedded images taken by `PDFminer`.

### Summary
- add `clean_pdfminer_duplicate_image_elements()` to remove embedded
images with similar `bboxes` and the same `text`
- add env_config `EMBEDDED_IMAGE_SAME_REGION_THRESHOLD` to consider the
bounding boxes of two embedded images as the same region
- refactor: reorganzie `clean_pdfminer_inner_elements()`
2024-04-18 23:07:47 +00:00
Michał Martyniak
001fa17c86
Preparing the foundation for better element IDs (#2842)
Part one of the issue described here:
https://github.com/Unstructured-IO/unstructured/issues/2461

It does not change how hashing algorithm works, just reworks how ids are
assigned:
> Element ID Design Principles
> 
> 1. A partitioning function can assign only one of two available ID
types to a returned element: a hash or UUID.
> 2. All elements that are returned come with an ID, which is never
None.
> 3. No matter which type of ID is used, it will always be in string
format.
> 4. Partitioning a document returns elements with hashes as their
default IDs.

Big thanks to @scanny for explaining the current design and suggesting
ways to do it right, especially with chunking.


Here's the next PR in line:
https://github.com/Unstructured-IO/unstructured/pull/2673

---------

Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com>
Co-authored-by: micmarty-deepsense <micmarty-deepsense@users.noreply.github.com>
2024-04-16 21:14:53 +00:00
Steve Canny
f752849c41
rfctr: improve typing in OCR modules (#2893)
**Summary**
In preparation for using OCR for partitioners other than PDF, clean up
typing in the OCR module.
2024-04-16 03:55:35 +00:00
Michał Martyniak
cb1e91058e
Introduce start_page argument to partitioning functions that assign element.metadata.page_number (#2884)
This small change will be useful for users who partition only fragments
of their PDF documents.
It's a small step towards addressing this issue:
https://github.com/Unstructured-IO/unstructured/issues/2461

Related PRs:
* https://github.com/Unstructured-IO/unstructured/pull/2842
* https://github.com/Unstructured-IO/unstructured/pull/2673
2024-04-15 21:03:42 +00:00
Christine Straub
ba3f374268
Fix: ingest test fixtures update pr (#2881)
This PR aims to update "Ingest Test Fixtures Update PR" CI to update the
ingest test fixtures only if the OVERWRITE_FIXTURES variable is not
`false` and the OUTPUT_DIR directory is not empty.
2024-04-15 17:47:22 +00:00
MiXiBo
0506aff788
add support for start_index in html links extraction (#2600)
add support for start_index in html links extraction (closes #2625)

Testing
```
from unstructured.partition.html import partition_html
from unstructured.staging.base import elements_to_json


html_text = """<html>
        <p>Hello there I am a <a href="/link">very important link!</a></p>
        <p>Here is a list of my favorite things</p>
        <ul>
            <li><a href="https://en.wikipedia.org/wiki/Parrot">Parrots</a></li>
            <li>Dogs</li>
        </ul>
        <a href="/loner">A lone link!</a>
    </html>"""

elements = partition_html(text=html_text)
print(elements_to_json(elements))
```

---------

Co-authored-by: Michael Niestroj <michael.niestroj@unblu.com>
Co-authored-by: christinestraub <christinemstraub@gmail.com>
Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com>
Co-authored-by: christinestraub <christinestraub@users.noreply.github.com>
Co-authored-by: Ronny H <138828701+ron-unstructured@users.noreply.github.com>
2024-04-12 06:14:20 +00:00
Steve Canny
3e643c4cb3
feat(pptx): add pluggable PPTX Picture sub-partitioner (#2880)
**Summary**
Delegate partitioning of PPTX Picture (image, to a first approximation)
shapes to a distinct sub-partitioner and allow the default picture
sub-partitioner to be replaced at run-time by one of the user's
choosing.
2024-04-12 06:00:01 +00:00
Steve Canny
2cba949f18
feat(pptx): partition_pptx() accepts strategy arg (#2879)
**Summary**
As we move to adding pluggable sub-partitioners, `partition_pptx()` will
need to become sensitive to the `strategy` argument, in particular when
it is set to "hi_res". Up until now there were no expensive operations
(inference, OCR, etc.) incurred while partitioning PPTX so this argument
was ignored.

After this PR, `partition_pptx()` still won't do anything with that
value, other than pass it along to `_PptxPartitionerOptions` for
safe-keeping, but now its ready for use by a `PicturePartitioner` (to
come in a subsequent PR).
2024-04-11 22:36:16 +00:00
Ahmet Melek
6fd29ea77c
fix: collection deletion for AstraDB test (#2869)
This PR:
- Fixes occasional collection deletion failures for AstraDB via putting
collection deletion statements inside a trap statement. It uses click
commands to do this.

Testing:
- Run ingest astradb destination test
2024-04-10 23:08:24 +00:00
Christine Straub
23edc4ad71
build(ci): skip python 3.11 in CI ingest jobs (#2877)
CI fails every time on test_ingest_src (3.11) and test_ingest_dst (3.11)
on what looks like a pip-install problem `(ModuleNotFoundError: No
module named 'click')`. The error is exactly the same place every time.
-
https://github.com/Unstructured-IO/unstructured/actions/runs/8622028071/job/23632669423
-
https://github.com/Unstructured-IO/unstructured/actions/runs/8623541446
-
https://github.com/Unstructured-IO/unstructured/actions/runs/8623056382
...

This PR skips the Python `3.11` ingest tests since the most important
one is `3.10` anyway.
2024-04-10 15:16:49 -07:00
Christine Straub
4656b8cbe5
Fix: partition_html() partially extracts text (#2852)
Closes #2362.

Previously, when an HTML contained a `div` with a nested tag e.g. a
`<b>` or `<span>`, the element created from the `div` contained only the
text up to the inline element. This PR adds support for extracting text
from tag tails in HTML.

### Testing
```
html_text = """
<html>
<body>
    <div>
        the Company issues shares at $<div style="display:inline;"><span>5.22</span></div> per share. There is more text
    </div>
</body>
</html>
"""

elements = partition_html(text=html_text)
print(''.join([str(el).strip() for el in elements]))
```

**Expected behavior**
```
the Company issues shares at $5.22per share. There is more text
```
2024-04-08 19:18:55 +00:00
Steve Canny
2c7e0289aa
rfctr(pptx): extract _PptxPartitionerOptions (#2853)
**Reviewers:** Likely quicker to review commit-by-commit.

**Summary**

In preparation for adding a PPTX `Picture` shape _sub-partitioner_,
extract management of PPTX partitioning-run options to a separate
`_PptxPartitioningOptions` object similar to those used in chunking and
XLSX partitioning. This provides several benefits:
- Extract code dealing with applying defaults and computing derived
values from the main partitioning code, leaving it less cluttered and
focused on the partitioning algorithm itself.
- Allow the options set to be passed to helper objects, prominently
including sub-partitioners, without requiring a long list of parameters
or requiring the caller to couple itself to the particular option values
the helper object requires.
- Allow options behaviors to be thoroughly and efficiently tested in
isolation.
2024-04-08 19:01:03 +00:00
Christine Straub
a9b6506724
Fix: partition_html() fails parsing simple html (#2849)
Closes #2520.

Previously, `partition_html()` did not extract text from `<b>` tags
inside container tags (like `<div>`, `<pre>`). This PR provides support
for extracting text from `<b>` tags inside container tags.

### Testing
```
html_text = """
<!DOCTYPE html>
<html>
<head>
 <title>A page</title>
</head>
<body>
<div>
    <h1>Header 1</h1>
    <p>Text </p>
    <h2>Header 2</h2>
    <pre><b>Param1</b> = Y<br><b>Param2</b> = 1<br><b>Param3</b> = 2<br><b>Param4</b> = A
    <br><b>Param5</b> = A,B,C,D,E<br><b>Param6</b> = 7<br><b>Param7</b> = Five<br></pre>
</div>
</body>
</html>
"""

elements = partition_html(text=html_text)
print("\n\n".join([str(el) for el in elements]))
```

**Expected behavior**
```
Header 1

Text

Header 2

Param1 = Y

Param2 = 1

Param3 = 2

Param4 = A

Param5 = A,B,C,D,E

Param6 = 7

Param7 = Five
```
2024-04-08 18:09:41 +00:00
Roman Isecke
4185a1a15a
feat: Remove constraint on unstructured client from .in file (#2862)
### Description
Don't limit the version of the unstructured client for all users of the
repo
2024-04-08 16:50:56 +00:00
cragwolfe
1621a70755
fix: Brings back missing word list files (#2857)
Fixes https://github.com/Unstructured-IO/unstructured/issues/2855
0.13.2
2024-04-04 23:38:15 -07:00
David Potter
57c7c7afc8
fix: Add mongodb env variables to ingest-test-fixtures-update-pr.yaml (#2851)
ingest-test-fixtures-update-pr.yaml was missing mongodb vars. And the
workflow was failing.
2024-04-04 23:38:21 +00:00