It's pretty basic change, just literally moved the category field to
Element class. Can't think of other changes that are needed here,
because I think pretty much everything expected the category to be
directly in elements list.
For local testing, IDE's and linters should see difference in that
`category` is now in Element.
**Summary**
Some partitioner test modules are placed in directories by themselves or
with one other test module. This unnecessarily obscures where to find
the test module corresponding to a partitiner.
Move partitioner test modules to mirror the directory structure of
`unstructured/partition`.
**Summary**
I preparation for adding DOCX pluggable image extraction, organize a few
of the DOCX tests to be parallel to very similar tests for the DOC and
ODT partitioners.
This PR adds the ability to fill inferred elements text from embedded
text (`pdfminer`) without depending on `unstructured-inference` library.
This PR is the second part of moving embedded text related code from
`unstructured-inference` to `unstructured` and works together with
https://github.com/Unstructured-IO/unstructured-inference/pull/349.
### Summary
A `partition_via_api` test that only runs on `main` was
[failing](https://github.com/Unstructured-IO/unstructured/actions/runs/9159429513/job/25181600959)
with the following output, likely due to the change in the default
behavior for `skip_infer_table_types`. This PR explicitly sets the
`skip_infer_table_types` param to avoid the failure..
```python
=========================== short test summary info ============================
FAILED test_unstructured/partition/test_api.py::test_partition_via_api_with_no_strategy - AssertionError: assert 'Zejiang Shen® (<), Ruochen Zhang?, Melissa Dell®, Benjamin Charles Germain Lee?, Jacob Carlson®, and Weining Li®' != 'Zejiang Shen® (<), Ruochen Zhang?, Melissa Dell®, Benjamin Charles Germain Lee?, Jacob Carlson®, and Weining Li®'
+ where 'Zejiang Shen® (<), Ruochen Zhang?, Melissa Dell®, Benjamin Charles Germain Lee?, Jacob Carlson®, and Weining Li®' = <unstructured.documents.elements.Text object at 0x7fb9069fc610>.text
+ and 'Zejiang Shen® (<), Ruochen Zhang?, Melissa Dell®, Benjamin Charles Germain Lee?, Jacob Carlson®, and Weining Li®' = <unstructured.documents.elements.Text object at 0x7fb90648ad90>.text
= 1 failed, 2299 passed, 9 skipped, 2 deselected, 2 xfailed, 9 xpassed, 14 warnings in 1241.64s (0:20:41) =
make: *** [Makefile:302: test] Error 1
```
### Testing
After temporarily removing the "skip if not on `main`" `pytest` mark,
the [unit tests
pass](https://github.com/Unstructured-IO/unstructured/actions/runs/9163268381/job/25192040902?pr=3057O)
on the feature branch.
### Summary
Closes#2959. Updates the dependency and CI to add support for Python
3.12.
The MongoDB ingest tests were disabled due to jobs like [this
one](https://github.com/Unstructured-IO/unstructured/actions/runs/9133383127/job/25116767333)
failing due to issues with the `bson` package. `bson` is a dependency
for the AstraDB connector, but `pymongo` does not work when `bson` is
installed from `pip`. This issue is documented by MongoDB
[here](https://pymongo.readthedocs.io/en/stable/installation.html). Spun
off #3049 to resolve this. Issue seems unrelated to Python 3.12, though
unsure why this didn't surface previously.
Disables the `argilla` tests because `argilla` does not yet support
Python 3.12. We can add the `argilla` tests back in once the PR
references below is merged. You can still use the `stage_for_argilla`
function if you're on `python<3.12` and you install `argilla` yourself.
- https://github.com/argilla-io/argilla/pull/4837
---------
Co-authored-by: Nicolò Boschi <boschi1997@gmail.com>
This PR aims to pass `kwargs` through `fast` strategy pipeline, which
was missing as part of the previous PR -
https://github.com/Unstructured-IO/unstructured/pull/3030.
I also did some code refactoring in this PR, so I recommend reviewing
this PR commit by commit.
### Summary
- pass `kwargs` through `fast` strategy pipeline, which will allow users
to specify additional params like `sort_mode`
- refactor: code reorganization
- cut a release for `0.14.0`
### Testing
CI should pass
This PR introduces GLOBAL_WORKING_DIR and GLOBAL_WORKING_PROCESS_DIR
controlling where temporary files are stored during partition flow, via
tempfile.tempdir.
#### Edit:
Renamed prefixes from STORAGE_ to UNSTRUCTURED_CACHE_
#### Edit 2:
Renamed prefixes from UNSTRUCTURED_CACHE to GLOBAL_WORKING_DIR_
### Summary
Closes#3021 . Turns table extraction for PDFs and images off by
default. The default behavior originally changed in #2588 . The reason
for reversion is that some users did not realize turning off table
extraction was an option and experience long processing times for PDFs
and images with the new default behavior.
---------
Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com>
Co-authored-by: MthwRobinson <MthwRobinson@users.noreply.github.com>
**Summary**
Because DOCX now supports the `strategy` argument to control aspects of
image extraction, `partition_doc()` and `partition_odt()` will need to
support it to because they delegate partitioning to `partition_docx()`.
This will allow image extraction to work the same way for those two
additional document-types.
Allows introduction of form extraction in the future - sets up the
FormKeysValues element & format, puts in an empty function call in the
partition_pdf_or_image pipeline.
This PR aims to skip element sorting when determining whether embedded
text can be extracted. The extracted elements in this step are returned
as final elements only for the `fast` strategy pipeline and are never
used for other strategy pipelines (`hi_res`, `ocr`).
Removing element sorting in this step and adding it to the `fast`
strategy pipeline later will improve performance and reduce execution
time.
### Summary
- skip element sorting when determining whether embedded text can be
extracted.
- add `_partition_pdf_with_pdfparser()` function for fast` strategy
pipeline
### Testing
CI should pass.
**Summary**
In preparation for adding more tests related to image extraction,
improve the `partition_odt()` test suite:
- Add type annotations to type-check clean on strict mode.
- Improve test names.
- Simplify tests where possible.
- Remove a couple duplicated tests
### Summary
Updates the `Dockerfile` to use the Chainguard `wolfi-base` image to
reduce CVEs. Also adds a step in the docker publish job that scans the
images and checks for CVEs before publishing. The job will fail if there
are high or critical vulnerabilities.
### Testing
Run `make docker-run-dev` and then `python3.11` once you're in. And that
point, you can try:
```python
from unstructured.partition.auto import partition
elements = partition(filename="example-docs/DA-1p.pdf", skip_infer_table_types=["pdf"])
elements
```
Stop the container once you're done.
**Summary**
The behavior of an image sub-partitioner can be partially determined by
the partitioning strategy, for example whether it is "hi_res" or "fast".
Add this parameter to `partition_docx()` so it can pass it along to
`DocxPartitionerOptions` which will make it available to any image
sub-partitioners.
**Summary**
In preparation for adding more tests related to image extraction,
improve the `partition_doc()` test suite:
- Remove redundant DOCX -> DOC file conversions on most tests.
- Add type annotations to type-check clean on strict mode.
- Improve test names.
- Simplify tests where possible.
- Remove one duplicated test
Speed was roughly doubled: 24 tests in 20s -> 23 tests in 8s.
**Reviewers:** Probably easier to review first and second commits
separately as the first one adds all the new code and tests (without
installing it), and the second one installs it into the partitioner
along with the required changes to code and tests.
**Summary**
Enable communication of partitioning options to sub-partitioners, in
particular to the pluggable `PicturePartitioner` coming in a closely
subsequent PR to implement image-extraction and OCR for DOCX, DOC, and
ODT formats.
**Additional Context**
In general, validation of partitioning options as well as assigning
default values and computing derived partitioning settings can be
extracted from partitioners into a neatly encapsulated separate object.
This simplifies the core partitioning code by removing the noise
associated with computing metadata values and deciding how to access the
source document, etc.
However, better factoring aside, having the partition-time "settings"
available in a single object allows partitioning of certain document
features, for example images, to be readily _delegated_ to a
sub-partitioner while still giving it access to all the relevant
partitioning settings for the current document. This is particularly
important when a sub-partitioner is "pluggable" at runtime and must rely
on a clearly-defined (and simple as possible) interface to operate
smoothly.
**Summary**
Organize DOC tests into related groups with markers. This makes it
easier to assess coverage and find tests related to particular
behaviors.
This is in preparation for adding tests related to DOC image extraction.
No code changes, purely line-block moves.
- Move module-level fixtures to the bottom.
- Organize tests into related groups with markers.
**Summary**
Noisy but trivial changes to `partition_docx()` environs and tests in
preparation for DOCX image extraction. These changes are extracted here
so they don't distract on the changes of substance to follow in the next
PR.
No code changes, strictly this single block move.
Move `Describe_DocxPartitioner` unit-test class to bottom so
`DescribeDocxPartitionerOptions` unit-test to follow in subsequent
commit will be together with it. Integration tests first, then unit
tests, for consistency with other test modules e.g. test_pptx.
I added `Describe_DocxPartitioner` soon after I arrived, before we
adopted the convention of placing unit-tests after integration tests.
Move this so we can maintain that consistency with the block of tests to
follow in a closely subsequent PR.
**Summary**
The CSV delimiter-sniffer requires whole lines to properly detect the
delimiter character. Limiting bytes read produced partial lines when
lines were very long. Limit bytes but read whole lines.
Fixes#2643.
Pass the parameters `include_slide_notes` and `include_page_breaks` to
`partition_pptx` from `partition_ppt`.
Also update the .ppt example doc we use for testing so it has slide
notes and a PageBreak (and second page)
Currently, CCT eval takes a long time for any of the test_metrics CI
runs. Documents in an eval set are evaluated sequentially, and It
appears that a max of 1 cpu core is currently utilized. This implies
there could be a large speedup by running eval across multiple docs
concurrently (probably with multiprocessing).
Things done in this PR:
- [x] concurrent.futures.ProcessPoolExecutor instead of sequential
for-loop
- [x] refactor/reorganization of redundant pieces of code without
changing the inner logic too much. Without that we'd have 3 places where
documents are being processed. Take a look at `BaseMetricsCalculator`
class and classes that inherit from it.
- [x] string paths manipulation is now reworked and relies on
`pathlib.Path()`
This pull request add metrics that are calculated based on
table_as_cells instead of text_as_html. This change is required for
comprehensive metrics calculation, as previously every colspan or
rowspan predicted was considered to be an incorrect predicted (even if
it was correct prediction)
This change has to be merged after
https://github.com/Unstructured-IO/unstructured/pull/2892 which
introduces table_as_cells field
This PR adds the ability to get the ratio of `cid` characters in
embedded text extracted by `pdfminer`. This PR is the second part of
moving `cid` related code from `unstructured-inference` to
`unstructured` and works together with
https://github.com/Unstructured-IO/unstructured-inference/pull/342.
**Summary**
File-types other than PDF need to use OCR on extracted images. Extract
`OCRAgent.get_agent()` such that any file-type partitioner can use it
without risking dependency on PDF-only extras.
**Summary**
A crude and OS-specific mechanism was used to detect when a path
represented a temp-file. Change that to be robust across operating
systems and localized configurations. The specific problem was for DOC
files but this PR fixes it for PPT too which was prone to the same
problem.
**Summary**
The DOCX format allows a table row to start late and/or end early,
meaning cells at the beginning or end of a row can be omitted. While
there are legitimate uses for this capability, using it in practice is
relatively rare. However, it can happen unintentionally when adjusting
cell borders with the mouse. Accommodate this case and generate accurate
`.text` and `.metadata.text_as_html` for these tables.
### Summary
Rip off page_number metadata fields until we have page counting for all
kinds of html files (not just limited to news articles with multiple
`<article>` tag)
### Test
Unit tests
`test_add_chunking_strategy_on_partition_html_respects_multipage` and
`test_add_chunking_strategy_title_on_partition_auto_respects_multipage`
removed since they relay on the `page_number` fields from the SEC html
file - now test moved to mock test for chunk_by_title -> revisit those
tests when we find test file for this
Also changed the element ids from partition outputs for html files -
element id change due to page number change (in element id hashing) ->
todo ticket: update other deterministic element id tests per crag's
comment
---------
Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com>
Co-authored-by: yuming-long <yuming-long@users.noreply.github.com>
Currently, `file_and_type_from_url()` does not correctly handle the
`Content-Type` header. Specifically, it assumes that the header contains
only the mime-type (e.g. `text/html`), however, [RFC
9110](https://www.rfc-editor.org/rfc/rfc9110#field.content-type) allows
for additional directives — specifically the `charset` — to be returned
in the header. This leads to a `ValueError` when loading a URL with a
response Content-Type header such as `text/html; charset=UTF-8`.
To reproduce the issue:
```python
from unstructured.partition.auto import partition
url = "https://arstechnica.com/space/2024/04/nasa-still-doesnt-understand-root-cause-of-orion-heat-shield-issue/"
partition(url=url)
```
Which will result in the following exception:
```python
{
"name": "ValueError",
"message": "Invalid file. The FileType.UNK file type is not supported in partition.",
"stack": "---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
Cell In[1], line 4
1 from unstructured.partition.auto import partition
3 url = \"https://arstechnica.com/space/2024/04/nasa-still-doesnt-understand-root-cause-of-orion-heat-shield-issue/\"
----> 4 partition(url=url)
File ~/miniconda3/envs/ai-tasks/lib/python3.11/site-packages/unstructured/partition/auto.py:541, in partition(filename, content_type, file, file_filename, url, include_page_breaks, strategy, encoding, paragraph_grouper, headers, skip_infer_table_types, ssl_verify, ocr_languages, languages, detect_language_per_element, pdf_infer_table_structure, extract_images_in_pdf, extract_image_block_types, extract_image_block_output_dir, extract_image_block_to_payload, xml_keep_tags, data_source_metadata, metadata_filename, request_timeout, hi_res_model_name, model_name, date_from_file_object, starting_page_number, **kwargs)
539 else:
540 msg = \"Invalid file\" if not filename else f\"Invalid file {filename}\"
--> 541 raise ValueError(f\"{msg}. The {filetype} file type is not supported in partition.\")
543 for element in elements:
544 element.metadata.url = url
ValueError: Invalid file. The FileType.UNK file type is not supported in partition."
}
```
This PR fixes the issue by parsing the mime-type out of the
`Content-Type` header string.
Closes#2257
This PR attempts to fix a memory issue, which resulted in errors like
this: https://github.com/Unstructured-IO/unstructured/issues/2931
The root cause seems to be in how ListItems are being combined, not in
how hashes or parent IDs are updated.
When `assign_and_map_hash_ids()` is called and elements (or elements'
metadata) do not have unique memory addresses, then updating the
parent_id of one element will also overwrite the parent_id of some other
element.
---------
Co-authored-by: cragwolfe <crag@unstructured.io>
This pull request allows to return predictions in raw cell
representation from table transformer. It will be later used to save
prediction in a cells format for simpler metrics calculation.
This PR has to be merged, after
https://github.com/Unstructured-IO/unstructured-inference/pull/335
Part two of: https://github.com/Unstructured-IO/unstructured/pull/2842
Main changes compared to part one:
* hash computation includes element's sequence number on page, page
number, document filename and its text
* there are more test for deterministic behavior of IDs returned by
partitioning functions + their uniqueness (guaranteed at the document
level, and high probability across multiple documents)
This PR addresses the following issue:
https://github.com/Unstructured-IO/unstructured/issues/2461
This PR adds a third OCR provider, alongside Tesseract and Paddle: the
[Google Cloud Vision API](https://cloud.google.com/vision).
It can be used similarly to other OCR methods: set the `OCR_AGENT`
environment variable to the path to the OCR module
(`unstructured.partition.utils.ocr_models.google_vision_ocr.OCRAgentGoogleVision`).
You also need to set the credentials to use Google APIs, for instance by
setting the `GOOGLE_APPLICATION_CREDENTIALS` environment variable.
---------
Co-authored-by: christinestraub <christinemstraub@gmail.com>
**Summary**
The `.section` field in `ElementMetadata` is dead code, possibly a
remainder from a prior iteration of `partition_epub()`. In any case, it
is not populated by any partitioner. Remove it and any code that uses
it.
**Summary**
A few additional small, mechanical odds and ends required for PPTX image
extraction.
The big one is removing the leading underscore from
`PptxPartitionerOptions` because now client code that implements a
custom Picture-shape sub-partitioner will need to reference this class.
This PR aims to remove duplicate embedded images taken by `PDFminer`.
### Summary
- add `clean_pdfminer_duplicate_image_elements()` to remove embedded
images with similar `bboxes` and the same `text`
- add env_config `EMBEDDED_IMAGE_SAME_REGION_THRESHOLD` to consider the
bounding boxes of two embedded images as the same region
- refactor: reorganzie `clean_pdfminer_inner_elements()`
Part one of the issue described here:
https://github.com/Unstructured-IO/unstructured/issues/2461
It does not change how hashing algorithm works, just reworks how ids are
assigned:
> Element ID Design Principles
>
> 1. A partitioning function can assign only one of two available ID
types to a returned element: a hash or UUID.
> 2. All elements that are returned come with an ID, which is never
None.
> 3. No matter which type of ID is used, it will always be in string
format.
> 4. Partitioning a document returns elements with hashes as their
default IDs.
Big thanks to @scanny for explaining the current design and suggesting
ways to do it right, especially with chunking.
Here's the next PR in line:
https://github.com/Unstructured-IO/unstructured/pull/2673
---------
Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com>
Co-authored-by: micmarty-deepsense <micmarty-deepsense@users.noreply.github.com>
add support for start_index in html links extraction (closes#2625)
Testing
```
from unstructured.partition.html import partition_html
from unstructured.staging.base import elements_to_json
html_text = """<html>
<p>Hello there I am a <a href="/link">very important link!</a></p>
<p>Here is a list of my favorite things</p>
<ul>
<li><a href="https://en.wikipedia.org/wiki/Parrot">Parrots</a></li>
<li>Dogs</li>
</ul>
<a href="/loner">A lone link!</a>
</html>"""
elements = partition_html(text=html_text)
print(elements_to_json(elements))
```
---------
Co-authored-by: Michael Niestroj <michael.niestroj@unblu.com>
Co-authored-by: christinestraub <christinemstraub@gmail.com>
Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com>
Co-authored-by: christinestraub <christinestraub@users.noreply.github.com>
Co-authored-by: Ronny H <138828701+ron-unstructured@users.noreply.github.com>
**Summary**
Delegate partitioning of PPTX Picture (image, to a first approximation)
shapes to a distinct sub-partitioner and allow the default picture
sub-partitioner to be replaced at run-time by one of the user's
choosing.
**Summary**
As we move to adding pluggable sub-partitioners, `partition_pptx()` will
need to become sensitive to the `strategy` argument, in particular when
it is set to "hi_res". Up until now there were no expensive operations
(inference, OCR, etc.) incurred while partitioning PPTX so this argument
was ignored.
After this PR, `partition_pptx()` still won't do anything with that
value, other than pass it along to `_PptxPartitionerOptions` for
safe-keeping, but now its ready for use by a `PicturePartitioner` (to
come in a subsequent PR).
Closes#2362.
Previously, when an HTML contained a `div` with a nested tag e.g. a
`<b>` or `<span>`, the element created from the `div` contained only the
text up to the inline element. This PR adds support for extracting text
from tag tails in HTML.
### Testing
```
html_text = """
<html>
<body>
<div>
the Company issues shares at $<div style="display:inline;"><span>5.22</span></div> per share. There is more text
</div>
</body>
</html>
"""
elements = partition_html(text=html_text)
print(''.join([str(el).strip() for el in elements]))
```
**Expected behavior**
```
the Company issues shares at $5.22per share. There is more text
```
**Reviewers:** Likely quicker to review commit-by-commit.
**Summary**
In preparation for adding a PPTX `Picture` shape _sub-partitioner_,
extract management of PPTX partitioning-run options to a separate
`_PptxPartitioningOptions` object similar to those used in chunking and
XLSX partitioning. This provides several benefits:
- Extract code dealing with applying defaults and computing derived
values from the main partitioning code, leaving it less cluttered and
focused on the partitioning algorithm itself.
- Allow the options set to be passed to helper objects, prominently
including sub-partitioners, without requiring a long list of parameters
or requiring the caller to couple itself to the particular option values
the helper object requires.
- Allow options behaviors to be thoroughly and efficiently tested in
isolation.
**Summary**
As an initial step in reducing the complexity of the monolithic
`partition_xlsx()` function, extract all argument-handling to a separate
`_XlsxPartitionerOptions` object which can be fully covered by isolated
unit tests.
**Additional Context**
This code was from a prior XLSX bug-fix branch that did not get
committed because of time constraints. I wanted to revisit it here
because I need the benefits of this as part of some new work on PPTX
that will require a separate options object that can be passed to
delegate objects.
This approach was incubated in the chunking context and has produced a
lot of opportunities there to decompose the logic into smaller
components that are more understandable and isolated-test-able, without
having to pass an extended list of option values in ever sub-call. As
well as decluttering the code, this removes coupling where the caller
needs to know which options a subroutine might need to reference.