1282 Commits

Author SHA1 Message Date
Steve Canny
f752849c41
rfctr: improve typing in OCR modules (#2893)
**Summary**
In preparation for using OCR for partitioners other than PDF, clean up
typing in the OCR module.
2024-04-16 03:55:35 +00:00
Michał Martyniak
cb1e91058e
Introduce start_page argument to partitioning functions that assign element.metadata.page_number (#2884)
This small change will be useful for users who partition only fragments
of their PDF documents.
It's a small step towards addressing this issue:
https://github.com/Unstructured-IO/unstructured/issues/2461

Related PRs:
* https://github.com/Unstructured-IO/unstructured/pull/2842
* https://github.com/Unstructured-IO/unstructured/pull/2673
2024-04-15 21:03:42 +00:00
Christine Straub
ba3f374268
Fix: ingest test fixtures update pr (#2881)
This PR aims to update "Ingest Test Fixtures Update PR" CI to update the
ingest test fixtures only if the OVERWRITE_FIXTURES variable is not
`false` and the OUTPUT_DIR directory is not empty.
2024-04-15 17:47:22 +00:00
MiXiBo
0506aff788
add support for start_index in html links extraction (#2600)
add support for start_index in html links extraction (closes #2625)

Testing
```
from unstructured.partition.html import partition_html
from unstructured.staging.base import elements_to_json


html_text = """<html>
        <p>Hello there I am a <a href="/link">very important link!</a></p>
        <p>Here is a list of my favorite things</p>
        <ul>
            <li><a href="https://en.wikipedia.org/wiki/Parrot">Parrots</a></li>
            <li>Dogs</li>
        </ul>
        <a href="/loner">A lone link!</a>
    </html>"""

elements = partition_html(text=html_text)
print(elements_to_json(elements))
```

---------

Co-authored-by: Michael Niestroj <michael.niestroj@unblu.com>
Co-authored-by: christinestraub <christinemstraub@gmail.com>
Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com>
Co-authored-by: christinestraub <christinestraub@users.noreply.github.com>
Co-authored-by: Ronny H <138828701+ron-unstructured@users.noreply.github.com>
2024-04-12 06:14:20 +00:00
Steve Canny
3e643c4cb3
feat(pptx): add pluggable PPTX Picture sub-partitioner (#2880)
**Summary**
Delegate partitioning of PPTX Picture (image, to a first approximation)
shapes to a distinct sub-partitioner and allow the default picture
sub-partitioner to be replaced at run-time by one of the user's
choosing.
2024-04-12 06:00:01 +00:00
Steve Canny
2cba949f18
feat(pptx): partition_pptx() accepts strategy arg (#2879)
**Summary**
As we move to adding pluggable sub-partitioners, `partition_pptx()` will
need to become sensitive to the `strategy` argument, in particular when
it is set to "hi_res". Up until now there were no expensive operations
(inference, OCR, etc.) incurred while partitioning PPTX so this argument
was ignored.

After this PR, `partition_pptx()` still won't do anything with that
value, other than pass it along to `_PptxPartitionerOptions` for
safe-keeping, but now its ready for use by a `PicturePartitioner` (to
come in a subsequent PR).
2024-04-11 22:36:16 +00:00
Ahmet Melek
6fd29ea77c
fix: collection deletion for AstraDB test (#2869)
This PR:
- Fixes occasional collection deletion failures for AstraDB via putting
collection deletion statements inside a trap statement. It uses click
commands to do this.

Testing:
- Run ingest astradb destination test
2024-04-10 23:08:24 +00:00
Christine Straub
23edc4ad71
build(ci): skip python 3.11 in CI ingest jobs (#2877)
CI fails every time on test_ingest_src (3.11) and test_ingest_dst (3.11)
on what looks like a pip-install problem `(ModuleNotFoundError: No
module named 'click')`. The error is exactly the same place every time.
-
https://github.com/Unstructured-IO/unstructured/actions/runs/8622028071/job/23632669423
-
https://github.com/Unstructured-IO/unstructured/actions/runs/8623541446
-
https://github.com/Unstructured-IO/unstructured/actions/runs/8623056382
...

This PR skips the Python `3.11` ingest tests since the most important
one is `3.10` anyway.
2024-04-10 15:16:49 -07:00
Christine Straub
4656b8cbe5
Fix: partition_html() partially extracts text (#2852)
Closes #2362.

Previously, when an HTML contained a `div` with a nested tag e.g. a
`<b>` or `<span>`, the element created from the `div` contained only the
text up to the inline element. This PR adds support for extracting text
from tag tails in HTML.

### Testing
```
html_text = """
<html>
<body>
    <div>
        the Company issues shares at $<div style="display:inline;"><span>5.22</span></div> per share. There is more text
    </div>
</body>
</html>
"""

elements = partition_html(text=html_text)
print(''.join([str(el).strip() for el in elements]))
```

**Expected behavior**
```
the Company issues shares at $5.22per share. There is more text
```
2024-04-08 19:18:55 +00:00
Steve Canny
2c7e0289aa
rfctr(pptx): extract _PptxPartitionerOptions (#2853)
**Reviewers:** Likely quicker to review commit-by-commit.

**Summary**

In preparation for adding a PPTX `Picture` shape _sub-partitioner_,
extract management of PPTX partitioning-run options to a separate
`_PptxPartitioningOptions` object similar to those used in chunking and
XLSX partitioning. This provides several benefits:
- Extract code dealing with applying defaults and computing derived
values from the main partitioning code, leaving it less cluttered and
focused on the partitioning algorithm itself.
- Allow the options set to be passed to helper objects, prominently
including sub-partitioners, without requiring a long list of parameters
or requiring the caller to couple itself to the particular option values
the helper object requires.
- Allow options behaviors to be thoroughly and efficiently tested in
isolation.
2024-04-08 19:01:03 +00:00
Christine Straub
a9b6506724
Fix: partition_html() fails parsing simple html (#2849)
Closes #2520.

Previously, `partition_html()` did not extract text from `<b>` tags
inside container tags (like `<div>`, `<pre>`). This PR provides support
for extracting text from `<b>` tags inside container tags.

### Testing
```
html_text = """
<!DOCTYPE html>
<html>
<head>
 <title>A page</title>
</head>
<body>
<div>
    <h1>Header 1</h1>
    <p>Text </p>
    <h2>Header 2</h2>
    <pre><b>Param1</b> = Y<br><b>Param2</b> = 1<br><b>Param3</b> = 2<br><b>Param4</b> = A
    <br><b>Param5</b> = A,B,C,D,E<br><b>Param6</b> = 7<br><b>Param7</b> = Five<br></pre>
</div>
</body>
</html>
"""

elements = partition_html(text=html_text)
print("\n\n".join([str(el) for el in elements]))
```

**Expected behavior**
```
Header 1

Text

Header 2

Param1 = Y

Param2 = 1

Param3 = 2

Param4 = A

Param5 = A,B,C,D,E

Param6 = 7

Param7 = Five
```
2024-04-08 18:09:41 +00:00
Roman Isecke
4185a1a15a
feat: Remove constraint on unstructured client from .in file (#2862)
### Description
Don't limit the version of the unstructured client for all users of the
repo
2024-04-08 16:50:56 +00:00
cragwolfe
1621a70755
fix: Brings back missing word list files (#2857)
Fixes https://github.com/Unstructured-IO/unstructured/issues/2855
0.13.2
2024-04-04 23:38:15 -07:00
David Potter
57c7c7afc8
fix: Add mongodb env variables to ingest-test-fixtures-update-pr.yaml (#2851)
ingest-test-fixtures-update-pr.yaml was missing mongodb vars. And the
workflow was failing.
2024-04-04 23:38:21 +00:00
ryannikolaidis
d80436a602
build(release): release commit for 0.13.1 (#2850) 0.13.1 2024-04-04 22:17:53 +00:00
Roman Isecke
d6f2841ff4
feat: update dependencies and remove constraint on pydantic (#2841)
### Description
* The `consistent-deps.sh` was fixed to take into account the ingest
dependencies, causing some errors to show up. New constriants were added
to make that script pass.
* Update all requirements without constraint on pydantic, allowing the
latest version to be pulled in.
* `pikepdf` is causing a conflict but there's a fix on their `main`
branch, just need for the next release to be published. Opened up a
question here to see if we can get that out any sooner: [Do releases
happen on a
schedule?](https://github.com/pikepdf/pikepdf/discussions/574). For now
added `lxml<5` to the constraints.

A couple optimizations: 
* `constraints.in` renamed to `constraints.txt` since the whole point is
all dependencies are already pinned and the file never gets compiled
* `constraints.txt` moved to a `requirements/deps` directory as this
never gets compiled by `pip-compile`
* Other dependency files updated to reference the new location of
`base.in` and `constraints.txt`
* make file updated since it was originally written to avoid the
`base.in` and `constraints.in` file
2024-04-04 19:58:23 +00:00
David Potter
ae315869d4
bug: Add options to SFTP (#2843)
Noticed authentication errors when connected to a non localhost SFTP. It
errored out when looking for ssh keys.

This gives us the option to not look for those. Which is correct if we
are giving it user/password.
2024-04-04 14:36:41 +00:00
Pawel Kmiecik
63fc2a1061
feat: element types extension (#2700)
This PR adds some new element types that can be used especially by
pdf/image parition.
2024-04-04 07:49:55 +00:00
Steve Canny
1ce60f2bba
rfctr(xlsx): extract _XlsxPartitionerOptions (#2838)
**Summary**
As an initial step in reducing the complexity of the monolithic
`partition_xlsx()` function, extract all argument-handling to a separate
`_XlsxPartitionerOptions` object which can be fully covered by isolated
unit tests.
    
**Additional Context**
This code was from a prior XLSX bug-fix branch that did not get
committed because of time constraints. I wanted to revisit it here
because I need the benefits of this as part of some new work on PPTX
that will require a separate options object that can be passed to
delegate objects.

This approach was incubated in the chunking context and has produced a
lot of opportunities there to decompose the logic into smaller
components that are more understandable and isolated-test-able, without
having to pass an extended list of option values in ever sub-call. As
well as decluttering the code, this removes coupling where the caller
needs to know which options a subroutine might need to reference.
2024-04-03 23:27:33 +00:00
Christine Straub
e49c35933d
Fix: partition_html() swallows some paragraphs (#2837)
Closes #2836.

The `partition_html()` only considers elements with limited depth when
determining if an HTML tag (`etree`) element contains text, to avoid
becoming the text representation of a giant div. This PR increases the
limit value.
2024-04-03 05:06:37 +00:00
Klaijan
8a239b346c
feat: add cleanup fixtures for test_evaluate (#2701)
This PR adds `@pytest.mark.usefixtures("_cleanup_after_test")` to
`test_evaluate` on tests that do not have.
2024-04-02 15:10:59 +00:00
Ahmet Melek
32e3789ed1
build(release): release commit for 0.13.0 (#2732) 0.13.0 2024-03-29 20:28:44 +00:00
Ahmet Melek
d46792214a
feat: add vertexai embeddings (#2693)
This PR:
- Adds VertexAI embeddings as an embedding provider

Testing
- Tested with pinecone destination connector on
[this](https://github.com/Unstructured-IO/unstructured/actions/runs/8429035114/job/23082700074?pr=2693)
job run.

---------

Co-authored-by: Matt Robinson <mrobinson@unstructuredai.io>
Co-authored-by: Matt Robinson <mrobinson@unstructured.io>
2024-03-28 21:15:36 +00:00
Christine Straub
887e6c9094
refactor: use env_config instead of SUBREGION_THRESHOLD_FOR_OCR constant (#2697)
The purpose of this PR is to introduce a new env_config for the
subregion threshold for OCR.

### Testing
CI should pass.
2024-03-28 20:28:35 +00:00
David Potter
c8cf8f31ac
bug CORE-4225: mongodb url bug (#2662)
The mongodb redact method was created because we wanted part of the url
to be exposed to the user during logging. Thus it did not use the
dataclass `enhanced_field(sensitive=True)` solution.

This changes it to use our standard redacted solution. This also
minimizes the amount of work to be done in platform.
2024-03-28 18:38:50 +00:00
Steve Canny
9ae838e50a
feat: add --include-orig-elements option to Ingest CLI (#2687)
**Summary**
Add an `--include-orig-elements` option to the Ingest CLI to allow users
to specify that corresponding new chunking parameter.

**Reviewer** A lot of this is cleanup, the second commit is where the
actual adding of this option are. The first commit fixes a number of
inaccuracies in the documentation and does some other clean-up.

---------

Co-authored-by: scanny <scanny@users.noreply.github.com>
2024-03-27 06:35:01 +00:00
Christine Straub
08fafc564f
Fix: embedded text not getting merged with inferred elements (#2679)
This PR is the second part of fixing "embedded text not getting merged
with inferred elements", the first part is done in
https://github.com/Unstructured-IO/unstructured-inference/pull/331.

### Summary
- replace `Rectangle.is_in()` with `Rectangle.is_almost_subregion_of()`
when removing pdfminer (embedded) elements that were merged with
inferred elements
- use env_config `EMBEDDED_TEXT_AGGREGATION_SUBREGION_THRESHOLD`
introduced in the [first
part](https://github.com/Unstructured-IO/unstructured-inference/pull/331)
when removing pdfminer (embedded) elements that were merged with
inferred elements
- bump `unstructured-inference` to 0.7.25

### Testing
PDF:
[pwc-financial-statements-p114.pdf](https://github.com/Unstructured-IO/unstructured/files/14707146/pwc-financial-statements-p114.pdf)

```
$ pip uninstall unstructured-inference -y
$ git clone -b fix/embedded-text-not-getting-merged-with-inferred-elements git@github.com:Unstructured-IO/unstructured-inference.git && cd unstructured-inference
$ pip install -e .
```

```
elements = partition_pdf(
    filename="pwc-financial-statements-p114.pdf",
    strategy="hi_res",
    infer_table_structure=True,
    extract_image_block_types=["Image"],
)

table_elements = [el for el in elements if el.category == "Table"]
print(table_elements[0].text)
```

---------

Co-authored-by: Antonio Jose Jimeno Yepes <antonio.jimeno@gmail.com>
Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com>
Co-authored-by: christinestraub <christinestraub@users.noreply.github.com>
2024-03-23 03:59:23 +00:00
Steve Canny
56fbaaed10
feat(chunking): add metadata.orig_elements serde (#2680)
**Summary**
This final PR in the "orig_elements" series adds the needful such that
`.metadata.orig_elements`, when present on a chunk (element), is
serialized to JSON when the chunk is serialized, for instance, to be
used in an HTTP response payload.

It also provides for deserializing such a JSON payload into chunks that
contain the `.orig_elements` metadata.

**Additional Context**
Note that `.metadata.orig_elements` is always `Optional[list[Element]]`
when in memory. However, those original elements are serialized as
Base64-encoded gzipped JSON and are in that form (str) when present as
JSON or as "element-dicts" which is an intermediate
serialization/deserialization format. That is, serialization is `Element
-> dict -> JSON` and deserialization is `JSON -> dict -> Element` and
`.orig_elements` are Base64-encoded in both the `dict` and `JSON` forms.

---------

Co-authored-by: scanny <scanny@users.noreply.github.com>
2024-03-22 21:53:26 +00:00
Klaijan
fd8b682194
fix: mean group add param (#2684) 2024-03-22 15:16:23 +00:00
Filip Knefel
bdfd975115
chore: change table extraction defaults (#2588)
Change default values for table extraction - works in pair with
[this](https://github.com/Unstructured-IO/unstructured-api/pull/370)
`unstructured-api` PR

We want to move away from `pdf_infer_table_structure` parameter, in this
PR:
- We change how it's treated wrt `skip_infer_table_types` parameter.
Whether to extract tables from pdf now follows from the rule:
`pdf_infer_table_structure && "pdf" not in skip_infer_table_types`
- We set it to `pdf_infer_table_structure=True` and
`skip_infer_table_types=[]` by default
- We remove it from the examples in documentation
- We describe it as deprecated in favor of `skip_infer_table_types` in
documentation

More detailed description of how we want parameters to interact
- if `pdf_infer_table_structure` is False tables will never extracted
from pdf
- if `pdf_infer_table_structure` is True tables will be extracted from
pdf unless it's skipped via `skip_infer_table_types`
- on default `pdf_infer_table_structure=True` and
`skip_infer_table_types=[]`

---------

Co-authored-by: Filip Knefel <filip@unstructured.io>
Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com>
Co-authored-by: ds-filipknefel <ds-filipknefel@users.noreply.github.com>
Co-authored-by: Ronny H <138828701+ron-unstructured@users.noreply.github.com>
2024-03-22 10:08:49 +00:00
Roman Isecke
4ff6a5b78e
Roman/bugfix support bedrock embeddings (#2650)
### Description
This PR resolved the following open issue:
[bug/bedrock-encoder-not-supported-in-ingest](https://github.com/Unstructured-IO/unstructured/issues/2319).
To do so, the following changes were made:
* All aws configs were added as input parameters to the CLI
* These were mapped to the bedrock embedder when an embedder is
generated via `get_embedder`
* An ingest test was added to call the aws bedrock service
* Requirements for boto were bumped because the first version to
introduce the bedrock runtime, which is required to hit the bedrock
service, was introduced in version `1.34.63`, which was ahead of the
version of boto pinned.

---------

Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com>
Co-authored-by: rbiseck3 <rbiseck3@users.noreply.github.com>
2024-03-21 18:21:04 +00:00
David Potter
9177aa20a8
feature CORE-3985: add Clarifai destination connector (#2633)
Thanks to @mogith-pn from Clarifai we have a new destination connector!

This PR intends to add Clarifai as a ingest destination connector.

Access via CLI and programmatic.
Documentation and Examples.
Integration test script.
2024-03-21 16:36:21 +00:00
Klaijan
469f878d14
refactor: get_mean_grouping command takes in export_name (#2677)
The `get_mean_grouping_command` currently does not take `export_name` as
param. Add the param for better naming use case.
2024-03-21 00:09:02 +00:00
Steve Canny
31bef433ad
rfctr: prepare to add orig_elements serde (#2668)
**Summary**
The serialization and deserialization (serde) of
`metadata.orig_elements` will be located in `unstructured.staging.base`
alongside `elements_to_json()` and other existing serde functions.
Improve the typing, readability, and structure of that module before
adding the new serde functions for `metadata.orig_elements`.

**Reviewers:** The commits are well-groomed and are probably quicker to
review commit-by-commit than as all files-changed at once.
2024-03-20 21:27:59 +00:00
Matt Robinson
6abfb8b2b3
docs: add morph badge (#2666)
Adds the Morph badge to the README. Supersedes #2663. The badge renders
correctly on the branch, as seen below.

<img width="924" alt="image"
src="https://github.com/Unstructured-IO/unstructured/assets/1635179/3dce2e6f-ce9d-452c-a0a7-4077ec7d66ce">
2024-03-19 19:55:17 +00:00
John
9ac4445e74
refactor title.py (#2657)
Minor refactor after conversation with @scanny

Updates docstring and how chunking options are accessed.
`self._kwargs.get()` should only be used in the `lazyproperty`
definition of an instance's attribute. Other calls should use
`self.<attribute>`
2024-03-19 17:48:23 +00:00
Yao You
2eb0b25e0d
Feat: single table structure eval metric (#2655)
Creates a compounding metric to represent table structure score. It is
an average of existing row and col index and content score.

This PR adds a new property to
`unstructured.metrics.table_eval.TableEvaluation`:
`composite_structure_acc`, which is computed from the element level row
and column index and content accuracy scores. This new metric is meant
to offer a single number to represent the performance of table structure
extraction model/algorithms.

This PR also refactors the eval computation logic so it uses a constant
`table_eval_metrics` instead of hard coding the name of the metrics in
multiple places in the code.

---------

Co-authored-by: qued <64741807+qued@users.noreply.github.com>
2024-03-19 15:15:32 +00:00
Steve Canny
1af41d5f90
feat(chunking): add .orig_elements behavior to chunking (#2656)
**Summary**
Add the actual behavior to populate `.metadata.orig_elements` during
chunking, when so instructed by the `include_orig_elements` option.

**Additional Context**
The underlying structures to support this, namely the
`.metadata.orig_elements` field and the `include_orig_elements` chunking
option, were added in closely prior PRs. This PR adds the behavior to
actually populate that metadata field during chunking when the option is
set.
2024-03-18 19:27:39 +00:00
Roman Isecke
c02cfb89d3
bug/unstructured-ingest produces ModuleNotFoundError: No module named 'unstructured.txtgest (#2661)
Quick fix for issue
https://github.com/Unstructured-IO/unstructured/issues/2658
2024-03-18 19:08:29 +00:00
Filip Knefel
6af6604057
feat: introduce date_from_file_object parameter to partitions (#2563)
Introduce `date_from_file_object` to `partition*` functions, by default
set to `False`.
If set to `True` and file is provided via `file` parameter, partition
will attempt to infer last modified date from `file`'s contents
otherwise last modified metadata will be set to `None`.

---------

Co-authored-by: Filip Knefel <filip@unstructured.io>
Co-authored-by: Ronny H <138828701+ron-unstructured@users.noreply.github.com>
2024-03-18 01:09:44 +00:00
Klaijan
ccda40f750
feat: grouping eval takes list of filenames (#2635)
Add features to `get_mean_grouping` to allow input as a list of
filenames in the format of List of strings or txt file.

---------

Co-authored-by: Ronny H <138828701+ron-unstructured@users.noreply.github.com>
2024-03-17 17:19:55 +00:00
Steve Canny
137ea67336
feat(chunking): add include_orig_elements chunking option (#2649)
**Summary**
Add `include_orig_elements: bool = True` as a new chunking option. This
PR does not implement _adding_ original elements to chunks, only
accepting this parameter as a chunking option and assigning `True` to it
as a default value when it is omitted as a keyword argument.

Note this will need to be added in other repositories as well in order
to fully support this new option by all access methods. In particular it
will need to be added in `unstructured-api` in order to become available
via the SDKs.
2024-03-15 18:48:07 +00:00
Matt Robinson
a63e8a9719
docs: add wintersports example (#2653)
### Summary

Adds the Winter Sports in Switzerland example for a video we're working
on with a partner.
2024-03-15 16:58:45 +00:00
Mason Brothers
ea67be5665
Doc: Change Python comment string to JavaScript comment string. (#2596)
JavaScript uses `//` for comments instead of `#`

Co-authored-by: Yuming Long <63475068+yuming-long@users.noreply.github.com>
Co-authored-by: Ronny H <138828701+ron-unstructured@users.noreply.github.com>
2024-03-15 09:56:35 -05:00
David Potter
5b92e0bb6b
bug CORE-4089: Onedrive partitioning fails - datetime formatting error (#2638)
Fixes Onedrive bug the same way Ryan fixed the Sharepoint error. (both
are microsoft products)
https://github.com/Unstructured-IO/unstructured/pull/2591
https://github.com/Unstructured-IO/unstructured/pull/2592/files

We are seeing occurrences of inconsistency in the timestamps returned by
Onedrive when fetching created and modified dates. Furthermore, in
future versions of this library, a datetime object will be returned
rather than a string.

Changes
This adds logic to guarantee Onedrive dates will be properly formatted
as ISO, regardless of the format provided by the onedrive library.
Bumps timestamp format output to include timezone offset (as we do with
others)

Adds unit tests for isofomat.

json_to_dict already unit tested here:

https://github.com/Unstructured-IO/unstructured/blob/main/test_unstructured_ingest/unit/test_utils.py

Adds small change for AstraDB to allow them to see what source called
their api
2024-03-15 14:01:05 +00:00
Steve Canny
94535e353c
rfctr: prepare for adding metadata.orig_elements field (#2647)
**Summary**
Some typing modernization in `elements.py` which will get changes to add
the `orig_elements` metadata field.

Also some additions to `unit_util.py` to enable simplified mocking that
will be required in the next PR.
2024-03-14 21:31:58 +00:00
Ronny H
d9e557459c
Update link_urls metadata (#2646)
Update the metadata `link_urls` in the News-of-the-Day notebook example.
2024-03-14 17:48:42 +00:00
Steve Canny
45e3c00120
rfctr(chunking): simplify chunking opts construction (#2645)
**Summary**
Use omnibus `kwargs` dict for `ChunkingOptions` state rather than
explicit option parameters.

**Additional Context**
While articulating explicit options for `ChunkingOptions` and its (now
several) sub-classes provides some type-safety, it induces a large
amount of redundancy which complicates updates to the base class and
especially patches to the base class from client code that adds custom
chunkers.

In particular, it makes custom chunkers brittle to any new attributes
added to `ChunkingOptions` (the base class).

Use a single omnibus `kwargs` argument to `ChunkingOptions` and its
subclasses allowing each to pull out the options it is interested in and
happily ignore the rest.

The type safety provided by explicit parameters and types is only
afforded to the single place the options item is called from which is
the custom chunker itself. Because this is "internal" code and not part
of the public interface, this is a manageably small loss in type-safety.
2024-03-14 06:00:51 +00:00
John
fe300fe56d
fix: teardown fixture for tests and update pre-commit-config (#2565)
Files were being created as a side effect from running tests in
`test_unstructured/metrics/test_evaluate.py`. The updated decorator
removes the created directory and its files after the tests run.

Testing
on the main branch, run `make test` or `pytest
test_unstructured/metrics/test_evaluate.py` and files will be created.
On this branch no files are created
2024-03-12 22:16:39 +00:00
Steve Canny
8ea203adf7
feat(chunking): composite text gets is_continuation (#2639)
**Summary**
Add `metadata.is_continuation = True` to metadata of second-and-later
text-split chunks formed from an oversized non-table element. Previously
this metadata was only present on text-split `TableChunk` elements.

This enables downstream filtering of intentionally redundant metadata on
chunk elements that may not be desired for all purposes.

---------

Co-authored-by: scanny <scanny@users.noreply.github.com>
2024-03-12 19:44:41 +00:00