unstructured

mirror of https://github.com/Unstructured-IO/unstructured.git synced 2025-12-27 23:24:27 +00:00

Author	SHA1	Message	Date
Steve Canny	f752849c41	rfctr: improve typing in OCR modules (#2893 ) Summary In preparation for using OCR for partitioners other than PDF, clean up typing in the OCR module.	2024-04-16 03:55:35 +00:00
Michał Martyniak	cb1e91058e	Introduce `start_page` argument to partitioning functions that assign `element.metadata.page_number` (#2884 ) This small change will be useful for users who partition only fragments of their PDF documents. It's a small step towards addressing this issue: https://github.com/Unstructured-IO/unstructured/issues/2461 Related PRs: * https://github.com/Unstructured-IO/unstructured/pull/2842 * https://github.com/Unstructured-IO/unstructured/pull/2673	2024-04-15 21:03:42 +00:00
Christine Straub	ba3f374268	Fix: ingest test fixtures update pr (#2881 ) This PR aims to update "Ingest Test Fixtures Update PR" CI to update the ingest test fixtures only if the OVERWRITE_FIXTURES variable is not `false` and the OUTPUT_DIR directory is not empty.	2024-04-15 17:47:22 +00:00
MiXiBo	0506aff788	add support for `start_index` in `html` links extraction (#2600 ) add support for start_index in html links extraction (closes #2625) Testing ``` from unstructured.partition.html import partition_html from unstructured.staging.base import elements_to_json html_text = """<html> <p>Hello there I am a <a href="/link">very important link!</a></p> <p>Here is a list of my favorite things</p> <ul> <li><a href="https://en.wikipedia.org/wiki/Parrot">Parrots</a></li> <li>Dogs</li> </ul> <a href="/loner">A lone link!</a> </html>""" elements = partition_html(text=html_text) print(elements_to_json(elements)) ``` --------- Co-authored-by: Michael Niestroj <michael.niestroj@unblu.com> Co-authored-by: christinestraub <christinemstraub@gmail.com> Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com> Co-authored-by: christinestraub <christinestraub@users.noreply.github.com> Co-authored-by: Ronny H <138828701+ron-unstructured@users.noreply.github.com>	2024-04-12 06:14:20 +00:00
Steve Canny	3e643c4cb3	feat(pptx): add pluggable PPTX Picture sub-partitioner (#2880 ) Summary Delegate partitioning of PPTX Picture (image, to a first approximation) shapes to a distinct sub-partitioner and allow the default picture sub-partitioner to be replaced at run-time by one of the user's choosing.	2024-04-12 06:00:01 +00:00
Steve Canny	2cba949f18	feat(pptx): partition_pptx() accepts strategy arg (#2879 ) Summary As we move to adding pluggable sub-partitioners, `partition_pptx()` will need to become sensitive to the `strategy` argument, in particular when it is set to "hi_res". Up until now there were no expensive operations (inference, OCR, etc.) incurred while partitioning PPTX so this argument was ignored. After this PR, `partition_pptx()` still won't do anything with that value, other than pass it along to `_PptxPartitionerOptions` for safe-keeping, but now its ready for use by a `PicturePartitioner` (to come in a subsequent PR).	2024-04-11 22:36:16 +00:00
Ahmet Melek	6fd29ea77c	fix: collection deletion for AstraDB test (#2869 ) This PR: - Fixes occasional collection deletion failures for AstraDB via putting collection deletion statements inside a trap statement. It uses click commands to do this. Testing: - Run ingest astradb destination test	2024-04-10 23:08:24 +00:00
Christine Straub	23edc4ad71	build(ci): skip `python 3.11` in CI ingest jobs (#2877 ) CI fails every time on test_ingest_src (3.11) and test_ingest_dst (3.11) on what looks like a pip-install problem `(ModuleNotFoundError: No module named 'click')`. The error is exactly the same place every time. - https://github.com/Unstructured-IO/unstructured/actions/runs/8622028071/job/23632669423 - https://github.com/Unstructured-IO/unstructured/actions/runs/8623541446 - https://github.com/Unstructured-IO/unstructured/actions/runs/8623056382 ... This PR skips the Python `3.11` ingest tests since the most important one is `3.10` anyway.	2024-04-10 15:16:49 -07:00
Christine Straub	4656b8cbe5	Fix: `partition_html()` partially extracts text (#2852 ) Closes #2362. Previously, when an HTML contained a `div` with a nested tag e.g. a `<b>` or `<span>`, the element created from the `div` contained only the text up to the inline element. This PR adds support for extracting text from tag tails in HTML. ### Testing ``` html_text = """ <html> <body> <div> the Company issues shares at $<div style="display:inline;"><span>5.22</span></div> per share. There is more text </div> </body> </html> """ elements = partition_html(text=html_text) print(''.join([str(el).strip() for el in elements])) ``` Expected behavior ``` the Company issues shares at $5.22per share. There is more text ```	2024-04-08 19:18:55 +00:00
Steve Canny	2c7e0289aa	rfctr(pptx): extract _PptxPartitionerOptions (#2853 ) Reviewers: Likely quicker to review commit-by-commit. Summary In preparation for adding a PPTX `Picture` shape _sub-partitioner_, extract management of PPTX partitioning-run options to a separate `_PptxPartitioningOptions` object similar to those used in chunking and XLSX partitioning. This provides several benefits: - Extract code dealing with applying defaults and computing derived values from the main partitioning code, leaving it less cluttered and focused on the partitioning algorithm itself. - Allow the options set to be passed to helper objects, prominently including sub-partitioners, without requiring a long list of parameters or requiring the caller to couple itself to the particular option values the helper object requires. - Allow options behaviors to be thoroughly and efficiently tested in isolation.	2024-04-08 19:01:03 +00:00
Christine Straub	a9b6506724	Fix: `partition_html()` fails parsing simple html (#2849 ) Closes #2520. Previously, `partition_html()` did not extract text from `<b>` tags inside container tags (like `<div>`, `<pre>`). This PR provides support for extracting text from `<b>` tags inside container tags. ### Testing ``` html_text = """ <!DOCTYPE html> <html> <head> <title>A page</title> </head> <body> <div> <h1>Header 1</h1> <p>Text </p> <h2>Header 2</h2> <pre><b>Param1</b> = Y<br><b>Param2</b> = 1<br><b>Param3</b> = 2<br><b>Param4</b> = A <br><b>Param5</b> = A,B,C,D,E<br><b>Param6</b> = 7<br><b>Param7</b> = Five<br></pre> </div> </body> </html> """ elements = partition_html(text=html_text) print("\n\n".join([str(el) for el in elements])) ``` Expected behavior ``` Header 1 Text Header 2 Param1 = Y Param2 = 1 Param3 = 2 Param4 = A Param5 = A,B,C,D,E Param6 = 7 Param7 = Five ```	2024-04-08 18:09:41 +00:00
Roman Isecke	4185a1a15a	feat: Remove constraint on unstructured client from .in file (#2862 ) ### Description Don't limit the version of the unstructured client for all users of the repo	2024-04-08 16:50:56 +00:00
cragwolfe	1621a70755	fix: Brings back missing word list files (#2857 ) Fixes https://github.com/Unstructured-IO/unstructured/issues/2855 0.13.2	2024-04-04 23:38:15 -07:00
David Potter	57c7c7afc8	fix: Add mongodb env variables to ingest-test-fixtures-update-pr.yaml (#2851 ) ingest-test-fixtures-update-pr.yaml was missing mongodb vars. And the workflow was failing.	2024-04-04 23:38:21 +00:00
ryannikolaidis	d80436a602	build(release): release commit for 0.13.1 (#2850 ) 0.13.1	2024-04-04 22:17:53 +00:00
Roman Isecke	d6f2841ff4	feat: update dependencies and remove constraint on pydantic (#2841 ) ### Description * The `consistent-deps.sh` was fixed to take into account the ingest dependencies, causing some errors to show up. New constriants were added to make that script pass. * Update all requirements without constraint on pydantic, allowing the latest version to be pulled in. * `pikepdf` is causing a conflict but there's a fix on their `main` branch, just need for the next release to be published. Opened up a question here to see if we can get that out any sooner: [Do releases happen on a schedule?](https://github.com/pikepdf/pikepdf/discussions/574). For now added `lxml<5` to the constraints. A couple optimizations: * `constraints.in` renamed to `constraints.txt` since the whole point is all dependencies are already pinned and the file never gets compiled * `constraints.txt` moved to a `requirements/deps` directory as this never gets compiled by `pip-compile` * Other dependency files updated to reference the new location of `base.in` and `constraints.txt` * make file updated since it was originally written to avoid the `base.in` and `constraints.in` file	2024-04-04 19:58:23 +00:00
David Potter	ae315869d4	bug: Add options to SFTP (#2843 ) Noticed authentication errors when connected to a non localhost SFTP. It errored out when looking for ssh keys. This gives us the option to not look for those. Which is correct if we are giving it user/password.	2024-04-04 14:36:41 +00:00
Pawel Kmiecik	63fc2a1061	feat: element types extension (#2700 ) This PR adds some new element types that can be used especially by pdf/image parition.	2024-04-04 07:49:55 +00:00
Steve Canny	1ce60f2bba	rfctr(xlsx): extract _XlsxPartitionerOptions (#2838 ) Summary As an initial step in reducing the complexity of the monolithic `partition_xlsx()` function, extract all argument-handling to a separate `_XlsxPartitionerOptions` object which can be fully covered by isolated unit tests. Additional Context This code was from a prior XLSX bug-fix branch that did not get committed because of time constraints. I wanted to revisit it here because I need the benefits of this as part of some new work on PPTX that will require a separate options object that can be passed to delegate objects. This approach was incubated in the chunking context and has produced a lot of opportunities there to decompose the logic into smaller components that are more understandable and isolated-test-able, without having to pass an extended list of option values in ever sub-call. As well as decluttering the code, this removes coupling where the caller needs to know which options a subroutine might need to reference.	2024-04-03 23:27:33 +00:00
Christine Straub	e49c35933d	Fix: `partition_html()` swallows some paragraphs (#2837 ) Closes #2836. The `partition_html()` only considers elements with limited depth when determining if an HTML tag (`etree`) element contains text, to avoid becoming the text representation of a giant div. This PR increases the limit value.	2024-04-03 05:06:37 +00:00
Klaijan	8a239b346c	feat: add cleanup fixtures for test_evaluate (#2701 ) This PR adds `@pytest.mark.usefixtures("_cleanup_after_test")` to `test_evaluate` on tests that do not have.	2024-04-02 15:10:59 +00:00
Ahmet Melek	32e3789ed1	build(release): release commit for 0.13.0 (#2732 ) 0.13.0	2024-03-29 20:28:44 +00:00
Ahmet Melek	d46792214a	feat: add vertexai embeddings (#2693 ) This PR: - Adds VertexAI embeddings as an embedding provider Testing - Tested with pinecone destination connector on [this](https://github.com/Unstructured-IO/unstructured/actions/runs/8429035114/job/23082700074?pr=2693) job run. --------- Co-authored-by: Matt Robinson <mrobinson@unstructuredai.io> Co-authored-by: Matt Robinson <mrobinson@unstructured.io>	2024-03-28 21:15:36 +00:00
Christine Straub	887e6c9094	refactor: use env_config instead of `SUBREGION_THRESHOLD_FOR_OCR` constant (#2697 ) The purpose of this PR is to introduce a new env_config for the subregion threshold for OCR. ### Testing CI should pass.	2024-03-28 20:28:35 +00:00
David Potter	c8cf8f31ac	bug CORE-4225: mongodb url bug (#2662 ) The mongodb redact method was created because we wanted part of the url to be exposed to the user during logging. Thus it did not use the dataclass `enhanced_field(sensitive=True)` solution. This changes it to use our standard redacted solution. This also minimizes the amount of work to be done in platform.	2024-03-28 18:38:50 +00:00
Steve Canny	9ae838e50a	feat: add --include-orig-elements option to Ingest CLI (#2687 ) Summary Add an `--include-orig-elements` option to the Ingest CLI to allow users to specify that corresponding new chunking parameter. Reviewer A lot of this is cleanup, the second commit is where the actual adding of this option are. The first commit fixes a number of inaccuracies in the documentation and does some other clean-up. --------- Co-authored-by: scanny <scanny@users.noreply.github.com>	2024-03-27 06:35:01 +00:00
Christine Straub	08fafc564f	Fix: embedded text not getting merged with inferred elements (#2679 ) This PR is the second part of fixing "embedded text not getting merged with inferred elements", the first part is done in https://github.com/Unstructured-IO/unstructured-inference/pull/331. ### Summary - replace `Rectangle.is_in()` with `Rectangle.is_almost_subregion_of()` when removing pdfminer (embedded) elements that were merged with inferred elements - use env_config `EMBEDDED_TEXT_AGGREGATION_SUBREGION_THRESHOLD` introduced in the [first part](https://github.com/Unstructured-IO/unstructured-inference/pull/331) when removing pdfminer (embedded) elements that were merged with inferred elements - bump `unstructured-inference` to 0.7.25 ### Testing PDF: [pwc-financial-statements-p114.pdf](https://github.com/Unstructured-IO/unstructured/files/14707146/pwc-financial-statements-p114.pdf) ``` $ pip uninstall unstructured-inference -y $ git clone -b fix/embedded-text-not-getting-merged-with-inferred-elements git@github.com:Unstructured-IO/unstructured-inference.git && cd unstructured-inference $ pip install -e . ``` ``` elements = partition_pdf( filename="pwc-financial-statements-p114.pdf", strategy="hi_res", infer_table_structure=True, extract_image_block_types=["Image"], ) table_elements = [el for el in elements if el.category == "Table"] print(table_elements[0].text) ``` --------- Co-authored-by: Antonio Jose Jimeno Yepes <antonio.jimeno@gmail.com> Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com> Co-authored-by: christinestraub <christinestraub@users.noreply.github.com>	2024-03-23 03:59:23 +00:00
Steve Canny	56fbaaed10	feat(chunking): add metadata.orig_elements serde (#2680 ) Summary This final PR in the "orig_elements" series adds the needful such that `.metadata.orig_elements`, when present on a chunk (element), is serialized to JSON when the chunk is serialized, for instance, to be used in an HTTP response payload. It also provides for deserializing such a JSON payload into chunks that contain the `.orig_elements` metadata. Additional Context Note that `.metadata.orig_elements` is always `Optional[list[Element]]` when in memory. However, those original elements are serialized as Base64-encoded gzipped JSON and are in that form (str) when present as JSON or as "element-dicts" which is an intermediate serialization/deserialization format. That is, serialization is `Element -> dict -> JSON` and deserialization is `JSON -> dict -> Element` and `.orig_elements` are Base64-encoded in both the `dict` and `JSON` forms. --------- Co-authored-by: scanny <scanny@users.noreply.github.com>	2024-03-22 21:53:26 +00:00
Klaijan	fd8b682194	fix: mean group add param (#2684 )	2024-03-22 15:16:23 +00:00
Filip Knefel	bdfd975115	chore: change table extraction defaults (#2588 ) Change default values for table extraction - works in pair with [this](https://github.com/Unstructured-IO/unstructured-api/pull/370) `unstructured-api` PR We want to move away from `pdf_infer_table_structure` parameter, in this PR: - We change how it's treated wrt `skip_infer_table_types` parameter. Whether to extract tables from pdf now follows from the rule: `pdf_infer_table_structure && "pdf" not in skip_infer_table_types` - We set it to `pdf_infer_table_structure=True` and `skip_infer_table_types=[]` by default - We remove it from the examples in documentation - We describe it as deprecated in favor of `skip_infer_table_types` in documentation More detailed description of how we want parameters to interact - if `pdf_infer_table_structure` is False tables will never extracted from pdf - if `pdf_infer_table_structure` is True tables will be extracted from pdf unless it's skipped via `skip_infer_table_types` - on default `pdf_infer_table_structure=True` and `skip_infer_table_types=[]` --------- Co-authored-by: Filip Knefel <filip@unstructured.io> Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com> Co-authored-by: ds-filipknefel <ds-filipknefel@users.noreply.github.com> Co-authored-by: Ronny H <138828701+ron-unstructured@users.noreply.github.com>	2024-03-22 10:08:49 +00:00
Roman Isecke	4ff6a5b78e	Roman/bugfix support bedrock embeddings (#2650 ) ### Description This PR resolved the following open issue: [bug/bedrock-encoder-not-supported-in-ingest](https://github.com/Unstructured-IO/unstructured/issues/2319). To do so, the following changes were made: * All aws configs were added as input parameters to the CLI * These were mapped to the bedrock embedder when an embedder is generated via `get_embedder` * An ingest test was added to call the aws bedrock service * Requirements for boto were bumped because the first version to introduce the bedrock runtime, which is required to hit the bedrock service, was introduced in version `1.34.63`, which was ahead of the version of boto pinned. --------- Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com> Co-authored-by: rbiseck3 <rbiseck3@users.noreply.github.com>	2024-03-21 18:21:04 +00:00
David Potter	9177aa20a8	feature CORE-3985: add Clarifai destination connector (#2633 ) Thanks to @mogith-pn from Clarifai we have a new destination connector! This PR intends to add Clarifai as a ingest destination connector. Access via CLI and programmatic. Documentation and Examples. Integration test script.	2024-03-21 16:36:21 +00:00
Klaijan	469f878d14	refactor: get_mean_grouping command takes in export_name (#2677 ) The `get_mean_grouping_command` currently does not take `export_name` as param. Add the param for better naming use case.	2024-03-21 00:09:02 +00:00
Steve Canny	31bef433ad	rfctr: prepare to add orig_elements serde (#2668 ) Summary The serialization and deserialization (serde) of `metadata.orig_elements` will be located in `unstructured.staging.base` alongside `elements_to_json()` and other existing serde functions. Improve the typing, readability, and structure of that module before adding the new serde functions for `metadata.orig_elements`. Reviewers: The commits are well-groomed and are probably quicker to review commit-by-commit than as all files-changed at once.	2024-03-20 21:27:59 +00:00
Matt Robinson	6abfb8b2b3	docs: add morph badge (#2666 ) Adds the Morph badge to the README. Supersedes #2663. The badge renders correctly on the branch, as seen below. <img width="924" alt="image" src="https://github.com/Unstructured-IO/unstructured/assets/1635179/3dce2e6f-ce9d-452c-a0a7-4077ec7d66ce">	2024-03-19 19:55:17 +00:00
John	9ac4445e74	refactor title.py (#2657 ) Minor refactor after conversation with @scanny Updates docstring and how chunking options are accessed. `self._kwargs.get()` should only be used in the `lazyproperty` definition of an instance's attribute. Other calls should use `self.<attribute>`	2024-03-19 17:48:23 +00:00
Yao You	2eb0b25e0d	Feat: single table structure eval metric (#2655 ) Creates a compounding metric to represent table structure score. It is an average of existing row and col index and content score. This PR adds a new property to `unstructured.metrics.table_eval.TableEvaluation`: `composite_structure_acc`, which is computed from the element level row and column index and content accuracy scores. This new metric is meant to offer a single number to represent the performance of table structure extraction model/algorithms. This PR also refactors the eval computation logic so it uses a constant `table_eval_metrics` instead of hard coding the name of the metrics in multiple places in the code. --------- Co-authored-by: qued <64741807+qued@users.noreply.github.com>	2024-03-19 15:15:32 +00:00
Steve Canny	1af41d5f90	feat(chunking): add .orig_elements behavior to chunking (#2656 ) Summary Add the actual behavior to populate `.metadata.orig_elements` during chunking, when so instructed by the `include_orig_elements` option. Additional Context The underlying structures to support this, namely the `.metadata.orig_elements` field and the `include_orig_elements` chunking option, were added in closely prior PRs. This PR adds the behavior to actually populate that metadata field during chunking when the option is set.	2024-03-18 19:27:39 +00:00
Roman Isecke	c02cfb89d3	bug/unstructured-ingest produces ModuleNotFoundError: No module named 'unstructured.txtgest (#2661 ) Quick fix for issue https://github.com/Unstructured-IO/unstructured/issues/2658	2024-03-18 19:08:29 +00:00
Filip Knefel	6af6604057	feat: introduce `date_from_file_object` parameter to partitions (#2563 ) Introduce `date_from_file_object` to `partition*` functions, by default set to `False`. If set to `True` and file is provided via `file` parameter, partition will attempt to infer last modified date from `file`'s contents otherwise last modified metadata will be set to `None`. --------- Co-authored-by: Filip Knefel <filip@unstructured.io> Co-authored-by: Ronny H <138828701+ron-unstructured@users.noreply.github.com>	2024-03-18 01:09:44 +00:00
Klaijan	ccda40f750	feat: grouping eval takes list of filenames (#2635 ) Add features to `get_mean_grouping` to allow input as a list of filenames in the format of List of strings or txt file. --------- Co-authored-by: Ronny H <138828701+ron-unstructured@users.noreply.github.com>	2024-03-17 17:19:55 +00:00
Steve Canny	137ea67336	feat(chunking): add include_orig_elements chunking option (#2649 ) Summary Add `include_orig_elements: bool = True` as a new chunking option. This PR does not implement _adding_ original elements to chunks, only accepting this parameter as a chunking option and assigning `True` to it as a default value when it is omitted as a keyword argument. Note this will need to be added in other repositories as well in order to fully support this new option by all access methods. In particular it will need to be added in `unstructured-api` in order to become available via the SDKs.	2024-03-15 18:48:07 +00:00
Matt Robinson	a63e8a9719	docs: add wintersports example (#2653 ) ### Summary Adds the Winter Sports in Switzerland example for a video we're working on with a partner.	2024-03-15 16:58:45 +00:00
Mason Brothers	ea67be5665	Doc: Change Python comment string to JavaScript comment string. (#2596 ) JavaScript uses `//` for comments instead of `#` Co-authored-by: Yuming Long <63475068+yuming-long@users.noreply.github.com> Co-authored-by: Ronny H <138828701+ron-unstructured@users.noreply.github.com>	2024-03-15 09:56:35 -05:00
David Potter	5b92e0bb6b	bug CORE-4089: Onedrive partitioning fails - datetime formatting error (#2638 ) Fixes Onedrive bug the same way Ryan fixed the Sharepoint error. (both are microsoft products) https://github.com/Unstructured-IO/unstructured/pull/2591 https://github.com/Unstructured-IO/unstructured/pull/2592/files We are seeing occurrences of inconsistency in the timestamps returned by Onedrive when fetching created and modified dates. Furthermore, in future versions of this library, a datetime object will be returned rather than a string. Changes This adds logic to guarantee Onedrive dates will be properly formatted as ISO, regardless of the format provided by the onedrive library. Bumps timestamp format output to include timezone offset (as we do with others) Adds unit tests for isofomat. json_to_dict already unit tested here: https://github.com/Unstructured-IO/unstructured/blob/main/test_unstructured_ingest/unit/test_utils.py Adds small change for AstraDB to allow them to see what source called their api	2024-03-15 14:01:05 +00:00
Steve Canny	94535e353c	rfctr: prepare for adding metadata.orig_elements field (#2647 ) Summary Some typing modernization in `elements.py` which will get changes to add the `orig_elements` metadata field. Also some additions to `unit_util.py` to enable simplified mocking that will be required in the next PR.	2024-03-14 21:31:58 +00:00
Ronny H	d9e557459c	Update link_urls metadata (#2646 ) Update the metadata `link_urls` in the News-of-the-Day notebook example.	2024-03-14 17:48:42 +00:00
Steve Canny	45e3c00120	rfctr(chunking): simplify chunking opts construction (#2645 ) Summary Use omnibus `kwargs` dict for `ChunkingOptions` state rather than explicit option parameters. Additional Context While articulating explicit options for `ChunkingOptions` and its (now several) sub-classes provides some type-safety, it induces a large amount of redundancy which complicates updates to the base class and especially patches to the base class from client code that adds custom chunkers. In particular, it makes custom chunkers brittle to any new attributes added to `ChunkingOptions` (the base class). Use a single omnibus `kwargs` argument to `ChunkingOptions` and its subclasses allowing each to pull out the options it is interested in and happily ignore the rest. The type safety provided by explicit parameters and types is only afforded to the single place the options item is called from which is the custom chunker itself. Because this is "internal" code and not part of the public interface, this is a manageably small loss in type-safety.	2024-03-14 06:00:51 +00:00
John	fe300fe56d	fix: teardown fixture for tests and update pre-commit-config (#2565 ) Files were being created as a side effect from running tests in `test_unstructured/metrics/test_evaluate.py`. The updated decorator removes the created directory and its files after the tests run. Testing on the main branch, run `make test` or `pytest test_unstructured/metrics/test_evaluate.py` and files will be created. On this branch no files are created	2024-03-12 22:16:39 +00:00
Steve Canny	8ea203adf7	feat(chunking): composite text gets is_continuation (#2639 ) Summary Add `metadata.is_continuation = True` to metadata of second-and-later text-split chunks formed from an oversized non-table element. Previously this metadata was only present on text-split `TableChunk` elements. This enables downstream filtering of intentionally redundant metadata on chunk elements that may not be desired for all purposes. --------- Co-authored-by: scanny <scanny@users.noreply.github.com>	2024-03-12 19:44:41 +00:00

1 2 3 4 5 ...

1282 Commits