unstructured

mirror of https://github.com/Unstructured-IO/unstructured.git synced 2025-11-29 16:49:54 +00:00

Author	SHA1	Message	Date
Pawel Kmiecik	63fc2a1061	feat: element types extension (#2700 ) This PR adds some new element types that can be used especially by pdf/image parition.	2024-04-04 07:49:55 +00:00
Steve Canny	1ce60f2bba	rfctr(xlsx): extract _XlsxPartitionerOptions (#2838 ) Summary As an initial step in reducing the complexity of the monolithic `partition_xlsx()` function, extract all argument-handling to a separate `_XlsxPartitionerOptions` object which can be fully covered by isolated unit tests. Additional Context This code was from a prior XLSX bug-fix branch that did not get committed because of time constraints. I wanted to revisit it here because I need the benefits of this as part of some new work on PPTX that will require a separate options object that can be passed to delegate objects. This approach was incubated in the chunking context and has produced a lot of opportunities there to decompose the logic into smaller components that are more understandable and isolated-test-able, without having to pass an extended list of option values in ever sub-call. As well as decluttering the code, this removes coupling where the caller needs to know which options a subroutine might need to reference.	2024-04-03 23:27:33 +00:00
Christine Straub	e49c35933d	Fix: `partition_html()` swallows some paragraphs (#2837 ) Closes #2836. The `partition_html()` only considers elements with limited depth when determining if an HTML tag (`etree`) element contains text, to avoid becoming the text representation of a giant div. This PR increases the limit value.	2024-04-03 05:06:37 +00:00
Klaijan	8a239b346c	feat: add cleanup fixtures for test_evaluate (#2701 ) This PR adds `@pytest.mark.usefixtures("_cleanup_after_test")` to `test_evaluate` on tests that do not have.	2024-04-02 15:10:59 +00:00
Ahmet Melek	32e3789ed1	build(release): release commit for 0.13.0 (#2732 ) 0.13.0	2024-03-29 20:28:44 +00:00
Ahmet Melek	d46792214a	feat: add vertexai embeddings (#2693 ) This PR: - Adds VertexAI embeddings as an embedding provider Testing - Tested with pinecone destination connector on [this](https://github.com/Unstructured-IO/unstructured/actions/runs/8429035114/job/23082700074?pr=2693) job run. --------- Co-authored-by: Matt Robinson <mrobinson@unstructuredai.io> Co-authored-by: Matt Robinson <mrobinson@unstructured.io>	2024-03-28 21:15:36 +00:00
Christine Straub	887e6c9094	refactor: use env_config instead of `SUBREGION_THRESHOLD_FOR_OCR` constant (#2697 ) The purpose of this PR is to introduce a new env_config for the subregion threshold for OCR. ### Testing CI should pass.	2024-03-28 20:28:35 +00:00
David Potter	c8cf8f31ac	bug CORE-4225: mongodb url bug (#2662 ) The mongodb redact method was created because we wanted part of the url to be exposed to the user during logging. Thus it did not use the dataclass `enhanced_field(sensitive=True)` solution. This changes it to use our standard redacted solution. This also minimizes the amount of work to be done in platform.	2024-03-28 18:38:50 +00:00
Steve Canny	9ae838e50a	feat: add --include-orig-elements option to Ingest CLI (#2687 ) Summary Add an `--include-orig-elements` option to the Ingest CLI to allow users to specify that corresponding new chunking parameter. Reviewer A lot of this is cleanup, the second commit is where the actual adding of this option are. The first commit fixes a number of inaccuracies in the documentation and does some other clean-up. --------- Co-authored-by: scanny <scanny@users.noreply.github.com>	2024-03-27 06:35:01 +00:00
Christine Straub	08fafc564f	Fix: embedded text not getting merged with inferred elements (#2679 ) This PR is the second part of fixing "embedded text not getting merged with inferred elements", the first part is done in https://github.com/Unstructured-IO/unstructured-inference/pull/331. ### Summary - replace `Rectangle.is_in()` with `Rectangle.is_almost_subregion_of()` when removing pdfminer (embedded) elements that were merged with inferred elements - use env_config `EMBEDDED_TEXT_AGGREGATION_SUBREGION_THRESHOLD` introduced in the [first part](https://github.com/Unstructured-IO/unstructured-inference/pull/331) when removing pdfminer (embedded) elements that were merged with inferred elements - bump `unstructured-inference` to 0.7.25 ### Testing PDF: [pwc-financial-statements-p114.pdf](https://github.com/Unstructured-IO/unstructured/files/14707146/pwc-financial-statements-p114.pdf) ``` $ pip uninstall unstructured-inference -y $ git clone -b fix/embedded-text-not-getting-merged-with-inferred-elements git@github.com:Unstructured-IO/unstructured-inference.git && cd unstructured-inference $ pip install -e . ``` ``` elements = partition_pdf( filename="pwc-financial-statements-p114.pdf", strategy="hi_res", infer_table_structure=True, extract_image_block_types=["Image"], ) table_elements = [el for el in elements if el.category == "Table"] print(table_elements[0].text) ``` --------- Co-authored-by: Antonio Jose Jimeno Yepes <antonio.jimeno@gmail.com> Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com> Co-authored-by: christinestraub <christinestraub@users.noreply.github.com>	2024-03-23 03:59:23 +00:00
Steve Canny	56fbaaed10	feat(chunking): add metadata.orig_elements serde (#2680 ) Summary This final PR in the "orig_elements" series adds the needful such that `.metadata.orig_elements`, when present on a chunk (element), is serialized to JSON when the chunk is serialized, for instance, to be used in an HTTP response payload. It also provides for deserializing such a JSON payload into chunks that contain the `.orig_elements` metadata. Additional Context Note that `.metadata.orig_elements` is always `Optional[list[Element]]` when in memory. However, those original elements are serialized as Base64-encoded gzipped JSON and are in that form (str) when present as JSON or as "element-dicts" which is an intermediate serialization/deserialization format. That is, serialization is `Element -> dict -> JSON` and deserialization is `JSON -> dict -> Element` and `.orig_elements` are Base64-encoded in both the `dict` and `JSON` forms. --------- Co-authored-by: scanny <scanny@users.noreply.github.com>	2024-03-22 21:53:26 +00:00
Klaijan	fd8b682194	fix: mean group add param (#2684 )	2024-03-22 15:16:23 +00:00
Filip Knefel	bdfd975115	chore: change table extraction defaults (#2588 ) Change default values for table extraction - works in pair with [this](https://github.com/Unstructured-IO/unstructured-api/pull/370) `unstructured-api` PR We want to move away from `pdf_infer_table_structure` parameter, in this PR: - We change how it's treated wrt `skip_infer_table_types` parameter. Whether to extract tables from pdf now follows from the rule: `pdf_infer_table_structure && "pdf" not in skip_infer_table_types` - We set it to `pdf_infer_table_structure=True` and `skip_infer_table_types=[]` by default - We remove it from the examples in documentation - We describe it as deprecated in favor of `skip_infer_table_types` in documentation More detailed description of how we want parameters to interact - if `pdf_infer_table_structure` is False tables will never extracted from pdf - if `pdf_infer_table_structure` is True tables will be extracted from pdf unless it's skipped via `skip_infer_table_types` - on default `pdf_infer_table_structure=True` and `skip_infer_table_types=[]` --------- Co-authored-by: Filip Knefel <filip@unstructured.io> Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com> Co-authored-by: ds-filipknefel <ds-filipknefel@users.noreply.github.com> Co-authored-by: Ronny H <138828701+ron-unstructured@users.noreply.github.com>	2024-03-22 10:08:49 +00:00
Roman Isecke	4ff6a5b78e	Roman/bugfix support bedrock embeddings (#2650 ) ### Description This PR resolved the following open issue: [bug/bedrock-encoder-not-supported-in-ingest](https://github.com/Unstructured-IO/unstructured/issues/2319). To do so, the following changes were made: * All aws configs were added as input parameters to the CLI * These were mapped to the bedrock embedder when an embedder is generated via `get_embedder` * An ingest test was added to call the aws bedrock service * Requirements for boto were bumped because the first version to introduce the bedrock runtime, which is required to hit the bedrock service, was introduced in version `1.34.63`, which was ahead of the version of boto pinned. --------- Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com> Co-authored-by: rbiseck3 <rbiseck3@users.noreply.github.com>	2024-03-21 18:21:04 +00:00
David Potter	9177aa20a8	feature CORE-3985: add Clarifai destination connector (#2633 ) Thanks to @mogith-pn from Clarifai we have a new destination connector! This PR intends to add Clarifai as a ingest destination connector. Access via CLI and programmatic. Documentation and Examples. Integration test script.	2024-03-21 16:36:21 +00:00
Klaijan	469f878d14	refactor: get_mean_grouping command takes in export_name (#2677 ) The `get_mean_grouping_command` currently does not take `export_name` as param. Add the param for better naming use case.	2024-03-21 00:09:02 +00:00
Steve Canny	31bef433ad	rfctr: prepare to add orig_elements serde (#2668 ) Summary The serialization and deserialization (serde) of `metadata.orig_elements` will be located in `unstructured.staging.base` alongside `elements_to_json()` and other existing serde functions. Improve the typing, readability, and structure of that module before adding the new serde functions for `metadata.orig_elements`. Reviewers: The commits are well-groomed and are probably quicker to review commit-by-commit than as all files-changed at once.	2024-03-20 21:27:59 +00:00
Matt Robinson	6abfb8b2b3	docs: add morph badge (#2666 ) Adds the Morph badge to the README. Supersedes #2663. The badge renders correctly on the branch, as seen below. <img width="924" alt="image" src="https://github.com/Unstructured-IO/unstructured/assets/1635179/3dce2e6f-ce9d-452c-a0a7-4077ec7d66ce">	2024-03-19 19:55:17 +00:00
John	9ac4445e74	refactor title.py (#2657 ) Minor refactor after conversation with @scanny Updates docstring and how chunking options are accessed. `self._kwargs.get()` should only be used in the `lazyproperty` definition of an instance's attribute. Other calls should use `self.<attribute>`	2024-03-19 17:48:23 +00:00
Yao You	2eb0b25e0d	Feat: single table structure eval metric (#2655 ) Creates a compounding metric to represent table structure score. It is an average of existing row and col index and content score. This PR adds a new property to `unstructured.metrics.table_eval.TableEvaluation`: `composite_structure_acc`, which is computed from the element level row and column index and content accuracy scores. This new metric is meant to offer a single number to represent the performance of table structure extraction model/algorithms. This PR also refactors the eval computation logic so it uses a constant `table_eval_metrics` instead of hard coding the name of the metrics in multiple places in the code. --------- Co-authored-by: qued <64741807+qued@users.noreply.github.com>	2024-03-19 15:15:32 +00:00
Steve Canny	1af41d5f90	feat(chunking): add .orig_elements behavior to chunking (#2656 ) Summary Add the actual behavior to populate `.metadata.orig_elements` during chunking, when so instructed by the `include_orig_elements` option. Additional Context The underlying structures to support this, namely the `.metadata.orig_elements` field and the `include_orig_elements` chunking option, were added in closely prior PRs. This PR adds the behavior to actually populate that metadata field during chunking when the option is set.	2024-03-18 19:27:39 +00:00
Roman Isecke	c02cfb89d3	bug/unstructured-ingest produces ModuleNotFoundError: No module named 'unstructured.txtgest (#2661 ) Quick fix for issue https://github.com/Unstructured-IO/unstructured/issues/2658	2024-03-18 19:08:29 +00:00
Filip Knefel	6af6604057	feat: introduce `date_from_file_object` parameter to partitions (#2563 ) Introduce `date_from_file_object` to `partition*` functions, by default set to `False`. If set to `True` and file is provided via `file` parameter, partition will attempt to infer last modified date from `file`'s contents otherwise last modified metadata will be set to `None`. --------- Co-authored-by: Filip Knefel <filip@unstructured.io> Co-authored-by: Ronny H <138828701+ron-unstructured@users.noreply.github.com>	2024-03-18 01:09:44 +00:00
Klaijan	ccda40f750	feat: grouping eval takes list of filenames (#2635 ) Add features to `get_mean_grouping` to allow input as a list of filenames in the format of List of strings or txt file. --------- Co-authored-by: Ronny H <138828701+ron-unstructured@users.noreply.github.com>	2024-03-17 17:19:55 +00:00
Steve Canny	137ea67336	feat(chunking): add include_orig_elements chunking option (#2649 ) Summary Add `include_orig_elements: bool = True` as a new chunking option. This PR does not implement _adding_ original elements to chunks, only accepting this parameter as a chunking option and assigning `True` to it as a default value when it is omitted as a keyword argument. Note this will need to be added in other repositories as well in order to fully support this new option by all access methods. In particular it will need to be added in `unstructured-api` in order to become available via the SDKs.	2024-03-15 18:48:07 +00:00
Matt Robinson	a63e8a9719	docs: add wintersports example (#2653 ) ### Summary Adds the Winter Sports in Switzerland example for a video we're working on with a partner.	2024-03-15 16:58:45 +00:00
Mason Brothers	ea67be5665	Doc: Change Python comment string to JavaScript comment string. (#2596 ) JavaScript uses `//` for comments instead of `#` Co-authored-by: Yuming Long <63475068+yuming-long@users.noreply.github.com> Co-authored-by: Ronny H <138828701+ron-unstructured@users.noreply.github.com>	2024-03-15 09:56:35 -05:00
David Potter	5b92e0bb6b	bug CORE-4089: Onedrive partitioning fails - datetime formatting error (#2638 ) Fixes Onedrive bug the same way Ryan fixed the Sharepoint error. (both are microsoft products) https://github.com/Unstructured-IO/unstructured/pull/2591 https://github.com/Unstructured-IO/unstructured/pull/2592/files We are seeing occurrences of inconsistency in the timestamps returned by Onedrive when fetching created and modified dates. Furthermore, in future versions of this library, a datetime object will be returned rather than a string. Changes This adds logic to guarantee Onedrive dates will be properly formatted as ISO, regardless of the format provided by the onedrive library. Bumps timestamp format output to include timezone offset (as we do with others) Adds unit tests for isofomat. json_to_dict already unit tested here: https://github.com/Unstructured-IO/unstructured/blob/main/test_unstructured_ingest/unit/test_utils.py Adds small change for AstraDB to allow them to see what source called their api	2024-03-15 14:01:05 +00:00
Steve Canny	94535e353c	rfctr: prepare for adding metadata.orig_elements field (#2647 ) Summary Some typing modernization in `elements.py` which will get changes to add the `orig_elements` metadata field. Also some additions to `unit_util.py` to enable simplified mocking that will be required in the next PR.	2024-03-14 21:31:58 +00:00
Ronny H	d9e557459c	Update link_urls metadata (#2646 ) Update the metadata `link_urls` in the News-of-the-Day notebook example.	2024-03-14 17:48:42 +00:00
Steve Canny	45e3c00120	rfctr(chunking): simplify chunking opts construction (#2645 ) Summary Use omnibus `kwargs` dict for `ChunkingOptions` state rather than explicit option parameters. Additional Context While articulating explicit options for `ChunkingOptions` and its (now several) sub-classes provides some type-safety, it induces a large amount of redundancy which complicates updates to the base class and especially patches to the base class from client code that adds custom chunkers. In particular, it makes custom chunkers brittle to any new attributes added to `ChunkingOptions` (the base class). Use a single omnibus `kwargs` argument to `ChunkingOptions` and its subclasses allowing each to pull out the options it is interested in and happily ignore the rest. The type safety provided by explicit parameters and types is only afforded to the single place the options item is called from which is the custom chunker itself. Because this is "internal" code and not part of the public interface, this is a manageably small loss in type-safety.	2024-03-14 06:00:51 +00:00
John	fe300fe56d	fix: teardown fixture for tests and update pre-commit-config (#2565 ) Files were being created as a side effect from running tests in `test_unstructured/metrics/test_evaluate.py`. The updated decorator removes the created directory and its files after the tests run. Testing on the main branch, run `make test` or `pytest test_unstructured/metrics/test_evaluate.py` and files will be created. On this branch no files are created	2024-03-12 22:16:39 +00:00
Steve Canny	8ea203adf7	feat(chunking): composite text gets is_continuation (#2639 ) Summary Add `metadata.is_continuation = True` to metadata of second-and-later text-split chunks formed from an oversized non-table element. Previously this metadata was only present on text-split `TableChunk` elements. This enables downstream filtering of intentionally redundant metadata on chunk elements that may not be desired for all purposes. --------- Co-authored-by: scanny <scanny@users.noreply.github.com>	2024-03-12 19:44:41 +00:00
Ronny H	9cbede37bd	Update requirements for GCS IAM Role for Platform Source & Destination Connectors (#2637 ) To test: > cd docs && make html Changelogs: * added a note to have Storage Object Viewer IAM Role for the GCS source connector. * added a note to have Storage Object Creator IAM Role for the GCS destination connector.	2024-03-11 22:14:20 +00:00
Ronny H	e5fab217be	Unstructured v0.12.6 release (#2626 ) ## 0.12.6 ### Enhancements * Improve ability to capture embedded links in `partition_pdf()` for `fast` strategy Previously, a threshold value that affects the capture of embedded links was set to a fixed value by default. This allows users to specify the threshold value for better capturing. * Refactor `add_chunking_strategy` decorator to dispatch by name. Add `chunk()` function to be used by the `add_chunking_strategy` decorator to dispatch chunking call based on a chunking-strategy name (that can be dynamic at runtime). This decouples chunking dispatch from only those chunkers known at "compile" time and enables runtime registration of custom chunkers. ### Features * Added Unstructured Platform Documentation The Unstructured Platform is currently in beta. The documentation provides how-to guides for setting up workflow automation, job scheduling, and configuring source and destination connectors. ### Fixes * Partitioning raises on file-like object with `.name` not a local file path. When partitioning a file using the `file=` argument, and `file` is a file-like object (e.g. io.BytesIO) having a `.name` attribute, and the value of `file.name` is not a valid path to a file present on the local filesystem, `FileNotFoundError` is raised. This prevents use of the `file.name` attribute for downstream purposes to, for example, describe the source of a document retrieved from a network location via HTTP. * Fix SharePoint dates with inconsistent formatting Adds logic to conditionally support dates returned by office365 that may vary in date formatting or may be a datetime rather than a string. * Include warnings about the potential risk of installing a version of `pandoc` which does not support RTF files + instructions that will help resolve that issue. * Incorporate the `install-pandoc` Makefile recipe into relevant stages of CI workflow, ensuring it is a version that supports RTF input files. * Fix Google Drive source key Allow passing string for source connector key. * Fix table structure evaluations calculations Replaced special value `-1.0` with `np.nan` and corrected rows filtering of files metrics basing on that. * Fix Sharepoint-with-permissions test Ignore permissions metadata, update test. * Fix table structure evaluations for edge case Fixes the issue when the prediction does not contain any table - no longer errors in such case. 0.12.6	2024-03-08 17:54:13 +00:00
Yao You	911f9983c1	feat: redefine table level acc (#2620 ) This PR redefines the `table_level_acc` metric as follow: - for each predicted table use sequence matching ratio as its accuracy - as a prerequisite for the sequence matching we sort the table cells by row then column for both predicted and ground truth to ensure they are ordered the same - average all predicted table accuracy - any prediction without a matching ground truth (false positive) would decrease the score - prediction that splits ground truth into smaller tables would also have low score with perfectly equal splits having lowest score This new definition makes the new metric a value between 0 and 1 per file. This replaces the existing definition where the metric is defined as (the number of predicted table that has a match to ground truth) to (the number of ground truth table). This existing metric actually gives higher values for predictions that splits tables and can be higher than 1. The new definition prefers predictions that do not split ground truth tables.	2024-03-08 17:00:57 +00:00
ryannikolaidis	3853840d52	fix: docker-publish build test missing key error (#2623 ) The docker-publish github actions workflow builds amd and arm images of the repository and tests them before publishing. These tests have been failing since [this commit](`ee8b0f93dc`) with an error `UNS_API_KEY environment variable not set`. The issue is that [this line](`b27ad9b6aa/.github/workflows/docker-publish.yml (L62)`) in the workflow is actually blowing away the value assigned to the file in the previous line ## Changes * Update line that was overwriting the assignment of UNS_API_KEY to the uns_test_env_file in the docker-publish workflow to leverage the `>>` operator so that UNSTRUCTURED_HF_TOKEN assignment is only appended. * [bonus]: arithmetic expansion in version-sync.sh to keep shell-check happy ## Testing To validate, I edited the docker-publish workflow to trigger on push (and to run the test but not publish the workflow) in [this commit](`0f04f5f0f7`). The successful test results can be reviewed [here](https://github.com/Unstructured-IO/unstructured/actions/runs/8199826803).	2024-03-08 14:55:04 +00:00
Klaijan	30b6a09bc3	fix: declare -i [SC2324 shellcheck] (#2624 ) Fix SC2324 shellcheck warning by adding -i to indicate var type of integer and tidy up the formatting.	2024-03-08 10:09:55 +00:00
Steve Canny	b27ad9b6aa	fix: raises on file-like object with .name not a valid path (#2614 ) Summary Fixes: #2308 Additional context Through a somewhat deep call-chain, partitioning a file-like object (e.g. io.BytesIO) having its `.name` attribute set to a path not pointing to an actual file on the local filesystem would raise `FileNotFoundError` when the last-modified date was being computed for the document. This scenario is a legitimate partitioning call, where `file.name` is used downstream to describe the source of, for example, a bytes payload downloaded from the network. Fix - explicitly check for the existence of a file at the given path before accessing it to get its modified date. Return `None` (already a legitimate return value) when no such file exists. - Generally clean up the implementations. - Add unit tests that exercise all cases. --------- Co-authored-by: John <43506685+Coniferish@users.noreply.github.com>	2024-03-07 19:02:04 +00:00
Pawel Kmiecik	e35306cfc7	fix: table evaluation metrics fix calculations when no tables found in predictions (#2619 ) The current way table structure metrics are computed does not cover cases when none table is found and all stats are empty. This PR fixes this + adds some hardenning tests for table eval processor. --------- Co-authored-by: Yao You <theyaoyou@gmail.com>	2024-03-07 18:39:19 +00:00
Roman Isecke	9866f1b52b	BUG: name override not passed through in recursive _asdict call (#2613 ) ### Description The only real change here is adding this line: ```python apply_name_overload=apply_name_overload, ``` Everything else is from running `make tidy`	2024-03-07 17:17:22 +00:00
Steve Canny	b59e4b69ce	rfctr: prepare for fix to raises on file-like-object with name not a path to a file (#2617 ) Summary Improve typing and other mechanical refactoring in preparation for fix to issue 2308.	2024-03-06 23:46:54 +00:00
MiXiBo	79552ff70b	Refactor threshold to annotation_threshold and make it an optional parameter (#2537 ) We are activating to configure the annotation threshold for links as an optional parameter. The reason for the change is that we ran into issues extracting simple text links from PDF documents that were created with MS Word. The sample PDF from unstructured worked with a default value of 0.9, and the PDF generated with Word resulted in a threshold of approx 0.67. We do use unstructured in together with langchain within an automated container deployment and to access by default the setting 'annotation_threshold' (refactored from 'threshold') can be very helpful. --------- Co-authored-by: Michael Niestroj <michael.niestroj@unblu.com> Co-authored-by: christinestraub <christinemstraub@gmail.com>	2024-03-06 23:08:49 +00:00
John	b6c1882cc3	chore: add tests and small fixes in utils.py (#2554 ) Linting and typing fixes, and add tests to improve test coverage in utils.py On the main branch, run `coverage run -m pytest test_unstructured/test_utils.py` and then `coverage report -m unstructured/utils.py` to see test coverage for `utils.py`. Check out to this branch and do the same. The percent coverage should increase to 88% --------- Co-authored-by: David Potter <potterdavidm@gmail.com>	2024-03-06 21:58:10 +00:00
Ronny H	2afd347e6b	Create Enterprise Platform Documentation (#2486 ) To test: > cd docs && make html Structures: * Getting Started with Platform (User Account Management) * Set Up workflow automation * Job Scheduling * Platform Source Connectors: * Azure Blob Storage, * Amazon S3 * Salesforce * Sharepoint * Google Cloud Storage * Google Drive * One Drive * Elasticsearch * SFTP Storage * Platform Destination Connectors: (i) * Amazon S3 * Azure Cognitive Search * Google Cloud Storage * Pinecone * Elasticsearch * Weaviate * MongoDB * AWS OpenSearch * Databricks --------- Co-authored-by: Matt Robinson <mrobinson@unstructured.io> Co-authored-by: Matt Robinson <mrobinson@unstructuredai.io>	2024-03-06 19:16:08 +00:00
Roman Isecke	9c1c41f493	BUGFIX: fix dependencies in setup.py (#2605 ) ### Description Currently the requirements associated with an extra in the `setup.py` is being dynamically generated using the `load_requirements()` method in the same file. This is being passed in all the `.in` files which then get read line by line to generate the requirements associated with an extra. Unless the `.in` file itself has a version pin, this will never respect the `.txt` files being generated by `pip-compile`. This fix updates all the inputs to `load_requirements()` to use the `.txt` files themselves.	2024-03-06 18:59:08 +00:00
David Potter	1ca90d209a	bug: update sharepoint-with-permissions test to fix CI (#2589 ) Adding `metadata.data_source.permissions_data` to sharepoint-with-permissions.sh --metadata-exclude to prevent sharepoint deprecation warning from ruining test. Updating expected-structured-output As per Ahmet's comment. We do want to check sharepoint permissions metadata at some point. But that will take a separate type of test. A file diff test is too unstable. Permissions checking will be later down the road.	2024-03-06 17:15:36 +00:00
Pawel Kmiecik	dc376053dd	feat(eval): Correct table metrics evaluations (#2615 ) This PR: - replaces `-1.0` value in table metrics with `nan`s - corrected rows filtering basing on above	2024-03-06 15:37:32 +00:00
Steve Canny	4096a38371	rfctr(chunking): extract chunking-strategy dispatch (#2545 ) Summary This is the final step in adding pluggable chunking-strategies. It introduces the `chunk()` function to replace calls to strategy-specific chunkers in the `@add_chunking_strategy` decorator. The `chunk()` function then uses a mapping of chunking-strategy names (e.g. "by_title", "basic") to chunking functions (chunkers) to dispatch the chunking call. This allows other chunkers to be added at runtime rather than requiring a code change, which is what "pluggable" chunkers is. Additional Information - Move the `@add_chunking_strategy` to the new `chunking.dispatch` module since it coheres strongly with that operation, but publish it from `chunking(.__init__)` (as it was before) so users don't couple to the way we organize the chunking sub-package. Also remove the third level of nesting as it's unrequired in this case. - Add unit tests for the `@add_chunking_strategy` decorator which was previously uncovered by any direct test.	2024-03-05 23:19:29 +00:00
Klaijan	3ff6de4f50	refactor: refactor var name for consistency (#2609 ) refactor variable name for consistency.	2024-03-05 09:08:25 +00:00

1 2 3 4 5 ...

1265 Commits