unstructured

mirror of https://github.com/Unstructured-IO/unstructured.git synced 2025-12-08 21:41:00 +00:00

Author	SHA1	Message	Date
Maciej Kurzawa	8fd216cc9f	feat/pdf-page-limit-in-hi-res (#3431 ) # Description: Passing `max_pages` argument allows rejecting pdf files which exceeds this page number limit while `high_res` strategy is chosen. By default it will allow parsing pdf files with unlimited number of pages. # Testing: ```python from unstructured.partition.auto import partition elements = partition(filename="unstructured/example-docs/pdf/reliance.pdf", strategy='hi_res') # should pass elements = partition(filename="unstructured/example-docs/pdf/reliance.pdf", strategy='hi_res', max_pages=4) # should pass elements = partition(filename="unstructured/example-docs/pdf/reliance.pdf", strategy='hi_res', max_pages=2) # should raise PdfMaxPagesExceededError ```	2024-07-30 16:52:17 +00:00
Roman Isecke	482f093afb	feat: Add deprecation warning on import of any ingest code (#3443 ) ### Description Any time `unstructed.ingest` is imported, this deprecation warning gets emitted: ``` DeprecationWarning: unstructured.ingest will be removed in a future version ```	2024-07-30 15:06:21 +00:00
Steve Canny	4e61acc1c6	fix(file): fix OLE-based file-type auto-detection (#3437 ) Summary A DOC, PPT, or XLS file sent to partition() as a file-like object is misidentified as a MSG file and raises an exception in python-oxmsg (which is used to process MSG files). Fix DOC, PPT, XLS, and MSG are all Microsoft OLE-based files, aka. Compound File Binary Format (CFBF). These can be reliably distinguished by inspecting magic bytes in certain locations. `libmagic` is unreliable at this or doesn't try, reporting the generic `"application/x-ole-storage"` which corresponds to the "container" CFBF format (vaguely like a Microsoft Zip format) that all these document types are stored in. Unconditionally use `filetype.guess_mime()` provided by the `filetype` package that is part of the base unstructured install. Unlike `libmagic`, this package reliably detects the distinguished MIME-type (e.g. `"application/msword"`) for OLE file subtypes. Fixes #3364	2024-07-25 17:25:41 +00:00
Steve Canny	432d209c36	fix(file): confirm or correct asserted DOCX, PPTX, and XLSX content types (#3434 ) Summary The `content_type` argument received by `partition()` from the API is sometimes unreliable for MS-Office 2007+ MIME-types. What we've observed is that it gets the MS-Office bit right but falls down on distinguishing PPTX from DOCX or XLSX. Confirmation of these types is simple, fast, and reliable. Confirm all MS-Office `content_type` argument values asserted by callers of `detect_filetype()` and correct swapped values.	2024-07-24 20:32:58 +00:00
Christine Straub	560cc0e975	fix: update HuggingFaceEmbeddingEncoder to use `langchain_huggingface` instead of `langchain-community` (#3436 ) Similar to https://github.com/Unstructured-IO/unstructured/pull/3433. ### Summary This PR aims to update `HuggingFaceEmbeddingEncoder` to use `HuggingFaceEmbeddings` from `langchain_huggingface` package instead of the deprecated version from `langchain-community`. This resolves the deprecation warning and ensures compatibility with future versions of langchain. ### Testing ``` from unstructured.documents.elements import Text from unstructured.embed.huggingface import HuggingFaceEmbeddingConfig, HuggingFaceEmbeddingEncoder embedding_encoder = HuggingFaceEmbeddingEncoder( config=HuggingFaceEmbeddingConfig() ) elements = embedding_encoder.embed_documents( elements=[Text("This is sentence 1"), Text("This is sentence 2")], ) query = "This is the query" query_embedding = embedding_encoder.embed_query(query=query) [print(e.embeddings, e) for e in elements] print(query_embedding, query) print(embedding_encoder.is_unit_vector(), embedding_encoder.num_of_dimensions()) ``` Expected behavior No deprecation warning should be displayed. The code should use the updated `HuggingFaceEmbeddings` class from the `langchain_huggingface` package.	2024-07-24 18:57:31 +00:00
Christine Straub	798dcc096c	fix: update OpenAIEmbeddingEncoder to use `langchain-openai` instead of `langchain-community` (#3433 ) Closes https://github.com/Unstructured-IO/unstructured/issues/3378. ### Summary This PR aims to update `OpenAIEmbeddingEncoder` to use `OpenAIEmbeddings` from `langchain-openai` package instead of the deprecated version from `langchain-community`. This resolves the deprecation warning and ensures compatibility with future versions of langchain.	2024-07-24 16:52:34 +00:00
Steve Canny	3fe5c094fa	rfctr(file): refactor detect_filetype() (#3429 ) Summary In preparation for fixing a cluster of bugs with automatic file-type detection and paving the way for some reliability improvements, refactor `unstructured.file_utils.filetype` module and improve thoroughness of tests. Additional Context Factor type-recognition process into three distinct strategies that are attempted in sequence. Attempted in order of preference, type-recognition falls to the next strategy when the one before it is not applicable or cannot determine the file-type. This provides a clear basis for organizing the code and tests at the top level. Consolidate the existing tests around these strategies, adding additional cases to achieve better coverage. Several bugs were uncovered in the process. Small ones were just fixed, bigger ones will be remedied in following PRs.	2024-07-23 23:18:48 +00:00
David Potter	441b3393b1	bugfix [OSS-67]: update import of pinecone exception (#3432 ) the pinecone python package moved their importing of PineconeApiException Chroma `sleep` added because even thought there is a `wait`, there is still some sort of timing issue.	2024-07-23 19:48:55 +00:00
Matt Robinson	b2f0620f2c	build(deps): version bumps for 2024-07-22 (#3427 ) ### Summary Weekly dependency bumps.	2024-07-22 20:49:40 +00:00
Steve Canny	49c4bd34be	rfctr(auto): add _PartitionerLoader (#3418 ) Summary Replace conditional explicit import of partitioner modules in `.partition.auto` with the new `_PartitionerLoader` class. This avoids unbound variable warnings and is much less noisy. `_PartitionerLoader` makes use of the new `FileType` property `.importable_package_dependencies` to determine whether all required packages are importable before dispatching the file to its partitioner. It uses `FileType.extra_name` to form a helpful error message when a dependency is not installed, so the caller knows which `pip install` extra to specify to remedy the error. `PartitionerLoader` uses the `FileType` properties `.partitioner_module_qname` and `partitioner_function_name` to load the partitioner once its dependencies are verified. Loaded partitioners are cached with module lifetime scope for efficiency.	2024-07-22 06:03:55 +00:00
Christine Straub	ec59abfabc	enhancement: improve text clearing process in `email` partitioning (#3422 ) ### Summary Currently, the email partitioner removes only `=\n` characters during the clearing process. However, email content sometimes contains `=\r\n` characters, especially when read from file-like objects such as `SpooledTemporaryFile` (the file type used in our API). This PR updates the email partitioner to remove both `=\n` and `=\r\n` characters during the clearing process. ### Testing ``` filename = "example-docs/eml/family-day.eml" elements = partition_email( filename=filename, ) print(f"From filename: {elements[3].text}") with open(filename, "rb") as test_file: spooled_temp_file = tempfile.SpooledTemporaryFile() spooled_temp_file.write(test_file.read()) spooled_temp_file.seek(0) elements = partition_email(file=spooled_temp_file) print(f"From spooled_temp_file: {elements[3].text}") ``` Results: - on `main` ``` From filename: Make sure to RSVP! From spooled_temp_file: Make sure to = RSVP! ``` - on `PR` ``` From filename: Make sure to RSVP! From spooled_temp_file: Make sure to RSVP! ``` 0.15.0	2024-07-19 18:18:02 +00:00
Roman Isecke	1df7908f03	feat: save file id for all fsspec connectors if present (#3405 ) ### Description If the id value exists in the stats response from fsspec, save it as a `file_id` field in the metadata being persisted on each element. --------- Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com> Co-authored-by: rbiseck3 <rbiseck3@users.noreply.github.com>	2024-07-19 13:30:21 +00:00
Christine Straub	0eb461acc2	refactor: restructure PDF/Image example document organization (#3410 ) This PR aims to improve the organization and readability of our example documents used in unit tests, specifically focusing on PDF and image files. ### Summary - Created two new subdirectories in the `example-docs` folder: - `pdf/`: for all PDF example files - `img/`: for all image example files - Moved relevant PDF files from `example-docs/` to `example-docs/pdf/` - Moved relevant image files from `example-docs/` to `example-docs/img/` - Updated file paths in affected unit & ingest tests to reflect the new directory structure ### Testing All unit & ingest tests should be updated and verified to work with the new file structure. ## Notes Other file types (e.g., office documents, HTML files) remain in the root of `example-docs/` for now. ## Next Steps Consider similar reorganization for other file types if this structure proves to be beneficial. --------- Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com> Co-authored-by: christinestraub <christinestraub@users.noreply.github.com>	2024-07-18 22:21:32 +00:00
Roman Isecke	5d387030eb	bugfix: google drive connector metadata safegaurds (#3407 ) ### Description At times, the google drive response doens't have some of the metadata we're grabbing to populate the `FileData` metadata. This is fine, but without the added safegaurds, this can cause a `KeyError`.	2024-07-18 16:09:19 +00:00
Steve Canny	e99e5a8abd	rfctr(file): make FileType enum a file-type descriptor (#3411 ) Summary Elaborate the `FileType` enum to be a complete descriptor of file-types. Add methods to allow `STR_TO_FILETYPE`, `EXT_TO_FILETYPE` and `FILETYPE_TO_MIMETYPE` mappings to be replaced, removing those redundant and noisy declarations. In the process, fix some lingering file-type identification and `.metadata.filetype` errors that had been skipped in the tests. Additional Context Gathering the various attributes of a file-type into the `FileType` enum eliminates the duplication inherent in the separate `STR_TO_FILETYPE` etc. mappings and makes access to those values convenient for callers. These attributes include what MIME-type a file-type should record in metadata and what MIME-types and extensions map to that file-type. These values and others are made available as methods and properties directly on the `FileType` class and members. Because all attributes are defined in the `FileType` enum there is no risk of inconsistency across multiple locations and any changes happen in one and only one place. Further attributes and methods will be added in later commits to support other file-type related operations like mapping to a partitioner and verifying its dependencies are installed.	2024-07-18 02:05:33 +00:00
Roman Isecke	35ee6bf8e4	bugfix: conform all connectors to be added to registry (#3408 ) ### Description Looks like some connectors were never added to the registry explicitly since that change was introduced. All of them are now updated.	2024-07-17 15:53:18 +00:00
Steve Canny	a5c9a3695c	rfctr(file): improve file-type auto-detect (#3409 ) Summary In preparation for further work on auto file-type detection, improve `filetype.py` and related modules: - improve docstrings - improve type annotations - extract domain model to `.model` module	2024-07-17 05:27:31 +00:00
Christine Straub	48bdf94656	feat: `partition_pdf()` support language specification for PaddleOCR (#3400 ) Closes #3159. This PR extends language specification capability to `PaddleOCR` in addition to `TesseractOCR`. Users can now specify OCR languages for both OCR engines when using `partition_pdf()`. ### Testing ``` os.environ["OCR_AGENT"] = "unstructured.partition.utils.ocr_models.paddle_ocr.OCRAgentPaddle" elements = partition_pdf( filename=<file_path>, strategy=strategy, languages=["chi_sim"], # chinese - simplified infer_table_structure=True, ) ```	2024-07-16 22:19:25 +00:00
David Potter	6b1d5f28bb	rfctr: move astra arg (#3383 ) Just moving the embedding dimension from the base connector object to the write object since its not needed for an Astra source connector. Only for a destination connector.	2024-07-16 16:33:23 +00:00
Steve Canny	56ca39ca7f	rfctr(file): improve filetype tests (#3402 ) Summary Improve file-detection tests in preparation for additional work and bug fixes. Additional Context - Add type annotations. - Use mocks instead of `monkeypatch` in most cases and verify calls to mock. This revealed a dozen broken tests, broken in that the mocks weren't being called so a different code path than intended was being exercised. - Use `example_doc_path()` instead of hard-coded paths. - Add actual test files for cases where they were being constructed in temporary directories. - Make test names consistent and more descriptive of behavior under test.	2024-07-16 04:04:34 +00:00
Steve Canny	0057f9dea8	fix(nltk): remedy Windows temp-file problem (#3395 ) Summary The implementation of `tempfile.NamedTemporaryFile` on Windows Python is problematic in certain situations. In particular, it raises `PermissionError` when attempting to access the temporary file by name rather than just by the file-descriptor returned by the context-manager. Remedy this situation by using `tempfile.TemporaryDirectory` instead and using a file name of our choosing. The temporary directory is deleted with all its contents when the context manager closes so the effect is the same and does not produce the error on Windows.	2024-07-16 01:58:32 +00:00
Steve Canny	e8b2297cbb	rfctr(auto): organize partition() dispatch (#3399 ) Summary Place dispatch to file-type specific partitioners in alphabetical order by file-type to ease scanning and speed up location by scrolling. No code changes, only block line moves.	2024-07-15 22:23:24 +00:00
Matt Robinson	adbf29d144	docs: remove examples directory in favor of docs page (#3391 ) ### Summary Removes the examples directory in the repo in favor of the example code page in our documentation, which is more actively maintained. - https://docs.unstructured.io/examplecode/notebooks	2024-07-15 14:35:51 +00:00
Matt Robinson	2baa7905bc	build(deps): version bumps for 2024-07-15 (#3397 ) ### Summary Version bumps for 2024-07-15.	2024-07-15 13:57:48 +00:00
Christine Straub	3e1a30d338	build(deps): bump unstructured.paddleocr 2.8.0.1 (#3388 ) ### Summary - Bump unstructured.paddleocr to `2.8.0.1` which removed `lmdb` dependency due to license issue. --------- Co-authored-by: Matt Robinson <mrobinson@unstructured.io>	2024-07-14 03:43:44 +00:00
Christine Straub	69cddf5f89	ci: disable sharepoint ingest test (#3393 ) Disable sharepoint ingest test to unblock development. We need to re-enable this test when the sharepoint credentials are updated.	2024-07-14 01:58:08 +00:00
Matt Robinson	ee2b247297	build: check dependency licenses in CI (#3349 ) ### Summary Adds a CI check to ensure that packages added as dependencies are appropriately licensed. All of the `.txt` files in the `requirements` directory are checked with the exception of: - `constraints.txt`, since those are not installed and are instead conditions on the other dependency files - `dev.txt`, since those are for local development and not shipped as part of the `unstructured` package - `extra-pdf-image.txt` - the `extra-pdf-image.in` since checking `extra-pdf-image.txt` pulls in NVIDIA GPU related packages with an `Other/Proprietary` license type, and there's not a good way to exclude those without adding `Other/Proprietary` to the allowed licenses list. ### Testing The new `check-licenses` job should pass in CI.	2024-07-11 22:36:01 +00:00
Steve Canny	3d6e30a1f7	rfctr(auto): improve expression in tests (#3384 ) Summary In preparation for further work on auto-partitioning, improve the expression in the test-suite.	2024-07-11 19:57:28 +00:00
Steve Canny	c27e0d0062	rfctr(html): replace html parser (#3218 ) Summary Replace legacy HTML parser with recursive version that captures all content and provides flexibility to add new metadata. It's also substantially faster although that's just a happy side-effect. Additional Context The prior HTML parsing algorithm that makes up the core of HTML partitioning was buggy and very difficult to reason about because it did not conform to the inherently recursive structure of HTML. The new version retains `lxml` as the performant and reliable base library but uses `lxml`'s custom element classes to efficiently classify HTML elements by their behaviors (block-item and inline (phrasing) primarily) and give those elements the desired partitioning behaviors. This solves a host of existing problems with content being skipped and elements (paragraphs) being divided improperly, but also provides a clear domain model for reasoning about its behavior and reliably adjusting it to suit our existing and future purposes. The parser's operation is recursive, closely modeling the recursive structure of HTML itself. It's behaviors are based on the HTML Standard and reliably produce proper and explainable results even for novel cases. Fixes #2325 Fixes #2562 Fixes #2675 Fixes #3168 Fixes #3227 Fixes #3228 Fixes #3230 Fixes #3237 Fixes #3245 Fixes #3247 Fixes #3255 Fixes #3309 ### BEHAVIOR DIFFERENCES #### `emphasized_text_tags` encoding is changed: - `<strong>` is encoded as `"b"` rather than `"strong"`. - `<em>` is encoded as `"i"` rather than `"em"`. - `<span>` is no longer recorded in `emphasized_text_tags` (because without the CSS we can't tell whether it's used for emphasis or if so what kind). - nested emphasis (e.g. bold+italic) is encoded as multiple characters ("bi"). - `emphasized_text_contents` is broken on emphasis-change boundaries, like: ```html `<p>foo <b>bar <i>baz</i> bada</b> bing</p>` ``` produces: ```json { "emphasized_text_contents": ["bar", "baz", "bada"], "emphasized_text_tags": ["b", "bi", "b"] } ``` whereas previously it would have produced: ```json { "emphasized_text_contents": ["bar baz bada", "baz"], "emphasized_text_tags": ["b", "i"] } ``` #### `<pre>` text is preserved as it appears in the html Except that a leading newline is removed if present (has to be in position 0 of text). Also, a trailing newline is stripped but only if it appears in the very last position ([-1]) of the `<pre>` text. Old parser stripped all leading and trailing whitespace. Result is that: ```html <pre> foo bar baz </pre> ``` parses to `"foo\nbar\nbaz"` which is the same result produced for: ```html <pre>foo bar baz</pre> ``` This equivalence is the same behavior exhibited by a browser, which is why we did the extra work to make it this way. #### Whitespace normalization Leading and trailing whitespace are removed from element text, just as it is removed in the browser. Runs of whitespace within the element text are reduced to a single space character (like in the browser). Note this means that `\t`, `\n`, and ` ` are replaced with a regular space character. All text derived from elements is whitespace normalized except the text within a `<pre>` tag. Any leading or trailing newline is trimmed from `<pre>` element text; all other whitespace is preserved just as it appeared in the HTML source. #### `link_start_indexes` metadata is no longer captured. Rationale: - It was frequently wrong, often `-1`. - It was deprecated but then added back in a community PR. - Maintaining it across any possible downstream transformations (e.g. chunking) would be expensive and almost certainly lead to wrong values as distant code evolves. - It is complex to compute and recompute when whitespace is normalized, adding substantial complexity to the code and reducing readability and maintainability #### `<br/>` element is replaced with a single newline (`"\n"`) but that is usually replaced with a space in `Element.text` when it is normalized. The newline is preserved within a `<pre>` element. - Related: _No paragraph-break on `<br/><br/>`_ #### Empty `h1..h6` elements are dropped. HTML heading elements (`<h1..h6>`) are "skipped" (do not generate a `Title` element) when they contain no text or contain only whitespace. --------- Co-authored-by: scanny <scanny@users.noreply.github.com>	2024-07-11 00:14:28 +00:00
Roman Isecke	76cccb3a5e	feat/persist metadata for fsspec connectors (#3371 ) Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com> Co-authored-by: rbiseck3 <rbiseck3@users.noreply.github.com>	2024-07-10 22:08:05 +00:00
David Potter	6c78677ebb	feat: add Astra source connector (#3304 ) Thanks to @erichare we now have an AstraDB source connector. updating constant names to be more aligned with AstraDB	2024-07-10 20:29:22 +00:00
Steve Canny	0c562d8050	rfctr(auto): fix auto-partition test xfails and skips (#3367 ) Summary Improve expression in auto-partition tests and fix xfails and skips. Add issues for the two hard-fails where xfail needed to stay.	2024-07-10 05:29:07 +00:00
ryannikolaidis	543057317f	rfctr [P6M-398]: salesforce connector v2 <- Ingest test fixtures update (#3377 ) This pull request includes updated ingest test fixtures. Please review and merge if appropriate. Co-authored-by: badGarnet <badGarnet@users.noreply.github.com>	2024-07-09 22:30:50 +00:00
Christine Straub	512583ed91	build(deps): bump unstructured.paddleocr 2.8.0 (#3374 ) ### Summary Bump unstructured.paddleocr to `2.8.0` --------- Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com> Co-authored-by: christinestraub <christinestraub@users.noreply.github.com>	2024-07-09 22:19:08 +00:00
David Potter	db1e6993a8	rfctr [P6M-398]: salesforce connector v2 (#3344 ) Updates salesforce source connector to v2.	2024-07-09 16:46:58 +00:00
Marianna	176875bf26	OD metrics for CI (#3269 ) OD metrics for CI --------- Co-authored-by: Paweł Kmiecik <pawel.kmiecik@deepsense.ai> Co-authored-by: Michał Martyniak <64484917+micmarty-deepsense@users.noreply.github.com>	2024-07-09 12:41:20 +00:00
Ahmet Melek	3f96a5ae5c	rfctr: Implement Azure Cognitive Search V2 Destination Connector (#3311 ) This PR - adds the V2 version of Azure Cognitive Search connector - extends the ingest test to check for chunking and embedding capability	2024-07-09 12:19:15 +00:00
Roman Isecke	b556d6d575	rfctr: Implement Sharepoint V2 Source Connector (#3314 ) ### Description Migrate over the sharepoint connector to v2 and in the process refactor the majority of the connector. It now pulls in much more content from the SDK on index time, including permissions data is the parameters are passed in. HTML content generated from the SitePage is isolated to the html content in the `CanvasContent1` and `LayoutWebpartsContent` returned by the SDK. Some TODOs were left in there for future iterations. Currently only document and site page content is being pulled in from sharepoint, but sharepoint has more types of content than just that, such as lists. Note left in there to support other sharepoint types. --------- Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com> Co-authored-by: rbiseck3 <rbiseck3@users.noreply.github.com> Co-authored-by: vangheem <vangheem@gmail.com> Co-authored-by: Ahmet Melek <ahmetmeleq@gmail.com> Co-authored-by: Ahmet Melek <39141206+ahmetmeleq@users.noreply.github.com>	2024-07-09 09:52:59 +00:00
Steve Canny	00e1d5c05b	rfctr(html): refine HTML parser (#3351 ) Note This refines the new HTML parser but _does not install it_. This is why no changes to ingest test expectations or other unit-tests are required here. Installing the new parser will happen in the next PR #3218. Summary The initial version of the parser (purposely) raised on a block element nested inside a phrasing element. While such nesting is not valid according to the HTML Standard, it is accepted by the browser and does happen in the wild. The refinements here handle this situation similarly to how the browser does, breaking phrasing at the block element boundaries and starting it up again after the block element. Unfortunately this adds complexity to the parser, but it makes the parser robust against pretty much any HTML we're likely to encounter and partitions it consistent with how it would be rendered in the browser.	2024-07-09 01:10:03 +00:00
Matt Robinson	7b25dfc337	fix(CVE-2024-39705): remove nltk download (#3361 ) ### Summary Addresses [CVE-2024-39705](https://nvd.nist.gov/vuln/detail/CVE-2024-39705), which highlights the risk of remote code execution when running `nltk.download` . Removes `nltk.download` in favor of a `.tgz` file with the appropriate NLTK data files and checking the SHA256 hash to validate the download. An error now raises if `nltk.download` is invoked. The logic for determining the NLTK download directory is borrowed from `nltk`, so users can still set `NLTK_DATA` as they did previously. ### Testing 1. Create a directory called `~/tmp/nltk_test`. Set `NLTK_DATA=${HOME}/tmp/nltk_test`. 2. From a python interactive session, run: ```python from unstructured.nlp.tokenize import download_nltk_packages download_nltk_packages() ``` 3. Run `ls /tmp/nltk_test/nltk_data`. You should see the downloaded data. --------- Co-authored-by: Steve Canny <stcanny@gmail.com> 0.14.10	2024-07-08 22:55:36 +00:00
Steve Canny	d48fa3b163	rfctr(auto): improve typing and organize auto tests (#3355 ) Summary In preparation for further work on auto-partitioning (`partition()`), improve typing and organize `test_auto.py` by introducing categories.	2024-07-08 21:25:17 +00:00
Pluto	609a08a95f	remove unused _with_spans metric (#3342 ) The table metrics considering spans is not used and it messes with the output thus I have cleaned the code from it. Though, I have left table_as_cells in the source code - it still may be useful for the users	2024-07-08 16:59:53 +00:00
Pluto	caea73c8e3	Tables detection f1 (#3341 ) This pull request add table detection metrics. One case that was considered by me: Case: Two tables are predicted and matched with one table in ground truth Question: Is this matching correct in both cases or just for on table There are two subcases: - table was predicted by OD as two sub tables (so half in two, there are two non overlapping subtables) -> in my opinion both are correct - it is false positive from tables matching script in get_table_level_alignment -> 1 good, 1 wrong As we don't have bounding boxes I followed the notebook calculation script and assumed pessimistic, second subcase version	2024-07-08 13:29:52 +00:00
Matt Robinson	3fc2342f6d	build(deps): version bumps for 2024-07-08 (#3359 ) ## Summary Version bumps for 2024-07-08.	2024-07-08 12:38:04 +00:00
Nathan Van Gheem	1ce01c3254	rfctr: Implement SQL V2 Dest Connector (#3323 )	2024-07-05 16:05:45 +00:00
Nathan Van Gheem	6e4d9ccd5b	refactor: implement databricks volumes v2 dest connector (#3334 )	2024-07-03 19:01:16 +00:00
Christine Straub	493bfccddd	fix: exception handling for OCRAgent.get_agent() (#3335 ) The purpose of this PR is to help investigate https://github.com/Unstructured-IO/unstructured/issues/3202.	2024-07-03 17:58:04 +00:00
Roman Isecke	d86d15c3ab	feat/custom ingest stager (#3340 ) ### Description Allow used to pass in a reference to a custom defined stager via the CLI. Checks are run on the instance passed in to be a subclass of the UploadStager interface.	2024-07-03 16:59:51 +00:00
Roman Isecke	f1a28600d9	feat/singlestore dest connector (#3320 ) ### Description Adds [SingleStore](https://www.singlestore.com/) database destination connector with associated ingest test.	2024-07-03 15:15:39 +00:00
John	0046f58a4f	revert unstructured-client pin and make pip-compile (#3298 ) Change unstructured-client pin to setting minimum version instead of max version and `make pip-compile`. Integration tests that were dependent on the old version of the client are removed. These tests should be replicated in/moved to the SDK repo(s).	2024-07-02 16:42:03 +00:00

1 2 3 4 5 ...

1605 Commits