unstructured

mirror of https://github.com/Unstructured-IO/unstructured.git synced 2025-09-07 23:57:32 +00:00

Author	SHA1	Message	Date
John	09f86f28fb	Update parition_pdf docstring (#2292 ) add `extract_element_types` to partition_pdf docstring and reword other parameter descriptions	2023-12-19 04:52:15 +00:00
Steve Canny	0c7f64ecaa	rfctr(chunking): generalize PreChunkBuilder (#2283 ) To implement inter-pre-chunk overlap, we need a context that sees every pre-chunk both before and after it is accumulated (from elements). - We need access to the pre-chunk when it is completed so we can extract the "tail" overlap to be applied to the next chunk. - We need access to the as-yet-unpopulated pre-chunk so we can add the prior tail to it as a prefix. This "visibility" is split between `PreChunkBuilder` and the pre-chunker itself, which handles `TablePreChunk`s without the builder. Move `Table` element and TablePreChunk` formation into `PreChunkBuilder` such that _all_ element types (adding `Table` elements in particular) pass through it. Then `PreChunkBuilder` becomes the context we require. The actual overlap harvesting and application will come in a subsequent commit.	2023-12-18 22:21:34 +00:00
cragwolfe	9efc22c0fc	build: release commit for 0.11.5 (#2285 ) Also fix broken link in docs. 0.11.5	2023-12-16 18:25:55 -08:00
Steve Canny	36e81c3367	rfctr(chunking): extract general-purpose objects to base (#2281 ) Many of the classes defined in `unstructured.chunking.title` are applicable to any chunking strategy and will shortly be used for the "by-character" chunking strategy as well. Move these and their tests to `unstructured.chunking.base`. Along the way, rename `TextPreChunkBuilder` to `PreChunkBuilder` because it will be generalized in a subsequent PR to also take `Table` elements such that inter-pre-chunk overlap can be implemented. Otherwise, no logic changes, just moves.	2023-12-16 17:28:15 +00:00
Christine Straub	a7c3f5f570	Refactor: importation consistency for `partition_pdf()` and `partition_image()` (#2282 ) Closes #2278. This PR also removes the `extract_tables_in_pdf` mentioned in issue #2280.	2023-12-15 22:29:58 +00:00
Steve Canny	70cf141036	rfctr: extract ChunkingOptions (#2266 ) Chunking options for things like chunk-size are largely independent of chunking strategy. Further, validating the args and applying defaults based on call arguments is sophisticated to make its use easy for the caller. These details distract from what the chunker is actually doing and would need to be repeated for every chunking strategy if left where they are. Extract these settings and the rules governing chunking behavior based on options into its own immutable object that can be passed to any component that is subject to optional behavior (pretty much all of them).	2023-12-15 19:51:02 +00:00
cragwolfe	8ba1bedfca	build: release commit for 0.11.4 (#2275 ) also cleans up the CHANGELOG to reflect the previous release of 0.11.2 0.11.4	2023-12-15 00:06:22 +00:00
Steve Canny	aa7794a566	rfctr(chunking): move add_chunking_strategy() decorator up (#2265 ) The chunking subpackage `unstructured.chunking` currently contains only the `title` module and the `@add_chunking_strategy()` decorator is located in that module even though it has no special relationship to the `by_title` chunking strategy. Move it to the `__init__.py` module such that it is exported from `unstructured.chunking`. Adjust all references, pretty much one per partitioner, to import it from there. This prepares the way for further separation of the chunking package into modules, including a new `character` module for the `by_character` chunking strategy.	2023-12-14 19:16:16 +00:00
John	7895d4e0a7	pdf rfctr (#2260 ) Refactor `_process_pdfminer_pages` by extracting logic into helper functions. --------- Co-authored-by: christinestraub <christinemstraub@gmail.com>	2023-12-14 08:16:38 +00:00
Yao You	5f5ff6319f	fix: consider text in cid code as invalid in hi_res (#2259 ) This PR addresses [CORE-2969](https://unstructured-ai.atlassian.net/browse/CORE-2969) - pdfminer sometimes fail to decode text in an pdf file and returns cid codes as text - now those text will be considered invalid and be replaced with ocr results in `hi_res` mode ## test This PR adds unit test for the utility functions. In addition the file below would return elements with text in cid code on main but proper ascii text with this PR: [005-CISA-AA22-076-Strengthening-Cybersecurity-p1-p4.pdf](https://github.com/Unstructured-IO/unstructured/files/13662984/005-CISA-AA22-076-Strengthening-Cybersecurity-p1-p4.pdf) This change improves both cct accuracy and %missing scores: before: ``` metric average sample_sd population_sd count -------------------------------------------------- cct-accuracy 0.681 0.267 0.266 105 cct-%missing 0.086 0.159 0.159 105 ``` after: ``` metric average sample_sd population_sd count -------------------------------------------------- cct-accuracy 0.697 0.251 0.250 105 cct-%missing 0.071 0.123 0.122 105 ``` [CORE-2969]: https://unstructured-ai.atlassian.net/browse/CORE-2969?atlOrigin=eyJpIjoiNWRkNTljNzYxNjVmNDY3MDlhMDU5Y2ZhYzA5YTRkZjUiLCJwIjoiZ2l0aHViLWNvbS1KU1cifQ --------- Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com> Co-authored-by: badGarnet <badGarnet@users.noreply.github.com> Co-authored-by: christinestraub <christinemstraub@gmail.com>	2023-12-14 06:49:23 +00:00
Austin Walker	d594c06a3e	fix: handle delimiter bug in partition_csv (#2224 ) Closes #2218. When a csv has commas in its content, and the delimiter is something else, Pandas may throw an error. We can sniff the csv and get the correct delimiter to pass to Pandas. To verify, try partitioning the file in the linked bug.	2023-12-13 23:57:46 +00:00
Steve Canny	cbeaed21ef	rfctr: rename pre chunk (#2261 ) The original naming for the pre-cursor to a chunk in `chunk_by_title()` was conflated with the idea of how these element subsequences were bounded (by document-section) for that strategy. I mistakenly picked that up as a universal concept but in fact no notion of section arises in the `by_character` or other chunking strategies. Fix this misconception by using the name `pre-chunk` for this concept throughout.	2023-12-13 23:13:57 +00:00
Steve Canny	74d089d942	rfctr: skip CheckBox elements during chunking (#2253 ) `CheckBox` elements get special treatment during chunking. `CheckBox` does not derive from `Text` and can contribute no text to a chunk. It is considered "non-combinable" and so is emitted as-is as a chunk of its own. A consequence of this is it breaks an otherwise contiguous chunk into two wherever it occurs. This is problematic, but becomes much more so when overlap is introduced. Each chunk accepts a "tail" text fragment from its preceding element and contributes its own tail fragment to the next chunk. These tails represent the "overlap" between chunks. However, a non-text chunk can neither accept nor provide a tail-fragment and so interrupts the overlap. None of the possible solutions are terrific. Give `Element` a `.text` attribute such that _all_ elements have a `.text` attribute, even though its value is the empty-string for element-types such as CheckBox and PageBreak which inherently have no text. As a consequence, several `cast()` wrappers are no longer required to satisfy strict type-checking. This also allows a `CheckBox` element to be combined with `Text` subtypes during chunking, essentially the same way `PageBreak` is, contributing no text to the chunk. Also, remove the `_NonTextSection` object which previously wrapped a `CheckBox` element during pre-chunking as it is no longer required.	2023-12-13 20:22:25 +00:00
Yao You	36e4639e05	fix: image may be scaled too large for tesseract (#2252 ) This PR addresses [CORE-2965](https://unstructured-ai.atlassian.net/browse/CORE-2965) by limiting zoom factor so that the scaled image can still be processed by tesseract. - tesseract has a 2^31 byte limit on image data - occasionally an image may be scaled too much and larger than that size - fix limits the scaling factor so that we never scale an image larger than what tesseract can handle ## test A unit test is added in this PR to test a unlikely case where we'd scale an image a few thousand times and massively exceed the limit without the fix. Unstructured reviewers can also use the document in the ticket to test. [CORE-2965]: https://unstructured-ai.atlassian.net/browse/CORE-2965?atlOrigin=eyJpIjoiNWRkNTljNzYxNjVmNDY3MDlhMDU5Y2ZhYzA5YTRkZjUiLCJwIjoiZ2l0aHViLWNvbS1KU1cifQ	2023-12-13 19:35:05 +00:00
John	d3a404cfb5	pdfminer bug (#2244 ) Closes #2212. ### Summary This PR implements logic to fall back to the "inferred_layout + OCR" if pdfminer fails in the `hi_res` pipeline (discussed in[ this slack channel](https://unstructuredw-kbe4326.slack.com/archives/C057R3F8F7A/p1701807299018929). ### Testing PDF: [NASA-SNA-8-D-027III-Rev2-CsmLmSpacecraftOperationalDataBook-Volume3-MassProperties-pg856.pdf](https://github.com/Unstructured-IO/unstructured/files/13554149/NASA-SNA-8-D-027III-Rev2-CsmLmSpacecraftOperationalDataBook-Volume3-MassProperties-pg856.pdf) ``` elements = partition_pdf( filename="NASA-SNA-8-D-027III-Rev2-CsmLmSpacecraftOperationalDataBook-Volume3-MassProperties-pg856.pdf", strategy="hi_res", ) ``` --------- Co-authored-by: christinestraub <christinemstraub@gmail.com>	2023-12-13 00:51:38 +00:00
Steve Canny	21bc67f52f	rfctr: improve element typing (#2247 ) In preparation for work on generalized chunking including `chunk_by_character()` and overlap, get `elements` module and tests passing strict type-checking.	2023-12-12 23:12:23 +00:00
Roman Isecke	76efcf4dd7	chore: add shfmt (#2246 ) ### Description Given all the shell files that now exist in the repo, would be nice to have linting/formatting around them (in addition to the existing shellcheck which doesn't do anything to format the shell code). This PR introduces `shfmt` to both check for changes and apply formatting when the associated make targets are called.	2023-12-12 01:04:15 +00:00
Yuming Long	529d1f6edb	Chore: put tesseract multiple languages splitter "+" in constant (#2226 ) ^^^	2023-12-11 22:20:37 +00:00
Roman Isecke	ac302689a0	chore: update sphinx ingest docs with new connectors (#2245 ) Replacing https://github.com/Unstructured-IO/unstructured/pull/2243	2023-12-11 21:29:41 +00:00
Christine Straub	da7ac625b1	Feat: save tables in PDF's as images (#2229 ) closes #2222. ### Summary The "table" elements are saved as `table-<pageN>-<tableN>.jpg`. This filename is presented in the `image_path` metadata field for the Table element. The default would be to not do this. ### Testing PDF: [124_PDFsam_Basel III - Finalising post-crisis reforms.pdf](https://github.com/Unstructured-IO/unstructured/files/13591714/124_PDFsam_Basel.III.-.Finalising.post-crisis.reforms.pdf) ``` elements = partition_pdf( filename="124_PDFsam_Basel III - Finalising post-crisis reforms.pdf", strategy="hi_res", infer_table_structure=True, extract_element_types=['Table'], ) ```	2023-12-11 19:14:41 +00:00
Roman Isecke	cc05e948ff	chore: sensitive info connector audit (#2227 ) ### Description All other connectors that were not included in https://github.com/Unstructured-IO/unstructured/pull/2194 are now updated to follow the new pattern and mark any variables as sensitive where it makes sense. Core changes: * All connectors now support an `AccessConfig` to mark data that's needed for auth (i.e. username, password) and those that are sensitive are designated appropriately using the new enhanced field. * All cli configs on the cli definition now inherit from the base config in the connector file to reuse the variables set on that dataclass * The base writer class was updated to better generalize the new approach given better use of dataclasses * The base cli classes were refactored to also take into account the need for a connector and write config when creating the respective runner/writer classes. * Any mismatch between the cli field name and the dataclass field name were updated on the dataclass side to not impact the user but maintain consistency * Add custom redaction logic for mongodb URIs since the password is expected to be a part of it. Now this: `"mongodb+srv://ingest-test-user:r4hK3BD07b@ingest-test.hgaig.mongodb.net/"` -> `"mongodb+srv://ingest-test-user:*REDACTED@ingest-test.hgaig.mongodb.net/"` in the logs Bundle all fsspec based files into their own packages. * Refactor custom `_decode_dataclass` used for enhanced json mixin by using a monkey-patch approach. The original approach was breaking on optional nested dataclasses when serializing since the other methods in `dataclasses_json_core` weren't using the new method. By monkey-patching the original method with a new one, all other methods in that library would use the new one. --------- Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com> Co-authored-by: rbiseck3 <rbiseck3@users.noreply.github.com>	2023-12-11 17:37:49 +00:00
Roman Isecke	bd5b3535ac	feat: support glob filters for fsspec source connectors (#2204 ) ### Description The local source connector already supports filters on a list of glob filters optionally provided. This same code was leveraged in the base fsspec connector to filter on the full file paths returned from the underlying fs library.	2023-12-08 16:08:40 +00:00
Christine Straub	4ad01efe23	feat: improve reading order (#2219 ) Closes GH Issue #2208.	2023-12-07 23:21:10 -08:00
Klaijan	46cb3060ac	fix: ignore connector extract if no connector folder available (#2198 ) When no connector is provided in the folder structure, the code get filename as connector instead. Fix the code so that if folder structure has no connector subfolder, it leaves blank or None for connector field.	2023-12-07 20:41:17 +00:00
David Potter	cde11d1eb0	feat: Add sftp source connector (#2163 ) Adds source connector for SFTP which uses fsspec and paramiko via fsspec. Paramiko is the standard sftp package for python used in pysftp etc... ``` --username foo \ --password bar \ --remote-url sftp://localhost:47474/upload/ ``` Will only download a specifically requested file if it has an extension. (i.e. `--remote-url sftp://localhost:47474/upload/bob.zip`) It will treat any other remote_url as a folder path. This is intentional. --------- Co-authored-by: potter-potter <david.potter@gmail.com>	2023-12-07 19:33:19 +00:00
qued	231f04eb84	chore: update api key link (#2182 ) Update to API key link per Slack convo. There may be more that's needed, @ron-unstructured please run with this PR if it's the right solution, close it if not, or make changes if needed. I haven't looked for other places where the link might need to be changed. The link appears to work, but any further investigation or testing is appreciated. Co-authored-by: Ronny H <138828701+ron-unstructured@users.noreply.github.com>	2023-12-06 23:46:32 +00:00
Roman Isecke	f193d3d43b	feat: improve sensitive data handling by fsspec connectors (#2194 ) ### Description Building off of PR https://github.com/Unstructured-IO/unstructured/pull/2179, updating fsspec based connectors to use better authentication field handling. This PR adds in the following changes: * Update the base classes to inherit from the enhanced json mixin * Add in a new access config dataclass that should be used as a nest dataclass in the connector configs * Update the code extracting configs out of the cli options dictionary to support the nested access config if it exists on the parent config * Update all fsspec connectors with explicit access configs given what each one's SDKs support * Update the json mixin and enhanced field to support a name override when serializing/deserializing from json/dicts. This allows a different name to be used for the CLI option than what the name of the field is on the dataclass. * Update all the writes to use class-based approach and share the same structure of the runner classes * Above update allowed for better code to be used in the base source and destination CLI commands * Add in utility code around paring a flat dictionary (coming from the click based options) into dataclass-based configs with potentially nested dataclasses. Slightly unrelated changes: * session handle removed from pinecone connector as this was breaking the serialization of the write config and didn't have any benefit as a connection was never being shared, the index used simply makes a new http call each time it's invoked. * Dedicated write configs were created for all destination connectors to better support serialization * Refactor of Elasticsearch connector included, with update to ingest test to use auth TODOs * Left a `#TODO` in the code but the way session handler is implemented right now, it breaks serialization since it adds a generic variable based on the library being used for a connector (i.e. `googleapiclient.discovery.Resource`) which is not serializable. This will need to be updated to omit that from serialization but still support the current workflow. --------- Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com> Co-authored-by: rbiseck3 <rbiseck3@users.noreply.github.com>	2023-12-05 20:55:19 +00:00
Christine Straub	ed76b11b1a	Refactor: support image extraction (#2201 ) ### Summary This PR is the second part of the "image extraction" refactor to move it from unstructured-inference repo to unstructured repo, the first part is done in https://github.com/Unstructured-IO/unstructured-inference/pull/299. This PR adds logic to support extracting images. ### Testing `git clone -b refactor/remove_image_extraction_code --single-branch https://github.com/Unstructured-IO/unstructured-inference.git && cd unstructured-inference && pip install -e . && cd ../` ``` elements = partition_pdf( filename="example-docs/embedded-images.pdf", strategy="hi_res", extract_images_in_pdf=True, ) print("\n\n".join([str(el) for el in elements])) ```	2023-12-05 18:22:29 +00:00
Roman Isecke	c5cb216ac8	chore: lint for print statements in ingest code (#2215 ) ### Description Given the filtering in the ingest logger, anything going to console should go through that. This adds a linter that only checks for `print()` statements in the ingest code and ignored it elsewhere for now.	2023-12-05 16:42:23 +00:00
John	8fa5cbf036	build(ci): rm unneeded call to get_api_key in test (#2199 ) Follow-up PR to [https://github.com/Unstructured-IO/unstructured/pull/2195](https://github.com/Unstructured-IO/unstructured/pull/2195). Removes unnecessary calls to `get_api_key()`. That helper function is supposed to only be used for tests decorated by @pytest.mark.skipif(skip_outside_ci, reason="Skipping test run outside of CI") (which are skipped because those tests are partitioning pdf/jpg files). These tests are partitioning emails and rely on the MockResponse at the top of the file, so they don't need to call `get_api_key()` and it can simply be removed from them.	2023-12-03 21:28:05 -08:00
rvztz	ce905dd098	feat: Weaviate destination connector (#1963 ) Closes #1781. - Adds a Weaviate destination connector - The connector receives a host for the weaviate instance and a weaviate class name. - Defines a weaviate schema for json elements. - Defines the pre-processing to conform unstructured's schema to the proposed weaviate schema.	2023-12-01 22:27:41 +00:00
Christine Straub	69d0ee1aea	Refactor: support merging `extracted` layout with `inferred` layout (#2158 ) ### Summary This PR is the second part of `pdfminer` refactor to move it from `unstructured-inference` repo to `unstructured` repo, the first part is done in https://github.com/Unstructured-IO/unstructured-inference/pull/294. This PR adds logic to merge the extracted layout with the inferred layout. The updated workflow for the `hi_res` strategy: * pass the document (as data/filename) to the `inference` repo to get `inferred_layout` (DocumentLayout) * pass the `inferred_layout` returned from the `inference` repo and the document (as data/filename) to the `pdfminer_processing` module, which first opens the document (create temp file/dir as needed), and splits the document by pages * if is_image is `True`, return the passed inferred_layout(DocumentLayout) * if is_image is `False`: * get extracted_layout (TextRegions) from the passed document(data/filename) by pdfminer * merge `extracted_layout` (TextRegions) with the passed `inferred_layout` (DocumentLayout) * return the `inferred_layout `(DocumentLayout) with updated elements (all merged LayoutElements) as merged_layout (DocumentLayout) * pass merged_layout and the document (as data/filename) to the `OCR` module, which first opens the document (create temp file/dir as needed), and splits the document by pages (convert PDF pages to image pages for PDF file) ### Note This PR also fixes issue #2164 by using functionality similar to the one implemented in the `fast` strategy workflow when extracting elements by `pdfminer`. ### TODO * image extraction refactor to move it from `unstructured-inference` repo to `unstructured` repo * improving natural reading order by applying the current default `xycut` sorting to the elements extracted by `pdfminer`	2023-12-01 20:56:31 +00:00
John	e5bdf7fb43	chore: unstructured python client (#2195 ) ### Summary Closes #2033 Updates `partition_via_api` to use `UnstructuredClient` for api calls instead of `requests`. Updates associated tests. Note: This PR does not update `partition_multiple_via_api` as documentation in `unstructured-python-client` indicates it does not support multiple files. A new issue should be opened to add that functionality to `unstructured-python-client`. --------- Co-authored-by: Klaijan <klaijan@unstructured.io> Co-authored-by: Roman Isecke <136338424+rbiseck3@users.noreply.github.com> Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com> Co-authored-by: rbiseck3 <rbiseck3@users.noreply.github.com>	2023-12-01 18:49:59 +00:00
Roman Isecke	f8aea71f3a	chore: refactor ingest unit test to run in it's own CI step (#2190 ) ### Description Currently the ingest unit tests are running in both the source and destination ingest steps. This PR moved it out into it's own step that can run on a more basic runner since it doesn't require multiple cores/cpus for parallelization. Further optimizations: * Added the changelog step as a dependency for the ingest tests to avoid running them (fail fast) of the pipeline has already failed due to the changelog not being updated. * The cache was never actually being saved if it needed to be recreated, a save job was added at the end of each custom action	2023-11-30 20:57:00 +00:00
cragwolfe	039ae17fdd	build(release): release commit for 0.11.2 (#2191 ) 0.11.2	2023-11-29 20:30:19 -08:00
Ronny H	d80abf0714	Reorganized the Examples section in Documentation & add Databricks example (#1855 ) To test: > cd docs && make html Change logs: * Examples are reorganized to have its own page * Removed two old examples, ie. "file-utils" & "sentiment analysis". * Added two examples: "RAG with Unstructured, LangChain, and ChromaDB" & "Multi-Files Processing with S3 Connector and API" * Reorganized and added detailed API documentation: (i) usage, (ii) SDKs, (iii) Azure Marketplace, (iv) AWS Marketplace, (v) parameters and validation errors	2023-11-30 01:24:43 +00:00
Ahmet Melek	ed08773de7	feat: add pinecone destination connector (#1774 ) Closes https://github.com/Unstructured-IO/unstructured/issues/1414 Closes #2039 This PR: - Uses Pinecone python cli to implement a destination connector for Pinecone and provides the ingest readme requirements [(here)](https://github.com/Unstructured-IO/unstructured/tree/main/unstructured/ingest#the-checklist) for the connector - Updates documentation for the s3 destination connector - Alphabetically sorts setup.py contents - Updates logs for the chunking node in ingest pipeline - Adds a baseline session handle implementation for destination connectors, to be able to parallelize their operations - For the [bug](https://github.com/Unstructured-IO/unstructured/issues/1892) related to persisting element data to ingest embedding nodes; this PR tests the [solution](https://github.com/Unstructured-IO/unstructured/pull/1893) with its ingest test - Solves a bug on ingest chunking params with [bugfix on chunking params and implementing related test](`69e1949a6f`) --------- Co-authored-by: Roman Isecke <136338424+rbiseck3@users.noreply.github.com>	2023-11-29 22:37:32 +00:00
pravin-unstructured	341f0f428c	Add coco staging brick to unstructured base (#2180 ) 0.11.1	2023-11-29 20:55:23 +00:00
Roman Isecke	c028a14ebf	chore: enable azure destination CI tests (#2172 ) Add AZURE_DEST_CONNECTION_STR to list of env vars in ci ingest test	2023-11-29 20:44:18 +00:00
Yuming Long	92dae8cd1a	Chore: Repair invalid PDF structure for PDFminer when PSSyntaxError (#2137 ) ### Summary Add a procedure to repair PDF when the PDF structure is invalid for `PDFminer` to process. This PR handles two cases of `PSSyntaxError Invalid dictionary construct: ...`: * PDFminer open entire document and create pages generator on `PDFPage.get_pages(fp)`: [sentry log example](https://unstructuredio.sentry.io/issues/4655715023/?alert_rule_id=14681339&alert_type=issue&notification_uuid=d8db4cf4-686f-4504-8a22-74a79a8e966f&project=4505909127086080&referrer=slack) * PDFminer's interpreter process a single page on `interpreter.process_page(page)`: [sentry log example](https://unstructuredio.sentry.io/issues/4655898781/?referrer=slack&notification_uuid=0d929d48-f490-4db8-8dad-5d431c8460bc&alert_rule_id=14681339&alert_type=issue) Additional tech details: * Add new dependency `pikepdf` in `requirements/extra-pdf-image.in`, which is used for repairing PDF. * Add new denpendenct `pypdf` in `requirements/extra-pdf-image.in`, which is used to find the error page from entire document by reading the PDF file again (can't find a way to split pdf in PDFminer). * Refactor the `is null` check for `get_uris_from_annots`, since the root cause is that `get_uris` passed a None `annots` to `get_uris_from_annots`, so the Null check should happen in `get_uris`. * Add more type protection in `get_uris_from_annots` when using any `PDFObjRef.resolve()` as `dict` (it could still be a `PDFObjRef`). This should fix : * https://github.com/Unstructured-IO/unstructured/issues/1922 where `annotation_dict` is a `PDFObjRef` * https://github.com/Unstructured-IO/unstructured/issues/1921 where `rect` is a `PDFObjRef` ### Test Added three test files (both are larger than 500 KB) for unittests to test: * Repair entire doc * Repair one page * Reprocess failure after repairing one page (just return the elements before error page in this case). * Also seems like splitting the document into smaller pages could fix this problem, but not sure why. For example, I saw error from reprocess in the whole [cancer.pdf](https://github.com/Unstructured-IO/unstructured/files/13461616/cancer.pdf) doc, but no error when i split the pdf by error page.... * tested if i can repair the entire doc again in this case, saw other error which means repairing is not helping imo * PDFminer can process the whole doc after pikepdf only repaired the entire doc in the first place, but we can't repair by pages in this way --------- Co-authored-by: cragwolfe <crag@unstructured.io>	2023-11-29 19:00:15 +00:00
Klaijan	2d450c48e7	fix: skipped file not found error (#2188 ) Create the file outside the if-clause.	2023-11-29 18:31:59 +00:00
Roman Isecke	7ad8e88a95	feat: leverage logger to hide sensitive data in ingest logs (#2175 ) ### Description Modify the logger being used by ingest to leverage a new class implemented inheriting `logging.Formatter` which adds in some middleware to update the message being logged to omit any sensitive content. It does this by dynamically pulled out any valid json from the string being logged and runs that through a `hide_sensitive_fields` method which updated any values that are considered sensitive. Replaces the original json strings with the `json.dumps` version of the new dictionary.	2023-11-29 18:16:23 +00:00
qued	1576e0b891	docs: update docker image link (#2186 ) Updated docker image link in documentation to be consistent with README.	2023-11-29 11:40:09 -06:00
Roman Isecke	b951d73a9b	feat: add logging to ingest CLI for tests being skipped at the end (#2174 ) ### Description Often times there are tests being skipped either due to missing env vars or explicitly defined in the base script but these get lost in the logs. This PR updates the scripts to leverage a custom error code if being skipped due to missing env vars and this custom error code is being caught by the base script and logs all files being skipped to a file. At the end of the script, this file gets logged in the CI output.	2023-11-29 13:41:19 +00:00
Klaijan	0aae1faa54	feat: add visualize param to command and add test (#2178 ) - Add `visualize` parameter to the click command -- now callable using `--visualize` flag to show the progress bar. - Refactor the name.	2023-11-29 01:05:55 +00:00
rvztz	50b1431c9e	rvztz/hubspot ingest connector (#1760 ) Closes #1843 Ingest connector for HubSpot. Supports: - Calls: Logs from calls related to contacts, companies and tickets - Communications: Logs from SMS/Whatsapp related to contacts, companies and tickets - Notes: Notes related to CRM notes - Products: CRM products - Emails: Logs from emails sent to CRM objects. - Tasks: CRM tasks From each record, `body/`description`information is grabbed. When a title property is available, this is registered at the beggining of the output file. The CLI receives three params: - `api-token`: [Private app](https://developers.hubspot.com/docs/api/private-apps) token. - `object-types: One of the noted supported objects in the form of a comma separated list: `calls,products,tasks` - `custom-properties`: Custom properties to grab information from. Must be in the form `<object_type>:<custom_property_id>,<object_type>:<custom_property_id>` --------- Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com> Co-authored-by: rvztz <rvztz@users.noreply.github.com>	2023-11-28 23:07:57 +00:00
Roman Isecke	30cbc420a0	bug: fsspec output filepath including base directory (#2146 ) ### Description When passing in a remote path for fsspec-based source connectors, the base directory was always being included in the output path itself. This was updated to exclude the base directory any only include any child directories relative to the base one. --------- Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com> Co-authored-by: rbiseck3 <rbiseck3@users.noreply.github.com>	2023-11-28 14:19:42 +00:00
Roman Isecke	2bb463d006	feat: support both single and batch ingest docs (#2105 ) ### Description There are some source ingest connectors that would be more efficient to read the content in batches rather than use an entire process per document. For example, reading from ElasticSearch. Given an index with possible hundreds of documents, reading each one individually is not as optimal as reading in batches. To try and maintain as much of the ingest doc paradigm already being supported, a new class `BaseIngestDocBatch` was added to handle reading in batches. It produces a list of `BaseSingleIngestDoc` which is what all current implementations were renamed to. This list is generated after it runs its `get_files` method. Past the source node, all other steps in the pipeline should not be affected, this is just an optimization for the read step. Additional Changes: * Removed use of jq and instead converted this into a fields filter on the content to let the database handle the filtering and limit the amount of data being pulled in. --------- Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com> Co-authored-by: rbiseck3 <rbiseck3@users.noreply.github.com>	2023-11-27 19:25:30 +00:00
Klaijan	877a30aed3	fix: fix eval ci to skip the overwrite if none exists (#2159 ) Currently the `check-diff-evaluation-metrics` only runs when there is file to perform evaluation on. Add the checking condition to skip the action when there is none. Additionally, more refactoring and `visualize` option for both evaluation calculation functions is also added.	2023-11-25 15:46:05 +00:00
Yuming Long	6c08c136ae	ci: fix broken API unit test for using unsupported `fast` strategy for images (#2144 ) ### Summary This should fix the broken unit test on main CI * change the strategy in `test_partition_multiple_via_api_valid_request_data_kwargs` from `fast` to `auto`, since the test was using `fast` for images, and we don't support it.	2023-11-22 17:35:04 -08:00

... 3 4 5 6 7 ...

1282 Commits