unstructured

mirror of https://github.com/Unstructured-IO/unstructured.git synced 2025-12-11 15:12:53 +00:00

Author	SHA1	Message	Date
Matt Robinson	09d84bc46b	build(deps): version bumps for 2024-08-26 (#3567 ) ### Summary Version bumps for 2024-08-26.	2024-08-26 15:15:25 -04:00
Christine Straub	ac10ba4fc1	build(deps): bump unstructured.paddleocr to 2.8.1.0 (#3561 ) ### Summary - Bump `unstructured.paddleocr` to 2.8.1.0 - Remove `opencv-python` and `opencv-contrib-python` constraint pins - Fix `0.15.7` changelog	2024-08-23 14:17:29 -07:00
Steve Canny	32bb77aafb	fix(file): no default OLE subtype (#3516 ) Summary Do not assume MSG format when an OLE "container" file cannot be differentiated into DOC, PPT, XLS, or MSG. Fall back to extention-based identification in that case. Additional Context DOC, MSG, PPT, and XLS are all OLE files. An OLE file is, very roughly, a Microsoft-proprietary Zip format which "contains" a filesystem of discrete files and directories. An OLE "container" is easily identified by inspecting the first 8 bytes of the file, so all we need to do is differentiate between the four subtypes we can process. The `filetype` module does a good job of this but it not perfect and does not identify MSG files. Previously we assumed MSG format when none of DOC, PPT, or XLS was detected, but we discovered that `filetype` is not completely reliable at detecting these types. Change the behavior to remove the assumption of MSG format. `_OleFileDifferentiator` returns `None` in this case and filetype detection falls back to use filename-extension. Note a file with no filename and no metadata_filename or an incorrect extension will not be correctly identified in this case, however we're assuming for now that will be rare in practice.	2024-08-22 19:16:53 +00:00
John	b4a6aa5559	chore: remove fsspec pin (#3554 ) remove fsspec pin	2024-08-21 21:57:42 +00:00
Steve Canny	03e0ed3519	rfctr(docx): DOCX emits std minified .text_as_html (#3545 ) Summary Eliminate historical "idiosyncracies" of `table.metadata.text_as_html` HTML introduced by `partition_docx()`. Produce minified `.text_as_html` consistent with that formed by chunking. Additional Context - nested tables appear as their extracted text in the parent cell (no nested `<table>` elements in `.text_as_html`). - DOCX `.text_as_html` is minified (no extra whitespace or thead, tbody, tfoot elements).	2024-08-21 18:54:21 +00:00
John	f135344738	chore: remove scipy and packaging pins (#3550 ) Remove scipy and packaging constraint pins	2024-08-21 16:05:19 +00:00
John	604cadfb7e	chore: remove ipython pin (#3548 ) this pr is stacked on https://github.com/Unstructured-IO/unstructured/pull/3538 and https://github.com/Unstructured-IO/unstructured/pull/3547 This pr removes dependency pins for IPython, anyio, and pyparsing. It also updates the label-studio-sdk import statement so we don't have to have that pinned and make some minor type hinting edits. Label Studio had a breaking change in their 1.13.0 [release](https://github.com/HumanSignal/label-studio/releases/tag/1.13.0)	2024-08-21 00:06:31 +00:00
Christine Straub	01dbc7b473	fix: `nltk` data download path to prevent redundant nested directories (#3546 ) Closes #3543. ### Summary This PR addresses an issue with the NLTK data download process. Previously, when downloading NLTK data, a nested "nltk_data" directory was created within the parent "nltk_data" directory if the parent directory already existed. This redundant directory structure led to two significant problems: - errors in checking if data had already been downloaded, potentially causing redundant downloads in subsequent calls. - failures in loading models from the downloaded NLTK data due to incorrect path resolution. This fix modifies the NLTK data download logic to prevent creation of unnecessary nested directories. If the download path ends with "nltk_data" and that directory already exists, we now use the existing directory instead of creating a new nested one. ### Testing CI should pass. 0.15.7	2024-08-20 18:56:59 +00:00
Matt Robinson	1f8030dd0e	fix(CVE-2024-39705): bump to `nltk` 3.9.1; correct model download issues (#3541 ) ### Summary Bumps to `nltk==3.9.1` and resolves [CVE-2024-39705](https://nvd.nist.gov/vuln/detail/CVE-2024-39705). An NLTK version bump was originally introduced in #3512 and rolled back in #3527 because `nltk==3.8.2` was yanked from PyPI, and also because we observed significant slowdowns in processing time after bumping to `nltk==3.8.2`. The processing time regression does not appear in `nltk==3.9.1`. ### Testing After the bump, CI should pass. Additionally we verified locally that files processing takes around the amount of time we would expect for a long `.docx` file. ```python In [1]: from unstructured.partition.auto import partition In [2]: filename = "test-doc.docx" In [3]: %timeit partition(filename=filename) 3.92 s ± 73 ms per loop (mean ± std. dev. of 7 runs, 1 loop each) ``` 0.15.6	2024-08-19 20:59:36 +00:00
Steve Canny	a861ed8fe7	feat(chunk): split tables on even row boundaries (#3504 ) Summary Use more sophisticated algorithm for splitting oversized `Table` elements into `TableChunk` elements during chunking to ensure element text and HTML are "synchronized" and HTML is always parseable. Additional Context Table splitting now has the following characteristics: - `TableChunk.metadata.text_as_html` is always a parseable HTML `<table>` subtree. - `TableChunk.text` is always the text in the HTML version of the table fragment in `.metadata.text_as_html`. Text and HTML are "synchronized". - The table is divided at a whole-row boundary whenever possible. - A row is broken at an even-cell boundary when a single row is larger than the chunking window. - A cell is broken at an even-word boundary when a single cell is larger than the chunking window. - `.text_as_html` is "minified", removing all extraneous whitespace and unneeded elements or attributes. This maximizes the semantic "density" of each chunk.	2024-08-19 18:56:53 +00:00
Christine Straub	99f72d65ba	ci: fix ingest test fixtures update (#3532 )	2024-08-16 16:37:33 -07:00
Christine Straub	fc26426310	feat: replace `pytesseract` with `unstructured.pytesseract` fork (#3528 ) This PR reverts `pytesseract` dependency to `unstructured.pytesseract` fork due to the unavailability of some recent release versions of `pytesseract` on PyPI. This PR also addresses an issue encountered during the publication of `unstructured==0.15.4` to PyPI. The error was due to the fact that PyPI does not allow direct dependencies from Version Control System URLs like GitHub in the `install_requires` or `extras_require` sections of the `setup.py` file. 0.15.5	2024-08-16 10:34:22 -04:00
Matt Robinson	e64e09507a	build: update to latest base image (#3524 ) ### Summary Updates to the latest `wolfi-base` base image to pull in more recent package version. A notable update is that upgrading to `libreoffice==24.2.5.2` resolves several CVEs. --------- Co-authored-by: christinestraub <christinemstraub@gmail.com>	2024-08-15 22:27:41 -07:00
Christine Straub	d0211cc41f	build: downgrade `nltk` version (#3527 ) This PR aims to roll back `nltk` to `3.8.1` which bumped to `3.8.2` in https://github.com/Unstructured-IO/unstructured/pull/3512 because `3.8.2` is no longer available in PyPI due to some issues(https://github.com/nltk/nltk/issues/3301)	2024-08-15 16:35:21 -07:00
Christine Straub	9b778e270d	fix: `pytesseract>=0.3.12` installation error while installing `pdf` extra (#3522 ) Closes #3521. This PR resolves an installation error with `pytesseract>=0.3.12` that occurred during `pip install unstructured[pdf]==0.15.3`. ### Testing Run following command in main branch and this PR ``` pip uninstall -y pytesseract && pip install ".[pdf]" ``` Results - `main` branch ``` INFO: pip is looking at multiple versions of unstructured[pdf] to determine which version is compatible with other requirements. This could take a while. ERROR: Could not find a version that satisfies the requirement pytesseract>=0.3.12; extra == "pdf" (from unstructured[pdf]) (from versions: 0.1, 0.1.3, 0.1.4, 0.1.5, 0.1.6, 0.1.7, 0.1.8, 0.1.9, 0.2.0, 0.2.2, 0.2.4, 0.2.5, 0.2.6, 0.2.7, 0.2.8, 0.2.9, 0.3.0, 0.3.1, 0.3.2, 0.3.3, 0.3.4, 0.3.5, 0.3.6, 0.3.7, 0.3.8, 0.3.9, 0.3.10) ERROR: No matching distribution found for pytesseract>=0.3.12; extra == "pdf" ``` - this `PR` `pytesseract-0.3.13` should be installed successfully. 0.15.4	2024-08-14 16:15:40 -05:00
Christine Straub	d6a84bdfbb	build(deps): update extra-paddleocr requirements (#3515 ) This PR removes custom index URL for `paddlepaddle` installation in `extra-paddleocr.in`, resolving `setup.py` configuration error. Now uses `paddlepaddle==3.0.0b1` directly from PyPI, simplifying installation process. --------- Co-authored-by: Matt Robinson <mrobinson@unstructuredai.io> Co-authored-by: Matt Robinson <mrobinson@unstructured.io> 0.15.3	2024-08-14 12:19:20 -05:00
Matt Robinson	7437f0a084	fix(CVE-2024-39705): update to latest `nltk` version (#3512 ) ### Summary Addresses [CVE-2024-39705](https://nvd.nist.gov/vuln/detail/CVE-2024-39705) by updating to `nltk==3.8.2` and closes #3511. This CVE had previously been mitigated in #3361. --------- Co-authored-by: Christine Straub <christinemstraub@gmail.com> 0.15.2	2024-08-13 09:39:29 -04:00
Christine Straub	1158d8f695	Refactor image block extraction in `pdf` partitioning (#3514 ) Closes [#3503](https://github.com/Unstructured-IO/unstructured/issues/3503). ### Summary This PR prevents creation of `figures` directory for saving image blocks (`Image`, `Table`) when `extract_image_block_to_payload` parameter is set to True ### Testing ``` elements = partition_image( filename="example-docs/img/embedded-images-tables.jpg", strategy="hi_res", extract_image_block_types=["Image", "Table"], extract_image_block_to_payload=True, ) ``` Results: - `Main` Branch: `figures` directory is created. - `PR`: `figures` directory is not created.	2024-08-13 06:11:10 +00:00
Steve Canny	cbe1b35621	rfctr(chunk): prep for adding TableSplitter (#3510 ) Summary Mechanical refactoring in preparation for adding (pre-chunk) `TableSplitter` in a PR stacked on this one.	2024-08-12 18:04:49 +00:00
Christine Straub	d99b39923d	build(deps): Remove unstructured.paddlepaddle fork (#3506 ) This PR aims to remove "unstructured.paddlepaddle" fork. Previously, we used `unstructured.paddlepaddle` fork to support `unstructured.paddleocr` on arm64 architecture. But currently, `unstructured.paddleocr` with `unstructured.paddlepaddle` fails to work on `arm64` architecture. Also, `unstructured.paddleocr` with the latest version of the original `paddlepaddle` works on both `amd64` and `arm64` architectures. ### Testing ``` os.environ["OCR_AGENT"] = "unstructured.partition.utils.ocr_models.paddle_ocr.OCRAgentPaddle" elements = partition_pdf( filename=<file_path>, strategy="hi_res", infer_table_structure=True, ) ```	2024-08-09 22:04:22 +00:00
John	a2ae2ed646	chore: remove matplotlib constraint (#3505 )	2024-08-09 19:31:19 +00:00
Jake Zerrer	051be5aead	Remove unstructured.pytesseract fork (#3454 ) A second attempt at https://github.com/Unstructured-IO/unstructured/pull/3360, this PR removes unstructured's dependency on its own fork of `pytesseract`. (The original reason for the fork, the addition of `run_and_get_multiple_output`, was removed [here](https://github.com/madmaze/pytesseract/releases/tag/v0.3.12).) --------- Co-authored-by: Christine Straub <christinemstraub@gmail.com>	2024-08-09 04:28:48 +00:00
John	2373eaa829	fix typo: pipline>pipeline (#3498 ) fix typo: pipline>pipeline	2024-08-08 18:53:47 +00:00
John	43ae0befa7	chore: bump botocore pin (#3493 ) bump botocore pin to match aiobotocore/s pin: `eae97439b3`	2024-08-07 21:41:53 +00:00
John	696155e614	chore: update importlib-metadata pin (#3491 )	2024-08-07 18:17:53 +00:00
John	6545f16e57	chore: remove cryptography pin and update test (#3482 ) remove cryptography pin, pin tenacity, and update test_unstructured_ingest/unit/connector/test_salesforce_connector.py	2024-08-07 15:25:23 +00:00
Pawel Kmiecik	eba12daeb2	feat: correct object detection metrics (#3490 ) This PR: - fixes an issue that made it impossible to compute OD metrics - ads per-class object detection metrics	2024-08-07 14:14:02 +00:00
John	24a1f298e5	chore: small edits (#3480 ) Add comments and fix decorators on some tests.	2024-08-06 19:21:43 +00:00
Steve Canny	73bef27ef1	fix(pptx): accommodate invalid image/jpg MIME-type (#3475 ) As described in #3381, some clients, perhaps including Adobe PDF Converter, map JPEG images to the invalid `image/jpg` MIME-type. Prior to v1.0.0, `python-pptx` would not load these images, which caused image extraction to fail. Update the `python-pptx` dependency to `v1.0.1` or above to ensure this upstream fix is always available. Fixes: #3381	2024-08-06 18:48:15 +00:00
Steve Canny	a468b2de3b	rfctr(csv): accommodate single column CSV files (#3483 ) Summary Improve factoring, type-annotation, and tests for `partition_csv()` and accommodate single-column CSV files. Fixes: #2616	2024-08-06 00:48:37 +00:00
David Potter	59ec64235b	chore: rename astra to astradb (#3458 ) DataStax wanted all references to be astradb instead of astra. As per @erichare We'll also have to do the same in unstructured-ingest :)	2024-08-05 20:41:02 +00:00
Austin Walker	7e887442c4	chore: Cut the 0.15.1 release (#3481 ) 0.15.1	2024-08-05 16:16:13 +00:00
Maciej Kurzawa	b749b891a7	fix: disabled checking max pages for images (#3473 ) Added fix related to https://github.com/Unstructured-IO/unstructured/pull/3431, which disables checking max pages for images	2024-08-02 14:25:08 +00:00
John	147514f6b5	feat: msg and email metadata (#3444 ) Update partition_eml and partition_msg to capture cc, bcc, and message id fields. Docs PR: https://github.com/Unstructured-IO/docs/pull/135/files Testing ``` from unstructured.partition.email import partition_email from test_unstructured.unit_utils import example_doc_path elements = partition_email(filename=example_doc_path("eml/fake-email-header.eml"), include_headers=True) print(elements) elements[0].metadata.to_dict() ``` Note to reviewers: Tests in `test_unstructured/partition/test_email.py` were refactored and rearranged to group similar tests together, so it will be easiest to review those changes commit by commit. --------- Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com> Co-authored-by: Coniferish <Coniferish@users.noreply.github.com>	2024-08-01 19:24:17 +00:00
Christine Straub	0f057188c6	Improve pdfminer embedded image extraction in `pdf` partitioning (#3456 ) ### Summary This PR addresses an issue in `pdfminer` library's embedded image extraction process. Previously, some extracted "images" were incorrect, including embedded text elements, resulting in oversized bounding boxes. This update refines the extraction process to focus on actual images with more accurate, smaller bounding boxes. ### Testing PDF: [test_pdfminer_text_extraction.pdf](https://github.com/user-attachments/files/16448213/test_pdfminer_text_extraction.pdf) ``` elements = partition_pdf( filename="test_pdfminer_text_extraction", strategy=strategy, languages=["chi_sim"], analysis=True, ) ``` Results - this `PR` ![page1_layout_pdfminer](https://github.com/user-attachments/assets/098e0a1f-fdad-4627-a881-cbafd71ce5a0) ![page1_layout_final](https://github.com/user-attachments/assets/6dc89180-36ac-424a-99de-63810ebf8958) - `main` branch ![page1_layout_pdfminer](https://github.com/user-attachments/assets/8228995a-2ef1-4b76-9758-b8015c224e6d) ![page1_layout_final](https://github.com/user-attachments/assets/68d43d7b-7270-4f58-8360-dc76bd0df78f)	2024-08-01 16:47:08 +00:00
Maciej Kurzawa	8fd216cc9f	feat/pdf-page-limit-in-hi-res (#3431 ) # Description: Passing `max_pages` argument allows rejecting pdf files which exceeds this page number limit while `high_res` strategy is chosen. By default it will allow parsing pdf files with unlimited number of pages. # Testing: ```python from unstructured.partition.auto import partition elements = partition(filename="unstructured/example-docs/pdf/reliance.pdf", strategy='hi_res') # should pass elements = partition(filename="unstructured/example-docs/pdf/reliance.pdf", strategy='hi_res', max_pages=4) # should pass elements = partition(filename="unstructured/example-docs/pdf/reliance.pdf", strategy='hi_res', max_pages=2) # should raise PdfMaxPagesExceededError ```	2024-07-30 16:52:17 +00:00
Roman Isecke	482f093afb	feat: Add deprecation warning on import of any ingest code (#3443 ) ### Description Any time `unstructed.ingest` is imported, this deprecation warning gets emitted: ``` DeprecationWarning: unstructured.ingest will be removed in a future version ```	2024-07-30 15:06:21 +00:00
Steve Canny	4e61acc1c6	fix(file): fix OLE-based file-type auto-detection (#3437 ) Summary A DOC, PPT, or XLS file sent to partition() as a file-like object is misidentified as a MSG file and raises an exception in python-oxmsg (which is used to process MSG files). Fix DOC, PPT, XLS, and MSG are all Microsoft OLE-based files, aka. Compound File Binary Format (CFBF). These can be reliably distinguished by inspecting magic bytes in certain locations. `libmagic` is unreliable at this or doesn't try, reporting the generic `"application/x-ole-storage"` which corresponds to the "container" CFBF format (vaguely like a Microsoft Zip format) that all these document types are stored in. Unconditionally use `filetype.guess_mime()` provided by the `filetype` package that is part of the base unstructured install. Unlike `libmagic`, this package reliably detects the distinguished MIME-type (e.g. `"application/msword"`) for OLE file subtypes. Fixes #3364	2024-07-25 17:25:41 +00:00
Steve Canny	432d209c36	fix(file): confirm or correct asserted DOCX, PPTX, and XLSX content types (#3434 ) Summary The `content_type` argument received by `partition()` from the API is sometimes unreliable for MS-Office 2007+ MIME-types. What we've observed is that it gets the MS-Office bit right but falls down on distinguishing PPTX from DOCX or XLSX. Confirmation of these types is simple, fast, and reliable. Confirm all MS-Office `content_type` argument values asserted by callers of `detect_filetype()` and correct swapped values.	2024-07-24 20:32:58 +00:00
Christine Straub	560cc0e975	fix: update HuggingFaceEmbeddingEncoder to use `langchain_huggingface` instead of `langchain-community` (#3436 ) Similar to https://github.com/Unstructured-IO/unstructured/pull/3433. ### Summary This PR aims to update `HuggingFaceEmbeddingEncoder` to use `HuggingFaceEmbeddings` from `langchain_huggingface` package instead of the deprecated version from `langchain-community`. This resolves the deprecation warning and ensures compatibility with future versions of langchain. ### Testing ``` from unstructured.documents.elements import Text from unstructured.embed.huggingface import HuggingFaceEmbeddingConfig, HuggingFaceEmbeddingEncoder embedding_encoder = HuggingFaceEmbeddingEncoder( config=HuggingFaceEmbeddingConfig() ) elements = embedding_encoder.embed_documents( elements=[Text("This is sentence 1"), Text("This is sentence 2")], ) query = "This is the query" query_embedding = embedding_encoder.embed_query(query=query) [print(e.embeddings, e) for e in elements] print(query_embedding, query) print(embedding_encoder.is_unit_vector(), embedding_encoder.num_of_dimensions()) ``` Expected behavior No deprecation warning should be displayed. The code should use the updated `HuggingFaceEmbeddings` class from the `langchain_huggingface` package.	2024-07-24 18:57:31 +00:00
Christine Straub	798dcc096c	fix: update OpenAIEmbeddingEncoder to use `langchain-openai` instead of `langchain-community` (#3433 ) Closes https://github.com/Unstructured-IO/unstructured/issues/3378. ### Summary This PR aims to update `OpenAIEmbeddingEncoder` to use `OpenAIEmbeddings` from `langchain-openai` package instead of the deprecated version from `langchain-community`. This resolves the deprecation warning and ensures compatibility with future versions of langchain.	2024-07-24 16:52:34 +00:00
Steve Canny	3fe5c094fa	rfctr(file): refactor detect_filetype() (#3429 ) Summary In preparation for fixing a cluster of bugs with automatic file-type detection and paving the way for some reliability improvements, refactor `unstructured.file_utils.filetype` module and improve thoroughness of tests. Additional Context Factor type-recognition process into three distinct strategies that are attempted in sequence. Attempted in order of preference, type-recognition falls to the next strategy when the one before it is not applicable or cannot determine the file-type. This provides a clear basis for organizing the code and tests at the top level. Consolidate the existing tests around these strategies, adding additional cases to achieve better coverage. Several bugs were uncovered in the process. Small ones were just fixed, bigger ones will be remedied in following PRs.	2024-07-23 23:18:48 +00:00
David Potter	441b3393b1	bugfix [OSS-67]: update import of pinecone exception (#3432 ) the pinecone python package moved their importing of PineconeApiException Chroma `sleep` added because even thought there is a `wait`, there is still some sort of timing issue.	2024-07-23 19:48:55 +00:00
Matt Robinson	b2f0620f2c	build(deps): version bumps for 2024-07-22 (#3427 ) ### Summary Weekly dependency bumps.	2024-07-22 20:49:40 +00:00
Steve Canny	49c4bd34be	rfctr(auto): add _PartitionerLoader (#3418 ) Summary Replace conditional explicit import of partitioner modules in `.partition.auto` with the new `_PartitionerLoader` class. This avoids unbound variable warnings and is much less noisy. `_PartitionerLoader` makes use of the new `FileType` property `.importable_package_dependencies` to determine whether all required packages are importable before dispatching the file to its partitioner. It uses `FileType.extra_name` to form a helpful error message when a dependency is not installed, so the caller knows which `pip install` extra to specify to remedy the error. `PartitionerLoader` uses the `FileType` properties `.partitioner_module_qname` and `partitioner_function_name` to load the partitioner once its dependencies are verified. Loaded partitioners are cached with module lifetime scope for efficiency.	2024-07-22 06:03:55 +00:00
Christine Straub	ec59abfabc	enhancement: improve text clearing process in `email` partitioning (#3422 ) ### Summary Currently, the email partitioner removes only `=\n` characters during the clearing process. However, email content sometimes contains `=\r\n` characters, especially when read from file-like objects such as `SpooledTemporaryFile` (the file type used in our API). This PR updates the email partitioner to remove both `=\n` and `=\r\n` characters during the clearing process. ### Testing ``` filename = "example-docs/eml/family-day.eml" elements = partition_email( filename=filename, ) print(f"From filename: {elements[3].text}") with open(filename, "rb") as test_file: spooled_temp_file = tempfile.SpooledTemporaryFile() spooled_temp_file.write(test_file.read()) spooled_temp_file.seek(0) elements = partition_email(file=spooled_temp_file) print(f"From spooled_temp_file: {elements[3].text}") ``` Results: - on `main` ``` From filename: Make sure to RSVP! From spooled_temp_file: Make sure to = RSVP! ``` - on `PR` ``` From filename: Make sure to RSVP! From spooled_temp_file: Make sure to RSVP! ``` 0.15.0	2024-07-19 18:18:02 +00:00
Roman Isecke	1df7908f03	feat: save file id for all fsspec connectors if present (#3405 ) ### Description If the id value exists in the stats response from fsspec, save it as a `file_id` field in the metadata being persisted on each element. --------- Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com> Co-authored-by: rbiseck3 <rbiseck3@users.noreply.github.com>	2024-07-19 13:30:21 +00:00
Christine Straub	0eb461acc2	refactor: restructure PDF/Image example document organization (#3410 ) This PR aims to improve the organization and readability of our example documents used in unit tests, specifically focusing on PDF and image files. ### Summary - Created two new subdirectories in the `example-docs` folder: - `pdf/`: for all PDF example files - `img/`: for all image example files - Moved relevant PDF files from `example-docs/` to `example-docs/pdf/` - Moved relevant image files from `example-docs/` to `example-docs/img/` - Updated file paths in affected unit & ingest tests to reflect the new directory structure ### Testing All unit & ingest tests should be updated and verified to work with the new file structure. ## Notes Other file types (e.g., office documents, HTML files) remain in the root of `example-docs/` for now. ## Next Steps Consider similar reorganization for other file types if this structure proves to be beneficial. --------- Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com> Co-authored-by: christinestraub <christinestraub@users.noreply.github.com>	2024-07-18 22:21:32 +00:00
Roman Isecke	5d387030eb	bugfix: google drive connector metadata safegaurds (#3407 ) ### Description At times, the google drive response doens't have some of the metadata we're grabbing to populate the `FileData` metadata. This is fine, but without the added safegaurds, this can cause a `KeyError`.	2024-07-18 16:09:19 +00:00
Steve Canny	e99e5a8abd	rfctr(file): make FileType enum a file-type descriptor (#3411 ) Summary Elaborate the `FileType` enum to be a complete descriptor of file-types. Add methods to allow `STR_TO_FILETYPE`, `EXT_TO_FILETYPE` and `FILETYPE_TO_MIMETYPE` mappings to be replaced, removing those redundant and noisy declarations. In the process, fix some lingering file-type identification and `.metadata.filetype` errors that had been skipped in the tests. Additional Context Gathering the various attributes of a file-type into the `FileType` enum eliminates the duplication inherent in the separate `STR_TO_FILETYPE` etc. mappings and makes access to those values convenient for callers. These attributes include what MIME-type a file-type should record in metadata and what MIME-types and extensions map to that file-type. These values and others are made available as methods and properties directly on the `FileType` class and members. Because all attributes are defined in the `FileType` enum there is no risk of inconsistency across multiple locations and any changes happen in one and only one place. Further attributes and methods will be added in later commits to support other file-type related operations like mapping to a partitioner and verifying its dependencies are installed.	2024-07-18 02:05:33 +00:00

1 2 3 4 5 ...

1540 Commits