1540 Commits

Author SHA1 Message Date
Matt Robinson
09d84bc46b
build(deps): version bumps for 2024-08-26 (#3567)
### Summary

Version bumps for 2024-08-26.
2024-08-26 15:15:25 -04:00
Christine Straub
ac10ba4fc1
build(deps): bump unstructured.paddleocr to 2.8.1.0 (#3561)
### Summary
- Bump `unstructured.paddleocr` to 2.8.1.0
- Remove `opencv-python` and `opencv-contrib-python` constraint pins
- Fix `0.15.7` changelog
2024-08-23 14:17:29 -07:00
Steve Canny
32bb77aafb
fix(file): no default OLE subtype (#3516)
**Summary**
Do not assume MSG format when an OLE "container" file cannot be
differentiated into DOC, PPT, XLS, or MSG. Fall back to extention-based
identification in that case.

**Additional Context**
DOC, MSG, PPT, and XLS are all OLE files. An OLE file is, very roughly,
a Microsoft-proprietary Zip format which "contains" a filesystem of
discrete files and directories.

An OLE "container" is easily identified by inspecting the first 8 bytes
of the file, so all we need to do is differentiate between the four
subtypes we can process. The `filetype` module does a good job of this
but it not perfect and does not identify MSG files.

Previously we assumed MSG format when none of DOC, PPT, or XLS was
detected, but we discovered that `filetype` is not completely reliable
at detecting these types.

Change the behavior to remove the assumption of MSG format.
`_OleFileDifferentiator` returns `None` in this case and filetype
detection falls back to use filename-extension.

Note a file with no filename and no metadata_filename or an incorrect
extension will not be correctly identified in this case, however we're
assuming for now that will be rare in practice.
2024-08-22 19:16:53 +00:00
John
b4a6aa5559
chore: remove fsspec pin (#3554)
remove fsspec pin
2024-08-21 21:57:42 +00:00
Steve Canny
03e0ed3519
rfctr(docx): DOCX emits std minified .text_as_html (#3545)
**Summary**
Eliminate historical "idiosyncracies" of `table.metadata.text_as_html`
HTML introduced by `partition_docx()`. Produce minified `.text_as_html`
consistent with that formed by chunking.

**Additional Context**
- nested tables appear as their extracted text in the parent cell (no
nested `<table>` elements in `.text_as_html`).
- DOCX `.text_as_html` is minified (no extra whitespace or thead, tbody,
tfoot elements).
2024-08-21 18:54:21 +00:00
John
f135344738
chore: remove scipy and packaging pins (#3550)
Remove scipy and packaging constraint pins
2024-08-21 16:05:19 +00:00
John
604cadfb7e
chore: remove ipython pin (#3548)
this pr is stacked on
https://github.com/Unstructured-IO/unstructured/pull/3538 and
https://github.com/Unstructured-IO/unstructured/pull/3547

This pr removes dependency pins for IPython, anyio, and pyparsing. It
also updates the label-studio-sdk import statement so we don't have to
have that pinned and make some minor type hinting edits. Label Studio
had a breaking change in their 1.13.0
[release](https://github.com/HumanSignal/label-studio/releases/tag/1.13.0)
2024-08-21 00:06:31 +00:00
Christine Straub
01dbc7b473
fix: nltk data download path to prevent redundant nested directories (#3546)
Closes #3543.

### Summary
This PR addresses an issue with the NLTK data download process.
Previously, when downloading NLTK data, a nested "nltk_data" directory
was created within the parent "nltk_data" directory if the parent
directory already existed. This redundant directory structure led to two
significant problems:
- errors in checking if data had already been downloaded, potentially
causing redundant downloads in subsequent calls.
- failures in loading models from the downloaded NLTK data due to
incorrect path resolution.

This fix modifies the NLTK data download logic to prevent creation of
unnecessary nested directories. If the download path ends with
"nltk_data" and that directory already exists, we now use the existing
directory instead of creating a new nested one.

### Testing
CI should pass.
0.15.7
2024-08-20 18:56:59 +00:00
Matt Robinson
1f8030dd0e
fix(CVE-2024-39705): bump to nltk 3.9.1; correct model download issues (#3541)
### Summary

Bumps to `nltk==3.9.1` and resolves
[CVE-2024-39705](https://nvd.nist.gov/vuln/detail/CVE-2024-39705). An
NLTK version bump was originally introduced in #3512 and rolled back in
#3527 because `nltk==3.8.2` was yanked from PyPI, and also because we
observed significant slowdowns in processing time after bumping to
`nltk==3.8.2`. The processing time regression does not appear in
`nltk==3.9.1`.

### Testing

After the bump, CI should pass. Additionally we verified locally that
files processing takes around the amount of time we would expect for a
long `.docx` file.

```python
In [1]: from unstructured.partition.auto import partition

In [2]: filename = "test-doc.docx"

In [3]: %timeit partition(filename=filename)
3.92 s ± 73 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
```
0.15.6
2024-08-19 20:59:36 +00:00
Steve Canny
a861ed8fe7
feat(chunk): split tables on even row boundaries (#3504)
**Summary**
Use more sophisticated algorithm for splitting oversized `Table`
elements into `TableChunk` elements during chunking to ensure element
text and HTML are "synchronized" and HTML is always parseable.

**Additional Context**
Table splitting now has the following characteristics:
- `TableChunk.metadata.text_as_html` is always a parseable HTML
`<table>` subtree.
- `TableChunk.text` is always the text in the HTML version of the table
fragment in `.metadata.text_as_html`. Text and HTML are "synchronized".
- The table is divided at a whole-row boundary whenever possible.
- A row is broken at an even-cell boundary when a single row is larger
than the chunking window.
- A cell is broken at an even-word boundary when a single cell is larger
than the chunking window.
- `.text_as_html` is "minified", removing all extraneous whitespace and
unneeded elements or attributes. This maximizes the semantic "density"
of each chunk.
2024-08-19 18:56:53 +00:00
Christine Straub
99f72d65ba
ci: fix ingest test fixtures update (#3532) 2024-08-16 16:37:33 -07:00
Christine Straub
fc26426310
feat: replace pytesseract with unstructured.pytesseract fork (#3528)
This PR reverts `pytesseract` dependency to `unstructured.pytesseract`
fork due to the unavailability of some recent release versions of
`pytesseract` on PyPI.

This PR also addresses an issue encountered during the publication of
`unstructured==0.15.4` to PyPI. The error was due to the fact that PyPI
does not allow direct dependencies from Version Control System URLs like
GitHub in the `install_requires` or `extras_require` sections of the
`setup.py` file.
0.15.5
2024-08-16 10:34:22 -04:00
Matt Robinson
e64e09507a
build: update to latest base image (#3524)
### Summary

Updates to the latest `wolfi-base` base image to pull in more recent
package version. A notable update is that upgrading to
`libreoffice==24.2.5.2` resolves several CVEs.

---------

Co-authored-by: christinestraub <christinemstraub@gmail.com>
2024-08-15 22:27:41 -07:00
Christine Straub
d0211cc41f
build: downgrade nltk version (#3527)
This PR aims to roll back `nltk` to `3.8.1` which bumped to `3.8.2` in
https://github.com/Unstructured-IO/unstructured/pull/3512 because
`3.8.2` is no longer available in PyPI due to some
issues(https://github.com/nltk/nltk/issues/3301)
2024-08-15 16:35:21 -07:00
Christine Straub
9b778e270d
fix: pytesseract>=0.3.12 installation error while installing pdf extra (#3522)
Closes #3521.

This PR resolves an installation error with `pytesseract>=0.3.12` that
occurred during `pip install unstructured[pdf]==0.15.3`.

### Testing
**Run following command in main branch and this PR**
```
pip uninstall -y pytesseract && pip install ".[pdf]"
```
**Results**
- `main` branch
```
INFO: pip is looking at multiple versions of unstructured[pdf] to determine which version is compatible with other requirements. This could take a while.
ERROR: Could not find a version that satisfies the requirement pytesseract>=0.3.12; extra == "pdf" (from unstructured[pdf]) (from versions: 0.1, 0.1.3, 0.1.4, 0.1.5, 0.1.6, 0.1.7, 0.1.8, 0.1.9, 0.2.0, 0.2.2, 0.2.4, 0.2.5, 0.2.6, 0.2.7, 0.2.8, 0.2.9, 0.3.0, 0.3.1, 0.3.2, 0.3.3, 0.3.4, 0.3.5, 0.3.6, 0.3.7, 0.3.8, 0.3.9, 0.3.10)
ERROR: No matching distribution found for pytesseract>=0.3.12; extra == "pdf"
```
- this `PR`

`pytesseract-0.3.13` should be installed successfully.
0.15.4
2024-08-14 16:15:40 -05:00
Christine Straub
d6a84bdfbb
build(deps): update extra-paddleocr requirements (#3515)
This PR removes custom index URL for `paddlepaddle` installation in
`extra-paddleocr.in`, resolving `setup.py` configuration error. Now uses
`paddlepaddle==3.0.0b1` directly from PyPI, simplifying installation
process.

---------

Co-authored-by: Matt Robinson <mrobinson@unstructuredai.io>
Co-authored-by: Matt Robinson <mrobinson@unstructured.io>
0.15.3
2024-08-14 12:19:20 -05:00
Matt Robinson
7437f0a084
fix(CVE-2024-39705): update to latest nltk version (#3512)
### Summary

Addresses
[CVE-2024-39705](https://nvd.nist.gov/vuln/detail/CVE-2024-39705) by
updating to `nltk==3.8.2` and closes #3511. This CVE had previously been
mitigated in #3361.

---------

Co-authored-by: Christine Straub <christinemstraub@gmail.com>
0.15.2
2024-08-13 09:39:29 -04:00
Christine Straub
1158d8f695
Refactor image block extraction in pdf partitioning (#3514)
Closes
[#3503](https://github.com/Unstructured-IO/unstructured/issues/3503).

### Summary
This PR prevents creation of `figures` directory for saving image blocks
(`Image`, `Table`) when `extract_image_block_to_payload` parameter is
set to True

### Testing

```
elements = partition_image(
    filename="example-docs/img/embedded-images-tables.jpg",
    strategy="hi_res",
    extract_image_block_types=["Image", "Table"],
    extract_image_block_to_payload=True,
)
```
**Results:**
- `Main` Branch: `figures` directory is created.
- `PR`: `figures` directory is not created.
2024-08-13 06:11:10 +00:00
Steve Canny
cbe1b35621
rfctr(chunk): prep for adding TableSplitter (#3510)
**Summary**
Mechanical refactoring in preparation for adding (pre-chunk)
`TableSplitter` in a PR stacked on this one.
2024-08-12 18:04:49 +00:00
Christine Straub
d99b39923d
build(deps): Remove unstructured.paddlepaddle fork (#3506)
This PR aims to remove "unstructured.paddlepaddle" fork. Previously, we
used `unstructured.paddlepaddle` fork to support
`unstructured.paddleocr` on arm64 architecture. But currently,
`unstructured.paddleocr` with `unstructured.paddlepaddle` fails to work
on `arm64` architecture. Also, `unstructured.paddleocr` with the latest
version of the original `paddlepaddle` works on both `amd64` and `arm64`
architectures.

### Testing
```
os.environ["OCR_AGENT"] = "unstructured.partition.utils.ocr_models.paddle_ocr.OCRAgentPaddle"

elements = partition_pdf(
    filename=<file_path>,
    strategy="hi_res",
    infer_table_structure=True,
)
```
2024-08-09 22:04:22 +00:00
John
a2ae2ed646
chore: remove matplotlib constraint (#3505) 2024-08-09 19:31:19 +00:00
Jake Zerrer
051be5aead
Remove unstructured.pytesseract fork (#3454)
A second attempt at
https://github.com/Unstructured-IO/unstructured/pull/3360, this PR
removes unstructured's dependency on its own fork of `pytesseract`. (The
original reason for the fork, the addition of
`run_and_get_multiple_output`, was removed
[here](https://github.com/madmaze/pytesseract/releases/tag/v0.3.12).)

---------

Co-authored-by: Christine Straub <christinemstraub@gmail.com>
2024-08-09 04:28:48 +00:00
John
2373eaa829
fix typo: pipline>pipeline (#3498)
fix typo: pipline>pipeline
2024-08-08 18:53:47 +00:00
John
43ae0befa7
chore: bump botocore pin (#3493)
bump botocore pin to match aiobotocore/s pin:

eae97439b3
2024-08-07 21:41:53 +00:00
John
696155e614
chore: update importlib-metadata pin (#3491) 2024-08-07 18:17:53 +00:00
John
6545f16e57
chore: remove cryptography pin and update test (#3482)
remove cryptography pin, pin tenacity, and update
test_unstructured_ingest/unit/connector/test_salesforce_connector.py
2024-08-07 15:25:23 +00:00
Pawel Kmiecik
eba12daeb2
feat: correct object detection metrics (#3490)
This PR:
- fixes an issue that made it impossible to compute OD metrics
- ads per-class object detection metrics
2024-08-07 14:14:02 +00:00
John
24a1f298e5
chore: small edits (#3480)
Add comments and fix decorators on some tests.
2024-08-06 19:21:43 +00:00
Steve Canny
73bef27ef1
fix(pptx): accommodate invalid image/jpg MIME-type (#3475)
As described in #3381, some clients, perhaps including Adobe PDF
Converter, map JPEG images to the invalid `image/jpg` MIME-type. Prior
to v1.0.0, `python-pptx` would not load these images, which caused image
extraction to fail.

Update the `python-pptx` dependency to `v1.0.1` or above to ensure this
upstream fix is always available.

Fixes: #3381
2024-08-06 18:48:15 +00:00
Steve Canny
a468b2de3b
rfctr(csv): accommodate single column CSV files (#3483)
**Summary**
Improve factoring, type-annotation, and tests for `partition_csv()` and
accommodate single-column CSV files.

Fixes: #2616
2024-08-06 00:48:37 +00:00
David Potter
59ec64235b
chore: rename astra to astradb (#3458)
DataStax wanted all references to be astradb instead of astra. As per
@erichare

We'll also have to do the same in unstructured-ingest :)
2024-08-05 20:41:02 +00:00
Austin Walker
7e887442c4
chore: Cut the 0.15.1 release (#3481) 0.15.1 2024-08-05 16:16:13 +00:00
Maciej Kurzawa
b749b891a7
fix: disabled checking max pages for images (#3473)
Added fix related to
https://github.com/Unstructured-IO/unstructured/pull/3431, which
disables checking max pages for images
2024-08-02 14:25:08 +00:00
John
147514f6b5
feat: msg and email metadata (#3444)
Update partition_eml and partition_msg to capture cc, bcc, and message
id fields.

Docs PR: https://github.com/Unstructured-IO/docs/pull/135/files

Testing
```
from unstructured.partition.email import partition_email
from test_unstructured.unit_utils import example_doc_path

elements = partition_email(filename=example_doc_path("eml/fake-email-header.eml"), include_headers=True)
print(elements)
elements[0].metadata.to_dict()
```

Note to reviewers:
Tests in `test_unstructured/partition/test_email.py` were refactored and
rearranged to group similar tests together, so it will be easiest to
review those changes commit by commit.

---------

Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com>
Co-authored-by: Coniferish <Coniferish@users.noreply.github.com>
2024-08-01 19:24:17 +00:00
Christine Straub
0f057188c6
Improve pdfminer embedded image extraction in pdf partitioning (#3456)
### Summary
This PR addresses an issue in `pdfminer` library's embedded image
extraction process. Previously, some extracted "images" were incorrect,
including embedded text elements, resulting in oversized bounding boxes.
This update refines the extraction process to focus on actual images
with more accurate, smaller bounding boxes.

### Testing
PDF:
[test_pdfminer_text_extraction.pdf](https://github.com/user-attachments/files/16448213/test_pdfminer_text_extraction.pdf)

```
elements = partition_pdf(
    filename="test_pdfminer_text_extraction",
    strategy=strategy,
    languages=["chi_sim"],
    analysis=True,
)
```
**Results**
- this `PR`

![page1_layout_pdfminer](https://github.com/user-attachments/assets/098e0a1f-fdad-4627-a881-cbafd71ce5a0)

![page1_layout_final](https://github.com/user-attachments/assets/6dc89180-36ac-424a-99de-63810ebf8958)
- `main` branch

![page1_layout_pdfminer](https://github.com/user-attachments/assets/8228995a-2ef1-4b76-9758-b8015c224e6d)

![page1_layout_final](https://github.com/user-attachments/assets/68d43d7b-7270-4f58-8360-dc76bd0df78f)
2024-08-01 16:47:08 +00:00
Maciej Kurzawa
8fd216cc9f
feat/pdf-page-limit-in-hi-res (#3431)
# Description:
Passing `max_pages` argument allows rejecting pdf files which exceeds
this page number limit while `high_res` strategy is chosen. By default
it will allow parsing pdf files with unlimited number of pages.

# Testing:
```python
from unstructured.partition.auto import partition

elements = partition(filename="unstructured/example-docs/pdf/reliance.pdf", strategy='hi_res')  # should pass
elements = partition(filename="unstructured/example-docs/pdf/reliance.pdf", strategy='hi_res', max_pages=4)  # should pass
elements = partition(filename="unstructured/example-docs/pdf/reliance.pdf", strategy='hi_res', max_pages=2)  # should raise PdfMaxPagesExceededError
```
2024-07-30 16:52:17 +00:00
Roman Isecke
482f093afb
feat: Add deprecation warning on import of any ingest code (#3443)
### Description
Any time `unstructed.ingest` is imported, this deprecation warning gets
emitted:
```
DeprecationWarning: unstructured.ingest will be removed in a future version
```
2024-07-30 15:06:21 +00:00
Steve Canny
4e61acc1c6
fix(file): fix OLE-based file-type auto-detection (#3437)
**Summary**
A DOC, PPT, or XLS file sent to partition() as a file-like object is
misidentified as a MSG file and raises an exception in python-oxmsg
(which is used to process MSG files).

**Fix**
DOC, PPT, XLS, and MSG are all Microsoft OLE-based files, aka. Compound
File Binary Format (CFBF). These can be reliably distinguished by
inspecting magic bytes in certain locations. `libmagic` is unreliable at
this or doesn't try, reporting the generic `"application/x-ole-storage"`
which corresponds to the "container" CFBF format (vaguely like a
Microsoft Zip format) that all these document types are stored in.

Unconditionally use `filetype.guess_mime()` provided by the `filetype`
package that is part of the base unstructured install. Unlike
`libmagic`, this package reliably detects the distinguished MIME-type
(e.g. `"application/msword"`) for OLE file subtypes.

Fixes #3364
2024-07-25 17:25:41 +00:00
Steve Canny
432d209c36
fix(file): confirm or correct asserted DOCX, PPTX, and XLSX content types (#3434)
**Summary**
The `content_type` argument received by `partition()` from the API is
sometimes unreliable for MS-Office 2007+ MIME-types. What we've observed
is that it gets the MS-Office bit right but falls down on distinguishing
PPTX from DOCX or XLSX.

Confirmation of these types is simple, fast, and reliable. Confirm all
MS-Office `content_type` argument values asserted by callers of
`detect_filetype()` and correct swapped values.
2024-07-24 20:32:58 +00:00
Christine Straub
560cc0e975
fix: update HuggingFaceEmbeddingEncoder to use langchain_huggingface instead of langchain-community (#3436)
Similar to https://github.com/Unstructured-IO/unstructured/pull/3433.

### Summary
This PR aims to update `HuggingFaceEmbeddingEncoder` to use
`HuggingFaceEmbeddings` from `langchain_huggingface` package instead of
the deprecated version from `langchain-community`. This resolves the
deprecation warning and ensures compatibility with future versions of
langchain.

### Testing
```
from unstructured.documents.elements import Text
from unstructured.embed.huggingface import HuggingFaceEmbeddingConfig, HuggingFaceEmbeddingEncoder

embedding_encoder = HuggingFaceEmbeddingEncoder(
    config=HuggingFaceEmbeddingConfig()
)
elements = embedding_encoder.embed_documents(
    elements=[Text("This is sentence 1"), Text("This is sentence 2")],
)

query = "This is the query"
query_embedding = embedding_encoder.embed_query(query=query)

[print(e.embeddings, e) for e in elements]
print(query_embedding, query)
print(embedding_encoder.is_unit_vector(), embedding_encoder.num_of_dimensions())
```
**Expected behavior**
No deprecation warning should be displayed. The code should use the
updated `HuggingFaceEmbeddings` class from the `langchain_huggingface`
package.
2024-07-24 18:57:31 +00:00
Christine Straub
798dcc096c
fix: update OpenAIEmbeddingEncoder to use langchain-openai instead of langchain-community (#3433)
Closes https://github.com/Unstructured-IO/unstructured/issues/3378.

### Summary
This PR aims to update `OpenAIEmbeddingEncoder` to use
`OpenAIEmbeddings` from `langchain-openai` package instead of the
deprecated version from `langchain-community`. This resolves the
deprecation warning and ensures compatibility with future versions of
langchain.
2024-07-24 16:52:34 +00:00
Steve Canny
3fe5c094fa
rfctr(file): refactor detect_filetype() (#3429)
**Summary**
In preparation for fixing a cluster of bugs with automatic file-type
detection and paving the way for some reliability improvements, refactor
`unstructured.file_utils.filetype` module and improve thoroughness of
tests.

**Additional Context**
Factor type-recognition process into three distinct strategies that are
attempted in sequence. Attempted in order of preference,
type-recognition falls to the next strategy when the one before it is
not applicable or cannot determine the file-type. This provides a clear
basis for organizing the code and tests at the top level.

Consolidate the existing tests around these strategies, adding
additional cases to achieve better coverage.

Several bugs were uncovered in the process. Small ones were just fixed,
bigger ones will be remedied in following PRs.
2024-07-23 23:18:48 +00:00
David Potter
441b3393b1
bugfix [OSS-67]: update import of pinecone exception (#3432)
the pinecone python package moved their importing of
PineconeApiException

Chroma `sleep` added because even thought there is a `wait`, there is
still some sort of timing issue.
2024-07-23 19:48:55 +00:00
Matt Robinson
b2f0620f2c
build(deps): version bumps for 2024-07-22 (#3427)
### Summary

Weekly dependency bumps.
2024-07-22 20:49:40 +00:00
Steve Canny
49c4bd34be
rfctr(auto): add _PartitionerLoader (#3418)
**Summary**
Replace conditional explicit import of partitioner modules in
`.partition.auto` with the new `_PartitionerLoader` class. This avoids
unbound variable warnings and is much less noisy.

`_PartitionerLoader` makes use of the new `FileType` property
`.importable_package_dependencies` to determine whether all required
packages are importable before dispatching the file to its partitioner.
It uses `FileType.extra_name` to form a helpful error message when a
dependency is not installed, so the caller knows which `pip install`
extra to specify to remedy the error.

`PartitionerLoader` uses the `FileType` properties
`.partitioner_module_qname` and `partitioner_function_name` to load
the partitioner once its dependencies are verified. Loaded partitioners
are cached with module lifetime scope for efficiency.
2024-07-22 06:03:55 +00:00
Christine Straub
ec59abfabc
enhancement: improve text clearing process in email partitioning (#3422)
### Summary
Currently, the email partitioner removes only `=\n` characters during
the clearing process. However, email content sometimes contains `=\r\n`
characters, especially when read from file-like objects such as
`SpooledTemporaryFile` (the file type used in our API). This PR updates
the email partitioner to remove both `=\n` and `=\r\n` characters during
the clearing process.

### Testing

```
filename = "example-docs/eml/family-day.eml"

elements = partition_email(
    filename=filename,
)
print(f"From filename: {elements[3].text}")

with open(filename, "rb") as test_file:
    spooled_temp_file = tempfile.SpooledTemporaryFile()
    spooled_temp_file.write(test_file.read())
    spooled_temp_file.seek(0)
    elements = partition_email(file=spooled_temp_file)
    print(f"From spooled_temp_file: {elements[3].text}")
```

**Results:**
- on `main`
```
From filename: Make sure to RSVP!
From spooled_temp_file: Make sure to = RSVP!
```
- on `PR`
```
From filename: Make sure to RSVP!
From spooled_temp_file: Make sure to RSVP!
```
0.15.0
2024-07-19 18:18:02 +00:00
Roman Isecke
1df7908f03
feat: save file id for all fsspec connectors if present (#3405)
### Description

If the id value exists in the stats response from fsspec, save it as a
`file_id` field in the metadata being persisted on each element.

---------

Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com>
Co-authored-by: rbiseck3 <rbiseck3@users.noreply.github.com>
2024-07-19 13:30:21 +00:00
Christine Straub
0eb461acc2
refactor: restructure PDF/Image example document organization (#3410)
This PR aims to improve the organization and readability of our example
documents used in unit tests, specifically focusing on PDF and image
files.

### Summary
- Created two new subdirectories in the `example-docs` folder:
  - `pdf/`: for all PDF example files
  - `img/`: for all image example files
- Moved relevant PDF files from `example-docs/` to `example-docs/pdf/`
- Moved relevant image files from `example-docs/` to `example-docs/img/`
- Updated file paths in affected unit & ingest tests to reflect the new
directory structure

### Testing
All unit & ingest tests should be updated and verified to work with the
new file structure.

## Notes
Other file types (e.g., office documents, HTML files) remain in the root
of `example-docs/` for now.

## Next Steps
Consider similar reorganization for other file types if this structure
proves to be beneficial.

---------

Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com>
Co-authored-by: christinestraub <christinestraub@users.noreply.github.com>
2024-07-18 22:21:32 +00:00
Roman Isecke
5d387030eb
bugfix: google drive connector metadata safegaurds (#3407)
### Description

At times, the google drive response doens't have some of the metadata
we're grabbing to populate the `FileData` metadata. This is fine, but
without the added safegaurds, this can cause a `KeyError`.
2024-07-18 16:09:19 +00:00
Steve Canny
e99e5a8abd
rfctr(file): make FileType enum a file-type descriptor (#3411)
**Summary**
Elaborate the `FileType` enum to be a complete descriptor of file-types.
Add methods to allow `STR_TO_FILETYPE`, `EXT_TO_FILETYPE` and
`FILETYPE_TO_MIMETYPE` mappings to be replaced, removing those redundant
and noisy declarations.

In the process, fix some lingering file-type identification and
`.metadata.filetype` errors that had been skipped in the tests.

**Additional Context**
Gathering the various attributes of a file-type into the `FileType` enum
eliminates the duplication inherent in the separate `STR_TO_FILETYPE`
etc. mappings and makes access to those values convenient for callers.
These attributes include what MIME-type a file-type should record in
metadata and what MIME-types and extensions map to that file-type. These
values and others are made available as methods and properties directly
on the `FileType` class and members. Because all attributes are defined
in the `FileType` enum there is no risk of inconsistency across multiple
locations and any changes happen in one and only one place. Further
attributes and methods will be added in later commits to support other
file-type related operations like mapping to a partitioner and verifying
its dependencies are installed.
2024-07-18 02:05:33 +00:00