### Description
This refactors the current ingest CLI process to support better
granularity in how the steps are ran
* Both multiprocessing and async now supported. Given that a lot of the
steps are IO-bound, such as downloading and uploading content, we can
achieve better parallelization by using async here
* Destination step broken up into a stager step and an upload step. This
will allow for steps that require manipulation of the data between
formats, such as converting the elements json into a csv format to
upload for tabular destinations, to be pulled out of the step that does
the actual upload.
* The process of writing the content to a local destination was now
pulled out as it's own dedicated destination connector, meaning you no
longer need to persist the content locally once the process is done if
the content was uploaded elsewhere.
* Quick update to the chunker/partition step to use the python client.
* Move the uncompress suppport as a pipeline step since this can
arbitrarily apply to any concrete files that have been downloaded,
regardless of where they came from.
* Leverage last modified date to mark files to be reprocessed, even if
the file already exists locally.
### Callouts
Retry configs haven't been moved over yet. This is an open question
because the intent was for it to wrap potential connection errors but
now any of the other steps that leverage an API might run into network
connection issues. Should those be isolated in each of the steps and
wrapped with the same retry configs? Or do we need to expose a unique
retry config for each step? This would bloat the input params even more.
### Testing
* If you want to run the new code as an SDK, there's an example file
that was added to highlight how to do that:
[example.py](https://github.com/Unstructured-IO/unstructured/blob/roman/refactor-ingest/unstructured/ingest/v2/example.py)
* If you want to run the new code as an isolated CLI:
```shell
PYTHONPATH=. python unstructured/ingest/v2/main.py --help
```
* If you want to see which commands have been migrated to the new
version, there's now a `v2` short help text next to those commands when
running the current cli:
```shell
PYTHONPATH=. python unstructured/ingest/main.py --help
Usage: main.py [OPTIONS] COMMAND [ARGS]...main.py --help
Options:
--help Show this message and exit.
Commands:
airtable
azure
biomed
box
confluence
delta-table
discord
dropbox
elasticsearch
fsspec
gcs
github
gitlab
google-drive
hubspot
jira
local v2
mongodb
notion
onedrive
opensearch
outlook
reddit
s3 v2
salesforce
sftp
sharepoint
slack
wikipedia
```
You can run any of the local or s3 specific ingest tests and these
should now work.
---------
Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com>
Co-authored-by: rbiseck3 <rbiseck3@users.noreply.github.com>
### Summary
A `partition_via_api` test that only runs on `main` was
[failing](https://github.com/Unstructured-IO/unstructured/actions/runs/9159429513/job/25181600959)
with the following output, likely due to the change in the default
behavior for `skip_infer_table_types`. This PR explicitly sets the
`skip_infer_table_types` param to avoid the failure..
```python
=========================== short test summary info ============================
FAILED test_unstructured/partition/test_api.py::test_partition_via_api_with_no_strategy - AssertionError: assert 'Zejiang Shen® (<), Ruochen Zhang?, Melissa Dell®, Benjamin Charles Germain Lee?, Jacob Carlson®, and Weining Li®' != 'Zejiang Shen® (<), Ruochen Zhang?, Melissa Dell®, Benjamin Charles Germain Lee?, Jacob Carlson®, and Weining Li®'
+ where 'Zejiang Shen® (<), Ruochen Zhang?, Melissa Dell®, Benjamin Charles Germain Lee?, Jacob Carlson®, and Weining Li®' = <unstructured.documents.elements.Text object at 0x7fb9069fc610>.text
+ and 'Zejiang Shen® (<), Ruochen Zhang?, Melissa Dell®, Benjamin Charles Germain Lee?, Jacob Carlson®, and Weining Li®' = <unstructured.documents.elements.Text object at 0x7fb90648ad90>.text
= 1 failed, 2299 passed, 9 skipped, 2 deselected, 2 xfailed, 9 xpassed, 14 warnings in 1241.64s (0:20:41) =
make: *** [Makefile:302: test] Error 1
```
### Testing
After temporarily removing the "skip if not on `main`" `pytest` mark,
the [unit tests
pass](https://github.com/Unstructured-IO/unstructured/actions/runs/9163268381/job/25192040902?pr=3057O)
on the feature branch.
This minor change updates the URL of the [Weaviate Docker
image](https://weaviate.io/developers/weaviate/installation/docker-compose).
Instead of the standard Docker registry, Weaviate now makes use of a
custom registry running at `cr.weaviate.io`.
Thanks in advance for merging.
🤖 beep boop, the Weaviate bot
PS:
Please note that the Weaviate Bot automates this PR; apologies if PR
formatting is missing. If you have questions, feel free to reach out via
our [forum](https://forum.weaviate.io) or
[Slack](https://weaviate.io/slack).
Co-authored-by: Matt Robinson <mrobinson@unstructured.io>
Removes this warning:
> Warning: you have pip-installed dependencies in your environment file,
but you do not list pip itself as one of your conda dependencies. Conda
may not use the correct pip to install your packages, and they may end
up in the wrong place. Please add an explicit pip dependency. I'm adding
one for you, but still nagging you.
Co-authored-by: Matt Robinson <mrobinson@unstructured.io>
This PR adds `py.typed`, which will fix issues of the following type:
![Uploading Screenshot 2024-05-17 at 12.13.33.png…]()
---------
Co-authored-by: Matt Robinson <mrobinson@unstructured.io>
### Summary
Closes#2959. Updates the dependency and CI to add support for Python
3.12.
The MongoDB ingest tests were disabled due to jobs like [this
one](https://github.com/Unstructured-IO/unstructured/actions/runs/9133383127/job/25116767333)
failing due to issues with the `bson` package. `bson` is a dependency
for the AstraDB connector, but `pymongo` does not work when `bson` is
installed from `pip`. This issue is documented by MongoDB
[here](https://pymongo.readthedocs.io/en/stable/installation.html). Spun
off #3049 to resolve this. Issue seems unrelated to Python 3.12, though
unsure why this didn't surface previously.
Disables the `argilla` tests because `argilla` does not yet support
Python 3.12. We can add the `argilla` tests back in once the PR
references below is merged. You can still use the `stage_for_argilla`
function if you're on `python<3.12` and you install `argilla` yourself.
- https://github.com/argilla-io/argilla/pull/4837
---------
Co-authored-by: Nicolò Boschi <boschi1997@gmail.com>
This PR aims to pass `kwargs` through `fast` strategy pipeline, which
was missing as part of the previous PR -
https://github.com/Unstructured-IO/unstructured/pull/3030.
I also did some code refactoring in this PR, so I recommend reviewing
this PR commit by commit.
### Summary
- pass `kwargs` through `fast` strategy pipeline, which will allow users
to specify additional params like `sort_mode`
- refactor: code reorganization
- cut a release for `0.14.0`
### Testing
CI should pass
This PR introduces GLOBAL_WORKING_DIR and GLOBAL_WORKING_PROCESS_DIR
controlling where temporary files are stored during partition flow, via
tempfile.tempdir.
#### Edit:
Renamed prefixes from STORAGE_ to UNSTRUCTURED_CACHE_
#### Edit 2:
Renamed prefixes from UNSTRUCTURED_CACHE to GLOBAL_WORKING_DIR_
### Summary
Closes#3021 . Turns table extraction for PDFs and images off by
default. The default behavior originally changed in #2588 . The reason
for reversion is that some users did not realize turning off table
extraction was an option and experience long processing times for PDFs
and images with the new default behavior.
---------
Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com>
Co-authored-by: MthwRobinson <MthwRobinson@users.noreply.github.com>
Thanks to @erichare from AstraDB
Adds support for specifying the indexing options for various columns in
Astra DB, allowing users to avoid a situation where long text columns
are by-default indexed.
Changes to: test_unstructured_ingest/python/test-ingest-astra-output.py
are forward looking from AstraDB
**Summary**
Because DOCX now supports the `strategy` argument to control aspects of
image extraction, `partition_doc()` and `partition_odt()` will need to
support it to because they delegate partitioning to `partition_docx()`.
This will allow image extraction to work the same way for those two
additional document-types.
Remedy disk-space leak where `partition_odt()` would leave an on-disk
copy of each `.odt` file passed as a file-like object.
`partition_odt()` creates a temporary file in which it writes each
source-document provided as a file-like object. This file is not deleted
and disk consumption grows without bound.
The `convert_and_partition_docx()` function used to convert ODT->DOCX
uses `pandoc` (a command-line program) to do the conversion. Because
this command-line program operates in a different memory space, the
source file cannot be passed as an in-memory object and needs to be on
the filesystem. When the ODT source-document is passed as a file-like
object, it is written to disk so the conversion program has access to
it. It is not deleted afterward.
Fix this by writing the temporary source ODT file in a
`TemporaryDirectory` and also use that location to write the
conversion-target DOCX file. That directory is automatically removed
when `partition_odt()` completes.
While we're in there, improve the factoring of `partition_odt()`.
- Extract `convert_and_partition_docx()` from `partition.docx` (used
only by `partition_odt()`) to `_convert_odt_to_docx()` in
`partition.odt` where it is used. Decouple file conversion from calling
`partition_docx()` with the converted file as the `partition_docx()`
call is `partition_odt()`'s natural responsibility.
- Improve docstrings, typing, and comments.
- All tests pass both before and after.
**Summary**
Avoid `SyntaxWarning` and/or `SyntaxError` messages when importing
`unstructured.nlp.patterns` by using raw strings (`"r"` prefix) for
regex patterns which may contain `\x` character sequences not recognized
by the Python parser for normal strings.
Fixes: #2495
Allows introduction of form extraction in the future - sets up the
FormKeysValues element & format, puts in an empty function call in the
partition_pdf_or_image pipeline.
This PR aims to skip element sorting when determining whether embedded
text can be extracted. The extracted elements in this step are returned
as final elements only for the `fast` strategy pipeline and are never
used for other strategy pipelines (`hi_res`, `ocr`).
Removing element sorting in this step and adding it to the `fast`
strategy pipeline later will improve performance and reduce execution
time.
### Summary
- skip element sorting when determining whether embedded text can be
extracted.
- add `_partition_pdf_with_pdfparser()` function for fast` strategy
pipeline
### Testing
CI should pass.
**Summary**
In preparation for adding more tests related to image extraction,
improve the `partition_odt()` test suite:
- Add type annotations to type-check clean on strict mode.
- Improve test names.
- Simplify tests where possible.
- Remove a couple duplicated tests
### Summary
Updates the `Dockerfile` to use the Chainguard `wolfi-base` image to
reduce CVEs. Also adds a step in the docker publish job that scans the
images and checks for CVEs before publishing. The job will fail if there
are high or critical vulnerabilities.
### Testing
Run `make docker-run-dev` and then `python3.11` once you're in. And that
point, you can try:
```python
from unstructured.partition.auto import partition
elements = partition(filename="example-docs/DA-1p.pdf", skip_infer_table_types=["pdf"])
elements
```
Stop the container once you're done.
**Summary**
The behavior of an image sub-partitioner can be partially determined by
the partitioning strategy, for example whether it is "hi_res" or "fast".
Add this parameter to `partition_docx()` so it can pass it along to
`DocxPartitionerOptions` which will make it available to any image
sub-partitioners.
**Summary**
In preparation for adding more tests related to image extraction,
improve the `partition_doc()` test suite:
- Remove redundant DOCX -> DOC file conversions on most tests.
- Add type annotations to type-check clean on strict mode.
- Improve test names.
- Simplify tests where possible.
- Remove one duplicated test
Speed was roughly doubled: 24 tests in 20s -> 23 tests in 8s.
**Summary**
Remedy disk-space leak where `partition_doc()` would leave a copy of
each `.doc` file passed as a file-like object on disk.
**Additional Context**
`partition_doc()` creates a temporary file in which it writes each
source-document provided as a file-like object. This file is not deleted
and disk consumption grows without bound.
The `convert_office_doc()` function used to convert DOC->DOCX uses a
command-line program provided with LibreOffice to convert do the
conversion. Because this command-line program operates in a different
memory space, the source file cannot be passed as an in-memory object
and needs to be on the filesystem. When the DOC file is passed as a
file-like object, it is written to disk so the conversion program has
access to it. It is not deleted afterward.
Fix this by writing the temporary source DOC file in the
TemporaryDirectory already being used to write the conversion-target
DOCX file. That directory is automatically removed when
`partition_doc()` completes.
**Reviewers:** Probably easier to review first and second commits
separately as the first one adds all the new code and tests (without
installing it), and the second one installs it into the partitioner
along with the required changes to code and tests.
**Summary**
Enable communication of partitioning options to sub-partitioners, in
particular to the pluggable `PicturePartitioner` coming in a closely
subsequent PR to implement image-extraction and OCR for DOCX, DOC, and
ODT formats.
**Additional Context**
In general, validation of partitioning options as well as assigning
default values and computing derived partitioning settings can be
extracted from partitioners into a neatly encapsulated separate object.
This simplifies the core partitioning code by removing the noise
associated with computing metadata values and deciding how to access the
source document, etc.
However, better factoring aside, having the partition-time "settings"
available in a single object allows partitioning of certain document
features, for example images, to be readily _delegated_ to a
sub-partitioner while still giving it access to all the relevant
partitioning settings for the current document. This is particularly
important when a sub-partitioner is "pluggable" at runtime and must rely
on a clearly-defined (and simple as possible) interface to operate
smoothly.
**Summary**
Organize DOC tests into related groups with markers. This makes it
easier to assess coverage and find tests related to particular
behaviors.
This is in preparation for adding tests related to DOC image extraction.
No code changes, purely line-block moves.
- Move module-level fixtures to the bottom.
- Organize tests into related groups with markers.
**Summary**
Noisy but trivial changes to `partition_docx()` environs and tests in
preparation for DOCX image extraction. These changes are extracted here
so they don't distract on the changes of substance to follow in the next
PR.
No code changes, strictly this single block move.
Move `Describe_DocxPartitioner` unit-test class to bottom so
`DescribeDocxPartitionerOptions` unit-test to follow in subsequent
commit will be together with it. Integration tests first, then unit
tests, for consistency with other test modules e.g. test_pptx.
I added `Describe_DocxPartitioner` soon after I arrived, before we
adopted the convention of placing unit-tests after integration tests.
Move this so we can maintain that consistency with the block of tests to
follow in a closely subsequent PR.
**Summary**
The CSV delimiter-sniffer requires whole lines to properly detect the
delimiter character. Limiting bytes read produced partial lines when
lines were very long. Limit bytes but read whole lines.
Fixes#2643.
Pass the parameters `include_slide_notes` and `include_page_breaks` to
`partition_pptx` from `partition_ppt`.
Also update the .ppt example doc we use for testing so it has slide
notes and a PageBreak (and second page)
The `links` param in `partition_pdf` was never used by the partitioner,
but added when that metadata element was created. This removes the
unused parameter since `links` are extracted during partitioning.
Currently, CCT eval takes a long time for any of the test_metrics CI
runs. Documents in an eval set are evaluated sequentially, and It
appears that a max of 1 cpu core is currently utilized. This implies
there could be a large speedup by running eval across multiple docs
concurrently (probably with multiprocessing).
Things done in this PR:
- [x] concurrent.futures.ProcessPoolExecutor instead of sequential
for-loop
- [x] refactor/reorganization of redundant pieces of code without
changing the inner logic too much. Without that we'd have 3 places where
documents are being processed. Take a look at `BaseMetricsCalculator`
class and classes that inherit from it.
- [x] string paths manipulation is now reworked and relies on
`pathlib.Path()`
Skip accuracy calculation for files for which output and ground truth
sizes differ greatly.
~10% speed up on local machine, keeping the same metrics.
---------
Co-authored-by: cragwolfe <crag@unstructured.io>
This pull request add metrics that are calculated based on
table_as_cells instead of text_as_html. This change is required for
comprehensive metrics calculation, as previously every colspan or
rowspan predicted was considered to be an incorrect predicted (even if
it was correct prediction)
This change has to be merged after
https://github.com/Unstructured-IO/unstructured/pull/2892 which
introduces table_as_cells field
This PR adds the ability to get the ratio of `cid` characters in
embedded text extracted by `pdfminer`. This PR is the second part of
moving `cid` related code from `unstructured-inference` to
`unstructured` and works together with
https://github.com/Unstructured-IO/unstructured-inference/pull/342.
**Summary**
File-types other than PDF need to use OCR on extracted images. Extract
`OCRAgent.get_agent()` such that any file-type partitioner can use it
without risking dependency on PDF-only extras.
**Summary**
Remedy the persistent type errors when importing `unstructured`. Give
the partitioner type annotations a general scrubbing while we're at it.
**Summary**
A crude and OS-specific mechanism was used to detect when a path
represented a temp-file. Change that to be robust across operating
systems and localized configurations. The specific problem was for DOC
files but this PR fixes it for PPT too which was prone to the same
problem.
**Summary**
The DOCX format allows a table row to start late and/or end early,
meaning cells at the beginning or end of a row can be omitted. While
there are legitimate uses for this capability, using it in practice is
relatively rare. However, it can happen unintentionally when adjusting
cell borders with the mouse. Accommodate this case and generate accurate
`.text` and `.metadata.text_as_html` for these tables.