1447 Commits

Author SHA1 Message Date
vangheem
a3ab69f3e5 another 2024-06-27 18:08:24 -04:00
vangheem
6e21dabcce Fix blocking async interfaces for v2 ingest framework 2024-06-27 18:05:38 -04:00
Roman Isecke
137b149be8
Bugfix/ingest pipeline check (#3303)
### Description
Using a `isinstance` on the destination registry mapping breaks when
inheritance is used for the associated uploader types. This adds a
connector type field to all uploaders so that the entry can be
deterministically fetched when running check for associated stager in
pipeline.
2024-06-27 16:35:37 +00:00
Steve Canny
087adb218f
feat(docx): differentiate no-file from not-ZIP (#3306)
**Summary**
The `python-docx` error `docx.opc.exceptions.PackageNotFoundError`
arises both when no file exists at the given path and when the file
exists but is not a ZIP archive (and so is not a DOCX file).

This ambiguity is unwelcome when diagnosing the error as the two
possible conditions generally indicate a different course of action to
resolve the error.

Add detailed validation to `DocxPartitionerOptions` to distinguish these
two and provide more precise exception messages.

**Additional Context**
- `python-pptx` shares the same OPC-Package (file) loading code used by
`python-docx`, so the same ambiguity will be present in `python-pptx`.
- It would be preferable for this distinguished exception behavior to be
upstream in `python-docx` and `python-pptx`. If we're willing to take
the version bump it might be worth considering doing that instead.
2024-06-27 00:18:56 +00:00
Roman Isecke
54ec311c55
feat/migrate onedrive src (#3295)
### Description
Migrate the onedrive source connector to v2, adding in more rich content
pulled from the response of the SDK to add further metadata to the
FIleData produced by the indexer.

---------

Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com>
Co-authored-by: rbiseck3 <rbiseck3@users.noreply.github.com>
2024-06-26 23:59:51 +00:00
Matt Robinson
6939bff49e
build(deps): bump langchain-community version (#3305)
### Summary

Bumps to the latest `langchain-community` version to resolve
[CVE-2024-2965](https://nvd.nist.gov/vuln/detail/CVE-2024-2965).

---------

Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com>
Co-authored-by: MthwRobinson <MthwRobinson@users.noreply.github.com>
2024-06-26 22:42:32 +00:00
Roman Isecke
5179b739fa
feat/more conservative ingest logging (#3301)
### Description
Isolate all log statements that happen per record and make them debug
level to avoid bloating the console output.
2024-06-26 16:54:08 +00:00
Pawel Kmiecik
575957b2d2
feat: enhance analysis options with od model dump and better vis (#3234)
This PR adds new capabilities for drawing bboxes for each layout
(extracted, inferred, ocr and final) + OD model output dump as a json
file for better analysis.

---------

Co-authored-by: Christine Straub <christinemstraub@gmail.com>
Co-authored-by: Michal Martyniak <michal.martyniak@deepsense.ai>
2024-06-26 13:14:55 +00:00
Steve Canny
f2fee0c32f
fix(auto): partition() passes strategy to DOC,ODT (#3278)
**Summary**
Remedy gap where `strategy` argument passed to `partition()` was not
forwarded to `partition_doc()` or `partition_odt()` and so was not
making its way to `partition_docx()`.
2024-06-26 00:29:47 +00:00
qued
0665e94b96
build: move numpy pin to packaging (#3296)
Moved numpy pin to `base.in` where it will be picked up by packaging.

Side note:
`constraints.txt` (formerly `constraints.in`) is a really useful
pattern: you put a constraint there, add that file as a `-c` requirement
in other files, and the constraint will be applied when pip-compiling
*only when needed* because the library is required by something else.
Neat! However, unfortunately, in my searches I've never found a similar
pattern for packaging, so any pins we want to propagate to user installs
need to be explicitly placed in the `.in` files.

So what is `constraints.txt` really doing for us? Well in the past I
think there have been instances where something is temporarily broken in
an upstream dependency but we expect it to be patched soon, but in the
meantime we want things to work in our CI builds and development
installs, so it's not worth pinning everywhere it's used. Having said
that, I'm coming to the conclusion that `constraints.txt` causes more
harm than good in the confusion it causes WRT packaging -- maybe we
should remove that pattern at some point.
2024-06-25 21:08:25 +00:00
Yao You
c32aeaac44
fix: wait to run soffice until there is no other soffice process running (#3287)
## Summary

This PR addresses an issue where the code could attempt to run `soffice`
in multiple processes and closes #3284
The fix is to add a wait mechanism when there is another `soffice`
process running in already.

## Diagnosis of issue

- `soffice` can only have one process running when using the command
`soffice` as is.
- on main branch the function `partition.common.convert_office_doc`
simply spawns a subprocess to run `soffice` command to convert a `doc`
or `ppt` file into `docx` or `pptx` format.
- if there are multiple partition calls to process `doc` or `ppt` files
and they all want to spawn `soffice` subprocesses only one will succeed
while other processes will simply fail and return 1 from the subprocess
- in downstream this will lead to errors like `PackageNotFoundError:
Package not found at '/tmp/tmpac6lcu4w/document.docx'`

## solution

While there are
[ways](https://www.reddit.com/r/libreoffice/comments/agk3os/how_to_open_more_than_one_calc_instance_under/)
to circumvent the limit of `soffice` by setting a tmp file as user
installation env, these kind of solutions rely on the internals of
`soffice` and adds maintenance cost to track its changes.

This PR solves this problem by adding a wait mechanism: 
- we first spawning a subprocess to run `soffice` 
- if the `stdout` is empty and we still have wait time budget left the
function first checks if there is another `soffice` running
  * If yes then the function waits for 0.01s before checking again; 
* if no the functions spawns a subprocess to run `soffice` and return to
beginning of this step
* we need to return the the beginning to check if `stdout` is empty
because we could have another collision right after `soffice` becomes
available.

## test

This PR adds two unit tests.
Additionally this can be tested by running partition of `.doc` files
locally with multiprocessing.
2024-06-25 18:49:27 +00:00
Roman Isecke
a7a53f6fcb
feat/migrate astra db (#3294)
### Description
Move astradb destination connector over to the new v2 ingest framework
2024-06-25 18:00:47 +00:00
Roman Isecke
3f581e6b7d
feat/migrate gdrive source connector (#3239)
### Description
Migrate the google drive source connector over to the new v2 ingest
framework and include a variety of improvements as part of the refactor:
* The ID is no longer limited to a drive id but can also be the id of a
subfolder within a drive or a file directly and each case is handled
appropriately
* More metadata is pulled in from google drive to enrich the partitioned
elements downstream and now the modified date is being set to not
reprocess if the ingest pipeline already has the file cached
* timing information is set on the file created when downloaded based on
the last modified data retrieved from google drive

---------

Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com>
Co-authored-by: rbiseck3 <rbiseck3@users.noreply.github.com>
2024-06-25 12:55:28 +00:00
Roman Isecke
e0f4374386
Roman/bugfix conflicting event loop ingest (#3264)
### Description
In use cases where an external system (such as code being run in a
jupyter notebook) already has a running event loop, run the async code
in a dedicated thread pool to not conflict with the existing event loop.

This also has a variety of fixes that were found when putting together a
demo leveraging the elasticsearch destination connector
2024-06-24 18:47:37 +00:00
Christine Straub
ab88e20575
chore: bump unstructured-inference 0.7.36 (#3275)
### Summary
- bump unstructured-inference to `0.7.35` which fixed `ValueError` when
converting cells to HTML in the table processing subpipeline
- cut a release for `0.14.8`

---------

Co-authored-by: Matt Robinson <mrobinson@unstructured.io>
Co-authored-by: Matt Robinson <mrobinson@unstructuredai.io>
0.14.8
2024-06-24 13:07:22 +00:00
Nathan Van Gheem
ce591e2e5d
Fix missing sensitive fields for embedders (#3263)
A few embedder types are missing sensitive field annotations.
2024-06-24 12:47:50 +00:00
David Potter
88b08a734d
rfctr: chroma destination migrated to V2 (#3214)
Moving Chroma destination to the V2 version of connectors.
2024-06-23 00:19:29 +00:00
David Potter
8610bd3ab9
feat: Kafka source and destination connector (#3176)
Thanks to @tullytim we have a new Kafka source and destination
connector. It also works with hosted Kafka via Confluent.

Documentation will be added to the Docs repo.
2024-06-22 23:26:23 +00:00
Matt Robinson
2d965fd65e
build: switch arm64 image to wolfi-base (#3268)
### Summary

Updates the `arm64` build to use the same `Dockerfile` as `amd64`, since
there are now upstream base images for `wolfi-base` for both
architectures. The legacy `rockylinux-9.4` is now stashed in a
subdirectory the `docker` subdirectory and is no longer built in CI, but
is available is users would like to build it themselves.

Additionally, this PR includes a fix to symlink `python3` to
`python3.11`, which had caused a CI failure
[here](https://github.com/Unstructured-IO/unstructured/actions/runs/9619486931/job/26535697755).

BREAKING CHANGE: the `arm64` image no longer supports `.doc`, `.pptx`,
or `.xls` because we do not yet have a `libreoffice` `apk` built for
`wolfi-base`. We intend to address that as a follow on. All other
filetypes work.

### Testing

Successfully docker builds, tests, and smoke tests for
[amd64](https://github.com/Unstructured-IO/unstructured/actions/runs/9619458140/job/26535610735?pr=3268)
and
[arm64](https://github.com/Unstructured-IO/unstructured/actions/runs/9619458140/job/26535610341?pr=3268)
on the feature branch (with publish disabled).
2024-06-22 05:10:29 +00:00
Yao You
edddf9f6ee
Feat/pass down strategy to partition ppt as well (#3274)
Following the same pattern of
https://github.com/Unstructured-IO/unstructured/pull/3273 and pass down
`strategy` parameter to `partition_ppt` as well.
2024-06-22 02:23:58 +00:00
Steve Canny
16df6944dd
fix(auto): partition() passes strategy to PPTX,DOCX (#3273)
**Summary**
Remedy gap where `strategy` argument passed to `partition()` was not
forwarded to `partition_pptx()` or `partition_docx()`.
2024-06-22 00:16:39 +00:00
Steve Canny
6fe1c9980e
rfctr(html): prepare for new html parser (#3257)
**Summary**
Extract as much mechanical refactoring from the HTML parser change-over
into the PR as possible. This leaves the next PR focused on installing
the new parser and the ingest-test impact.

**Reviewers:** Commits are well groomed and reviewing commit-by-commit
is probably easier.

**Additional Context**
This PR introduces the rewritten HTML parser. Its general design is
recursive, consistent with the recursive structure of HTML (tree of
elements). It also adds the unit tests for that parser but it does not
_install_ the parser. So the behavior of `partition_html()` is unchanged
by this PR. The next PR in this series will do that and handle the
ingest and other unit test changes required to reflect the dozen or so
bug-fixes the new parser provides.
2024-06-21 20:59:48 +00:00
Matt Robinson
e1b75539f7
build: fix amd64 image hash (#3272)
### Summary

Sets to latest hash in quay.
2024-06-21 19:57:31 +00:00
Christine Straub
14f149d43c
fix: update base image SHA for amd64 wolfi (#3270)
This PR aims to fix a `test_dockerfile` job
[failure](https://github.com/Unstructured-IO/unstructured/actions/runs/9613636416/job/26517074221?pr=3234)
in CI after `base-images` repo update.
2024-06-21 18:30:33 +00:00
Matt Robinson
80abbcd5a8
build: version bump for release 0.14.7 (#3259)
### Summary

Release PR for `0.14.7`.
0.14.7
2024-06-20 14:10:33 -04:00
Austin Walker
0b73978b92
fix: fix IndexError when partioning a pdf with starting_page_number (#3246)
The Issue:

When extracting images from pdfs, we use the metadata page number to
index into a list of the images. However, the metadata page number can
now be changed via `starting_page_number`. To get the true page index,
we need to subtract this value.

Testing:

Run this snippet in a python shell. Before the fix, this throws an
IndexError. On this branch, it will return the elements.
```
from unstructured.partition.auto import partition
filename = "example-docs/layout-parser-paper-with-table.pdf"
partition(filename, strategy="hi_res", extract_image_block_types=["Image", "Table"], starting_page_number=20)
```

---------

Co-authored-by: Matt Robinson <mrobinson@unstructuredai.io>
Co-authored-by: christinestraub <christinemstraub@gmail.com>
2024-06-19 18:20:54 +00:00
Pawel Kmiecik
c3af03d5ac
feat: expose converters deckerd -> html and back (#3233)
This PR exposes functions in evaluation module for easy conversion
between tables in Deckerd and HTML formats, which are useful in
evalution experiments.
2024-06-19 07:03:38 +00:00
Christine Straub
f23d180d34
fix: docker image publishing error (#3238)
This PR aims to fix a docker image publishing error caused by user
changes when pulling the `amd64` image from the `unstructured`
`wolfi-base` image.
(https://github.com/Unstructured-IO/unstructured/pull/3213).

---------

Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com>
Co-authored-by: christinestraub <christinestraub@users.noreply.github.com>
2024-06-18 21:01:42 +00:00
Roman Isecke
fd98cf9ea5
Roman/migrate es dest (#3224)
### Description
Migrate elasticsearch destination connector to new v2 ingest framework
2024-06-18 14:20:49 +00:00
Christine Straub
b47e6e9fdc
refactor: remove download packages step (#3225)
This PR aims to remove the download packages step since all of that gets
installed in the base images. This PR also updates the base `wolfi`
image because the original base image can not be found anymore:
https://github.com/Unstructured-IO/unstructured/actions/runs/9555654898/job/26339587945
2024-06-18 12:15:44 +00:00
Steve Canny
77a9e1b54d
rfctr(html): drop convert_and_partition_html() (#3215)
**Summary**
Remove `unstructured.partition.html.convert_and_partition_html()`. Move
file-type conversion (to HTML) responsibility to each brokering
partitioner that uses that strategy and let them call `partition_html()`
for themselves with the result.

**Additional Context**

Rationale:
- `partition_html()` does not want or need to know which partitioners
might broker partitioning to it.
- Different brokering partitioners have their own methods to convert
their format to HTML and quirks that may be involved for their format.
Avoid coupling them so they can evolve independently.
- The core of the conversion work is already encapsulated in
`unstructured.partition.common.convert_file_to_html_text_using_pandoc()`.
- `convert_and_partition_html()` represents an additional brokering
layer with the entailed complexities of an additional site for default
parameter values to be (mis-)applied and/or dropped and is an additional
location for new parameters to be added.
2024-06-17 19:43:18 +00:00
Roman Isecke
d876a386ed
Roman/fix ingest async connectors (#3210)
### Description
Choosing to use async needs to be very careful because if a connector is
set to use async, the pipeline will not fan out the inputs via
multiprocessing but instead it will be limited to run in a single
process under the assumption it has more benefit from async due to heavy
network traffic. This means the exact same code that is not optimized
for async and is blocking will force the pipeline to perform worse than
simply never marking the connector to use async since the pipeline will
fan that out using multiprocessing.

All connectors and processes in the pipeline we revisited to make sure
this criteria was met and updated accordingly:
* Currently the unstructured client does not support making requests
async, so this was moved over to use multiprocessing
* fsspec connector was updated to use the async client from the fsspec
library. This also required that the client be a `@property` fetched on
demand, otherwise the client would break the multiprocessing pool since
it maintains a thread lock and that can't be pickled when the fsspec
connector doesn't support async.
* elasticsearch was also updated to use the async client
* weaviate only recently came out with async support in their SDK at a
version that is higher than we can use in the open source repo, so a
TODO was left but otherwise moved to use multiprocessing
* all underlying embedders don't use async to embedder step must be
multiprocessing for now. TODO left to update underlying embedder classes
to optionally support async.
* Chunking parameters were not accurately being passed through from cli
to chunker params, this was fixed

---------

Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com>
Co-authored-by: rbiseck3 <rbiseck3@users.noreply.github.com>
2024-06-17 16:55:19 +00:00
Frederic Marvin Abraham
6220633d3f
enhancement: make tempfiles windows friendly (#3108)
### Summary

Updates handling of tempfiles so that they work on Windows systems.

---------

Co-authored-by: Matt Robinson <mrobinson@unstructured.io>
Co-authored-by: Matt Robinson <mrobinson@unstructuredai.io>
2024-06-17 13:28:48 -04:00
Matt Robinson
2815226b54
build(deps): version bumps for 2024-06-17 (#3220)
### Summary

Version bumps for the week of 2024-06-17. There is a now a pin on
`numpy` due to a breaking change in the latest version that we'll need
to investigate and remove in a subsequent PR.
2024-06-17 14:04:29 +00:00
Steve Canny
9fae0111d9
rfctr(html): drop HTML-specific elements (#3207)
**Summary**
Remove HTML-specific element types and return "regular" elements like
`Title` and `NarrativeText` from `partition_html()`.

**Additional Context**
- An aspect of the legacy HTML partitioner was the use of HTML-specific
element types used to track metadata during partitioning.
- That role is no longer necessary or desireable.
- HTML-specific elements like `HTMLTitle` and `HTMLNarrativeText` were
returned from partitioning HTML but also the seven other file-formats
that broker partitioning to HTML (convert-to-HTML and partition_html()).
This does not cause immediate breakage because these are still `Text`
element subtypes, but it produces a confusing developer experience.
- Remove the prior metadata roles from HTML-specific elements and remove
those element types entirely.
2024-06-15 00:14:22 +00:00
Matt Robinson
08383a27de
build: pull from wolfi base image (#3213)
### Summary

Updates the `wolfi` image to pull from the upstream `wolfi-base` base
image to avoid maintaining the base layers in both locations. Closes
#3105 by pulling in the fix from upstream.

### Testing

`test_dockerfile` should continue to pass with the changes.
2024-06-14 20:41:27 +00:00
Christine Straub
9552fbbfbf
chore: bump unstructured-inference 0.7.35 (#3205)
### Summary
- bump unstructured-inference to `0.7.35` which fixed syntax for
generated HTML tables
- update unit tests and ingest test fixtures to reflect changes in the
generated HTML tables
- cut a release for `0.14.6`

---------

Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com>
Co-authored-by: christinestraub <christinestraub@users.noreply.github.com>
0.14.6
2024-06-14 18:11:38 +00:00
Roman Isecke
a6c09ec621
Roman/dry ingest pipeline step (#3203)
### Description
The main goal of this was to reduce the duplicate code that was being
written for each ingest pipeline step to support async and not async
functionality.

Additional bug fixes found and fixed:
* each logger for ingest wasn't being instantiated correctly. This was
fixed to instantiate in the beginning of a pipeline run as soon as the
verbosity level can be determined.
* The `requires_dependencies` wrapper wasn't wrapping async functions
correctly. This was fixed so that `asyncio.iscoroutinefunction()` gets
trigger correctly.
2024-06-14 13:46:44 +00:00
Pawel Kmiecik
29e64eb281
feat: table evaluations for fixed html table generation (#3196)
Update to the evaluation script to handle correct HTML syntax for
tables.
See https://github.com/Unstructured-IO/unstructured-inference/pull/355
for details.

This change:
- modifies transforming HTML tables to evaluation internal `cells`
format
- fixes the indexing of the output (internal format cells) when HTML
cells use spans
2024-06-14 09:03:27 +00:00
Roman Isecke
dadc9c6d0b
feat/tqdm ingest support (#3199)
### Description
Add in tqdm support to show progress bar of status of each job when
being run. Supported for each mode (serial, async, multiprocess). Also
small timing wrapper around jobs to print out how long it took in total.
2024-06-13 18:41:54 +00:00
Steve Canny
f5ebb209a4
rfctr(html): drop page concept (#3184)
**Summary**
Pagination of HTML documents is currently unused. The `Page` class and
concept were deeply embedding in the legacy organization of HTML
partitioning code due to the legacy `Document` (= pages of elements)
domain model. Remove this concept from the code such that elements are
available directly from the partitioner.

**Additional Context**
- Pagination can be re-added later if we decide we want it again. A
re-implementation would be much simpler and much lower impact to the
structure of the code and introduce much less additional complexity,
similar to the approach we take in `partition_docx()`.
2024-06-13 18:19:42 +00:00
ryannikolaidis
da3492b529
fix: dropbox source connector file path bugs (#3189)
The Dropbox source connector currently raises exceptions when indexing
files due to two issues: a path formatting idiosyncrasy of the Dropbox
library and a divergence in the definition of the Dropbox libraries
fs.info method, expecting a 'url' parameter rather than 'path'.

## Changes

* add a `/` prefix to file path used by DropboxIndexer
* override the fsspec sterilize_info method in DropboxIndexer to call
`self.fs.info` with `url` rather than `path`; to accommodate for the
fact that `dropboxdrivefs` diverges with this signature
* remove `dropbox.sh` from ignored source tests
* update test fixtures (now that the dropbox connector has been fixed
and not skipped)

## Testing
`dropbox.sh` source ingest test now succeeds (and is no longer ignored)

---------

Co-authored-by: ryannikolaidis <ryannikolaidis@users.noreply.github.com>
Co-authored-by: Christine Straub <christinemstraub@gmail.com>
2024-06-13 18:06:41 +00:00
Roman Isecke
f7b0a37c86
Feat/migrate elasticsearch src connector (#3174)
### Description
Migrate elasticsearch connector with support for what used to be batch
ingest docs but not it support for the download step to generate
additional file data.

---------

Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com>
Co-authored-by: rbiseck3 <rbiseck3@users.noreply.github.com>
2024-06-13 17:57:59 +00:00
Matt Robinson
ad69bdcd4e
build(deps): deltalake bump to 0.18.x (#3197)
### Summary

Closes #3173. Removes the `overwrite_schema` kwarg from the Delta Table
connector and bumps the `deltalake` version. Per [this
PR](https://github.com/delta-io/delta-rs/pull/2554) in the `deltalake`
repo, the `overwrite_schema` kwarg is deprecated as of version `0.18.0`.
Users can specify `schema_mode="merge"` to obtain the same behavior.

- `schema_mode="merge"` is equivalent to `overwrite_schema=False`
- `schema_mode="overwrite"` is equivalent to `overwrite_schema=True`

Also adds an `engine` parameter that you can use to set `"rust"` or
`"pyarrow"` as the engine. `engine` defaults to `"pyarrow"` and
`schema_mode` defaults to `None`, which is consistent with the behavior
in `deltalake` documented
[here](https://delta-io.github.io/delta-rs/api/delta_writer/).

### Testing

The Delta Table ingest tests should pass on this PR.

---------

Co-authored-by: Ahmet Melek <39141206+ahmetmeleq@users.noreply.github.com>
2024-06-13 15:59:34 +00:00
Steve Canny
5f582f1716
ci: update to Node 20 actions (#3200)
**Summary**
Silence the long list of warnings we get in CI from using Node 16
actions by updating to Node 20 versions.
2024-06-13 03:43:26 +00:00
ryannikolaidis
17bc55e7be
fix: relative path / permissions issues with v2 fsspec connectors (#3186)
When the v2 fsspec connectors currently generate the relative path, they
may introduce a path with a leading slash (this happens in the case of
the Box connector, which is a subclass of fsspec). When this happens
this results in the paths unintentionally being treated as absolute
paths. As a result, the ingest pipeline attempts to write files to
directories at root level, which in turn raises permission issues.

Note: Box expected results needed to update now that it's no longer
failing.

Aside: found that our tests were unintentionally skipping `box.sh` tests
because we were intending to skip `dropbox.sh` and we use regex to match
if a given test is in skip tests. This adds changes to force an exact
match.

## Changes

* Strip leading slashes during the creating of relative paths in fsspec
connectors
* Add expected results for Box connector
* (bonus): `make tidy` altered an unrelated file by removing an
unnecessary call of `pass`
* (bonus): check exact match for skipped ingest tests which fixes Box
tests getting skipped

## Testing


[Tests](https://github.com/Unstructured-IO/unstructured/actions/runs/9461928289/job/26093475612#step:7:2085)
for the Box connector was failing. It was accidentally getting skipped
(see changes above). It is now no longer skipped and passing.
2024-06-12 03:39:35 +00:00
Filip Knefel
c2065db716
fix API-297: List parameters incorrectly passed to API requests (#3154)
In two places parameters passed to the python client when using either
Ingest workflow and `partition_via_api` function directly we parse the
parameters with list values to strings e.g.
```python
extract_image_block_types=["image"] -> extract_image_block_types='["image"]'
```
as of now these parameters are parsed incorrectly when given as strings
and correctly when given as lists.

This PR removes parsing from `PartitionConfig` and `partition_via_api`.

---------

Co-authored-by: Filip Knefel <filip@unstructured.io>
2024-06-11 21:00:41 +00:00
Steve Canny
2f0400f279
rfctr(html): break coupling to DocumentLayout (#3180)
**Summary**
Remove use of `partition.common.document_to_element_list()` by
`HTMLDocument`. The transitive coupling with layout-inference through
this shared function have been the source of frustration and a drain on
engineering time and there's no compelling reason for the two to share
this code.

**Additional Context**
`partition_html()` uses `partition.common.document_to_element_list()` to
get finalized elements from `HTMLDocument` (pages). This gives rise to a
very nasty coupling between `DocumentLayout`, used by
`unstructured_inference`, and `HTMLDocument`.
`document_to_element_list()` has evolved to work for both callers, but
they share very few common characteristics with each other.

This coupling is bad news for us and also, importantly, for the
inference and page layout folks working on PDF and images.

Break that coupling so those inference-related functions can evolve
whatever way they need to without being dragged down by legacy
`HTMLDocument` connections.

The initial step is to extract a `document_to_element_list()` function
of our own, getting rid of the coordinates and other
`DocumentLayout`-related bits we don't need. As you'll see in the next
few PRs, all of this `document_to_element_list()` code will end up
either going away or being relocated closer to where it's used in
`HTMLDocument`.
2024-06-11 20:54:11 +00:00
Steve Canny
e39ee16161
rfctr(html): promote HTMLDoc candidate methods (#3177)
**Summary**
Make `._find_articles()` and `._find_main` into `._articles` and
`._main` properties on HTMLDocument, respectively.

**Additional Context**
After prior refactorings, these two functions now each require only
`self` and can become `@lazyproperty`s on `HTMLDocument`. This ensures
they are computed at most once. In addition, their close relationship to
`HTMLDocument` is indicated by their membership as methods rather than
"loose" functions.
2024-06-10 22:07:21 +00:00
Matt Robinson
c822e3fd10
build(deps): weekly dependency bumps (6/10/2024) (#3170)
### Summary

Weekly dependency bumps for the week of 6/10/2024.

The `deltalake` dependency was pinned to `<0.18.0` because `0.18.0`
seemed to break the connector test, per [this
test](https://github.com/Unstructured-IO/unstructured/actions/runs/9450141486/job/26028131005).
Opened #3173 to address.
2024-06-10 16:20:22 +00:00