1447 Commits

Author SHA1 Message Date
Roman Isecke
8ba9fadf8a
feat: improve dataclass use for encoders (#2318)
### Description
Leverage a similar pattern to what is used for connectors, where there
is a nested config dataclass as a field, along with cached content for
things like the client and sample embedding for each. This required an
update on the embeddings config in ingest and I left a TODO in there
because the current approach breaks on other encoders such as bedrock
because the parameters in that config don't map to all encoders. But
this keeps the existing functionality working.

This update makes sure all variables associated with the dataclass exist
when it's instantiated rather than being added in the `__post_init__()`
method or the `initialize()`, allowing other libraries like pydantic to
appropriately generate schemas from it. It also now follows the pattern
of the connectors in that each class has a nested config class used to
instantiate the client itself as well as a field/property approach used
to cache the client.
2023-12-26 22:33:19 +00:00
Roman Isecke
bfef183f77
feat: update encoders to be dataclasses (#2313)
### Description
Convert all encoders to be based off dataclasses. Purpose: this will
allow encoders to be used in a generic way amongst other dataclasses.
Otherwise, it'll break validation in those parent dataclasses.
2023-12-26 14:48:00 +00:00
Steve Canny
eb1b022ff8
feat(chunking): add overlap on chunk-splits (#2305)
There are two distinct overlap operations with completely different
implementations. This is "intra-chunk" overlap, applying overlap to
chunks resulting from text-splitting an oversized element.

So if an oversized element had text "abcd efgh ijkl mnop qrst" and was
split at 15 chars with overlap of 5, it would produce "abcd efgh ijkl"
and "ijkl mnop qrst". Any inter-chunk overlap from the prior chunk and
applied at the beginning of the string (before "abcd") is handled in a
separate operation in the next PR.
2023-12-22 20:35:18 +00:00
John
5c0043aa7d
chore: add hi_res_model_name kwarg (#2289)
Closes #2160 

Explicitly adds `hi_res_model_name` as kwarg to relevant functions and
notes that `model_name` is to be deprecated.

Testing:
```
from unstructured.partition.auto import partition
filename = "example-docs/DA-1p.pdf"
elements = partition(filename, strategy="hi_res", hi_res_model_name="yolox")
```

---------

Co-authored-by: cragwolfe <crag@unstructured.io>
Co-authored-by: Steve Canny <stcanny@gmail.com>
Co-authored-by: Christine Straub <christinemstraub@gmail.com>
Co-authored-by: Yao You <yao@unstructured.io>
Co-authored-by: Yao You <theyaoyou@gmail.com>
2023-12-22 15:06:54 +00:00
Steve Canny
093a11d058
rfctr(chunking): split oversized chunks on word boundary (#2297)
The text of an oversized chunk is split on an arbitrary character
boundary (mid-word). The `chunk_by_character()` strategy introduces the
idea of allowing the user to specify a separator to use for
chunk-splitting. For `langchain` this is typically "\n\n", "\n", or " ";
blank-line, newline, or word boundaries respectively.

Even if the user is allowed to specify a separator, we must provide
fall-back for when a chunk contains no such character. This can be done
incrementally, like blank-line is preferable to newline, newline is
preferable to word, and word is preferable to arbitrary character.

Further, there is nothing particular to `chunk_by_character()` in
providing such a fall-back text-splitting strategy. It would be
preferable for all strategies to split oversized chunks on even-word
boundaries for example.

Note that while a "blank-line" ("\n\n") may be common in plain text, it
is unlikely to appear in the text of an element because it would have
been interpreted as an element boundary during partitioning.

Add _TextSplitter with basic separator preferences and fall-back and
apply it to chunk-splitting for all strategies. The `by_character`
chunking strategy may enhance this behavior by adding the option for a
user to specify a particular separator suited to their use case.
2023-12-21 05:45:36 +00:00
cragwolfe
4533bdac98
build: release commit for 0.11.6 (#2304) 0.11.6 2023-12-20 13:36:38 -08:00
Ronny H
ac380ce989
Added AWS Marketplace docs and improved Azure Marketplace docs (#2248)
To test:
> cd docs && make HTML

Change logs:
- Added AWS Marketplace documentation
- Improved Azure Marketplace documentation - Networking section
2023-12-20 20:13:47 +00:00
John
04f4c3ab16
create teardown fixture for tests (#2269)
Closes #2263 
Files were being created as a side effect from running tests in
`test_unstructured/metrics/test_evaluate.py`. The added decorator
removes the created directory and its files after the tests run.

Testing
on the main branch, run `make test` or `pytest
test_unstructured/metrics/test_evaluate.py` and files will be created.
On this branch no files are created
2023-12-20 17:50:12 +00:00
Andy Li
4ae49419c9
feat: support base64-encoded text in partition_email (#2277)
closes #816 
## Description
Added functionality for `partition_email` to automatically decode base64
text before passing it to `partition_text` or `partition_html`.
Also adds base64 encoded email text test cases.
2023-12-19 23:37:17 -08:00
Steve Canny
82714cad98
rfctr(chunking): extract BasePreChunker (#2294)
The `_split_elements_by_title_and_table()` function fulfills the
pre-chunker role for `chunk_by_title()`, but most of its operation is
not strategy-specific and can be reused by other chunking strategies.

Extract `BasePreChunker` and use it as the base class for
`_ByTitlePreChunker` which now only needs to provide the boundary
predicates specific to that strategy.
2023-12-20 06:30:21 +00:00
Ahmet Melek
fd293b3e78
feat: add elasticsearch destination connector (#2152)
Closes https://github.com/Unstructured-IO/unstructured/issues/1842
Closes https://github.com/Unstructured-IO/unstructured/issues/2202
Closes https://github.com/Unstructured-IO/unstructured/issues/2203

This PR:
- Adds Elasticsearch destination connector to be able to ingest
documents from any supported source, embed them and write the embeddings
/ documents into Elasticsearch.
- Defines an example unstructured elements schema for users to be able
to setup their unstructured elasticsearch indexes easily.
- Includes parallelized upload and lazy processing for elasticsearch
destination connector.
- Rearranges elasticsearch test helpers to source, destination, and
common folders.
- Adds util functions to be able to batch iterables in a lazy way for
uploads
- Fixes a bug where removing the optional parameter `--fields` broke the
connector due to an integer processing error.
- Fixes a bug where using an [elasticsearch
config](8fa5cbf036/unstructured/ingest/connector/elasticsearch.py (L26-L35))
for a destination connector resulted in a serialization issue when
optional parameter `--fields` was not provided.
2023-12-20 01:26:58 +00:00
Steve Canny
4e2ba2c9b2
rfctr(chunking): extract boundary predicates (#2284)
`chunk_by_title()` respects certain semantic boundaries while chunking.
Those are sections introduced by a `Title` element, sections introduced
by a `metadata.section` value change, and optionally page-breaks.
"Respecting" in this context means that elements on opposite sides of a
semantic boundary never appear in the same chunk.

The `metadata_differs()` function used for this purpose is clumsy to use
requiring the caller to maintain state (prior element). It also combines
what are independent predicates such that they cannot be individually
reused.

Introduce the `BoundaryPredicate` type which takes an element and
returns bool, indicating whether the element introduces a new semantic
boundary. These can be reused by any chunking strategy that needs them
and allows the pre-chunking operation to be generalized for use by any
chunking strategy, which it will be in the following PR.
2023-12-19 18:20:05 +00:00
David Potter
4b8352e0f5
feat: add chroma destination connector (#2240)
Adds Chroma (also known as ChromaDB) as a vector destination.

Currently Chroma is an in-memory single-process oriented library with
plans of a hosted and/or more production ready solution
-https://docs.trychroma.com/deployment

Though they now claim to support multiple Clients hitting the database
at once, I found that it was inconsistent. Sometimes multiprocessing
worked (maybe 1 out of 3 times) But the other times I would get
different errors. So I kept it single process.

---------

Co-authored-by: potter-potter <david.potter@gmail.com>
2023-12-19 16:58:23 +00:00
cragwolfe
bd8a74d686
chore: shell scripts default indent of 2 instead of 4 (#2287)
Given the tendency for shell scripts to easily enter into a few levels
of indentation and long line lengths, update the default to 2 spaces.
2023-12-19 07:48:21 +00:00
Christine Straub
096d23bc28
Refactor: support layout analysis (#2273)
### Summary
This PR is the second part of the "layout analysis" refactor to move it
from unstructured-inference repo to unstructured repo, the first part is
done in
https://github.com/Unstructured-IO/unstructured-inference/pull/305. This
PR adds logic to support annotating `inferred` and `extracted` elements.

### Testing

```
PYTHONPATH=. python examples/layout-analysis/visualization.py <file_path> <strategy> <document_type>
```
e.g.
```
PYTHONPATH=. python examples/layout-analysis/visualization.py example-docs/layout-parser-paper-fast.pdf hi_res pdf
```
2023-12-19 06:21:56 +00:00
John
09f86f28fb
Update parition_pdf docstring (#2292)
add `extract_element_types` to partition_pdf docstring and reword other
parameter descriptions
2023-12-19 04:52:15 +00:00
Steve Canny
0c7f64ecaa
rfctr(chunking): generalize PreChunkBuilder (#2283)
To implement inter-pre-chunk overlap, we need a context that sees every
pre-chunk both before and after it is accumulated (from elements).

- We need access to the pre-chunk when it is completed so we can extract
the "tail" overlap to be applied to the next chunk.
- We need access to the as-yet-unpopulated pre-chunk so we can add the
prior tail to it as a prefix.

This "visibility" is split between `PreChunkBuilder` and the pre-chunker
itself, which handles `TablePreChunk`s without the builder.

Move `Table` element and TablePreChunk` formation into `PreChunkBuilder`
such that _all_ element types (adding `Table` elements in particular)
pass through it. Then `PreChunkBuilder` becomes the context we require.

The actual overlap harvesting and application will come in a subsequent
commit.
2023-12-18 22:21:34 +00:00
cragwolfe
9efc22c0fc
build: release commit for 0.11.5 (#2285)
Also fix broken link in docs.
0.11.5
2023-12-16 18:25:55 -08:00
Steve Canny
36e81c3367
rfctr(chunking): extract general-purpose objects to base (#2281)
Many of the classes defined in `unstructured.chunking.title` are
applicable to any chunking strategy and will shortly be used for the
"by-character" chunking strategy as well.

Move these and their tests to `unstructured.chunking.base`.

Along the way, rename `TextPreChunkBuilder` to `PreChunkBuilder` because
it will be generalized in a subsequent PR to also take `Table` elements
such that inter-pre-chunk overlap can be implemented.

Otherwise, no logic changes, just moves.
2023-12-16 17:28:15 +00:00
Christine Straub
a7c3f5f570
Refactor: importation consistency for partition_pdf() and partition_image() (#2282)
Closes #2278. This PR also removes the `extract_tables_in_pdf` mentioned
in issue #2280.
2023-12-15 22:29:58 +00:00
Steve Canny
70cf141036
rfctr: extract ChunkingOptions (#2266)
Chunking options for things like chunk-size are largely independent of
chunking strategy. Further, validating the args and applying defaults
based on call arguments is sophisticated to make its use easy for the
caller. These details distract from what the chunker is actually doing
and would need to be repeated for every chunking strategy if left where
they are.

Extract these settings and the rules governing chunking behavior based
on options into its own immutable object that can be passed to any
component that is subject to optional behavior (pretty much all of
them).
2023-12-15 19:51:02 +00:00
cragwolfe
8ba1bedfca
build: release commit for 0.11.4 (#2275)
also cleans up the CHANGELOG to reflect the previous release of 0.11.2
0.11.4
2023-12-15 00:06:22 +00:00
Steve Canny
aa7794a566
rfctr(chunking): move add_chunking_strategy() decorator up (#2265)
The chunking subpackage `unstructured.chunking` currently contains only
the `title` module and the `@add_chunking_strategy()` decorator is
located in that module even though it has no special relationship to the
`by_title` chunking strategy.

Move it to the `__init__.py` module such that it is exported from
`unstructured.chunking`. Adjust all references, pretty much one per
partitioner, to import it from there.

This prepares the way for further separation of the chunking package
into modules, including a new `character` module for the `by_character`
chunking strategy.
2023-12-14 19:16:16 +00:00
John
7895d4e0a7
pdf rfctr (#2260)
Refactor `_process_pdfminer_pages` by extracting logic into helper
functions.

---------

Co-authored-by: christinestraub <christinemstraub@gmail.com>
2023-12-14 08:16:38 +00:00
Yao You
5f5ff6319f
fix: consider text in cid code as invalid in hi_res (#2259)
This PR addresses
[CORE-2969](https://unstructured-ai.atlassian.net/browse/CORE-2969)
- pdfminer sometimes fail to decode text in an pdf file and returns cid
codes as text
- now those text will be considered invalid and be replaced with ocr
results in `hi_res` mode

## test

This PR adds unit test for the utility functions. In addition the file
below would return elements with text in cid code on main but proper
ascii text with this PR:


[005-CISA-AA22-076-Strengthening-Cybersecurity-p1-p4.pdf](https://github.com/Unstructured-IO/unstructured/files/13662984/005-CISA-AA22-076-Strengthening-Cybersecurity-p1-p4.pdf)

This change improves both cct accuracy and %missing scores:

**before:**
```
metric       average sample_sd population_sd count
--------------------------------------------------
cct-accuracy 0.681   0.267     0.266         105
cct-%missing 0.086   0.159     0.159         105
```

**after:**
```
metric       average sample_sd population_sd count
--------------------------------------------------
cct-accuracy 0.697   0.251     0.250         105
cct-%missing 0.071   0.123     0.122         105
```

[CORE-2969]:
https://unstructured-ai.atlassian.net/browse/CORE-2969?atlOrigin=eyJpIjoiNWRkNTljNzYxNjVmNDY3MDlhMDU5Y2ZhYzA5YTRkZjUiLCJwIjoiZ2l0aHViLWNvbS1KU1cifQ

---------

Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com>
Co-authored-by: badGarnet <badGarnet@users.noreply.github.com>
Co-authored-by: christinestraub <christinemstraub@gmail.com>
2023-12-14 06:49:23 +00:00
Austin Walker
d594c06a3e
fix: handle delimiter bug in partition_csv (#2224)
Closes #2218. When a csv has commas in its content, and the delimiter is
something else, Pandas may throw an error. We can sniff the csv and get
the correct delimiter to pass to Pandas. To verify, try partitioning the
file in the linked bug.
2023-12-13 23:57:46 +00:00
Steve Canny
cbeaed21ef
rfctr: rename pre chunk (#2261)
The original naming for the pre-cursor to a chunk in `chunk_by_title()`
was conflated with the idea of how these element subsequences were
bounded (by document-section) for that strategy. I mistakenly picked
that up as a universal concept but in fact no notion of section arises
in the `by_character` or other chunking strategies.

Fix this misconception by using the name `pre-chunk` for this concept
throughout.
2023-12-13 23:13:57 +00:00
Steve Canny
74d089d942
rfctr: skip CheckBox elements during chunking (#2253)
`CheckBox` elements get special treatment during chunking. `CheckBox`
does not derive from `Text` and can contribute no text to a chunk. It is
considered "non-combinable" and so is emitted as-is as a chunk of its
own. A consequence of this is it breaks an otherwise contiguous chunk
into two wherever it occurs.

This is problematic, but becomes much more so when overlap is
introduced. Each chunk accepts a "tail" text fragment from its preceding
element and contributes its own tail fragment to the next chunk. These
tails represent the "overlap" between chunks. However, a non-text chunk
can neither accept nor provide a tail-fragment and so interrupts the
overlap. None of the possible solutions are terrific.

Give `Element` a `.text` attribute such that _all_ elements have a
`.text` attribute, even though its value is the empty-string for
element-types such as CheckBox and PageBreak which inherently have no
text. As a consequence, several `cast()` wrappers are no longer required
to satisfy strict type-checking.

This also allows a `CheckBox` element to be combined with `Text`
subtypes during chunking, essentially the same way `PageBreak` is,
contributing no text to the chunk.

Also, remove the `_NonTextSection` object which previously wrapped a
`CheckBox` element during pre-chunking as it is no longer required.
2023-12-13 20:22:25 +00:00
Yao You
36e4639e05
fix: image may be scaled too large for tesseract (#2252)
This PR addresses
[CORE-2965](https://unstructured-ai.atlassian.net/browse/CORE-2965) by
limiting zoom factor so that the scaled image can still be processed by
tesseract.

- tesseract has a 2^31 byte limit on image data
- occasionally an image may be scaled too much and larger than that size
- fix limits the scaling factor so that we never scale an image larger
than what tesseract can handle

## test

A unit test is added in this PR to test a unlikely case where we'd scale
an image a few thousand times and massively exceed the limit without the
fix.

Unstructured reviewers can also use the document in the ticket to test.


[CORE-2965]:
https://unstructured-ai.atlassian.net/browse/CORE-2965?atlOrigin=eyJpIjoiNWRkNTljNzYxNjVmNDY3MDlhMDU5Y2ZhYzA5YTRkZjUiLCJwIjoiZ2l0aHViLWNvbS1KU1cifQ
2023-12-13 19:35:05 +00:00
John
d3a404cfb5
pdfminer bug (#2244)
Closes #2212.

### Summary
This PR implements logic to fall back to the "inferred_layout + OCR" if
pdfminer fails in the `hi_res` pipeline (discussed in[ this slack
channel](https://unstructuredw-kbe4326.slack.com/archives/C057R3F8F7A/p1701807299018929).

### Testing
PDF:
[NASA-SNA-8-D-027III-Rev2-CsmLmSpacecraftOperationalDataBook-Volume3-MassProperties-pg856.pdf](https://github.com/Unstructured-IO/unstructured/files/13554149/NASA-SNA-8-D-027III-Rev2-CsmLmSpacecraftOperationalDataBook-Volume3-MassProperties-pg856.pdf)

```
elements = partition_pdf(
    filename="NASA-SNA-8-D-027III-Rev2-CsmLmSpacecraftOperationalDataBook-Volume3-MassProperties-pg856.pdf",
    strategy="hi_res",
)
```

---------

Co-authored-by: christinestraub <christinemstraub@gmail.com>
2023-12-13 00:51:38 +00:00
Steve Canny
21bc67f52f
rfctr: improve element typing (#2247)
In preparation for work on generalized chunking including
`chunk_by_character()` and overlap, get `elements` module and tests
passing strict type-checking.
2023-12-12 23:12:23 +00:00
Roman Isecke
76efcf4dd7
chore: add shfmt (#2246)
### Description
Given all the shell files that now exist in the repo, would be nice to
have linting/formatting around them (in addition to the existing
shellcheck which doesn't do anything to format the shell code). This PR
introduces `shfmt` to both check for changes and apply formatting when
the associated make targets are called.
2023-12-12 01:04:15 +00:00
Yuming Long
529d1f6edb
Chore: put tesseract multiple languages splitter "+" in constant (#2226)
^^^
2023-12-11 22:20:37 +00:00
Roman Isecke
ac302689a0
chore: update sphinx ingest docs with new connectors (#2245)
Replacing https://github.com/Unstructured-IO/unstructured/pull/2243
2023-12-11 21:29:41 +00:00
Christine Straub
da7ac625b1
Feat: save tables in PDF's as images (#2229)
closes #2222.

### Summary
The "table" elements are saved as `table-<pageN>-<tableN>.jpg`. This
filename is presented in the `image_path` metadata field for the Table
element. The default would be to not do this.

### Testing
PDF: [124_PDFsam_Basel III - Finalising post-crisis
reforms.pdf](https://github.com/Unstructured-IO/unstructured/files/13591714/124_PDFsam_Basel.III.-.Finalising.post-crisis.reforms.pdf)

```
elements = partition_pdf(
    filename="124_PDFsam_Basel III - Finalising post-crisis reforms.pdf",
    strategy="hi_res",
    infer_table_structure=True,
    extract_element_types=['Table'],
)
```
2023-12-11 19:14:41 +00:00
Roman Isecke
cc05e948ff
chore: sensitive info connector audit (#2227)
### Description
All other connectors that were not included in
https://github.com/Unstructured-IO/unstructured/pull/2194 are now
updated to follow the new pattern and mark any variables as sensitive
where it makes sense.
Core changes:
* All connectors now support an `AccessConfig` to mark data that's
needed for auth (i.e. username, password) and those that are sensitive
are designated appropriately using the new enhanced field.
* All cli configs on the cli definition now inherit from the base config
in the connector file to reuse the variables set on that dataclass
* The base writer class was updated to better generalize the new
approach given better use of dataclasses
* The base cli classes were refactored to also take into account the
need for a connector and write config when creating the respective
runner/writer classes.
* Any mismatch between the cli field name and the dataclass field name
were updated on the dataclass side to not impact the user but maintain
consistency
* Add custom redaction logic for mongodb URIs since the password is
expected to be a part of it. Now this:
`"mongodb+srv://ingest-test-user:r4hK3BD07b@ingest-test.hgaig.mongodb.net/"`
->
`"mongodb+srv://ingest-test-user:***REDACTED***@ingest-test.hgaig.mongodb.net/"`
in the logs
* Bundle all fsspec based files into their own packages. 
* Refactor custom `_decode_dataclass` used for enhanced json mixin by
using a monkey-patch approach. The original approach was breaking on
optional nested dataclasses when serializing since the other methods in
`dataclasses_json_core` weren't using the new method. By monkey-patching
the original method with a new one, all other methods in that library
would use the new one.

---------

Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com>
Co-authored-by: rbiseck3 <rbiseck3@users.noreply.github.com>
2023-12-11 17:37:49 +00:00
Roman Isecke
bd5b3535ac
feat: support glob filters for fsspec source connectors (#2204)
### Description
The local source connector already supports filters on a list of glob
filters optionally provided. This same code was leveraged in the base
fsspec connector to filter on the full file paths returned from the
underlying fs library.
2023-12-08 16:08:40 +00:00
Christine Straub
4ad01efe23
feat: improve reading order (#2219)
Closes GH Issue #2208.
2023-12-07 23:21:10 -08:00
Klaijan
46cb3060ac
fix: ignore connector extract if no connector folder available (#2198)
When no connector is provided in the folder structure, the code get
filename as connector instead. Fix the code so that if folder structure
has no connector subfolder, it leaves blank or None for connector field.
2023-12-07 20:41:17 +00:00
David Potter
cde11d1eb0
feat: Add sftp source connector (#2163)
Adds source connector for SFTP which uses fsspec and paramiko via
fsspec. Paramiko is the standard sftp package for python used in pysftp
etc...

```
--username foo \
--password bar \
--remote-url sftp://localhost:47474/upload/
```

Will only download a specifically requested file if it has an extension.
(i.e. `--remote-url sftp://localhost:47474/upload/bob.zip`) It will
treat any other remote_url as a folder path. This is intentional.

---------

Co-authored-by: potter-potter <david.potter@gmail.com>
2023-12-07 19:33:19 +00:00
qued
231f04eb84
chore: update api key link (#2182)
Update to API key link per Slack convo.

There may be more that's needed, @ron-unstructured please run with this
PR if it's the right solution, close it if not, or make changes if
needed. I haven't looked for other places where the link might need to
be changed. The link appears to work, but any further investigation or
testing is appreciated.

Co-authored-by: Ronny H <138828701+ron-unstructured@users.noreply.github.com>
2023-12-06 23:46:32 +00:00
Roman Isecke
f193d3d43b
feat: improve sensitive data handling by fsspec connectors (#2194)
### Description
Building off of PR
https://github.com/Unstructured-IO/unstructured/pull/2179, updating
fsspec based connectors to use better authentication field handling.
This PR adds in the following changes:

* Update the base classes to inherit from the enhanced json mixin
* Add in a new access config dataclass that should be used as a nest
dataclass in the connector configs
* Update the code extracting configs out of the cli options dictionary
to support the nested access config if it exists on the parent config
* Update all fsspec connectors with explicit access configs given what
each one's SDKs support
* Update the json mixin and enhanced field to support a name override
when serializing/deserializing from json/dicts. This allows a different
name to be used for the CLI option than what the name of the field is on
the dataclass.
* Update all the writes to use class-based approach and share the same
structure of the runner classes
* Above update allowed for better code to be used in the base source and
destination CLI commands
* Add in utility code around paring a flat dictionary (coming from the
click based options) into dataclass-based configs with potentially
nested dataclasses.

**Slightly unrelated changes:**
* session handle removed from pinecone connector as this was breaking
the serialization of the write config and didn't have any benefit as a
connection was never being shared, the index used simply makes a new
http call each time it's invoked.
* Dedicated write configs were created for all destination connectors to
better support serialization
* Refactor of Elasticsearch connector included, with update to ingest
test to use auth

**TODOs**
* Left a `#TODO` in the code but the way session handler is implemented
right now, it breaks serialization since it adds a generic variable
based on the library being used for a connector (i.e.
`googleapiclient.discovery.Resource`) which is not serializable. This
will need to be updated to omit that from serialization but still
support the current workflow.

---------

Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com>
Co-authored-by: rbiseck3 <rbiseck3@users.noreply.github.com>
2023-12-05 20:55:19 +00:00
Christine Straub
ed76b11b1a
Refactor: support image extraction (#2201)
### Summary
This PR is the second part of the "image extraction" refactor to move it
from unstructured-inference repo to unstructured repo, the first part is
done in
https://github.com/Unstructured-IO/unstructured-inference/pull/299. This
PR adds logic to support extracting images.

### Testing

`git clone -b refactor/remove_image_extraction_code --single-branch
https://github.com/Unstructured-IO/unstructured-inference.git && cd
unstructured-inference && pip install -e . && cd ../`

```
elements = partition_pdf(
        filename="example-docs/embedded-images.pdf",
        strategy="hi_res",
        extract_images_in_pdf=True,
    )

print("\n\n".join([str(el) for el in elements]))
```
2023-12-05 18:22:29 +00:00
Roman Isecke
c5cb216ac8
chore: lint for print statements in ingest code (#2215)
### Description
Given the filtering in the ingest logger, anything going to console
should go through that. This adds a linter that only checks for
`print()` statements in the ingest code and ignored it elsewhere for
now.
2023-12-05 16:42:23 +00:00
John
8fa5cbf036
build(ci): rm unneeded call to get_api_key in test (#2199)
Follow-up PR to
[https://github.com/Unstructured-IO/unstructured/pull/2195](https://github.com/Unstructured-IO/unstructured/pull/2195).

Removes unnecessary calls to `get_api_key()`. That helper function is
supposed to only be used for tests decorated by
@pytest.mark.skipif(skip_outside_ci, reason="Skipping test run outside
of CI") (which are skipped because those tests are partitioning pdf/jpg
files).
These tests are partitioning emails and rely on the MockResponse at the
top of the file, so they don't need to call `get_api_key()` and it can
simply be removed from them.
2023-12-03 21:28:05 -08:00
rvztz
ce905dd098
feat: Weaviate destination connector (#1963)
Closes #1781.
- Adds a Weaviate destination connector
- The connector receives a host for the weaviate instance and a weaviate
class name.
- Defines a weaviate schema for json elements.
- Defines the pre-processing to conform unstructured's schema to the
proposed weaviate schema.
2023-12-01 22:27:41 +00:00
Christine Straub
69d0ee1aea
Refactor: support merging extracted layout with inferred layout (#2158)
### Summary
This PR is the second part of `pdfminer` refactor to move it from
`unstructured-inference` repo to `unstructured` repo, the first part is
done in
https://github.com/Unstructured-IO/unstructured-inference/pull/294. This
PR adds logic to merge the extracted layout with the inferred layout.

The updated workflow for the `hi_res` strategy:
* pass the document (as data/filename) to the `inference` repo to get
`inferred_layout` (DocumentLayout)
* pass the `inferred_layout` returned from the `inference` repo and the
document (as data/filename) to the `pdfminer_processing` module, which
first opens the document (create temp file/dir as needed), and splits
the document by pages
* if is_image is `True`, return the passed
inferred_layout(DocumentLayout)
  * if is_image is `False`:
* get extracted_layout (TextRegions) from the passed
document(data/filename) by pdfminer
* merge `extracted_layout` (TextRegions) with the passed
`inferred_layout` (DocumentLayout)
* return the `inferred_layout `(DocumentLayout) with updated elements
(all merged LayoutElements) as merged_layout (DocumentLayout)
* pass merged_layout and the document (as data/filename) to the `OCR`
module, which first opens the document (create temp file/dir as needed),
and splits the document by pages (convert PDF pages to image pages for
PDF file)

### Note
This PR also fixes issue #2164 by using functionality similar to the one
implemented in the `fast` strategy workflow when extracting elements by
`pdfminer`.

### TODO
* image extraction refactor to move it from `unstructured-inference`
repo to `unstructured` repo
* improving natural reading order by applying the current default
`xycut` sorting to the elements extracted by `pdfminer`
2023-12-01 20:56:31 +00:00
John
e5bdf7fb43
chore: unstructured python client (#2195)
### Summary
Closes #2033
Updates `partition_via_api` to use `UnstructuredClient` for api calls
instead of `requests`.
Updates associated tests.

Note: This PR does **not** update `partition_multiple_via_api` as
documentation in `unstructured-python-client` indicates it does not
support multiple files. A new issue should be opened to add that
functionality to `unstructured-python-client`.

---------

Co-authored-by: Klaijan <klaijan@unstructured.io>
Co-authored-by: Roman Isecke <136338424+rbiseck3@users.noreply.github.com>
Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com>
Co-authored-by: rbiseck3 <rbiseck3@users.noreply.github.com>
2023-12-01 18:49:59 +00:00
Roman Isecke
f8aea71f3a
chore: refactor ingest unit test to run in it's own CI step (#2190)
### Description
Currently the ingest unit tests are running in both the source and
destination ingest steps. This PR moved it out into it's own step that
can run on a more basic runner since it doesn't require multiple
cores/cpus for parallelization.

Further optimizations:
* Added the changelog step as a dependency for the ingest tests to avoid
running them (fail fast) of the pipeline has already failed due to the
changelog not being updated.
* The cache was never actually being saved if it needed to be recreated,
a save job was added at the end of each custom action
2023-11-30 20:57:00 +00:00
cragwolfe
039ae17fdd
build(release): release commit for 0.11.2 (#2191) 0.11.2 2023-11-29 20:30:19 -08:00