1447 Commits

Author SHA1 Message Date
Tracy Shen
d82a34519e
[Merge request] bug fix on table structure metric (#3089)
**Summary**
This fix is to provide better logic oon matched_idx of calculating table
structure metric to provide more accurate calculation on the acc
**Additional Context**

- this fix has passed CI run in Draft PR #3025 initially
- therefore, this time we would like to merge into main branch
- this commit has merged the latest change from main after the Draft PR
2024-06-10 15:14:32 +00:00
Duda Nogueira
657a949a00
chore: Weaviate pyv4 example (#3151)
Update Unstructured example for Weaviate, now using latest python v4
client.

---------

Co-authored-by: Matt Robinson <mrobinson@unstructured.io>
2024-06-10 10:08:46 -04:00
Steve Canny
a66661a7bf
rfctr(html): drop now dead XMLDocument and Document (#3165)
**Summary**
`HTMLDocument` is the class handling the core of HTML parsing. This is
critical code because 8 of the 20 file-type partitioners end up using
this code (`partition_html()` + 7 brokering partitioners like EPUB, MD,
and RST).

For historical reasons, `HTMLDocument` subclassed `XMLDocument` which in
turn subclassed `Document`, both of which are no longer relevant and
unnecessarily complicate reasoning about `HTMLDocument` behavior.

Remove that inheritance and dependency and drop both `XMLDocument` and
`Document` modules which become dead code after no longer being used by
`HTMLDocument`.
2024-06-08 07:36:18 +00:00
Matt Robinson
b4876f1b18
build: 0.14.5 release (#3164)
### Summary

Update changelog and version for `0.14.5` release.
0.14.5
2024-06-07 17:20:30 +00:00
Roman Isecke
0fe0f15f30
feat: migrate weaviate connector to new framework (#3160)
### Description
Add weaviate output connector to those supported in the new v2 ingest
framework. Some fixes were needed to the upoad stager step as this was
the first connector moved over that leverages this part of the pipeline.
2024-06-06 23:18:55 +00:00
Steve Canny
a883fc9df2
rfctr(html): improve SNR in HTMLDocument (#3162)
**Summary**
Remove dead code and organize helpers of HTMLDocument in preparation for
improvements and bug-fixes to follow
2024-06-06 21:21:33 +00:00
Steve Canny
8378ddaa3b
rfctr(html): organize and improve HTMLDocument tests (#3161)
**Summary**
In preparation for further work on HTMLDocument, organize the organic
growth in `documents/tests_html.py` and improving typing and expression.

**Reviewers:** Commits are groomed and review is probably eased by going
commit-by-commit
2024-06-06 18:16:02 +00:00
Steve Canny
f1cab248ce
rfctr(msg): remove temporary new_msg.py (#3157)
**Summary**
Remove temporary `new_msg.py` module.

**Additional Context**
The rewrite of `partition_msg()` was placed in a separate file
`new_msg.py` to avoid a messy diff for code-review. This PR makes that
`new_msg.py` the new `msg.py`.

No code changes were made in the process.
2024-06-06 08:31:56 +00:00
Steve Canny
ddbe90f6bb
rfctr(html): clean html tests in prep for PRs to follow (#3156)
**Summary**
Clean `tests_unstructured/partition/test_html.py` in preparation for
broader refactor of HTML partitioner to follow. That refactor will
address a cluster of bugs.

Temporarily remove blank lines in tests so reordering tests in following
commit is easier to follow. Those will go back in after that.
2024-06-05 23:11:58 +00:00
Steve Canny
e4158deaff
fix(msg): use python-oxmsg for MSG email parsing (#3142)
**Summary**
`partition_msg()` previously used the `msg_parser` library for parsing
Outlook MSG email files (.msg files). The `msg_parser` library is
unmaintained and has several major shortcomings such as not being able
to parse MSG files with 8-bit encoded strings and not reliably
extracting attachments.

Use the new and permissively licenced `python-oxmsg` library instead.

**Additional Context**
For reviewability purposes, this PR temporarily places the new
`partition_msg()` implementation in `new_msg.py` and references that
implementation from `msg.py`. `new_msg.py` will be renamed to `msg.py`
in a closely following PR. This avoids a very messy interleaving of
hunks in a diff between the old and re-written `partition_msg()`
implementation.

Fixes #2481 
Fixes #3006
2024-06-05 21:12:27 +00:00
Roman Isecke
b777864296
feat: Migrate over fsspec connectors (#3066)
### Description
Move over all fsspec connectors to the new framework

Variety of bug fixes found and fixed in this PR as well:
* custom json mixin being used for the enhanced dataclass would break if
typing was quoted. That was fixed. A check was also added to the
enhanced dataclass to prevent `InitVar` from being used in the root
dataclass since this breaks serialization.
* hashing for partitioner was using the filename of the raw file being
partitioned rather than the file name of the file data generated from
indexing. This means that mutliple files could result in the same
partition hash when recursive flag is passed in. This was updated to use
the file data file name instead.

---------

Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com>
Co-authored-by: rbiseck3 <rbiseck3@users.noreply.github.com>
2024-06-05 19:12:06 +00:00
Matt Robinson
0e16bf4bf0
enhancement: apply tar filters when using python 3.12 or above (#3124)
### Summary

Applies tar filters when using Python 3.12 or above. This was added to
the [Python `tarfile` library in
3.12](https://docs.python.org/3/library/tarfile.html#extraction-filters)
and guards against malicious content being extracted from `.tar.gz`
files.

### Testing

Added smoke test. If this passes for all Python versions, we're good.
2024-06-05 18:28:59 +00:00
Yao You
fdb27378cb
chore: use python3 consistently in makefile (#3152)
This PR changes two `python` commands in `Makefile` to use `python3` to
be consistent with other make commands. This makes it more explicit on
which python to use when the makefile is used outside of a controlled
virtualenv where only one python exists.
2024-06-05 00:05:57 +00:00
Matt Robinson
5203390a4a
build(deps): weekly pip version bump (#3147)
### Summary

Weekly PR to bump dependency versions.
2024-06-04 20:47:04 +00:00
Christine Straub
1dede5029d
fix: parsing pdf error - new_cells as str has no "copy" (#3130)
Closes #3119.

### Testing
Parsing the provided PDF should be successful.


[testing_brochure_2.pdf](https://github.com/user-attachments/files/15518094/testing_brochure_2.pdf)
```
filename = "testing_brochure_2.pdf"
with open(filename, "rb") as pdf_content:
    elements = partition_pdf(
        file=pdf_content,
        infer_table_structure=True,
        extract_image_block_types=["Image", "Table"],
        chunking_strategy="by_title",
        max_characters=1000,
        new_after_n_chars=3000,
        combine_text_under_n_chars=1000,
    )
print("\n\n".join([str(el) for el in elements]))
```
0.14.4
2024-06-03 18:49:38 +00:00
Matt Robinson
1b43102762
fix: remote root handlers when they exist (#3128)
### Summary

In some environments, such as Google Colab, loggers have a root handling
that did not mask sensitive values. As a result, secrets such as API
keys appeared in the logs. The PR removes root handlers when they exist
to ensure sensitive values are handler properly.

### Testing

Run the following in a Colab notebook. You should see two log outputs,
one with the API key masked and one with it exposed.

```
!pip install unstructured
```

```python
import logging
import json

from unstructured.ingest.interfaces import (
    ChunkingConfig,
    EmbeddingConfig,
    PartitionConfig,
    ProcessorConfig,
    ReadConfig,
)

partition_config = PartitionConfig(
        partition_by_api=True,
        api_key="super secret",

    )

from unstructured.ingest.logger import ingest_log_streaming_init
ingest_log_streaming_init(logging.INFO)

logger = logging.getLogger("unstructured.ingest")
logger.setLevel(logging.INFO)

logger.info(
 f"Running partition node to extract content from json files. "
 f"Config: {partition_config.to_json()}, "
)
```

Now replace the first cell with the following and rerun the Python code.
Only the masked logging output should remain.

```
!git clone https://github.com/Unstructured-IO/unstructured.git && cd unstructured && git checkout fix/rm-log-dupes && pip install -e .
```
2024-05-31 22:07:38 +00:00
Matt Robinson
54c1e4e57f
ci: remove jira issue workflow (#3129)
### Summary

Removes the workflow for creating Jira tickets.
2024-05-31 22:00:40 +00:00
Matt Robinson
6005abce79
feat: configure googlevisionapi (#3126)
### Summary

Includes changes from #3117. Merged into a feature branch to run the
full test suite.

Original PR description:

The Google Vision API allows for [configuration of the API
endpoint](https://cloud.google.com/vision/docs/ocr#regionalization), to
select if the data should be sent to the US or the EU. This PR adds an
environment variable (`GOOGLEVISION_API_ENDPOINT`) to configure it.

---------

Co-authored-by: JIAQIA <jqq1716@gmail.com>
Co-authored-by: Dimitri Lozeve <dimitri@lozeve.com>
2024-05-31 18:41:04 +00:00
Yuming Long
4a96d54906
chore: move logger error to debug when pdfminer extract fails (#3028)
### Summary

We are seeing logger error `Invalid dictionary construct` for hosted
APIs, move this logger error to debug level - we still continue
partition when pdfminer text extraction fails as before (just don't
throw the log error anymore)

### Test
I was able to reproduce the logger error with an internal only file
(please DM me if needed) and the error trace look like
```
 File "/Users/yumingl/develops/unstructured/unstructured/partition/pdf.py", line 709, in _process_pdfminer_pages
    annotation_list = get_uris(page.annots, height, coordinate_system, page_number)
  File "/Users/yumingl/develops/unstructured/unstructured/partition/pdf.py", line 1049, in get_uris
    resolved_annots = annots.resolve()
...
```
we also won't be able to repair pdf structure on `get_uris` (not a page
level) so move this exception to debug level.
2024-05-31 17:58:36 +00:00
Matt Robinson
865ef496e6
ci: update pinecone test to use serverless (#3127)
### Summary

Closes #3068. Updates the Pinecone connector tests to use serverless
indexes, per the documentation
[here](https://docs.pinecone.io/reference/api/control-plane/create_index).
Also updates the CHANGELOG to mention serverless. Turns out we already
supported it with the client version bump, but it hadn't been tested
yet.

### Testing

See [this CI
job](https://github.com/Unstructured-IO/unstructured/actions/runs/9319836670/job/25655322433?pr=3127)
that passed, running only the Pinecone test.
2024-05-31 15:24:41 +00:00
ryannikolaidis
1f8768750c
chore: add auth to s3 destination test (#3122)
We should be validating the S3 Destination with authenticated requests,
with credentials from a limited test user.

## Changes

- Updates s3 destination test to point to a bucket that requires
authentication.
- Adds authentication to the s3 destination test request
- Bonus: fix deserialization of S3ConnectionConfig for s3 V2 destination
- Bonus: fix S3ConnectionConfig never registered for s3 V2 destination
- Bonus: repair version and changelog version for consistency with -dev
convention

## Testing
Validated by changes to S3 destination ingest test
2024-05-31 07:05:09 +00:00
Matt Robinson
23e570fc8a
docs: cleanup readme; add python 3.12 (#3120)
### Summary

Updates documentation references in the README to point to
https://docs.unstructured.io and cleans up a few sections of the README.
Specifically:

- Removes an old API announcement
- Removes the section mentioning Chipper as a beta feature. Chipper is
only available through the SaaS API.

Also adds a Python 3.12 tag to `setup.py` since we now support Python
3.12.
2024-05-30 16:22:54 +00:00
qued
293901e144
build: pin python-docx (#3110)
Since we incorporate a newer feature from `python-docx`
[here](https://github.com/Unstructured-IO/unstructured/blob/main/unstructured/partition/docx.py#L521),
we should make the version of `python-docx` that first supports that
method an explicit requirement.

I didn't pip recompile since our generated dependencies already have
`python-docx==1.1.2`, but I can do that if someone thinks it's
necessary.
2024-05-30 15:08:10 +00:00
Matt Robinson
9acf26ec2e
docs: explicitly replace all old pages with link to new docs (#3118)
### Summary

Explicitly replaces all old docs pages with a link to the new docs. This
was required because 404 redirects didn't work for pages that previously
existed, though they worked non-existing paths that never existed.
2024-05-30 13:01:33 +00:00
Matt Robinson
8415db5112
docs: make 404 pages same as index (#3114)
### Summary

Makes a custom 404 page that's the same as `index.html`, so any path
shows the URL for the new docs.
2024-05-30 07:46:38 -04:00
Steve Canny
f2e67539b1
rfctr: clean MSG partitioner and tests as prep (#3107)
**Summary**
Fix type errors and generally prepare `partition_msg()` and its tests
for refactoring to use `python-oxmsg` library instead of the problematic
`msg_parser` library for partitioning Outlook MSG files.
2024-05-29 21:36:05 +00:00
Matt Robinson
2ecaf5e38c
fix: remove 404 from docs (#3112)
### Summary

Removes 404 from the docs build to avoid rate limiting behavior.
2024-05-29 20:41:32 +00:00
ryannikolaidis
6b5d8a9785
fix: revert dropping of filename extension for some connectors (#3109)
V2 refactor of ingest code introduces the removal of original file
extensions. Since the upgrade of connectors is incomplete this means
that some connectors will remove the original file extension and some
will not. Still TBD whether this is actually something we want at all.

This PR reverts specifically that change in the V2 ingest code so that
original file extension is preserved downstream.

## Testing
CI is passing with filenames updated via `Ingest Test Fixtures Update`
workflow.

---------

Co-authored-by: ryannikolaidis <ryannikolaidis@users.noreply.github.com>
2024-05-29 19:14:22 +00:00
Christine Straub
f4457249a7
fix: partition_pdf() removes spaces from the text (#3106)
Closes #2896.

This PR aims to fix `partition_pdf()` to keep spaces in text. The
control character `\t` is now replaced with a space instead of being
removed when merging inferred and embedded elements.

### Testing
PDF:
[rok_20230930_1-1.pdf](https://github.com/Unstructured-IO/unstructured/files/15001636/rok_20230930_1-1.pdf)
```
elements = partition_pdf(
    filename="rok_20230930_1-1.pdf",
    strategy="hi_res",
)

print(str(elements[20]))
```
**Results:**
- PR
```
Name of each exchange on which registered New York Stock Exchange
```
- main branch
```
Nameofeachexchangeonwhichregistered NewYorkStockExchange
```
0.14.3
2024-05-29 04:53:17 +00:00
Matt Robinson
3158169585
fix: uninstall bson for mongo connector (#3104)
### Summary

Closes #3049. Reenables the MongoDB connector test, which was disabled
previously in #3047 due to incompatibility between the `pymongo` and the
`bson` package from `pip`, which is a dependency for the Astra
connector. Per the `pymongo` docs below, `pymongo` ships with its own
version of `bson` and installing `bson` from `pip` breaks `pymongo`.

- https://pymongo.readthedocs.io/en/stable/installation.html

### Testing

Ingest tests ran successfully for the [source
connector](https://github.com/Unstructured-IO/unstructured/actions/runs/9273154676/job/25512636315)
and the [destination
connector](https://github.com/Unstructured-IO/unstructured/actions/runs/9273154676/job/25512635546).
2024-05-28 17:45:18 +00:00
Matt Robinson
6b400b46fe
feat: add VoyageAI embeddings (#3069) (#3099)
Original PR was #3069. Merged in to a feature branch to fix dependency
and linting issues. Application code changes from the original PR were
already reviewed and approved.

------------
Original PR description:
Adding VoyageAI embeddings 
Voyage AI’s embedding models and rerankers are state-of-the-art in
retrieval accuracy.

---------

Co-authored-by: fzowl <160063452+fzowl@users.noreply.github.com>
Co-authored-by: Liuhong99 <39693953+Liuhong99@users.noreply.github.com>
2024-05-24 21:48:35 +00:00
Yao You
32df4ee1c6
fix: disable table_as_cells output by default (#3093)
This PR changes the output of table elements: now by default the table
elements' `metadata.table_as_cells` is `None`. The data will only be
populated when the env `EXTRACT_TABLE_AS_CELLS` is set to `true`.

The original design of the `table_as_cells` is for evaluate table
extraction performance. The format itself is not as readable as the
`table_as_html` metadata for human or RAG consumption. Therefore by
default this data is not needed.

Since this output is meant for evaluation use this PR choose to use an
environment variable to control if it should be present in the
partitioned results. This approach avoids adding parameters to the
`partition` function call. Adding a new parameter to the `partition`
interface increases the complexity of the interface and adds more
maintenance cost since there is a long chain of function calls to pass
down this parameter to where it is needed.

## test

running the following code snippet on main vs. this PR

```python
from unstructured.partition.auto import partition

elements = partition("example-docs/layout-parser-paper-with-table.pdf", strategy="hi_res", skip_infer_table_types=[])
table_cells = [element.metadata.table_as_cells, None) for element in elements if element.category == "Table"]
```

on main branch `table_cells` contains cell structured data but on this
branch it is a list of `None`

However if we first set in terminal:

```bash
export EXTRACT_TABLE_AS_CELLS=true
```

then run the same code again with this PR the `table_cells` would
contain actual data, the same as on main branch.

---------

Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com>
Co-authored-by: badGarnet <badGarnet@users.noreply.github.com>
2024-05-24 16:41:25 +00:00
Yao You
809c7e515a
chore: reduce excessive logging (#3095)
- change some info level logging for per page processing into detail
level logging on trace logger
- replace the try block in `document_to_element_list` to use `getattr`
instead and add comment on the reason why sometimes `type` attribute may
not exist for an element
2024-05-24 14:58:47 +00:00
Steve Canny
26d403d7a7
fix: add missing params to ElementMetadata (#3092)
A couple of parameters needed for DOCX image extraction were not added
as parameters to the `ElementMetadata` constructor when they were added
as known fields.

Also repair a couple gaps in alphabetical ordering cause by recent
additions.
2024-05-23 21:30:55 +00:00
Christine Straub
35ec21ecd0
fix: decide table extraction (#3090)
This PR aims to add backward compatibility for the deprecated
`pdf_infer_table_structure` parameter. A missing part of turning table
extraction for PDFs and Images off by default in
https://github.com/Unstructured-IO/unstructured/pull/3035, which was
turned on in https://github.com/Unstructured-IO/unstructured/pull/2588.

---------

Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com>
Co-authored-by: christinestraub <christinestraub@users.noreply.github.com>
2024-05-23 20:37:15 +00:00
David Potter
31a53c8a28
Fix: Chroma Upsert instead of Add (#3086)
Thanks to @0xjgv we have upserting instead of adding in Chroma. This
will prevent duplicate embeddings.

Also including a huggingface example. We had examples for all the other
embedders.
2024-05-23 19:56:19 +00:00
Steve Canny
47d28612f7
feat(docx): add pluggable picture sub-partitioner (#3081)
**Summary**
Allow registration of a custom sub-partitioner that extracts images from
a DOCX paragraph.

**Additional Context**
- A custom image sub-partitioner must implement the
`PicturePartitionerT` interface defined in this PR. Basically have an
`.iter_elements()` classmethod that takes the paragraph and generates
zero or more `Image` elements from it.
- The custom image sub-partitioner must be registered by passing the
class to `register_picture_partitioner()`.
- The default image sub-partitioner is `_NullPicturePartitioner` that
does nothing.
- The registered picture partitioner is called once for each paragraph.
2024-05-23 18:46:30 +00:00
Matt Robinson
171b5df09f
fix: set resolve_entities=False in partition_xml (#3088)
### Summary

Closes #3078. Sets `resolve_entities=False` for parsing XML with `lxml`
in `partition_xml` to avoid text being dynamically injected into the
document.

### Testing

`pytest test_unstructured/partition/test_xml.py` continues to pass with
the update.
2024-05-23 18:38:11 +00:00
Jan Kanty Milczek
9b83330b5a
fix: added the missing function argument (#3085) 2024-05-23 14:30:26 +00:00
Hubert Rutkowski
b8d894f963
feat/Move the category field to Element (#3056)
It's pretty basic change, just literally moved the category field to
Element class. Can't think of other changes that are needed here,
because I think pretty much everything expected the category to be
directly in elements list.

For local testing, IDE's and linters should see difference in that
`category` is now in Element.
2024-05-23 10:43:26 +00:00
Matt Robinson
c9976760c5
fix: revert back to old requirements file for sphinx docs (#3077)
### Summary

As seen in [this
job](https://github.com/Unstructured-IO/unstructured/actions/runs/9182534479/job/25251583102),
the build job for sphinx docs is failing, and has been failing for quite
some time. This PR reverts the requirements file back to a [previous
good
commit](91b892c79d)
for that job, and also moves the `build.in` file so the requirements
file doesn't get update on `make pip-compile.` This is fine since those
requirements don't get installed as part of the package, and we're
deprecated the `sphinx` docs in favor of https://docs.unstructured.io
anyway.

### Testing

Build was
[successful](https://github.com/Unstructured-IO/unstructured/actions/runs/9198605026/job/25301670934?pr=3077)
on the feature branch.

---------

Co-authored-by: Christine Straub <christinemstraub@gmail.com>
2024-05-23 03:32:06 +00:00
Steve Canny
b4ee019170
rfctr: flatten test_unstructured/partition (#3073)
**Summary**
Some partitioner test modules are placed in directories by themselves or
with one other test module. This unnecessarily obscures where to find
the test module corresponding to a partitiner.

Move partitioner test modules to mirror the directory structure of
`unstructured/partition`.
2024-05-23 00:51:08 +00:00
Christine Straub
18428f24ab
chore: bump unstructured-inference 0.7.33 (#3074)
Summary:
- bump unstructured-inference to `0.7.33`
- cut a release for `0.14.2`
- add some dependencies that previously came through from the
layoutparser extras.
0.14.2
2024-05-22 22:35:00 +00:00
Steve Canny
30e5a0cd4e
rfctr(docx): organize docx tests (#3070)
**Summary**
I preparation for adding DOCX pluggable image extraction, organize a few
of the DOCX tests to be parallel to very similar tests for the DOC and
ODT partitioners.
0.14.1
2024-05-21 22:11:46 +00:00
Matt Robinson
7832dfc723
feat: add attribution for pinecone (#3067)
### Summary

- Updates the `pinecone-client` from v2 to v4 using the [client
migration
guide](https://canyon-quilt-082.notion.site/Pinecone-Python-SDK-v3-0-0-Migration-Guide-056d3897d7634bf7be399676a4757c7b#932ad98a2d33432cac4229e1df34d3d5).
Version bump was required to [add
attribution](https://pinecone-2-partner-integration-guide.mintlify.app/integrations/build-integration/attribute-api-activity)
and will also enable use to support [serverless
indexes](https://docs.pinecone.io/reference/pinecone-clients#initialize)
- Adds `"unstructured.{version}"` as the source tag for the connector

### Testing

Destination connection tests
[pass](https://github.com/Unstructured-IO/unstructured/actions/runs/9180305080/job/25244484432?pr=3067)
with the updates.
2024-05-21 20:56:08 +00:00
Christine Straub
b0d8a779da
feat: partiton_pdf() set inferred elements text (#3061)
This PR adds the ability to fill inferred elements text from embedded
text (`pdfminer`) without depending on `unstructured-inference` library.
This PR is the second part of moving embedded text related code from
`unstructured-inference` to `unstructured` and works together with
https://github.com/Unstructured-IO/unstructured-inference/pull/349.
2024-05-21 19:43:38 +00:00
Matt Robinson
059fc64bd9
build: apk add libreoffice24 (#3065)
### Summary

Switches to installing `libreoffice` from the Wolfi repository and
upgrades the `libreoffice` version to `libreoffice==24.x.x`. Resolves a
medium vulnerability in the old `libreoffice` version. Security scanning
with `anchore/grype` was also added to the `test_dockerfile` job.
Requirements were bumped to resolve a vulnerability in the `requests`
library.

### Testing

`test_dockerfile` passes with the updates.
2024-05-21 18:54:16 +00:00
Roman Isecke
3eaf65a8c1
feat: refactor ingest (#3009)
### Description
This refactors the current ingest CLI process to support better
granularity in how the steps are ran
* Both multiprocessing and async now supported. Given that a lot of the
steps are IO-bound, such as downloading and uploading content, we can
achieve better parallelization by using async here
* Destination step broken up into a stager step and an upload step. This
will allow for steps that require manipulation of the data between
formats, such as converting the elements json into a csv format to
upload for tabular destinations, to be pulled out of the step that does
the actual upload.
* The process of writing the content to a local destination was now
pulled out as it's own dedicated destination connector, meaning you no
longer need to persist the content locally once the process is done if
the content was uploaded elsewhere.
* Quick update to the chunker/partition step to use the python client.
* Move the uncompress suppport as a pipeline step since this can
arbitrarily apply to any concrete files that have been downloaded,
regardless of where they came from.
* Leverage last modified date to mark files to be reprocessed, even if
the file already exists locally.

### Callouts
Retry configs haven't been moved over yet. This is an open question
because the intent was for it to wrap potential connection errors but
now any of the other steps that leverage an API might run into network
connection issues. Should those be isolated in each of the steps and
wrapped with the same retry configs? Or do we need to expose a unique
retry config for each step? This would bloat the input params even more.

### Testing
* If you want to run the new code as an SDK, there's an example file
that was added to highlight how to do that:
[example.py](https://github.com/Unstructured-IO/unstructured/blob/roman/refactor-ingest/unstructured/ingest/v2/example.py)
* If you want to run the new code as an isolated CLI:
```shell
PYTHONPATH=. python unstructured/ingest/v2/main.py --help
```
* If you want to see which commands have been migrated to the new
version, there's now a `v2` short help text next to those commands when
running the current cli:
```shell
PYTHONPATH=. python unstructured/ingest/main.py --help
Usage: main.py [OPTIONS] COMMAND [ARGS]...main.py --help   

Options:
  --help  Show this message and exit.

Commands:
  airtable
  azure
  biomed
  box
  confluence
  delta-table
  discord
  dropbox
  elasticsearch
  fsspec
  gcs
  github
  gitlab
  google-drive
  hubspot
  jira
  local          v2
  mongodb
  notion
  onedrive
  opensearch
  outlook
  reddit
  s3             v2
  salesforce
  sftp
  sharepoint
  slack
  wikipedia
```

You can run any of the local or s3 specific ingest tests and these
should now work.

---------

Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com>
Co-authored-by: rbiseck3 <rbiseck3@users.noreply.github.com>
2024-05-21 17:01:49 +00:00
Matt Robinson
73739b38cc
docs: redirect to docs.unstructured.io on github pages (#3054)
### Summary

Updates GitHub pages to redirect to the new https://docs.unstructured.io
page. This will appear on GitHub pages after the next tag.

### Testing

1. From the docs direction, run `make html`. You should not see any
errors or warnings
2. Open `unstructured/docs/build/html/index.html`. It should look like
the following:
<img width="1512" alt="image"
src="https://github.com/Unstructured-IO/unstructured/assets/1635179/077626a5-d88a-467e-9e37-273a92e75d30">
3. Open `unstructured/docs/build/html/404.html`. It should redirect back
to `index.html`. Per the [GitHub pages
docs](https://docs.github.com/en/pages/getting-started-with-github-pages/creating-a-custom-404-page-for-your-github-pages-site),
that page will get served for 404 errors, meaning any links to old docs
pages will redirect to `index.html`, which points users to the new docs
page.
2024-05-21 09:38:32 -04:00
Matt Robinson
acda4d0707
fix: set skip_infer_tables explicitly in test_partition_via_api_with_no_strategy (#3057)
### Summary

A `partition_via_api` test that only runs on `main` was
[failing](https://github.com/Unstructured-IO/unstructured/actions/runs/9159429513/job/25181600959)
with the following output, likely due to the change in the default
behavior for `skip_infer_table_types`. This PR explicitly sets the
`skip_infer_table_types` param to avoid the failure..

```python
=========================== short test summary info ============================
FAILED test_unstructured/partition/test_api.py::test_partition_via_api_with_no_strategy - AssertionError: assert 'Zejiang Shen® (<), Ruochen Zhang?, Melissa Dell®, Benjamin Charles Germain Lee?, Jacob Carlson®, and Weining Li®' != 'Zejiang Shen® (<), Ruochen Zhang?, Melissa Dell®, Benjamin Charles Germain Lee?, Jacob Carlson®, and Weining Li®'
 +  where 'Zejiang Shen® (<), Ruochen Zhang?, Melissa Dell®, Benjamin Charles Germain Lee?, Jacob Carlson®, and Weining Li®' = <unstructured.documents.elements.Text object at 0x7fb9069fc610>.text
 +  and   'Zejiang Shen® (<), Ruochen Zhang?, Melissa Dell®, Benjamin Charles Germain Lee?, Jacob Carlson®, and Weining Li®' = <unstructured.documents.elements.Text object at 0x7fb90648ad90>.text
= 1 failed, 2299 passed, 9 skipped, 2 deselected, 2 xfailed, 9 xpassed, 14 warnings in 1241.64s (0:20:41) =
make: *** [Makefile:302: test] Error 1
```

### Testing

After temporarily removing the "skip if not on `main`" `pytest` mark,
the [unit tests
pass](https://github.com/Unstructured-IO/unstructured/actions/runs/9163268381/job/25192040902?pr=3057O)
on the feature branch.
2024-05-20 19:05:13 -04:00