1418 Commits

Author SHA1 Message Date
Christine Straub
b47e6e9fdc
refactor: remove download packages step (#3225)
This PR aims to remove the download packages step since all of that gets
installed in the base images. This PR also updates the base `wolfi`
image because the original base image can not be found anymore:
https://github.com/Unstructured-IO/unstructured/actions/runs/9555654898/job/26339587945
2024-06-18 12:15:44 +00:00
Steve Canny
77a9e1b54d
rfctr(html): drop convert_and_partition_html() (#3215)
**Summary**
Remove `unstructured.partition.html.convert_and_partition_html()`. Move
file-type conversion (to HTML) responsibility to each brokering
partitioner that uses that strategy and let them call `partition_html()`
for themselves with the result.

**Additional Context**

Rationale:
- `partition_html()` does not want or need to know which partitioners
might broker partitioning to it.
- Different brokering partitioners have their own methods to convert
their format to HTML and quirks that may be involved for their format.
Avoid coupling them so they can evolve independently.
- The core of the conversion work is already encapsulated in
`unstructured.partition.common.convert_file_to_html_text_using_pandoc()`.
- `convert_and_partition_html()` represents an additional brokering
layer with the entailed complexities of an additional site for default
parameter values to be (mis-)applied and/or dropped and is an additional
location for new parameters to be added.
2024-06-17 19:43:18 +00:00
Roman Isecke
d876a386ed
Roman/fix ingest async connectors (#3210)
### Description
Choosing to use async needs to be very careful because if a connector is
set to use async, the pipeline will not fan out the inputs via
multiprocessing but instead it will be limited to run in a single
process under the assumption it has more benefit from async due to heavy
network traffic. This means the exact same code that is not optimized
for async and is blocking will force the pipeline to perform worse than
simply never marking the connector to use async since the pipeline will
fan that out using multiprocessing.

All connectors and processes in the pipeline we revisited to make sure
this criteria was met and updated accordingly:
* Currently the unstructured client does not support making requests
async, so this was moved over to use multiprocessing
* fsspec connector was updated to use the async client from the fsspec
library. This also required that the client be a `@property` fetched on
demand, otherwise the client would break the multiprocessing pool since
it maintains a thread lock and that can't be pickled when the fsspec
connector doesn't support async.
* elasticsearch was also updated to use the async client
* weaviate only recently came out with async support in their SDK at a
version that is higher than we can use in the open source repo, so a
TODO was left but otherwise moved to use multiprocessing
* all underlying embedders don't use async to embedder step must be
multiprocessing for now. TODO left to update underlying embedder classes
to optionally support async.
* Chunking parameters were not accurately being passed through from cli
to chunker params, this was fixed

---------

Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com>
Co-authored-by: rbiseck3 <rbiseck3@users.noreply.github.com>
2024-06-17 16:55:19 +00:00
Frederic Marvin Abraham
6220633d3f
enhancement: make tempfiles windows friendly (#3108)
### Summary

Updates handling of tempfiles so that they work on Windows systems.

---------

Co-authored-by: Matt Robinson <mrobinson@unstructured.io>
Co-authored-by: Matt Robinson <mrobinson@unstructuredai.io>
2024-06-17 13:28:48 -04:00
Matt Robinson
2815226b54
build(deps): version bumps for 2024-06-17 (#3220)
### Summary

Version bumps for the week of 2024-06-17. There is a now a pin on
`numpy` due to a breaking change in the latest version that we'll need
to investigate and remove in a subsequent PR.
2024-06-17 14:04:29 +00:00
Steve Canny
9fae0111d9
rfctr(html): drop HTML-specific elements (#3207)
**Summary**
Remove HTML-specific element types and return "regular" elements like
`Title` and `NarrativeText` from `partition_html()`.

**Additional Context**
- An aspect of the legacy HTML partitioner was the use of HTML-specific
element types used to track metadata during partitioning.
- That role is no longer necessary or desireable.
- HTML-specific elements like `HTMLTitle` and `HTMLNarrativeText` were
returned from partitioning HTML but also the seven other file-formats
that broker partitioning to HTML (convert-to-HTML and partition_html()).
This does not cause immediate breakage because these are still `Text`
element subtypes, but it produces a confusing developer experience.
- Remove the prior metadata roles from HTML-specific elements and remove
those element types entirely.
2024-06-15 00:14:22 +00:00
Matt Robinson
08383a27de
build: pull from wolfi base image (#3213)
### Summary

Updates the `wolfi` image to pull from the upstream `wolfi-base` base
image to avoid maintaining the base layers in both locations. Closes
#3105 by pulling in the fix from upstream.

### Testing

`test_dockerfile` should continue to pass with the changes.
2024-06-14 20:41:27 +00:00
Christine Straub
9552fbbfbf
chore: bump unstructured-inference 0.7.35 (#3205)
### Summary
- bump unstructured-inference to `0.7.35` which fixed syntax for
generated HTML tables
- update unit tests and ingest test fixtures to reflect changes in the
generated HTML tables
- cut a release for `0.14.6`

---------

Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com>
Co-authored-by: christinestraub <christinestraub@users.noreply.github.com>
0.14.6
2024-06-14 18:11:38 +00:00
Roman Isecke
a6c09ec621
Roman/dry ingest pipeline step (#3203)
### Description
The main goal of this was to reduce the duplicate code that was being
written for each ingest pipeline step to support async and not async
functionality.

Additional bug fixes found and fixed:
* each logger for ingest wasn't being instantiated correctly. This was
fixed to instantiate in the beginning of a pipeline run as soon as the
verbosity level can be determined.
* The `requires_dependencies` wrapper wasn't wrapping async functions
correctly. This was fixed so that `asyncio.iscoroutinefunction()` gets
trigger correctly.
2024-06-14 13:46:44 +00:00
Pawel Kmiecik
29e64eb281
feat: table evaluations for fixed html table generation (#3196)
Update to the evaluation script to handle correct HTML syntax for
tables.
See https://github.com/Unstructured-IO/unstructured-inference/pull/355
for details.

This change:
- modifies transforming HTML tables to evaluation internal `cells`
format
- fixes the indexing of the output (internal format cells) when HTML
cells use spans
2024-06-14 09:03:27 +00:00
Roman Isecke
dadc9c6d0b
feat/tqdm ingest support (#3199)
### Description
Add in tqdm support to show progress bar of status of each job when
being run. Supported for each mode (serial, async, multiprocess). Also
small timing wrapper around jobs to print out how long it took in total.
2024-06-13 18:41:54 +00:00
Steve Canny
f5ebb209a4
rfctr(html): drop page concept (#3184)
**Summary**
Pagination of HTML documents is currently unused. The `Page` class and
concept were deeply embedding in the legacy organization of HTML
partitioning code due to the legacy `Document` (= pages of elements)
domain model. Remove this concept from the code such that elements are
available directly from the partitioner.

**Additional Context**
- Pagination can be re-added later if we decide we want it again. A
re-implementation would be much simpler and much lower impact to the
structure of the code and introduce much less additional complexity,
similar to the approach we take in `partition_docx()`.
2024-06-13 18:19:42 +00:00
ryannikolaidis
da3492b529
fix: dropbox source connector file path bugs (#3189)
The Dropbox source connector currently raises exceptions when indexing
files due to two issues: a path formatting idiosyncrasy of the Dropbox
library and a divergence in the definition of the Dropbox libraries
fs.info method, expecting a 'url' parameter rather than 'path'.

## Changes

* add a `/` prefix to file path used by DropboxIndexer
* override the fsspec sterilize_info method in DropboxIndexer to call
`self.fs.info` with `url` rather than `path`; to accommodate for the
fact that `dropboxdrivefs` diverges with this signature
* remove `dropbox.sh` from ignored source tests
* update test fixtures (now that the dropbox connector has been fixed
and not skipped)

## Testing
`dropbox.sh` source ingest test now succeeds (and is no longer ignored)

---------

Co-authored-by: ryannikolaidis <ryannikolaidis@users.noreply.github.com>
Co-authored-by: Christine Straub <christinemstraub@gmail.com>
2024-06-13 18:06:41 +00:00
Roman Isecke
f7b0a37c86
Feat/migrate elasticsearch src connector (#3174)
### Description
Migrate elasticsearch connector with support for what used to be batch
ingest docs but not it support for the download step to generate
additional file data.

---------

Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com>
Co-authored-by: rbiseck3 <rbiseck3@users.noreply.github.com>
2024-06-13 17:57:59 +00:00
Matt Robinson
ad69bdcd4e
build(deps): deltalake bump to 0.18.x (#3197)
### Summary

Closes #3173. Removes the `overwrite_schema` kwarg from the Delta Table
connector and bumps the `deltalake` version. Per [this
PR](https://github.com/delta-io/delta-rs/pull/2554) in the `deltalake`
repo, the `overwrite_schema` kwarg is deprecated as of version `0.18.0`.
Users can specify `schema_mode="merge"` to obtain the same behavior.

- `schema_mode="merge"` is equivalent to `overwrite_schema=False`
- `schema_mode="overwrite"` is equivalent to `overwrite_schema=True`

Also adds an `engine` parameter that you can use to set `"rust"` or
`"pyarrow"` as the engine. `engine` defaults to `"pyarrow"` and
`schema_mode` defaults to `None`, which is consistent with the behavior
in `deltalake` documented
[here](https://delta-io.github.io/delta-rs/api/delta_writer/).

### Testing

The Delta Table ingest tests should pass on this PR.

---------

Co-authored-by: Ahmet Melek <39141206+ahmetmeleq@users.noreply.github.com>
2024-06-13 15:59:34 +00:00
Steve Canny
5f582f1716
ci: update to Node 20 actions (#3200)
**Summary**
Silence the long list of warnings we get in CI from using Node 16
actions by updating to Node 20 versions.
2024-06-13 03:43:26 +00:00
ryannikolaidis
17bc55e7be
fix: relative path / permissions issues with v2 fsspec connectors (#3186)
When the v2 fsspec connectors currently generate the relative path, they
may introduce a path with a leading slash (this happens in the case of
the Box connector, which is a subclass of fsspec). When this happens
this results in the paths unintentionally being treated as absolute
paths. As a result, the ingest pipeline attempts to write files to
directories at root level, which in turn raises permission issues.

Note: Box expected results needed to update now that it's no longer
failing.

Aside: found that our tests were unintentionally skipping `box.sh` tests
because we were intending to skip `dropbox.sh` and we use regex to match
if a given test is in skip tests. This adds changes to force an exact
match.

## Changes

* Strip leading slashes during the creating of relative paths in fsspec
connectors
* Add expected results for Box connector
* (bonus): `make tidy` altered an unrelated file by removing an
unnecessary call of `pass`
* (bonus): check exact match for skipped ingest tests which fixes Box
tests getting skipped

## Testing


[Tests](https://github.com/Unstructured-IO/unstructured/actions/runs/9461928289/job/26093475612#step:7:2085)
for the Box connector was failing. It was accidentally getting skipped
(see changes above). It is now no longer skipped and passing.
2024-06-12 03:39:35 +00:00
Filip Knefel
c2065db716
fix API-297: List parameters incorrectly passed to API requests (#3154)
In two places parameters passed to the python client when using either
Ingest workflow and `partition_via_api` function directly we parse the
parameters with list values to strings e.g.
```python
extract_image_block_types=["image"] -> extract_image_block_types='["image"]'
```
as of now these parameters are parsed incorrectly when given as strings
and correctly when given as lists.

This PR removes parsing from `PartitionConfig` and `partition_via_api`.

---------

Co-authored-by: Filip Knefel <filip@unstructured.io>
2024-06-11 21:00:41 +00:00
Steve Canny
2f0400f279
rfctr(html): break coupling to DocumentLayout (#3180)
**Summary**
Remove use of `partition.common.document_to_element_list()` by
`HTMLDocument`. The transitive coupling with layout-inference through
this shared function have been the source of frustration and a drain on
engineering time and there's no compelling reason for the two to share
this code.

**Additional Context**
`partition_html()` uses `partition.common.document_to_element_list()` to
get finalized elements from `HTMLDocument` (pages). This gives rise to a
very nasty coupling between `DocumentLayout`, used by
`unstructured_inference`, and `HTMLDocument`.
`document_to_element_list()` has evolved to work for both callers, but
they share very few common characteristics with each other.

This coupling is bad news for us and also, importantly, for the
inference and page layout folks working on PDF and images.

Break that coupling so those inference-related functions can evolve
whatever way they need to without being dragged down by legacy
`HTMLDocument` connections.

The initial step is to extract a `document_to_element_list()` function
of our own, getting rid of the coordinates and other
`DocumentLayout`-related bits we don't need. As you'll see in the next
few PRs, all of this `document_to_element_list()` code will end up
either going away or being relocated closer to where it's used in
`HTMLDocument`.
2024-06-11 20:54:11 +00:00
Steve Canny
e39ee16161
rfctr(html): promote HTMLDoc candidate methods (#3177)
**Summary**
Make `._find_articles()` and `._find_main` into `._articles` and
`._main` properties on HTMLDocument, respectively.

**Additional Context**
After prior refactorings, these two functions now each require only
`self` and can become `@lazyproperty`s on `HTMLDocument`. This ensures
they are computed at most once. In addition, their close relationship to
`HTMLDocument` is indicated by their membership as methods rather than
"loose" functions.
2024-06-10 22:07:21 +00:00
Matt Robinson
c822e3fd10
build(deps): weekly dependency bumps (6/10/2024) (#3170)
### Summary

Weekly dependency bumps for the week of 6/10/2024.

The `deltalake` dependency was pinned to `<0.18.0` because `0.18.0`
seemed to break the connector test, per [this
test](https://github.com/Unstructured-IO/unstructured/actions/runs/9450141486/job/26028131005).
Opened #3173 to address.
2024-06-10 16:20:22 +00:00
Tracy Shen
d82a34519e
[Merge request] bug fix on table structure metric (#3089)
**Summary**
This fix is to provide better logic oon matched_idx of calculating table
structure metric to provide more accurate calculation on the acc
**Additional Context**

- this fix has passed CI run in Draft PR #3025 initially
- therefore, this time we would like to merge into main branch
- this commit has merged the latest change from main after the Draft PR
2024-06-10 15:14:32 +00:00
Duda Nogueira
657a949a00
chore: Weaviate pyv4 example (#3151)
Update Unstructured example for Weaviate, now using latest python v4
client.

---------

Co-authored-by: Matt Robinson <mrobinson@unstructured.io>
2024-06-10 10:08:46 -04:00
Steve Canny
a66661a7bf
rfctr(html): drop now dead XMLDocument and Document (#3165)
**Summary**
`HTMLDocument` is the class handling the core of HTML parsing. This is
critical code because 8 of the 20 file-type partitioners end up using
this code (`partition_html()` + 7 brokering partitioners like EPUB, MD,
and RST).

For historical reasons, `HTMLDocument` subclassed `XMLDocument` which in
turn subclassed `Document`, both of which are no longer relevant and
unnecessarily complicate reasoning about `HTMLDocument` behavior.

Remove that inheritance and dependency and drop both `XMLDocument` and
`Document` modules which become dead code after no longer being used by
`HTMLDocument`.
2024-06-08 07:36:18 +00:00
Matt Robinson
b4876f1b18
build: 0.14.5 release (#3164)
### Summary

Update changelog and version for `0.14.5` release.
0.14.5
2024-06-07 17:20:30 +00:00
Roman Isecke
0fe0f15f30
feat: migrate weaviate connector to new framework (#3160)
### Description
Add weaviate output connector to those supported in the new v2 ingest
framework. Some fixes were needed to the upoad stager step as this was
the first connector moved over that leverages this part of the pipeline.
2024-06-06 23:18:55 +00:00
Steve Canny
a883fc9df2
rfctr(html): improve SNR in HTMLDocument (#3162)
**Summary**
Remove dead code and organize helpers of HTMLDocument in preparation for
improvements and bug-fixes to follow
2024-06-06 21:21:33 +00:00
Steve Canny
8378ddaa3b
rfctr(html): organize and improve HTMLDocument tests (#3161)
**Summary**
In preparation for further work on HTMLDocument, organize the organic
growth in `documents/tests_html.py` and improving typing and expression.

**Reviewers:** Commits are groomed and review is probably eased by going
commit-by-commit
2024-06-06 18:16:02 +00:00
Steve Canny
f1cab248ce
rfctr(msg): remove temporary new_msg.py (#3157)
**Summary**
Remove temporary `new_msg.py` module.

**Additional Context**
The rewrite of `partition_msg()` was placed in a separate file
`new_msg.py` to avoid a messy diff for code-review. This PR makes that
`new_msg.py` the new `msg.py`.

No code changes were made in the process.
2024-06-06 08:31:56 +00:00
Steve Canny
ddbe90f6bb
rfctr(html): clean html tests in prep for PRs to follow (#3156)
**Summary**
Clean `tests_unstructured/partition/test_html.py` in preparation for
broader refactor of HTML partitioner to follow. That refactor will
address a cluster of bugs.

Temporarily remove blank lines in tests so reordering tests in following
commit is easier to follow. Those will go back in after that.
2024-06-05 23:11:58 +00:00
Steve Canny
e4158deaff
fix(msg): use python-oxmsg for MSG email parsing (#3142)
**Summary**
`partition_msg()` previously used the `msg_parser` library for parsing
Outlook MSG email files (.msg files). The `msg_parser` library is
unmaintained and has several major shortcomings such as not being able
to parse MSG files with 8-bit encoded strings and not reliably
extracting attachments.

Use the new and permissively licenced `python-oxmsg` library instead.

**Additional Context**
For reviewability purposes, this PR temporarily places the new
`partition_msg()` implementation in `new_msg.py` and references that
implementation from `msg.py`. `new_msg.py` will be renamed to `msg.py`
in a closely following PR. This avoids a very messy interleaving of
hunks in a diff between the old and re-written `partition_msg()`
implementation.

Fixes #2481 
Fixes #3006
2024-06-05 21:12:27 +00:00
Roman Isecke
b777864296
feat: Migrate over fsspec connectors (#3066)
### Description
Move over all fsspec connectors to the new framework

Variety of bug fixes found and fixed in this PR as well:
* custom json mixin being used for the enhanced dataclass would break if
typing was quoted. That was fixed. A check was also added to the
enhanced dataclass to prevent `InitVar` from being used in the root
dataclass since this breaks serialization.
* hashing for partitioner was using the filename of the raw file being
partitioned rather than the file name of the file data generated from
indexing. This means that mutliple files could result in the same
partition hash when recursive flag is passed in. This was updated to use
the file data file name instead.

---------

Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com>
Co-authored-by: rbiseck3 <rbiseck3@users.noreply.github.com>
2024-06-05 19:12:06 +00:00
Matt Robinson
0e16bf4bf0
enhancement: apply tar filters when using python 3.12 or above (#3124)
### Summary

Applies tar filters when using Python 3.12 or above. This was added to
the [Python `tarfile` library in
3.12](https://docs.python.org/3/library/tarfile.html#extraction-filters)
and guards against malicious content being extracted from `.tar.gz`
files.

### Testing

Added smoke test. If this passes for all Python versions, we're good.
2024-06-05 18:28:59 +00:00
Yao You
fdb27378cb
chore: use python3 consistently in makefile (#3152)
This PR changes two `python` commands in `Makefile` to use `python3` to
be consistent with other make commands. This makes it more explicit on
which python to use when the makefile is used outside of a controlled
virtualenv where only one python exists.
2024-06-05 00:05:57 +00:00
Matt Robinson
5203390a4a
build(deps): weekly pip version bump (#3147)
### Summary

Weekly PR to bump dependency versions.
2024-06-04 20:47:04 +00:00
Christine Straub
1dede5029d
fix: parsing pdf error - new_cells as str has no "copy" (#3130)
Closes #3119.

### Testing
Parsing the provided PDF should be successful.


[testing_brochure_2.pdf](https://github.com/user-attachments/files/15518094/testing_brochure_2.pdf)
```
filename = "testing_brochure_2.pdf"
with open(filename, "rb") as pdf_content:
    elements = partition_pdf(
        file=pdf_content,
        infer_table_structure=True,
        extract_image_block_types=["Image", "Table"],
        chunking_strategy="by_title",
        max_characters=1000,
        new_after_n_chars=3000,
        combine_text_under_n_chars=1000,
    )
print("\n\n".join([str(el) for el in elements]))
```
0.14.4
2024-06-03 18:49:38 +00:00
Matt Robinson
1b43102762
fix: remote root handlers when they exist (#3128)
### Summary

In some environments, such as Google Colab, loggers have a root handling
that did not mask sensitive values. As a result, secrets such as API
keys appeared in the logs. The PR removes root handlers when they exist
to ensure sensitive values are handler properly.

### Testing

Run the following in a Colab notebook. You should see two log outputs,
one with the API key masked and one with it exposed.

```
!pip install unstructured
```

```python
import logging
import json

from unstructured.ingest.interfaces import (
    ChunkingConfig,
    EmbeddingConfig,
    PartitionConfig,
    ProcessorConfig,
    ReadConfig,
)

partition_config = PartitionConfig(
        partition_by_api=True,
        api_key="super secret",

    )

from unstructured.ingest.logger import ingest_log_streaming_init
ingest_log_streaming_init(logging.INFO)

logger = logging.getLogger("unstructured.ingest")
logger.setLevel(logging.INFO)

logger.info(
 f"Running partition node to extract content from json files. "
 f"Config: {partition_config.to_json()}, "
)
```

Now replace the first cell with the following and rerun the Python code.
Only the masked logging output should remain.

```
!git clone https://github.com/Unstructured-IO/unstructured.git && cd unstructured && git checkout fix/rm-log-dupes && pip install -e .
```
2024-05-31 22:07:38 +00:00
Matt Robinson
54c1e4e57f
ci: remove jira issue workflow (#3129)
### Summary

Removes the workflow for creating Jira tickets.
2024-05-31 22:00:40 +00:00
Matt Robinson
6005abce79
feat: configure googlevisionapi (#3126)
### Summary

Includes changes from #3117. Merged into a feature branch to run the
full test suite.

Original PR description:

The Google Vision API allows for [configuration of the API
endpoint](https://cloud.google.com/vision/docs/ocr#regionalization), to
select if the data should be sent to the US or the EU. This PR adds an
environment variable (`GOOGLEVISION_API_ENDPOINT`) to configure it.

---------

Co-authored-by: JIAQIA <jqq1716@gmail.com>
Co-authored-by: Dimitri Lozeve <dimitri@lozeve.com>
2024-05-31 18:41:04 +00:00
Yuming Long
4a96d54906
chore: move logger error to debug when pdfminer extract fails (#3028)
### Summary

We are seeing logger error `Invalid dictionary construct` for hosted
APIs, move this logger error to debug level - we still continue
partition when pdfminer text extraction fails as before (just don't
throw the log error anymore)

### Test
I was able to reproduce the logger error with an internal only file
(please DM me if needed) and the error trace look like
```
 File "/Users/yumingl/develops/unstructured/unstructured/partition/pdf.py", line 709, in _process_pdfminer_pages
    annotation_list = get_uris(page.annots, height, coordinate_system, page_number)
  File "/Users/yumingl/develops/unstructured/unstructured/partition/pdf.py", line 1049, in get_uris
    resolved_annots = annots.resolve()
...
```
we also won't be able to repair pdf structure on `get_uris` (not a page
level) so move this exception to debug level.
2024-05-31 17:58:36 +00:00
Matt Robinson
865ef496e6
ci: update pinecone test to use serverless (#3127)
### Summary

Closes #3068. Updates the Pinecone connector tests to use serverless
indexes, per the documentation
[here](https://docs.pinecone.io/reference/api/control-plane/create_index).
Also updates the CHANGELOG to mention serverless. Turns out we already
supported it with the client version bump, but it hadn't been tested
yet.

### Testing

See [this CI
job](https://github.com/Unstructured-IO/unstructured/actions/runs/9319836670/job/25655322433?pr=3127)
that passed, running only the Pinecone test.
2024-05-31 15:24:41 +00:00
ryannikolaidis
1f8768750c
chore: add auth to s3 destination test (#3122)
We should be validating the S3 Destination with authenticated requests,
with credentials from a limited test user.

## Changes

- Updates s3 destination test to point to a bucket that requires
authentication.
- Adds authentication to the s3 destination test request
- Bonus: fix deserialization of S3ConnectionConfig for s3 V2 destination
- Bonus: fix S3ConnectionConfig never registered for s3 V2 destination
- Bonus: repair version and changelog version for consistency with -dev
convention

## Testing
Validated by changes to S3 destination ingest test
2024-05-31 07:05:09 +00:00
Matt Robinson
23e570fc8a
docs: cleanup readme; add python 3.12 (#3120)
### Summary

Updates documentation references in the README to point to
https://docs.unstructured.io and cleans up a few sections of the README.
Specifically:

- Removes an old API announcement
- Removes the section mentioning Chipper as a beta feature. Chipper is
only available through the SaaS API.

Also adds a Python 3.12 tag to `setup.py` since we now support Python
3.12.
2024-05-30 16:22:54 +00:00
qued
293901e144
build: pin python-docx (#3110)
Since we incorporate a newer feature from `python-docx`
[here](https://github.com/Unstructured-IO/unstructured/blob/main/unstructured/partition/docx.py#L521),
we should make the version of `python-docx` that first supports that
method an explicit requirement.

I didn't pip recompile since our generated dependencies already have
`python-docx==1.1.2`, but I can do that if someone thinks it's
necessary.
2024-05-30 15:08:10 +00:00
Matt Robinson
9acf26ec2e
docs: explicitly replace all old pages with link to new docs (#3118)
### Summary

Explicitly replaces all old docs pages with a link to the new docs. This
was required because 404 redirects didn't work for pages that previously
existed, though they worked non-existing paths that never existed.
2024-05-30 13:01:33 +00:00
Matt Robinson
8415db5112
docs: make 404 pages same as index (#3114)
### Summary

Makes a custom 404 page that's the same as `index.html`, so any path
shows the URL for the new docs.
2024-05-30 07:46:38 -04:00
Steve Canny
f2e67539b1
rfctr: clean MSG partitioner and tests as prep (#3107)
**Summary**
Fix type errors and generally prepare `partition_msg()` and its tests
for refactoring to use `python-oxmsg` library instead of the problematic
`msg_parser` library for partitioning Outlook MSG files.
2024-05-29 21:36:05 +00:00
Matt Robinson
2ecaf5e38c
fix: remove 404 from docs (#3112)
### Summary

Removes 404 from the docs build to avoid rate limiting behavior.
2024-05-29 20:41:32 +00:00
ryannikolaidis
6b5d8a9785
fix: revert dropping of filename extension for some connectors (#3109)
V2 refactor of ingest code introduces the removal of original file
extensions. Since the upgrade of connectors is incomplete this means
that some connectors will remove the original file extension and some
will not. Still TBD whether this is actually something we want at all.

This PR reverts specifically that change in the V2 ingest code so that
original file extension is preserved downstream.

## Testing
CI is passing with filenames updated via `Ingest Test Fixtures Update`
workflow.

---------

Co-authored-by: ryannikolaidis <ryannikolaidis@users.noreply.github.com>
2024-05-29 19:14:22 +00:00
Christine Straub
f4457249a7
fix: partition_pdf() removes spaces from the text (#3106)
Closes #2896.

This PR aims to fix `partition_pdf()` to keep spaces in text. The
control character `\t` is now replaced with a space instead of being
removed when merging inferred and embedded elements.

### Testing
PDF:
[rok_20230930_1-1.pdf](https://github.com/Unstructured-IO/unstructured/files/15001636/rok_20230930_1-1.pdf)
```
elements = partition_pdf(
    filename="rok_20230930_1-1.pdf",
    strategy="hi_res",
)

print(str(elements[20]))
```
**Results:**
- PR
```
Name of each exchange on which registered New York Stock Exchange
```
- main branch
```
Nameofeachexchangeonwhichregistered NewYorkStockExchange
```
0.14.3
2024-05-29 04:53:17 +00:00