1109 Commits

Author SHA1 Message Date
Benjamin Torres
5d193c8e5a
fix/bad formed formula (#1481)
@ron-unstructured reported that loading files with:

```
from unstructured.partition.pdf import partition_pdf

elements_yolox = partition_pdf(filename="1706.03762.pdf", strategy='hi_res', model_name="yolox")
print(elements_yolox)
```

Throws an error. After debugging the execution I found that the issue is
that an object of class Formula is being created, however, this class
doesn't contain an __init__ method. This PR solves the issue of adding a
constructor method with an empty string for the element.

The file can be found at:

https://drive.google.com/drive/folders/1hDumyps0hA4_d-GZxs3Hij15Cpa5fjWY?usp=sharing

After this PR is merged this file is correctly processed
2023-09-23 02:36:22 +00:00
ryannikolaidis
48c52365dd
build(test): disable airtable-large ingest test (#1509) 2023-09-23 02:00:01 +00:00
Trevor Bossert
961223da2a
Chore: Update readme to using new download location to track download metrics (#1507)
Related to:
https://github.com/Unstructured-IO/unstructured#chart_with_upwards_trend-analytics

Testing:
`docker pull
downloads.unstructured.io/unstructured-io/unstructured:latest`

There should be no additional steps needed.
2023-09-22 17:30:37 -07:00
ryannikolaidis
ca01b30c07
ci: more reliable release version alerts (#1479) 2023-09-22 21:19:26 +00:00
Trevor Bossert
e8dfbfdbe5
Add notification that we will be utilizing scarf for docker and python downloads (#1503)
We've created a custom domain, downloads.unstructured.io that redirects
to quay.io
(using https://scarf.sh/). This custom domain allows us to swap the
underlying container registry without impacting users. It also provides
us with important metrics about container and package usage, without
surfacing PII
like IP addresses.

Python package follows the same pattern at packages.unstructured.io
2023-09-22 12:59:58 -07:00
ryannikolaidis
955efac935
fix: SharePoint connector fails if any document has an unsupported filetype (#1493) 2023-09-22 18:47:28 +00:00
Trevor Bossert
3e04110bab
Chore: Pin unstructured-inference in extra-pdf-image (#1474)
This is so users are able to upgrade it when unstructured library is
updated.
2023-09-22 09:41:53 -07:00
Christine Straub
2d951722df
Feat/1332 save embedded images in pdf (#1371)
Addresses
[#1332](https://github.com/Unstructured-IO/unstructured/issues/1332)
with `unstructured-inference` PR
[#208](https://github.com/Unstructured-IO/unstructured-inference/pull/208).
### Summary
- Add `image_path` to element metadata
- Pass parameters related to extracting images in PDF
- Preserve image elements ignored due to garbage text if
`el.metadata.image_path` is `True`
### Testing


from unstructured.partition.pdf import partition_pdf

f_path = "example-docs/embedded-images.pdf"

# default image output directory
elements = partition_pdf(
    f_path,
    strategy=strategy,
    extract_images_in_pdf=True,
)

# specific image output directory
elements = partition_pdf(
    f_path,
    strategy=strategy,
    extract_images_in_pdf=True,
    image_output_dir_path=<directory path>,
)
2023-09-22 09:16:03 +00:00
cragwolfe
92ad7698fb
build(test): ignore notion ingest test failures for now (#1496)
There is a fix in progress here:
https://github.com/Unstructured-IO/unstructured/pull/1492 , but let's
see proven stability of a few days before allowing notion ingest test
failures to block CI.
2023-09-22 07:19:21 +00:00
Roman Isecke
e88f7d9eab
chore: ingest test file cleanup (#1366) 2023-09-21 11:51:08 -07:00
Ahmet Melek
9e88929a8c
feat: document embeddings (#1368)
Closes https://github.com/Unstructured-IO/unstructured/issues/1319,
closes https://github.com/Unstructured-IO/unstructured/issues/1372

This module:

- implements EmbeddingEncoder classes which track embedding related data
- implements embed_documents method which receives a list of Elements,
obtains embeddings for the text within Elements, updates the Elements
with an attribute named embeddings , and returns the updated Elements
- the module uses langchain to obtain the embeddings
-----
- The PR additionally fixes a JSON de-serialization issue on the
metadata fields.

To test the changes, run `examples/embed/example.py`
2023-09-20 19:55:30 +00:00
ryannikolaidis
7a3828d292
chore: fix changelog (#1469)
Fix an earlier merge that resulted in the Tesseract enhancement
entry in a duplicated 0.10.15.
2023-09-20 09:07:36 -07:00
rvztz
424852ab39
feat: adds data source properties to Sharepoint and Outlook (#1278) 2023-09-20 09:13:35 +00:00
Ryan Nikolaidis
8c1d03e5cf update slack invite 2023-09-20 00:02:03 -07:00
rvztz
2f52df180f
Adds data source properties to onedrive, reddit and slack (#1281) 2023-09-20 04:26:36 +00:00
Amanda Cameron
e359afafbe
fix: coordinates bug on pdf parsing (#1462)
Addresses: https://github.com/Unstructured-IO/unstructured/issues/1460

We were raising an error with invalid coordinates, which prevented us
from continuing to return the element and continue parsing the pdf. Now
instead of raising the error we'll return early.

to test:
```
from unstructured.partition.auto import partition

elements = partition(url='https://www.apple.com/environment/pdf/Apple_Environmental_Progress_Report_2022.pdf', strategy="fast")
```

---------

Co-authored-by: cragwolfe <crag@unstructured.io>
0.10.16
2023-09-19 19:25:31 -07:00
Steve Canny
b54994ae95
rfctr: docx partitioning (#1422)
Reviewers: I recommend reviewing commit-by-commit or just looking at the
final version of `partition/docx.py` as View File.

This refactor solves a few problems but mostly lays the groundwork to
allow us to refine further aspects such as page-break detection,
list-item detection, and moving python-docx internals upstream to that
library so our work doesn't depend on that domain-knowledge.
2023-09-19 15:32:46 -07:00
rvztz
9a3e24fcbb
Adds data source properties to elasticsearch, wikipedia and google-drive (#1282) 2023-09-19 20:25:38 +00:00
rvztz
92e18c3f58
feat: adds data source properties to airtable, confluence and discord (#1283) 2023-09-19 18:05:27 +00:00
Yuming Long
f962a1e57d
fix: fix ingest paddle hanging issue (#1441)
## Summary

Ingest tests are having paddle OOM issue which cause the tests to hang
forever. The fix here is to remove paddle from ci and set both OCR env
`TABLE_OCR` and `ENTIRE_PAGE_OCR` to `tesseract`. (will have follow up
PR to investigate why this is failing)

## Test
please check ingest tests in CI
2023-09-19 17:20:23 +00:00
shreyanid
eb8ce89137
chore: function to map between standard and Tesseract language codes (#1421)
### Summary
In order to convert between incompatible language codes from packages
used for OCR, this change adds a function to map between any standard
language codes and tesseract OCR specific codes. Users can input
language information to `languages` in any Tesseract-supported langcode
or any ISO 639 standard language code.

### Details
- Introduces the
[python-iso639](https://pypi.org/project/python-iso639/) package for
matching standard language codes. Recompiles all dependencies.
- If a language is not already supplied by the user as a Tesseract
specific langcode, supplies all possible script/orthography variants of
the language to the Tesseract OCR agent.

### Test
Added many unit tests for a variety of language combinations, special
cases, and variants. For general testing, call partition functions with
any lang codes in the languages parameter (Tesseract or standard).

for example,
```
from unstructured.partition.auto import partition

elements = partition(filename="example-docs/layout-parser-paper.pdf", strategy="hi_res", languages=["en", "chi"])
print("\n\n".join([str(el) for el in elements]))
```
should supply eng+chi_sim+chi_sim_vert+chi_tra+chi_tra_vert to Tesseract
2023-09-18 08:42:02 -07:00
qued
3a07d1e6b4
chore: Fix typos in changelog (#1442) 2023-09-18 10:39:36 -05:00
Amanda Cameron
a9f18eddb8
chore: adding test case for odt tables (#1434)
ODT table extraction is happening! Just added to an existing example-doc
and an accompanying test case.
2023-09-16 22:29:44 -07:00
Yao You
b534b2a6cd
Chore: bump inference package version to 0.5.28 and new release (#1355)
This bump removes the preprocessing before table structure extraction
and improves the OCR results for tables.

---------

Co-authored-by: yuming-long <yuming-long@users.noreply.github.com>
0.10.15
2023-09-15 18:26:15 -07:00
Trevor Bossert
09a0958f90
Feat: CORE-1269 - Install paddlepaddle wheel dependent on arch, supporting aarch64 (#1350)
Testing instructions

on Apple silicon

```
make docker-build
docker run -it unstructured:dev bash
python3
```
Then run the test in this PR
https://unstructured-ai.atlassian.net/browse/CORE-1269

You should get output like shown in ticket

Run the same process on your local machine (not inside docker) with same
test to verify the non aarch64 paddlepaddle got installed correctly

---------

Co-authored-by: Yuming Long <63475068+yuming-long@users.noreply.github.com>
2023-09-15 17:05:48 -07:00
cragwolfe
36d026cb1b
chore: update CHANGELOG.md bullets (#1436)
add "why does it matter" for a couple of bullets
2023-09-15 16:52:01 -07:00
John
6187dc0976
update links in integrations.rst (#1418)
A number of the links in integrations.rst don't seem to lead to the
intended section in the unstructured documentation.

For example:
```See the `stage_for_weaviate <https://unstructured-io.github.io/unstructured/bricks.html#stage-for-weaviate>`_ docs for details```

It seems this link should direct to here instead: https://unstructured-io.github.io/unstructured/bricks/staging.html#stage-for-weaviate
2023-09-15 16:50:55 -07:00
Roman Isecke
333558494e
roman/delta lake dest connector (#1385)
### Description
Add delta table downstream destination connector

Closes https://github.com/Unstructured-IO/unstructured/issues/1415
2023-09-15 22:13:39 +00:00
cragwolfe
98d3541909
Update CHANGELOG.md (#1435)
Update a bullet to reflect: What was the problem? What was fixed? Why
does it matter?
2023-09-15 15:26:49 -05:00
John
de4d496fcf
Fix bbox coordinates for ocr_only strategy (#1325)
### Summary
Duplicate PR of #1259 because of issues with checks
Closes #1227, which found that `nan` values were present in the
coordinates being generated for some elements.
This breaks logic out from `add_pytesseract_bbox_to_elements` to new
functions `_get_element_box` and
`convert_multiple_coordinates_to_new_system`. It also updates the logic
to check that the current bounding box matches the first character of
the element's text (as to avoid the `~` characters that
`pytesseract.image_to_boxes` includes, but are not present in
`pytesseract.image_to_string`.

### Testing
```
from unstructured.partition.image import partition_image
from PIL import Image, ImageDraw

filename="example-docs/layout-parser-paper-with-table.jpg"
elements = partition_image(filename=filename, strategy="ocr_only")
image = Image.open(filename)
draw = ImageDraw.Draw(image)
for i, element in enumerate(elements):
    print(i, element.metadata.coordinates)
    if element.metadata.coordinates:
        draw.polygon(element.metadata.coordinates.points, outline="red", width=2)
output = "example-docs/box-layout-parser-paper-with-table.jpg"
image.save(output)
image.close()
```

---------

Co-authored-by: qued <64741807+qued@users.noreply.github.com>
Co-authored-by: cragwolfe <crag@unstructured.io>
Co-authored-by: Yao You <theyaoyou@gmail.com>
2023-09-15 15:11:16 -05:00
qued
0d61c98481
fix: Pass partition_image kwargs downstream (#1426)
`partition_pdf` allows for passing a `model_name` parameter. Given the
similarity between the image and PDF pipelines, the expected behavior is
that `partition_image` should support the same parameter, but
`partition_image` was unintentionally not passing along its `kwargs`.
This was corrected by adding the kwargs to the downstream call.

#### Testing:

```python
from unstructured.partition.image import partition_image

output1 = partition_image("example-docs/layout-parser-paper-fast.jpg", model_name="detectron2_onnx")
output2 = partition_image("example-docs/layout-parser-paper-fast.jpg", model_name="yolox")

# These shouldn't be the same, since they were produced using different models.
assert output1 != output2

```
The assertion should fail on `main`, but pass on this branch.
2023-09-15 15:09:58 -05:00
Sebastian Laverde Alfonso
fe11ab4235
feat: improved mapping for missing chipper elements (#1431)
This PR updates
[TYPE_TO_TEXT_ELEMENT_MAP](bd33a52ee0/unstructured/documents/elements.py (L551))
with additional mapping for `chipper` elements:

```
“Threading”: NarrativeText,
“Form”: NarrativeText,
“FieldName”: Title,
“Value”: NarrativeText,
“Link”: NarrativeText,
```
2023-09-15 20:05:40 +00:00
Amanda Cameron
50db2abd9f
fix: updating element types (#1394)
This PR adds an arg to the html partition flow called `source_format` if
anything other than "html" we will return non-HTML elements to conform
with the file type we received.

addresses: https://github.com/Unstructured-IO/unstructured/issues/726
2023-09-15 11:51:22 -05:00
Sebastian Laverde Alfonso
40b1d0d092
feat: improved chipper elements mapping and new category_depth metadata (#1308)
Two changes: 
1. Improved mapping of `chipper` element types `Headline` (to `Title`),
`Subheadline`(to `Title`) and `Abstract`( to `NarrativeText`.
2. New element metadata `category_depth`: `None` unless is `Headline`
(`category_depth=1`), or `Subheadline` (`category_depth=2`). The update
of `category_depth` happens during the transform
`normalize_layout_element`.

---------

Co-authored-by: Yao You <theyaoyou@gmail.com>
Co-authored-by: Yao You <yao@unstructured.io>
Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com>
Co-authored-by: LaverdeS <LaverdeS@users.noreply.github.com>
Co-authored-by: Benjamin Torres <benjats07@users.noreply.github.com>
Co-authored-by: Benjamin Torres <benjamin@unstructured.io>
2023-09-15 14:43:17 +00:00
ryannikolaidis
ad69d93d53
ci: add new release version alert (#1413) 2023-09-15 07:05:00 +00:00
rvztz
3be9f089b3
feat: adds data source properties to fsspec-based connectors (#1279) 2023-09-15 05:56:44 +00:00
Yao You
a5ca628f22
[CORE-1741] use forked pytesseract to reduce calls to tesseract (#1298)
This PR resolves
[CORE-1741](https://unstructured-ai.atlassian.net/browse/CORE-1741) by
using a new function `pytesseract.run_and_get_multiple_output`, see
forked repo for more details:
https://github.com/Unstructured-IO/unstructured.pytesseract/releases/tag/0.3.11-dev1

This reduces the call to `tesseract` by half per page of PDF/image
during partition, roughly reducing the runtime by 48%.

The new function is in forked `unstructured.pytesseract`. A PR has been
made to the upstream repo and once that is merged we should switch to
the up stream version. For now we add a new dependency:
`unstructured.pytesseract`.

## testing

Existing unit tests should serve as tests to the new function. 

To demonstrate the changes in performance:
- checkout main
- run `./scripts/performance/profile.sh` and select `ocr_only` strategy,
using the 10th document (16 page layout paper in pdf format)
- examine the speedscope profile or time profile in flamegraph -> should
see two dominant time spenders are `pytesseract.image_to_text` and
`pytesseract.image_to_boxes`, with both about the same total time (see
attached first image)
- checkout this branch
- run the same `profile.sh` with the same options
- examine the profile again and this time should notice 1) total runtime
is reduced by more than 40%; 2) only
`unstructured_pytesseract.run_and_get_multiple_output` is the top time
spender and its total time is about the same as either the
`pytesseract.image_to_text` or `pytesseract.image_to_boxes` time (see
second image below)

![Screenshot 2023-09-06 at 9 45 10
AM](https://github.com/Unstructured-IO/unstructured/assets/647930/fed6118b-a0dc-493d-bef8-85d73027c968)

![Screenshot 2023-09-06 at 9 46 37
AM](https://github.com/Unstructured-IO/unstructured/assets/647930/dd1d6369-cfba-43d4-b1c6-87a8a98b2e16)

[CORE-1741]:
https://unstructured-ai.atlassian.net/browse/CORE-1741?atlOrigin=eyJpIjoiNWRkNTljNzYxNjVmNDY3MDlhMDU5Y2ZhYzA5YTRkZjUiLCJwIjoiZ2l0aHViLWNvbS1KU1cifQ

---------

Co-authored-by: Benjamin Torres <benjats07@users.noreply.github.com>
Co-authored-by: cragwolfe <crag@unstructured.io>
2023-09-14 23:27:18 +00:00
cragwolfe
8f60784178
chore: update CHANGELOG.md (#1420)
Move to a new CHANGELOG.md convention to more fully describe changes.
Bullets should address: what was broken? what was fixed? why does it
matter?

To assist with scanning changes, the first sentence in each bullet is in
**bold**.

Note: it's also worth looking at the rendered markdown in the branch:
https://github.com/Unstructured-IO/unstructured/blob/crag/changelog-tweak/CHANGELOG.md
rather than just the git diff.

---------

Co-authored-by: Klaijan <klaijan@unstructured.io>
2023-09-14 12:53:30 -07:00
qued
3655a752bc
docs: clearer message for sentence count skip (#1410)
Related to #744 . In the `sentence_count` function, there is a parameter
that sets a threshold for a minimum word count for a sentence to be
"counted" as a sentence. When a sentence is skipped in the count because
it doesn't meet the minimum word count, the log message is potentially
misleading as it mentions "skipping" the sentence. Without the above
context, it could be interpreted that the sentence is being skipped in
the partitioning process, which is not the case.

This PR is to reword the log message to make the situation clearer.

Co-authored-by: Ronny H <138828701+ron-unstructured@users.noreply.github.com>
2023-09-14 12:33:10 -05:00
Newel H
cd704e873b
Feat: Create a naive hierarchy for elements (#1268)
## **Summary**
By adding hierarchy to unstructured elements, users will have more
information for implementing vector db/LLM chunking strategies. For
example, text elements could be queried by their preceding title
element. The hierarchy is implemented by a parent_id tag in the
element's metadata.

### Features
- Introduces a parent_id to ElementMetadata (The id of the parent
element, not a pointer)
- Creates a rule set for assigning hierarchies. Sensible default is
assigned, with an optional override parameter
- Sets element parent ids if there isn't an existing parent id or
matches the ruleset

### How it works

Hierarchies are assigned via a parent id field in element metadata.
Elements are read sequentially and evaluated against a ruleset. For
example take the following elements:

1. Title, "This is the Title"
2. Text, "this is the text"

And the ruleset: `{"title": ["text"]}`. When evaluated, the parent_id of
2 will be the id of 1. The algorithm for determining this is more
complex and resolves several edge cases, so please read the code for
further details.

### Schema Changes

```
@dataclass
class ElementMetadata:
      coordinates: Optional[CoordinatesMetadata] = None
      data_source: Optional[DataSourceMetadata] = None
      filename: Optional[str] = None
      file_directory: Optional[str] = None
      last_modified: Optional[str] = None
      filetype: Optional[str] = None
      attached_to_filename: Optional[str] = None
+     parent_id: Optional[Union[str, uuid.UUID, NoID, UUID]] = None
+     category_depth: Optional[int] = None

...
```

### Testing
```
from unstructured.partition.auto import partition
from typing import List

elements = partition(filename="./unstructured/example-docs/fake-html.html", strategy="auto")

for element in elements:
    print(
        f"Category:  {getattr(element, 'category', '')}\n"\
        f"Text:      {getattr(element, 'text', '')}\n"
        f"ID:        {element.id}\n" \
        f"Parent ID: {element.metadata.parent_id}\n"\
        f"Depth:     {element.metadata.category_depth}\n" \
    )
```

### Additional Notes
Implementing this feature revealed a possibly undesired side-effect in
how element metadata are processed. In
`unstructured/partition/common.py` the `_add_element_metadata` is
invoked as part of the `add_metadata_with_filetype` decorator for
filetype partitioning. This method is intended to add additional
information to the metadata generated with the element including
filename and filetype, however the existing metadata is merged into a
newly created metadata object rather than the other way around. Because
of the way it's structured, new metadata fields can easily be forgotten
and pose debugging challenges to developers. This likely warrants a new
issue.

I'm guessing that the implementation is done this way to avoid issues
with deserializing elements, but could be wrong.

---------

Co-authored-by: Benjamin Torres <benjats07@users.noreply.github.com>
2023-09-14 11:23:16 -04:00
Ahmet Melek
bd33a52ee0
fix: coordinates metadata hinders chunking (#1374)
Closes https://github.com/Unstructured-IO/unstructured/issues/1373

This PR: 
- drops the `coordinates` metadata field in `chunk_by_title` to fix
https://github.com/Unstructured-IO/unstructured/issues/1373 (read issue
for the details)
- adds relevant test that checks the particular case
2023-09-14 10:10:03 +00:00
Ronny H
f1364594ad
Docs models (#1412)
This PR adds documentation of models supported by the `Unstructured`
tool. The changes reflect the tool's capabilities, usage examples, and
the process for integrating custom models.

Sections:
- Detailed the basic usage of the `Unstructured` partition with the
model name.
- Provided a list of available models in the `Unstructured` partition.
- Added instructions on using non-default models via three distinct
methods.
- Explained leveraging models from the LayoutParser's model zoo with
`UnstructuredDetectronModel`.
- Guided users in integrating their custom object detection models using
the `UnstructuredObjectDetectionModel` class.

Tested the docs build with:
> cd docs
> pip install -r requirements.txt
> make html
2023-09-13 23:37:31 -07:00
Yao You
12d7628b10
update constraints to pin weaviate during ci (#1408)
This PR ensures the version for `weaviate` is consistent in CI testing.
Latest (3.24.1) is not compatible with our test needs and last version
that run successfully in CI is 3.23.2.
2023-09-13 23:19:20 +00:00
Klaijan
00181b88df
feat: pdf auto strategy groups broken numbered and bullet list items(#1393)
**Summary**
Adds logic to combine broken numbered list for pdf fast strategy.

**Details**
Previously the document reads the numbered list items part of the
`layout-parser-paper-fast.pdf` file as:

```
'1. An off-the-shelf toolkit for applying DL models for layout detection, character'
'recognition, and other DIA tasks (Section 3)'
'2. A rich repository of pre-trained neural network models (Model Zoo) that'
'underlies the off-the-shelf usage'
'3. Comprehensive tools for efficient document image data annotation and model'
'tuning to support different levels of customization'
'4. A DL model hub and community platform for the easy sharing, distribu- tion, and discussion of DIA models and pipelines, to promote reusability, reproducibility, and extensibility (Section 4)'
```

Now it reads:

```
'1. An off-the-shelf toolkit for applying DL models for layout detection, character recognition, and other DIA tasks (Section 3)'
'2. A rich repository of pre-trained neural network models (Model Zoo) that underlies the off-the-shelf usage'
'3. Comprehensive tools for efficient document image data annotation and model' tuning to support different levels of customization'
'4. A DL model hub and community platform for the easy sharing, distribu- tion, and discussion of DIA models and pipelines, to promote reusability, reproducibility, and extensibility (Section 4)'
```

The added logic leverages `ElementType` and `coordinates` to determine
whether the following lines is a part of the previously detected
`ListItem` or not.

**Test**
Add test that checks the element length less than original version with
broken numbered list. The test also checks whether the first detected
numbered list ends with previously broken line.

---------

Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com>
Co-authored-by: Klaijan <Klaijan@users.noreply.github.com>
2023-09-13 21:30:06 +00:00
shreyanid
d87c83d7b6
chore: refactor languages parameter for text_type functions (#1399)
### Summary
In order to support language functionality other than Tesseract OCR, we
want to represent languages provided for either partitioning accuracy or
OCR as a standard list of langcodes as strings. To identify element
types such as NarrativeText and Title, continue the refactor into
functions that use language checks to determine those potential
classifications.

### Details
Replaces `language` with `languages` (a list of strings) as a parameter
to `is_possible_narrative_text` and `is_possible_title`.


### Test
Call `is_possible_narrative_text` and `is_possible_title` with text in a
variety of languages and different inputs for `languages`. The resulting
element classifications should be no different from the current outputs.

ex: see `test_text_type_handles_multi_language_examples` in
`test_unstructured/partition/test_text_type.py`.
2023-09-13 19:46:36 +00:00
shreyanid
1b7c99d878
chore: refactor languages parameter for auto partition (#1400)
### Summary
In order to support language functionality other than Tesseract OCR, we
want to represent languages provided for either partitioning accuracy or
OCR as a standard list of langcodes as strings.

### Details
Follows the pattern established with PDFs in #1334. Adds languages (a
list of strings) as a parameter to partition in auto.py. Marks
ocr_languages for deprecation.

### Test
Call partition with a variety of filetypes (especially pdfs/images),
strategies, languages, or ocr_languages.
- inclusion of ocr_languages as a parameter should display a deprecation
warning and may proceed with partitioning if no other conflicts
- the other valid call outputs should be no different from the current
outputs
2023-09-13 13:07:28 -04:00
shreyanid
2b571eb9a3
chore: refactor languages parameter for image partition functions (#1395)
Adds languages (a list of strings) as a parameter to `partition_image`. Marks ocr_languages for deprecation.
2023-09-13 04:11:58 +00:00
Amanda Cameron
7fd81dc7df
Table processing test for RTF (#1388)
This PR does two things:
1. Adds test case (and alters sample doc) for rtf and epub files with
table
2. Adds `xls/x` file extension to `skip_infer_table_types` default list

---------

Co-authored-by: shreyanid <42684285+shreyanid@users.noreply.github.com>
2023-09-12 18:27:05 -07:00
shreyanid
791adf459d
stop printing all commands in version-sync script (#1390)
### Summary

Remove -x in version-sync script to stop printing all commands and
arguments and improve readability.

### Test

`make check` and `make check-version` no longer print all the commands
and arguments.

(unstructured) shreyanid@Shreyas-MBP-2 unstructured % make check-version 
scripts/version-sync.sh -c \
                -f "unstructured/__version__.py" semver
From github.com:Unstructured-IO/unstructured
 * branch              main       -> FETCH_HEAD
version sync would make no changes to unstructured/__version__.py.
2023-09-12 15:05:26 -07:00
qued
6595632a57
enhancement: backup text categorization (#1322)
Currently there are some cases when `partition_pdf` is run using the
`hi_res` strategy, in which elements can come back with category
`UncategorizedText`. This happens when the detection model fails to
detect an element, but we're able to find it anyway either because it
was embedded in the PDF, or we found it using OCR.

This commit is to allow for attempting to categorize these uncategorized
elements using our text-based classification function,
`element_from_text`.
2023-09-12 20:32:48 +00:00