This refactor solves a problem or two, the big one being recursing into
group-shapes to get all shapes on the slide, but mostly lays the
groundwork to allow us to refine further aspects such as list-item
detection, off-slide shape detection, and image-capture going forward.
### Summary
Uses `langdetect` to detect all languages present in the input document.
### Details
- Converts all language codes (whether user inputted or detected using
`langdetect`) to a standard ISO 639-3 code.
- Adds `languages` field to the metadata
- Will revisit how to nonstandardly represent simplified vs traditional
Chinese scripts internally (separate PR).
- Update ingest test results to add `languages` field to documents. Some
other side effects are changes in order of some elements and changes in
element categorization
### Test
You can test the detect_languages function individually by importing the
function and inputting a text sample and optionally a language:
```
text = "My lubimy mleko i chleb."
doc_langs = detect_languages(text)
print(doc_langs)
```
-> ['ces', 'pol', 'slk']
---------
Co-authored-by: Newel H <37004249+newelh@users.noreply.github.com>
Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com>
Co-authored-by: shreyanid <shreyanid@users.noreply.github.com>
Co-authored-by: Trevor Bossert <37596773+tabossert@users.noreply.github.com>
Co-authored-by: Ronny H <138828701+ron-unstructured@users.noreply.github.com>
This updates the docker image download url to pass through the scarf
gateway, this allows anonymous tracking of downloads
Related to:
https://github.com/Unstructured-IO/unstructured#chart_with_upwards_trend-analytics
Testing:
docker pull
downloads.unstructured.io/unstructured-io/unstructured:latest
Result:
Image should download
* Partitions Salesforce data as xlm instead of text for improved detail and flexibility
* Partitions htmlbody instead of textbody for Salesforce emails
### Description
New [Azure Cognitive
Search](https://azure.microsoft.com/en-us/products/ai-services/cognitive-search)
destination connector added. Writes each json element from the created
json files via partition and writes that content to an index.
**Bonus bug fix:** Due to a recent change where the default version of
python used in the repo was bumped to `3.10` from `3.8`, this means
running `pip-compile` now runs it against that version rather than the
lowest we support which is still `3.8`. This breaks the setup for those
lower versions because some of the versions pulled in by `pip-compile`
exist for `3.10` but not `3.8`. `pip-compile` was updates to run as a
script that checks the version of python being used first, which helps
guarantee that all dependencies meet the minimum python version
requirement.
Closes out https://github.com/Unstructured-IO/unstructured/issues/1466
@ron-unstructured reported that loading files with:
```
from unstructured.partition.pdf import partition_pdf
elements_yolox = partition_pdf(filename="1706.03762.pdf", strategy='hi_res', model_name="yolox")
print(elements_yolox)
```
Throws an error. After debugging the execution I found that the issue is
that an object of class Formula is being created, however, this class
doesn't contain an __init__ method. This PR solves the issue of adding a
constructor method with an empty string for the element.
The file can be found at:
https://drive.google.com/drive/folders/1hDumyps0hA4_d-GZxs3Hij15Cpa5fjWY?usp=sharing
After this PR is merged this file is correctly processed
We've created a custom domain, downloads.unstructured.io that redirects
to quay.io
(using https://scarf.sh/). This custom domain allows us to swap the
underlying container registry without impacting users. It also provides
us with important metrics about container and package usage, without
surfacing PII
like IP addresses.
Python package follows the same pattern at packages.unstructured.io
Addresses
[#1332](https://github.com/Unstructured-IO/unstructured/issues/1332)
with `unstructured-inference` PR
[#208](https://github.com/Unstructured-IO/unstructured-inference/pull/208).
### Summary
- Add `image_path` to element metadata
- Pass parameters related to extracting images in PDF
- Preserve image elements ignored due to garbage text if
`el.metadata.image_path` is `True`
### Testing
from unstructured.partition.pdf import partition_pdf
f_path = "example-docs/embedded-images.pdf"
# default image output directory
elements = partition_pdf(
f_path,
strategy=strategy,
extract_images_in_pdf=True,
)
# specific image output directory
elements = partition_pdf(
f_path,
strategy=strategy,
extract_images_in_pdf=True,
image_output_dir_path=<directory path>,
)
Closes https://github.com/Unstructured-IO/unstructured/issues/1319,
closes https://github.com/Unstructured-IO/unstructured/issues/1372
This module:
- implements EmbeddingEncoder classes which track embedding related data
- implements embed_documents method which receives a list of Elements,
obtains embeddings for the text within Elements, updates the Elements
with an attribute named embeddings , and returns the updated Elements
- the module uses langchain to obtain the embeddings
-----
- The PR additionally fixes a JSON de-serialization issue on the
metadata fields.
To test the changes, run `examples/embed/example.py`
Reviewers: I recommend reviewing commit-by-commit or just looking at the
final version of `partition/docx.py` as View File.
This refactor solves a few problems but mostly lays the groundwork to
allow us to refine further aspects such as page-break detection,
list-item detection, and moving python-docx internals upstream to that
library so our work doesn't depend on that domain-knowledge.
## Summary
Ingest tests are having paddle OOM issue which cause the tests to hang
forever. The fix here is to remove paddle from ci and set both OCR env
`TABLE_OCR` and `ENTIRE_PAGE_OCR` to `tesseract`. (will have follow up
PR to investigate why this is failing)
## Test
please check ingest tests in CI
### Summary
In order to convert between incompatible language codes from packages
used for OCR, this change adds a function to map between any standard
language codes and tesseract OCR specific codes. Users can input
language information to `languages` in any Tesseract-supported langcode
or any ISO 639 standard language code.
### Details
- Introduces the
[python-iso639](https://pypi.org/project/python-iso639/) package for
matching standard language codes. Recompiles all dependencies.
- If a language is not already supplied by the user as a Tesseract
specific langcode, supplies all possible script/orthography variants of
the language to the Tesseract OCR agent.
### Test
Added many unit tests for a variety of language combinations, special
cases, and variants. For general testing, call partition functions with
any lang codes in the languages parameter (Tesseract or standard).
for example,
```
from unstructured.partition.auto import partition
elements = partition(filename="example-docs/layout-parser-paper.pdf", strategy="hi_res", languages=["en", "chi"])
print("\n\n".join([str(el) for el in elements]))
```
should supply eng+chi_sim+chi_sim_vert+chi_tra+chi_tra_vert to Tesseract
This bump removes the preprocessing before table structure extraction
and improves the OCR results for tables.
---------
Co-authored-by: yuming-long <yuming-long@users.noreply.github.com>
Testing instructions
on Apple silicon
```
make docker-build
docker run -it unstructured:dev bash
python3
```
Then run the test in this PR
https://unstructured-ai.atlassian.net/browse/CORE-1269
You should get output like shown in ticket
Run the same process on your local machine (not inside docker) with same
test to verify the non aarch64 paddlepaddle got installed correctly
---------
Co-authored-by: Yuming Long <63475068+yuming-long@users.noreply.github.com>
### Summary
Duplicate PR of #1259 because of issues with checks
Closes#1227, which found that `nan` values were present in the
coordinates being generated for some elements.
This breaks logic out from `add_pytesseract_bbox_to_elements` to new
functions `_get_element_box` and
`convert_multiple_coordinates_to_new_system`. It also updates the logic
to check that the current bounding box matches the first character of
the element's text (as to avoid the `~` characters that
`pytesseract.image_to_boxes` includes, but are not present in
`pytesseract.image_to_string`.
### Testing
```
from unstructured.partition.image import partition_image
from PIL import Image, ImageDraw
filename="example-docs/layout-parser-paper-with-table.jpg"
elements = partition_image(filename=filename, strategy="ocr_only")
image = Image.open(filename)
draw = ImageDraw.Draw(image)
for i, element in enumerate(elements):
print(i, element.metadata.coordinates)
if element.metadata.coordinates:
draw.polygon(element.metadata.coordinates.points, outline="red", width=2)
output = "example-docs/box-layout-parser-paper-with-table.jpg"
image.save(output)
image.close()
```
---------
Co-authored-by: qued <64741807+qued@users.noreply.github.com>
Co-authored-by: cragwolfe <crag@unstructured.io>
Co-authored-by: Yao You <theyaoyou@gmail.com>
`partition_pdf` allows for passing a `model_name` parameter. Given the
similarity between the image and PDF pipelines, the expected behavior is
that `partition_image` should support the same parameter, but
`partition_image` was unintentionally not passing along its `kwargs`.
This was corrected by adding the kwargs to the downstream call.
#### Testing:
```python
from unstructured.partition.image import partition_image
output1 = partition_image("example-docs/layout-parser-paper-fast.jpg", model_name="detectron2_onnx")
output2 = partition_image("example-docs/layout-parser-paper-fast.jpg", model_name="yolox")
# These shouldn't be the same, since they were produced using different models.
assert output1 != output2
```
The assertion should fail on `main`, but pass on this branch.
This PR adds an arg to the html partition flow called `source_format` if
anything other than "html" we will return non-HTML elements to conform
with the file type we received.
addresses: https://github.com/Unstructured-IO/unstructured/issues/726
Two changes:
1. Improved mapping of `chipper` element types `Headline` (to `Title`),
`Subheadline`(to `Title`) and `Abstract`( to `NarrativeText`.
2. New element metadata `category_depth`: `None` unless is `Headline`
(`category_depth=1`), or `Subheadline` (`category_depth=2`). The update
of `category_depth` happens during the transform
`normalize_layout_element`.
---------
Co-authored-by: Yao You <theyaoyou@gmail.com>
Co-authored-by: Yao You <yao@unstructured.io>
Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com>
Co-authored-by: LaverdeS <LaverdeS@users.noreply.github.com>
Co-authored-by: Benjamin Torres <benjats07@users.noreply.github.com>
Co-authored-by: Benjamin Torres <benjamin@unstructured.io>
This PR resolves
[CORE-1741](https://unstructured-ai.atlassian.net/browse/CORE-1741) by
using a new function `pytesseract.run_and_get_multiple_output`, see
forked repo for more details:
https://github.com/Unstructured-IO/unstructured.pytesseract/releases/tag/0.3.11-dev1
This reduces the call to `tesseract` by half per page of PDF/image
during partition, roughly reducing the runtime by 48%.
The new function is in forked `unstructured.pytesseract`. A PR has been
made to the upstream repo and once that is merged we should switch to
the up stream version. For now we add a new dependency:
`unstructured.pytesseract`.
## testing
Existing unit tests should serve as tests to the new function.
To demonstrate the changes in performance:
- checkout main
- run `./scripts/performance/profile.sh` and select `ocr_only` strategy,
using the 10th document (16 page layout paper in pdf format)
- examine the speedscope profile or time profile in flamegraph -> should
see two dominant time spenders are `pytesseract.image_to_text` and
`pytesseract.image_to_boxes`, with both about the same total time (see
attached first image)
- checkout this branch
- run the same `profile.sh` with the same options
- examine the profile again and this time should notice 1) total runtime
is reduced by more than 40%; 2) only
`unstructured_pytesseract.run_and_get_multiple_output` is the top time
spender and its total time is about the same as either the
`pytesseract.image_to_text` or `pytesseract.image_to_boxes` time (see
second image below)


[CORE-1741]:
https://unstructured-ai.atlassian.net/browse/CORE-1741?atlOrigin=eyJpIjoiNWRkNTljNzYxNjVmNDY3MDlhMDU5Y2ZhYzA5YTRkZjUiLCJwIjoiZ2l0aHViLWNvbS1KU1cifQ
---------
Co-authored-by: Benjamin Torres <benjats07@users.noreply.github.com>
Co-authored-by: cragwolfe <crag@unstructured.io>
Move to a new CHANGELOG.md convention to more fully describe changes.
Bullets should address: what was broken? what was fixed? why does it
matter?
To assist with scanning changes, the first sentence in each bullet is in
**bold**.
Note: it's also worth looking at the rendered markdown in the branch:
https://github.com/Unstructured-IO/unstructured/blob/crag/changelog-tweak/CHANGELOG.md
rather than just the git diff.
---------
Co-authored-by: Klaijan <klaijan@unstructured.io>
Related to #744 . In the `sentence_count` function, there is a parameter
that sets a threshold for a minimum word count for a sentence to be
"counted" as a sentence. When a sentence is skipped in the count because
it doesn't meet the minimum word count, the log message is potentially
misleading as it mentions "skipping" the sentence. Without the above
context, it could be interpreted that the sentence is being skipped in
the partitioning process, which is not the case.
This PR is to reword the log message to make the situation clearer.
Co-authored-by: Ronny H <138828701+ron-unstructured@users.noreply.github.com>
## **Summary**
By adding hierarchy to unstructured elements, users will have more
information for implementing vector db/LLM chunking strategies. For
example, text elements could be queried by their preceding title
element. The hierarchy is implemented by a parent_id tag in the
element's metadata.
### Features
- Introduces a parent_id to ElementMetadata (The id of the parent
element, not a pointer)
- Creates a rule set for assigning hierarchies. Sensible default is
assigned, with an optional override parameter
- Sets element parent ids if there isn't an existing parent id or
matches the ruleset
### How it works
Hierarchies are assigned via a parent id field in element metadata.
Elements are read sequentially and evaluated against a ruleset. For
example take the following elements:
1. Title, "This is the Title"
2. Text, "this is the text"
And the ruleset: `{"title": ["text"]}`. When evaluated, the parent_id of
2 will be the id of 1. The algorithm for determining this is more
complex and resolves several edge cases, so please read the code for
further details.
### Schema Changes
```
@dataclass
class ElementMetadata:
coordinates: Optional[CoordinatesMetadata] = None
data_source: Optional[DataSourceMetadata] = None
filename: Optional[str] = None
file_directory: Optional[str] = None
last_modified: Optional[str] = None
filetype: Optional[str] = None
attached_to_filename: Optional[str] = None
+ parent_id: Optional[Union[str, uuid.UUID, NoID, UUID]] = None
+ category_depth: Optional[int] = None
...
```
### Testing
```
from unstructured.partition.auto import partition
from typing import List
elements = partition(filename="./unstructured/example-docs/fake-html.html", strategy="auto")
for element in elements:
print(
f"Category: {getattr(element, 'category', '')}\n"\
f"Text: {getattr(element, 'text', '')}\n"
f"ID: {element.id}\n" \
f"Parent ID: {element.metadata.parent_id}\n"\
f"Depth: {element.metadata.category_depth}\n" \
)
```
### Additional Notes
Implementing this feature revealed a possibly undesired side-effect in
how element metadata are processed. In
`unstructured/partition/common.py` the `_add_element_metadata` is
invoked as part of the `add_metadata_with_filetype` decorator for
filetype partitioning. This method is intended to add additional
information to the metadata generated with the element including
filename and filetype, however the existing metadata is merged into a
newly created metadata object rather than the other way around. Because
of the way it's structured, new metadata fields can easily be forgotten
and pose debugging challenges to developers. This likely warrants a new
issue.
I'm guessing that the implementation is done this way to avoid issues
with deserializing elements, but could be wrong.
---------
Co-authored-by: Benjamin Torres <benjats07@users.noreply.github.com>