929 Commits

Author SHA1 Message Date
Trevor Bossert
792232dcc5
Chore: move scarf to setup.py (#1569)
This also follows what I have seen as the recommend way to define a file
package like this.

Also bumps minor versions from pip compile

Testing:
`pip install -e .`
Everything should build as normal

`❯ pip install -e .
Obtaining file:///Users/trevor/dev/unstructured
  Installing build dependencies ... done
  Checking if build backend supports build_editable ... done
  Getting requirements to build editable ... done
  Preparing editable metadata (pyproject.toml) ... done
Collecting scarf@ https://packages.unstructured.io/scarf.tgz (from
unstructured==0.10.17.dev16)
  Using cached https://packages.unstructured.io/scarf.tgz (1.1 kB)
  Installing build dependencies ... done
  Getting requirements to build wheel ... done
  Preparing metadata (pyproject.toml) ... done`

When new release goes out, I will test just plain pip install to verify
that functionality still works
2023-09-28 16:18:14 -07:00
qued
e5d08662d4
enhancement: memory efficient xml partitioning (#1547)
Closes #1236. Partitions XML documents iteratively in most cases*, never
loading the entire tree into memory. This ends up being much faster.

(* The exception is when the argument `xml_path` is passed to filter
elements. I was not able to find a way in Python to compare XPaths while
streaming the elements, aside from writing a custom XPath parser. So the
shortest way forward was to bite the bullet and load the whole tree in
memory when filtering by XPath.)

Memory usage is about 20% of usage on `main` when processing a 470MB XML
file. Time to process is 10s vs 900s.

Output is slightly different, but appears to be an improvement, adding
lines of text that are skipped in current partitioning. No text is lost.
2023-09-28 02:34:06 +00:00
Yao You
62b0557792
build: ignore failing delta lake test ingest for now (#1557) 2023-09-27 19:49:21 -05:00
rvztz
2e01c49d90
feat: adds data source properties to delta table connector. (#1464) 2023-09-27 17:46:01 -07:00
Trevor Bossert
fd79c5262c
Bump Dockerfile to use latest base image (#1553)
New base image includes security fixes. This is an ongoing process to
remediate security issues as they are identified.
2023-09-27 22:30:32 +00:00
Roman Isecke
9836235ead
Chunking support for SharePoint Connector (#1548)
### Description
Optionally adds in chunking to the CLI which adds a flag to trigger
chunking and exposes the parameters used by the `chunk_by_title` method.
Runs chunking before the embedding step.


Opened to replace original PR
https://github.com/Unstructured-IO/unstructured/pull/1531
2023-09-27 21:05:55 +00:00
Ahmet Melek
b283962567
docs: update ingest readme (#1456)
Closes https://github.com/Unstructured-IO/unstructured/issues/1070

This PR aims to update the ingest readme file based on the recent
changes that the ingest module had.
2023-09-27 20:38:15 +00:00
Austin Walker
f34c277bca
fix: add backwards compatibility to ElementMetadata (#1526)
Fixes https://github.com/Unstructured-IO/unstructured-api/issues/237

The problem:
The `ElementMetadata` class was not able to ignore fields that it didn't
know about. This surfaced in `partition_via_api`, when the hosted api
schema is newer than the local `unstructured` version. In
`ElementMetadata.from_json()` we get errors such as `TypeError:
__init__() got an unexpected keyword argument 'parent_id'`.

The fix:
The `from_json` methods for these dataclasses should drop any unexpected
fields before calling `__init__`.

To verify:
This shouldn't throw an error
```
from unstructured.staging.base import elements_from_json
import json

test_api_result = json.dumps([
    {
        "type": "Title",
        "element_id": "2f7cc75f6467bba468022c4c2875335e",
        "metadata": {
            "filename": "layout-parser-paper.pdf",
            "filetype": "application/pdf",
            "page_number": 1,
            "new_field": "foo",
        },
        "text": "LayoutParser: A Unified Toolkit for Deep Learning Based Document Image Analysis"
    }
])

elements = elements_from_json(text=test_api_result)

print(elements)
```
2023-09-27 18:40:56 +00:00
Klaijan
d26d591d6a
feat: get embedded url, associate text and start index for pdf (#1539)
**Executive Summary**

Adds PDF functionality to capture hyperlink (external or internal) for
pdf fast strategy along with associate text.

**Technical Details**

- `pdfminer` associates `annotation` (links and uris) with bounding box
rather than text. Therefore, the link and text matching is not a perfect
pair but rather a logic-based and calculation matching from bounding box
overlapping.
- There is no word-level bounding box. Only character-level (access
using `LTChar`). Thus in order to get to word-level, there is a window
slicing through the text. The words are captured in alphanumeric and
non-alphanumeric separately, meaning it will split the word if contains
both, on the first encounter of non-alphanumeric.)
- The bounding box calculation is calculated using start and stop
coordinates for the corresponding word calculated from above. The
calculation is simply using distance between two dots.

The result now contains `links` in `metadata` as shown below:

```
            "links": [
                {
                    "text": "link",
                    "url": "https://github.com/Unstructured-IO/unstructured",
                    "start_index": 12
                },
                {
                    "text": "email",
                    "url": "mailto:unstructuredai@earlygrowth.com",
                    "start_index": 30
                },
                {
                    "text": "phone number",
                    "url": "tel:6505124019",
                    "start_index": 49
                }
            ]
```

---------

Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com>
Co-authored-by: Klaijan <Klaijan@users.noreply.github.com>
2023-09-27 13:43:32 -04:00
Newel H
55315cf645
Feat: Native hierarchies for docx element types (#1505)
Improves hierarchy from docx files by leveraging natural hierarchies
built into docx documents. Hierarchy can now be detected from an
indentation level for list bullets/numbers and by style name (e.g.
Heading 1, List Bullet 2, List Number).

Hierarchy detection is improved by determining category depth via the
following:
1. Check if the paragraph item has an indentation level (ilvl) xpath -
these are typically on list bullet/numbers. Return the indentation level
if it exists
2. Check the name of the paragraph style if it contains any category
depth information (e.g. Heading 1 vs Heading 2 or List Bullet vs List
Bullet 2). Return the category depth if found, else default to depth of
0.
3. Check the paragraph ilvl via the paragraph's style name. Outside of
the paragraph's metadata, docx stores default ilvls for various style
names, which requires a complex lookup. This check is yet to be
implemented, as the above methods cover most usecases but the
implementation is stubbed out.
---
Co-authored-by: Steve Canny <stcanny@gmail.com>
2023-09-27 11:32:46 -04:00
Roman Isecke
5c7b4f586b
Roman/azure cognitive embeddings (#1524)
### Description
This PR is two-fold:  

**Embeddings:**
* Embeddings incorporated into the sharepoint source connector, which
will now call out to OpenAI and create embeddings if the flag is passed
in and the api key provided.

**Writing vector content (embeddings) to Azure cognitive search index:**
* The schema for the index expected to exist in Azure has been updated
to include the vector field type and a test script has been added to
test the new content being produced from the Sharepoint connector to
push the embedding content.

Some important notes about other changes in here:
* The embedding code had to be updated to patch the `to_dict` method on
elements to add `embeddings` to the dict output if that was added. While
the code originally added the embedding content, when `to_dict` was
called to save the content as json, this was lost.
2023-09-26 23:24:21 +00:00
rvztz
d8a36af08c
fix: Sharepoint connector server_path issue (#1497) 2023-09-26 14:25:35 -07:00
Roman Isecke
81af879038
roman/increase ingest tests num processes (#1500)
### Description
In an effort to speed up the ingest tests, bumping the num if processes
to the max on the system for each
2023-09-26 16:06:53 -05:00
Steve Canny
ab29de8dbd
Rfctr: Refactor PPTX partitioning to more closely align with how pptx documents are structured
This refactor solves a problem or two, the big one being recursing into
group-shapes to get all shapes on the slide, but mostly lays the
groundwork to allow us to refine further aspects such as list-item
detection, off-slide shape detection, and image-capture going forward.
2023-09-26 15:43:55 -04:00
shreyanid
32bfebccf7
feat: introduce language detection function for text partitioning function (#1453)
### Summary
Uses `langdetect` to detect all languages present in the input document.

### Details
- Converts all language codes (whether user inputted or detected using
`langdetect`) to a standard ISO 639-3 code.
- Adds `languages` field to the metadata
- Will revisit how to nonstandardly represent simplified vs traditional
Chinese scripts internally (separate PR).
- Update ingest test results to add `languages` field to documents. Some
other side effects are changes in order of some elements and changes in
element categorization

### Test
You can test the detect_languages function individually by importing the
function and inputting a text sample and optionally a language:
```
text = "My lubimy mleko i chleb."
doc_langs = detect_languages(text)
print(doc_langs)
```
-> ['ces', 'pol', 'slk']

---------

Co-authored-by: Newel H <37004249+newelh@users.noreply.github.com>
Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com>
Co-authored-by: shreyanid <shreyanid@users.noreply.github.com>
Co-authored-by: Trevor Bossert <37596773+tabossert@users.noreply.github.com>
Co-authored-by: Ronny H <138828701+ron-unstructured@users.noreply.github.com>
2023-09-26 18:09:27 +00:00
Ronny H
868cac5bd5
Fixed Sphinx warning errors (#1438)
Fixed issue #1437 - resolved the Warning errors when building sphinx
with `make html`.

test:
1. `cd docs` folder and `rm -rf build`
2. `pip install -r requirements.txt`
3. run `make html`
2023-09-26 04:20:16 +00:00
Trevor Bossert
2a24c81852
Update docker download url to use scarf gateway (#1523)
This updates the docker image download url to pass through the scarf
gateway, this allows anonymous tracking of downloads

Related to:
https://github.com/Unstructured-IO/unstructured#chart_with_upwards_trend-analytics

Testing:
docker pull
downloads.unstructured.io/unstructured-io/unstructured:latest

Result:
Image should download
2023-09-25 14:52:39 -07:00
Trevor Bossert
af5ef0c1a7
Add scarf archive to requirements (#1514)
This allows anonymous tracking of downloads

Related to:
https://github.com/Unstructured-IO/unstructured#chart_with_upwards_trend-analytics

Testing:
pip install -r requirements/base.in

Result:
all packages should install as normal and it builds scarf package
2023-09-25 11:49:40 -07:00
David Potter
01a147eb1d
feat: improved salesforce partitioning (#1475)
* Partitions Salesforce data as xlm instead of text for improved detail and flexibility
* Partitions htmlbody instead of textbody for Salesforce emails
2023-09-25 11:44:28 -07:00
Roman Isecke
bd49cfbab7
feat: adds Azure Cognitive Search (full text) destination connector (#1459)
### Description
New [Azure Cognitive
Search](https://azure.microsoft.com/en-us/products/ai-services/cognitive-search)
destination connector added. Writes each json element from the created
json files via partition and writes that content to an index.

**Bonus bug fix:** Due to a recent change where the default version of
python used in the repo was bumped to `3.10` from `3.8`, this means
running `pip-compile` now runs it against that version rather than the
lowest we support which is still `3.8`. This breaks the setup for those
lower versions because some of the versions pulled in by `pip-compile`
exist for `3.10` but not `3.8`. `pip-compile` was updates to run as a
script that checks the version of python being used first, which helps
guarantee that all dependencies meet the minimum python version
requirement.

Closes out https://github.com/Unstructured-IO/unstructured/issues/1466
2023-09-25 10:27:42 -04:00
Benjamin Torres
5d193c8e5a
fix/bad formed formula (#1481)
@ron-unstructured reported that loading files with:

```
from unstructured.partition.pdf import partition_pdf

elements_yolox = partition_pdf(filename="1706.03762.pdf", strategy='hi_res', model_name="yolox")
print(elements_yolox)
```

Throws an error. After debugging the execution I found that the issue is
that an object of class Formula is being created, however, this class
doesn't contain an __init__ method. This PR solves the issue of adding a
constructor method with an empty string for the element.

The file can be found at:

https://drive.google.com/drive/folders/1hDumyps0hA4_d-GZxs3Hij15Cpa5fjWY?usp=sharing

After this PR is merged this file is correctly processed
2023-09-23 02:36:22 +00:00
ryannikolaidis
48c52365dd
build(test): disable airtable-large ingest test (#1509) 2023-09-23 02:00:01 +00:00
Trevor Bossert
961223da2a
Chore: Update readme to using new download location to track download metrics (#1507)
Related to:
https://github.com/Unstructured-IO/unstructured#chart_with_upwards_trend-analytics

Testing:
`docker pull
downloads.unstructured.io/unstructured-io/unstructured:latest`

There should be no additional steps needed.
2023-09-22 17:30:37 -07:00
ryannikolaidis
ca01b30c07
ci: more reliable release version alerts (#1479) 2023-09-22 21:19:26 +00:00
Trevor Bossert
e8dfbfdbe5
Add notification that we will be utilizing scarf for docker and python downloads (#1503)
We've created a custom domain, downloads.unstructured.io that redirects
to quay.io
(using https://scarf.sh/). This custom domain allows us to swap the
underlying container registry without impacting users. It also provides
us with important metrics about container and package usage, without
surfacing PII
like IP addresses.

Python package follows the same pattern at packages.unstructured.io
2023-09-22 12:59:58 -07:00
ryannikolaidis
955efac935
fix: SharePoint connector fails if any document has an unsupported filetype (#1493) 2023-09-22 18:47:28 +00:00
Trevor Bossert
3e04110bab
Chore: Pin unstructured-inference in extra-pdf-image (#1474)
This is so users are able to upgrade it when unstructured library is
updated.
2023-09-22 09:41:53 -07:00
Christine Straub
2d951722df
Feat/1332 save embedded images in pdf (#1371)
Addresses
[#1332](https://github.com/Unstructured-IO/unstructured/issues/1332)
with `unstructured-inference` PR
[#208](https://github.com/Unstructured-IO/unstructured-inference/pull/208).
### Summary
- Add `image_path` to element metadata
- Pass parameters related to extracting images in PDF
- Preserve image elements ignored due to garbage text if
`el.metadata.image_path` is `True`
### Testing


from unstructured.partition.pdf import partition_pdf

f_path = "example-docs/embedded-images.pdf"

# default image output directory
elements = partition_pdf(
    f_path,
    strategy=strategy,
    extract_images_in_pdf=True,
)

# specific image output directory
elements = partition_pdf(
    f_path,
    strategy=strategy,
    extract_images_in_pdf=True,
    image_output_dir_path=<directory path>,
)
2023-09-22 09:16:03 +00:00
cragwolfe
92ad7698fb
build(test): ignore notion ingest test failures for now (#1496)
There is a fix in progress here:
https://github.com/Unstructured-IO/unstructured/pull/1492 , but let's
see proven stability of a few days before allowing notion ingest test
failures to block CI.
2023-09-22 07:19:21 +00:00
Roman Isecke
e88f7d9eab
chore: ingest test file cleanup (#1366) 2023-09-21 11:51:08 -07:00
Ahmet Melek
9e88929a8c
feat: document embeddings (#1368)
Closes https://github.com/Unstructured-IO/unstructured/issues/1319,
closes https://github.com/Unstructured-IO/unstructured/issues/1372

This module:

- implements EmbeddingEncoder classes which track embedding related data
- implements embed_documents method which receives a list of Elements,
obtains embeddings for the text within Elements, updates the Elements
with an attribute named embeddings , and returns the updated Elements
- the module uses langchain to obtain the embeddings
-----
- The PR additionally fixes a JSON de-serialization issue on the
metadata fields.

To test the changes, run `examples/embed/example.py`
2023-09-20 19:55:30 +00:00
ryannikolaidis
7a3828d292
chore: fix changelog (#1469)
Fix an earlier merge that resulted in the Tesseract enhancement
entry in a duplicated 0.10.15.
2023-09-20 09:07:36 -07:00
rvztz
424852ab39
feat: adds data source properties to Sharepoint and Outlook (#1278) 2023-09-20 09:13:35 +00:00
Ryan Nikolaidis
8c1d03e5cf update slack invite 2023-09-20 00:02:03 -07:00
rvztz
2f52df180f
Adds data source properties to onedrive, reddit and slack (#1281) 2023-09-20 04:26:36 +00:00
Amanda Cameron
e359afafbe
fix: coordinates bug on pdf parsing (#1462)
Addresses: https://github.com/Unstructured-IO/unstructured/issues/1460

We were raising an error with invalid coordinates, which prevented us
from continuing to return the element and continue parsing the pdf. Now
instead of raising the error we'll return early.

to test:
```
from unstructured.partition.auto import partition

elements = partition(url='https://www.apple.com/environment/pdf/Apple_Environmental_Progress_Report_2022.pdf', strategy="fast")
```

---------

Co-authored-by: cragwolfe <crag@unstructured.io>
0.10.16
2023-09-19 19:25:31 -07:00
Steve Canny
b54994ae95
rfctr: docx partitioning (#1422)
Reviewers: I recommend reviewing commit-by-commit or just looking at the
final version of `partition/docx.py` as View File.

This refactor solves a few problems but mostly lays the groundwork to
allow us to refine further aspects such as page-break detection,
list-item detection, and moving python-docx internals upstream to that
library so our work doesn't depend on that domain-knowledge.
2023-09-19 15:32:46 -07:00
rvztz
9a3e24fcbb
Adds data source properties to elasticsearch, wikipedia and google-drive (#1282) 2023-09-19 20:25:38 +00:00
rvztz
92e18c3f58
feat: adds data source properties to airtable, confluence and discord (#1283) 2023-09-19 18:05:27 +00:00
Yuming Long
f962a1e57d
fix: fix ingest paddle hanging issue (#1441)
## Summary

Ingest tests are having paddle OOM issue which cause the tests to hang
forever. The fix here is to remove paddle from ci and set both OCR env
`TABLE_OCR` and `ENTIRE_PAGE_OCR` to `tesseract`. (will have follow up
PR to investigate why this is failing)

## Test
please check ingest tests in CI
2023-09-19 17:20:23 +00:00
shreyanid
eb8ce89137
chore: function to map between standard and Tesseract language codes (#1421)
### Summary
In order to convert between incompatible language codes from packages
used for OCR, this change adds a function to map between any standard
language codes and tesseract OCR specific codes. Users can input
language information to `languages` in any Tesseract-supported langcode
or any ISO 639 standard language code.

### Details
- Introduces the
[python-iso639](https://pypi.org/project/python-iso639/) package for
matching standard language codes. Recompiles all dependencies.
- If a language is not already supplied by the user as a Tesseract
specific langcode, supplies all possible script/orthography variants of
the language to the Tesseract OCR agent.

### Test
Added many unit tests for a variety of language combinations, special
cases, and variants. For general testing, call partition functions with
any lang codes in the languages parameter (Tesseract or standard).

for example,
```
from unstructured.partition.auto import partition

elements = partition(filename="example-docs/layout-parser-paper.pdf", strategy="hi_res", languages=["en", "chi"])
print("\n\n".join([str(el) for el in elements]))
```
should supply eng+chi_sim+chi_sim_vert+chi_tra+chi_tra_vert to Tesseract
2023-09-18 08:42:02 -07:00
qued
3a07d1e6b4
chore: Fix typos in changelog (#1442) 2023-09-18 10:39:36 -05:00
Amanda Cameron
a9f18eddb8
chore: adding test case for odt tables (#1434)
ODT table extraction is happening! Just added to an existing example-doc
and an accompanying test case.
2023-09-16 22:29:44 -07:00
Yao You
b534b2a6cd
Chore: bump inference package version to 0.5.28 and new release (#1355)
This bump removes the preprocessing before table structure extraction
and improves the OCR results for tables.

---------

Co-authored-by: yuming-long <yuming-long@users.noreply.github.com>
0.10.15
2023-09-15 18:26:15 -07:00
Trevor Bossert
09a0958f90
Feat: CORE-1269 - Install paddlepaddle wheel dependent on arch, supporting aarch64 (#1350)
Testing instructions

on Apple silicon

```
make docker-build
docker run -it unstructured:dev bash
python3
```
Then run the test in this PR
https://unstructured-ai.atlassian.net/browse/CORE-1269

You should get output like shown in ticket

Run the same process on your local machine (not inside docker) with same
test to verify the non aarch64 paddlepaddle got installed correctly

---------

Co-authored-by: Yuming Long <63475068+yuming-long@users.noreply.github.com>
2023-09-15 17:05:48 -07:00
cragwolfe
36d026cb1b
chore: update CHANGELOG.md bullets (#1436)
add "why does it matter" for a couple of bullets
2023-09-15 16:52:01 -07:00
John
6187dc0976
update links in integrations.rst (#1418)
A number of the links in integrations.rst don't seem to lead to the
intended section in the unstructured documentation.

For example:
```See the `stage_for_weaviate <https://unstructured-io.github.io/unstructured/bricks.html#stage-for-weaviate>`_ docs for details```

It seems this link should direct to here instead: https://unstructured-io.github.io/unstructured/bricks/staging.html#stage-for-weaviate
2023-09-15 16:50:55 -07:00
Roman Isecke
333558494e
roman/delta lake dest connector (#1385)
### Description
Add delta table downstream destination connector

Closes https://github.com/Unstructured-IO/unstructured/issues/1415
2023-09-15 22:13:39 +00:00
cragwolfe
98d3541909
Update CHANGELOG.md (#1435)
Update a bullet to reflect: What was the problem? What was fixed? Why
does it matter?
2023-09-15 15:26:49 -05:00
John
de4d496fcf
Fix bbox coordinates for ocr_only strategy (#1325)
### Summary
Duplicate PR of #1259 because of issues with checks
Closes #1227, which found that `nan` values were present in the
coordinates being generated for some elements.
This breaks logic out from `add_pytesseract_bbox_to_elements` to new
functions `_get_element_box` and
`convert_multiple_coordinates_to_new_system`. It also updates the logic
to check that the current bounding box matches the first character of
the element's text (as to avoid the `~` characters that
`pytesseract.image_to_boxes` includes, but are not present in
`pytesseract.image_to_string`.

### Testing
```
from unstructured.partition.image import partition_image
from PIL import Image, ImageDraw

filename="example-docs/layout-parser-paper-with-table.jpg"
elements = partition_image(filename=filename, strategy="ocr_only")
image = Image.open(filename)
draw = ImageDraw.Draw(image)
for i, element in enumerate(elements):
    print(i, element.metadata.coordinates)
    if element.metadata.coordinates:
        draw.polygon(element.metadata.coordinates.points, outline="red", width=2)
output = "example-docs/box-layout-parser-paper-with-table.jpg"
image.save(output)
image.close()
```

---------

Co-authored-by: qued <64741807+qued@users.noreply.github.com>
Co-authored-by: cragwolfe <crag@unstructured.io>
Co-authored-by: Yao You <theyaoyou@gmail.com>
2023-09-15 15:11:16 -05:00