1447 Commits

Author SHA1 Message Date
Ahmet Melek
09cc4bfa5f
feat: jira connector (cloud) (#1238)
This connector:
- takes a Jira Cloud URL, user email and api token; to authenticate into
Jira Cloud
- ingests:
  - either all issues in all projects in a Jira Cloud Organization
  - or 
    - issues in user specified projects, boards
    - user specified issues
- processes this kind of data: 
  - text fields such as issue summary, description, and comments
- dropdown fields such as issue type, status, priority, assignee,
reporter, labels, and components
- other data such as issue id, issue key, project id, information on
subtasks
  - notes down attachment URLs, however does not process attachments
- stores each downloaded issue in a txt file, in a predefined template
form (consisting of the data above)
- then processes each downloaded issue document into elements using
unstructured library
- related to: https://github.com/Unstructured-IO/unstructured/issues/263

To test the changes, make the necessary setups and run the relevant
ingest test scripts.

---------

Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com>
Co-authored-by: ahmetmeleq <ahmetmeleq@users.noreply.github.com>
2023-09-06 10:10:48 +00:00
ryannikolaidis
92692ad8d7
fix: wrapped error handling for connectors (#1262)
The CustomError that we use to wrap custom ingest errors inherits from
BaseException rather than Exception (as we should, per specification
[here](https://docs.python.org/3/library/exceptions.html#BaseException)).
This resulted in exceptions not properly raising as expected. This PR
changes the inheritance which resolves the known issue.

Additionally, our base definition for get_file on IngestDoc was wrapped
with SourceConnectionError, however this must be explicitly decorating
each subclass definition in order to function. This PR does that.

## Testing
Some unit test coverage was added for the error wrapping class, however
this wasn't properly recreating the issue we are seeing when running
ingest tests.

To recreate that issue one can intentionally raise an exception in the
[partition_file](https://github.com/Unstructured-IO/unstructured/blob/main/unstructured/ingest/interfaces.py#L214C9-L214C23)
definition and then run any ingest test. Prior to this change: the code
and logs suggest that everything ran without exception, but the
partitioned output was not generated (as a result the test will fail
without any clues as to what went wrong). With this update, the expected
custom partition error, error message, and stack trace will be visible.

---------

Co-authored-by: Ahmet Melek <39141206+ahmetmeleq@users.noreply.github.com>
2023-09-04 20:52:32 +00:00
Jack Retterer
95b6295307
Jack/update documentation (#1190)
Updated:
- Added back support document types for partitioning
- Added more tabs for python code in the API page
- Added a RAG section in Key Concepts
- Added a Common Use case section in overview
2023-09-04 16:15:50 +00:00
cragwolfe
c72014ffaf
build(release): bump to unstructured-inference==0.5.21 (#1293) 0.10.12 2023-09-03 19:09:18 -07:00
Benjamin Torres
f2af953ee8
feat: added YoloX types (#1284)
This PR adds extra element types so that additional output classes from Yolox may be mapped to those element types. E.g., a Yolox `List-item` class is now mapped to a ListItem element type, whereas before it would have been UncategorizedText.
2023-09-03 22:45:29 +00:00
cragwolfe
a475b447e8
doc: add colab link for xycut sorting (#1288) 2023-09-03 20:19:40 +00:00
David Potter
b710bafa89
feat: add salesforce connector (#1168) 2023-09-02 08:50:31 -07:00
Yao You
1a0b737e9c
revert pdf changes and add new pdf for empty page testing (#1255)
- revert the layout parser fast pdf file to original with just two pages
- add a new file that has one empty page and one page says "this page is
intentionally left blank" for tests
2023-09-01 22:33:06 +00:00
qued
fc9d251e4e
build(deps): Remove pillow pin (#1274)
Removed pin for `PIL` as `detectron2` repo has been updated, and so has
`unstructured-inference`.
2023-09-01 19:47:50 +00:00
Trevor Bossert
30cdc19cba
set sha for base image (#1276)
Provides more consistency and integrity to base image by including sha
2023-09-01 18:30:32 +00:00
ryannikolaidis
95c3e17af0
fix: version-sync (#1266) 2023-09-01 06:50:05 +00:00
cragwolfe
69c2c62978
build(image): patch-level base-image bump (#1265) 2023-09-01 05:48:47 +00:00
cragwolfe
65344117b1
enhancement: entire page OCR output included with hi_res (#1263)
Bumps unstructured-inference==0.5.19 to bring in @christinestraub's
enhancement
https://github.com/Unstructured-IO/unstructured-inference/pull/186 .

This is a **massive** improvement where previously omitted text was not
included in `hi_res` output if the layout model had not put a bounding
box around it. In addition, the xycut sorting algorithm generally does a
good job of ordering the merged OCR-text-not-in-layout-model bboxes with
layout-model bboxes into "natural reading order." More details in
https://github.com/Unstructured-IO/unstructured-inference/pull/186#issuecomment-1700438645 .

Bonus: changelog fix.
0.10.11
2023-09-01 04:27:48 +00:00
Yao You
9191be7ae8
[issue 1237] fix empty coordinates break sorting bug (#1242)
This PR resolves #1237 by checking if any coordinates are `None`; if yes
do not attempt to sort with xy cut method and return the list as is.
2023-09-01 03:15:10 +00:00
Roman Isecke
ed7f991ab9
Add s3 writer (#1223)
### Description
Convert s3 cli code to also support writing to s3. Writers are added as
optional subcommands to the parent command with their own arguments.
Custom `click.Group` introduced to add some custom formatting and text
in help messages.

To limit the scope of this PR, most existing files were not touched but
instead new files were added for the new flow. This allowed _only_ the
s3 connector to be updated without breaking any other ones.
2023-08-31 22:19:53 +00:00
cragwolfe
810cfc2c8a
chore: switch to non-release dev version (#1258)
attempting to break the cycle of multiple 'release' commits.
2023-08-31 20:20:38 +00:00
Yao You
b504a48e06
dev: add py-spy profiling (#1251)
This PR adds a new developer tool for profiling performance: `py-spy`.
Additionally it adds a new make command to start a docker with your
local `unstructured` repo mounted for quick testing code in a Rocky
Linux environment (see usage below for intent).

### py-spy

It is a sampling profiler https://github.com/benfred/py-spy and in
practice usually provides more readily usable information than commonly
used `cProfiler`. It also supports output to `speedscope` format,
[which](https://github.com/jlfwong/speedscope#usage) provides a rich
view of the profiling result.

### usage

The new tool is added to the existing `profile.sh` script and is readily
discoverable in the interactive interface. When select to view the new
speedscope format profile it would show up in your local browser if you
followed the readme to install speedscope locally via `npm install -g
speedscope`.

On macOS the profiling tool needs superuser privilege. If you are not
comfortable with that feel free to run the profiling inside a Linux
container if your local dev env is macOS.
2023-08-31 19:26:29 +00:00
cragwolfe
a4ec43a85f
build(image): bump to rockylinux 9 (#1254) 0.10.10 2023-08-30 19:10:08 -07:00
ryannikolaidis
076b1e38f4
feat: serialize ingest docs as json (#1178) 2023-08-31 01:48:41 +00:00
Yao You
27773132b7
[issue 1247] fix element and bbox mismatch bug (#1250)
This PR resolves #1247 by using the matching elements and bbox for
coordinate computation.

This PR also updates the example doc
`example-docs/layout-parser-paper-fast.pdf` so that it includes a true
blank page and a page with text "this page is intentionally left blank".
This change helps us testing:
- differences between fast and hi_res
- code handling empty pages in between pages with contents (which
triggers the bug found in #1247 )

Lastly, this PR updates the names of the variables inside
`_partition_pdf_or_image_with_ocr` so that matching inputs all starts
with `_` like `_elements`, `_text`, and `_bboxes` to improve
readability.

This change also improves partition performance for multi-page pdfs as
it reduces the amount of iterations inside
`add_pytesseract_bbox_to_elements`. Testing locally on m2 mac + Rocky
docker shows it reduces partition time for DA-619p.pdf file from around
1min to around 23s.
2023-08-30 23:34:55 +00:00
Matt Robinson
c49df62967
feat: partition_xml infers element type on each leaf node (#1249)
### Summary

Closes #1229. Updates `partition_xml` so that the element type is
inferred on each leaf node when `xml_keep_tags=False` instead of
delegating splitting and partitioning to `partition_xml`. If
`xml_keep_tags=True`, the file is treated like a text file still and
partitioning is still delegated to `partition_text`.

Also adds the option to pass `text` as an input to `partition_xml`.

### Testing

Create a `parrots.xml` file that looks like:

```xml
<xml><parrot><name>Conure</name><description>A conure is a very friendly bird.

Conures are feathery and like to dance.</description></parrot></xml>
```

Run:

```python
from unstructured.partition.xml import partition_xml
from unstructured.staging.base import convert_to_dict

elements = partition_xml(filename="parrots.xml")
convert_to_dict(elements)
```

One `main`, the output is the following. Notice how the `<name>` tag
incorrectly gets merged into `<description>` in the first element.

```python
[{'element_id': '7ae4074435df8dfcefcf24a4e6c52026',
  'metadata': {'file_directory': '/home/matt/tmp',
               'filename': 'parrots.xml',
               'filetype': 'application/xml',
               'last_modified': '2023-08-30T14:21:38'},
  'text': 'Conure A conure is a very friendly bird.',
  'type': 'NarrativeText'},
 {'element_id': '859ecb332da6961acd2fb6a0185d1549',
  'metadata': {'file_directory': '/home/matt/tmp',
               'filename': 'parrots.xml',
               'filetype': 'application/xml',
               'last_modified': '2023-08-30T14:21:38'},
  'text': 'Conures are feathery and like to dance.',
  'type': 'NarrativeText'}]

```

One the feature branch, the output is the following, and the tags are
correctly separated.

```python
[{'element_id': '5512218914e4eeacf71a9cd42c373710',
  'metadata': {'file_directory': '/home/matt/tmp',
               'filename': 'parrots.xml',
               'filetype': 'application/xml',
               'last_modified': '2023-08-30T14:21:38'},
  'text': 'Conure',
  'type': 'Title'},
 {'element_id': '113bf8d250c2b1a77c9c2caa4b812f85',
  'metadata': {'file_directory': '/home/matt/tmp',
               'filename': 'parrots.xml',
               'filetype': 'application/xml',
               'last_modified': '2023-08-30T14:21:38'},
  'text': 'A conure is a very friendly bird.\n'
          '\n'
          'Conures are feathery and like to dance.',
  'type': 'NarrativeText'}]

```
2023-08-30 17:07:10 -04:00
Charles
de855bb4ed
enhancement: new extract function for detecting image URLs (#1212)
- Adds new feature discussed in GitHub Issue #1117 and in slack
2023-08-30 11:29:15 -07:00
ryannikolaidis
d33d8b5d0b
fix: update .gitignores to include text comparison files (#1246) 2023-08-30 07:21:04 +00:00
cragwolfe
6ad497136d
build: docker image fix (#1245)
Moving to a non-root user in the docker image caused a failure in the
publication workflow.

This fix was used to publish the 0.10.9 unstructured image in this
workflow:

https://github.com/Unstructured-IO/unstructured/actions/runs/6020624226/job/16332230987
2023-08-29 23:27:52 -07:00
Benjamin Torres
7ce2659340
build(deps): bump unstructured-inference==0.5.18 (#1243)
Bumps unstructured-inference to 0.5.18, changes non-default detectron2 classification threshold.
2023-08-29 21:18:33 -07:00
Trevor Bossert
e4535d29ca
Set user for container to same as api image. (#1239)
This is security best practice, a user can override this with their own
Dockerfile if required.
0.10.9
2023-08-30 01:01:44 +00:00
qued
dde3eb058b
fix: make cv2 dependency optional (#1234)
Makes CV2 dependency optional. It's not used in any of the functional code.
2023-08-29 17:14:00 -07:00
Benjamin Torres
5052e6cb3b
Added plain-text comparison for tests (#1180)
This PR adds a comparison during ingest test for the content of the
files in plain text (i.e.: without JSON format)
2023-08-29 23:23:14 +00:00
Klaijan
675a10ea69
fix: update test_json to not use auto partition (#1187)
Update `test_json` to not use auto partition due to dependencies. Previously, to run `test_json` requires full requirements installation library to read file types, including but not limited to, docx, pptx, as well as others. Therefore the test will raise error with base installation. With the update, this fix also add to other test files to check its invariant with `elements_to_json`.
2023-08-29 16:59:26 -04:00
Matt Robinson
f6a745a74f
feat: chunk elements based on titles (#1222)
### Summary

An initial pass on smart chunking for RAG applications. Breaks a
document into sections based on the presence of `Title` elements. Also
starts a new section under the following conditions:

- If metadata changes, indicating a change in section or page or a
switch to processing attachments. If `multipage_sections=True`, sections
can span pages. `multipage_sections` defaults to True.
- If the length of the section exceeds `new_after_n_chars` characters.
The default is `1500`. The chunking function does not split individual
elements, so it's possible for a section to exceed that threshold if an
individual element if over `new_after_n_chars` characters, which could
occur with a long `NarrativeText` element.
- Section under `combine_under_n_chars` characters are combined. The
default is `500`.

### Testing

```python
from unstructured.partition.html import partition_html
from unstructured.chunking.title import chunk_by_title

url = "https://understandingwar.org/backgrounder/russian-offensive-campaign-assessment-august-27-2023-0"
elements = partition_html(url=url)
chunks = chunk_by_title(elements)

for chunk in chunks:
    print(chunk)
    print("\n\n" + "-"*80)
    input()
```
2023-08-29 16:04:57 +00:00
qued
4a5a3022a3
fix: remove duplicate target in makefile (#1235)
Removed a duplicate make target in the `Makefile`.
2023-08-29 06:49:18 +00:00
Ronny H
2d5f931c3f
Update README to Python-3.10 (#1231) 2023-08-29 03:21:23 +00:00
ryannikolaidis
86d78073ee
fix: check number of outputs in ingest test (#1201) 2023-08-29 02:04:59 +00:00
Ahmet Melek
b22e18f7d8
uncomment confluence diff ingest test (#1217)
Uncomment confluence-diff ingest test to:
- see if the test has consistent results
- keep testing the confluence connector

---------

Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com>
Co-authored-by: ahmetmeleq <ahmetmeleq@users.noreply.github.com>
2023-08-28 18:05:57 -07:00
Klaijan
c7cbfe6c79
chore: changelog repair (#1226)
Edit CHANGELOG from 0.10.8 to 0.10.9-dev0.
2023-08-28 13:00:47 -07:00
Klaijan
4b830e3b05
fix: return ocr coordinates points as tuple (#1219)
The `add_pytesseract_bbox_to_elements` returned the
`metadata.coordinates.points` as `Tuple` whereas other strategies
returned as `List`. Make change accordingly for consistency.

Previously: 
```
element.metadata.coordinates.points = [
            (x1, y1),
            (x2, y2),
            (x3, y3),
            (x4, y4),
]
```
Currently:
```
element.metadata.coordinates.points = (
            (x1, y1),
            (x2, y2),
            (x3, y3),
            (x4, y4),
)
```
2023-08-28 13:31:55 -04:00
omahs
64b4287308
fix: typos (#1215)
fix: typos
2023-08-28 12:05:48 +00:00
cragwolfe
ba70828f4a
build(image): bump Dockerfile to python3.10 (#1214) 0.10.8 2023-08-27 18:30:17 -07:00
cragwolfe
4c13d12dc3
fix: prevent spammy ListItem's from images and PDF's (#1210)
The issue was that for blocks detected in an image such as:

![image](https://github.com/Unstructured-IO/unstructured/assets/28578599/a955bf2c-a683-4cef-a19f-546f9378835a)
, where the full image is:

https://utic-dev-tech-fixtures.s3.us-east-2.amazonaws.com/pastebin//Users/cragwolfe/tmp/IRS-form-1987.png
, many ListItem's would be extracted that were not adding much value to
the output (assuming the block was determined to be of type List from
the layout model). This particular file is also used in ingest tests,
and you can see the prior output here:


https://github.com/Unstructured-IO/unstructured/blob/483b09b/test_unstructured_ingest/expected-structured-output/azure/IRS-form-1987.png.json#L93-L280

Test Instructions:

1. run the following snippet:

```
import json
import os
from datetime import datetime

from unstructured.__version__ import __version__
from unstructured.partition.auto import partition
from unstructured.staging.base import elements_to_json
                                                                                                 
filename = "/opt/home/tmp/IRS-form-1987.png"
output_dir = "/opt/home/tmp/json"
base_name_with_ext = os.path.basename(filename)
output_filename_part = os.path.join(output_dir, base_name_with_ext)

print(f"unstructured version: {__version__}")
#for strategy in ("hi_res", "fast", "auto"):                                                                                                            
for strategy in ("hi_res",):
    d1 = datetime.now()
    elements = partition(filename=filename, strategy=strategy)
    elems_as_dicts = json.loads(elements_to_json(elements, indent=2))

    # strip out metadata for the sake of more readable results                                                                                          
    for element_dict in elems_as_dicts:
	del element_dict["metadata"]
    json_filename=f"{output_filename_part}-{strategy}.json"

    with open(json_filename, "w") as jsonf:
        jsonf.write(json.dumps(elems_as_dicts, indent=2))
    d2 = datetime.now()
    print(f"num elements for {strategy}: {len(elements)}")
    print(f"time elapsed     {strategy}: {(d2-d1).total_seconds()}")
```
updating the `filename` and `output_dir` paths for your particular local
environment.

2. Open the json file that was writen to your `output_dir`, named
IRS-form-1987.png-hi_res.json

Witness the new element:
```
  {
    "type": "ListItem",
    "element_id": "7d3ba328af2c20ddeef5d2c1d270f60f",
    "text": "Long-term contracts.\u2014If you are required to change your method of accounting for long-term contracts under section 460, see Notice 87
-61 (9/21/87), 1987-38 IRB 40, for the notification procedures that must be followed Other methods. \u2014Unless the Service has Published a regulation
 or procedure to the contrary, all other changes in accounting methods required by the Act are automatically considered to be approved by the Commissio
ner. Examples of method changes automatically approved by the Commissioner are those changes required to effect: (1) the repeal of the reserve method f
or bad debts of taxpayers other than financial institutions (Act section 805); (2) the repeal of the installment method for sales under a revolving cre
dit plan (Act section 812); (3) the Inclusion of income attributable to the sale or furnishing of utility services no later than the year in which the 
services were provided to customers (Act section 821); and (4) the repeal of the deduction for qualified discount coupons (Act section 823). Do not fil
e Form 3115 for these changes."
  },
```
0.10.7
2023-08-26 21:01:07 -07:00
cragwolfe
3f1c90eef2
build: bump unstructured-inference==0.5.17, cut release (#1207)
Pulls in @awalker4's tesseract enhancement:
https://github.com/Unstructured-IO/unstructured-inference/pull/185
0.10.6
2023-08-26 01:05:48 +00:00
Matt Robinson
07f76275f1
feat: detect PGP encrypted content in partition_email and partition_msg (#1205)
### Summary

Closes #1018. Enables `partition_email` and `partition_msg` to detect if
an email has PGP encrypted content. Based on the specification in [RFC
2015](https://www.ietf.org/rfc/rfc2015.txt). The test emails are based
on the example email in the spec. If PGP detected content is detected, a
warning is emitted and an empty set of lists is returned.

### Testing

```python
from unstructured.partition_email import partition_email

filename = "example-docs/eml/fake-encrypted.eml"
partition_email(filename=filename)
```

```python
from unstructured.partition_msg import partition_msg

filename = "example-docs/fake-encrypted.msg"
partition_msgl(filename=filename)
```
2023-08-25 17:09:25 -07:00
John
5872fa23c3
Extract coordinates from PDFs and images when using OCR only strategy (#1163)
### Summary
Closes #983 
Creates new function `add_pytesseract_bbox_to_elements`
Fixes typos in docstrings

### Testing
```
from unstructured.partition.image import partition_image
from PIL import Image, ImageDraw

png_filename="example-docs/english-and-korean.png"
png_elements = partition_image(filename=png_filename, strategy="ocr_only")
png_image = Image.open(png_filename)
draw = ImageDraw.Draw(png_image)
draw.polygon(png_elements[0].metadata.coordinates.points, outline="red", width=2)
draw.polygon(png_elements[1].metadata.coordinates.points, outline="red", width=2)
draw.polygon(png_elements[2].metadata.coordinates.points, outline="red", width=2)
output = "example-docs/english-and-korean-box.png"
png_image.save(output)
png_image.close()
```
2023-08-25 05:32:12 +00:00
Matt Robinson
c578b85699
fix: respect <pre> tag order in partition_html (#1197)
### Summary

Closes #1184. Updates `partition_html` to respect the ordering of
`<pre>` tags in HTML documents.

### Testing

The elements in the following example should be in the correct order.

```python
    from unstructured.partition.html import partition_html

    html_text = """
    <pre>The Big Brown Bear</pre>
    <div>The big brown bear is growling.</div>
    <pre>The big brown bear is sleeping.</pre>
    <div>The Big Blue Bear</div>
    """
    elements = partition_html(text=html_text)
    print("\n\n".join([str(el) for el in elements]))
```
2023-08-25 04:14:48 +00:00
Christine Straub
483b09b3c9
Feat/1136 elements ordering for pdf (#1161)
### Summary
Address
[#1136](https://github.com/Unstructured-IO/unstructured/issues/1136) for
`hi_res` and `fast` strategies. The `ocr_only` strategy does not include
coordinates.
- add functionality to switch sort mode between the current `basic`
sorting and the new `xy-cut` sorting for `hi_res` and `fast` strategies
- add the script to evaluate the `xy-cut` sorting approach
- add jupyter notebook to provide evaluation and visualization for the
`xy-cut` sorting approach

### Evaluation
```
export PYTHONPATH=.:$PYTHONPATH && python examples/custom-layout-order/evaluate_xy_cut_sorting.py <file_path> <strategy>
```
Here, the file should be under the project root directory. For example,
```
export PYTHONPATH=.:$PYTHONPATH && python examples/custom-layout-order/evaluate_xy_cut_sorting.py example-docs/multi-column-2p.pdf fast
```
2023-08-24 17:46:19 -07:00
Trevor Bossert
f267cef329
feat: Adds in threaded replies (#1188)
- Puts threaded replies into the same text field as parent message,
allowing for a full thread to be under a single element_id
- Output is now XML instead of TXT to allow for easier parsing of new
format.

https://github.com/Unstructured-IO/unstructured/issues/1186
2023-08-24 12:12:29 -07:00
ryannikolaidis
566e947d13
fix: ARM build with constraint for safetensors <=0.3.2 (#1196) 2023-08-24 18:00:25 +00:00
Klaijan
1524841cd9
feat: supports multipage tiff (#1131)
Add test case test_partition_image_with_multipage_tiff that reads multipage TIFF file and

- confirms that the function reads all the pages in the TIFF.

- page number is added to the metadata

This PR is branched from and developed on top of 6d6be99 commit.
2023-08-24 15:12:50 +00:00
Matt Robinson
cdae53cc29
chore: deprecation warning for file_filename (#1191)
### Summary

Closes #1007. Adds a deprecation warning for the `file_filename` kwarg
to `partition`, `partition_via_api`, and `partition_multiple_via_api`.
Also catches a warning in `ebooklib` that we do not want to emit in
`unstructured`.

### Testing

```python
from unstructured.partition.auto import partition

filename = "example-docs/winter-sports.epub"

# Should not emit a warning
with open(filename, "rb") as f:
    elements = partition(file=f, metadata_filename="test.epub")
# Should be test.epub
elements[0].metadata.filename

# Should emit a warning
with open(filename, "rb") as f:
    elements = partition(file=f, file_filename="test.epub")
# Should be test.epub
elements[0].metadata.filename

# Should raise an error
with open(filename, "rb") as f:
    elements = partition(file=f, metadata_filename="test.epub", file_filename="test.epub")
```
2023-08-24 07:02:47 +00:00
ryannikolaidis
835378aba6
ci: fix documentation build flow (#1181) 2023-08-24 00:24:03 -05:00
cragwolfe
df4bd459d5
build(deps): bump unstructured-inference==0.5.16 (#1182)
Pulls in @newelh's fix:
https://github.com/Unstructured-IO/unstructured-inference/pull/184
2023-08-23 05:28:45 +00:00