### Description
Exposes the endpoint url as an access kwarg when using the s3 filesystem
library via the fsspec abstraction. This allows for any non-aws data
providers that support the s3 protocol to be used with the s3 connector
(i.e. minio)
Closes out https://github.com/Unstructured-IO/unstructured/issues/950
---------
Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com>
Co-authored-by: rbiseck3 <rbiseck3@users.noreply.github.com>
- bump `unstructured-inference` to `0.6.6`
- specify default model name for element detection to be
`detectron2_onnx` to keep current behavior
- NOTE: the updated inference package by default would use yolox as
element detection model; this will be evaluated and enabled in a
separated PR
---------
Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com>
Co-authored-by: badGarnet <badGarnet@users.noreply.github.com>
Occasionally the es test can fail because the index fail to be created
on the first try. Experiments show adding timeout doesn't help but add
retry mitigates the issue. See history of commits in branch:
yao/bump-inference-to-0.6.6
https://github.com/Unstructured-IO/unstructured/pull/1563
---------
Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com>
Co-authored-by: badGarnet <badGarnet@users.noreply.github.com>
Closes GH Issue #1233.
### Summary
- add functionality to shrink all bounding boxes along x and y axes
(still centered around the same center point) before running xy-cut sort
### Evaluation
Run the followin gcommand for this
[PDF](https://utic-dev-tech-fixtures.s3.us-east-2.amazonaws.com/pastebin/patent-11723901-page2.pdf).
PYTHONPATH=. python examples/custom-layout-order/evaluate_xy_cut_sorting.py <file_path> <strategy>
**Executive Summary**
Adds PDF functionality to capture hyperlink (external or internal) for
pdf fast strategy along with associate text.
**Technical Details**
- `pdfminer` associates `annotation` (links and uris) with bounding box
rather than text. Therefore, the link and text matching is not a perfect
pair but rather a logic-based and calculation matching from bounding box
overlapping.
- There is no word-level bounding box. Only character-level (access
using `LTChar`). Thus in order to get to word-level, there is a window
slicing through the text. The words are captured in alphanumeric and
non-alphanumeric separately, meaning it will split the word if contains
both, on the first encounter of non-alphanumeric.)
- The bounding box calculation is calculated using start and stop
coordinates for the corresponding word calculated from above. The
calculation is simply using distance between two dots.
The result now contains `links` in `metadata` as shown below:
```
"links": [
{
"text": "link",
"url": "https://github.com/Unstructured-IO/unstructured",
"start_index": 12
},
{
"text": "email",
"url": "mailto:unstructuredai@earlygrowth.com",
"start_index": 30
},
{
"text": "phone number",
"url": "tel:6505124019",
"start_index": 49
}
]
```
---------
Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com>
Co-authored-by: Klaijan <Klaijan@users.noreply.github.com>
### Summary
Uses `langdetect` to detect all languages present in the input document.
### Details
- Converts all language codes (whether user inputted or detected using
`langdetect`) to a standard ISO 639-3 code.
- Adds `languages` field to the metadata
- Will revisit how to nonstandardly represent simplified vs traditional
Chinese scripts internally (separate PR).
- Update ingest test results to add `languages` field to documents. Some
other side effects are changes in order of some elements and changes in
element categorization
### Test
You can test the detect_languages function individually by importing the
function and inputting a text sample and optionally a language:
```
text = "My lubimy mleko i chleb."
doc_langs = detect_languages(text)
print(doc_langs)
```
-> ['ces', 'pol', 'slk']
---------
Co-authored-by: Newel H <37004249+newelh@users.noreply.github.com>
Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com>
Co-authored-by: shreyanid <shreyanid@users.noreply.github.com>
Co-authored-by: Trevor Bossert <37596773+tabossert@users.noreply.github.com>
Co-authored-by: Ronny H <138828701+ron-unstructured@users.noreply.github.com>
* Partitions Salesforce data as xlm instead of text for improved detail and flexibility
* Partitions htmlbody instead of textbody for Salesforce emails
Addresses
[#1332](https://github.com/Unstructured-IO/unstructured/issues/1332)
with `unstructured-inference` PR
[#208](https://github.com/Unstructured-IO/unstructured-inference/pull/208).
### Summary
- Add `image_path` to element metadata
- Pass parameters related to extracting images in PDF
- Preserve image elements ignored due to garbage text if
`el.metadata.image_path` is `True`
### Testing
from unstructured.partition.pdf import partition_pdf
f_path = "example-docs/embedded-images.pdf"
# default image output directory
elements = partition_pdf(
f_path,
strategy=strategy,
extract_images_in_pdf=True,
)
# specific image output directory
elements = partition_pdf(
f_path,
strategy=strategy,
extract_images_in_pdf=True,
image_output_dir_path=<directory path>,
)
This bump removes the preprocessing before table structure extraction
and improves the OCR results for tables.
---------
Co-authored-by: yuming-long <yuming-long@users.noreply.github.com>
**Summary**
Adds logic to combine broken numbered list for pdf fast strategy.
**Details**
Previously the document reads the numbered list items part of the
`layout-parser-paper-fast.pdf` file as:
```
'1. An off-the-shelf toolkit for applying DL models for layout detection, character'
'recognition, and other DIA tasks (Section 3)'
'2. A rich repository of pre-trained neural network models (Model Zoo) that'
'underlies the off-the-shelf usage'
'3. Comprehensive tools for efficient document image data annotation and model'
'tuning to support different levels of customization'
'4. A DL model hub and community platform for the easy sharing, distribu- tion, and discussion of DIA models and pipelines, to promote reusability, reproducibility, and extensibility (Section 4)'
```
Now it reads:
```
'1. An off-the-shelf toolkit for applying DL models for layout detection, character recognition, and other DIA tasks (Section 3)'
'2. A rich repository of pre-trained neural network models (Model Zoo) that underlies the off-the-shelf usage'
'3. Comprehensive tools for efficient document image data annotation and model' tuning to support different levels of customization'
'4. A DL model hub and community platform for the easy sharing, distribu- tion, and discussion of DIA models and pipelines, to promote reusability, reproducibility, and extensibility (Section 4)'
```
The added logic leverages `ElementType` and `coordinates` to determine
whether the following lines is a part of the previously detected
`ListItem` or not.
**Test**
Add test that checks the element length less than original version with
broken numbered list. The test also checks whether the first detected
numbered list ends with previously broken line.
---------
Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com>
Co-authored-by: Klaijan <Klaijan@users.noreply.github.com>
Currently there are some cases when `partition_pdf` is run using the
`hi_res` strategy, in which elements can come back with category
`UncategorizedText`. This happens when the detection model fails to
detect an element, but we're able to find it anyway either because it
was embedded in the PDF, or we found it using OCR.
This commit is to allow for attempting to categorize these uncategorized
elements using our text-based classification function,
`element_from_text`.
### Summary
In order to support language functionality other than Tesseract OCR, we
want to represent languages provided for either partitioning accuracy or
OCR as a standard list of langcodes as strings.
### Details
Adds `languages` (a list of strings) as a parameter to pdf partitioning
functions. Marks `ocr_languages` for deprecation. Adds a new file
`lang.py` for language-related helper functions.
Coming up: langcode standardization, language detection
### Test
Call `partition_pdf` or `partition_pdf_or_image` with a variety of
strategies, languages, or `ocr_languages`.
- inclusion of `ocr_languages` as a parameter should display a
deprecation warning
- the other valid call outputs should be no different from the current
outputs.
ex:
```
from unstructured.partition.pdf import partition_pdf
elements = partition_pdf(filename="example-docs/DA-1p.pdf", strategy="hi_res", languages=["eng", "spa"])
print("\n\n".join([str(el) for el in elements]))
```
Adding table extraction to HTML partitioning.
This PR utilizes 'table' HTML elements to extract and parse HTML tables
and return them in partitioning.
```
# checkout this branch, go into ipython shell
In [1]: from unstructured.partition.html import partition_html
In [2]: path_to_html = "{html sample file with table}"
In [3]: elements = partition_html(path_to_html)
```
you should see the table in the elements list!
The default sorting algorithm for PDF's, "xycut," would cause an error
when partitioning a document if Y coordinate points were negative. This
change checks for that condition (or more broadly, any negative
coordinates) and falls back to the "basic" sort if that is the case.
This PR does not address the underlying issue of "bad points" which
still should be investigated. However, the sorting code should be less
brittle to unexpected bounding boxes in the first case.
Resolves: https://github.com/Unstructured-IO/unstructured/issues/1296
Bumps unstructured-inference==05.23 to pull in @christinestraub's fix:
https://github.com/Unstructured-IO/unstructured-inference/pull/198 , so
embedded Images
in PDF's are now included in partition results ("hi_res").
From the perspective of elements with clean text, this is not a big win
as a lot of the images have OCR garbage. However, it is important to
preserve image elements for other downstream use cases, so overall this
is a step forward.
### Summary
Closes#1230. Updates `partition_html` to split on `<br>` tags that
appear within text elements.
### Testing
The following is code previously produced one giant element on `main`.
```python
from unstructured.partition.html import partition_html
filename = "example-docs/ideas-page.html"
elements = partition_html(filename=filename)
len(elements) # Should be 4
print("\n\n".join([str(el) for el in elements)])
```
The output should be:
```python
January 2023
(Someone fed my essays into GPT to make something that could answer
questions based on them, then asked it where good ideas come from. The
answer was ok, but not what I would have said. This is what I would have said.)
The way to get new ideas is to notice anomalies: what seems strange,
or missing, or broken? You can see anomalies in everyday life (much
of standup comedy is based on this), but the best place to look for
them is at the frontiers of knowledge.
Knowledge grows fractally.
From a distance its edges look smooth, but when you learn enough
to get close to one, you'll notice it's full of gaps. These gaps
will seem obvious; it will seem inexplicable that no one has tried
x or wondered about y. In the best case, exploring such gaps yields
whole new fractal buds.
```
This connector:
- takes a Jira Cloud URL, user email and api token; to authenticate into
Jira Cloud
- ingests:
- either all issues in all projects in a Jira Cloud Organization
- or
- issues in user specified projects, boards
- user specified issues
- processes this kind of data:
- text fields such as issue summary, description, and comments
- dropdown fields such as issue type, status, priority, assignee,
reporter, labels, and components
- other data such as issue id, issue key, project id, information on
subtasks
- notes down attachment URLs, however does not process attachments
- stores each downloaded issue in a txt file, in a predefined template
form (consisting of the data above)
- then processes each downloaded issue document into elements using
unstructured library
- related to: https://github.com/Unstructured-IO/unstructured/issues/263
To test the changes, make the necessary setups and run the relevant
ingest test scripts.
---------
Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com>
Co-authored-by: ahmetmeleq <ahmetmeleq@users.noreply.github.com>
Uncomment confluence-diff ingest test to:
- see if the test has consistent results
- keep testing the confluence connector
---------
Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com>
Co-authored-by: ahmetmeleq <ahmetmeleq@users.noreply.github.com>
The issue was that for blocks detected in an image such as:

, where the full image is:
https://utic-dev-tech-fixtures.s3.us-east-2.amazonaws.com/pastebin//Users/cragwolfe/tmp/IRS-form-1987.png
, many ListItem's would be extracted that were not adding much value to
the output (assuming the block was determined to be of type List from
the layout model). This particular file is also used in ingest tests,
and you can see the prior output here:
https://github.com/Unstructured-IO/unstructured/blob/483b09b/test_unstructured_ingest/expected-structured-output/azure/IRS-form-1987.png.json#L93-L280
Test Instructions:
1. run the following snippet:
```
import json
import os
from datetime import datetime
from unstructured.__version__ import __version__
from unstructured.partition.auto import partition
from unstructured.staging.base import elements_to_json
filename = "/opt/home/tmp/IRS-form-1987.png"
output_dir = "/opt/home/tmp/json"
base_name_with_ext = os.path.basename(filename)
output_filename_part = os.path.join(output_dir, base_name_with_ext)
print(f"unstructured version: {__version__}")
#for strategy in ("hi_res", "fast", "auto"):
for strategy in ("hi_res",):
d1 = datetime.now()
elements = partition(filename=filename, strategy=strategy)
elems_as_dicts = json.loads(elements_to_json(elements, indent=2))
# strip out metadata for the sake of more readable results
for element_dict in elems_as_dicts:
del element_dict["metadata"]
json_filename=f"{output_filename_part}-{strategy}.json"
with open(json_filename, "w") as jsonf:
jsonf.write(json.dumps(elems_as_dicts, indent=2))
d2 = datetime.now()
print(f"num elements for {strategy}: {len(elements)}")
print(f"time elapsed {strategy}: {(d2-d1).total_seconds()}")
```
updating the `filename` and `output_dir` paths for your particular local
environment.
2. Open the json file that was writen to your `output_dir`, named
IRS-form-1987.png-hi_res.json
Witness the new element:
```
{
"type": "ListItem",
"element_id": "7d3ba328af2c20ddeef5d2c1d270f60f",
"text": "Long-term contracts.\u2014If you are required to change your method of accounting for long-term contracts under section 460, see Notice 87
-61 (9/21/87), 1987-38 IRB 40, for the notification procedures that must be followed Other methods. \u2014Unless the Service has Published a regulation
or procedure to the contrary, all other changes in accounting methods required by the Act are automatically considered to be approved by the Commissio
ner. Examples of method changes automatically approved by the Commissioner are those changes required to effect: (1) the repeal of the reserve method f
or bad debts of taxpayers other than financial institutions (Act section 805); (2) the repeal of the installment method for sales under a revolving cre
dit plan (Act section 812); (3) the Inclusion of income attributable to the sale or furnishing of utility services no later than the year in which the
services were provided to customers (Act section 821); and (4) the repeal of the deduction for qualified discount coupons (Act section 823). Do not fil
e Form 3115 for these changes."
},
```
### Summary
Closes#1184. Updates `partition_html` to respect the ordering of
`<pre>` tags in HTML documents.
### Testing
The elements in the following example should be in the correct order.
```python
from unstructured.partition.html import partition_html
html_text = """
<pre>The Big Brown Bear</pre>
<div>The big brown bear is growling.</div>
<pre>The big brown bear is sleeping.</pre>
<div>The Big Blue Bear</div>
"""
elements = partition_html(text=html_text)
print("\n\n".join([str(el) for el in elements]))
```
### Summary
Address
[#1136](https://github.com/Unstructured-IO/unstructured/issues/1136) for
`hi_res` and `fast` strategies. The `ocr_only` strategy does not include
coordinates.
- add functionality to switch sort mode between the current `basic`
sorting and the new `xy-cut` sorting for `hi_res` and `fast` strategies
- add the script to evaluate the `xy-cut` sorting approach
- add jupyter notebook to provide evaluation and visualization for the
`xy-cut` sorting approach
### Evaluation
```
export PYTHONPATH=.:$PYTHONPATH && python examples/custom-layout-order/evaluate_xy_cut_sorting.py <file_path> <strategy>
```
Here, the file should be under the project root directory. For example,
```
export PYTHONPATH=.:$PYTHONPATH && python examples/custom-layout-order/evaluate_xy_cut_sorting.py example-docs/multi-column-2p.pdf fast
```
- Puts threaded replies into the same text field as parent message,
allowing for a full thread to be under a single element_id
- Output is now XML instead of TXT to allow for easier parsing of new
format.
https://github.com/Unstructured-IO/unstructured/issues/1186
* pip-compile in order to bump unstructured-inference
* Set the default `ocr_mode` back to `enitre_page` now that [this
error](https://github.com/Unstructured-IO/unstructured-inference/pull/183)
is addressed
* Explicitly add `sphinx-tabs` to `build.in`. This file provides
`docs/requirements.txt`.
* Remove a pinned `pydantic` version
* Fix a makefile command to `pip-compile` a missing ingest file.
### Description
Add delta table connector and test against a delta table generated via
delta.io and uploaded to s3. Shows an example of how to use the
connection options to leverage s3.
I was able to get this to work with s3 if I pass in the access and
secret keys as storage options. Even though the s3 bucket being used is
public, would not work without those.
---------
Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com>
Co-authored-by: rbiseck3 <rbiseck3@users.noreply.github.com>
### Description
* Add ingest test for Notion docs
* Update default cache dir for connectors to include connector name.
Makes debugging the cached content easier.
---------
Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com>
Co-authored-by: rbiseck3 <rbiseck3@users.noreply.github.com>
Set to individual_blocks for now to work around [this
bug](https://github.com/Unstructured-IO/unstructured-inference/issues/179).
I verified by printing the current ocr_mode in inference. The
`entire_page` default is overridden.
---------
Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com>
Co-authored-by: awalker4 <awalker4@users.noreply.github.com>
Bump to unstructured-inference==0.5.13, which includes:
Fix extracted image elements being included in layout merge, addresses the issue
where an entire-page image in a PDF was not passed to the layout model when using hi_res.
* add the first version of airtable connector
* change imports as inline to fail gracefully in case of lacking dependency
* parse tables as csv rather than plain text
* add relevant logic to be able to use --airtable-list-of-paths
* add script for creation of reseources for testing, add test script (large) for testing with a large number of tables to validate scroll functionality, update test script (diff) based on the new settings
* fix ingest test names
* add scripts for the large table test
* remove large table test from diff test
* make base and table ids explicit
* add and remove comments
* use -ne instead of !=
* update code based on the recent ingest refactor, update changelog and version
* shellcheck fix
* update comments
* update check-num-rows-and-columns-output error message
Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com>
* update help comments
* update help comments
* update help comments
* update workflows to set auth tokens and to run make install
* add comments on create_scale_test_components
* separate component ids from the test script, add comments to document test component creation
* add LARGE_BASE test, implement LARGE_BASE component creation, replace component id
* shellcheck fixes
* shellcheck fixes
* update docs
* update comment
* bump version
* add wrongly deleted file
* sort columns before saving to process
* Update ingest test fixtures (#1098)
Co-authored-by: ahmetmeleq <ahmetmeleq@users.noreply.github.com>
---------
Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com>
Co-authored-by: ahmetmeleq <ahmetmeleq@users.noreply.github.com>
* trigger CI
* trigger CI
* trigger CI
* do not ingest personal spaces in the diff test
* fix argument
* Update ingest test fixtures (#1083)
Co-authored-by: ahmetmeleq <ahmetmeleq@users.noreply.github.com>
---------
Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com>
Co-authored-by: ahmetmeleq <ahmetmeleq@users.noreply.github.com>