This bump removes the preprocessing before table structure extraction
and improves the OCR results for tables.
---------
Co-authored-by: yuming-long <yuming-long@users.noreply.github.com>
**Summary**
Adds logic to combine broken numbered list for pdf fast strategy.
**Details**
Previously the document reads the numbered list items part of the
`layout-parser-paper-fast.pdf` file as:
```
'1. An off-the-shelf toolkit for applying DL models for layout detection, character'
'recognition, and other DIA tasks (Section 3)'
'2. A rich repository of pre-trained neural network models (Model Zoo) that'
'underlies the off-the-shelf usage'
'3. Comprehensive tools for efficient document image data annotation and model'
'tuning to support different levels of customization'
'4. A DL model hub and community platform for the easy sharing, distribu- tion, and discussion of DIA models and pipelines, to promote reusability, reproducibility, and extensibility (Section 4)'
```
Now it reads:
```
'1. An off-the-shelf toolkit for applying DL models for layout detection, character recognition, and other DIA tasks (Section 3)'
'2. A rich repository of pre-trained neural network models (Model Zoo) that underlies the off-the-shelf usage'
'3. Comprehensive tools for efficient document image data annotation and model' tuning to support different levels of customization'
'4. A DL model hub and community platform for the easy sharing, distribu- tion, and discussion of DIA models and pipelines, to promote reusability, reproducibility, and extensibility (Section 4)'
```
The added logic leverages `ElementType` and `coordinates` to determine
whether the following lines is a part of the previously detected
`ListItem` or not.
**Test**
Add test that checks the element length less than original version with
broken numbered list. The test also checks whether the first detected
numbered list ends with previously broken line.
---------
Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com>
Co-authored-by: Klaijan <Klaijan@users.noreply.github.com>
Currently there are some cases when `partition_pdf` is run using the
`hi_res` strategy, in which elements can come back with category
`UncategorizedText`. This happens when the detection model fails to
detect an element, but we're able to find it anyway either because it
was embedded in the PDF, or we found it using OCR.
This commit is to allow for attempting to categorize these uncategorized
elements using our text-based classification function,
`element_from_text`.
### Summary
In order to support language functionality other than Tesseract OCR, we
want to represent languages provided for either partitioning accuracy or
OCR as a standard list of langcodes as strings.
### Details
Adds `languages` (a list of strings) as a parameter to pdf partitioning
functions. Marks `ocr_languages` for deprecation. Adds a new file
`lang.py` for language-related helper functions.
Coming up: langcode standardization, language detection
### Test
Call `partition_pdf` or `partition_pdf_or_image` with a variety of
strategies, languages, or `ocr_languages`.
- inclusion of `ocr_languages` as a parameter should display a
deprecation warning
- the other valid call outputs should be no different from the current
outputs.
ex:
```
from unstructured.partition.pdf import partition_pdf
elements = partition_pdf(filename="example-docs/DA-1p.pdf", strategy="hi_res", languages=["eng", "spa"])
print("\n\n".join([str(el) for el in elements]))
```
Adding table extraction to HTML partitioning.
This PR utilizes 'table' HTML elements to extract and parse HTML tables
and return them in partitioning.
```
# checkout this branch, go into ipython shell
In [1]: from unstructured.partition.html import partition_html
In [2]: path_to_html = "{html sample file with table}"
In [3]: elements = partition_html(path_to_html)
```
you should see the table in the elements list!
The default sorting algorithm for PDF's, "xycut," would cause an error
when partitioning a document if Y coordinate points were negative. This
change checks for that condition (or more broadly, any negative
coordinates) and falls back to the "basic" sort if that is the case.
This PR does not address the underlying issue of "bad points" which
still should be investigated. However, the sorting code should be less
brittle to unexpected bounding boxes in the first case.
Resolves: https://github.com/Unstructured-IO/unstructured/issues/1296
Bumps unstructured-inference==05.23 to pull in @christinestraub's fix:
https://github.com/Unstructured-IO/unstructured-inference/pull/198 , so
embedded Images
in PDF's are now included in partition results ("hi_res").
From the perspective of elements with clean text, this is not a big win
as a lot of the images have OCR garbage. However, it is important to
preserve image elements for other downstream use cases, so overall this
is a step forward.
### Summary
Closes#1230. Updates `partition_html` to split on `<br>` tags that
appear within text elements.
### Testing
The following is code previously produced one giant element on `main`.
```python
from unstructured.partition.html import partition_html
filename = "example-docs/ideas-page.html"
elements = partition_html(filename=filename)
len(elements) # Should be 4
print("\n\n".join([str(el) for el in elements)])
```
The output should be:
```python
January 2023
(Someone fed my essays into GPT to make something that could answer
questions based on them, then asked it where good ideas come from. The
answer was ok, but not what I would have said. This is what I would have said.)
The way to get new ideas is to notice anomalies: what seems strange,
or missing, or broken? You can see anomalies in everyday life (much
of standup comedy is based on this), but the best place to look for
them is at the frontiers of knowledge.
Knowledge grows fractally.
From a distance its edges look smooth, but when you learn enough
to get close to one, you'll notice it's full of gaps. These gaps
will seem obvious; it will seem inexplicable that no one has tried
x or wondered about y. In the best case, exploring such gaps yields
whole new fractal buds.
```
This connector:
- takes a Jira Cloud URL, user email and api token; to authenticate into
Jira Cloud
- ingests:
- either all issues in all projects in a Jira Cloud Organization
- or
- issues in user specified projects, boards
- user specified issues
- processes this kind of data:
- text fields such as issue summary, description, and comments
- dropdown fields such as issue type, status, priority, assignee,
reporter, labels, and components
- other data such as issue id, issue key, project id, information on
subtasks
- notes down attachment URLs, however does not process attachments
- stores each downloaded issue in a txt file, in a predefined template
form (consisting of the data above)
- then processes each downloaded issue document into elements using
unstructured library
- related to: https://github.com/Unstructured-IO/unstructured/issues/263
To test the changes, make the necessary setups and run the relevant
ingest test scripts.
---------
Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com>
Co-authored-by: ahmetmeleq <ahmetmeleq@users.noreply.github.com>
Uncomment confluence-diff ingest test to:
- see if the test has consistent results
- keep testing the confluence connector
---------
Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com>
Co-authored-by: ahmetmeleq <ahmetmeleq@users.noreply.github.com>
The issue was that for blocks detected in an image such as:

, where the full image is:
https://utic-dev-tech-fixtures.s3.us-east-2.amazonaws.com/pastebin//Users/cragwolfe/tmp/IRS-form-1987.png
, many ListItem's would be extracted that were not adding much value to
the output (assuming the block was determined to be of type List from
the layout model). This particular file is also used in ingest tests,
and you can see the prior output here:
https://github.com/Unstructured-IO/unstructured/blob/483b09b/test_unstructured_ingest/expected-structured-output/azure/IRS-form-1987.png.json#L93-L280
Test Instructions:
1. run the following snippet:
```
import json
import os
from datetime import datetime
from unstructured.__version__ import __version__
from unstructured.partition.auto import partition
from unstructured.staging.base import elements_to_json
filename = "/opt/home/tmp/IRS-form-1987.png"
output_dir = "/opt/home/tmp/json"
base_name_with_ext = os.path.basename(filename)
output_filename_part = os.path.join(output_dir, base_name_with_ext)
print(f"unstructured version: {__version__}")
#for strategy in ("hi_res", "fast", "auto"):
for strategy in ("hi_res",):
d1 = datetime.now()
elements = partition(filename=filename, strategy=strategy)
elems_as_dicts = json.loads(elements_to_json(elements, indent=2))
# strip out metadata for the sake of more readable results
for element_dict in elems_as_dicts:
del element_dict["metadata"]
json_filename=f"{output_filename_part}-{strategy}.json"
with open(json_filename, "w") as jsonf:
jsonf.write(json.dumps(elems_as_dicts, indent=2))
d2 = datetime.now()
print(f"num elements for {strategy}: {len(elements)}")
print(f"time elapsed {strategy}: {(d2-d1).total_seconds()}")
```
updating the `filename` and `output_dir` paths for your particular local
environment.
2. Open the json file that was writen to your `output_dir`, named
IRS-form-1987.png-hi_res.json
Witness the new element:
```
{
"type": "ListItem",
"element_id": "7d3ba328af2c20ddeef5d2c1d270f60f",
"text": "Long-term contracts.\u2014If you are required to change your method of accounting for long-term contracts under section 460, see Notice 87
-61 (9/21/87), 1987-38 IRB 40, for the notification procedures that must be followed Other methods. \u2014Unless the Service has Published a regulation
or procedure to the contrary, all other changes in accounting methods required by the Act are automatically considered to be approved by the Commissio
ner. Examples of method changes automatically approved by the Commissioner are those changes required to effect: (1) the repeal of the reserve method f
or bad debts of taxpayers other than financial institutions (Act section 805); (2) the repeal of the installment method for sales under a revolving cre
dit plan (Act section 812); (3) the Inclusion of income attributable to the sale or furnishing of utility services no later than the year in which the
services were provided to customers (Act section 821); and (4) the repeal of the deduction for qualified discount coupons (Act section 823). Do not fil
e Form 3115 for these changes."
},
```
### Summary
Closes#1184. Updates `partition_html` to respect the ordering of
`<pre>` tags in HTML documents.
### Testing
The elements in the following example should be in the correct order.
```python
from unstructured.partition.html import partition_html
html_text = """
<pre>The Big Brown Bear</pre>
<div>The big brown bear is growling.</div>
<pre>The big brown bear is sleeping.</pre>
<div>The Big Blue Bear</div>
"""
elements = partition_html(text=html_text)
print("\n\n".join([str(el) for el in elements]))
```
### Summary
Address
[#1136](https://github.com/Unstructured-IO/unstructured/issues/1136) for
`hi_res` and `fast` strategies. The `ocr_only` strategy does not include
coordinates.
- add functionality to switch sort mode between the current `basic`
sorting and the new `xy-cut` sorting for `hi_res` and `fast` strategies
- add the script to evaluate the `xy-cut` sorting approach
- add jupyter notebook to provide evaluation and visualization for the
`xy-cut` sorting approach
### Evaluation
```
export PYTHONPATH=.:$PYTHONPATH && python examples/custom-layout-order/evaluate_xy_cut_sorting.py <file_path> <strategy>
```
Here, the file should be under the project root directory. For example,
```
export PYTHONPATH=.:$PYTHONPATH && python examples/custom-layout-order/evaluate_xy_cut_sorting.py example-docs/multi-column-2p.pdf fast
```
- Puts threaded replies into the same text field as parent message,
allowing for a full thread to be under a single element_id
- Output is now XML instead of TXT to allow for easier parsing of new
format.
https://github.com/Unstructured-IO/unstructured/issues/1186
* pip-compile in order to bump unstructured-inference
* Set the default `ocr_mode` back to `enitre_page` now that [this
error](https://github.com/Unstructured-IO/unstructured-inference/pull/183)
is addressed
* Explicitly add `sphinx-tabs` to `build.in`. This file provides
`docs/requirements.txt`.
* Remove a pinned `pydantic` version
* Fix a makefile command to `pip-compile` a missing ingest file.
### Description
Add delta table connector and test against a delta table generated via
delta.io and uploaded to s3. Shows an example of how to use the
connection options to leverage s3.
I was able to get this to work with s3 if I pass in the access and
secret keys as storage options. Even though the s3 bucket being used is
public, would not work without those.
---------
Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com>
Co-authored-by: rbiseck3 <rbiseck3@users.noreply.github.com>
### Description
* Add ingest test for Notion docs
* Update default cache dir for connectors to include connector name.
Makes debugging the cached content easier.
---------
Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com>
Co-authored-by: rbiseck3 <rbiseck3@users.noreply.github.com>
Set to individual_blocks for now to work around [this
bug](https://github.com/Unstructured-IO/unstructured-inference/issues/179).
I verified by printing the current ocr_mode in inference. The
`entire_page` default is overridden.
---------
Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com>
Co-authored-by: awalker4 <awalker4@users.noreply.github.com>
Bump to unstructured-inference==0.5.13, which includes:
Fix extracted image elements being included in layout merge, addresses the issue
where an entire-page image in a PDF was not passed to the layout model when using hi_res.
* add the first version of airtable connector
* change imports as inline to fail gracefully in case of lacking dependency
* parse tables as csv rather than plain text
* add relevant logic to be able to use --airtable-list-of-paths
* add script for creation of reseources for testing, add test script (large) for testing with a large number of tables to validate scroll functionality, update test script (diff) based on the new settings
* fix ingest test names
* add scripts for the large table test
* remove large table test from diff test
* make base and table ids explicit
* add and remove comments
* use -ne instead of !=
* update code based on the recent ingest refactor, update changelog and version
* shellcheck fix
* update comments
* update check-num-rows-and-columns-output error message
Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com>
* update help comments
* update help comments
* update help comments
* update workflows to set auth tokens and to run make install
* add comments on create_scale_test_components
* separate component ids from the test script, add comments to document test component creation
* add LARGE_BASE test, implement LARGE_BASE component creation, replace component id
* shellcheck fixes
* shellcheck fixes
* update docs
* update comment
* bump version
* add wrongly deleted file
* sort columns before saving to process
* Update ingest test fixtures (#1098)
Co-authored-by: ahmetmeleq <ahmetmeleq@users.noreply.github.com>
---------
Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com>
Co-authored-by: ahmetmeleq <ahmetmeleq@users.noreply.github.com>
* trigger CI
* trigger CI
* trigger CI
* do not ingest personal spaces in the diff test
* fix argument
* Update ingest test fixtures (#1083)
Co-authored-by: ahmetmeleq <ahmetmeleq@users.noreply.github.com>
---------
Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com>
Co-authored-by: ahmetmeleq <ahmetmeleq@users.noreply.github.com>
* add param
* expected test
* add option (to do doc nit)
* test with api for now
* typo
* test with api key
* use local only
* encoding -> partition-encoding
* changelog and version
* Update ingest test fixtures (#1055)
Co-authored-by: yuming-long <yuming-long@users.noreply.github.com>
* ignore coordinates
* no witespace lol
* Update ingest test fixtures (#1061)
Co-authored-by: yuming-long <yuming-long@users.noreply.github.com>
---------
Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com>
Co-authored-by: yuming-long <yuming-long@users.noreply.github.com>
* add auto_paragraph_grouper. add line break pattern.
* combine group_broken_paragraph and blank_line_grouper function
* fix make check errors
* fix make check errors
* fix make check errors
* fix make check errors
* run make tidy to fix errors
* tidy core.py and text.py
* fix blank-line breaker to extends the result and replace new line with space
* fix function name typo
* call group_broken_paragraphs for blank_line_grouper
* edit function name from one_line_grouper to new_line_grouper for consistency
* edit threshold from 0.5 to 0.1
* edit threshold from 0.5 to 0.1
* Revert "call group_broken_paragraphs for blank_line_grouper"
This reverts commit 8fb93b7aa7c4d7e0320ac1e09c77da44c9b6c7d9.
* revert to commit 8fb93b7 and change threshold from 0.5 to 0.1
* edit test_text assertion. remove all BULLETS_PATTERN.
* Update ingest test fixtures (#1052)
Co-authored-by: ahmetmeleq <ahmetmeleq@users.noreply.github.com>
* edit test case in test_xml_partition
* update assertion on test_auto
---------
Co-authored-by: Klaijan Sinteppadon <klaijan@Klaijans-MacBook-Pro.local>
Co-authored-by: Klaijan Sinteppadon <klaijan@klaijans-mbp.mynetworksettings.com>
Co-authored-by: Klaijan Sinteppadon <klaijan@Klaijans-MBP.fios-router.home>
Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com>
Co-authored-by: ahmetmeleq <ahmetmeleq@users.noreply.github.com>
* feat: add functionality to track emphasized text (`bold/italic` formatting) from paragraph
* chore: add docstring
* chore: fix lint errors
* feat: ignore spaces when extracting emphasized texts from a paragraph
* feat: add functionality to track emphasized text (`bold/italic` formatting) from table
* test: add test case for grabbing emphasized texts from element metadata
* chore: fix lint errors
* chore: update changelog & version
* Update ingest test fixtures (#1047)
* track tags in html
* pass through links as metadata
* add test for grabbing links
* one more link
* changelog and version
* update docs
* fix tests
* update empty link assertion
* ingest-test-fixtures-update
* Update ingest test fixtures (#961)
* Add confluence connector and an example script
* add test script, add dependency installations
* add authentication secret variables for ci tests and actions
* add dependency installation commands for workflows
* add dependency installation commands for workflows
* Update ingest test fixtures (#907)
Co-authored-by: ahmetmeleq <ahmetmeleq@users.noreply.github.com>
* add add ingest test fixtures update workflow for python 3.10, update example script with dummy values
* change workflow name to avoid confusion
* change workflow name to avoid confusion
* only leave 3.8 in ingest test matrix to test consistent partitioning among python versions, remove 3.10 workflow for the test fixtures update
* only leave 3.8 in ingest test matrix to test consistent partitioning among python versions
* Update ingest test fixtures (#911)
Co-authored-by: ahmetmeleq <ahmetmeleq@users.noreply.github.com>
* revert back the test python version matrix
* recompile dependencies
* modifications for shellcheck
* update changelog and version
* changelog and version
* remove comments
* Update ingest test fixtures (#915)
Co-authored-by: ahmetmeleq <ahmetmeleq@users.noreply.github.com>
* add the option to state the number of spaces to be fetched
* add scroll functionality, expose --confluence-num-of-spaces, --confluence-list-of-spaces and --confluence-num-of-docs-from-each-space to users
* add help message
* add docstrings for two tests, validate grabbing every doc in the fetched spaces, count number of files instead of diffing for confluence2 test
* change test names
* rename connector arg
Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com>
* change arg name for connector
Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com>
* add comment to example
* change arg names
* add new tests to ingest test
* shellcheck remove redundant statement
* Update ingest test fixtures (#932)
Co-authored-by: ahmetmeleq <ahmetmeleq@users.noreply.github.com>
* Update ingest test fixtures (#936)
Co-authored-by: ahmetmeleq <ahmetmeleq@users.noreply.github.com>
* linting
* change file extensions to parse as html
* Update ingest test fixtures (#943)
Co-authored-by: ahmetmeleq <ahmetmeleq@users.noreply.github.com>
* remove old fixtures
* update version to 0.8.2-dev3
* change file to trigger CI
* change file to trigger CI
* change file to trigger CI
* change file to trigger CI
---------
Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com>
Co-authored-by: ahmetmeleq <ahmetmeleq@users.noreply.github.com>
auto strategy was choosing the fast strategy in cases where the pdf contents were just a flat image, resulting in no output. This PR changes the behavior of auto so that elements that can be extracted by fast are extracted, a cursory examination of the elements is made to see if there are elements with text present, and if so then these elements are used as the output. Otherwise fallback strategies come into play.