23 Commits

Author SHA1 Message Date
Roman Isecke
3eaf65a8c1
feat: refactor ingest (#3009)
### Description
This refactors the current ingest CLI process to support better
granularity in how the steps are ran
* Both multiprocessing and async now supported. Given that a lot of the
steps are IO-bound, such as downloading and uploading content, we can
achieve better parallelization by using async here
* Destination step broken up into a stager step and an upload step. This
will allow for steps that require manipulation of the data between
formats, such as converting the elements json into a csv format to
upload for tabular destinations, to be pulled out of the step that does
the actual upload.
* The process of writing the content to a local destination was now
pulled out as it's own dedicated destination connector, meaning you no
longer need to persist the content locally once the process is done if
the content was uploaded elsewhere.
* Quick update to the chunker/partition step to use the python client.
* Move the uncompress suppport as a pipeline step since this can
arbitrarily apply to any concrete files that have been downloaded,
regardless of where they came from.
* Leverage last modified date to mark files to be reprocessed, even if
the file already exists locally.

### Callouts
Retry configs haven't been moved over yet. This is an open question
because the intent was for it to wrap potential connection errors but
now any of the other steps that leverage an API might run into network
connection issues. Should those be isolated in each of the steps and
wrapped with the same retry configs? Or do we need to expose a unique
retry config for each step? This would bloat the input params even more.

### Testing
* If you want to run the new code as an SDK, there's an example file
that was added to highlight how to do that:
[example.py](https://github.com/Unstructured-IO/unstructured/blob/roman/refactor-ingest/unstructured/ingest/v2/example.py)
* If you want to run the new code as an isolated CLI:
```shell
PYTHONPATH=. python unstructured/ingest/v2/main.py --help
```
* If you want to see which commands have been migrated to the new
version, there's now a `v2` short help text next to those commands when
running the current cli:
```shell
PYTHONPATH=. python unstructured/ingest/main.py --help
Usage: main.py [OPTIONS] COMMAND [ARGS]...main.py --help   

Options:
  --help  Show this message and exit.

Commands:
  airtable
  azure
  biomed
  box
  confluence
  delta-table
  discord
  dropbox
  elasticsearch
  fsspec
  gcs
  github
  gitlab
  google-drive
  hubspot
  jira
  local          v2
  mongodb
  notion
  onedrive
  opensearch
  outlook
  reddit
  s3             v2
  salesforce
  sftp
  sharepoint
  slack
  wikipedia
```

You can run any of the local or s3 specific ingest tests and these
should now work.

---------

Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com>
Co-authored-by: rbiseck3 <rbiseck3@users.noreply.github.com>
2024-05-21 17:01:49 +00:00
Michał Martyniak
2d1923ac7e
Better element IDs - deterministic and document-unique hashes (#2673)
Part two of: https://github.com/Unstructured-IO/unstructured/pull/2842

Main changes compared to part one:
* hash computation includes element's sequence number on page, page
number, document filename and its text
* there are more test for deterministic behavior of IDs returned by
partitioning functions + their uniqueness (guaranteed at the document
level, and high probability across multiple documents)

This PR addresses the following issue:
https://github.com/Unstructured-IO/unstructured/issues/2461
2024-04-24 00:05:20 -07:00
Michał Martyniak
001fa17c86
Preparing the foundation for better element IDs (#2842)
Part one of the issue described here:
https://github.com/Unstructured-IO/unstructured/issues/2461

It does not change how hashing algorithm works, just reworks how ids are
assigned:
> Element ID Design Principles
> 
> 1. A partitioning function can assign only one of two available ID
types to a returned element: a hash or UUID.
> 2. All elements that are returned come with an ID, which is never
None.
> 3. No matter which type of ID is used, it will always be in string
format.
> 4. Partitioning a document returns elements with hashes as their
default IDs.

Big thanks to @scanny for explaining the current design and suggesting
ways to do it right, especially with chunking.


Here's the next PR in line:
https://github.com/Unstructured-IO/unstructured/pull/2673

---------

Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com>
Co-authored-by: micmarty-deepsense <micmarty-deepsense@users.noreply.github.com>
2024-04-16 21:14:53 +00:00
Matt Robinson
b4d9ad8130
enhancement: detect headers in partition_pdf with fast strategy (#2455)
### Summary

Detects headers and footers when using `partition_pdf` with the fast
strategy. Identifies elements that are positioned in the top or bottom
5% of the page as headers or footers. If no coordinate information is
available, an element won't be detected as a header or footer.

---------

Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com>
Co-authored-by: MthwRobinson <MthwRobinson@users.noreply.github.com>
2024-02-23 16:56:09 +00:00
Steve Canny
b8a8de33f4
fix(ingest): canonicalize ingest JSON (#2080)
Canonicalize JSON produced for ingest tests such that incidental changes
is _form_ of the JSON objects (keys moving around) that does not change
the _content_ of that JSON object does not trigger an ingest-test
failure.
2023-11-15 00:52:58 -08:00
Christine Straub
475066ba7c
Fix: fast strategy fallback to ocr only (#2055)
Closes #2038.
### Summary
The `fast` strategy should not fall back to a more expensive strategy.

### Testing
For
[9493801-p17.pdf](https://github.com/Unstructured-IO/unstructured/files/13292884/9493801-p17.pdf),
the following code should return an empty list.

```
elements = partition(filename=filename, strategy="fast")
```

---------

Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com>
Co-authored-by: christinestraub <christinestraub@users.noreply.github.com>
2023-11-14 18:46:41 +00:00
John
1ead5a27df
Jj/2011 missing languages metadata (#2037)
### Summary
Closes #2011 

`languages` was missing from the metadata when partitioning pdfs via
`hi_res` and `fast` strategies and missing from image partitions via
`hi_res`. This PR adds `languages` to the relevant function calls so it
is included in the resulting elements.


### Testing
On the main branch, `partition_image` will include `languages` when
`strategy='ocr_only'`, but not when `strategy='hi_res'`:
```
filename = "example-docs/english-and-korean.png"
from unstructured.partition.image import partition_image

elements = partition_image(filename, strategy="ocr_only", languages=['eng', 'kor'])
elements[0].metadata.languages

elements = partition_image(filename, strategy="hi_res", languages=['eng', 'kor'])
elements[0].metadata.languages
```

For `partition_pdf`, `'ocr_only'` will include `languages` in the
metadata, but `'fast'` and `'hi_res'` will not.
```
filename = "example-docs/korean-text-with-tables.pdf"
from unstructured.partition.pdf import partition_pdf

elements = partition_pdf(filename, strategy="ocr_only", languages=['kor'])
elements[0].metadata.languages


elements = partition_pdf(filename, strategy="fast", languages=['kor'])
elements[0].metadata.languages


elements = partition_pdf(filename, strategy="hi_res", languages=['kor'])
elements[0].metadata.languages
```

On this branch, `languages` is included in the metadata regardless of
strategy

---------

Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com>
Co-authored-by: Coniferish <Coniferish@users.noreply.github.com>
2023-11-13 16:47:05 +00:00
Roman Isecke
ba4477ac20
feat: support table conversion for tabular destination connectors (#1917)
### Description
* A full schema was introduced to map the type of all output content
from the json partition output and mapped to a flattened table structure
to leverage table-based destination connectors. The delta table
destination connector was updated at the moment to take advantage of
this.
* Existing method to convert to a dataframe was updated because it had a
bug in it. Object content in the metadata would have the key name
changed when flattened but then this would be omitted since it didn't
exist in the `_get_metadata_table_fieldnames` response.
* Unit test was added to make sure we handle all values possible in an
Element when converting to a table
* Delta table ingest test was split into a source and destination test
(looking ahead to split these up in CI)

---------

Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com>
Co-authored-by: rbiseck3 <rbiseck3@users.noreply.github.com>
2023-11-03 16:47:21 +00:00
Christine Straub
1f0c563e0c
refactor: partition_pdf() for ocr_only strategy (#1811)
### Summary
Update `ocr_only` strategy in `partition_pdf()`. This PR adds the
functionality to get accurate coordinate data when partitioning PDFs and
Images with the `ocr_only` strategy.
- Add functionality to perform OCR region grouping based on the OCR text
taken from `pytesseract.image_to_string()`
- Add functionality to get layout elements from OCR regions (ocr_layout)
for both `tesseract` and `paddle`
- Add functionality to determine the `source` of merged text regions
when merging text regions in `merge_text_regions()`
- Merge multiple test functions related to "ocr_only" strategy into
`test_partition_pdf_with_ocr_only_strategy()`
- This PR also fixes [issue
#1792](https://github.com/Unstructured-IO/unstructured/issues/1792)
### Evaluation
```
# Image
PYTHONPATH=. python examples/custom-layout-order/evaluate_natural_reading_order.py example-docs/double-column-A.jpg ocr_only xy-cut image

# PDF
PYTHONPATH=. python examples/custom-layout-order/evaluate_natural_reading_order.py example-docs/multi-column-2p.pdf ocr_only xy-cut pdf
```
### Test
- **Before update**
All elements have the same coordinate data 


![multi-column-2p_1_xy-cut](https://github.com/Unstructured-IO/unstructured/assets/9475974/aae0195a-2943-4fa8-bdd8-807f2f09c768)

- **After update**
All elements have accurate coordinate data


![multi-column-2p_1_xy-cut](https://github.com/Unstructured-IO/unstructured/assets/9475974/0f6c6202-9e65-4acf-bcd4-ac9dd01ab64a)

---------

Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com>
Co-authored-by: christinestraub <christinestraub@users.noreply.github.com>
2023-10-30 20:13:29 +00:00
Roman Isecke
a2af72bb79
local connector metadata and deserialization fix (#1800)
### Description
* Priority of this was to fix deserialization of ingest docs. Currently
the source metadata wasn't being persisted
* To help debug this, source metadata was added to the local ingest doc
as well.
* Unit test added to make sure the metadata itself was persisted.
* As part of serialization, it was forcing docs to fetch source metadata
if it hadn't already to add to the generated dict/json. This shouldn't
have happened if the underlying variable `_source_metadata` was `None`.
This way the doc can be serialized without any calls being made.
* Serialization was moved to the `to_dict` method to make it more
universal.

---------

Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com>
Co-authored-by: rbiseck3 <rbiseck3@users.noreply.github.com>
2023-10-23 15:51:52 +00:00
ryannikolaidis
80c3c24ca5
ingest retry strategy refactor <- Ingest test fixtures update (#1780)
This pull request includes updated ingest test fixtures.
Please review and merge if appropriate.

Co-authored-by: benjats07 <benjats07@users.noreply.github.com>
2023-10-18 04:33:57 +00:00
Inscore
8ab40c20c1
fix: correct PDF list item parsing (#1693)
The current implementation removes elements from the beginning of the
element list and duplicates the list items

---------

Co-authored-by: Klaijan <klaijan@unstructured.io>
Co-authored-by: yuming <305248291@qq.com>
Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com>
Co-authored-by: yuming-long <yuming-long@users.noreply.github.com>
2023-10-11 20:38:36 +00:00
Christine Straub
5d14a2aea0
feat: shrink bboxes by top left (#1633)
Closes #1573.
### Summary
- update `shrink_bbox()` to keep top left rather than center
### Evaluation
Run the following command for this
[PDF](https://utic-dev-tech-fixtures.s3.us-east-2.amazonaws.com/pastebin/patent-11723901-page2.pdf).
```
PYTHONPATH=. python examples/custom-layout-order/evaluate_xy_cut_sorting.py <file_path> <strategy>
```

---------

Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com>
2023-10-06 05:16:11 +00:00
Christine Straub
b30d6a601e
Fix/1209 tweak xycut ordering output (#1630)
Closes GH Issue #1209.

### Summary
- add swapped `xycut` sorting
- update `xycut` sorting evaluation script

PDFs:
-
[sbaa031.073.pdf](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7234218/pdf/sbaa031.073.pdf)
-
[multi-column-2p.pdf](https://github.com/Unstructured-IO/unstructured/files/12796147/multi-column-2p.pdf)
-
[11723901.pdf](https://github.com/Unstructured-IO/unstructured-inference/files/12360085/11723901.pdf)
### Testing
```
elements = partition_pdf("sbaa031.073.pdf", strategy="hi_res")
print("\n\n".join([str(el) for el in elements]))
```
### Evaluation
```
PYTHONPATH=. python examples/custom-layout-order/evaluate_xy_cut_sorting.py sbaa031.073.pdf hi_res xycut_only
```
2023-10-05 07:41:38 +00:00
Christine Straub
94fbbed189
feat: bbox shrinking in xycut algo, better natural reading order (#1560)
Closes GH Issue #1233.

### Summary
- add functionality to shrink all bounding boxes along x and y axes
(still centered around the same center point) before running xy-cut sort

### Evaluation
Run the followin gcommand for this
[PDF](https://utic-dev-tech-fixtures.s3.us-east-2.amazonaws.com/pastebin/patent-11723901-page2.pdf).

PYTHONPATH=. python examples/custom-layout-order/evaluate_xy_cut_sorting.py <file_path> <strategy>
2023-09-29 03:48:02 +00:00
Klaijan
d26d591d6a
feat: get embedded url, associate text and start index for pdf (#1539)
**Executive Summary**

Adds PDF functionality to capture hyperlink (external or internal) for
pdf fast strategy along with associate text.

**Technical Details**

- `pdfminer` associates `annotation` (links and uris) with bounding box
rather than text. Therefore, the link and text matching is not a perfect
pair but rather a logic-based and calculation matching from bounding box
overlapping.
- There is no word-level bounding box. Only character-level (access
using `LTChar`). Thus in order to get to word-level, there is a window
slicing through the text. The words are captured in alphanumeric and
non-alphanumeric separately, meaning it will split the word if contains
both, on the first encounter of non-alphanumeric.)
- The bounding box calculation is calculated using start and stop
coordinates for the corresponding word calculated from above. The
calculation is simply using distance between two dots.

The result now contains `links` in `metadata` as shown below:

```
            "links": [
                {
                    "text": "link",
                    "url": "https://github.com/Unstructured-IO/unstructured",
                    "start_index": 12
                },
                {
                    "text": "email",
                    "url": "mailto:unstructuredai@earlygrowth.com",
                    "start_index": 30
                },
                {
                    "text": "phone number",
                    "url": "tel:6505124019",
                    "start_index": 49
                }
            ]
```

---------

Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com>
Co-authored-by: Klaijan <Klaijan@users.noreply.github.com>
2023-09-27 13:43:32 -04:00
shreyanid
32bfebccf7
feat: introduce language detection function for text partitioning function (#1453)
### Summary
Uses `langdetect` to detect all languages present in the input document.

### Details
- Converts all language codes (whether user inputted or detected using
`langdetect`) to a standard ISO 639-3 code.
- Adds `languages` field to the metadata
- Will revisit how to nonstandardly represent simplified vs traditional
Chinese scripts internally (separate PR).
- Update ingest test results to add `languages` field to documents. Some
other side effects are changes in order of some elements and changes in
element categorization

### Test
You can test the detect_languages function individually by importing the
function and inputting a text sample and optionally a language:
```
text = "My lubimy mleko i chleb."
doc_langs = detect_languages(text)
print(doc_langs)
```
-> ['ces', 'pol', 'slk']

---------

Co-authored-by: Newel H <37004249+newelh@users.noreply.github.com>
Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com>
Co-authored-by: shreyanid <shreyanid@users.noreply.github.com>
Co-authored-by: Trevor Bossert <37596773+tabossert@users.noreply.github.com>
Co-authored-by: Ronny H <138828701+ron-unstructured@users.noreply.github.com>
2023-09-26 18:09:27 +00:00
Klaijan
00181b88df
feat: pdf auto strategy groups broken numbered and bullet list items(#1393)
**Summary**
Adds logic to combine broken numbered list for pdf fast strategy.

**Details**
Previously the document reads the numbered list items part of the
`layout-parser-paper-fast.pdf` file as:

```
'1. An off-the-shelf toolkit for applying DL models for layout detection, character'
'recognition, and other DIA tasks (Section 3)'
'2. A rich repository of pre-trained neural network models (Model Zoo) that'
'underlies the off-the-shelf usage'
'3. Comprehensive tools for efficient document image data annotation and model'
'tuning to support different levels of customization'
'4. A DL model hub and community platform for the easy sharing, distribu- tion, and discussion of DIA models and pipelines, to promote reusability, reproducibility, and extensibility (Section 4)'
```

Now it reads:

```
'1. An off-the-shelf toolkit for applying DL models for layout detection, character recognition, and other DIA tasks (Section 3)'
'2. A rich repository of pre-trained neural network models (Model Zoo) that underlies the off-the-shelf usage'
'3. Comprehensive tools for efficient document image data annotation and model' tuning to support different levels of customization'
'4. A DL model hub and community platform for the easy sharing, distribu- tion, and discussion of DIA models and pipelines, to promote reusability, reproducibility, and extensibility (Section 4)'
```

The added logic leverages `ElementType` and `coordinates` to determine
whether the following lines is a part of the previously detected
`ListItem` or not.

**Test**
Add test that checks the element length less than original version with
broken numbered list. The test also checks whether the first detected
numbered list ends with previously broken line.

---------

Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com>
Co-authored-by: Klaijan <Klaijan@users.noreply.github.com>
2023-09-13 21:30:06 +00:00
Christine Straub
483b09b3c9
Feat/1136 elements ordering for pdf (#1161)
### Summary
Address
[#1136](https://github.com/Unstructured-IO/unstructured/issues/1136) for
`hi_res` and `fast` strategies. The `ocr_only` strategy does not include
coordinates.
- add functionality to switch sort mode between the current `basic`
sorting and the new `xy-cut` sorting for `hi_res` and `fast` strategies
- add the script to evaluate the `xy-cut` sorting approach
- add jupyter notebook to provide evaluation and visualization for the
`xy-cut` sorting approach

### Evaluation
```
export PYTHONPATH=.:$PYTHONPATH && python examples/custom-layout-order/evaluate_xy_cut_sorting.py <file_path> <strategy>
```
Here, the file should be under the project root directory. For example,
```
export PYTHONPATH=.:$PYTHONPATH && python examples/custom-layout-order/evaluate_xy_cut_sorting.py example-docs/multi-column-2p.pdf fast
```
2023-08-24 17:46:19 -07:00
qued
79f734d3f9
fix: better extractable check (#900)
auto strategy was choosing the fast strategy in cases where the pdf contents were just a flat image, resulting in no output. This PR changes the behavior of auto so that elements that can be extracted by fast are extracted, a cursory examination of the elements is made to see if there are elements with text present, and if so then these elements are used as the output. Otherwise fallback strategies come into play.
2023-07-07 23:41:37 -05:00
qued
db4c5dfdf7
feat: coordinate systems (#774)
Added the CoordinateSystem class for tracking the system in which coordinates are represented, and changing the system if desired.
2023-06-20 11:19:55 -05:00
kravetsmic
7fd7d7afae
feat(biomed connector): added additional params (#468) (#623)
Unstructured-ingest biomed connector: Adds max retries, max request time with backoff and decay.

---------

Co-authored-by: Crag Wolfe <crag@unstructuredai.io>
2023-06-15 01:57:45 -07:00
Yuming Long
2fbb1ccd30
Chore(ingest) : add tests on PDFs with fast strategy (#614)
Summary
* Updates "fast" PDF output element ordering to be consistent across Python versions by using the X,Y coordinates of elements extracted
* Added PDFs ingest tests with fast strategy with new script ./test_unstructured_ingest/test-ingest-pdf-fast-reprocess.sh

Updated ingest tests procedure:

* Processing files with hi_res strategy, and preserve downloads to repo files-ingest-download/<ingest_test_name>
* Reprocessing all PDFs with fast strategy from local file files-ingest-download, the partition outputs are stored at expected-structured-output/pdf-fast-reprocess/<ingest_test_name>
Test
* Reproduce tests with ./scripts/ingest-test-fixtures-update.sh , should expect no update. Also don't need any secret tokens since relevant tests won't produce PDFs.
2023-06-12 19:02:48 +00:00