1393 Commits

Author SHA1 Message Date
ryannikolaidis
d22044a44c
fix: unstructured-ingest embedding KeyError (#1727)
Currently adding the embedding flag to any unstructured-ingest call
results in this failure:

```
2023-10-11 22:42:14,177 MainProcess ERROR    'b8a98c5d963a9dd75847a8f110cbf7c9'
multiprocessing.pool.RemoteTraceback:
"""
Traceback (most recent call last):
  File "/Users/ryannikolaidis/.pyenv/versions/3.10.11/lib/python3.10/multiprocessing/pool.py", line 125, in worker
    result = (True, func(*args, **kwds))
  File "/Users/ryannikolaidis/.pyenv/versions/3.10.11/lib/python3.10/multiprocessing/pool.py", line 48, in mapstar
    return list(map(*args))
  File "/Users/ryannikolaidis/Development/unstructured/unstructured/unstructured/ingest/pipeline/copy.py", line 14, in run
    ingest_doc_json = self.pipeline_context.ingest_docs_map[doc_hash]
  File "<string>", line 2, in __getitem__
  File "/Users/ryannikolaidis/.pyenv/versions/3.10.11/lib/python3.10/multiprocessing/managers.py", line 833, in _callmethod
    raise convert_to_error(kind, result)
KeyError: 'b8a98c5d963a9dd75847a8f110cbf7c9'
"""
```

This is because the run method for the embedding node is not adding the
IngestDoc to the context map. This PR adds that logic and adds a test to
validate that the embeddings option works as expected.

NOTE: until https://github.com/Unstructured-IO/unstructured/pull/1719
goes in, the expected results include the duplicate element bug, however
currently this does at least prove that embeddings are generated and the
function doesn't error.
2023-10-12 20:27:30 +00:00
Steve Canny
d726963e42
serde tests round-trip through JSON (#1681)
Each partitioner has a test like `test_partition_x_with_json()`. What
these do is serialize the elements produced by the partitioner to JSON,
then read them back in from JSON and compare the before and after
elements.

Because our element equality (`Element.__eq__()`) is shallow, this
doesn't tell us a lot, but if we take it one more step, like
`List[Element] -> JSON -> List[Element] -> JSON` and then compare the
JSON, it gives us some confidence that the serialized elements can be
"re-hydrated" without losing any information.

This actually showed up a few problems, all in the
serialization/deserialization (serde) code that all elements share.
2023-10-12 19:47:55 +00:00
Yuming Long
cb247d8cc4
doc: update comment for ingest test pdf-fast-reprocess (#1733) 2023-10-12 17:33:25 +00:00
Roman Isecke
22e568cf64
roman/bugfix fix default language ingest option (#1729)
### Description
Set language to None by default. Update ingest test to use local file
used in language unit tests to validate.

---------

Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com>
Co-authored-by: rbiseck3 <rbiseck3@users.noreply.github.com>
2023-10-12 17:31:23 +00:00
Roman Isecke
9b5d5e0f9e
roman/cli infer table arg (#1685)
### Description
Add new parameter to map to `skip_infer_table_types` partition arg.
Applies to partition config which is set on all connectors.

---------

Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com>
Co-authored-by: rbiseck3 <rbiseck3@users.noreply.github.com>
2023-10-12 16:14:53 +00:00
Roman Isecke
35852bba83
roman/bugfix ingest pipeline reformat dir (#1717)
Closes #1724 
### Description
Add file needed to make that dir discoverable
2023-10-12 15:44:30 +00:00
Trevor Bossert
9864086bc8
Drop patch version of python in Scarf anonymous analytics (#1725) 2023-10-12 01:36:47 +00:00
Trevor Bossert
569561e59b
Add more params for scarf (#1720)
Allow to slice on further metrics
2023-10-12 00:09:19 +00:00
Inscore
8ab40c20c1
fix: correct PDF list item parsing (#1693)
The current implementation removes elements from the beginning of the
element list and duplicates the list items

---------

Co-authored-by: Klaijan <klaijan@unstructured.io>
Co-authored-by: yuming <305248291@qq.com>
Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com>
Co-authored-by: yuming-long <yuming-long@users.noreply.github.com>
2023-10-11 20:38:36 +00:00
Trevor Bossert
6acd06987b
Remove extra index url from docs (#1711)
It’s no longer required to specify the extra index url as we utilize a
different method of gathering install anonymous analytics.
2023-10-11 19:34:49 +00:00
Trevor Bossert
f0a63e2712
Add basic call to scarf to get anonymous analytics (#1705)
There is a built in option to not send data by setting an env var,
SCARF_NO_ANALYTICS=true.

DoD:
- When importing or running unstructured package it will make a get call
to scarf
- When env variable is set to not track, call is not made
0.10.21
2023-10-11 09:15:36 -07:00
John
9500d04791
detect document language across all partitioners (#1627)
### Summary
Closes #1534 and #1535
Detects document language using `langdetect` package. 
Creates new kwargs for user to set the document language (`languages`)
or detect the language at the element level instead of the default
document level (`detect_language_per_element`)

---------

Co-authored-by: shreyanid <42684285+shreyanid@users.noreply.github.com>
Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com>
Co-authored-by: Coniferish <Coniferish@users.noreply.github.com>
Co-authored-by: cragwolfe <crag@unstructured.io>
Co-authored-by: Austin Walker <austin@unstructured.io>
0.10.20
2023-10-11 01:47:56 +00:00
Klaijan
ee75ce25e2
feat: element type frequency (#1688)
**Executive Summary**

Add function that returns frequency of given element types and depth.

---------

Co-authored-by: shreyanid <42684285+shreyanid@users.noreply.github.com>
2023-10-11 00:36:44 +00:00
rvztz
7fd61e3a7f
Adds data source properties to git connectors (#1280)
Adds data source properties to git connectors:
-  data_created
-  date_modified
-  version
-  record_locator
These properties are instantiated when supported by the connector.

Separates the logic between fetching the file from source and
`get_file`. Retrieves file metadata when any of the properties are
called.

Adds logic to check if file exists in the remote source. For connectors
that don't directly support it, adds exception handling to check any
issues while retrieving the file.

---------

Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com>
Co-authored-by: rvztz <rvztz@users.noreply.github.com>
2023-10-10 22:56:56 +00:00
shreyanid
9d228c7ecb
feat: calculate metric for percent of text missing (#1701)
### Summary
Missing text is a particularly important metric of quality for the
Unstructured library because information from the document is not being
captured and therefore not usable by downstream applications.

Add function to calculate the percent of text missing relative to the
source transcription. Function takes 2 text strings (output and source)
as input, and returns the percentage of text missing as a decimal.

### Technical Details
- The 2 input strings are both assumed to already contain clean and
concatenated text (CCT)
- Implementation compares the bags of words (frequency counts for each
word present in the text) of each input text
- Duplicated/extra text is not penalized
- Value is limited to the range [0, 1]

### Test
- Several edge cases are covered in the test function (missing text,
duplicated text, spaced out words, etc).
- Can test other cases or text inputs by calling the function with 2 CCT
strings as "output" and "source"
2023-10-10 20:54:49 +00:00
Yuming Long
e597ec7a0f
Fix: skip empty annotation bbox (#1665)
Address: https://github.com/Unstructured-IO/unstructured/issues/1663
## Summary
While trying to find how overlap between a element bbox and annotation
bbox, we find the intersection of two bboxes and divide it by the size
of annotation bbox, this will cause a zero division error if size of
annotation bbox is 0.

* this PR fix the zero division error for function
`check_annotations_within_element`
* also fix error: `TypeError: unsupported operand type(s) for -: 'float'
and 'NoneType'` by stop inserting empty word with None bbox into list of
words in function `get_word_bounding_box_from_element`

## Test
reproduce with code and document as the user mentioned and should see no
error:
```
from unstructured.partition.auto import partition

elements = partition(
    filename="./IZSAM8.2_221012.pdf",
    strategy="fast",
)
```
2023-10-10 20:48:44 +00:00
Mallori Harrell
a5d7ae4611
Feat: Bag of words for testing metric (#1650)
This PR adds the `bag_of_words` function to count the frequency of words
for evaluation.

**Testing**
```Python
from unstructured.cleaners.core import bag_of_words
string = "The dog loved the cat, but the cat loved the cow."

print(bag_of_words)

---------

Co-authored-by: Mallori Harrell <mallori@Malloris-MacBook-Pro.local>
Co-authored-by: Klaijan <klaijan@unstructured.io>
Co-authored-by: Shreya Nidadavolu <shreyanid9@gmail.com>
Co-authored-by: shreyanid <42684285+shreyanid@users.noreply.github.com>
2023-10-10 18:46:01 +00:00
Roman Isecke
b38a6b3022
feat: add Notion connector retry strategy (#1492)
### Description
In order to add a retry strategy to the notion http calls, leveraging a
generic backoff library with some tweaks to pass in values from the CLI.
2023-10-10 17:41:18 +00:00
Klaijan
1d80beaaf2
fix: initialize uri before try-except (#1690)
Fix github issue
https://github.com/Unstructured-IO/unstructured/issues/1686
2023-10-10 17:29:10 +00:00
ryannikolaidis
3e101d3e4f
build(test): skip full python matrix for most ingest tests (#1687)
We’re probably unfairly (to the test) making a large volume of new
connections and requests to test services when all of our ingest tests
run across the full python test matrix and when a lot of PRs a firing at
once. Lets limit the full matrix run to a select few, but still have all
ingest tests run on python v3.10. This is done by checking the version
and skipping in ingest-test.sh.

Bonus: Bumps ingest test fixture workflow to use 3.10. This technically
shouldn't make a difference, but since we're making 3.10 the default of
the matrix strategy, it probably makes sense to use 3.10 for the ingest
fixture generation as well for consistency.

## Testing
-
[example](https://github.com/Unstructured-IO/unstructured/actions/runs/6460319121/job/17537900978?pr=1687)
running all tests in 3.10
-
[example](https://github.com/Unstructured-IO/unstructured/actions/runs/6460319121/job/17537899999?pr=1687)
skipping/running the expected tests in 3.8
2023-10-10 16:39:34 +00:00
Dev Khant
f09b87da23
Doc : replace link upstream connectors with source connectors (#1683)
Fixes #1502

Here I have replaced `stream_connectors.html` with
`source_connectors.html`.
2023-10-09 21:37:51 -07:00
Amanda Cameron
f98d5e65ca
chore: adding max_characters to other element type chunking (#1673)
This PR adds the `max_characters` (hard max) param to non-table element
chunking. Additionally updates the `num_characters` metadata to
`max_characters` to make it clearer which param we're referencing.

To test:

```
from unstructured.partition.html import partition_html

filename = "example-docs/example-10k-1p.html"
chunk_elements = partition_html(
        filename,
        chunking_strategy="by_title",
        combine_text_under_n_chars=0,
        new_after_n_chars=50,
        max_characters=100,
    )

for chunk in chunk_elements:
     print(len(chunk.text))

# previously we were only respecting the "soft max" (default of 500) for elements other than tables
# now we should see that all the elements have text fields under 100 chars.
```

---------

Co-authored-by: cragwolfe <crag@unstructured.io>
2023-10-09 19:42:36 +00:00
David Potter
8b93217a33
built(test): exclude version metadata from google drive test (#1682) 2023-10-07 19:34:32 -07:00
cragwolfe
46cb1b642a
chore: don't cleanup ingest test outputs (non-CI) (#1680)
When running test-ingest test fixtures locally (but not in CI), keep
output .json's and other workdir artifacts around for the convenience of
debugging.

**Test Instructions**

Run 

    bash -x ./test_unstructured_ingest/test-ingest-azure.sh

and witness output .json's are visible. Yay! Now, to instead clean up
output .json's and workdir, run:

UNSTRUCTURED_CLEANUP_DEV_FIXTURES=1 bash -x
./test_unstructured_ingest/test-ingest-azure.sh
    
and witness the files have been cleaned up. Yay!
2023-10-07 02:18:37 +00:00
Klaijan
33edbf84f5
feat: add calculate edit distance feature (#1656)
**Executive Summary**

Adds function to calculate edit distance (Levenshtein distance) between
two strings. The function can return as: 1. score (similarity = 1 -
distance/source_len) 2. distance (raw levenshtein distance)

**Technical details**
- The `weights` param is set to default at (2,1,1) for (insertion,
deletion, substitution), meaning that we will penalize the insertion we
need to add from output (target) in comparison with the source
(reference). In other word, the missing extraction will be penalized
higher.
- The function takes in 2 strings in an assumption that both string are
already clean and concatenated (CCT)

**Important Note!**
Test case needs to be updated to use CCT once the function is ready. It
is now only tested the "functionality" of edit distance, not the edit
distance with CCT as its intended to be.

---------

Co-authored-by: cragwolfe <crag@unstructured.io>
2023-10-07 01:21:14 +00:00
Trevor Bossert
ce206f1f85
add extra-index-url for scarf anonymous tracking (#1668)
This adds extra-index-url to our docs to allow for anonymous install
analytics to help us understand and improve our product.

---------

Co-authored-by: cragwolfe <crag@unstructured.io>
2023-10-07 01:16:38 +00:00
Jack Retterer
7e310ecac2
Update Getting Started Guide in Documentation (#1667)
- Fixed typo that stated "infer_table_structured" instead of
"infer_table_structure"

Co-authored-by: cragwolfe <crag@unstructured.io>
2023-10-07 01:12:52 +00:00
Yuming Long
dcd6d0ff67
Refactor: support entire page OCR with ocr_mode and ocr_languages (#1579)
## Summary
Second part of OCR refactor to move it from inference repo to
unstructured repo, first part is done in
https://github.com/Unstructured-IO/unstructured-inference/pull/231. This
PR adds OCR process logics to entire page OCR, and support two OCR
modes, "entire_page" or "individual_blocks".

The updated workflow for `Hi_res` partition:
* pass the document as data/filename to inference repo to get
`inferred_layout` (DocumentLayout)
* pass the document as data/filename to OCR module, which first open the
document (create temp file/dir as needed), and split the document by
pages (convert PDF pages to image pages for PDF file)
* if ocr mode is `"entire_page"`
    *  OCR the entire image
    * merge the OCR layout with inferred page layout
 * if ocr mode is `"individual_blocks"`
* from inferred page layout, find element with no extracted text, crop
the entire image by the bboxes of the element
* replace empty text element with the text obtained from OCR the cropped
image
* return all merged PageLayouts and form a DocumentLayout subject for
later on process

This PR also bump `unstructured-inference==0.7.2` since the branch relay
on OCR refactor from unstructured-inference.
  
## Test
```
from unstructured.partition.auto import partition

entrie_page_ocr_mode_elements = partition(filename="example-docs/english-and-korean.png", ocr_mode="entire_page", ocr_languages="eng+kor", strategy="hi_res")
individual_blocks_ocr_mode_elements = partition(filename="example-docs/english-and-korean.png", ocr_mode="individual_blocks", ocr_languages="eng+kor", strategy="hi_res")
print([el.text for el in entrie_page_ocr_mode_elements])
print([el.text for el in individual_blocks_ocr_mode_elements])
```
latest output:
```
# entrie_page
['RULES AND INSTRUCTIONS 1. Template for day 1 (korean) , for day 2 (English) for day 3 both English and korean. 2. Use all your accounts. use different emails to send. Its better to have many email', 'accounts.', 'Note: Remember to write your own "OPENING MESSAGE" before you copy and paste the template. please always include [TREASURE HARUTO] for example:', '안녕하세요, 저 희 는 YGEAS 그룹 TREASUREWH HARUTOM|2] 팬 입니다. 팬 으 로서, HARUTO 씨 받 는 대 우 에 대해 의 구 심 과 불 공 평 함 을 LRU, 이 일 을 통해 저 희 의 의 혹 을 전 달 하여 귀 사 의 진지한 민 과 적극적인 답 변 을 받을 수 있 기 를 바랍니다.', '3. CC Harutonations@gmail.com so we can keep track of how many emails were', 'successfully sent', '4. Use the hashtag of Haruto on your tweet to show that vou have sent vour email]', '메 고']
# individual_blocks
['RULES AND INSTRUCTIONS 1. Template for day 1 (korean) , for day 2 (English) for day 3 both English and korean. 2. Use all your accounts. use different emails to send. Its better to have many email', 'Note: Remember to write your own "OPENING MESSAGE" before you copy and paste the template. please always include [TREASURE HARUTO] for example:', '안녕하세요, 저 희 는 YGEAS 그룹 TREASURES HARUTOM| 2] 팬 입니다. 팬 으로서, HARUTO 씨 받 는 대 우 에 대해 의 구 심 과 habe ERO, 이 머 일 을 적극 저 희 의 ASS 전 달 하여 귀 사 의 진지한 고 2 있 기 를 바랍니다.', '3. CC Harutonations@gmail.com so we can keep track of how many emails were ciiccecefisliy cant', 'VULLESSIULY Set 4. Use the hashtag of Haruto on your tweet to show that you have sent your email']
```

---------

Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com>
Co-authored-by: yuming-long <yuming-long@users.noreply.github.com>
Co-authored-by: christinestraub <christinemstraub@gmail.com>
Co-authored-by: christinestraub <christinestraub@users.noreply.github.com>
2023-10-06 22:54:49 +00:00
Roman Isecke
2e1404e02c
refactor: unstructured ingest as a pipeline (#1551)
### Description
As we add more and more steps to the pipeline (i.e. chunking, embedding,
table manipulation), it would help seperate the responsibility of each
of these into their own processes, running each in parallel using json
files to share data across. This will also help guarantee data is
serializable if this code was used in an actual pipeline. Following is a
flow diagram of the proposed changes. As part of this change:
* A parent pipeline class will be responsible for running each `node`,
which can optionally be run via multiprocessing if it supports it, or
not. Possible nodes at this moment:
  * Doc factory: creates all the ingest docs via the source connector
* Source: reads/downloads all of the content to process to the local
filesystem to the location set by the `download_dir` parameter.
* Partition: runs partition on all of the downloaded content in json
format.
* Any number of reformat nodes that modify the partitioned content. This
can include chunking, embedding, etc.
* Write: push the final json into the destination via the destination
connector
* This pipeline relies on the information of the ingest docs to be
available via their serialization. An optimization was introduced with
the `IngestDocJsonMixin` which adds in all the `@property` fields to the
serialized json already being created via the `DataClassJsonMixin`
* For all intermediate steps (partitioning, reformatting), the content
is saved to a dedicated location on the local filesystem. Right now it's
set to `$HOME/.cache/unstructured/ingest/pipeline/STEP_NAME/`.
* Minor changes: made sense to move some of the config parameters
between the read and partition configs when I explicitly divided the
responsibility to download vs partition the content in the pipeline.
* The pipeline class only makes the doc factory, source and partition
nodes required, keeping with the logic that has been supported so far.
All reformatting nodes and write node are optional.
* Long term, there should also be some changes to the base configs
supported by the CLI to support pipeline specific configs, but for now
what exists was used to minimize changes in this PR.
* Final step to copy the final output to the location designated by the
`_output_filename` value of the ingest doc.
* Hashing occurs at each step by hashing the parameters of that step
(i.e. partition configs) along with the previous step via the filename
used. This allows each step to be the same _if_ all the parameters for
it have not changed and the content so far is the same.
* The only data that is shared and has writes to across processes is the
dictionary of ingest json data. This dict is created using the
`multiprocessing.manager.DictProxy` to make sure any interaction with it
is behind a lock.

### Minor refactors included:
* Utility methods added to extract configs from the click options
* Utility method to add common options to click commands.
* All writers moved to using the class approach which extracts a lot of
the common code so there's less copy-paste when new runners are added.
* Use `@property` for source metadata on base ingest doc to add logic to
call `update_source_metadata` if it's still `None` at the time it's
fetched.


### Additional bug fixes included
* Fsspec connectors were not serializable due to the `ingest_doc_cls`.
This was removed from the fields captured by the `@dataclass` decorator
and added in a `__post_init__` method.
* Various reddit connector params were missing. This doesn't have an
explicit ingest test at the moment so was never caught.
* Fsspec connector had the parent `update_source_metadata` misnamed as
`update_source_metadata_metadata` so it was never being called.

### Flow Diagram


![ingest_pipeline](https://github.com/Unstructured-IO/unstructured/assets/136338424/be485606-cfe0-4931-8b81-c2bf569cf1e2)
2023-10-06 18:49:29 +00:00
qued
b9fa20ab46
fix: isolate metadata imports to doctype (#1671)
In a different PR, some no-extras tests started failing with import
errors when something innocuous was imported from
`unstructured.file_utils.metadata`. This turned out to be because of the
top-level, doctype-specific imports in that file. Importing a general
metadata object shouldn't require installation of modules like `PIL`,
`docx`, and `openpyxl`.

To fix, I moved these functions to be imported inside the functions that
use them, and added the `requires_dependencies` decorator to the
functions.

#### Testing:

You should be able to run something like:
```python
from unstructured.file_utils.metadata import Metadata
```

Without `openpyxl` installed.
2023-10-06 18:49:03 +00:00
John
6b7fe4469f
add docs with multiple languages for testing (#1591)
### Summary
Closes #1536 
Adds .txt documents for testing language detection.
2023-10-06 18:41:40 +00:00
Christine Straub
5d14a2aea0
feat: shrink bboxes by top left (#1633)
Closes #1573.
### Summary
- update `shrink_bbox()` to keep top left rather than center
### Evaluation
Run the following command for this
[PDF](https://utic-dev-tech-fixtures.s3.us-east-2.amazonaws.com/pastebin/patent-11723901-page2.pdf).
```
PYTHONPATH=. python examples/custom-layout-order/evaluate_xy_cut_sorting.py <file_path> <strategy>
```

---------

Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com>
2023-10-06 05:16:11 +00:00
ryannikolaidis
1e32da6389
build: fix merge queue issues (#1654)
Closes #1482 

There are two known issues when attempting to merge PRs via merge queue:

- CodeQL fails with: 
```
Error: ref 'refs/heads/gh-readonly-queue/main/pr-968-499f37f64b27c66d4fc68446dbea519860d06cf7' not found in this repository
```

- CI.changelog fails with: 
```
Get current git ref
Error: The process '/usr/bin/git' failed with exit code [1](https://github.com/Unstructured-IO/unstructured/actions/runs/5735977683/job/15544656682#step:2:1)28
```

The error with CodeQL is a known and still [open
issue](https://github.com/github/codeql-action/issues/1572). We don't
current enforce branch protection for CodeQL, so probably our best
compromise is to simply not run this on the merge queue event. There
could be a narrow margin where some issue is introduced via merge, but
we'll still see issues on individual branches and on pushes to main, so
this is probably acceptable.

The changelog job now has a checkout step prior to paths-filter which
guarantees the git ref exists before attempting to execute the filter
action.

## Testing

Prior to this change, I was able to validate both the
[CodeQL](https://github.com/ryan-nikolaidis/unstructured/actions/runs/6414128010)
and
[changelog](https://github.com/ryan-nikolaidis/unstructured/actions/runs/6414128007/job/17414065768)
test errors

With these changes, validated that the merge queue was able to
[successfully
run](https://github.com/ryan-nikolaidis/unstructured/actions/runs/6414511843/job/17415024319)
the changelog CI job.
2023-10-05 21:58:39 +00:00
Benjamin Torres
e0201e9a11
feat/add sources from unstructured inference (#1538)
This PR adds support for `source` property from
`unstructured_inference`, allowing the user to be able to see the origin
of the data under `detection_origin`field environment variable
UNSTRUCTURED_INCLUDE_DEBUG_METADATA=true

In order to try this feature you can use this code:
```
from unstructured.partition.pdf import partition_pdf_or_image

yolox_elements = partition_pdf_or_image(filename='example-docs/loremipsum-flat.pdf', strategy='hi_res', model_name='yolox')

sources = [e.detection_origin for e in yolox_elements]
print(sources)
```
And will print 'yolox' as source for all the elements
2023-10-05 20:26:47 +00:00
Christine Straub
f673ea4c40
enhancement: add visualization script to annotate elements (#1613)
This PR was initially created to close GitHub Issue #1604 (Synchronizing the default
layout model), but since it was already resolved in PR
[#1607](https://github.com/Unstructured-IO/unstructured/pull/1607), this
PR now only adds the visualization script used to investigate the issue.

### Summary
- add python script to annotate elements

PDF:
[references.pdf](https://github.com/Unstructured-IO/unstructured/files/12778270/references.pdf)

### Evaluation
```
PYTHONPATH=. python examples/layout-analysis/visualization.py references.pdf hi_res
```
2023-10-05 12:53:16 -07:00
Ronny H
b1cbfb845c
Update Slack community link on README page (#1653) 2023-10-05 12:07:00 -07:00
Sebastian Laverde Alfonso
e90a979f45
fix: Better logic for setting category_depth metadata for Title elements (#1517)
This PR promotes the `category_depth` metadata for `Title` elements from
`None` to 0, whenever `Headline` and/or `Subheadline` types (that are
also mapped to `Title` elements with depth 1 and 2) are present. An
additional test to `test_common.py` has been added to check on the
improvement. More test of how this logic fixes the behaviour can be
found in a adapted version on the colab
[here](https://colab.research.google.com/drive/1LoScFJBYUhkM6X7pMp8cDaJLC_VoxGci?usp=sharing).

---------

Co-authored-by: qued <64741807+qued@users.noreply.github.com>
2023-10-05 17:51:06 +00:00
Newel H
e34396b2c9
Feat: Native hierarchies for elements from pptx documents (#1616)
## Summary
**Improve title detection in pptx documents** The default title
textboxes on a pptx slide are now categorized as titles.
**Improve hierarchy detection in pptx documents** List items, and other
slide text are properly nested under the slide title. This will enable
better chunking of pptx documents.

Hierarchy detection is improved by determining category depth via the
following:
- Check if the paragraph item has a level parameter via the python pptx
paragraph. If so, use the paragraph level as the category_depth level.
- If the shape being checked is a title shape and the item is not a
bullet or email, the element will be set as a Title with a depth
corresponding to the enumerated paragraph increment (e.g. 1st line of
title shape is depth 0, second is depth 1 etc.).
- If the shape is not a title shape but the paragraph is a title, the
increment will match the level + 1, so that all paragraph titles are at
least 1 to set them below the slide title element
2023-10-05 12:55:45 -04:00
Christine Straub
b30d6a601e
Fix/1209 tweak xycut ordering output (#1630)
Closes GH Issue #1209.

### Summary
- add swapped `xycut` sorting
- update `xycut` sorting evaluation script

PDFs:
-
[sbaa031.073.pdf](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7234218/pdf/sbaa031.073.pdf)
-
[multi-column-2p.pdf](https://github.com/Unstructured-IO/unstructured/files/12796147/multi-column-2p.pdf)
-
[11723901.pdf](https://github.com/Unstructured-IO/unstructured-inference/files/12360085/11723901.pdf)
### Testing
```
elements = partition_pdf("sbaa031.073.pdf", strategy="hi_res")
print("\n\n".join([str(el) for el in elements]))
```
### Evaluation
```
PYTHONPATH=. python examples/custom-layout-order/evaluate_xy_cut_sorting.py sbaa031.073.pdf hi_res xycut_only
```
2023-10-05 07:41:38 +00:00
rvztz
6d8572d7d6
feat: Adds data source properties to the Jira connector (#1516) 2023-10-04 20:04:18 -07:00
Ronny H
8564d920ac
Update Metadata and Installation Documentation (#1646)
* Updated Metadata page: add common and additional metadata fields by
document types and connectors
* Updated specific installation extra by document types and connectors
* Added embedding brick page in Sphinx TOC
* Fixed Sphinx warnings in new pages
0.10.19
2023-10-05 01:25:41 +00:00
ryannikolaidis
9960ce5f00
fix: chunking fails with detection_class_prob in metadata (#1637) 2023-10-04 22:14:21 +00:00
Klaijan
0a65fc2134
feat: xlsx subtable extraction (#1585)
**Executive Summary**
Unstructured is now able to capture subtables, along with other text
element types within the `.xlsx` sheet.

**Technical Details**
- The function now reads the excel *without* header as default
- Leverages the connected components search to find subtables within the
sheet. This search is based on dfs search
- It also handle the overlapping table or text cases
- Row with only single cell of data is considered not a table, and
therefore passed on the determine the element type as text
- In connected elements, it is possible to have table title, header, or
footer. We run the count for the first non-single empty rows from top
and bottom to determine those text

**Result**
This table now reads as:
<img width="747" alt="image"
src="https://github.com/Unstructured-IO/unstructured/assets/2177850/6b8e6d01-4ca5-43f4-ae88-6104b0174ed2">

```
[
    {
        "type": "Title",
        "element_id": "3315afd97f7f2ebcd450e7c939878429",
        "metadata": {
            "filename": "vodafone.xlsx",
            "file_directory": "example-docs",
            "last_modified": "2023-10-03T17:51:34",
            "filetype": "application/vnd.openxmlformats-officedocument.spreadsheetml.sheet",
            "parent_id": "3315afd97f7f2ebcd450e7c939878429",
            "languages": [
                "spa",
                "ita"
            ],
            "page_number": 1,
            "page_name": "Index",
            "text_as_html": "<table border=\"1\" class=\"dataframe\">\n  <tbody>\n    <tr>\n      <td>Topic</td>\n      <td>Period</td>\n      <td></td>\n      <td></td>\n      <td>Page</td>\n    </tr>\n    <tr>\n      <td>Quarterly revenue</td>\n      <td>Nine quarters to 30 June 2023</td>\n      <td></td>\n      <td></td>\n      <td>1</td>\n    </tr>\n    <tr>\n      <td>Group financial performance</td>\n      <td>FY 22</td>\n      <td>FY 23</td>\n      <td></td>\n      <td>2</td>\n    </tr>\n    <tr>\n      <td>Segmental results</td>\n      <td>FY 22</td>\n      <td>FY 23</td>\n      <td></td>\n      <td>3</td>\n    </tr>\n    <tr>\n      <td>Segmental analysis</td>\n      <td>FY 22</td>\n      <td>FY 23</td>\n      <td></td>\n      <td>4</td>\n    </tr>\n    <tr>\n      <td>Cash flow</td>\n      <td>FY 22</td>\n      <td>FY 23</td>\n      <td></td>\n      <td>5</td>\n    </tr>\n  </tbody>\n</table>"
        },
        "text": "Financial performance"
    },
    {
        "type": "Table",
        "element_id": "17f5d512705be6f8812e5dbb801ba727",
        "metadata": {
            "filename": "vodafone.xlsx",
            "file_directory": "example-docs",
            "last_modified": "2023-10-03T17:51:34",
            "filetype": "application/vnd.openxmlformats-officedocument.spreadsheetml.sheet",
            "parent_id": "3315afd97f7f2ebcd450e7c939878429",
            "languages": [
                "spa",
                "ita"
            ],
            "page_number": 1,
            "page_name": "Index",
            "text_as_html": "<table border=\"1\" class=\"dataframe\">\n  <tbody>\n    <tr>\n      <td>Topic</td>\n      <td>Period</td>\n      <td></td>\n      <td></td>\n      <td>Page</td>\n    </tr>\n    <tr>\n      <td>Quarterly revenue</td>\n      <td>Nine quarters to 30 June 2023</td>\n      <td></td>\n      <td></td>\n      <td>1</td>\n    </tr>\n    <tr>\n      <td>Group financial performance</td>\n      <td>FY 22</td>\n      <td>FY 23</td>\n      <td></td>\n      <td>2</td>\n    </tr>\n    <tr>\n      <td>Segmental results</td>\n      <td>FY 22</td>\n      <td>FY 23</td>\n      <td></td>\n      <td>3</td>\n    </tr>\n    <tr>\n      <td>Segmental analysis</td>\n      <td>FY 22</td>\n      <td>FY 23</td>\n      <td></td>\n      <td>4</td>\n    </tr>\n    <tr>\n      <td>Cash flow</td>\n      <td>FY 22</td>\n      <td>FY 23</td>\n      <td></td>\n      <td>5</td>\n    </tr>\n  </tbody>\n</table>"
        },
        "text": "\n\n\nTopic\nPeriod\n\n\nPage\n\n\nQuarterly revenue\nNine quarters to 30 June 2023\n\n\n1\n\n\nGroup financial performance\nFY 22\nFY 23\n\n2\n\n\nSegmental results\nFY 22\nFY 23\n\n3\n\n\nSegmental analysis\nFY 22\nFY 23\n\n4\n\n\nCash flow\nFY 22\nFY 23\n\n5\n\n\n"
    },
    {
        "type": "Title",
        "element_id": "8a9db7161a02b427f8fda883656036e1",
        "metadata": {
            "filename": "vodafone.xlsx",
            "file_directory": "example-docs",
            "last_modified": "2023-10-03T17:51:34",
            "filetype": "application/vnd.openxmlformats-officedocument.spreadsheetml.sheet",
            "parent_id": "8a9db7161a02b427f8fda883656036e1",
            "languages": [
                "spa",
                "ita"
            ],
            "page_number": 1,
            "page_name": "Index",
            "text_as_html": "<table border=\"1\" class=\"dataframe\">\n  <tbody>\n    <tr>\n      <td>Topic</td>\n      <td>Period</td>\n      <td></td>\n      <td></td>\n      <td>Page</td>\n    </tr>\n    <tr>\n      <td>Mobile customers</td>\n      <td>Nine quarters to 30 June 2023</td>\n      <td></td>\n      <td></td>\n      <td>6</td>\n    </tr>\n    <tr>\n      <td>Fixed broadband customers</td>\n      <td>Nine quarters to 30 June 2023</td>\n      <td></td>\n      <td></td>\n      <td>7</td>\n    </tr>\n    <tr>\n      <td>Marketable homes passed</td>\n      <td>Nine quarters to 30 June 2023</td>\n      <td></td>\n      <td></td>\n      <td>8</td>\n    </tr>\n    <tr>\n      <td>TV customers</td>\n      <td>Nine quarters to 30 June 2023</td>\n      <td></td>\n      <td></td>\n      <td>9</td>\n    </tr>\n    <tr>\n      <td>Converged customers</td>\n      <td>Nine quarters to 30 June 2023</td>\n      <td></td>\n      <td></td>\n      <td>10</td>\n    </tr>\n    <tr>\n      <td>Mobile churn</td>\n      <td>Nine quarters to 30 June 2023</td>\n      <td></td>\n      <td></td>\n      <td>11</td>\n    </tr>\n    <tr>\n      <td>Mobile data usage</td>\n      <td>Nine quarters to 30 June 2023</td>\n      <td></td>\n      <td></td>\n      <td>12</td>\n    </tr>\n    <tr>\n      <td>Mobile ARPU</td>\n      <td>Nine quarters to 30 June 2023</td>\n      <td></td>\n      <td></td>\n      <td>13</td>\n    </tr>\n  </tbody>\n</table>"
        },
        "text": "Operational metrics"
    },
    {
        "type": "Table",
        "element_id": "d5d16f7bf9c7950cd45fae06e12e5847",
        "metadata": {
            "filename": "vodafone.xlsx",
            "file_directory": "example-docs",
            "last_modified": "2023-10-03T17:51:34",
            "filetype": "application/vnd.openxmlformats-officedocument.spreadsheetml.sheet",
            "parent_id": "8a9db7161a02b427f8fda883656036e1",
            "languages": [
                "spa",
                "ita"
            ],
            "page_number": 1,
            "page_name": "Index",
            "text_as_html": "<table border=\"1\" class=\"dataframe\">\n  <tbody>\n    <tr>\n      <td>Topic</td>\n      <td>Period</td>\n      <td></td>\n      <td></td>\n      <td>Page</td>\n    </tr>\n    <tr>\n      <td>Mobile customers</td>\n      <td>Nine quarters to 30 June 2023</td>\n      <td></td>\n      <td></td>\n      <td>6</td>\n    </tr>\n    <tr>\n      <td>Fixed broadband customers</td>\n      <td>Nine quarters to 30 June 2023</td>\n      <td></td>\n      <td></td>\n      <td>7</td>\n    </tr>\n    <tr>\n      <td>Marketable homes passed</td>\n      <td>Nine quarters to 30 June 2023</td>\n      <td></td>\n      <td></td>\n      <td>8</td>\n    </tr>\n    <tr>\n      <td>TV customers</td>\n      <td>Nine quarters to 30 June 2023</td>\n      <td></td>\n      <td></td>\n      <td>9</td>\n    </tr>\n    <tr>\n      <td>Converged customers</td>\n      <td>Nine quarters to 30 June 2023</td>\n      <td></td>\n      <td></td>\n      <td>10</td>\n    </tr>\n    <tr>\n      <td>Mobile churn</td>\n      <td>Nine quarters to 30 June 2023</td>\n      <td></td>\n      <td></td>\n      <td>11</td>\n    </tr>\n    <tr>\n      <td>Mobile data usage</td>\n      <td>Nine quarters to 30 June 2023</td>\n      <td></td>\n      <td></td>\n      <td>12</td>\n    </tr>\n    <tr>\n      <td>Mobile ARPU</td>\n      <td>Nine quarters to 30 June 2023</td>\n      <td></td>\n      <td></td>\n      <td>13</td>\n    </tr>\n  </tbody>\n</table>"
        },
        "text": "\n\n\nTopic\nPeriod\n\n\nPage\n\n\nMobile customers\nNine quarters to 30 June 2023\n\n\n6\n\n\nFixed broadband customers\nNine quarters to 30 June 2023\n\n\n7\n\n\nMarketable homes passed\nNine quarters to 30 June 2023\n\n\n8\n\n\nTV customers\nNine quarters to 30 June 2023\n\n\n9\n\n\nConverged customers\nNine quarters to 30 June 2023\n\n\n10\n\n\nMobile churn\nNine quarters to 30 June 2023\n\n\n11\n\n\nMobile data usage\nNine quarters to 30 June 2023\n\n\n12\n\n\nMobile ARPU\nNine quarters to 30 June 2023\n\n\n13\n\n\n"
    },
    {
        "type": "Title",
        "element_id": "f97e9da0e3b879f0a9df979ae260a5f7",
        "metadata": {
            "filename": "vodafone.xlsx",
            "file_directory": "example-docs",
            "last_modified": "2023-10-03T17:51:34",
            "filetype": "application/vnd.openxmlformats-officedocument.spreadsheetml.sheet",
            "parent_id": "f97e9da0e3b879f0a9df979ae260a5f7",
            "languages": [
                "spa",
                "ita"
            ],
            "page_number": 1,
            "page_name": "Index",
            "text_as_html": "<table border=\"1\" class=\"dataframe\">\n  <tbody>\n    <tr>\n      <td>Topic</td>\n      <td>Period</td>\n      <td></td>\n      <td></td>\n      <td>Page</td>\n    </tr>\n    <tr>\n      <td>Average foreign exchange rates</td>\n      <td>Nine quarters to 30 June 2023</td>\n      <td></td>\n      <td></td>\n      <td>14</td>\n    </tr>\n    <tr>\n      <td>Guidance rates</td>\n      <td>FY 23/24</td>\n      <td></td>\n      <td></td>\n      <td>14</td>\n    </tr>\n  </tbody>\n</table>"
        },
        "text": "Other"
    },
    {
        "type": "Table",
        "element_id": "080e1a745a2a3f2df22b6a08d33d59bb",
        "metadata": {
            "filename": "vodafone.xlsx",
            "file_directory": "example-docs",
            "last_modified": "2023-10-03T17:51:34",
            "filetype": "application/vnd.openxmlformats-officedocument.spreadsheetml.sheet",
            "parent_id": "f97e9da0e3b879f0a9df979ae260a5f7",
            "languages": [
                "spa",
                "ita"
            ],
            "page_number": 1,
            "page_name": "Index",
            "text_as_html": "<table border=\"1\" class=\"dataframe\">\n  <tbody>\n    <tr>\n      <td>Topic</td>\n      <td>Period</td>\n      <td></td>\n      <td></td>\n      <td>Page</td>\n    </tr>\n    <tr>\n      <td>Average foreign exchange rates</td>\n      <td>Nine quarters to 30 June 2023</td>\n      <td></td>\n      <td></td>\n      <td>14</td>\n    </tr>\n    <tr>\n      <td>Guidance rates</td>\n      <td>FY 23/24</td>\n      <td></td>\n      <td></td>\n      <td>14</td>\n    </tr>\n  </tbody>\n</table>"
        },
        "text": "\n\n\nTopic\nPeriod\n\n\nPage\n\n\nAverage foreign exchange rates\nNine quarters to 30 June 2023\n\n\n14\n\n\nGuidance rates\nFY 23/24\n\n\n14\n\n\n"
    }
]
```
2023-10-04 13:30:23 -04:00
Yao You
19d8bff275
feat: change default hi_res model to yolox quantized (#1607) 2023-10-04 03:28:47 +00:00
Manirevuri
13453d6358
Fix: Documentation for Unstructured API's (#1624)
Fixed "files=file_data" param for all python files

---------

Co-authored-by: Austin Walker <austin@unstructured.io>
2023-10-03 20:42:32 +00:00
Roman Isecke
8821689f36
Roman/s3 minio all cloud support (#1606)
### Description
Exposes the endpoint url as an access kwarg when using the s3 filesystem
library via the fsspec abstraction. This allows for any non-aws data
providers that support the s3 protocol to be used with the s3 connector
(i.e. minio)

Closes out https://github.com/Unstructured-IO/unstructured/issues/950

---------

Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com>
Co-authored-by: rbiseck3 <rbiseck3@users.noreply.github.com>
2023-10-03 14:31:28 -04:00
Amanda Cameron
1fb464235a
chore: Table chunking (#1540)
This change is adding to our `add_chunking_strategy` logic so that we
are able to chunk Table elements' `text` and `text_as_html` params. In
order to keep the functionality under the same `by_title` chunking
strategy we have renamed the `combine_under_n_chars` to
`max_characters`. It functions the same way for the combining elements
under Title's, as well as specifying a chunk size (in chars) for
TableChunk elements.

*renaming the variable to `max_characters` will also reflect the 'hard
max' we will implement for large elements in followup PRs


Additionally -> some lint changes snuck in when I ran `make tidy` hence
the minor changes in unrelated files :)

TODO:
 add unit tests
--> note: added where I could to unit tests! Some unit tests I just
clarified that the chunking strategy was now 'by_title' because we don't
have a file example that has Table elements to test the
'by_num_characters' chunking strategy
  update changelog

To manually test:
```
In [1]: filename="example-docs/example-10k.html"

In [2]: from unstructured.chunking.title import chunk_table_element

In [3]: from unstructured.partition.auto import partition

In [4]: elements = partition(filename)

# element at -2 happens to be a Table, and we'll get chunks of char size 4 here
In [5]: chunks = chunk_table_element(elements[-2], 4)

# examine text and text_as_html params
ln [6]: for c in chunks:
                    print(c.text)
                    print(c.metadata.text_as_html)
```

---------

Co-authored-by: Yao You <theyaoyou@gmail.com>
2023-10-03 09:40:34 -07:00
Newel H
bcd0eee753
Feat: Detect all text in HTML Heading tags as titles (#1556)
## Summary
This will increase the accuracy of hierarchies in HTML documents and
provide more accurate element categorization. If text is in an HTML
heading tag and is not a list item, address categorize it as a title.

## Testing
```
from unstructured.partition.html import partition_html
elements = partition_html(url="https://www.eda.gov/grants/2015")
```
Before, the date headers at the given url would not be correctly parsed
as titles, after this change they are now correctly identified.

A unit test to verify the functionality has been added:
`test_html_partition::test_html_heading_title_detection` that includes
values that were previously detected as narrative text and uncategorized
text
2023-10-03 11:54:36 -04:00
Klaijan
d6efd52b4b
fix: isalnum referenced before assignment (#1586)
**Executive Summary**
Fix bug on the `get_word_bounding_box_from_element` function that
prevent `partition_pdf` to run.

**Technical Details**
- The function originally first define `isalnum` on the first index. Now
switched to conditional on flag value.
2023-10-03 11:25:20 -04:00
Roman Isecke
b2e997635f
roman/es ingest test fixes (#1610)
### Description
update elasticsearch docker setup to use docker-compose

Would close out
https://github.com/Unstructured-IO/unstructured/issues/1609
2023-10-03 10:39:33 -04:00