1447 Commits

Author SHA1 Message Date
Yuming Long
01a0e003d9
Chore: stop passing extract_tables to inference and note table regression on entire doc OCR (#1850)
### Summary

A follow up ticket on
https://github.com/Unstructured-IO/unstructured/pull/1801, I forgot to
remove the lines that pass extract_tables to inference, and noted the
table regression if we only do one OCR for entire doc

**Tech details:**
* stop passing `extract_tables` parameter to inference
* added table extraction ingest test for image, which was skipped
before, and the "text_as_html" field contains the OCR output from the
table OCR refactor PR
* replaced `assert_called_once_with` with `call_args` so that the unit
tests don't need to test additional parameters
* added `error_margin` as ENV when comparing bounding boxes
of`ocr_region` with `table_element`
* added more tests for tables and noted the table regression in test for
partition pdf

### Test
* for stop passing `extract_tables` parameter to inference, run test
`test_partition_pdf_hi_res_ocr_mode_with_table_extraction` before this
branch and you will see warning like `Table OCR from get_tokens method
will be deprecated....`, which means it called the table OCR in
inference repo. This branch removed the warning.
2023-10-24 17:13:28 +00:00
qued
44cef80c82
test: Add test to ensure languages trickle down to ocr (#1857)
Closes
[#93](https://github.com/Unstructured-IO/unstructured-inference/issues/93).

Adds a test to ensure language parameters are passed all the way from
`partition_pdf` down to the OCR calls.

#### Testing:

CI should pass.
2023-10-24 16:54:19 +00:00
Yao You
b530e0a2be
fix: partition docx from teams output (#1825)
This PR resolves #1816 
- current docx partition assumes all contents are in sections
- this is not true for MS Teams chat transcript exported to docx
- now the code checks if there are sections or not; if not then iterate
through the paragraphs and partition contents in the paragraphs
2023-10-24 15:17:02 +00:00
Roman Isecke
4802332de0
Roman/optimize ingest ci (#1799)
### Description
Currently the CI caches the CI dependencies but uses the hash of all
files in `requirements/`. This isn't completely accurate since the
ingest dependencies are installed in a later step and don't affect the
cached environment. As part of this PR:
* ingest dependencies were isolated into their own folder in
`requirements/ingest/`
* A new cache setup was introduced in the CI to restore the base cache
-> install ingest dependencies -> cache it with a new id
* new make target created to install all ingest dependencies via `pip
install -r ...`
* updates to Dockerfile to use `find ...` to install all dependencies,
avoiding the need to update this when new deps are added.
* update to pip-compile script to run over all `*.in` files in
`requirements/`
2023-10-24 14:54:00 +00:00
Roman Isecke
37e841310a
refactor ingest CLI for better code reuse (#1846)
### Description
Much of the current CLI code is copy-paste across subcommands. To
alleviate this, most of the duplicate code was moved into base classes
for src and destination connector commands. This also allows for code
reuse when a destination command is called and it no longer has to jump
through hoops to dynamically recreate what _would_ have been called by a
source command.

The reason everything can't live in a single BaseCmd class is due to the
need for a dynamic map to the source command. This runs into a circular
dependency issue if it was all in one class. By splitting it into a
`BaseSrcCmd` and a `BaseDestCmd` class, this helps avoid that issue.
2023-10-24 12:57:33 +00:00
Amanda Cameron
0584e1d031
chore: fix infer_table bug (#1833)
Carrying `skip_infer_table_types` to `infer_table_structure` in
partition flow. Now PPT/X, DOC/X, etc. Table elements should not have a
`text_as_html` field.

Note: I've continued to exclude this var from partitioners that go
through html flow, I think if we've already got the html it doesn't make
sense to carry the infer variable along, since we're not 'infer-ing' the
html table in these cases.


TODO:
  add unit tests

---------

Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com>
Co-authored-by: amanda103 <amanda103@users.noreply.github.com>
2023-10-24 00:11:53 +00:00
Klaijan
6707cab250
build: text extraction evaluation metrics workflow added (#1757)
**Executive Summary**
This PR adds the evaluation metrics to our current workflow. It verifies
the flow that when the code is pushed, the code will gets evaluate
against our gold standard and output into `.tsv` file.

**Technical Details**
- Adds evaluation metrics to the test-ingest workflow
- Make use of `structured-output` from `test-ingest` and compare to the
gold-standard uploaded in s3, and download into local when make
comparison. The current folder in-use is
`s3://utic-dev-tech-fixtures/small-cct`. This dir is editable in the
shell script.
- With this PR, only one file from one connector is use to compare.

**Misc**
- Not many overlapped files between test-ingest and gold-standard. More
files will be added.

**Outputs**
2 `.tsv` files are saved under `test_unstructured_ingest/metrics/`.


![image](https://github.com/Unstructured-IO/unstructured/assets/2177850/222e437c-1a94-4d7c-9320-81696633b1ae)


![image](https://github.com/Unstructured-IO/unstructured/assets/2177850/5c840322-6739-4634-8868-eba04b4ebc96)

---------

Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com>
Co-authored-by: Klaijan <Klaijan@users.noreply.github.com>
2023-10-23 21:39:22 +00:00
Roman Isecke
a2af72bb79
local connector metadata and deserialization fix (#1800)
### Description
* Priority of this was to fix deserialization of ingest docs. Currently
the source metadata wasn't being persisted
* To help debug this, source metadata was added to the local ingest doc
as well.
* Unit test added to make sure the metadata itself was persisted.
* As part of serialization, it was forcing docs to fetch source metadata
if it hadn't already to add to the generated dict/json. This shouldn't
have happened if the underlying variable `_source_metadata` was `None`.
This way the doc can be serialized without any calls being made.
* Serialization was moved to the `to_dict` method to make it more
universal.

---------

Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com>
Co-authored-by: rbiseck3 <rbiseck3@users.noreply.github.com>
2023-10-23 15:51:52 +00:00
qued
7fdddfbc1e
chore: improve kwarg handling (#1810)
Closes `unstructured-inference` issue
[#265](https://github.com/Unstructured-IO/unstructured-inference/issues/265).

Cleaned up the kwarg handling, taking opportunities to turn instances of
handling kwargs as dicts to just using them as normal in function
signatures.

#### Testing:

Should just pass CI.
2023-10-23 04:48:28 +00:00
Steve Canny
82c8adba3f
fix: split-chunks appear out-of-order (#1824)
**Executive Summary.** Code inspection in preparation for adding the
chunk-overlap feature revealed a bug causing split-chunks to be inserted
out-of-order. For example, elements like this:

```
Text("One" + 400 chars)
Text("Two" + 400 chars)
Text("Three" + 600 chars)
Text("Four" + 400 chars)
Text("Five" + 600 chars)
```

Should produce chunks:
```
CompositeElement("One ...")           # (400 chars)
CompositeElement("Two ...")           # (400 chars)
CompositeElement("Three ...")         # (500 chars)
CompositeElement("rest of Three ...") # (100 chars)
CompositeElement("Four")              # (400 chars)
CompositeElement("Five ...")          # (500 chars)
CompositeElement("rest of Five ...")  # (100 chars)
```

but produced this instead:
```
CompositeElement("Five ...")          # (500 chars)
CompositeElement("rest of Five ...")  # (100 chars)
CompositeElement("Three ...")         # (500 chars)
CompositeElement("rest of Three ...") # (100 chars)
CompositeElement("One ...")           # (400 chars)
CompositeElement("Two ...")           # (400 chars)
CompositeElement("Four")              # (400 chars)
```

This PR fixes that behavior that was introduced on Oct 9 this year in
commit: f98d5e65 when adding chunk splitting.


**Technical Summary**

The essential transformation of chunking is:

```
elements          sections              chunks
List[Element] -> List[List[Element]] -> List[CompositeElement]
```

1. The _sectioner_ (`_split_elements_by_title_and_table()`) _groups_
semantically-related elements into _sections_ (`List[Element]`), in the
best case, that would be a title (heading) and the text that follows it
(until the next title). A heading and its text is often referred to as a
_section_ in publishing parlance, hence the name.

2. The _chunker_ (`chunk_by_title()` currently) does two things:
1. first it _consolidates_ the elements of each section into a single
`ConsolidatedElement` object (a "chunk"). This includes both joining the
element text into a single string as well as consolidating the metadata
of the section elements.
2. then if necessary it _splits_ the chunk into two or more
`ConsolidatedElement` objects when the consolidated text is too long to
fit in the specified window (`max_characters`).

Chunk splitting is only required when a single element (like a big
paragraph) has text longer than the specified window. Otherwise a
section and the chunk that derives from it reflects an even element
boundary.

`chunk_by_title()` was elaborated in commit f98d5e65 to add this
"chunk-splitting" behavior.

At the time there was some notion of wanting to "split from the end
backward" such that any small remainder chunk would appear first, and
could possibly be combined with a small prior chunk. To accomplish this,
split chunks were _inserted_ at the beginning of the list instead of
_appended_ to the end.

The `chunked_elements` variable (`List[CompositeElement]`) holds the
sequence of chunks that result from the chunking operation and is the
returned value for `chunk_by_title()`. This was the list
"split-from-the-end" chunks were inserted at the beginning of and that
unfortunately produces this out-of-order behavior because the insertion
was at the beginning of this "all-chunks-in-document" list, not a
sublist just for this chunk.

Further, the "split-from-the-end" behavior can produce no benefit
because chunks are never combined, only _elements_ are combined (across
semantic boundaries into a single section when a section is small) and
sectioning occurs _prior_ to chunking.

The fix is to rework the chunk-splitting passage to a straighforward
iterative algorithm that works both when a chunk must be split and when
it doesn't. This algorithm is also very easily extended to implement
split-chunk-overlap which is coming up in an immediately following PR.

```python
# -- split chunk into CompositeElements objects maxlen or smaller --
text_len = len(text)
start = 0
remaining = text_len

while remaining > 0:
    end = min(start + max_characters, text_len)
    chunked_elements.append(CompositeElement(text=text[start:end], metadata=chunk_meta))
    start = end - overlap
    remaining = text_len - end
```

*Forensic analysis*
The out-of-order-chunks behavior was introduced in commit 4ea71683 on
10/09/2023 in the same PR in which chunk-splitting was introduced.

---------

Co-authored-by: Shreya Nidadavolu <shreyanid9@gmail.com>
Co-authored-by: shreyanid <42684285+shreyanid@users.noreply.github.com>
0.10.25
2023-10-21 01:37:34 +00:00
Yuming Long
ce40cdc55f
Chore (refactor): support table extraction with pre-computed ocr data (#1801)
### Summary

Table OCR refactor, move the OCR part for table model in inference repo
to unst repo.
* Before this PR, table model extracts OCR tokens with texts and
bounding box and fills the tokens to the table structure in inference
repo. This means we need to do an additional OCR for tables.
* After this PR, we use the OCR data from entire page OCR and pass the
OCR tokens to inference repo, which means we only do one OCR for the
entire document.

**Tech details:**
* Combined env `ENTIRE_PAGE_OCR` and `TABLE_OCR` to `OCR_AGENT`, this
means we use the same OCR agent for entire page and tables since we only
do one OCR.
* Bump inference repo to `0.7.9`, which allow table model in inference
to use pre-computed OCR data from unst repo. Please check in
[PR](https://github.com/Unstructured-IO/unstructured-inference/pull/256).
* All notebooks lint are made by `make tidy`
* This PR also fixes
[issue](https://github.com/Unstructured-IO/unstructured/issues/1564),
I've added test for the issue in
`test_pdf.py::test_partition_pdf_hi_table_extraction_with_languages`
* Add same scaling logic to image [similar to previous Table
OCR](https://github.com/Unstructured-IO/unstructured-inference/blob/main/unstructured_inference/models/tables.py#L109C1-L113),
but now scaling is applied to entire image

### Test
* Not much to manually testing expect table extraction still works
* But due to change on scaling and use pre-computed OCR data from entire
page, there are some slight (better) changes on table output, here is an
comparison on test outputs i found from the same test
`test_partition_image_with_table_extraction`:

screen shot for table in `layout-parser-paper-with-table.jpg`:
<img width="343" alt="expected"
src="https://github.com/Unstructured-IO/unstructured/assets/63475068/278d7665-d212-433d-9a05-872c4502725c">
before refactor:
<img width="709" alt="before"
src="https://github.com/Unstructured-IO/unstructured/assets/63475068/347fbc3b-f52b-45b5-97e9-6f633eaa0d5e">
after refactor:
<img width="705" alt="after"
src="https://github.com/Unstructured-IO/unstructured/assets/63475068/b3cbd809-cf67-4e75-945a-5cbd06b33b2d">

### TODO
(added as a ticket) Still have some clean up to do in inference repo
since now unst repo have duplicate logic, but can keep them as a fall
back plan. If we want to remove anything OCR related in inference, here
are items that is deprecated and can be removed:
*
[`get_tokens`](https://github.com/Unstructured-IO/unstructured-inference/blob/main/unstructured_inference/models/tables.py#L77)
(already noted in code)
* parameter `extract_tables` in inference
*
[`interpret_table_block`](https://github.com/Unstructured-IO/unstructured-inference/blob/main/unstructured_inference/inference/layoutelement.py#L88)
*
[`load_agent`](https://github.com/Unstructured-IO/unstructured-inference/blob/main/unstructured_inference/models/tables.py#L197)
* env `TABLE_OCR` 

### Note
if we want to fallback for an additional table OCR (may need this for
using paddle for table), we need to:
* pass `infer_table_structure` to inference with `extract_tables`
parameter
* stop passing `infer_table_structure` to `ocr.py`

---------

Co-authored-by: Yao You <yao@unstructured.io>
2023-10-21 00:24:23 +00:00
Yao You
3437a23c91
fix: partition html fail with table without tbody (#1817)
This PR resolves #1807 
- fix a bug where when a table tagged content does not contain `tbody`
tag but `thead` tag for the rows the code fails
- now when there is no `tbody` in a table section we try to look for
`thead` isntead
- when both are not found return empty table
2023-10-20 23:21:59 +00:00
Klaijan
de685fbc18
feat: add accuracy as wrapper for edit distance score (#1828)
Add `calculate_accuracy` function that is a wrapper of
`calculate_edit_distance` that returns the "score".

---------

Co-authored-by: shreyanid <42684285+shreyanid@users.noreply.github.com>
2023-10-20 23:04:13 +00:00
Yao You
aa7b7c87d6
fix: model_name being None raises attribution error (#1822)
This PR resolves #1754 
- function wrapper tries to use `cast` to convert kwargs into `str` but
when a value is `None` `cast(str, None)` still returns `None`
- fix replaces the conversion to simply using `str()` function call
2023-10-20 21:08:17 +00:00
Klaijan
4a9d48859c
chore: add test_unstructured_ingest/metrics dummy dir (#1827)
Update path for metrics output to `test_unstructured_ingest/metrics` and
create a dummy folder `test_unstructured_ingest/metrics` to prevent ci
error.
2023-10-20 14:54:30 -07:00
shreyanid
bdf1925c8a
fix: ModuleNotFoundError for unstructured metrics folder (#1819)
### Summary
Fix `ModuleNotFoundError: No module named 'unstructured.metrics'` where
the unstructured/metrics folder is not discoverable from a package
import since its missing an __init__.py file.
2023-10-20 19:23:05 +00:00
Roman Isecke
15b696925a
bugfix/mapping source connectors in destination cli commands (#1788)
### Description
Due to the dynamic nature of how the source connector is called when a
destination command is invoked, the configs need to be mapped and the
fsspec config needs to be dynamically added based on the type of runner
being used. This code was added to all currently supported destination
commands.
2023-10-20 17:21:06 +00:00
cragwolfe
1b90028501
chore: fix paths in ingest-test-fixtures-update-pr.yml (#1815)
Reference:
https://github.com/marketplace/actions/create-pull-request#add-specific-paths
2023-10-20 09:49:02 -07:00
Roman Isecke
63861f537e
Add check for duplicate click options (#1775)
### Description
Given that many of the options associated with the `Click` based cli
ingest commands are added dynamically from a number of configs, a check
was incorporated to make sure there were no duplicate entries to prevent
new configs from overwriting already added options.

### Issues that were found and fixes:
* duplicate api-key option set on Notion command conflicts with api key
used for unstructured api. Added notion prefix.
* retry logic configs had duplicates in biomed. Removed since this is
not handled by the pipeline.
2023-10-20 14:00:19 +00:00
John
fb2a1d42ce
Jj/1798 languages warning (#1805)
### Summary
Closes #1798 
Fixes language detection of elements with empty strings: This resolves a
warning message that was raised by `langdetect` if the language was
attempted to be detected on an empty string. Language detection is now
skipped for empty strings.

### Testing
on the main branch this will log the warning "No features in text", but
it will not log anything on this branch.
```
from unstructured.documents.elements import NarrativeText, PageBreak
from unstructured.partition.lang import apply_lang_metadata

elements = [NarrativeText("Sample text."), PageBreak("")]
elements = list(
        apply_lang_metadata(
            elements=elements,
            languages=["auto"],
            detect_language_per_element=True,
        ),
    )
```

### Other
Also changes imports in test_lang.py so imports are explicit

---------

Co-authored-by: cragwolfe <crag@unstructured.io>
2023-10-20 04:15:28 +00:00
Steve Canny
d9c2516364
fix: chunks break on regex-meta changes and regex-meta start/stop not adjusted (#1779)
**Executive Summary.** Introducing strict type-checking as preparation
for adding the chunk-overlap feature revealed a type mismatch for
regex-metadata between chunking tests and the (authoritative)
ElementMetadata definition. The implementation of regex-metadata aspects
of chunking passed the tests but did not produce the appropriate
behaviors in production where the actual data-structure was different.
This PR fixes these two bugs.

1. **Over-chunking.** The presence of `regex-metadata` in an element was
incorrectly being interpreted as a semantic boundary, leading to such
elements being isolated in their own chunks.

2. **Discarded regex-metadata.** regex-metadata present on the second or
later elements in a section (chunk) was discarded.


**Technical Summary**

The type of `ElementMetadata.regex_metadata` is `Dict[str,
List[RegexMetadata]]`. `RegexMetadata` is a `TypedDict` like `{"text":
"this matched", "start": 7, "end": 19}`.

Multiple regexes can be specified, each with a name like "mail-stop",
"version", etc. Each of those may produce its own set of matches, like:

```python
>>> element.regex_metadata
{
    "mail-stop": [{"text": "MS-107", "start": 18, "end": 24}],
    "version": [
        {"text": "current: v1.7.2", "start": 7, "end": 21},
        {"text": "supersedes: v1.7.0", "start": 22, "end": 40},
    ],
}
```

*Forensic analysis*
* The regex-metadata feature was added by Matt Robinson on 06/16/2023
commit: 4ea71683. The regex_metadata data structure is the same as when
it was added.

* The chunk-by-title feature was added by Matt Robinson on 08/29/2023
commit: f6a745a7. The mistaken regex-metadata data structure in the
tests is present in that commit.

Looks to me like a mis-remembering of the regex-metadata data-structure
and insufficient type-checking rigor (type-checker strictness level set
too low) to warn of the mistake.


**Over-chunking Behavior**

The over-chunking looked like this:

Chunking three elements with regex metadata should combine them into a
single chunk (`CompositeElement` object), subject to maximum size rules
(default 500 chars).

```python
elements: List[Element] = [
    Title(
        "Lorem Ipsum",
        metadata=ElementMetadata(
            regex_metadata={"ipsum": [RegexMetadata(text="Ipsum", start=6, end=11)]}
        ),
    ),
    Text(
        "Lorem ipsum dolor sit amet consectetur adipiscing elit.",
        metadata=ElementMetadata(
            regex_metadata={"dolor": [RegexMetadata(text="dolor", start=12, end=17)]}
        ),
    ),
    Text(
        "In rhoncus ipsum sed lectus porta volutpat.",
        metadata=ElementMetadata(
            regex_metadata={"ipsum": [RegexMetadata(text="ipsum", start=11, end=16)]}
        ),
    ),
]

chunks = chunk_by_title(elements)

assert chunks == [
    CompositeElement(
        "Lorem Ipsum\n\nLorem ipsum dolor sit amet consectetur adipiscing elit.\n\nIn rhoncus"
        " ipsum sed lectus porta volutpat."
    )
]
```

Observed behavior looked like this:

```python
chunks => [
    CompositeElement('Lorem Ipsum')
    CompositeElement('Lorem ipsum dolor sit amet consectetur adipiscing elit.')
    CompositeElement('In rhoncus ipsum sed lectus porta volutpat.')
]

```

The fix changed the approach from breaking on any metadata field not in
a specified group (`regex_metadata` was missing from this group) to only
breaking on specified fields (whitelisting instead of blacklisting).
This avoids overchunking every time we add a new metadata field and is
also simpler and easier to understand. This change in approach is
discussed in more detail here #1790.


**Dropping regex-metadata Behavior**

Chunking this section:

```python
elements: List[Element] = [
    Title(
        "Lorem Ipsum",
        metadata=ElementMetadata(
            regex_metadata={"ipsum": [RegexMetadata(text="Ipsum", start=6, end=11)]}
        ),
    ),
    Text(
        "Lorem ipsum dolor sit amet consectetur adipiscing elit.",
        metadata=ElementMetadata(
            regex_metadata={
                "dolor": [RegexMetadata(text="dolor", start=12, end=17)],
                "ipsum": [RegexMetadata(text="ipsum", start=6, end=11)],
            }
        ),
    ),
    Text(
        "In rhoncus ipsum sed lectus porta volutpat.",
        metadata=ElementMetadata(
            regex_metadata={"ipsum": [RegexMetadata(text="ipsum", start=11, end=16)]}
        ),
    ),
]
```

..should produce this regex_metadata on the single produced chunk:

```python
assert chunk == CompositeElement(
    "Lorem Ipsum\n\nLorem ipsum dolor sit amet consectetur adipiscing elit.\n\nIn rhoncus"
    " ipsum sed lectus porta volutpat."
)
assert chunk.metadata.regex_metadata == {
    "dolor": [RegexMetadata(text="dolor", start=25, end=30)],
    "ipsum": [
        RegexMetadata(text="Ipsum", start=6, end=11),
        RegexMetadata(text="ipsum", start=19, end=24),
        RegexMetadata(text="ipsum", start=81, end=86),
    ],
}
```

but instead produced this:

```python
regex_metadata == {"ipsum": [{"text": "Ipsum", "start": 6, "end": 11}]}
```

Which is the regex-metadata from the first element only.

The fix was to remove the consolidation+adjustment process from inside
the "list-attribute-processing" loop (because regex-metadata is not a
list) and process regex metadata separately.
2023-10-19 22:16:02 -05:00
Trevor Bossert
62aa4fc4ed
Move python setup above cache restore on ingest (#1802)
This moves the setup-python step on ingest job above the cache restore,
otherwise cache is restored and setup-python breaks symlinks. This
matches pattern on other jobs.
2023-10-19 21:40:06 +00:00
Mallori Harrell
00635744ed
feat: Adds local embedding model (#1619)
This PR adds a local embedding model option as an alternative to using
our OpenAI embedding brick. This brick uses LangChain's
HuggingFacEmbeddings.
2023-10-19 11:51:36 -05:00
qued
a0b44f7231
fix: import PDFResourceManager from pdfinterp (#1797)
Closes #1763.

**Import PDFResourceManager more directly** We were importing
`PDFResourceManager` from `pdfminer.converter` which was causing an
error for some users. We changed to import from the actual location of
`PDFResourceManager`, which is `pdfminer.pdfinterp`.
2023-10-19 00:18:19 -05:00
Jack Retterer
b8f24ba67e
Added AWS Bedrock embeddings (#1738)
Summary: Added support for AWS Bedrock embeddings. Leverages
"amazon.titan-tg1-large" for the embedding model.

Test

- find your aws secret access key and key id; make sure the account has
access to bedrock's tian embed model
- follow the instructions in
d5e797cd44/docs/source/bricks/embedding.rst (bedrockembeddingencoder)

---------

Co-authored-by: Ahmet Melek <39141206+ahmetmeleq@users.noreply.github.com>
Co-authored-by: Yao You <yao@unstructured.io>
Co-authored-by: Yao You <theyaoyou@gmail.com>
Co-authored-by: Ahmet Melek <ahmetmeleq@gmail.com>
2023-10-18 19:36:51 -05:00
Klaijan
98d54e3184
build: ingest fixtures workflow to include metrics dir (#1789)
Add `test_unstructured_ingest/metrics` path for evaluation metrics
master file.
2023-10-18 11:30:31 -07:00
ryannikolaidis
80c3c24ca5
ingest retry strategy refactor <- Ingest test fixtures update (#1780)
This pull request includes updated ingest test fixtures.
Please review and merge if appropriate.

Co-authored-by: benjats07 <benjats07@users.noreply.github.com>
2023-10-18 04:33:57 +00:00
Roman Isecke
775bfb7588
ingest retry strategy refactor (#1708)
### Description
Pivot from using the retry logic as a decorator as this posed too many
limitations on what can be passed in as a parameter at runtime. Moved
this to be a class approach and now that can be instantiated with
appropriate loggers leveraging the `--verbose` flag to set the log
level. This also mitigates how much new code is being forked from the
backoff library. The existing notion client that was using the previous
decorator has been refactored to use the new class approach and the
airtable connector was updated to support retry logic as well. Default
log handlers were introduced which applies to all instances of the retry
handler when it starts, backs off, and gives up.

A generic approach was added to configuring the retry parameters in the
CLI and was added to the running number of common configs across all CLI
commands.

Omitted CHANGELOG entry as this is mostly just a refactor of the retry
code. All other connectors will be updated to support retry in another
PR but this helps limit the number of changes to review in this one.

### Extra fixes
* Updated local and salesforce source connector to set `ingest_doc_cls`
in a `__post_init__` method since this variable can't be serialized.

### Testing
Both the airtable and notion ingest tests can be run locally. While they
might not pass due to text changes (to be expected when running
locally), the process can be viewed in the logs to validate.

Associated issue: #1488
2023-10-17 16:15:08 +00:00
Roman Isecke
adacd8e5b1
roman/update ingest pipeline docs (#1689)
### Description
* Update all existing connector docs to use new pipeline approach

### Additional changes:
* Some defaults were set for the runners to match those in the configs
to make those easy to handle, i.e. the biomed runner:
```python
max_retries: int = 5,
max_request_time: int = 45,
decay: float = 0.3,
```
2023-10-17 16:11:16 +00:00
cragwolfe
9ea3734fd0
fix: memory issue resolved for chipper v2 (#1772)
Co-authored-by: Austin Walker <austin@unstructured.io>
Co-authored-by: Austin Walker <awalk89@gmail.com>
0.10.24
2023-10-17 14:37:25 +00:00
Roman Isecke
aeaae5fd17
destination connector method elements input (#1674)
### Description
**Ingest destination connectors support for writing raw list of
elements** Along with the default write method used in the ingest
pipeline to write the json content associated with the ingest docs, each
destination connector can now also write a raw list of elements to the
desired downstream location without having an ingest doc associated with
it.
2023-10-17 12:47:59 +00:00
Roman Isecke
b265d8874b
refactoring linting (#1739)
### Description
Currently linting only takes place over the base unstructured directory
but we support python files throughout the repo. It makes sense for all
those files to also abide by the same linting rules so the entire repo
was set to be inspected when the linters are run. Along with that
autoflake was added as a linter which has a lot of added benefits such
as removing unused imports for you that would currently break flake and
require manual intervention.

The only real relevant changes in this PR are in the `Makefile`,
`setup.cfg`, and `requirements/test.in`. The rest is the result of
running the linters.
2023-10-17 12:45:12 +00:00
Christine Straub
237d04c896
feat: improve natural reading order by filtering OCR results (#1768)
### Summary
Some `OCR` elements with only spaces in the text have full-page width in
the bounding box, which causes the `xycut` sorting to not work as
expected. Now the logic to parse OCR results removes any elements with
only spaces (more than one space).

---------

Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com>
Co-authored-by: christinestraub <christinestraub@users.noreply.github.com>
2023-10-16 23:05:55 +00:00
Léa
89fa88f076
fix: stop csv and tsv dropping the first line of the file (#1530)
The current code assumes the first line of csv and tsv files are a
header line. Most csv and tsv files don't have a header line, and even
for those that do, dropping this line may not be the desired behavior.

Here is a snippet of code that demonstrates the current behavior and the
proposed fix

```
import pandas as pd
from lxml.html.soupparser import fromstring as soupparser_fromstring

c1 = """
    Stanley Cups,,
    Team,Location,Stanley Cups
    Blues,STL,1
    Flyers,PHI,2
    Maple Leafs,TOR,13
    """

f = "./test.csv"
with open(f, 'w') as ff:
    ff.write(c1)
  
print("Suggested Improvement Keep First Line") 
table = pd.read_csv(f, header=None)
html_text = table.to_html(index=False, header=False, na_rep="")
text = soupparser_fromstring(html_text).text_content()
print(text)

print("\n\nOriginal Looses First Line") 
table = pd.read_csv(f)
html_text = table.to_html(index=False, header=False, na_rep="")
text = soupparser_fromstring(html_text).text_content()
print(text)
```

---------

Co-authored-by: cragwolfe <crag@unstructured.io>
Co-authored-by: Yao You <theyaoyou@gmail.com>
Co-authored-by: Yao You <yao@unstructured.io>
2023-10-16 17:59:35 -05:00
Yuming Long
4907d1e2b5
Fix: ModuleNotFoundError for partition.utils.ocr_models (#1767)
### Summary

Fix https://github.com/Unstructured-IO/unstructured-api/issues/286 where
`partition/utils/ocr_models` folder is not uploaded to PyPI since its
missing an `__init__.py` file
2023-10-16 19:47:09 +00:00
Klaijan
ba4c649cf0
feat: calculate element type percent match (#1723)
**Executive Summary**
Adds function to calculate the percent match between two element type
frequency output from `get_element_type_frequency` function.

**Technical Detail**
- The function takes two `Dict` input which both should be output from
`get_element_type_frequency`
- Implementors can define weight `category_depth_weight` they want to
give to the matching `type` but different in `category_depth` case
- The function loops through output item list first to find exact match
and count total exact match, and collect the remaining value for both
output and source in new list (of `dict` type). Then it loops through
existing source item list that has not been an exact match, to find
`type` match which then weigh with the factor of `category_depth_weight`
defined earlier, default at 0.5)

**Output**
output
```
{
  ("Title", 0): 2,
  ("Title", 1): 1,
  ("NarrativeText", None): 3,
  ("UncategorizedText", None): 1,
}
```

source
```
{
  ("Title", 0): 1,
  ("Title", 1): 2,
  ("NarrativeText", None): 5,
}
```

With this output and source, and weight of 0.5, the % match will yield
5.5 / 8 -- for 5 exact match, and 1 partial match with 0.5 weight.

---------

Co-authored-by: shreyanid <42684285+shreyanid@users.noreply.github.com>
2023-10-16 17:57:28 +00:00
Roman Isecke
9c7ee8921a
roman/fsspec compression support (#1730)
### Description 
Opened to replace original PR:
[1443](https://github.com/Unstructured-IO/unstructured/pull/1443)
2023-10-16 14:26:30 +00:00
cragwolfe
282b8f700d
build: release unstructured==0.10.23 (#1762)
Cut the release.
0.10.23
2023-10-15 21:26:46 -07:00
John
6d7fe3ab02
fix: default to None for the languages metadata field (#1743)
### Summary
Closes #1714
Changes the default value for `languages` to `None` for elements that
don't have text or the language can't be detected.

### Testing
```
from unstructured.partition.auto import partition
filename = "example-docs/handbook-1p.docx"
elements = partition(filename=filename, detect_language_per_element=True)

# PageBreak elements don't have text and will be collected here
none_langs = [element for element in elements if element.metadata.languages is None]
none_langs[0].text
```

---------

Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com>
Co-authored-by: Coniferish <Coniferish@users.noreply.github.com>
Co-authored-by: cragwolfe <crag@unstructured.io>
2023-10-14 22:46:24 +00:00
Amanda Cameron
d0c84d605c
chore: updating table docs with file extensions (#1702)
gh issue: https://github.com/Unstructured-IO/unstructured/issues/1691

Adding filetype extensions from this
[list](f98d5e65ca/unstructured/file_utils/filetype.py (L154-L200))
where applicable.

---------

Co-authored-by: cragwolfe <crag@unstructured.io>
Co-authored-by: Crag Wolfe <crag@unstructuredai.io>
2023-10-14 14:14:52 -07:00
qued
cf31c9a2c4
fix: use nx to avoid recursion limit (#1761)
Fixes recursion limit error that was being raised when partitioning
Excel documents of a certain size.

Previously we used a recursive method to find subtables within an excel
sheet. However this would run afoul of Python's recursion depth limit
when there was a contiguous block of more than 1000 cells within a
sheet. This function has been updated to use the NetworkX library which
avoids Python recursion issues.

* Updated `_get_connected_components` to use `networkx` graph methods
rather than implementing our own algorithm for finding contiguous groups
of cells within a sheet.
* Added a test and example doc that replicates the `RecursionError`
prior to the change.
*  Added `networkx` to `extra_xlsx` dependencies and `pip-compile`d.

#### Testing:
The following run from a Python terminal should raise a `RecursionError`
on `main` and succeed on this branch:
```python
import sys
from unstructured.partition.xlsx import partition_xlsx
old_recursion_limit = sys.getrecursionlimit()
try:
    sys.setrecursionlimit(1000)
    filename = "example-docs/more-than-1k-cells.xlsx"
    partition_xlsx(filename=filename)
finally:
    sys.setrecursionlimit(old_recursion_limit)

```
Note: the recursion limit is different in different contexts. Checking
my own system, the default in a notebook seems to be 3000, but in a
terminal it's 1000. The documented Python default recursion limit is
1000.
2023-10-14 19:38:21 +00:00
cragwolfe
3f32c6702a
feat: bump unstructured-inference=0.7.5 for faster chipper (#1756)
**Improved inference speed for Chipper V2** API requests with
'hi_res_model_name=chipper' now have ~2-3x faster responses.
2023-10-14 13:03:59 -07:00
Minwoo Byeon (Dylan)
3331c5c6c0
Remove the temporary files when the conversion is finished. (#1696)
Co-authored-by: cragwolfe <crag@unstructured.io>
Co-authored-by: Yao You <theyaoyou@gmail.com>
2023-10-13 18:51:44 -05:00
qued
95728ead0f
fix: zero divide in under_non_alpha_ratio (#1753)
The function `under_non_alpha_ratio` in
`unstructured.partition.text_type` was producing a divide-by-zero error.
After investigation I found this was a possibility when the function was
passed a string of all spaces.

---------

Co-authored-by: cragwolfe <crag@unstructured.io>
2023-10-13 21:20:01 +00:00
M Bharat lal
21df17f7fa
fix: consider all the required lines instead of first line to detect file type as CSV (#1728)
Current file detection logic for csv in file_utils/filetype.py is not
considering all the lines for counting the no. of comma's, it is
considering just the first line which will return always return true

```
lines = lines[: len(lines)] if len(lines) < 10 else lines[:10]
header_count = _count_commas(lines[0])
if any("," not in line for line in lines):
        return False
return all(_count_commas(line) == header_count for line in lines[:1])
```

fixed issue by considering all the lines except the first line as shown
below

```
lines = lines[: len(lines)] if len(lines) < 10 else lines[:10]
header_count = _count_commas(lines[0])
if any("," not in line for line in lines):
        return False
return all(_count_commas(line) == header_count for line in lines[1:])
```
2023-10-13 13:36:05 -07:00
Christine Straub
ef391e1a3e
feat: less precision in json floats (#1718)
Closes #1340.
### Summary
- add functionality to limit precision when serializing to JSON
### Testing
```
elements = partition(raw_doc.<extension>)
output_json = elements_to_json(elements)
print(output_json)
```
2023-10-13 11:06:36 -07:00
Austin Walker
ad1b93dbaa
chore: cut the 0.10.22 release (#1749) 0.10.22 2023-10-13 17:17:21 +00:00
ryannikolaidis
d9a0bd741a
fix: build test failures (#1748)
* Fix missing HF_TOKEN when running containerized test for the build
process
* Fix pytest args when running specific test

## Testing
Example run of the HF_TOKEN assgned for the containerized test in the
build process:
https://github.com/Unstructured-IO/unstructured/actions/runs/6504556437/job/17666669155

Example run of the pytest args working for the arm test (ran in a new
workflow for testing on push):
https://github.com/Unstructured-IO/unstructured/actions/runs/6504213010
2023-10-13 01:08:27 -07:00
Steve Canny
4b84d596c2
docx: add hyperlink metadata (#1746) 2023-10-13 06:26:14 +00:00
qued
8100f1e7e2
chore: process chipper hierarchy (#1634)
PR to support schema changes introduced from [PR
232](https://github.com/Unstructured-IO/unstructured-inference/pull/232)
in `unstructured-inference`.

Specifically what needs to be supported is:
* Change to the way `LayoutElement` from `unstructured-inference` is
structured, specifically that this class is no longer a subclass of
`Rectangle`, and instead `LayoutElement` has a `bbox` property that
captures the location information and a `from_coords` method that allows
construction of a `LayoutElement` directly from coordinates.
* Removal of `LocationlessLayoutElement` since chipper now exports
bounding boxes, and if we need to support elements without bounding
boxes, we can make the `bbox` property mentioned above optional.
* Getting hierarchy data directly from the inference elements rather
than in post-processing
* Don't try to reorder elements received from chipper v2, as they should
already be ordered.

#### Testing:

The following demonstrates that the new version of chipper is inferring
hierarchy.

```python
from unstructured.partition.pdf import partition_pdf
elements = partition_pdf("example-docs/layout-parser-paper-fast.pdf", strategy="hi_res", model_name="chipper")
children = [el for el in elements if el.metadata.parent_id is not None]
print(children)

```
Also verify that running the traditional `hi_res` gives different
results:
```python
from unstructured.partition.pdf import partition_pdf
elements = partition_pdf("example-docs/layout-parser-paper-fast.pdf", strategy="hi_res")

```

---------

Co-authored-by: Sebastian Laverde Alfonso <lavmlk20201@gmail.com>
Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com>
Co-authored-by: christinestraub <christinemstraub@gmail.com>
2023-10-13 01:28:46 +00:00