1109 Commits

Author SHA1 Message Date
Roman Isecke
6700a7d8c4
feat: support generic inputs for partition kwargs from ingest CLI (#1923)
### Description
To always support the latest changed to the partition method and the
possible kwargs it supports, the ingest CLI has been refactored to take
in a valid json string to represent those values to allow a user more
flexibility with controlling the partition method.
2023-11-02 21:19:29 +00:00
Roman Isecke
b58d0dde3e
Add CliMix class to wrap both BaseConfig and CliMixin (#1957)
### Description
Add new class to wrap base config and cli mixin to help with typing:
```python
class CliConfig(BaseConfig, CliMixin):
    pass
```
2023-11-02 21:18:40 +00:00
Roman Isecke
901704b6c0
update sphinx docs with ingest content (#1969)
### Description
Create a new structure for ingest content in the docs, update with all
configs
2023-11-02 20:40:35 +00:00
shreyanid
c24e6e056c
chore: add doctype to ingest evaluation functions (#1977)
### Summary
To combine ingest and holistic metrics efforts, add the `doctype` field
to the results from the functions in evaluate.py for use in subsequent
aggregation functions.

### Test
Run `sh ./test_unstructured_ingest/evaluation-metrics.sh
text-extraction` and there will be a new doctype column with the file's
doctype extension.
<img width="508" alt="Screenshot 2023-11-01 at 2 23 11 PM"
src="https://github.com/Unstructured-IO/unstructured/assets/42684285/44583da9-e7ef-4142-be72-c2247b954bcf">

---------

Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com>
Co-authored-by: shreyanid <shreyanid@users.noreply.github.com>
2023-11-02 19:15:53 +00:00
Mallori Harrell
d07baed4a1
bug: empty-elements (#1252)
- This PR adds a function to check if a piece of text only contains a
bullet (no text) to prevent creating an empty element.
- Also fixed a test that had a typo.
2023-11-02 10:52:41 -05:00
Yao You
69265685ea
build(deps): add makefile to requirements (#1295)
This PR resolves #1294 by adding a Makefile to compile requirements.
This makefile respects the dependencies between file and will compile
them in order. E.g., extra-*.txt will be compiled __after__ base.txt is
updated.

Test locally by simply running `make pip-compile` or `cd requirements &&
make clean && make all`

---------

Co-authored-by: qued <64741807+qued@users.noreply.github.com>
2023-11-02 10:17:35 -05:00
qued
1bee1b0038
chore: Remove chipper example (#1989)
Closes #1956.

Removed chipper example notebook as it is no longer functional with
Chipper private.
2023-11-02 10:14:49 -05:00
Matt Robinson
d9c035edb1
docs: no more bricks (#1967)
### Summary

We no longer use the "bricks" terminology for partioning functions, etc
in the library. This PR updates various references to bricks within the
repo and the docs. This is just an initial pass to swap the terminology
out, it'll likely be helpful to reorganize the docs a bit as well.

---------

Co-authored-by: qued <64741807+qued@users.noreply.github.com>
Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com>
2023-11-02 09:43:26 -05:00
ryannikolaidis
4a947dbc03
ci: remove activate_credentials_file arg from google cloud auth (#1983)
CI[ was raising a warning
](https://github.com/Unstructured-IO/unstructured/actions/runs/6725942303)in
the workflow annotations (scroll to the bottom).

It looks like this is not a supported argument for the version that we
are targeting. Since it looks like this is probably just getting ignored
anyways, removing.

## Testing
Note the [CI workflow
run](https://github.com/Unstructured-IO/unstructured/actions/runs/6726986305)
from this PR succeeded and no longer has those annotations.
2023-11-02 06:32:39 +00:00
Steve Canny
4e40999070
rfctr: prepare docx partitioner and tests for nested tables PR to follow (#1978)
*Reviewer:* May be quicker to review commit by commit as they are quite
distinct and well-groomed to each focus on a single clean-up task.

Clean up odds-and-ends in the docx partitioner in preparation for adding
nested-tables support in a closely following PR.

1. Remove obsolete TODOs now in GitHub issues, which is probably where
they belong in future anyway.
2. Remove local DOCX "workaround" code that has been implemented
upstream and is now obsolete.
3. "Clean" the docx tests, introducing strict typing, extracting a
fixture or two, and generally tightening things up.
4. Extract docx-local versions of
`unstructured.partition.common.convert_ms_office_table_to_text()` which
will be the base for adding nested-table support. More information on
why this is required in that commit.
2023-11-02 05:22:17 +00:00
Steve Canny
51d07b6434
fix: flaky chunk metadata (#1947)
**Executive Summary.** When the elements in a _section_ are combined
into a _chunk_, the metadata in each of the elements is _consolidated_
into a single `ElementMetadata` instance. There are two main problems
with the current implementation:

1. The current algorithm simply uses the metadata of the first element
as the metadata for the chunk. This produces:
- **empty chunk metadata** when the first element has no metadata, such
as a `PageBreak("")`
- **missing chunk metadata** when the first element contains only
partial metadata such as a `Header()` or `Footer()`
- **misleading metadata** when the first element contains values
applicable only to that element, such as `category_depth`, `coordinates`
(bounding-box), `header_footer_type`, or `parent_id`
2. Second, list metadata such as `emphasized_text_content`,
`emphasized_text_tags`, `link_texts` and `link_urls` is only combined
when it is unique within the combined list. These lists are "unzipped"
pairs. For example, the first `link_texts` corresponds to the first
`link_urls` value. When an item is removed from one (because it matches
a prior entry) and not the other (say same text "here" but different
URL) the positional correspondence is broken and downstream processing
will at best be wrong, at worst raise an exception.

### Technical Discussion
Element metadata cannot be determined in the general case simply by
sampling that of the first element. At the same time, a simple union of
all values is also not sufficient. To effectively consolidate the
current variety of metadata fields we need four distinct strategies,
selecting which to apply to each field based on that fields provenance
and other characteristics.

The four strategies are:
- `FIRST` - Select the first non-`None` value across all the elements.
Several fields are determined by the document source (`filename`,
`file_directory`, etc.) and will not change within the output of a
single partitioning run. They might not appear in every element, but
they will be the same whenever they do appear. This strategy takes the
first one that appears, if any, as proxy for the value for the entire
chunk.
- `LIST` - Consolidate the four list fields like
`emphasized_text_content` and `link_urls` by concatenating them in
element order (no set semantics apply). All values from `elements[n]`
appear before those from `elements[n+1]` and existing order is
preserved.
- `LIST_UNIQUE` - Combine only unique elements across the (list) values
of the elements, preserving order in which a unique item first appeared.
- `REGEX` - Regex metadata has its own rules, including adjusting the
`start` and `end` offset of each match based its new position in the
concatenated text.
- `DROP` - Not all metadata can or should appear in a chunk. For
example, a chunk cannot be guaranteed to have a single `category_depth`
or `parent_id`.

Other strategies such as `COORDINATES` could be added to consolidate the
bounding box of the chunk from the coordinates of its elements, roughly
`min(lefts)`, `max(rights)`, etc. Others could be `LAST`, `MAJORITY`, or
`SUM` depending on how metadata evolves.

The proposed strategy assignments are these:

- `attached_to_filename`: FIRST,
- `category_depth`: DROP,
- `coordinates`: DROP,
- `data_source`: FIRST,
- `detection_class_prob`: DROP,  # -- ? confirm --
- `detection_origin`: DROP,      # -- ? confirm --
- `emphasized_text_contents`: LIST,
- `emphasized_text_tags`: LIST,
- `file_directory`: FIRST,
- `filename`: FIRST,
- `filetype`: FIRST,
- `header_footer_type`: DROP,
- `image_path`: DROP,
- `is_continuation`: DROP, # -- not expected, added by chunking, not
before --
- `languages`: LIST_UNIQUE,
- `last_modified`: FIRST,
- `link_texts`: LIST,
- `link_urls`: LIST,
- `links`: DROP,            # -- deprecated field --
- `max_characters`: DROP, # -- unused in code, probably remove from
ElementMetadata --
- `page_name`: FIRST,
- `page_number`: FIRST,
- `parent_id`: DROP,
- `regex_metadata`: REGEX,
- `section`: FIRST, # -- section unconditionally breaks on new section
--
- `sent_from`: FIRST,
- `sent_to`: FIRST,
- `subject`: FIRST,
- `text_as_html`: DROP, # -- not expected, only occurs in TableSection
--
- `url`: FIRST,

**Assumptions:**
- each .eml file is partitioned->chunked separately (not in batches),
therefore
  sent-from, sent-to, and subject will not change within a section.

### Implementation
Implementation of this behavior requires two steps:
1. **Collect** all non-`None` values from all elements, each in a
sequence by field-name. Fields not populated in any of the elements do
not appear in the collection.
```python
all_meta = {
    "filename": ["memo.docx", "memo.docx"]
    "link_texts": [["here", "here"], ["and here"]]
    "parent_id": ["f273a7cb", "808b4ced"]
}
```
2. **Apply** the specified strategy to each item in the overall
collection to produce the consolidated chunk meta (see implementation).

### Factoring
For the following reasons, the implementation of metadata consolidation
is extracted from its current location in `chunk_by_title()` to a
handful of collaborating methods in `_TextSection`.
- The current implementation of metadata consolidation "inline" in
`chunk_by_title()` already has too many moving pieces to be understood
without extended study. Adding strategies to that would make it worse.
- `_TextSection` is the only section type where metadata is consolidated
(the other two types always have exactly one element so already exactly
one metadata.)
- `_TextSection` is already the expert on all the information required
to consolidate metadata, in particular the elements that make up the
section and their text.

Some other problems were also fixed in that transition, such as mutation
of elements during the consolidation process.

### Technical Risk: adding new `ElementMetadata` field breaks metadata

If each metadata field requires a strategy assignment to be consolidated
and a developer adds a new `ElementMetadata` field without adding a
corresponding strategy mapping, metadata consolidation could break or
produce incorrect results.

This risk can be mitigated multiple ways:
1. Add a test that verifies a strategy is defined for each
(Recommended).
2. Define a default strategy, either `DROP` or `FIRST` for scalar types,
`LIST` for list types.
3. Raise an exception when an unknown metadata field is encountered.

This PR implements option 1 such that a developer will be notified
before merge if they add a new metadata field but do not define a
strategy for it.

### Other Considerations
- If end-users can in-future add arbitrary metadata fields _before_
chunking, then we'll need to define metadata-consolidation behavior for
such fields. Depending on how we implement user-defined metadata fields
we might:
- Require explicit definition of a new metadata field before use,
perhaps with a method like `ElementMetadata.add_custom_field()` which
requires a consolidation strategy to be defined (and/or has a default
value).
- Have a default strategy, perhaps `DROP` or `FIRST`, or `LIST` if the
field is type `list`.

### Further Context
Metadata is only consolidated for `TextSection` because the other two
section types (`TableSection` and `NonTextSection`) can only contain a
single element.

---

## Further discussion on consolidation strategy by field

### document-static
These fields are very likely to be the same for all elements in a single
document:

- `attached_to_filename`
- `data_source`
- `file_directory`
- `filename`
- `filetype`
- `last_modified`
- `sent_from`
- `sent_to`
- `subject`
- `url`

*Consolidation strategy:* `FIRST` - use first one found, if any.

### section-static
These fields are very likely to be the same for all elements in a single
section, which is the scope we really care about for metadata
consolidation:

- `section` - an EPUB document-section unconditionally starts new
section.

*Consolidation strategy:* `FIRST` - use first one found, if any.

### consolidated list-items
These `List` fields are consolidated by concatenating the lists from
each element that has one:

- `emphasized_text_contents`
- `emphasized_text_tags`
- `link_texts`
- `link_urls`
- `regex_metadata` - special case, this one gets indexes adjusted too.

*Consolidation strategy:* `LIST` - concatenate lists across elements.

### dynamic
These fields are likely to hold unique data for each element:

- `category_depth`
- `coordinates`
- `image_path`
- `parent_id`

*Consolidation strategy:*
- `DROP` as likely misleading.
- `COORDINATES` strategy could be added to compute the bounding box from
all bounding boxes.
- Consider allowing if they are all the same, perhaps an `ALL` strategy.

### slow-changing
These fields are somewhere in-between, likely to be common between
multiple elements but varied within a document:

- `header_footer_type` - *strategy:* drop as not-consolidatable
- `languages` - *strategy:* take first occurence
- `page_name` - *strategy:* take first occurence
- `page_number` - *strategy:* take first occurence, will all be the same
when `multipage_sections` is `False`. Worst-case semantics are "this
chunk began on this page".

### N/A
These field types do not figure in metadata-consolidation:

- `detection_class_prob` - I'm thinking this is for debug and should not
appear in chunks, but need confirmation.
- `detection_origin` - for debug only
- `is_continuation` - is _produced_ by chunking, never by partitioning
(not in our code anyway).
- `links` (deprecated, probably should be dropped)
- `max_characters` - is unused as far as I can tell, is unreferenced in
source code. Should be removed from `ElementMetadata` as far as I can
tell.
- `text_as_html` - only appears in a `Table` element, each of which
appears in its own section so needs no consolidation. Never appears in
`TextSection`.

*Consolidation strategy:* `DROP` any that appear (several never will)
2023-11-02 01:49:20 +00:00
John
2f553333bd
refactor text.py (#1872)
### Summary
Closes #1520 
Partial solution to #1521 

- Adds an abstraction layer between the user API and the partitioner
implementation
- Adds comments explaining paragraph chunking
- Makes edits to pass strict type-checking for both text.py and
test_text.py
2023-11-01 17:44:55 -05:00
John
b92cab7fbd
fix languages 500 error with empty string for ocr_languages (#1968)
Closes #1870 
Defining both `languages` and `ocr_languages` raises a ValueError, but
the api defaults to `ocr_languages` being an empty string, so if users
define `languages` they are automatically hitting the ValueError.

This fix checks if `ocr_languages` is an empty string and converts it to
`None` to avoid this.

### Testing
On the main branch, the following will raise the ValueError, but it will
correctly partition on this branch
```
from unstructured.partition.auto import partition
filename = "example-docs/category-level.docx"
elements = partition(filename,languages=['spa'],ocr_languages="")

elements[0].metadata.languages
```

---------

Co-authored-by: yuming <305248291@qq.com>
Co-authored-by: Yuming Long <63475068+yuming-long@users.noreply.github.com>
Co-authored-by: Austin Walker <awalk89@gmail.com>
2023-11-01 22:02:00 +00:00
Klaijan
1893d5a669
fix: avoid loop through None (#1975)
Fix this issue https://unstructured-ai.atlassian.net/browse/CORE-2455.
Adding logical check if the variable is not None.
2023-11-01 20:50:34 +00:00
Roman Isecke
24a419ece0
separate ingest tests (#1951)
### Description
This splits the source ingest tests from the destination ingest tests
since they share a different pattern:
* src tests pull data from a source and compare the partitioned content
to the expected results
* destingation tests leverage the local connector to produce results to
push to a destination and leverages overhead to create temporary
locations at those destinations to write to and delete when done.

Only the src tests create partitioned content that needs to be checked
so the update ingest test CI job only needs to run these.
2023-11-01 19:23:44 +00:00
Christine Straub
210d53a7e0
Fix: missing columns on table ingest output after table OCR refactor (#1959)
Closes #1873.
### Summary
Table OCR refactoring changed the default padding value for table image
cropping from
[12](https://github.com/Unstructured-IO/unstructured-inference/blob/main/unstructured_inference/inference/layoutelement.py#L95)
to
[0](https://github.com/Unstructured-IO/unstructured/blob/main/unstructured/partition/ocr.py#L260),
causing some columns in the table to be missing.
### Testing
```
filename = "example-docs/layout-parser-paper-with-table.pdf"
elements = pdf.partition_pdf(
    filename=filename,
    strategy="hi_res",
    infer_table_structure=True,
)
table = [el.metadata.text_as_html for el in elements if el.metadata.text_as_html]
assert "Large Model" in table[0]
```

---------

Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com>
Co-authored-by: christinestraub <christinestraub@users.noreply.github.com>
2023-11-01 18:34:27 +00:00
Klaijan
a06b151897
refactor: ci workflow refactor (#1907)
Refactor the evaluation scripts including
`unstructured/ingest/evaluation.py`
`test_unstructured_ingest/evaluation-metrics.sh` for more structured
code and usage.
- The script is now only use one python script call with param
- Adds function to build string for output_args (`--output_dir
--output_list) and source_args (`--source_dir --source_args`)
- Now accepts evaluation to call as a param, currently only accepts
`text-extraction` and `element-type`

Example to call the function:
```sh evaluation-metrics.sh text-extraction```

---------

Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com>
Co-authored-by: Klaijan <Klaijan@users.noreply.github.com>
2023-11-01 15:58:23 +00:00
qued
b08562ba1a
tests: separate chipper tests (#1939)
Separates chipper tests to speed up testing and CI.
2023-10-31 21:02:00 +00:00
Roman Isecke
123ad20f4c
support passing credentials from memory for google connectors (#1888)
### Description

### Google Drive
The existing service account parameter was expanded to support either a
file path or a json value to generate the credentials when instantiating
the google drive client.

### GCS
Google Cloud Storage already supports the value being passed in, from
their docstring:
> - you may supply a token generated by the
      [gcloud](https://cloud.google.com/sdk/docs/)
      utility; this is either a python dictionary, the name of a file
containing the JSON returned by logging in with the gcloud CLI tool,
      or a Credentials object.


I tested this locally:

```python
from gcsfs import GCSFileSystem
import json

with open("/Users/romanisecke/.ssh/google-cloud-unstructured-ingest-test-d4fc30286d9d.json") as json_file:
    json_data = json.load(json_file)
    print(json_data)

    fs = GCSFileSystem(token=json_data)
    print(fs.ls(path="gs://utic-test-ingest-fixtures/"))
```
`['utic-test-ingest-fixtures/ideas-page.html',
'utic-test-ingest-fixtures/nested-1',
'utic-test-ingest-fixtures/nested-2']`
2023-10-31 17:12:04 +00:00
Roman Isecke
922bc84cee
Update fsspecs-specific source connector docs (#1898)
### Description
Add in the fsspec configs needed for the fsspec-based connectors

To match the behavior of the original CLI, the default used by the click
option was mirrored in the base config for the api endpoint.
2023-10-31 16:09:46 +00:00
Ahmet Melek
a9a3efd85c
bugfix: SharePoint permissions fetching should be opt-in (#1894)
Closes: #1891 (check the issue for more info)

---------

Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com>
Co-authored-by: ahmetmeleq <ahmetmeleq@users.noreply.github.com>
Co-authored-by: Yao You <yao@unstructured.io>
2023-10-31 15:55:07 +00:00
qued
b83057ac66
build: don't save cache for cache existence check (#1953)
Using `actions/cache@v3` instead of `actions/cache/restore@v3` for the
cache lookup in `setup_ingest` is causing CI to save the cache twice
when there's a cache miss. This is unnecessary, but I'm also a little
concerned it's causing some sort of race condition since I've seen
instances of CI failing to save due to the cache already existing (which
shouldn't be the case on a cache miss).

This PR switches the lookup to a `restore` action to avoid duplicate
ingest cache saving.

#### Testing:

There should only be one "Post Run actions/cache@v3" step in each
`setup_ingest` job when there's a cache miss.


[Here](https://github.com/Unstructured-IO/unstructured/actions/runs/6707740917/job/18227300507)
is an example of a cache miss running with this PR.
2023-10-31 15:49:12 +00:00
Matt Robinson
21b45ae8b0
docs: update to new logo (#1937)
### Summary

Updates the docs and README to use the new Unstructured logo. The README
links to the raw GitHub user content, so the changes isn't reflected in
the README on the branch, but will update after the image is merged to
main.

### Testing

Here's what the updated docs look like locally:

<img width="237" alt="image"
src="https://github.com/Unstructured-IO/unstructured/assets/1635179/f13d8b4b-3098-4823-bd16-a6c8dfcffe67">

<img width="1509" alt="image"
src="https://github.com/Unstructured-IO/unstructured/assets/1635179/3b8aae5e-34aa-48c0-90f9-f5f3f0f1e26d">

<img width="1490" alt="image"
src="https://github.com/Unstructured-IO/unstructured/assets/1635179/e82a876f-b19a-4573-b6bb-1c0215d2d7a9">
2023-10-31 15:39:19 +00:00
Roman Isecke
4f8cb04663
ingest download-only fix (#1943)
### Description
move check for download only after source node run
2023-10-31 14:05:37 +00:00
Roman Isecke
857195b6e6
expand retry logic in source connectors (#1889)
### Description
All http calls being made by the ingest source connectors have been
isolated and wrapped by the `SourceConnectionNetworkError` custom error,
which triggers the retry logic, if enabled, in the ingest pipeline.
2023-10-31 14:02:28 +00:00
Roman Isecke
963ac35b9c
bugfix/correctly share session handler across ingest docs (#1806)
### Description
Fix session handler
2023-10-31 12:21:23 +00:00
Denis Lusson
f585d489c1
feat: Add include_header argument for partition_csv and partition_tsv (#1764)
This PR add `include_header` argument for partition_csv and
partition_tsv. This is related to the following feature request
https://github.com/Unstructured-IO/unstructured/issues/1751.

`include_header` is already part of partition_xlsx. The work here is in
line with the current usage and testing of the `include_header` argument
in partition_xlsx.

---------

Co-authored-by: cragwolfe <crag@unstructured.io>
2023-10-31 08:16:36 +00:00
Ronny H
f78d4d505a
Updated "join Slack" link (#1948)
Updated "join Slack" links on README page.

---------

Co-authored-by: cragwolfe <crag@unstructured.io>
2023-10-31 00:02:21 -07:00
cragwolfe
ecbc4546e3
build: release commit for unstructured==0.10.28 (#1949) 0.10.28 2023-10-30 23:01:09 -07:00
cragwolfe
841a521790
build(ci): use larger CI runners for setup (#1946)
Hopefully avoid incomplete cache issues. Though to be fair, there is no
solid evidence pointing to runner size as the source of the issue.
2023-10-31 02:09:59 +00:00
Klaijan
a11d4634f1
fix: type error string indices bug (#1940)
Fix TypeError: string indices must be integers. The `annotation_dict`
variable is conditioned to be `None` if instance type is not dict. Then
we add logic to skip the attempt if the value is `None`.
2023-10-30 17:38:57 -07:00
Trevor Bossert
c3e42e9ffc
remove test login (#1945)
This was only used for debugging on a branch, not needed here. It was
failing because it didn't have the environment var set to "ci".
2023-10-30 15:47:43 -07:00
Christine Straub
1f0c563e0c
refactor: partition_pdf() for ocr_only strategy (#1811)
### Summary
Update `ocr_only` strategy in `partition_pdf()`. This PR adds the
functionality to get accurate coordinate data when partitioning PDFs and
Images with the `ocr_only` strategy.
- Add functionality to perform OCR region grouping based on the OCR text
taken from `pytesseract.image_to_string()`
- Add functionality to get layout elements from OCR regions (ocr_layout)
for both `tesseract` and `paddle`
- Add functionality to determine the `source` of merged text regions
when merging text regions in `merge_text_regions()`
- Merge multiple test functions related to "ocr_only" strategy into
`test_partition_pdf_with_ocr_only_strategy()`
- This PR also fixes [issue
#1792](https://github.com/Unstructured-IO/unstructured/issues/1792)
### Evaluation
```
# Image
PYTHONPATH=. python examples/custom-layout-order/evaluate_natural_reading_order.py example-docs/double-column-A.jpg ocr_only xy-cut image

# PDF
PYTHONPATH=. python examples/custom-layout-order/evaluate_natural_reading_order.py example-docs/multi-column-2p.pdf ocr_only xy-cut pdf
```
### Test
- **Before update**
All elements have the same coordinate data 


![multi-column-2p_1_xy-cut](https://github.com/Unstructured-IO/unstructured/assets/9475974/aae0195a-2943-4fa8-bdd8-807f2f09c768)

- **After update**
All elements have accurate coordinate data


![multi-column-2p_1_xy-cut](https://github.com/Unstructured-IO/unstructured/assets/9475974/0f6c6202-9e65-4acf-bcd4-ac9dd01ab64a)

---------

Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com>
Co-authored-by: christinestraub <christinestraub@users.noreply.github.com>
2023-10-30 20:13:29 +00:00
Roman Isecke
680cfbabd4
expand fsspec downstream connectors (#1777)
### Description
Replacing PR
[1383](https://github.com/Unstructured-IO/unstructured/pull/1383)

---------

Co-authored-by: Trevor Bossert <alanboss@gmail.com>
Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com>
Co-authored-by: rbiseck3 <rbiseck3@users.noreply.github.com>
2023-10-30 20:09:49 +00:00
qued
645a0fb765
fix: md tables (#1924)
Courtesy @phowat, created a branch in the repo to make some changes and
merge quickly.

Closes #1486.

* **Fixes issue where tables from markdown documents were being treated
as text** Problem: Tables from markdown documents were being treated as
text, and not being extracted as tables. Solution: Enable the `tables`
extension when instantiating the `python-markdown` object. Importance:
This will allow users to extract structured data from tables in markdown
documents.

#### Testing:

On `main` run the following (run `git checkout fix/md-tables --
example-docs/simple-table.md` first to grab the example table from this
branch)
```python
from unstructured.partition.md import partition_md
elements = partition_md("example-docs/simple-table.md")
print(elements[0].category)

```
Output should be `UncategorizedText`. Then run the same code on this
branch and observe the output is `Table`.

---------

Co-authored-by: cragwolfe <crag@unstructured.io>
2023-10-30 14:09:46 +00:00
Benjamin Torres
05c3cd1be2
feat: clean pdfminer elements inside tables (#1808)
This PR introduces `clean_pdfminer_inner_elements` , which deletes
pdfminer elements inside other detection origins such as YoloX or
detectron.
This function returns the clean document.

Also, the ingest-test fixtures were updated to reflect the new standard
output.

The best way to check that this function is working properly is check
the new test `test_clean_pdfminer_inner_elements` in
`test_unstructured/partition/utils/test_processing_elements.py`

---------

Co-authored-by: Roman Isecke <roman@unstructured.io>
Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com>
Co-authored-by: rbiseck3 <rbiseck3@users.noreply.github.com>
Co-authored-by: Roman Isecke <136338424+rbiseck3@users.noreply.github.com>
2023-10-30 07:10:51 +00:00
cragwolfe
2e1419867f
build(ci): bring back medium runners where appropriate (#1936)
Now that "medium" runners are available in GitHub Actions again,
re-enable them.

Intentionally not reverting
76213ecba7
the change to the setup-ingest job since additional CPU shouldn't make a
difference (e.g. it ran in 5 minutes here:
76213ecba7
).
2023-10-30 05:07:17 +00:00
Steve Canny
7373391aa4
fix: sectioner dissociated titles from their chunk (#1861)
### disassociated-titles

**Executive Summary**. Section titles are often combined with the prior
section and then missing from the section they belong to.

_Chunk combination_ is a behavior in which two succesive small chunks
are combined into a single chunk that better fills the chunk window.
Chunking can be and by default is configured to combine sequential small
chunks that will together fit within the full chunk window (default 500
chars).

Combination is only valid for "whole" chunks. The current implementation
attempts to combine at the element level (in the sectioner), meaning a
small initial element (such as a `Title`) is combined with the prior
section without considering the remaining length of the section that
title belongs to. This frequently causes a title element to be removed
from the chunk it belongs to and added to the prior, otherwise
unrelated, chunk.

Example:
```python
elements: List[Element] = [
    Title("Lorem Ipsum"),  # 11
    Text("Lorem ipsum dolor sit amet consectetur adipiscing elit."),  # 55
    Title("Rhoncus"),  # 7
    Text("In rhoncus ipsum sed lectus porta volutpat. Ut fermentum."),  # 57
]

chunks = chunk_by_title(elements, max_characters=80, combine_text_under_n_chars=80)

# -- want --------------------
CompositeElement('Lorem Ipsum\n\nLorem ipsum dolor sit amet consectetur adipiscing elit.')
CompositeElement('Rhoncus\n\nIn rhoncus ipsum sed lectus porta volutpat. Ut fermentum.')

# -- got ---------------------
CompositeElement('Lorem Ipsum\n\nLorem ipsum dolor sit amet consectetur adipiscing elit.\n\nRhoncus')
CompositeElement('In rhoncus ipsum sed lectus porta volutpat. Ut fermentum.')
```

**Technical Summary.** Combination cannot be effectively performed at
the element level, at least not without complicating things with
arbitrary look-ahead into future elements. Much more straightforward is
to combine sections once they have been formed from the element stream.

**Fix.** Introduce an intermediate stream processor that accepts a
stream of sections and emits a stream of sometimes-combined sections.

The solution implemented in this PR builds upon introducing `_Section`
objects to replace the `List[Element]` primitive used previously:

- `_TextSection` gets the `.combine()` method and `.text_length`
property which allows a combining client to produce a combined section
(only text-sections are ever combined).
- `_SectionCombiner` is introduced to encapsulate the logic of
combination, acting as a "filter", accepting a stream of sections and
emitting the same type, just with some resulting from two or more
combined input sections: `(Iterable[_Section]) -> Iterator[_Section]`.
- `_TextSectionAccumulator` is a helper to `_SectionCombiner` that takes
responsibility for repeatedly accumulating sections, characterizing
their length and doing the actual combining (calling
`_Section.combine(other_section)`) when instructed. Very similar in
concept to `_TextSectionBuilder`, just at the section level instead of
element level.
- Remove attempts to combine sections at the element level from
`_split_elements_by_title_and_table()` and install `_SectionCombiner` as
filter between sectioner and chunker.
2023-10-30 04:20:27 +00:00
cragwolfe
76213ecba7
build(fixtures-update): all CI jobs on smaller worker (#1934) 2023-10-28 21:19:51 -07:00
cragwolfe
4e669d419f
build(fixtures-update): run on smaller worker (#1932)
Run fixtures-update workflow on smaller github runner until larger one
is available again.
2023-10-27 20:36:05 -07:00
cragwolfe
22b3edb226
build: re-enable ingest on normal CI workers (#1931)
temporarily, until large workers are working again.
2023-10-27 19:46:40 -07:00
cragwolfe
56b1c063a2
chore: moar changelog repair (#1930)
Per
a2af72bb79
, these changes were part of 0.10.26.
2023-10-27 18:05:56 -07:00
Benjamin Torres
25e7a68d4b
chore: changelog repair (#1929)
Removes duplicated entries in changelog
2023-10-27 17:46:50 -07:00
Yao You
f87731e085
feat: use yolox as default to table extraction for pdf/image (#1919)
- yolox has better recall than yolox_quantized, the current default
model, for table detection
- update logic so that when `infer_table_structure=True` the default
model is `yolox` instead of `yolox_quantized`
- user can still override the default by passing in a `model_name` or
set the env variable `UNSTRUCTURED_HI_RES_MODEL_NAME`

## Test:

Partition the attached file with 

```python
from unstructured.partition.pdf import partition_pdf

yolox_elements = partition_pdf(filename, strategy="hi_re", infer_table_structure=True)
yolox_quantized_elements = partition_pdf(filename, strategy="hi_re", infer_table_structure=True, model_name="yolox_quantized")
```

Compare the table elements between those two and yolox (default)
elements should have more complete table.


[AK_AK-PERS_CAFR_2008_3.pdf](https://github.com/Unstructured-IO/unstructured/files/13191198/AK_AK-PERS_CAFR_2008_3.pdf)
2023-10-27 15:37:45 -05:00
cragwolfe
ff752e88df
chore: exit evaluation script if nothing to do (#1910)
Relates to CI ingest-tests. The last step of test-ingest.sh is to
calculate evaluation metrics (comparing gold set standard outputs with
actual output files). If no output files were created, as *should* be
the case right now in CI for all python versions other than 3.10 (that
only test a limited number of
files/connectors),`unstructured/ingest/evaluate.py` would fail.
2023-10-27 13:29:05 -05:00
John
670687bb67
update .pre-commit-config to match linting used by CI (#1906)
Closes #1905 
.pre-commit-config.yaml does not match pyproject.toml, which causes
unnecessary/undesirable formatting changes. These changes are not
required by CI, so they should not have to be made.

**To Reproduce**
Install pre-commit configuration as described
[here](https://github.com/Unstructured-IO/unstructured#installation-instructions-for-local-development).
Make a commit and something like the following will be logged:
```
check for added large files..............................................Passed
check toml...........................................(no files to check)Skipped
check yaml...........................................(no files to check)Skipped
check json...........................................(no files to check)Skipped
check xml............................................(no files to check)Skipped
fix end of files.........................................................Passed
trim trailing whitespace.................................................Passed
mixed line ending........................................................Passed
black....................................................................Passed
ruff.....................................................................Failed
- hook id: ruff
- files were modified by this hook
```

---------

Co-authored-by: Yao You <theyaoyou@gmail.com>
2023-10-27 13:24:55 -05:00
Yao You
42f8cf1997
chore: add metric helper for table structure eval (#1877)
- add helper to run inference over an image or pdf of table and compare
it against a ground truth csv file
- this metric generates a similarity score between 1 and 0, where 1 is
perfect match and 0 is no match at all
- add example docs for testing
- NOTE: this metric is only relevant to table structure detection.
Therefore the input should be just the table area in an image/pdf file;
we are not evaluating table element detection in this metric
2023-10-27 13:23:44 -05:00
Yuming Long
b1534af55c
Fix: replace wrong logger for paddle info (#1916)
### Summary

The logger from `paddle_ocr.py` is wrong, it should be `from
unstructured.logger` since the module is in unst repo

### Test
* install this branch of unst to an unst-api env with `pip install -e .`
* in unst-api repo, run `OCR_AGENT=paddle make run-web-app`
* curl with:
```
curl -X 'POST'   'http://0.0.0.0:8000/general/v0/general' \
-H 'accept: application/json'  \
-H 'Content-Type: multipart/form-data'  \
-F 'files=@sample-docs/layout-parser-paper.pdf'  \
-F 'strategy=hi_res'  \
-F 'pdf_infer_table_structure=True' \
 | jq -C . | less -R
```
you should be able to see log like 
```
2023-10-27 10:31:48,691 unstructured INFO Processing OCR with paddle...
2023-10-27 10:31:48,969 unstructured INFO Loading paddle with CPU on language=en...
```
not 
```
2023-10-27 10:16:08,654 unstructured_inference INFO Loading paddle with CPU on language=en...
```
even paddle is not installed
2023-10-27 16:06:30 +00:00
cragwolfe
8aceda97dd
test: print slowest unittests (#1911)
Show which tests are slowing things down when running `make test`:

E.g., from the CI run in this PR:

```
2023-10-27T05:51:05.6256039Z 105.12s setup    test_unstructured/partition/pdf_image/test_pdf.py::test_chipper_has_hierarchy
2023-10-27T05:51:05.6257784Z 93.47s call     test_unstructured/partition/pdf_image/test_pdf.py::test_partition_pdf_hi_res_ocr_mode_with_table_extraction[entire_page]
2023-10-27T05:51:05.6259866Z 93.09s call     test_unstructured/partition/pdf_image/test_pdf.py::test_partition_pdf_hi_res_ocr_mode_with_table_extraction[individual_blocks]
2023-10-27T05:51:05.6261818Z 31.70s call     test_unstructured/partition/epub/test_epub.py::test_add_chunking_strategy_on_partition_epub_non_default
2023-10-27T05:51:05.6263774Z 17.22s call     test_unstructured/partition/pdf_image/test_pdf.py::test_partition_pdf[hi_res-expected1-pdf-filename]
2023-10-27T05:51:05.6265658Z 17.13s call     test_unstructured/partition/pdf_image/test_pdf.py::test_partition_pdf[hi_res-expected1-pdf-spool]
2023-10-27T05:51:05.6273195Z 16.95s call     test_unstructured/partition/pdf_image/test_image.py::test_add_chunking_strategy_on_partition_image_hi_res
2023-10-27T05:51:05.6275118Z 16.77s call     test_unstructured/partition/pdf_image/test_pdf.py::test_partition_pdf[hi_res-expected1-pdf-rb]
2023-10-27T05:51:05.6276759Z 14.64s call     test_unstructured/partition/test_text.py::test_partition_text_detects_more_than_3_languages
2023-10-27T05:51:05.6278381Z 13.86s call     test_unstructured/partition/pdf_image/test_image.py::test_partition_image_with_multipage_tiff
2023-10-27T05:51:05.6280137Z 13.51s call     test_unstructured/partition/test_auto.py::test_auto_partition_pdf_from_filename[False-None]
2023-10-27T05:51:05.6281995Z 13.41s call     test_unstructured/partition/test_html_partition.py::test_add_chunking_strategy_on_partition_html
2023-10-27T05:51:05.6283640Z 12.80s call     test_unstructured/partition/pdf_image/test_pdf.py::test_partition_pdf_with_copy_protection
2023-10-27T05:51:05.6285305Z 12.46s call     test_unstructured/partition/pdf_image/test_image.py::test_add_chunking_strategy_on_partition_image
2023-10-27T05:51:05.6287250Z 12.39s call     test_unstructured/partition/pdf_image/test_image.py::test_partition_image_hi_res_ocr_mode_with_table_extraction[individual_blocks]
2023-10-27T05:51:05.6289347Z 12.14s call     test_unstructured/partition/pdf_image/test_pdf.py::test_partition_pdf_from_file_with_hi_res_strategy_custom_metadata_date
2023-10-27T05:51:05.6291329Z 12.12s call     test_unstructured/partition/pdf_image/test_pdf.py::test_partition_pdf_with_hi_res_strategy_custom_metadata_date
2023-10-27T05:51:05.6293388Z 12.12s call     test_unstructured/partition/test_auto.py::test_auto_partition_pdf_from_file[True-application/pdf]
2023-10-27T05:51:05.6294869Z 12.08s call     test_unstructured/partition/test_auto.py::test_auto_with_page_breaks
2023-10-27T05:51:05.6296396Z 12.02s call     test_unstructured/partition/pdf_image/test_pdf.py::test_partition_pdf_with_hi_res_strategy_metadata_date
2023-10-27T05:51:05.6298278Z 11.99s call     test_unstructured/partition/pdf_image/test_pdf.py::test_partition_pdf_from_file_with_hi_res_strategy_metadata_date
```
2023-10-27 11:40:55 -05:00
Ahmet Melek
c249d02fa8
bugfix: ingest pipeline with chunking and embedding does not persist data to the embedding step (#1893)
Closes: #1892 (check the issue for more info)
2023-10-27 13:07:00 +00:00