22 Commits

Author SHA1 Message Date
Steve Canny
9ece0b5ad2
fix: improve false-positive Title elements on Chinese text (#3836)
**Summary**
Improve element-type mapping for Chinese text. Fixes bug where Chinese
text would produce large numbers of false-positive `Title` elements.

Fixes #3084

---------

Co-authored-by: scanny <scanny@users.noreply.github.com>
Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com>
2024-12-18 01:16:42 +00:00
Roman Isecke
9049e4e2be
feat/remove ingest code, use new dep for tests (#3595)
### Description
Alternative to https://github.com/Unstructured-IO/unstructured/pull/3572
but maintaining all ingest tests, running them by pulling in the latest
version of unstructured-ingest.

---------

Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com>
Co-authored-by: rbiseck3 <rbiseck3@users.noreply.github.com>
Co-authored-by: Christine Straub <christinemstraub@gmail.com>
Co-authored-by: christinestraub <christinestraub@users.noreply.github.com>
2024-10-15 10:01:34 -05:00
Steve Canny
a861ed8fe7
feat(chunk): split tables on even row boundaries (#3504)
**Summary**
Use more sophisticated algorithm for splitting oversized `Table`
elements into `TableChunk` elements during chunking to ensure element
text and HTML are "synchronized" and HTML is always parseable.

**Additional Context**
Table splitting now has the following characteristics:
- `TableChunk.metadata.text_as_html` is always a parseable HTML
`<table>` subtree.
- `TableChunk.text` is always the text in the HTML version of the table
fragment in `.metadata.text_as_html`. Text and HTML are "synchronized".
- The table is divided at a whole-row boundary whenever possible.
- A row is broken at an even-cell boundary when a single row is larger
than the chunking window.
- A cell is broken at an even-word boundary when a single cell is larger
than the chunking window.
- `.text_as_html` is "minified", removing all extraneous whitespace and
unneeded elements or attributes. This maximizes the semantic "density"
of each chunk.
2024-08-19 18:56:53 +00:00
Roman Isecke
1df7908f03
feat: save file id for all fsspec connectors if present (#3405)
### Description

If the id value exists in the stats response from fsspec, save it as a
`file_id` field in the metadata being persisted on each element.

---------

Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com>
Co-authored-by: rbiseck3 <rbiseck3@users.noreply.github.com>
2024-07-19 13:30:21 +00:00
ryannikolaidis
17bc55e7be
fix: relative path / permissions issues with v2 fsspec connectors (#3186)
When the v2 fsspec connectors currently generate the relative path, they
may introduce a path with a leading slash (this happens in the case of
the Box connector, which is a subclass of fsspec). When this happens
this results in the paths unintentionally being treated as absolute
paths. As a result, the ingest pipeline attempts to write files to
directories at root level, which in turn raises permission issues.

Note: Box expected results needed to update now that it's no longer
failing.

Aside: found that our tests were unintentionally skipping `box.sh` tests
because we were intending to skip `dropbox.sh` and we use regex to match
if a given test is in skip tests. This adds changes to force an exact
match.

## Changes

* Strip leading slashes during the creating of relative paths in fsspec
connectors
* Add expected results for Box connector
* (bonus): `make tidy` altered an unrelated file by removing an
unnecessary call of `pass`
* (bonus): check exact match for skipped ingest tests which fixes Box
tests getting skipped

## Testing


[Tests](https://github.com/Unstructured-IO/unstructured/actions/runs/9461928289/job/26093475612#step:7:2085)
for the Box connector was failing. It was accidentally getting skipped
(see changes above). It is now no longer skipped and passing.
2024-06-12 03:39:35 +00:00
Michał Martyniak
001fa17c86
Preparing the foundation for better element IDs (#2842)
Part one of the issue described here:
https://github.com/Unstructured-IO/unstructured/issues/2461

It does not change how hashing algorithm works, just reworks how ids are
assigned:
> Element ID Design Principles
> 
> 1. A partitioning function can assign only one of two available ID
types to a returned element: a hash or UUID.
> 2. All elements that are returned come with an ID, which is never
None.
> 3. No matter which type of ID is used, it will always be in string
format.
> 4. Partitioning a document returns elements with hashes as their
default IDs.

Big thanks to @scanny for explaining the current design and suggesting
ways to do it right, especially with chunking.


Here's the next PR in line:
https://github.com/Unstructured-IO/unstructured/pull/2673

---------

Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com>
Co-authored-by: micmarty-deepsense <micmarty-deepsense@users.noreply.github.com>
2024-04-16 21:14:53 +00:00
Roman Isecke
b37b4689bc
drop python3.8 (#2372)
### Description
Remove all uses of python3.8

---------

Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com>
Co-authored-by: rbiseck3 <rbiseck3@users.noreply.github.com>
2024-01-09 23:37:30 +00:00
rvztz
50b1431c9e
rvztz/hubspot ingest connector (#1760)
Closes #1843 

Ingest connector for HubSpot. Supports:
- Calls: Logs from calls related to contacts, companies and tickets
- Communications: Logs from SMS/Whatsapp related to contacts, companies
and tickets
- Notes: Notes related to CRM notes
- Products: CRM products
- Emails: Logs from emails sent to CRM objects.
- Tasks: CRM tasks

From each record, `body/`description`information is grabbed. When a
title property is available, this is registered at the beggining of the
output file. The CLI receives three params:
- `api-token`: [Private
app](https://developers.hubspot.com/docs/api/private-apps) token.
- `object-types: One of the noted supported objects in the form of a
comma separated list: `calls,products,tasks`
- `custom-properties`: Custom properties to grab information from. Must
be in the form
`<object_type>:<custom_property_id>,<object_type>:<custom_property_id>`

---------

Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com>
Co-authored-by: rvztz <rvztz@users.noreply.github.com>
2023-11-28 23:07:57 +00:00
Steve Canny
ee9be2a3b2
fix: assorted partition_html() bugs (#2113)
Addresses a cluster of HTML-related bugs:
- empty table is identified as bulleted-table
- `partition_html()` emits empty (no text) tables (#1928)
- `.text_as_html` contains inappropriate `<br>` elements in invalid
locations.
- cells enclosed in `<thead>` and `<tfoot>` elements are dropped (#1928)
- `.text_as_html` contains whitespace padding

Each of these is addressed in a separate commit below.

Fixes #1928.

---------

Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com>
Co-authored-by: scanny <scanny@users.noreply.github.com>
Co-authored-by: Yuming Long <63475068+yuming-long@users.noreply.github.com>
2023-11-20 16:29:32 +00:00
Steve Canny
b8a8de33f4
fix(ingest): canonicalize ingest JSON (#2080)
Canonicalize JSON produced for ingest tests such that incidental changes
is _form_ of the JSON objects (keys moving around) that does not change
the _content_ of that JSON object does not trigger an ingest-test
failure.
2023-11-15 00:52:58 -08:00
Christine Straub
475066ba7c
Fix: fast strategy fallback to ocr only (#2055)
Closes #2038.
### Summary
The `fast` strategy should not fall back to a more expensive strategy.

### Testing
For
[9493801-p17.pdf](https://github.com/Unstructured-IO/unstructured/files/13292884/9493801-p17.pdf),
the following code should return an empty list.

```
elements = partition(filename=filename, strategy="fast")
```

---------

Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com>
Co-authored-by: christinestraub <christinestraub@users.noreply.github.com>
2023-11-14 18:46:41 +00:00
Steve Canny
d06bcc41bb
fix(docx): improve page-break detection (#2036)
Page breaks are reliably indicated by `w:lastRenderedPageBreak` elements
present in the document XML. Page breaks are NOT reliably indicated by
"hard" page-breaks inserted by the author and when present are redundant
to a `w:lastRenderedPageBreak` element so cause over-counting if used.

Use rendered page-breaks only.
2023-11-09 20:34:30 +00:00
Roman Isecke
135aa65906
update ingest pipeline to share ingest docs via multiprocessing.manager.dict (#1814)
### Description
* If the contents of a doc were updated by the process of
reading/downloading it, this was not being persisted. To fix this, the
data being passed around was updated to use a multiprocessing safe dict
rather than the json string. Now that dict is updated after the
`get_file` method is called.
* Wikipedia connector was updated to use a static filename rather than
one requiring a call to fetch data.
* The read config param `re_download` was not being leveraged by the
source node, this was fixed.
* Added fix: chunking and embedding order reversed so chunking runs
before embeddings

---------

Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com>
Co-authored-by: rbiseck3 <rbiseck3@users.noreply.github.com>
2023-10-25 22:04:27 +00:00
John
9500d04791
detect document language across all partitioners (#1627)
### Summary
Closes #1534 and #1535
Detects document language using `langdetect` package. 
Creates new kwargs for user to set the document language (`languages`)
or detect the language at the element level instead of the default
document level (`detect_language_per_element`)

---------

Co-authored-by: shreyanid <42684285+shreyanid@users.noreply.github.com>
Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com>
Co-authored-by: Coniferish <Coniferish@users.noreply.github.com>
Co-authored-by: cragwolfe <crag@unstructured.io>
Co-authored-by: Austin Walker <austin@unstructured.io>
2023-10-11 01:47:56 +00:00
rvztz
3be9f089b3
feat: adds data source properties to fsspec-based connectors (#1279) 2023-09-15 05:56:44 +00:00
shreyanid
c2853e4ac3
refactor languages parameter for pdf partition functions (#1334)
### Summary

In order to support language functionality other than Tesseract OCR, we
want to represent languages provided for either partitioning accuracy or
OCR as a standard list of langcodes as strings.

### Details

Adds `languages` (a list of strings) as a parameter to pdf partitioning
functions. Marks `ocr_languages` for deprecation. Adds a new file
`lang.py` for language-related helper functions.

Coming up: langcode standardization, language detection

### Test

Call `partition_pdf` or `partition_pdf_or_image` with a variety of
strategies, languages, or `ocr_languages`.
- inclusion of `ocr_languages` as a parameter should display a
deprecation warning
- the other valid call outputs should be no different from the current
outputs.

ex:
```
from unstructured.partition.pdf import partition_pdf

elements = partition_pdf(filename="example-docs/DA-1p.pdf", strategy="hi_res", languages=["eng", "spa"])
print("\n\n".join([str(el) for el in elements]))
```
2023-09-12 16:15:26 +00:00
Amanda Cameron
a501d1d18f
Adding table extraction to partition_html (#1324)
Adding table extraction to HTML partitioning.

This PR utilizes 'table' HTML elements to extract and parse HTML tables
and return them in partitioning.

```
# checkout this branch, go into ipython shell
In [1]: from unstructured.partition.html import partition_html
In [2]: path_to_html = "{html sample file with table}"
In [3]: elements = partition_html(path_to_html)
```
you should see the table in the elements list!
2023-09-11 11:14:11 -07:00
Matt Robinson
22974f61ce
fix: separate elements by <br> tag in partition_html (#1314)
### Summary

Closes #1230. Updates `partition_html` to split on `<br>` tags that
appear within text elements.


### Testing

The following is code previously produced one giant element on `main`.

```python
from unstructured.partition.html import partition_html

filename = "example-docs/ideas-page.html"
elements = partition_html(filename=filename)

len(elements) # Should be 4
print("\n\n".join([str(el) for el in elements)])
```

The output should be:

```python
January 2023

(Someone fed my essays into GPT to make something that could answer
questions based on them, then asked it where good ideas come from.  The
answer was ok, but not what I would have said. This is what I would have said.)

The way to get new ideas is to notice anomalies: what seems strange,
or missing, or broken? You can see anomalies in everyday life (much
of standup comedy is based on this), but the best place to look for
them is at the frontiers of knowledge.

Knowledge grows fractally.
From a distance its edges look smooth, but when you learn enough
to get close to one, you'll notice it's full of gaps. These gaps
will seem obvious; it will seem inexplicable that no one has tried
x or wondered about y. In the best case, exploring such gaps yields
whole new fractal buds.
```
2023-09-07 13:16:31 +00:00
Christine Straub
0e887cc36b
Feat/1060 update metadata fields (#1099)
Closes Github Issue #1060.

* update the metadata field links
* update the metadata field emphasized_texts
2023-08-16 04:33:06 +00:00
Christine Straub
b76d2ee745
feat: track emphasized text msword (#1048)
* feat: add functionality to track emphasized text (`bold/italic` formatting) from paragraph

* chore: add docstring

* chore: fix lint errors

* feat: ignore spaces when extracting emphasized texts from a paragraph

* feat: add functionality to track emphasized text (`bold/italic` formatting) from table

* test: add test case for grabbing emphasized texts from element metadata

* chore: fix lint errors

* chore: update changelog & version

* Update ingest test fixtures (#1047)
2023-08-04 17:04:12 -04:00
Matt Robinson
f4ddf53590
feat: track emphasized text in partition_html (#1034)
* Feat/965 track emphasized text html (#1021)

* feat: add functionality to track emphasized text (<strong>, <em>, <span>, <b>, <i> tags) in HTML

* feat: add `include_tail_text` parameter to `_construct_text`

* test: add test case for `_get_emphasized_texts_from_tag`

* test: add `emphasized_texts` to metadata

* chore: update changelog & version

* fix tests

* fix lint errors

* chore: update changelog

* chore: small comment updates

* feat: update `XMLDocument._read_xml` to create `<p>` tag element for the text enclosed in the `<pre>` tag

* chore: update changelog

* Update ingest test fixtures (#1026)

Co-authored-by: christinestraub <christinestraub@users.noreply.github.com>

---------

Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com>
Co-authored-by: christinestraub <christinestraub@users.noreply.github.com>
Co-authored-by: Matt Robinson <mrobinson@unstructured.io>

* ingest-test-fixtures-update

* Update ingest test fixtures (#1035)

Co-authored-by: MthwRobinson <MthwRobinson@users.noreply.github.com>

---------

Co-authored-by: Christine Straub <christinemstraub@gmail.com>
Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com>
Co-authored-by: christinestraub <christinestraub@users.noreply.github.com>
Co-authored-by: MthwRobinson <MthwRobinson@users.noreply.github.com>
2023-08-03 16:24:25 +00:00
David Potter
1542607892
feat: adds Box connector (#996) 2023-08-01 01:10:10 +00:00