12 Commits

Author SHA1 Message Date
Steve Canny
ee9be2a3b2
fix: assorted partition_html() bugs (#2113)
Addresses a cluster of HTML-related bugs:
- empty table is identified as bulleted-table
- `partition_html()` emits empty (no text) tables (#1928)
- `.text_as_html` contains inappropriate `<br>` elements in invalid
locations.
- cells enclosed in `<thead>` and `<tfoot>` elements are dropped (#1928)
- `.text_as_html` contains whitespace padding

Each of these is addressed in a separate commit below.

Fixes #1928.

---------

Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com>
Co-authored-by: scanny <scanny@users.noreply.github.com>
Co-authored-by: Yuming Long <63475068+yuming-long@users.noreply.github.com>
2023-11-20 16:29:32 +00:00
Steve Canny
b8a8de33f4
fix(ingest): canonicalize ingest JSON (#2080)
Canonicalize JSON produced for ingest tests such that incidental changes
is _form_ of the JSON objects (keys moving around) that does not change
the _content_ of that JSON object does not trigger an ingest-test
failure.
2023-11-15 00:52:58 -08:00
Roman Isecke
680cfbabd4
expand fsspec downstream connectors (#1777)
### Description
Replacing PR
[1383](https://github.com/Unstructured-IO/unstructured/pull/1383)

---------

Co-authored-by: Trevor Bossert <alanboss@gmail.com>
Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com>
Co-authored-by: rbiseck3 <rbiseck3@users.noreply.github.com>
2023-10-30 20:09:49 +00:00
Roman Isecke
a2af72bb79
local connector metadata and deserialization fix (#1800)
### Description
* Priority of this was to fix deserialization of ingest docs. Currently
the source metadata wasn't being persisted
* To help debug this, source metadata was added to the local ingest doc
as well.
* Unit test added to make sure the metadata itself was persisted.
* As part of serialization, it was forcing docs to fetch source metadata
if it hadn't already to add to the generated dict/json. This shouldn't
have happened if the underlying variable `_source_metadata` was `None`.
This way the doc can be serialized without any calls being made.
* Serialization was moved to the `to_dict` method to make it more
universal.

---------

Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com>
Co-authored-by: rbiseck3 <rbiseck3@users.noreply.github.com>
2023-10-23 15:51:52 +00:00
John
9500d04791
detect document language across all partitioners (#1627)
### Summary
Closes #1534 and #1535
Detects document language using `langdetect` package. 
Creates new kwargs for user to set the document language (`languages`)
or detect the language at the element level instead of the default
document level (`detect_language_per_element`)

---------

Co-authored-by: shreyanid <42684285+shreyanid@users.noreply.github.com>
Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com>
Co-authored-by: Coniferish <Coniferish@users.noreply.github.com>
Co-authored-by: cragwolfe <crag@unstructured.io>
Co-authored-by: Austin Walker <austin@unstructured.io>
2023-10-11 01:47:56 +00:00
rvztz
3be9f089b3
feat: adds data source properties to fsspec-based connectors (#1279) 2023-09-15 05:56:44 +00:00
Amanda Cameron
a501d1d18f
Adding table extraction to partition_html (#1324)
Adding table extraction to HTML partitioning.

This PR utilizes 'table' HTML elements to extract and parse HTML tables
and return them in partitioning.

```
# checkout this branch, go into ipython shell
In [1]: from unstructured.partition.html import partition_html
In [2]: path_to_html = "{html sample file with table}"
In [3]: elements = partition_html(path_to_html)
```
you should see the table in the elements list!
2023-09-11 11:14:11 -07:00
Matt Robinson
22974f61ce
fix: separate elements by <br> tag in partition_html (#1314)
### Summary

Closes #1230. Updates `partition_html` to split on `<br>` tags that
appear within text elements.


### Testing

The following is code previously produced one giant element on `main`.

```python
from unstructured.partition.html import partition_html

filename = "example-docs/ideas-page.html"
elements = partition_html(filename=filename)

len(elements) # Should be 4
print("\n\n".join([str(el) for el in elements)])
```

The output should be:

```python
January 2023

(Someone fed my essays into GPT to make something that could answer
questions based on them, then asked it where good ideas come from.  The
answer was ok, but not what I would have said. This is what I would have said.)

The way to get new ideas is to notice anomalies: what seems strange,
or missing, or broken? You can see anomalies in everyday life (much
of standup comedy is based on this), but the best place to look for
them is at the frontiers of knowledge.

Knowledge grows fractally.
From a distance its edges look smooth, but when you learn enough
to get close to one, you'll notice it's full of gaps. These gaps
will seem obvious; it will seem inexplicable that no one has tried
x or wondered about y. In the best case, exploring such gaps yields
whole new fractal buds.
```
2023-09-07 13:16:31 +00:00
Christine Straub
0e887cc36b
Feat/1060 update metadata fields (#1099)
Closes Github Issue #1060.

* update the metadata field links
* update the metadata field emphasized_texts
2023-08-16 04:33:06 +00:00
Matt Robinson
f4ddf53590
feat: track emphasized text in partition_html (#1034)
* Feat/965 track emphasized text html (#1021)

* feat: add functionality to track emphasized text (<strong>, <em>, <span>, <b>, <i> tags) in HTML

* feat: add `include_tail_text` parameter to `_construct_text`

* test: add test case for `_get_emphasized_texts_from_tag`

* test: add `emphasized_texts` to metadata

* chore: update changelog & version

* fix tests

* fix lint errors

* chore: update changelog

* chore: small comment updates

* feat: update `XMLDocument._read_xml` to create `<p>` tag element for the text enclosed in the `<pre>` tag

* chore: update changelog

* Update ingest test fixtures (#1026)

Co-authored-by: christinestraub <christinestraub@users.noreply.github.com>

---------

Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com>
Co-authored-by: christinestraub <christinestraub@users.noreply.github.com>
Co-authored-by: Matt Robinson <mrobinson@unstructured.io>

* ingest-test-fixtures-update

* Update ingest test fixtures (#1035)

Co-authored-by: MthwRobinson <MthwRobinson@users.noreply.github.com>

---------

Co-authored-by: Christine Straub <christinemstraub@gmail.com>
Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com>
Co-authored-by: christinestraub <christinestraub@users.noreply.github.com>
Co-authored-by: MthwRobinson <MthwRobinson@users.noreply.github.com>
2023-08-03 16:24:25 +00:00
ryannikolaidis
70365ea42d
chore: add Dropbox secrets to CI environments (#1029) 2023-08-03 02:18:29 +00:00
David Potter
bec733cdf8
feat: add Dropbox connector (#844) 2023-06-30 17:08:27 -07:00