585 Commits

Author SHA1 Message Date
Amanda Cameron
7fd81dc7df
Table processing test for RTF (#1388)
This PR does two things:
1. Adds test case (and alters sample doc) for rtf and epub files with
table
2. Adds `xls/x` file extension to `skip_infer_table_types` default list

---------

Co-authored-by: shreyanid <42684285+shreyanid@users.noreply.github.com>
2023-09-12 18:27:05 -07:00
qued
6595632a57
enhancement: backup text categorization (#1322)
Currently there are some cases when `partition_pdf` is run using the
`hi_res` strategy, in which elements can come back with category
`UncategorizedText`. This happens when the detection model fails to
detect an element, but we're able to find it anyway either because it
was embedded in the PDF, or we found it using OCR.

This commit is to allow for attempting to categorize these uncategorized
elements using our text-based classification function,
`element_from_text`.
2023-09-12 20:32:48 +00:00
shreyanid
c2853e4ac3
refactor languages parameter for pdf partition functions (#1334)
### Summary

In order to support language functionality other than Tesseract OCR, we
want to represent languages provided for either partitioning accuracy or
OCR as a standard list of langcodes as strings.

### Details

Adds `languages` (a list of strings) as a parameter to pdf partitioning
functions. Marks `ocr_languages` for deprecation. Adds a new file
`lang.py` for language-related helper functions.

Coming up: langcode standardization, language detection

### Test

Call `partition_pdf` or `partition_pdf_or_image` with a variety of
strategies, languages, or `ocr_languages`.
- inclusion of `ocr_languages` as a parameter should display a
deprecation warning
- the other valid call outputs should be no different from the current
outputs.

ex:
```
from unstructured.partition.pdf import partition_pdf

elements = partition_pdf(filename="example-docs/DA-1p.pdf", strategy="hi_res", languages=["eng", "spa"])
print("\n\n".join([str(el) for el in elements]))
```
2023-09-12 16:15:26 +00:00
John
c58b261feb
chunk_by_title decorator (#1304)
### Summary

Partial solution to #1185.
Related to #1222.
Creates decorator from `chunk_by_title` cleaning brick.
Breaks a document into sections based on the presence of Title elements.
Also starts a new section under the following conditions:

- If metadata changes, indicating a change in section or page or a
switch to processing attachments. If `multipage_sections=True`, sections
can span pages. `multipage_sections` defaults to True.
- If the length of the section exceeds `new_after_n_chars` characters.
The default is 1500. The **chunking function does not split individual
elements**, so it's possible for a section to exceed that threshold if
an individual element if over `new_after_n_chars characters`, which
could occur with a long NarrativeText element.

Combines sections under these conditions
- Sections under `combine_under_n_chars` characters are combined. The
default is 500.

### Testing

from unstructured.partition.html import partition_html

url = "https://understandingwar.org/backgrounder/russian-offensive-campaign-assessment-august-27-2023-0"
chunks = partition_html(url=url, chunking_strategy="by_title")

for chunk in chunks:
    print(chunk)
    print("\n\n" + "-"*80)
    input()
2023-09-11 21:00:14 +00:00
Amanda Cameron
a501d1d18f
Adding table extraction to partition_html (#1324)
Adding table extraction to HTML partitioning.

This PR utilizes 'table' HTML elements to extract and parse HTML tables
and return them in partitioning.

```
# checkout this branch, go into ipython shell
In [1]: from unstructured.partition.html import partition_html
In [2]: path_to_html = "{html sample file with table}"
In [3]: elements = partition_html(path_to_html)
```
you should see the table in the elements list!
2023-09-11 11:14:11 -07:00
cragwolfe
d0749d181f
fix: avoid PDF sorting error on negative coords (#1361)
The default sorting algorithm for PDF's, "xycut," would cause an error
when partitioning a document if Y coordinate points were negative. This
change checks for that condition (or more broadly, any negative
coordinates) and falls back to the "basic" sort if that is the case.

This PR does not address the underlying issue of "bad points" which
still should be investigated. However, the sorting code should be less
brittle to unexpected bounding boxes in the first case.

Resolves: https://github.com/Unstructured-IO/unstructured/issues/1296
2023-09-10 19:29:49 -07:00
suraj chauhan
f4bf1fa270
Chore: Libmagic detection for "application/octet-stream" when it is not a zip file. (#1347)
Addressed the issue #494 .
Updated the `_detect_filetype_from_octet_stream()` function to use
libmagic to infer the content type of file when it is not a zip file.
2023-09-08 18:49:00 +00:00
pravin-unstructured
8641fe39dc
Add Model Probabilities to Hi-Res strategy MetaData for Images + PDFs. (#1323)
If a layout model is used from unstructured-inference, you get back
class probabilities in the element metadata from partition.
extra-pdf-image-in in requirements already has the newest version of
unstructured-inference in there without a pinned version. Is there any
place else that the unstructured-inference version needs to be updated
to the required release version, 0.5.22?
2023-09-07 22:56:43 -04:00
Wahab Alshahin
e4e25c9feb
Add clean_ligatures to core cleaners (#1326)
# Background


[Ligatures](https://en.wikipedia.org/wiki/Ligature_(writing)#Ligatures_in_Unicode_(Latin_alphabets))
can sometimes show up during the text extraction process when they
should not. Very common examples of this are with the Latin `f` related
ligatures which can be **very subtle** to spot by eye (see example
below), but can wreak havoc later.

```python
"ff": "ff",
"fi": "fi",
"fl": "fl",
"ffi": "ffi",
"ffl": "ffl",
```

Several libraries already do something like this. Most recently,
`pdfplumber` added this sort of capability as part of the text
extraction process, see https://github.com/jsvine/pdfplumber/issues/598

Instead of incorporating any sort of breaking change to the PDF text
processing in `unstructured`, it is best to add this as another cleaner
and allow users to opt in. In turn, the `clean_ligatures` method has
been added in this PR - with accompanying tests.

# Example

Here is an example PDF that causes the issue. For example: `Benefits`,
which should be `Benefits`.


[example.pdf](https://github.com/Unstructured-IO/unstructured/files/12544344/example.pdf)

```bash
curl -X 'POST' \
    'https://api.unstructured.io/general/v0/general' \
    -H 'accept: application/json' \
    -H 'Content-Type: multipart/form-data' \
    -H 'unstructured-api-key: ${UNSTRUCTURED_API_KEY}' \
    -F 'files=@example.pdf' \
    -s | jq -C .
```

# Notes

An initial list of mappings was added with the most common ligatures.
There is some subjectivity to this, but this should be a relatively safe
starting set. Can always be expanded as needed.
2023-09-07 21:30:18 +00:00
Matt Robinson
22974f61ce
fix: separate elements by <br> tag in partition_html (#1314)
### Summary

Closes #1230. Updates `partition_html` to split on `<br>` tags that
appear within text elements.


### Testing

The following is code previously produced one giant element on `main`.

```python
from unstructured.partition.html import partition_html

filename = "example-docs/ideas-page.html"
elements = partition_html(filename=filename)

len(elements) # Should be 4
print("\n\n".join([str(el) for el in elements)])
```

The output should be:

```python
January 2023

(Someone fed my essays into GPT to make something that could answer
questions based on them, then asked it where good ideas come from.  The
answer was ok, but not what I would have said. This is what I would have said.)

The way to get new ideas is to notice anomalies: what seems strange,
or missing, or broken? You can see anomalies in everyday life (much
of standup comedy is based on this), but the best place to look for
them is at the frontiers of knowledge.

Knowledge grows fractally.
From a distance its edges look smooth, but when you learn enough
to get close to one, you'll notice it's full of gaps. These gaps
will seem obvious; it will seem inexplicable that no one has tried
x or wondered about y. In the best case, exploring such gaps yields
whole new fractal buds.
```
2023-09-07 13:16:31 +00:00
Yao You
1a0b737e9c
revert pdf changes and add new pdf for empty page testing (#1255)
- revert the layout parser fast pdf file to original with just two pages
- add a new file that has one empty page and one page says "this page is
intentionally left blank" for tests
2023-09-01 22:33:06 +00:00
Yao You
9191be7ae8
[issue 1237] fix empty coordinates break sorting bug (#1242)
This PR resolves #1237 by checking if any coordinates are `None`; if yes
do not attempt to sort with xy cut method and return the list as is.
2023-09-01 03:15:10 +00:00
Yao You
27773132b7
[issue 1247] fix element and bbox mismatch bug (#1250)
This PR resolves #1247 by using the matching elements and bbox for
coordinate computation.

This PR also updates the example doc
`example-docs/layout-parser-paper-fast.pdf` so that it includes a true
blank page and a page with text "this page is intentionally left blank".
This change helps us testing:
- differences between fast and hi_res
- code handling empty pages in between pages with contents (which
triggers the bug found in #1247 )

Lastly, this PR updates the names of the variables inside
`_partition_pdf_or_image_with_ocr` so that matching inputs all starts
with `_` like `_elements`, `_text`, and `_bboxes` to improve
readability.

This change also improves partition performance for multi-page pdfs as
it reduces the amount of iterations inside
`add_pytesseract_bbox_to_elements`. Testing locally on m2 mac + Rocky
docker shows it reduces partition time for DA-619p.pdf file from around
1min to around 23s.
2023-08-30 23:34:55 +00:00
Matt Robinson
c49df62967
feat: partition_xml infers element type on each leaf node (#1249)
### Summary

Closes #1229. Updates `partition_xml` so that the element type is
inferred on each leaf node when `xml_keep_tags=False` instead of
delegating splitting and partitioning to `partition_xml`. If
`xml_keep_tags=True`, the file is treated like a text file still and
partitioning is still delegated to `partition_text`.

Also adds the option to pass `text` as an input to `partition_xml`.

### Testing

Create a `parrots.xml` file that looks like:

```xml
<xml><parrot><name>Conure</name><description>A conure is a very friendly bird.

Conures are feathery and like to dance.</description></parrot></xml>
```

Run:

```python
from unstructured.partition.xml import partition_xml
from unstructured.staging.base import convert_to_dict

elements = partition_xml(filename="parrots.xml")
convert_to_dict(elements)
```

One `main`, the output is the following. Notice how the `<name>` tag
incorrectly gets merged into `<description>` in the first element.

```python
[{'element_id': '7ae4074435df8dfcefcf24a4e6c52026',
  'metadata': {'file_directory': '/home/matt/tmp',
               'filename': 'parrots.xml',
               'filetype': 'application/xml',
               'last_modified': '2023-08-30T14:21:38'},
  'text': 'Conure A conure is a very friendly bird.',
  'type': 'NarrativeText'},
 {'element_id': '859ecb332da6961acd2fb6a0185d1549',
  'metadata': {'file_directory': '/home/matt/tmp',
               'filename': 'parrots.xml',
               'filetype': 'application/xml',
               'last_modified': '2023-08-30T14:21:38'},
  'text': 'Conures are feathery and like to dance.',
  'type': 'NarrativeText'}]

```

One the feature branch, the output is the following, and the tags are
correctly separated.

```python
[{'element_id': '5512218914e4eeacf71a9cd42c373710',
  'metadata': {'file_directory': '/home/matt/tmp',
               'filename': 'parrots.xml',
               'filetype': 'application/xml',
               'last_modified': '2023-08-30T14:21:38'},
  'text': 'Conure',
  'type': 'Title'},
 {'element_id': '113bf8d250c2b1a77c9c2caa4b812f85',
  'metadata': {'file_directory': '/home/matt/tmp',
               'filename': 'parrots.xml',
               'filetype': 'application/xml',
               'last_modified': '2023-08-30T14:21:38'},
  'text': 'A conure is a very friendly bird.\n'
          '\n'
          'Conures are feathery and like to dance.',
  'type': 'NarrativeText'}]

```
2023-08-30 17:07:10 -04:00
Charles
de855bb4ed
enhancement: new extract function for detecting image URLs (#1212)
- Adds new feature discussed in GitHub Issue #1117 and in slack
2023-08-30 11:29:15 -07:00
Klaijan
675a10ea69
fix: update test_json to not use auto partition (#1187)
Update `test_json` to not use auto partition due to dependencies. Previously, to run `test_json` requires full requirements installation library to read file types, including but not limited to, docx, pptx, as well as others. Therefore the test will raise error with base installation. With the update, this fix also add to other test files to check its invariant with `elements_to_json`.
2023-08-29 16:59:26 -04:00
Matt Robinson
f6a745a74f
feat: chunk elements based on titles (#1222)
### Summary

An initial pass on smart chunking for RAG applications. Breaks a
document into sections based on the presence of `Title` elements. Also
starts a new section under the following conditions:

- If metadata changes, indicating a change in section or page or a
switch to processing attachments. If `multipage_sections=True`, sections
can span pages. `multipage_sections` defaults to True.
- If the length of the section exceeds `new_after_n_chars` characters.
The default is `1500`. The chunking function does not split individual
elements, so it's possible for a section to exceed that threshold if an
individual element if over `new_after_n_chars` characters, which could
occur with a long `NarrativeText` element.
- Section under `combine_under_n_chars` characters are combined. The
default is `500`.

### Testing

```python
from unstructured.partition.html import partition_html
from unstructured.chunking.title import chunk_by_title

url = "https://understandingwar.org/backgrounder/russian-offensive-campaign-assessment-august-27-2023-0"
elements = partition_html(url=url)
chunks = chunk_by_title(elements)

for chunk in chunks:
    print(chunk)
    print("\n\n" + "-"*80)
    input()
```
2023-08-29 16:04:57 +00:00
Klaijan
4b830e3b05
fix: return ocr coordinates points as tuple (#1219)
The `add_pytesseract_bbox_to_elements` returned the
`metadata.coordinates.points` as `Tuple` whereas other strategies
returned as `List`. Make change accordingly for consistency.

Previously: 
```
element.metadata.coordinates.points = [
            (x1, y1),
            (x2, y2),
            (x3, y3),
            (x4, y4),
]
```
Currently:
```
element.metadata.coordinates.points = (
            (x1, y1),
            (x2, y2),
            (x3, y3),
            (x4, y4),
)
```
2023-08-28 13:31:55 -04:00
Matt Robinson
07f76275f1
feat: detect PGP encrypted content in partition_email and partition_msg (#1205)
### Summary

Closes #1018. Enables `partition_email` and `partition_msg` to detect if
an email has PGP encrypted content. Based on the specification in [RFC
2015](https://www.ietf.org/rfc/rfc2015.txt). The test emails are based
on the example email in the spec. If PGP detected content is detected, a
warning is emitted and an empty set of lists is returned.

### Testing

```python
from unstructured.partition_email import partition_email

filename = "example-docs/eml/fake-encrypted.eml"
partition_email(filename=filename)
```

```python
from unstructured.partition_msg import partition_msg

filename = "example-docs/fake-encrypted.msg"
partition_msgl(filename=filename)
```
2023-08-25 17:09:25 -07:00
John
5872fa23c3
Extract coordinates from PDFs and images when using OCR only strategy (#1163)
### Summary
Closes #983 
Creates new function `add_pytesseract_bbox_to_elements`
Fixes typos in docstrings

### Testing
```
from unstructured.partition.image import partition_image
from PIL import Image, ImageDraw

png_filename="example-docs/english-and-korean.png"
png_elements = partition_image(filename=png_filename, strategy="ocr_only")
png_image = Image.open(png_filename)
draw = ImageDraw.Draw(png_image)
draw.polygon(png_elements[0].metadata.coordinates.points, outline="red", width=2)
draw.polygon(png_elements[1].metadata.coordinates.points, outline="red", width=2)
draw.polygon(png_elements[2].metadata.coordinates.points, outline="red", width=2)
output = "example-docs/english-and-korean-box.png"
png_image.save(output)
png_image.close()
```
2023-08-25 05:32:12 +00:00
Matt Robinson
c578b85699
fix: respect <pre> tag order in partition_html (#1197)
### Summary

Closes #1184. Updates `partition_html` to respect the ordering of
`<pre>` tags in HTML documents.

### Testing

The elements in the following example should be in the correct order.

```python
    from unstructured.partition.html import partition_html

    html_text = """
    <pre>The Big Brown Bear</pre>
    <div>The big brown bear is growling.</div>
    <pre>The big brown bear is sleeping.</pre>
    <div>The Big Blue Bear</div>
    """
    elements = partition_html(text=html_text)
    print("\n\n".join([str(el) for el in elements]))
```
2023-08-25 04:14:48 +00:00
Christine Straub
483b09b3c9
Feat/1136 elements ordering for pdf (#1161)
### Summary
Address
[#1136](https://github.com/Unstructured-IO/unstructured/issues/1136) for
`hi_res` and `fast` strategies. The `ocr_only` strategy does not include
coordinates.
- add functionality to switch sort mode between the current `basic`
sorting and the new `xy-cut` sorting for `hi_res` and `fast` strategies
- add the script to evaluate the `xy-cut` sorting approach
- add jupyter notebook to provide evaluation and visualization for the
`xy-cut` sorting approach

### Evaluation
```
export PYTHONPATH=.:$PYTHONPATH && python examples/custom-layout-order/evaluate_xy_cut_sorting.py <file_path> <strategy>
```
Here, the file should be under the project root directory. For example,
```
export PYTHONPATH=.:$PYTHONPATH && python examples/custom-layout-order/evaluate_xy_cut_sorting.py example-docs/multi-column-2p.pdf fast
```
2023-08-24 17:46:19 -07:00
Klaijan
1524841cd9
feat: supports multipage tiff (#1131)
Add test case test_partition_image_with_multipage_tiff that reads multipage TIFF file and

- confirms that the function reads all the pages in the TIFF.

- page number is added to the metadata

This PR is branched from and developed on top of 6d6be99 commit.
2023-08-24 15:12:50 +00:00
Matt Robinson
cdae53cc29
chore: deprecation warning for file_filename (#1191)
### Summary

Closes #1007. Adds a deprecation warning for the `file_filename` kwarg
to `partition`, `partition_via_api`, and `partition_multiple_via_api`.
Also catches a warning in `ebooklib` that we do not want to emit in
`unstructured`.

### Testing

```python
from unstructured.partition.auto import partition

filename = "example-docs/winter-sports.epub"

# Should not emit a warning
with open(filename, "rb") as f:
    elements = partition(file=f, metadata_filename="test.epub")
# Should be test.epub
elements[0].metadata.filename

# Should emit a warning
with open(filename, "rb") as f:
    elements = partition(file=f, file_filename="test.epub")
# Should be test.epub
elements[0].metadata.filename

# Should raise an error
with open(filename, "rb") as f:
    elements = partition(file=f, metadata_filename="test.epub", file_filename="test.epub")
```
2023-08-24 07:02:47 +00:00
Charles
1ddf542e14
fix: Don't call extractable_elements if strategy is ocr_only (#1160)
- fixes #1079 where partitioning is happening twice in the case of
`strategy="ocr_only"`
- only calls `extractable_elements` if we can predetermine that
`ocr_only` is not a possible strategy even if it was the intended
strategy.
- Adds additional assertion test that `_partition_pdf_or_image_with_ocr`
is not called when falling back to `fast` from `ocr_only`
2023-08-22 19:43:33 -07:00
Austin Walker
e7d189fcc8
chore: Bump inference and set default ocr_mode to entire_page (#1172)
* pip-compile in order to bump unstructured-inference
* Set the default `ocr_mode` back to `enitre_page` now that [this
error](https://github.com/Unstructured-IO/unstructured-inference/pull/183)
is addressed
* Explicitly add `sphinx-tabs` to `build.in`. This file provides
`docs/requirements.txt`.
* Remove a pinned `pydantic` version
* Fix a makefile command to `pip-compile` a missing ingest file.
2023-08-22 16:05:02 -07:00
Matt Robinson
ad595d32f6
enhancement: tell users to install missing extras (#1167)
### Summary

Updates `partition` to let users know to installs the appropriate extras
if they're missing. Prior to this PR, users would get an exception
stating `partition_pdf` (or whichever function that requires extras)
does not exist.

### Testing

First `pip uninstall ebooklib`. Then run

```python
from unstructured.partition.auto import partition

partition(filename="example-docs/winter-sports.epub")
```

The error should look like

```python
ImportError: partition_epub is not available. Install the epub dependencies with pip install "unstructured[epub]"
```
2023-08-22 03:00:21 +00:00
Newel H
e4aa7373e2
test: create CI pipelines for verifying base and extras pass respective tests (#1137)
**Summary**
Closes #747
* Create CI Pipeline for running text, xml, email, and html doc tests
against the library installed without extras
* Create CI Pipeline for running each library extra against their
respective tests
2023-08-19 12:56:13 -04:00
John
69edffb0c0
bug: update partition_msg and partition_email so attachments also receive metadata_last_modified kwarg (#1134)
### Summary
Closes #1027
The msg test in question was no longer failing after removing the
quick-fix and comment explaining the issue. However, the test was not
functioning as intended. Test was refactored to appropriately test
`metadata_last_modified` of attachments.
`partition_msg` was then updated to pass `metadata_last_modified` to
`attachment_partitioner`.
The same was done for email partitioning.

### Testing
```
from unstructured.partition.text import partition_text
from unstructured.partition.msg import partition_msg
from unstructured.partition.email import partition_email

filename="example-docs/fake-email-attachment.msg"
elements = partition_msg(filename=filename, attachment_partitioner=partition_text, process_attachments=True, metadata_last_modified="0000-00-00")

# previously, these were different values because last_modified wasn't being updated in attachments
elements[1].metadata.last_modified 
elements[-1].text
elements[-1].metadata.last_modified

email_filename="example-docs/eml/fake-email-attachment.eml"
email_elements = partition_email(filename=email_filename, attachment_partitioner=partition_text, process_attachments=True, metadata_last_modified="0000-00-00")

email_elements[1].metadata.last_modified 
email_elements[-1].text
email_elements[-1].metadata.last_modified
```
2023-08-18 23:21:11 +00:00
Austin Walker
dd243b4fd9
chore: pass ocr_mode in partition_pdf_or_image (#1154)
Set to individual_blocks for now to work around [this
bug](https://github.com/Unstructured-IO/unstructured-inference/issues/179).

I verified by printing the current ocr_mode in inference. The
`entire_page` default is overridden.

---------

Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com>
Co-authored-by: awalker4 <awalker4@users.noreply.github.com>
2023-08-18 20:59:08 +00:00
cragwolfe
1456f06b2d
chore: skip consistently failing test in main (#1150)
The reason this test is failing is the API is returning "fast" results
when "hi_res" is requested, which is being tracked in this ticket:
https://github.com/Unstructured-IO/unstructured-api/issues/188 .

This failure was only showing up on the `main` branch, per the commented
out `pytest` skips.
2023-08-18 10:06:17 -07:00
John
9f7bd6127b
enhancement: Add include_header kwarg for xlsx, default True(#1125)
Closes Github issue #1121

Adds include_header kwarg to partition_xlsx and change default behavior to True.
2023-08-17 04:16:23 +00:00
Christine Straub
0e887cc36b
Feat/1060 update metadata fields (#1099)
Closes Github Issue #1060.

* update the metadata field links
* update the metadata field emphasized_texts
2023-08-16 04:33:06 +00:00
John
6e5d27c6c3
fix pdf partition of list items being detected as titles in OCR only mode (#1119)
Closes Github issue #1010

adds group_bullet_paragraph func to handle grouping of bullet items that are split across multiple lines
2023-08-15 09:35:54 -07:00
Mike Lay
79a1eb8683
Handle inline and lacking filename (#1109)
Handle Content-Disposition: inline and attachment without filename

* Add new email test example and test with Content-Disposition: inline.
* Move attachment_info above for loop so it is always defined
* Check if item is inline as well as attachment as these both lack an = character to split on
* Create filename if filename is not specified and write file.
* Update list_attachments with new filename
2023-08-14 18:38:53 +00:00
Mike Lay
2e0ab86c6a
Fix attachments with = in filename (#1110)
Fix attachments with = in filename

* Limit split to first match of = to prevent creating a list of more than two parts
* Add example email with attachment name and test for issue
2023-08-13 20:35:18 -07:00
Christine Straub
fc2699ff06
Fix/1057 etree parser error tsv (#1106)
* feat: always use `soupparser_fromstring` to parse `html text` which gracefully handles emoji
* chore: update changelog & version
2023-08-14 01:22:36 +00:00
Christine Straub
4a3176885f
Fix/1057 etree parser error xlsx (#1094)
* feat: add functionality to check if a string contains any emoji characters

* feat: add functionality to switch `html` text parser based on whether the `html` text contains emoji

* chore: add `beautifulsoup4` and `emoji` packages to `requirements/base.in` for general use

* chore: update changelog & version

* chore: update changelog & version

* chore: update dependencies

* test: update `EXPECTED_XLS_TEXT_LEN` for `test_auto_partition_xls_from_filename`

* chore: update changelog & version

* feat: add functionality to switch html text parser based on whether the html text contains emoji

* chore: update changelog & version

* fix lint errors

* test: revert the `EXPECTED_XLS_TEXT_LEN` value back

* feat: always use `soupparser_fromstring` to parse `html text`

* fix lint error
2023-08-13 12:20:33 -07:00
cragwolfe
02af625b93
chore: fix fickle test to not be so time sensitive (#1105) 2023-08-13 10:58:46 -07:00
John
f63a66dbef
Capture section and chapter in the metadata for epubs under epub_section (#1005)
Capture section and chapter in the metadata for epubs under epub_section.
Closes Github issue #459
2023-08-12 21:02:06 +00:00
Matt Robinson
fa5a3dbd81
feat: unique_element_ids kwarg for UUID elements (#1085)
* added kwarg for unique elements

* test for unique ids

* update docs

* changelog and version
2023-08-11 11:02:37 +00:00
Christine Straub
d26ab1deac
fix: etree parser error (#1077)
* feat: add functionality to check if a string contains any emoji characters

* feat: add functionality to switch `html` text parser based on whether the `html` text contains emoji

* chore: add `beautifulsoup4` and `emoji` packages to `requirements/base.in` for general use

* chore: update changelog & version

* chore: update changelog & version

* chore: update dependencies

* test: update `EXPECTED_XLS_TEXT_LEN` for `test_auto_partition_xls_from_filename`

* chore: update changelog & version
2023-08-10 23:28:57 +00:00
cragwolfe
6779918406
build(release): bump unstructured-inference (#1074)
* build(release): bump unstructured-inference

Related to downstream issue:
Unstructured-IO/unstructured-api#182

And upstream PR:
Unstructured-IO/unstructured-inference#165

---------

Co-authored-by: Shreya Nidadavolu <shreyanid9@gmail.com>
2023-08-10 20:57:46 +00:00
Chris Pappalardo
ef5091f276
feat: added UUID option for element_id arg in element constructor (#1076)
* added UUID option for element_id arg in element constructor and updated unit tests

* updated CHANGELOG and bumped to dev2
2023-08-09 18:32:20 -04:00
Yuming Long
112347aa0d
doc: update API doc to sync with new parameter in prod API (#1049)
* doc doc

* changelog and version

* sample docs -> example docs

* nit on compute cost doc

* pass empty dict not none

* note note

* cutting release
2023-08-09 11:09:37 -04:00
Klaijan
ad386af8b5
Klaijan/auto paragraph grouper (#994)
* add auto_paragraph_grouper. add line break pattern.

* combine group_broken_paragraph and blank_line_grouper function

* fix make check errors

* fix make check errors

* fix make check errors

* fix make check errors

* run make tidy to fix errors

* tidy core.py and text.py

* fix blank-line breaker to extends the result and replace new line with space

* fix function name typo

* call group_broken_paragraphs for blank_line_grouper

* edit function name from one_line_grouper to new_line_grouper for consistency

* edit threshold from 0.5 to 0.1

* edit threshold from 0.5 to 0.1

* Revert "call group_broken_paragraphs for blank_line_grouper"

This reverts commit 8fb93b7aa7c4d7e0320ac1e09c77da44c9b6c7d9.

* revert to commit 8fb93b7 and change threshold from 0.5 to 0.1

* edit test_text assertion. remove all BULLETS_PATTERN.

* Update ingest test fixtures (#1052)

Co-authored-by: ahmetmeleq <ahmetmeleq@users.noreply.github.com>

* edit test case in test_xml_partition

* update assertion on test_auto

---------

Co-authored-by: Klaijan Sinteppadon <klaijan@Klaijans-MacBook-Pro.local>
Co-authored-by: Klaijan Sinteppadon <klaijan@klaijans-mbp.mynetworksettings.com>
Co-authored-by: Klaijan Sinteppadon <klaijan@Klaijans-MBP.fios-router.home>
Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com>
Co-authored-by: ahmetmeleq <ahmetmeleq@users.noreply.github.com>
2023-08-07 18:37:18 -04:00
kravetsmic
25ca5744cf
feat: optionally ignore header and footer tags in partition html (#1013)
---------

Co-authored-by: Matt Robinson <mrobinson@unstructured.io>
2023-08-04 21:56:33 +00:00
Christine Straub
b76d2ee745
feat: track emphasized text msword (#1048)
* feat: add functionality to track emphasized text (`bold/italic` formatting) from paragraph

* chore: add docstring

* chore: fix lint errors

* feat: ignore spaces when extracting emphasized texts from a paragraph

* feat: add functionality to track emphasized text (`bold/italic` formatting) from table

* test: add test case for grabbing emphasized texts from element metadata

* chore: fix lint errors

* chore: update changelog & version

* Update ingest test fixtures (#1047)
2023-08-04 17:04:12 -04:00
kravetsmic
bef93aef6e
fix: email addresses shouldn't be flagged as titles (#957)
* feat: add func for checking on EmailAddress type

* feat: add EmailAddress type

* feat: add check for email type

* feat: add test for cheking EmailAdress type

* feat: update existing example files with email

* feat: add new exampe fileds with email in the text

* fix: apply linter

* feat: update changelog file

* feat: add test for is_email_address function

* don't push

* fix: clean up code

* apply linter

* fix: clean up

* fix: remove file chaanges

* fix: remove not used  files for email address test

* fix: remove not necessary tests

* clean up

* fix: apply linter

* fix: update CHANGELOG

* fix: change version

* fix: fix  msg test

* fix: apply linter for tests

* fix: remove spaces

* fix: apply linter with longer line

* feat: update documentation

* fix: remove duplicates

* Update getting_started.rst

---------

Co-authored-by: Matt Robinson <mrobinson@unstructured.io>
2023-08-04 11:28:36 -04:00
Hynek Kydlíček
47b20119c3
fix: extract emojis with partition_xlsx (#1009)
* 🐛 fixxed emoji xlsx bug

* update version and changelog

* check if beautifulsoup exists

* update docs

* fix html parser call

* fix failing attachment test

*   added emoji test, added requirment fixed dependency

* 🐛 dependency

* 🐛 correct depeendency

* linting, linting, linting

* check for bs4

* skip auto xls filename test

---------

Co-authored-by: Matt Robinson <mrobinson@unstructured.io>
2023-08-04 10:14:08 -04:00