5 Commits

Author SHA1 Message Date
Matt Robinson
c49df62967
feat: partition_xml infers element type on each leaf node (#1249)
### Summary

Closes #1229. Updates `partition_xml` so that the element type is
inferred on each leaf node when `xml_keep_tags=False` instead of
delegating splitting and partitioning to `partition_xml`. If
`xml_keep_tags=True`, the file is treated like a text file still and
partitioning is still delegated to `partition_text`.

Also adds the option to pass `text` as an input to `partition_xml`.

### Testing

Create a `parrots.xml` file that looks like:

```xml
<xml><parrot><name>Conure</name><description>A conure is a very friendly bird.

Conures are feathery and like to dance.</description></parrot></xml>
```

Run:

```python
from unstructured.partition.xml import partition_xml
from unstructured.staging.base import convert_to_dict

elements = partition_xml(filename="parrots.xml")
convert_to_dict(elements)
```

One `main`, the output is the following. Notice how the `<name>` tag
incorrectly gets merged into `<description>` in the first element.

```python
[{'element_id': '7ae4074435df8dfcefcf24a4e6c52026',
  'metadata': {'file_directory': '/home/matt/tmp',
               'filename': 'parrots.xml',
               'filetype': 'application/xml',
               'last_modified': '2023-08-30T14:21:38'},
  'text': 'Conure A conure is a very friendly bird.',
  'type': 'NarrativeText'},
 {'element_id': '859ecb332da6961acd2fb6a0185d1549',
  'metadata': {'file_directory': '/home/matt/tmp',
               'filename': 'parrots.xml',
               'filetype': 'application/xml',
               'last_modified': '2023-08-30T14:21:38'},
  'text': 'Conures are feathery and like to dance.',
  'type': 'NarrativeText'}]

```

One the feature branch, the output is the following, and the tags are
correctly separated.

```python
[{'element_id': '5512218914e4eeacf71a9cd42c373710',
  'metadata': {'file_directory': '/home/matt/tmp',
               'filename': 'parrots.xml',
               'filetype': 'application/xml',
               'last_modified': '2023-08-30T14:21:38'},
  'text': 'Conure',
  'type': 'Title'},
 {'element_id': '113bf8d250c2b1a77c9c2caa4b812f85',
  'metadata': {'file_directory': '/home/matt/tmp',
               'filename': 'parrots.xml',
               'filetype': 'application/xml',
               'last_modified': '2023-08-30T14:21:38'},
  'text': 'A conure is a very friendly bird.\n'
          '\n'
          'Conures are feathery and like to dance.',
  'type': 'NarrativeText'}]

```
2023-08-30 17:07:10 -04:00
Matt Robinson
f6a745a74f
feat: chunk elements based on titles (#1222)
### Summary

An initial pass on smart chunking for RAG applications. Breaks a
document into sections based on the presence of `Title` elements. Also
starts a new section under the following conditions:

- If metadata changes, indicating a change in section or page or a
switch to processing attachments. If `multipage_sections=True`, sections
can span pages. `multipage_sections` defaults to True.
- If the length of the section exceeds `new_after_n_chars` characters.
The default is `1500`. The chunking function does not split individual
elements, so it's possible for a section to exceed that threshold if an
individual element if over `new_after_n_chars` characters, which could
occur with a long `NarrativeText` element.
- Section under `combine_under_n_chars` characters are combined. The
default is `500`.

### Testing

```python
from unstructured.partition.html import partition_html
from unstructured.chunking.title import chunk_by_title

url = "https://understandingwar.org/backgrounder/russian-offensive-campaign-assessment-august-27-2023-0"
elements = partition_html(url=url)
chunks = chunk_by_title(elements)

for chunk in chunks:
    print(chunk)
    print("\n\n" + "-"*80)
    input()
```
2023-08-29 16:04:57 +00:00
Matt Robinson
07f76275f1
feat: detect PGP encrypted content in partition_email and partition_msg (#1205)
### Summary

Closes #1018. Enables `partition_email` and `partition_msg` to detect if
an email has PGP encrypted content. Based on the specification in [RFC
2015](https://www.ietf.org/rfc/rfc2015.txt). The test emails are based
on the example email in the spec. If PGP detected content is detected, a
warning is emitted and an empty set of lists is returned.

### Testing

```python
from unstructured.partition_email import partition_email

filename = "example-docs/eml/fake-encrypted.eml"
partition_email(filename=filename)
```

```python
from unstructured.partition_msg import partition_msg

filename = "example-docs/fake-encrypted.msg"
partition_msgl(filename=filename)
```
2023-08-25 17:09:25 -07:00
Matt Robinson
cdae53cc29
chore: deprecation warning for file_filename (#1191)
### Summary

Closes #1007. Adds a deprecation warning for the `file_filename` kwarg
to `partition`, `partition_via_api`, and `partition_multiple_via_api`.
Also catches a warning in `ebooklib` that we do not want to emit in
`unstructured`.

### Testing

```python
from unstructured.partition.auto import partition

filename = "example-docs/winter-sports.epub"

# Should not emit a warning
with open(filename, "rb") as f:
    elements = partition(file=f, metadata_filename="test.epub")
# Should be test.epub
elements[0].metadata.filename

# Should emit a warning
with open(filename, "rb") as f:
    elements = partition(file=f, file_filename="test.epub")
# Should be test.epub
elements[0].metadata.filename

# Should raise an error
with open(filename, "rb") as f:
    elements = partition(file=f, metadata_filename="test.epub", file_filename="test.epub")
```
2023-08-24 07:02:47 +00:00
Jack Retterer
a35ff890e0
Update docs jack (#1157)
Documentation Overhaul

- Added documentation hierarchy
- Added options for Bash vs Python for API & Upstream Connectors
- Added Introduction section (Overview, Key Concepts, Getting Started)
- Redid connectors section
- Installation is now broken up (needs further work)
2023-08-21 10:27:32 -07:00