mirror of
https://github.com/Unstructured-IO/unstructured.git
synced 2025-12-27 23:24:27 +00:00
### Summary
Currently, the email partitioner removes only `=\n` characters during
the clearing process. However, email content sometimes contains `=\r\n`
characters, especially when read from file-like objects such as
`SpooledTemporaryFile` (the file type used in our API). This PR updates
the email partitioner to remove both `=\n` and `=\r\n` characters during
the clearing process.
### Testing
```
filename = "example-docs/eml/family-day.eml"
elements = partition_email(
filename=filename,
)
print(f"From filename: {elements[3].text}")
with open(filename, "rb") as test_file:
spooled_temp_file = tempfile.SpooledTemporaryFile()
spooled_temp_file.write(test_file.read())
spooled_temp_file.seek(0)
elements = partition_email(file=spooled_temp_file)
print(f"From spooled_temp_file: {elements[3].text}")
```
**Results:**
- on `main`
```
From filename: Make sure to RSVP!
From spooled_temp_file: Make sure to = RSVP!
```
- on `PR`
```
From filename: Make sure to RSVP!
From spooled_temp_file: Make sure to RSVP!
```
39 lines
1.3 KiB
Plaintext
39 lines
1.3 KiB
Plaintext
MIME-Version: 1.0
|
|
Date: Wed, 21 Dec 2022 10:28:53 -0600
|
|
Message-ID: <CAPgNNXQKR=o6AsOTr74VMrsDNhUJW0Keou9n3vLa2UO_Nv+tZw@mail.gmail.com>
|
|
Subject: Family Day
|
|
From: Mallori Harrell <mallori@unstructured.io>
|
|
To: Mallori Harrell <mallori@unstructured.io>
|
|
Content-Type: multipart/alternative; boundary="0000000000005c115405f0590ce4"
|
|
|
|
--0000000000005c115405f0590ce4
|
|
Content-Type: text/plain; charset="UTF-8"
|
|
|
|
Hi All,
|
|
|
|
Get excited for our first annual family day!
|
|
|
|
There will be face painting, a petting zoo, funnel cake and more.
|
|
|
|
Make sure to RSVP!
|
|
|
|
Best.
|
|
|
|
--
|
|
Mallori Harrell
|
|
Unstructured Technologies
|
|
Data Scientist
|
|
|
|
--0000000000005c115405f0590ce4
|
|
Content-Type: text/html; charset="UTF-8"
|
|
Content-Transfer-Encoding: quoted-printable
|
|
|
|
<div dir=3D"ltr">Hi All,<div><br></div><div>Get excited for our first annua=
|
|
l family day!=C2=A0</div><div><br></div><div>There will be face painting, =
|
|
a petting zoo, funnel cake and more.</div><div><br></div><div>Make sure to =
|
|
RSVP!</div><div><br></div><div>Best.<br clear=3D"all"><div><br></div>-- <br=
|
|
><div dir=3D"ltr" class=3D"gmail_signature" data-smartmail=3D"gmail_signatu=
|
|
re"><div dir=3D"ltr">Mallori Harrell<div>Unstructured Technologies<br><div>=
|
|
Data Scientist</div><div><br></div></div></div></div></div></div>
|
|
|
|
--0000000000005c115405f0590ce4-- |