1794 Commits

Author SHA1 Message Date
Lawrence Elitzer (LoLo)
ae0efca897
fix: pin deltalake<1.3.0 for ARM64 Docker builds (#4157)
## Summary
- Pin `deltalake<1.3.0` to fix ARM64 Docker build failures

## Problem
`deltalake` 1.3.0 is missing Linux ARM64 wheels due to a builder OOM
issue on their CI. When pip can't find a wheel, it tries to build from
source, which fails because the Wolfi base image doesn't have a C
compiler (`cc`).

This causes the `unstructured-ingest[delta-table]` install to fail,
breaking the ARM64 Docker image.

https://github.com/delta-io/delta-rs/issues/4041

## Solution
Temporarily pin `deltalake<1.3.0` until:
- deltalake publishes ARM64 wheels for 1.3.0+, OR
- unstructured-ingest adds the pin to its `delta-table` extra

## Test plan
- [ ] ARM64 Docker build succeeds

🤖 Generated with [Claude Code](https://claude.com/claude-code)


<!-- CURSOR_SUMMARY -->
---

> [!NOTE]
> Pins a dependency to unblock ARM64 builds and publishes a patch
release.
> 
> - Add `deltalake<1.3.0` to `requirements/ingest/ingest.txt` to avoid
missing Linux ARM64 wheels breaking Docker builds
> - Bump version to `0.18.26` and add corresponding CHANGELOG entry
> 
> <sup>Written by [Cursor
Bugbot](https://cursor.com/dashboard?tab=bugbot) for commit
b4f15b4e060a039a8ec6d7e21abfb6b51139695e. This will update automatically
on new commits. Configure
[here](https://cursor.com/dashboard?tab=bugbot).</sup>
<!-- /CURSOR_SUMMARY -->

Co-authored-by: Claude Opus 4.5 <noreply@anthropic.com>
0.18.26
2026-01-05 14:16:39 -06:00
Lawrence Elitzer (LoLo)
9f3d994e31
Remove pdfminer.six version constraint, bump dependencies to address high severity CVEs (#4156)
<!-- CURSOR_SUMMARY -->
> [!NOTE]
> Security-focused dependency updates and alignment with new pdfminer
behavior.
> 
> - Remove `pdfminer.six` constraint; bump `pdfminer-six` to `20251230`
and `urllib3` to `2.6.2` across requirement sets, plus assorted minor
dependency bumps
> - Update tests (`test_pdfminer_processing`) to reflect pdfminer’s
hidden OCR text handling; add clarifying docstring in `text_is_embedded`
> - Bump version to `0.18.25` and update `CHANGELOG.md`
> 
> <sup>Written by [Cursor
Bugbot](https://cursor.com/dashboard?tab=bugbot) for commit
04f70ee60d0eb7574fb79548fb9981949ccad207. This will update automatically
on new commits. Configure
[here](https://cursor.com/dashboard?tab=bugbot).</sup>
<!-- /CURSOR_SUMMARY -->

---------

Co-authored-by: Claude Opus 4.5 <noreply@anthropic.com>
2026-01-05 09:58:11 -06:00
utic-renovate[bot]
7f2cb4ce25
fix(deps): Update security updates [SECURITY] (#4154)
This PR contains the following updates:

| Package | Change |
[Age](https://docs.renovatebot.com/merge-confidence/) |
[Confidence](https://docs.renovatebot.com/merge-confidence/) |
|---|---|---|---|
| [filelock](https://redirect.github.com/tox-dev/py-filelock) |
`==3.20.0` → `==3.20.1` |
![age](https://developer.mend.io/api/mc/badges/age/pypi/filelock/3.20.1?slim=true)
|
![confidence](https://developer.mend.io/api/mc/badges/confidence/pypi/filelock/3.20.0/3.20.1?slim=true)
|
|
[marshmallow](https://redirect.github.com/marshmallow-code/marshmallow)
([changelog](https://marshmallow.readthedocs.io/en/latest/changelog.html))
| `==3.26.1` → `==3.26.2` |
![age](https://developer.mend.io/api/mc/badges/age/pypi/marshmallow/3.26.2?slim=true)
|
![confidence](https://developer.mend.io/api/mc/badges/confidence/pypi/marshmallow/3.26.1/3.26.2?slim=true)
|
| [pypdf](https://redirect.github.com/py-pdf/pypdf)
([changelog](https://pypdf.readthedocs.io/en/latest/meta/CHANGELOG.html))
| `==6.3.0` → `==6.4.0` |
![age](https://developer.mend.io/api/mc/badges/age/pypi/pypdf/6.4.0?slim=true)
|
![confidence](https://developer.mend.io/api/mc/badges/confidence/pypi/pypdf/6.3.0/6.4.0?slim=true)
|
| [urllib3](https://redirect.github.com/urllib3/urllib3)
([changelog](https://redirect.github.com/urllib3/urllib3/blob/main/CHANGES.rst))
| `==2.5.0` → `==2.6.0` |
![age](https://developer.mend.io/api/mc/badges/age/pypi/urllib3/2.6.0?slim=true)
|
![confidence](https://developer.mend.io/api/mc/badges/confidence/pypi/urllib3/2.5.0/2.6.0?slim=true)
|

### GitHub Vulnerability Alerts

####
[CVE-2025-68146](https://redirect.github.com/tox-dev/filelock/security/advisories/GHSA-w853-jp5j-5j7f)

### Impact

A Time-of-Check-Time-of-Use (TOCTOU) race condition allows local
attackers to corrupt or truncate arbitrary user files through symlink
attacks. The vulnerability exists in both Unix and Windows lock file
creation where filelock checks if a file exists before opening it with
O_TRUNC. An attacker can create a symlink pointing to a victim file in
the time gap between the check and open, causing os.open() to follow the
symlink and truncate the target file.

**Who is impacted:**

All users of filelock on Unix, Linux, macOS, and Windows systems. The
vulnerability cascades to dependent libraries:

- **virtualenv users**: Configuration files can be overwritten with
virtualenv metadata, leaking sensitive paths
- **PyTorch users**: CPU ISA cache or model checkpoints can be
corrupted, causing crashes or ML pipeline failures
- **poetry/tox users**: through using virtualenv or filelock on their
own.

Attack requires local filesystem access and ability to create symlinks
(standard user permissions on Unix; Developer Mode on Windows 10+).
Exploitation succeeds within 1-3 attempts when lock file paths are
predictable.

### Patches

Fixed in version **3.20.1**.

**Unix/Linux/macOS fix:** Added O_NOFOLLOW flag to os.open() in
UnixFileLock.\_acquire() to prevent symlink following.

**Windows fix:** Added GetFileAttributesW API check to detect reparse
points (symlinks/junctions) before opening files in
WindowsFileLock.\_acquire().

**Users should upgrade to filelock 3.20.1 or later immediately.**

### Workarounds

If immediate upgrade is not possible:

1. Use SoftFileLock instead of UnixFileLock/WindowsFileLock (note:
different locking semantics, may not be suitable for all use cases)
2. Ensure lock file directories have restrictive permissions (chmod
0700) to prevent untrusted users from creating symlinks
3. Monitor lock file directories for suspicious symlinks before running
trusted applications

**Warning:** These workarounds provide only partial mitigation. The race
condition remains exploitable. Upgrading to version 3.20.1 is strongly
recommended.

______________________________________________________________________

## Technical Details: How the Exploit Works

### The Vulnerable Code Pattern

**Unix/Linux/macOS** (`src/filelock/_unix.py:39-44`):

```python
def _acquire(self) -> None:
    ensure_directory_exists(self.lock_file)
    open_flags = os.O_RDWR | os.O_TRUNC  # (1) Prepare to truncate
    if not Path(self.lock_file).exists():  # (2) CHECK: Does file exist?
        open_flags |= os.O_CREAT
    fd = os.open(self.lock_file, open_flags, ...)  # (3) USE: Open and truncate
```

**Windows** (`src/filelock/_windows.py:19-28`):

```python
def _acquire(self) -> None:
    raise_on_not_writable_file(self.lock_file)  # (1) Check writability
    ensure_directory_exists(self.lock_file)
    flags = os.O_RDWR | os.O_CREAT | os.O_TRUNC  # (2) Prepare to truncate
    fd = os.open(self.lock_file, flags, ...)  # (3) Open and truncate
```

### The Race Window

The vulnerability exists in the gap between operations:

**Unix variant:**

```
Time    Victim Thread                          Attacker Thread
----    -------------                          ---------------
T0      Check: lock_file exists? → False
T1                                             ↓ RACE WINDOW
T2                                             Create symlink: lock → victim_file
T3      Open lock_file with O_TRUNC
        → Follows symlink
        → Opens victim_file
        → Truncates victim_file to 0 bytes! ☠️
```

**Windows variant:**

```
Time    Victim Thread                          Attacker Thread
----    -------------                          ---------------
T0      Check: lock_file writable?
T1                                             ↓ RACE WINDOW
T2                                             Create symlink: lock → victim_file
T3      Open lock_file with O_TRUNC
        → Follows symlink/junction
        → Opens victim_file
        → Truncates victim_file to 0 bytes! ☠️
```

### Step-by-Step Attack Flow

**1. Attacker Setup:**

```python

# Attacker identifies target application using filelock
lock_path = "/tmp/myapp.lock"  # Predictable lock path
victim_file = "/home/victim/.ssh/config"  # High-value target
```

**2. Attacker Creates Race Condition:**

```python
import os
import threading

def attacker_thread():
    # Remove any existing lock file
    try:
        os.unlink(lock_path)
    except FileNotFoundError:
        pass

    # Create symlink pointing to victim file
    os.symlink(victim_file, lock_path)
    print(f"[Attacker] Created: {lock_path} → {victim_file}")

# Launch attack
threading.Thread(target=attacker_thread).start()
```

**3. Victim Application Runs:**

```python
from filelock import UnixFileLock

# Normal application code
lock = UnixFileLock("/tmp/myapp.lock")
lock.acquire()  # ← VULNERABILITY TRIGGERED HERE

# At this point, /home/victim/.ssh/config is now 0 bytes!
```

**4. What Happens Inside os.open():**

On Unix systems, when `os.open()` is called:

```c
// Linux kernel behavior (simplified)
int open(const char *pathname, int flags) {
    struct file *f = path_lookup(pathname);  // Resolves symlinks by default!

    if (flags & O_TRUNC) {
        truncate_file(f);  // ← Truncates the TARGET of the symlink
    }

    return file_descriptor;
}
```

Without `O_NOFOLLOW` flag, the kernel follows the symlink and truncates
the target file.

### Why the Attack Succeeds Reliably

**Timing Characteristics:**

- **Check operation** (Path.exists()): ~100-500 nanoseconds
- **Symlink creation** (os.symlink()): ~1-10 microseconds
- **Race window**: ~1-5 microseconds (very small but exploitable)
- **Thread scheduling quantum**: ~1-10 milliseconds

**Success factors:**

1. **Tight loop**: Running attack in a loop hits the race window within
1-3 attempts
2. **CPU scheduling**: Modern OS thread schedulers frequently
context-switch during I/O operations
3. **No synchronization**: No atomic file creation prevents the race
4. **Symlink speed**: Creating symlinks is extremely fast (metadata-only
operation)

### Real-World Attack Scenarios

**Scenario 1: virtualenv Exploitation**

```python

# Victim runs: python -m venv /tmp/myenv
# Attacker racing to create:
os.symlink("/home/victim/.bashrc", "/tmp/myenv/pyvenv.cfg")

# Result: /home/victim/.bashrc overwritten with:

# home = /usr/bin/python3
# include-system-site-packages = false

# version = 3.11.2
# ← Original .bashrc contents LOST + virtualenv metadata LEAKED to attacker
```

**Scenario 2: PyTorch Cache Poisoning**

```python

# Victim runs: import torch
# PyTorch checks CPU capabilities, uses filelock on cache

# Attacker racing to create:
os.symlink("/home/victim/.torch/compiled_model.pt", "/home/victim/.cache/torch/cpu_isa_check.lock")

# Result: Trained ML model checkpoint truncated to 0 bytes

# Impact: Weeks of training lost, ML pipeline DoS
```

### Why Standard Defenses Don't Help

**File permissions don't prevent this:**

- Attacker doesn't need write access to victim_file
- os.open() with O_TRUNC follows symlinks using the *victim's*
permissions
- The victim process truncates its own file

**Directory permissions help but aren't always feasible:**

- Lock files often created in shared /tmp directory (mode 1777)
- Applications may not control lock file location
- Many apps use predictable paths in user-writable directories

**File locking doesn't prevent this:**

- The truncation happens *during* the open() call, before any lock is
acquired
- fcntl.flock() only prevents concurrent lock acquisition, not symlink
attacks

### Exploitation Proof-of-Concept Results

From empirical testing with the provided PoCs:

**Simple Direct Attack** (`filelock_simple_poc.py`):

- Success rate: 33% per attempt (1 in 3 tries)
- Average attempts to success: 2.1
- Target file reduced to 0 bytes in \<100ms

**virtualenv Attack** (`weaponized_virtualenv.py`):

- Success rate: ~90% on first attempt (deterministic timing)
- Information leaked: File paths, Python version, system configuration
- Data corruption: Complete loss of original file contents

**PyTorch Attack** (`weaponized_pytorch.py`):

- Success rate: 25-40% per attempt
- Impact: Application crashes, model loading failures
- Recovery: Requires cache rebuild or model retraining

**Discovered and reported by:** George Tsigourakos
(@&#8203;tsigouris007)

####
[CVE-2025-68480](https://redirect.github.com/marshmallow-code/marshmallow/security/advisories/GHSA-428g-f7cq-pgp5)

### Impact

`Schema.load(data, many=True)` is vulnerable to denial of service
attacks. A moderately sized request can consume a disproportionate
amount of CPU time.

### Patches

4.1.2, 3.26.2

### Workarounds

```py

# Fail fast
def load_many(schema, data, **kwargs):
    if not isinstance(data, list):
        raise ValidationError(['Invalid input type.'])
    return [schema.load(item, **kwargs) for item in data]
```

####
[CVE-2025-66019](https://redirect.github.com/py-pdf/pypdf/security/advisories/GHSA-jfx9-29x2-rv3j)

### Impact

An attacker who uses this vulnerability can craft a PDF which leads to a
memory usage of up to 1 GB per stream. This requires parsing the content
stream of a page using the LZWDecode filter.

This is a follow up to
[GHSA-jfx9-29x2-rv3j](https://redirect.github.com/py-pdf/pypdf/security/advisories/GHSA-jfx9-29x2-rv3j)
to align the default limit with the one for *zlib*.

### Patches
This has been fixed in
[pypdf==6.4.0](https://redirect.github.com/py-pdf/pypdf/releases/tag/6.4.0).

### Workarounds
If users cannot upgrade yet, use the line below to overwrite the default
in their code:

```python
pypdf.filters.LZW_MAX_OUTPUT_LENGTH = 75_000_000
```

####
[CVE-2025-66418](https://redirect.github.com/urllib3/urllib3/security/advisories/GHSA-gm62-xv2j-4w53)

## Impact

urllib3 supports chained HTTP encoding algorithms for response content
according to RFC 9110 (e.g., `Content-Encoding: gzip, zstd`).

However, the number of links in the decompression chain was unbounded
allowing a malicious server to insert a virtually unlimited number of
compression steps leading to high CPU usage and massive memory
allocation for the decompressed data.

## Affected usages

Applications and libraries using urllib3 version 2.5.0 and earlier for
HTTP requests to untrusted sources unless they disable content decoding
explicitly.

## Remediation

Upgrade to at least urllib3 v2.6.0 in which the library limits the
number of links to 5.

If upgrading is not immediately possible, use
[`preload_content=False`](https://urllib3.readthedocs.io/en/2.5.0/advanced-usage.html#streaming-and-i-o)
and ensure that `resp.headers["content-encoding"]` contains a safe
number of encodings before reading the response content.

####
[CVE-2025-66471](https://redirect.github.com/urllib3/urllib3/security/advisories/GHSA-2xpw-w6gg-jr37)

### Impact

urllib3's [streaming
API](https://urllib3.readthedocs.io/en/2.5.0/advanced-usage.html#streaming-and-i-o)
is designed for the efficient handling of large HTTP responses by
reading the content in chunks, rather than loading the entire response
body into memory at once.

When streaming a compressed response, urllib3 can perform decoding or
decompression based on the HTTP `Content-Encoding` header (e.g., `gzip`,
`deflate`, `br`, or `zstd`). The library must read compressed data from
the network and decompress it until the requested chunk size is met. Any
resulting decompressed data that exceeds the requested amount is held in
an internal buffer for the next read operation.

The decompression logic could cause urllib3 to fully decode a small
amount of highly compressed data in a single operation. This can result
in excessive resource consumption (high CPU usage and massive memory
allocation for the decompressed data; CWE-409) on the client side, even
if the application only requested a small chunk of data.

### Affected usages

Applications and libraries using urllib3 version 2.5.0 and earlier to
stream large compressed responses or content from untrusted sources.

`stream()`, `read(amt=256)`, `read1(amt=256)`, `read_chunked(amt=256)`,
`readinto(b)` are examples of `urllib3.HTTPResponse` method calls using
the affected logic unless decoding is disabled explicitly.

### Remediation

Upgrade to at least urllib3 v2.6.0 in which the library avoids
decompressing data that exceeds the requested amount.

If your environment contains a package facilitating the Brotli encoding,
upgrade to at least Brotli 1.2.0 or brotlicffi 1.2.0.0 too. These
versions are enforced by the `urllib3[brotli]` extra in the patched
versions of urllib3.

### Credits

The issue was reported by @&#8203;Cycloctane.
Supplemental information was provided by @&#8203;stamparm during a
security audit performed by [7ASecurity](https://7asecurity.com/) and
facilitated by [OSTIF](https://ostif.org/).

---

### Release Notes

<details>
<summary>tox-dev/py-filelock (filelock)</summary>

###
[`v3.20.1`](https://redirect.github.com/tox-dev/filelock/releases/tag/3.20.1)

[Compare
Source](https://redirect.github.com/tox-dev/py-filelock/compare/3.20.0...3.20.1)

<!-- Release notes generated using configuration in .github/release.yml
at main -->

##### What's Changed

- CVE-2025-68146: Fix TOCTOU symlink vulnerability in lock file creation
by [@&#8203;gaborbernat](https://redirect.github.com/gaborbernat) in
[tox-dev/filelock#461](https://redirect.github.com/tox-dev/filelock/pull/461)

**Full Changelog**:
<https://github.com/tox-dev/filelock/compare/3.20.0...3.20.1>

</details>

<details>
<summary>marshmallow-code/marshmallow (marshmallow)</summary>

###
[`v3.26.2`](https://redirect.github.com/marshmallow-code/marshmallow/blob/HEAD/CHANGELOG.rst#3262-2025-12-19)

[Compare
Source](https://redirect.github.com/marshmallow-code/marshmallow/compare/3.26.1...3.26.2)

Bug fixes:

- :cve:`2025-68480`: Merge error store messages without rebuilding
collections.
  Thanks 카푸치노 for reporting and :user:`deckar01` for the fix.

</details>

<details>
<summary>py-pdf/pypdf (pypdf)</summary>

###
[`v6.4.0`](https://redirect.github.com/py-pdf/pypdf/blob/HEAD/CHANGELOG.md#Version-641-2025-12-07)

[Compare
Source](https://redirect.github.com/py-pdf/pypdf/compare/6.3.0...6.4.0)

##### Performance Improvements (PI)

- Optimize loop for layout mode text extraction
([#&#8203;3543](https://redirect.github.com/py-pdf/pypdf/issues/3543))

##### Bug Fixes (BUG)

- Do not fail on choice field without /Opt key
([#&#8203;3540](https://redirect.github.com/py-pdf/pypdf/issues/3540))

##### Documentation (DOC)

- Document possible issues with merge\_page and clipping
([#&#8203;3546](https://redirect.github.com/py-pdf/pypdf/issues/3546))
- Add some notes about library security
([#&#8203;3545](https://redirect.github.com/py-pdf/pypdf/issues/3545))

##### Maintenance (MAINT)

- Use CORE\_FONT\_METRICS for widths where possible
([#&#8203;3526](https://redirect.github.com/py-pdf/pypdf/issues/3526))

[Full
Changelog](https://redirect.github.com/py-pdf/pypdf/compare/6.4.0...6.4.1)

</details>

<details>
<summary>urllib3/urllib3 (urllib3)</summary>

###
[`v2.6.0`](https://redirect.github.com/urllib3/urllib3/blob/HEAD/CHANGES.rst#260-2025-12-05)

[Compare
Source](https://redirect.github.com/urllib3/urllib3/compare/2.5.0...2.6.0)

\==================

## Security

- Fixed a security issue where streaming API could improperly handle
highly
compressed HTTP content ("decompression bombs") leading to excessive
resource
consumption even when a small amount of data was requested. Reading
small
  chunks of compressed data is safer and much more efficient now.
(`GHSA-2xpw-w6gg-jr37
<https://github.com/urllib3/urllib3/security/advisories/GHSA-2xpw-w6gg-jr37>`\_\_)
- Fixed a security issue where an attacker could compose an HTTP
response with
virtually unlimited links in the `Content-Encoding` header, potentially
leading to a denial of service (DoS) attack by exhausting system
resources
during decoding. The number of allowed chained encodings is now limited
to 5.
(`GHSA-gm62-xv2j-4w53
<https://github.com/urllib3/urllib3/security/advisories/GHSA-gm62-xv2j-4w53>`\_\_)

.. caution::

- If urllib3 is not installed with the optional `urllib3[brotli]` extra,
but
your environment contains a Brotli/brotlicffi/brotlipy package anyway,
make
  sure to upgrade it to at least Brotli 1.2.0 or brotlicffi 1.2.0.0 to
  benefit from the security fixes and avoid warnings. Prefer using
`urllib3[brotli]` to install a compatible Brotli package automatically.

- If you use custom decompressors, please make sure to update them to
  respect the changed API of `urllib3.response.ContentDecoder`.

## Features

- Enabled retrieval, deletion, and membership testing in
`HTTPHeaderDict` using bytes keys. (`#&#8203;3653
<https://github.com/urllib3/urllib3/issues/3653>`\_\_)
- Added host and port information to string representations of
`HTTPConnection`. (`#&#8203;3666
<https://github.com/urllib3/urllib3/issues/3666>`\_\_)
- Added support for Python 3.14 free-threading builds explicitly.
(`#&#8203;3696 <https://github.com/urllib3/urllib3/issues/3696>`\_\_)

## Removals

- Removed the `HTTPResponse.getheaders()` method in favor of
`HTTPResponse.headers`.
Removed the `HTTPResponse.getheader(name, default)` method in favor of
`HTTPResponse.headers.get(name, default)`. (`#&#8203;3622
<https://github.com/urllib3/urllib3/issues/3622>`\_\_)

## Bugfixes

- Fixed redirect handling in `urllib3.PoolManager` when an integer is
passed
for the retries parameter. (`#&#8203;3649
<https://github.com/urllib3/urllib3/issues/3649>`\_\_)
- Fixed `HTTPConnectionPool` when used in Emscripten with no explicit
port. (`#&#8203;3664
<https://github.com/urllib3/urllib3/issues/3664>`\_\_)
- Fixed handling of `SSLKEYLOGFILE` with expandable variables.
(`#&#8203;3700 <https://github.com/urllib3/urllib3/issues/3700>`\_\_)

## Misc

- Changed the `zstd` extra to install `backports.zstd` instead of
`zstandard` on Python 3.13 and before. (`#&#8203;3693
<https://github.com/urllib3/urllib3/issues/3693>`\_\_)
- Improved the performance of content decoding by optimizing
`BytesQueueBuffer` class. (`#&#8203;3710
<https://github.com/urllib3/urllib3/issues/3710>`\_\_)
- Allowed building the urllib3 package with newer setuptools-scm v9.x.
(`#&#8203;3652 <https://github.com/urllib3/urllib3/issues/3652>`\_\_)
- Ensured successful urllib3 builds by setting Hatchling requirement to
>= 1.27.0. (`#&#8203;3638
<https://github.com/urllib3/urllib3/issues/3638>`\_\_)

</details>

---

### Configuration

📅 **Schedule**: Branch creation - At any time (no schedule defined),
Automerge - At any time (no schedule defined).

🚦 **Automerge**: Disabled by config. Please merge this manually once you
are satisfied.

♻ **Rebasing**: Whenever PR becomes conflicted, or you tick the
rebase/retry checkbox.

👻 **Immortal**: This PR will be recreated if closed unmerged. Get
[config
help](https://redirect.github.com/renovatebot/renovate/discussions) if
that's undesired.

---

- [ ] <!-- rebase-check -->If you want to rebase/retry this PR, check
this box

---

This PR has been generated by [Renovate
Bot](https://redirect.github.com/renovatebot/renovate).

<!--renovate-debug:eyJjcmVhdGVkSW5WZXIiOiI0Mi42Ni4zIiwidXBkYXRlZEluVmVyIjoiNDIuNjYuMyIsInRhcmdldEJyYW5jaCI6Im1haW4iLCJsYWJlbHMiOlsic2VjdXJpdHkiXX0=-->

Co-authored-by: utic-renovate[bot] <235200891+utic-renovate[bot]@users.noreply.github.com>
0.18.24
2025-12-24 23:03:32 +00:00
Lawrence Elitzer (LoLo)
eb15cd721b
fix: use vulnerabilityAlerts config instead of invalid packageRule ma… (#4147)
…tcher

Move postUpgradeTasks from packageRules to vulnerabilityAlerts object.
The matchIsVulnerabilityAlert option doesn't exist in Renovate's schema.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

<!-- CURSOR_SUMMARY -->
---

> [!NOTE]
> Shifts Renovate config to correctly trigger version bump tasks on
security alerts.
> 
> - Removes `packageRules` with non-existent `matchIsVulnerabilityAlert`
> - Adds `vulnerabilityAlerts.postUpgradeTasks` to run
`scripts/renovate-security-bump.sh` with specified `fileFilters` and
`executionMode: branch`
> 
> <sup>Written by [Cursor
Bugbot](https://cursor.com/dashboard?tab=bugbot) for commit
676af0a87990b0a7f6639708ff8f150fb0211433. This will update automatically
on new commits. Configure
[here](https://cursor.com/dashboard?tab=bugbot).</sup>
<!-- /CURSOR_SUMMARY -->

Co-authored-by: Claude Opus 4.5 <noreply@anthropic.com>
2025-12-24 15:23:52 -06:00
Lawrence Elitzer (LoLo)
100f6f5700
feat: add automatic version bumping for security updates (#4145)
Add version bumping script and enable postUpgradeTasks for Python
security updates via Renovate.

Changes:
- Add scripts/renovate-security-bump.sh from renovate-config repo
- Configure postUpgradeTasks in renovate.json5 to run the script
- Script automatically bumps version and updates CHANGELOG on security
fixes

When Renovate creates a Python security update PR, it will now:
1. Detect changed dependencies
2. Bump patch version (or release current -dev version)
3. Add security fix entry to CHANGELOG.md
4. Include version and CHANGELOG changes in the PR

🤖 Generated with [Claude Code](https://claude.com/claude-code)

<!-- CURSOR_SUMMARY -->
---

> [!NOTE]
> Automates release housekeeping for Python security updates via
Renovate.
> 
> - Adds `scripts/renovate-security-bump.sh` to bump
`unstructured/__version__.py` (strip `-dev` or increment patch), detect
changed dependencies, and append a security entry to `CHANGELOG.md`
> - Updates `renovate.json5` to run the script as a `postUpgradeTasks`
step for `pypi` vulnerability alerts on the PR branch
> 
> <sup>Written by [Cursor
Bugbot](https://cursor.com/dashboard?tab=bugbot) for commit
7be1a7ce7b1117761aed030638d53edad618099d. This will update automatically
on new commits. Configure
[here](https://cursor.com/dashboard?tab=bugbot).</sup>
<!-- /CURSOR_SUMMARY -->

---------

Co-authored-by: Claude Sonnet 4.5 (1M context) <noreply@anthropic.com>
2025-12-24 14:59:36 -06:00
Lawrence Elitzer (LoLo)
ce1ea7b9f7
feat: switch to renovate (#4131)
<!-- CURSOR_SUMMARY -->
> [!NOTE]
> Migrates automated dependency updates from Dependabot to Renovate.
> 
> - Removes `.github/dependabot.yml`
> - Adds `renovate.json5` extending
`github>unstructured-io/renovate-config` to manage updates via Renovate
> 
> <sup>Written by [Cursor
Bugbot](https://cursor.com/dashboard?tab=bugbot) for commit
2a2b7281b1590e731f10b24fa6bec2fb5448f4ac. This will update automatically
on new commits. Configure
[here](https://cursor.com/dashboard?tab=bugbot).</sup>
<!-- /CURSOR_SUMMARY -->
2025-12-24 11:18:14 -06:00
Saurabh Misra
1f32cdaa08
️ Speed up method OCRAgentTesseract.extract_word_from_hocr by 35% (#4130)
Saurabh's comments - This looks like a good, easy straightforward and
impactful optimization
<!-- CODEFLASH_OPTIMIZATION:
{"function":"OCRAgentTesseract.extract_word_from_hocr","file":"unstructured/partition/utils/ocr_models/tesseract_ocr.py","speedup_pct":"35%","speedup_x":"0.35x","original_runtime":"7.18
milliseconds","best_runtime":"5.31
milliseconds","optimization_type":"loop","timestamp":"2025-12-19T03:15:54.368Z","version":"1.0"}
-->
#### 📄 35% (0.35x) speedup for
***`OCRAgentTesseract.extract_word_from_hocr` in
`unstructured/partition/utils/ocr_models/tesseract_ocr.py`***

⏱️ Runtime : **`7.18 milliseconds`** **→** **`5.31 milliseconds`** (best
of `13` runs)

#### 📝 Explanation and details


The optimized code achieves a **35% speedup** through two key
performance improvements:

**1. Regex Precompilation**
The original code calls `re.search(r"x_conf (\d+\.\d+)", char_title)`
inside the loop, recompiling the regex pattern on every iteration. The
optimization moves this to module level as `_RE_X_CONF =
re.compile(r"x_conf (\d+\.\d+)")`, compiling it once at import time. The
line profiler shows the regex search time improved from 12.73ms (42.9%
of total time) to 3.02ms (16.2% of total time) - a **76% reduction** in
regex overhead.

**2. Efficient String Building**
The original code uses string concatenation (`word_text += char`) which
creates a new string object each time due to Python's immutable strings.
With 6,339 character additions in the profiled run, this becomes
expensive. The optimization collects characters in a list
(`chars.append(char)`) and builds the final string once with
`"".join(chars)`. This reduces the character accumulation overhead from
1.52ms to 1.58ms for appends plus a single 46μs join operation.

**Performance Impact**
These optimizations are particularly effective for OCR processing where:
- The same regex pattern is applied thousands of times per document
- Words contain multiple characters that need accumulation
- The function is likely called frequently during document processing

The 35% speedup directly translates to faster document processing in OCR
workflows, with the most significant gains occurring when processing
documents with many detected characters that pass the confidence
threshold.



 **Correctness verification report:**

| Test                        | Status            |
| --------------------------- | ----------------- |
| ⚙️ Existing Unit Tests |  **27 Passed** |
| 🌀 Generated Regression Tests |  **22 Passed** |
|  Replay Tests | 🔘 **None Found** |
| 🔎 Concolic Coverage Tests | 🔘 **None Found** |
|📊 Tests Coverage       | 100.0% |
<details>
<summary>⚙️ Existing Unit Tests and Runtime</summary>

| Test File::Test Function | Original ⏱️ | Optimized ⏱️ | Speedup |

|:---------------------------------------------------------------|:--------------|:---------------|:----------|
| `partition/pdf_image/test_ocr.py::test_extract_word_from_hocr` |
63.2μs | 49.1μs | 28.7% |

</details>

<details>
<summary>🌀 Generated Regression Tests and Runtime</summary>

```python

```

</details>


To edit these changes `git checkout
codeflash/optimize-OCRAgentTesseract.extract_word_from_hocr-mjcarjk8`
and push.


[![Codeflash](https://img.shields.io/badge/Optimized%20with-Codeflash-yellow?style=flat&color=%23ffc428&logo=)](https://codeflash.ai)
![Static
Badge](https://img.shields.io/badge/🎯_Optimization_Quality-high-green)

---------

Co-authored-by: codeflash-ai[bot] <148906541+codeflash-ai[bot]@users.noreply.github.com>
Co-authored-by: Aseem Saxena <aseem.bits@gmail.com>
2025-12-19 04:44:19 +00:00
Yao You
dce5345370
Feat: is extracted only true when text has trivial amount of invisible text (#4128)
## summary
This PR modifies the logic that detects if a text is `IsExtracted.TRUE`.
- a text is `IsExtracted.TRUE` when it is embedded text rendered on the
page -> this is useful downstream to understand quality of the text vs.
the pdf render
- current logic labels any text found by `pdfminer` as
`IsExtracted.TRUE`
- however, sometimes a pdf file may be a scanned page and it also can
contain text in its metadata that are OCR sourced text. Those text often
has poor quality and are always invisible so they are not rendered on
the page
- the modified logic detects if a piece of text found by `pdfminer` has
non-trivial amount of invisible text; if no then it is labeled
`IsExtracted.TRUE`

For example the newly added test file `pdf-with-ocr-text.pdf` is a
scanned page with one line at the bottom that is truly embedded text
(text rendered on page). But `pdfminer` finds more text that just that
line and those other text only roughly matches the content on the
scanned page. Those are not embedded text as they are not rendered on
page and are hidden invisible. The updated logic correctly identifies
those text as not `IsExtracted.TRUE`.

## pitfall of the current solution
This solution does not consider cases where the text color matches the
background color therefore becomes invisible.
2025-12-16 22:57:19 +00:00
Nick Franck
afd911891b
fix(deps): Bump fonttools to address cve (#4125)
<!-- CURSOR_SUMMARY -->
> [!NOTE]
> Constrain fonttools to >=4.60.2 (CVE-2025-66034), bump extras to
4.61.0, switch setup_ingest to ubuntu-latest-m, and release 0.18.22.
> 
> - **Dependencies**:
> - Constrain `fonttools>=4.60.2` in `requirements/deps/constraints.txt`
to address CVE-2025-66034.
> - Bump `fonttools` to `4.61.0` in `requirements/extra-*.txt`; refresh
files via uv and align constraint references.
> - **CI**:
> - Update `setup_ingest` job in `.github/workflows/ci.yml` to run on
`ubuntu-latest-m`.
> - **Release**:
>   - Bump version to `0.18.22` and update `CHANGELOG.md`.
> 
> <sup>Written by [Cursor
Bugbot](https://cursor.com/dashboard?tab=bugbot) for commit
6ec072e0f48249f0b07d7ca12e35a09dcc78c04f. This will update automatically
on new commits. Configure
[here](https://cursor.com/dashboard?tab=bugbot).</sup>
<!-- /CURSOR_SUMMARY -->
0.18.22
2025-12-10 17:11:53 +00:00
Lawrence Elitzer (LoLo)
91a9888d35
chore: dependency updates to resolve cves (#4124)
<!-- CURSOR_SUMMARY -->
> [!NOTE]
> Release 0.18.21 with broad dependency pin updates across requirements
(notably unstructured-inference 1.1.2) to remediate CVEs.
> 
> - **Release**
>   - Set version to `0.18.21` and update `CHANGELOG.md`.
> - **Dependencies**
> - Upgrade `unstructured-inference` to `1.1.2` in
`requirements/extra-pdf-image.txt` to address CVEs.
> - Refresh pins across `requirements/*.txt` (base, dev, test, and
extras), including updates like `certifi`, `click`, `pypdf`, `pypandoc`,
`paddlepaddle`, `torch`/`torchvision`, `google-auth` stack, `protobuf`,
`safetensors`, etc.; normalize pip-compile headers and constraint paths.
> 
> <sup>Written by [Cursor
Bugbot](https://cursor.com/dashboard?tab=bugbot) for commit
face075955f8b733f28e17cd26863470b4622def. This will update automatically
on new commits. Configure
[here](https://cursor.com/dashboard?tab=bugbot).</sup>
<!-- /CURSOR_SUMMARY -->
0.18.21
2025-11-21 22:16:43 +00:00
qued
6c1bbb379c
test: add check crop box padding to save_elements test (#4123)
Updated `save_elements` test to check the behavior of the environment
variables `EXTRACT_IMAGE_BLOCK_CROP_HORIZONTAL_PAD` and
`EXTRACT_IMAGE_BLOCK_CROP_VERTICAL_PAD` that pad the crop box for image
extraction.

<!-- CURSOR_SUMMARY -->
---

> [!NOTE]
> Enhances save_elements tests to validate crop-box padding via env vars
and image dimensions for both payload and file outputs; bumps version
and updates changelog.
> 
> - **Tests (pdf_image_utils)**:
> - `test_save_elements` now parametrizes
`horizontal_padding`/`vertical_padding`, sets
`EXTRACT_IMAGE_BLOCK_CROP_HORIZONTAL_PAD` and
`EXTRACT_IMAGE_BLOCK_CROP_VERTICAL_PAD`, and asserts padded image
dimensions for both `extract_image_block_to_payload` paths (decoding
`image_base64` or reading saved file).
>   - Adds required imports (`base64`, `io`).
> - **Versioning**:
>   - Update `unstructured/__version__.py` to `0.18.21-dev0`.
>   - Add CHANGELOG entry noting the unit test enhancement.
> 
> <sup>Written by [Cursor
Bugbot](https://cursor.com/dashboard?tab=bugbot) for commit
a23bf6a9e7fb67f35c2aa16cc409d2de92886db8. This will update automatically
on new commits. Configure
[here](https://cursor.com/dashboard?tab=bugbot).</sup>
<!-- /CURSOR_SUMMARY -->
2025-11-18 22:01:50 +00:00
cragwolfe
7c4d0b984b
chore: minor CHANGELOG.md update (#4122)
last release actually should have been 0.18.19. let's skip it and just
fix the CHANGELOG
0.18.20
2025-11-14 16:12:40 -08:00
fzowl
063e6cee63
(feat) VoyageAI integration improvements (#4109)
Supporting VoyageAI's contextual model
Counting tokens and creating efficient batches
Documentation change: https://github.com/Unstructured-IO/docs/pull/790
2025-11-14 16:03:03 -08:00
qued
15253a53ea
feat: track text source (#4112)
The purpose of this PR is to use the newly created `is_extracted`
parameter in `TextRegion` (and the corresponding vector version
`is_extracted_array` in `TextRegions`), flagging elements that were
extracted directly from PDFs as such.

This also involved:
- New tests
- A version update to bring in the new `unstructured-inference`
- An ingest fixtures update
- An optimization from Codeflash that's not directly related

One important thing to review is that all avenues by which an element is
extracted and ends up in the output of a partition are covered... fast,
hi_res, etc.

---------

Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com>
Co-authored-by: codeflash-ai[bot] <148906541+codeflash-ai[bot]@users.noreply.github.com>
Co-authored-by: luke-kucing <luke@unstructured.io>
Co-authored-by: Claude <noreply@anthropic.com>
Co-authored-by: qued <qued@users.noreply.github.com>
2025-11-11 19:04:01 +00:00
luke-kucing
b01d35b237
fix: sanitize MSG attachment filenames to prevent path traversal (GHS… (#4117)
Summary

Fixes path traversal vulnerability in email and MSG attachment filename
handling (GHSA-gm8q-m8mv-jj5m).

Changes

Security Fix

Sanitizes attachment filenames in _AttachmentPartitioner for both
email.py and msg.py
Uses os.path.basename() to strip path components from filenames
Normalizes backslashes to forward slashes to handle Windows paths on
Unix systems
Removes null bytes and other control characters
Handles edge cases (empty strings, ".", "..")
Defaults to "unknown" for invalid or dangerous filenames
Test Coverage

Added 17 comprehensive tests covering:

Path traversal attempts (../../../etc/passwd)
Absolute Unix paths (/etc/passwd)
Absolute Windows paths (C:\Windows\System32\config\sam)
Null byte injection (file\x00.txt)
Dot and dotdot filenames (. and ..)
Missing/empty filenames
Complex mixed path separators
Valid filenames (ensuring they pass through unchanged)
Test Results

 All 17 new security tests pass
 All 129 existing tests pass
 No regressions
Security Impact

Prevents attackers from using malicious attachment filenames to write
files outside the intended directory, which could lead to arbitrary file
write vulnerabilities.

Changes include comprehensive test coverage for various attack vectors
and a version bump to 0.18.18.

---------

Co-authored-by: Claude <noreply@anthropic.com>
0.18.18
2025-11-06 23:14:56 +00:00
luke-kucing
1c519efef5
Security Fixes - CVE Remediation (#4115)
Main Changes:

  1. Removed Clarifai Dependency
- Completely removed the clarifai dependency which is no longer used in
the codebase
- Removed clarifai from the unstructured-ingest extras list in
requirements/ingest/ingest.txt:1
- Removed clarifai test script reference from
test_unstructured_ingest/test-ingest-dest.sh:23

  2. Updated Dependencies to Resolve CVEs
  - pypdf: Updated from 6.1.1 → 6.1.3 (fixes GHSA-vr63-x8vc-m265)
- pip: Added explicit upgrade to >=25.3 in Dockerfile (fixes
GHSA-4xh5-x5gv-qwph)
  - uv: Addressed GHSA-8qf3-x8v5-2pj8 and GHSA-pqhf-p39g-3x64

  3. Dockerfile Security Enhancements (Dockerfile:17,28-29)
  - Added Alpine package upgrade for py3.12-pip
- Added explicit pip upgrade step before installing Python dependencies

  4. General Dependency Updates
  Ran pip-compile across all requirement files, resulting in updates to:
  - cryptography: 46.0.2 → 46.0.3
  - psutil: 7.1.0 → 7.1.3
  - rapidfuzz: 3.14.1 → 3.14.3
  - regex: 2025.9.18 → 2025.11.3
  - wrapt: 1.17.3 → 2.0.0
- Plus many other transitive dependencies across all extra requirement
files

  5. Version Bump
- Updated version from 0.18.16 → 0.18.17 in
unstructured/__version__.py:1
  - Updated CHANGELOG.md with security fixes documentation

  Impact:

This PR resolves 4 CVEs total without introducing breaking changes,
making it a pure security maintenance release.

---------

Co-authored-by: Claude <noreply@anthropic.com>
2025-11-06 21:01:32 +00:00
luke-kucing
c79cf3a9f6
updated dependancies to resolve open CVEs and cut a new version (#4108)
Summary

  Version bump to 0.18.16 - Security patch release

  Changes

Security Fixes: Updated multiple dependencies via pip-compile to resolve
critical CVEs:
  - authlib: GHSA-pq5p-34cr-23v9
  - python-3.12/python03.12-base: CVE-2025-8291, GHSA-49g5-f6qw-8mm7
- libcrypto3/libssl3: CVE-2025-9230, CVE-2025-9231, CVE-2025-9232,
GHSA-76r2-c3cg-f5r9, GHSA-9mrx-mqmg-gwj9

  Enhancement: Speed up function _assign_hash_ids by 34% (codeflash)

  Files Changed (13 files, +104/-92 lines)

  - unstructured/__version__.py - Version bumped to 0.18.16
  - CHANGELOG.md - Added release notes
  - All requirement files updated with new dependency versions:
    - requirements/base.txt
    - requirements/dev.txt
- requirements/extra-*.txt (csv, docx, odt, paddleocr, pdf-image, pptx,
xlsx)
    - requirements/huggingface.txt
    - requirements/test.txt

This is a security-focused patch release that addresses multiple CVEs
while also including a performance enhancement.
2025-10-15 17:58:08 +00:00
qued
8fd07fd9f6
feat: Add simple script to sync fork with local branch (#4102)
#### Testing:

From the base folder of this repo, run:
```bash
./scripts/sync_fork.sh git@github.com:aseembits93/unstructured.git optimize-_assign_hash_ids-memtfran
```
Check to make sure the only remote is `origin` with:
```bash
git remote
```
Check the diff from `main` with:
```bash
git diff main
```
2025-09-26 16:36:56 +00:00
qued
ef68384985
enhancement: Speed up function _assign_hash_ids by 34% (#4101)
In-repo duplicate of #4089.

---------

Co-authored-by: codeflash-ai[bot] <148906541+codeflash-ai[bot]@users.noreply.github.com>
Co-authored-by: Aseem Saxena <aseem.bits@gmail.com>
2025-09-25 20:49:15 +00:00
luke-kucing
2d44d73a88
Luke/sept16 CVE (#4094)
dependancy bump and version bump. mainly to resolve the crit in deepdif

---------

Co-authored-by: cragwolfe <crag@unstructured.io>
0.18.15
2025-09-17 01:33:53 +00:00
Aseem Saxena
ab55d86f5c
️ Speed up method ElementHtml._get_children_html by 234% (#4087)
### 📄 234% (2.34x) speedup for ***`ElementHtml._get_children_html` in
`unstructured/partition/html/convert.py`***

⏱️ Runtime : **`12.3 milliseconds`** **→** **`3.69 milliseconds`** (best
of `101` runs)
### 📝 Explanation and details

Here is a **faster rewrite** of your program, based on your line
profiling results, the imported code constraints, and the code logic.

### Key optimizations.

- **Avoid repeated parsing:** The hotspot is in recursive calls to
`child.get_html_element(**kwargs)`, each of which is re-creating a new
`BeautifulSoup` object in every call.
Solution: **Pass down and reuse a single `BeautifulSoup` instance** when
building child HTML elements.
- **Minimize object creation:** Create `soup` once at the *topmost* call
and reuse for all children and subchildren.
- **Reduce .get_text_as_html use:** Optimize to only use the soup
instance when really necessary and avoid repeated blank parses.
- **Avoid double wrapping:** Only allocate wrappers and new tags if
absolutely required.
- **General micro-optimizations:** Use `None` instead of `or []`,
fast-path checks on empty children, etc.
- **Preserve all comments and signatures as specified.**

Below is the optimized version.



### Explanation of improvements

- **Soup passing**: The `get_html_element` method now optionally
receives a `_soup` kwarg. At the top of the tree, it is `None`, so a new
one is created. Then, for all descendants, the same `soup` instance is
passed via `_soup`, avoiding repeated parsing and allocation.
- **Children check**: `self.children` is checked once, and the attribute
itself is kept as a list (not or-ed with empty list at every call).
- **No unnecessary soup parsing**: `get_text_as_html()` doesn't need a
soup argument, since it only returns a Tag (from the parent module).
- **No changes to existing comments, new comments added only where logic
was changed.**
- **Behavior (output and signature) preserved.**

This **avoids creating thousands of BeautifulSoup objects recursively**,
which was the primary bottleneck found in the profiler. The result is
vastly improved performance, especially for large/complex trees.


 **Correctness verification report:**

| Test                        | Status            |
| --------------------------- | ----------------- |
| ⚙️ Existing Unit Tests | 🔘 **None Found** |
| 🌀 Generated Regression Tests |  **768 Passed** |
|  Replay Tests |  **1 Passed** |
| 🔎 Concolic Coverage Tests | 🔘 **None Found** |
|📊 Tests Coverage       | 100.0% |
<details>
<summary>🌀 Generated Regression Tests and Runtime</summary>

```python
from abc import ABC
from typing import Any, List, Optional, Union

# imports
import pytest  # used for our unit tests
from bs4 import BeautifulSoup, Tag
from unstructured.partition.html.convert import ElementHtml

# --- Minimal stubs for dependencies ---

class Metadata:
    def __init__(self, text_as_html: Optional[str] = None):
        self.text_as_html = text_as_html

class Element:
    def __init__(self, text="", category="default", id="0", metadata=None):
        self.text = text
        self.category = category
        self.id = id
        self.metadata = metadata or Metadata()

# --- The function and class under test ---

HTML_PARSER = "html.parser"

# --- Test helpers ---

class DummyElementHtml(ElementHtml):
    """A concrete subclass for testing, with optional custom tag."""
    def __init__(self, element, children=None, html_tag="div"):
        super().__init__(element, children)
        self._html_tag = html_tag

# --- Unit tests for _get_children_html ---

@pytest.fixture
def soup():
    # Fixture for a BeautifulSoup object
    return BeautifulSoup("", HTML_PARSER)

def make_tag(soup, name, text=None, **attrs):
    tag = soup.new_tag(name)
    if text:
        tag.string = text
    for k, v in attrs.items():
        tag[k] = v
    return tag

# 1. BASIC TEST CASES

def test_single_child_basic(soup):
    """Single child: Should wrap parent and child in a div, in order."""
    parent_el = Element("Parent", category="parent", id="p1")
    child_el = Element("Child", category="child", id="c1")
    child = DummyElementHtml(child_el)
    parent = DummyElementHtml(parent_el, children=[child])
    # Prepare the parent tag
    parent_tag = make_tag(soup, "div", "Parent", **{"class": "parent", "id": "p1"})
    # Call _get_children_html
    codeflash_output = parent._get_children_html(soup, parent_tag); result = codeflash_output
    divs = result.find_all("div", recursive=False)

def test_multiple_children_basic(soup):
    """Multiple children: All children should be appended in order."""
    parent_el = Element("Parent", category="parent", id="p1")
    child1_el = Element("Child1", category="child", id="c1")
    child2_el = Element("Child2", category="child", id="c2")
    child1 = DummyElementHtml(child1_el)
    child2 = DummyElementHtml(child2_el)
    parent = DummyElementHtml(parent_el, children=[child1, child2])
    parent_tag = make_tag(soup, "div", "Parent", **{"class": "parent", "id": "p1"})
    codeflash_output = parent._get_children_html(soup, parent_tag); result = codeflash_output
    divs = result.find_all("div", recursive=False)

def test_no_children_returns_parent_wrapped(soup):
    """No children: Should still wrap parent in a div."""
    parent_el = Element("Parent", category="parent", id="p1")
    parent = DummyElementHtml(parent_el, children=[])
    parent_tag = make_tag(soup, "div", "Parent", **{"class": "parent", "id": "p1"})
    codeflash_output = parent._get_children_html(soup, parent_tag); result = codeflash_output
    inner_divs = result.find_all("div", recursive=False)

def test_children_with_different_tags(soup):
    """Children with different HTML tags should be preserved."""
    parent_el = Element("Parent", category="parent", id="p1")
    child_el = Element("Child", category="child", id="c1")
    child = DummyElementHtml(child_el, html_tag="span")
    parent = DummyElementHtml(parent_el, children=[child])
    parent_tag = make_tag(soup, "div", "Parent", **{"class": "parent", "id": "p1"})
    codeflash_output = parent._get_children_html(soup, parent_tag); result = codeflash_output

# 2. EDGE TEST CASES

def test_empty_element_text_and_children(soup):
    """Parent and children have empty text."""
    parent_el = Element("", category="parent", id="p1")
    child_el = Element("", category="child", id="c1")
    child = DummyElementHtml(child_el)
    parent = DummyElementHtml(parent_el, children=[child])
    parent_tag = make_tag(soup, "div", "", **{"class": "parent", "id": "p1"})
    codeflash_output = parent._get_children_html(soup, parent_tag); result = codeflash_output
    divs = result.find_all("div", recursive=False)

def test_deeply_nested_children(soup):
    """Test with deep nesting (e.g., 5 levels)."""
    # Build a chain: root -> c1 -> c2 -> c3 -> c4 -> c5
    el = Element("root", category="cat0", id="id0")
    node = DummyElementHtml(el)
    for i in range(1, 6):
        el = Element(f"c{i}", category=f"cat{i}", id=f"id{i}")
        node = DummyElementHtml(el, children=[node])
    # At the top, node is the outermost parent
    parent_tag = make_tag(soup, "div", "c5", **{"class": "cat5", "id": "id5"})
    codeflash_output = node._get_children_html(soup, parent_tag); result = codeflash_output
    # Should have one child at each level
    current = result
    for i in range(6):
        divs = [c for c in current.contents if isinstance(c, Tag)]
        current = divs[0]

def test_html_injection_in_text(soup):
    """Child text that looks like HTML should be escaped, not parsed as HTML."""
    parent_el = Element("Parent", category="parent", id="p1")
    child_el = Element("<b>bold</b>", category="child", id="c1")
    child = DummyElementHtml(child_el)
    parent = DummyElementHtml(parent_el, children=[child])
    parent_tag = make_tag(soup, "div", "Parent", **{"class": "parent", "id": "p1"})
    codeflash_output = parent._get_children_html(soup, parent_tag); result = codeflash_output
    # The child div should have literal text, not a <b> tag inside
    child_div = result.find_all("div", recursive=False)[1]

def test_children_with_duplicate_ids(soup):
    """Multiple children with the same id."""
    parent_el = Element("Parent", category="parent", id="p1")
    child1_el = Element("Child1", category="child", id="dup")
    child2_el = Element("Child2", category="child", id="dup")
    child1 = DummyElementHtml(child1_el)
    child2 = DummyElementHtml(child2_el)
    parent = DummyElementHtml(parent_el, children=[child1, child2])
    parent_tag = make_tag(soup, "div", "Parent", **{"class": "parent", "id": "p1"})
    codeflash_output = parent._get_children_html(soup, parent_tag); result = codeflash_output
    # Both children should be present, even with duplicate ids
    divs = result.find_all("div", recursive=False)

def test_children_with_none(soup):
    """Children list contains None (should ignore or raise)."""
    parent_el = Element("Parent", category="parent", id="p1")
    child_el = Element("Child", category="child", id="c1")
    child = DummyElementHtml(child_el)
    parent = DummyElementHtml(parent_el, children=[child, None])
    parent_tag = make_tag(soup, "div", "Parent", **{"class": "parent", "id": "p1"})
    # Should raise AttributeError when trying to call get_html_element on None
    with pytest.raises(AttributeError):
        parent._get_children_html(soup, parent_tag)

# 3. LARGE SCALE TEST CASES

def test_many_children_performance(soup):
    """Test with 500 children: structure and order."""
    parent_el = Element("Parent", category="parent", id="p1")
    children = [DummyElementHtml(Element(f"Child{i}", category="child", id=f"c{i}")) for i in range(500)]
    parent = DummyElementHtml(parent_el, children=children)
    parent_tag = make_tag(soup, "div", "Parent", **{"class": "parent", "id": "p1"})
    codeflash_output = parent._get_children_html(soup, parent_tag); result = codeflash_output
    divs = result.find_all("div", recursive=False)

def test_large_tree_width_and_depth(soup):
    """Test with a tree of width 10 and depth 3 (total 1 + 10 + 100 = 111 nodes)."""
    def make_tree(depth, width):
        if depth == 0:
            return []
        return [
            DummyElementHtml(
                Element(f"Child{depth}_{i}", category="cat", id=f"id{depth}_{i}"),
                children=make_tree(depth-1, width)
            )
            for i in range(width)
        ]
    parent_el = Element("Root", category="root", id="root")
    children = make_tree(2, 10)  # depth=2, width=10 at each node
    parent = DummyElementHtml(parent_el, children=children)
    parent_tag = make_tag(soup, "div", "Root", **{"class": "root", "id": "root"})
    codeflash_output = parent._get_children_html(soup, parent_tag); result = codeflash_output
    # The first level should have 1 parent + 10 children
    divs = result.find_all("div", recursive=False)
    # Each child should have its own children (10 each)
    for child_div in divs[1:]:
        sub_divs = child_div.find_all("div", recursive=False)

def test_large_text_content(soup):
    """Test with a single child with a very large text string."""
    large_text = "A" * 10000
    parent_el = Element("Parent", category="parent", id="p1")
    child_el = Element(large_text, category="child", id="c1")
    child = DummyElementHtml(child_el)
    parent = DummyElementHtml(parent_el, children=[child])
    parent_tag = make_tag(soup, "div", "Parent", **{"class": "parent", "id": "p1"})
    codeflash_output = parent._get_children_html(soup, parent_tag); result = codeflash_output
    # The child div should contain the large text exactly
    child_div = result.find_all("div", recursive=False)[1]

def test_children_with_varied_tags_and_attributes(soup):
    """Test children with different tags and extra attributes."""
    parent_el = Element("P", category="parent", id="p")
    child1_el = Element("C1", category="c1", id="c1")
    child2_el = Element("C2", category="c2", id="c2")
    child1 = DummyElementHtml(child1_el, html_tag="section")
    child2 = DummyElementHtml(child2_el, html_tag="article")
    parent = DummyElementHtml(parent_el, children=[child1, child2])
    parent_tag = make_tag(soup, "header", "P", **{"class": "parent", "id": "p"})
    codeflash_output = parent._get_children_html(soup, parent_tag); result = codeflash_output
# codeflash_output is used to check that the output of the original code is the same as that of the optimized code.

from abc import ABC
from typing import Any, Optional, Union

# imports
import pytest  # used for our unit tests
from bs4 import BeautifulSoup, Tag
from unstructured.partition.html.convert import ElementHtml


# Minimal stub for Element and its metadata
class Metadata:
    def __init__(self, text_as_html=None):
        self.text_as_html = text_as_html

class Element:
    def __init__(self, text="", category=None, id=None, metadata=None):
        self.text = text
        self.category = category or "default-category"
        self.id = id or "default-id"
        self.metadata = metadata or Metadata()

HTML_PARSER = "html.parser"

# ---------------------------
# Unit tests for _get_children_html
# ---------------------------

# Helper subclass to expose _get_children_html for testing
class TestElementHtml(ElementHtml):
    def public_get_children_html(self, soup, element_html, **kwargs):
        return self._get_children_html(soup, element_html, **kwargs)

    # Override get_html_element to avoid recursion issues in tests
    def get_html_element(self, **kwargs: Any) -> Tag:
        soup = BeautifulSoup("", HTML_PARSER)
        element_html = self.get_text_as_html()
        if element_html is None:
            element_html = soup.new_tag(name=self.html_tag)
            self._inject_html_element_content(element_html, **kwargs)
        element_html["class"] = self.element.category
        element_html["id"] = self.element.id
        self._inject_html_element_attrs(element_html)
        if self.children:
            return self._get_children_html(soup, element_html, **kwargs)
        return element_html

# ---- BASIC TEST CASES ----

def test_single_child_basic():
    # Test with one parent and one child
    parent_elem = Element(text="Parent", category="parent-cat", id="parent-id")
    child_elem = Element(text="Child", category="child-cat", id="child-id")
    child = TestElementHtml(child_elem)
    parent = TestElementHtml(parent_elem, children=[child])

    soup = BeautifulSoup("", HTML_PARSER)
    parent_html = soup.new_tag("div")
    parent_html.string = parent_elem.text
    parent_html["class"] = parent_elem.category
    parent_html["id"] = parent_elem.id

    # Call the function
    result = parent.public_get_children_html(soup, parent_html)

def test_multiple_children_basic():
    # Parent with two children
    parent_elem = Element(text="P", category="p-cat", id="p-id")
    child1_elem = Element(text="C1", category="c1-cat", id="c1-id")
    child2_elem = Element(text="C2", category="c2-cat", id="c2-id")
    child1 = TestElementHtml(child1_elem)
    child2 = TestElementHtml(child2_elem)
    parent = TestElementHtml(parent_elem, children=[child1, child2])

    soup = BeautifulSoup("", HTML_PARSER)
    parent_html = soup.new_tag("div")
    parent_html.string = parent_elem.text
    parent_html["class"] = parent_elem.category
    parent_html["id"] = parent_elem.id

    result = parent.public_get_children_html(soup, parent_html)

def test_no_children_returns_wrapper_with_only_parent():
    # Parent with no children, should still wrap parent_html in a div
    parent_elem = Element(text="Solo", category="solo-cat", id="solo-id")
    parent = TestElementHtml(parent_elem, children=[])

    soup = BeautifulSoup("", HTML_PARSER)
    parent_html = soup.new_tag("div")
    parent_html.string = parent_elem.text
    parent_html["class"] = parent_elem.category
    parent_html["id"] = parent_elem.id

    result = parent.public_get_children_html(soup, parent_html)

def test_children_are_nested():
    # Test with a deeper hierarchy: parent -> child -> grandchild
    grandchild_elem = Element(text="GC", category="gc-cat", id="gc-id")
    grandchild = TestElementHtml(grandchild_elem)
    child_elem = Element(text="C", category="c-cat", id="c-id")
    child = TestElementHtml(child_elem, children=[grandchild])
    parent_elem = Element(text="P", category="p-cat", id="p-id")
    parent = TestElementHtml(parent_elem, children=[child])

    soup = BeautifulSoup("", HTML_PARSER)
    parent_html = soup.new_tag("div")
    parent_html.string = parent_elem.text
    parent_html["class"] = parent_elem.category
    parent_html["id"] = parent_elem.id

    result = parent.public_get_children_html(soup, parent_html)
    child_div = result.contents[1]
    grandchild_div = child_div.contents[1]

# ---- EDGE TEST CASES ----

def test_empty_text_and_attributes():
    # Parent and child with empty text and missing attributes
    parent_elem = Element(text="", category="", id="")
    child_elem = Element(text="", category="", id="")
    child = TestElementHtml(child_elem)
    parent = TestElementHtml(parent_elem, children=[child])

    soup = BeautifulSoup("", HTML_PARSER)
    parent_html = soup.new_tag("div")
    parent_html.string = parent_elem.text
    parent_html["class"] = parent_elem.category
    parent_html["id"] = parent_elem.id

    result = parent.public_get_children_html(soup, parent_html)

def test_child_with_html_content():
    # Child with HTML in text_as_html, should parse as HTML element
    child_elem = Element(text="ignored", category="cat", id="cid",
                         metadata=Metadata(text_as_html="<span>HTMLChild</span>"))
    child = TestElementHtml(child_elem)
    parent_elem = Element(text="Parent", category="pcat", id="pid")
    parent = TestElementHtml(parent_elem, children=[child])

    soup = BeautifulSoup("", HTML_PARSER)
    parent_html = soup.new_tag("div")
    parent_html.string = parent_elem.text
    parent_html["class"] = parent_elem.category
    parent_html["id"] = parent_elem.id

    result = parent.public_get_children_html(soup, parent_html)
    child_html = result.contents[1]

def test_parent_with_html_content_and_children():
    # Parent with HTML in text_as_html, children as normal
    parent_elem = Element(text="ignored", category="pcat", id="pid",
                          metadata=Metadata(text_as_html="<h1>Header</h1>"))
    child_elem = Element(text="Child", category="ccat", id="cid")
    child = TestElementHtml(child_elem)
    parent = TestElementHtml(parent_elem, children=[child])

    soup = BeautifulSoup("", HTML_PARSER)
    parent_html = parent.get_text_as_html()
    parent_html["class"] = parent_elem.category
    parent_html["id"] = parent_elem.id

    result = parent.public_get_children_html(soup, parent_html)

def test_children_with_duplicate_ids():
    # Children with the same id, should not raise errors, but both ids should be present
    child_elem1 = Element(text="A", category="cat", id="dup")
    child_elem2 = Element(text="B", category="cat", id="dup")
    child1 = TestElementHtml(child_elem1)
    child2 = TestElementHtml(child_elem2)
    parent_elem = Element(text="P", category="pcat", id="pid")
    parent = TestElementHtml(parent_elem, children=[child1, child2])

    soup = BeautifulSoup("", HTML_PARSER)
    parent_html = soup.new_tag("div")
    parent_html.string = parent_elem.text
    parent_html["class"] = parent_elem.category
    parent_html["id"] = parent_elem.id

    result = parent.public_get_children_html(soup, parent_html)

def test_children_with_various_html_tags():
    # Children with different html_tag settings
    class CustomElementHtml(TestElementHtml):
        _html_tag = "section"

    child_elem = Element(text="Sec", category="cat", id="cid")
    child = CustomElementHtml(child_elem)
    parent_elem = Element(text="P", category="pcat", id="pid")
    parent = TestElementHtml(parent_elem, children=[child])

    soup = BeautifulSoup("", HTML_PARSER)
    parent_html = soup.new_tag("div")
    parent_html.string = parent_elem.text
    parent_html["class"] = parent_elem.category
    parent_html["id"] = parent_elem.id

    result = parent.public_get_children_html(soup, parent_html)

def test_html_tag_property_override():
    # Test that html_tag property is respected
    class CustomElementHtml(TestElementHtml):
        @property
        def html_tag(self):
            return "article"

    child_elem = Element(text="Art", category="cat", id="cid")
    child = CustomElementHtml(child_elem)
    parent_elem = Element(text="P", category="pcat", id="pid")
    parent = TestElementHtml(parent_elem, children=[child])

    soup = BeautifulSoup("", HTML_PARSER)
    parent_html = soup.new_tag("div")
    parent_html.string = parent_elem.text
    parent_html["class"] = parent_elem.category
    parent_html["id"] = parent_elem.id

    result = parent.public_get_children_html(soup, parent_html)

def test_inject_html_element_attrs_is_called():
    # Test that _inject_html_element_attrs is called (by side effect)
    class AttrElementHtml(TestElementHtml):
        def _inject_html_element_attrs(self, element_html: Tag) -> None:
            element_html["data-test"] = "called"

    child_elem = Element(text="Child", category="cat", id="cid")
    child = AttrElementHtml(child_elem)
    parent_elem = Element(text="P", category="pcat", id="pid")
    parent = AttrElementHtml(parent_elem, children=[child])

    soup = BeautifulSoup("", HTML_PARSER)
    parent_html = soup.new_tag("div")
    parent_html.string = parent_elem.text
    parent_html["class"] = parent_elem.category
    parent_html["id"] = parent_elem.id

    result = parent.public_get_children_html(soup, parent_html)

# ---- LARGE SCALE TEST CASES ----

def test_large_number_of_children():
    # Test with 500 children
    num_children = 500
    children = [TestElementHtml(Element(text=f"Child{i}", category="cat", id=f"id{i}")) for i in range(num_children)]
    parent_elem = Element(text="Parent", category="pcat", id="pid")
    parent = TestElementHtml(parent_elem, children=children)

    soup = BeautifulSoup("", HTML_PARSER)
    parent_html = soup.new_tag("div")
    parent_html.string = parent_elem.text
    parent_html["class"] = parent_elem.category
    parent_html["id"] = parent_elem.id

    result = parent.public_get_children_html(soup, parent_html)

def test_large_depth_of_nesting():
    # Test with 100 nested single-child levels
    depth = 100
    current = TestElementHtml(Element(text=f"Level{depth}", category="cat", id=f"id{depth}"))
    for i in range(depth-1, 0, -1):
        current = TestElementHtml(Element(text=f"Level{i}", category="cat", id=f"id{i}"), children=[current])
    parent = current

    soup = BeautifulSoup("", HTML_PARSER)
    parent_html = soup.new_tag("div")
    parent_html.string = parent.element.text
    parent_html["class"] = parent.element.category
    parent_html["id"] = parent.element.id

    result = parent.public_get_children_html(soup, parent_html)
    # Traverse down the nesting, checking text at each level
    node = result
    for i in range(1, depth+1):
        if len(node.contents) > 1:
            node = node.contents[1]
        else:
            break

def test_large_tree_with_breadth_and_depth():
    # 10 children, each with 10 children (total 1 + 10 + 100 = 111 nodes)
    children = []
    for i in range(10):
        grandchildren = [TestElementHtml(Element(text=f"GC{i}-{j}", category="gcat", id=f"gid{i}-{j}")) for j in range(10)]
        child = TestElementHtml(Element(text=f"C{i}", category="ccat", id=f"cid{i}"), children=grandchildren)
        children.append(child)
    parent_elem = Element(text="P", category="pcat", id="pid")
    parent = TestElementHtml(parent_elem, children=children)

    soup = BeautifulSoup("", HTML_PARSER)
    parent_html = soup.new_tag("div")
    parent_html.string = parent_elem.text
    parent_html["class"] = parent_elem.category
    parent_html["id"] = parent_elem.id

    result = parent.public_get_children_html(soup, parent_html)
    for i, child_div in enumerate(result.contents[1:]):
        for j, gc_div in enumerate(child_div.contents[1:]):
            pass
# codeflash_output is used to check that the output of the original code is the same as that of the optimized code.
```

</details>


To edit these changes `git checkout
codeflash/optimize-ElementHtml._get_children_html-mcsd67co` and push.


[![Codeflash](https://img.shields.io/badge/Optimized%20with-Codeflash-yellow?style=flat&color=%23ffc428&logo=)](https://codeflash.ai)

---------

Co-authored-by: codeflash-ai[bot] <148906541+codeflash-ai[bot]@users.noreply.github.com>
Co-authored-by: Saurabh Misra <misra.saurabh1@gmail.com>
2025-09-10 07:42:45 +00:00
Aseem Saxena
6aee1311a0
️ Speed up function group_broken_paragraphs by 30% (#4088)
### 📄 30% (0.30x) speedup for ***`group_broken_paragraphs` in
`unstructured/cleaners/core.py`***

⏱️ Runtime : **`21.2 milliseconds`** **→** **`16.3 milliseconds`** (best
of `66` runs)
### 📝 Explanation and details

Here’s an optimized version of your code, preserving all function
signatures, return values, and comments.
**Key improvements:**  
- **Precompile regexes** inside the functions where they are used
repeatedly.
- **Avoid repeated `.strip()` and `.split()`** calls in tight loops by
working with stripped data directly.
- **Reduce intermediate allocations** (like unnecessary list comps).
- **Optimize `all_lines_short` computation** by short-circuiting
iteration (`any` instead of `all` and negating logic).
- Minimize calls to regex replace by using direct substitution when
possible.



**Summary of key speedups**.
- Precompiled regex references up-front—no repeated compile.
- Reordered bullet-matching logic for early fast-path continue.
- Short-circuit `all_lines_short`: break on the first long line.
- Avoids unnecessary double stripping/splitting.
- Uses precompiled regexes even when constants may be strings.

This version will be noticeably faster, especially for large documents
or tight loops.


 **Correctness verification report:**

| Test                        | Status            |
| --------------------------- | ----------------- |
| ⚙️ Existing Unit Tests |  **58 Passed** |
| 🌀 Generated Regression Tests |  **49 Passed** |
|  Replay Tests |  **6 Passed** |
| 🔎 Concolic Coverage Tests | 🔘 **None Found** |
|📊 Tests Coverage       | 100.0% |
<details>
<summary>⚙️ Existing Unit Tests and Runtime</summary>

| Test File::Test Function | Original ⏱️ | Optimized ⏱️ | Speedup |

|:--------------------------------------------------------------------------------------------|:--------------|:---------------|:----------|
| `cleaners/test_core.py::test_group_broken_paragraphs` | 19.5μs |
16.1μs | 21.0% |
|
`cleaners/test_core.py::test_group_broken_paragraphs_non_default_settings`
| 23.9μs | 21.7μs | 10.2% |
| `partition/test_text.py::test_partition_text_groups_broken_paragraphs`
| 1.97ms | 1.96ms | 0.347% |
|
`test_tracer_py__replay_test_0.py::test_unstructured_cleaners_core_group_broken_paragraphs`
| 161μs | 119μs | 34.9% |

</details>

<details>
<summary>🌀 Generated Regression Tests and Runtime</summary>

```python
from __future__ import annotations

import re

# imports
import pytest  # used for our unit tests
from unstructured.cleaners.core import group_broken_paragraphs

# Dummy patterns for testing (since unstructured.nlp.patterns is unavailable)
# These are simplified versions for the sake of testing
DOUBLE_PARAGRAPH_PATTERN_RE = re.compile(r"\n\s*\n")
E_BULLET_PATTERN = re.compile(r"^\s*e\s+", re.MULTILINE)
PARAGRAPH_PATTERN = re.compile(r"\n")
PARAGRAPH_PATTERN_RE = re.compile(r"\n")
# Unicode bullets for test
UNICODE_BULLETS_RE = re.compile(r"^\s*[•○·]", re.MULTILINE)
from unstructured.cleaners.core import group_broken_paragraphs

# unit tests

# -------------------- BASIC TEST CASES --------------------

def test_empty_string():
    # Test that empty input returns empty string
    codeflash_output = group_broken_paragraphs('') # 1.38μs -> 2.69μs (48.7% slower)

def test_single_line():
    # Test that a single line is returned unchanged
    codeflash_output = group_broken_paragraphs('Hello world.') # 6.58μs -> 6.83μs (3.68% slower)

def test_two_paragraphs_with_double_newline():
    # Test that two paragraphs separated by double newline are preserved
    text = "First paragraph.\nSecond line.\n\nSecond paragraph.\nAnother line."
    expected = "First paragraph. Second line.\n\nSecond paragraph. Another line."
    codeflash_output = group_broken_paragraphs(text) # 13.7μs -> 14.2μs (3.07% slower)

def test_paragraphs_with_single_line_breaks():
    # Test that lines in a paragraph are joined with spaces
    text = "The big red fox\nis walking down the lane.\n\nAt the end of the lane\nthe fox met a bear."
    expected = "The big red fox is walking down the lane.\n\nAt the end of the lane the fox met a bear."
    codeflash_output = group_broken_paragraphs(text) # 18.8μs -> 16.2μs (15.7% faster)

def test_bullet_points():
    # Test bullet points are handled and line breaks inside bullets are joined
    text = "• The big red fox\nis walking down the lane.\n\n• At the end of the lane\nthe fox met a bear."
    expected = [
        "• The big red fox is walking down the lane.",
        "• At the end of the lane the fox met a bear."
    ]
    codeflash_output = group_broken_paragraphs(text); result = codeflash_output # 33.4μs -> 19.7μs (69.7% faster)

def test_e_bullet_points():
    # Test pytesseract e-bullet conversion is handled
    text = "e The big red fox\nis walking down the lane.\n\ne At the end of the lane\nthe fox met a bear."
    # e should be converted to ·
    expected = [
        "· The big red fox is walking down the lane.",
        "· At the end of the lane the fox met a bear."
    ]
    codeflash_output = group_broken_paragraphs(text); result = codeflash_output # 27.8μs -> 16.9μs (64.3% faster)

def test_short_lines_not_grouped():
    # Test that lines with <5 words are not grouped
    text = "Apache License\nVersion 2.0, January 2004\nhttp://www.apache.org/licenses/"
    expected = "Apache License\nVersion 2.0, January 2004\nhttp://www.apache.org/licenses/"
    codeflash_output = group_broken_paragraphs(text) # 10.5μs -> 11.5μs (8.37% slower)

def test_mixed_bullet_and_normal():
    # Test that a mix of bullets and normal paragraphs works
    text = (
        "• First bullet\nis split\n\n"
        "A normal paragraph\nwith line break.\n\n"
        "• Second bullet\nis also split"
    )
    expected = [
        "• First bullet is split",
        "A normal paragraph with line break.",
        "• Second bullet is also split"
    ]
    codeflash_output = group_broken_paragraphs(text); result = codeflash_output # 31.2μs -> 21.3μs (46.3% faster)

# -------------------- EDGE TEST CASES --------------------

def test_all_whitespace():
    # Test input of only whitespace returns empty string
    codeflash_output = group_broken_paragraphs('   \n   ') # 3.52μs -> 4.19μs (16.1% slower)

def test_only_newlines():
    # Test input of only newlines returns empty string
    codeflash_output = group_broken_paragraphs('\n\n\n') # 2.44μs -> 3.46μs (29.7% slower)

def test_single_bullet_with_no_linebreaks():
    # Test bullet point with no line breaks is preserved
    text = "• A bullet point with no line breaks."
    codeflash_output = group_broken_paragraphs(text) # 15.3μs -> 8.46μs (81.1% faster)

def test_paragraph_with_multiple_consecutive_newlines():
    # Test that multiple consecutive newlines are treated as paragraph breaks
    text = "First para.\n\n\nSecond para.\n\n\n\nThird para."
    expected = "First para.\n\nSecond para.\n\nThird para."
    codeflash_output = group_broken_paragraphs(text) # 11.4μs -> 11.6μs (1.56% slower)

def test_leading_and_trailing_newlines():
    # Test that leading and trailing newlines are ignored
    text = "\n\nFirst para.\nSecond line.\n\nSecond para.\n\n"
    expected = "First para. Second line.\n\nSecond para."
    codeflash_output = group_broken_paragraphs(text) # 11.9μs -> 12.5μs (4.58% slower)

def test_bullet_point_with_leading_spaces():
    # Test bullet with leading whitespace is handled
    text = "   • Bullet with leading spaces\nand a line break."
    expected = "• Bullet with leading spaces and a line break."
    codeflash_output = group_broken_paragraphs(text) # 18.4μs -> 10.6μs (73.3% faster)

def test_unicode_bullets():
    # Test that various unicode bullets are handled
    text = "○ Unicode bullet\nline two.\n\n· Another unicode bullet\nline two."
    expected = [
        "○ Unicode bullet line two.",
        "· Another unicode bullet line two."
    ]
    codeflash_output = group_broken_paragraphs(text); result = codeflash_output # 27.7μs -> 15.7μs (75.8% faster)

def test_short_lines_with_blank_lines():
    # Test that short lines with blank lines are preserved and not grouped
    text = "Title\n\nSubtitle\n\n2024"
    expected = "Title\n\nSubtitle\n\n2024"
    codeflash_output = group_broken_paragraphs(text) # 9.66μs -> 10.1μs (4.73% slower)

def test_mixed_short_and_long_lines():
    # Test a paragraph with both short and long lines
    text = "Title\nThis is a long line that should be grouped with the next.\nAnother long line."
    expected = "Title This is a long line that should be grouped with the next. Another long line."
    codeflash_output = group_broken_paragraphs(text) # 14.9μs -> 13.2μs (13.3% faster)

def test_bullet_point_with_inner_blank_lines():
    # Test bullet points with inner blank lines
    text = "• Bullet one\n\n• Bullet two\n\n• Bullet three"
    expected = [
        "• Bullet one",
        "• Bullet two",
        "• Bullet three"
    ]
    codeflash_output = group_broken_paragraphs(text); result = codeflash_output # 24.9μs -> 13.7μs (81.4% faster)

def test_paragraph_with_tabs_and_spaces():
    # Test paragraphs with tabs and spaces are grouped correctly
    text = "First\tparagraph\nis here.\n\n\tSecond paragraph\nis here."
    expected = "First\tparagraph is here.\n\n\tSecond paragraph is here."
    codeflash_output = group_broken_paragraphs(text) # 12.4μs -> 12.4μs (0.314% slower)

# -------------------- LARGE SCALE TEST CASES --------------------

def test_large_number_of_paragraphs():
    # Test function with 500 paragraphs
    paras = ["Paragraph {} line 1\nParagraph {} line 2".format(i, i) for i in range(500)]
    text = "\n\n".join(paras)
    expected = "\n\n".join(["Paragraph {} line 1 Paragraph {} line 2".format(i, i) for i in range(500)])
    codeflash_output = group_broken_paragraphs(text) # 1.79ms -> 1.69ms (5.66% faster)

def test_large_number_of_bullets():
    # Test function with 500 bullet points, each split over two lines
    bullets = ["• Bullet {} part 1\nBullet {} part 2".format(i, i) for i in range(500)]
    text = "\n\n".join(bullets)
    expected = "\n\n".join(["• Bullet {} part 1 Bullet {} part 2".format(i, i) for i in range(500)])
    codeflash_output = group_broken_paragraphs(text) # 3.72ms -> 1.88ms (97.3% faster)

def test_large_mixed_content():
    # Test function with 200 normal paragraphs and 200 bullet paragraphs
    paras = ["Normal para {} line 1\nNormal para {} line 2".format(i, i) for i in range(200)]
    bullets = ["• Bullet {} part 1\nBullet {} part 2".format(i, i) for i in range(200)]
    # Interleave them
    text = "\n\n".join([item for pair in zip(paras, bullets) for item in pair])
    expected = "\n\n".join([
        "Normal para {} line 1 Normal para {} line 2".format(i, i)
        for i in range(200)
    ] + [
        "• Bullet {} part 1 Bullet {} part 2".format(i, i)
        for i in range(200)
    ])
    # Since we interleaved, need to interleave expected as well
    expected = "\n\n".join([
        val for pair in zip(
            ["Normal para {} line 1 Normal para {} line 2".format(i, i) for i in range(200)],
            ["• Bullet {} part 1 Bullet {} part 2".format(i, i) for i in range(200)]
        ) for val in pair
    ])
    codeflash_output = group_broken_paragraphs(text) # 2.48ms -> 1.59ms (55.8% faster)

def test_performance_on_large_text():
    # Test that the function can handle a large block of text efficiently (not a correctness test)
    big_text = "This is a line in a very big paragraph.\n" * 999
    # Should be grouped into a single paragraph with spaces
    expected = " ".join(["This is a line in a very big paragraph."] * 999)
    codeflash_output = group_broken_paragraphs(big_text) # 2.62ms -> 2.62ms (0.161% faster)
# codeflash_output is used to check that the output of the original code is the same as that of the optimized code.

from __future__ import annotations

import re

# imports
import pytest  # used for our unit tests
from unstructured.cleaners.core import group_broken_paragraphs

# Dummy regexes for test purposes (since we don't have unstructured.nlp.patterns)
DOUBLE_PARAGRAPH_PATTERN_RE = re.compile(r"\n\s*\n")
E_BULLET_PATTERN = re.compile(r"^e\s")
PARAGRAPH_PATTERN = re.compile(r"\n")
PARAGRAPH_PATTERN_RE = re.compile(r"\n")
UNICODE_BULLETS_RE = re.compile(r"^[\u2022\u2023\u25E6\u2043\u2219\u25AA\u25CF\u25CB\u25A0\u25A1\u25B2\u25B3\u25BC\u25BD\u25C6\u25C7\u25C9\u25CB\u25D8\u25D9\u25E6\u2605\u2606\u2765\u2767\u29BE\u29BF\u25A0-\u25FF]")
from unstructured.cleaners.core import group_broken_paragraphs

# unit tests

# -------------------------------
# 1. Basic Test Cases
# -------------------------------

def test_single_paragraph_joined():
    # Should join lines in a single paragraph into one line
    text = "The big red fox\nis walking down the lane."
    expected = "The big red fox is walking down the lane."
    codeflash_output = group_broken_paragraphs(text) # 11.2μs -> 9.78μs (14.9% faster)

def test_multiple_paragraphs():
    # Should join lines in each paragraph, and keep paragraphs separate
    text = "The big red fox\nis walking down the lane.\n\nAt the end of the lane\nthe fox met a bear."
    expected = "The big red fox is walking down the lane.\n\nAt the end of the lane the fox met a bear."
    codeflash_output = group_broken_paragraphs(text) # 17.7μs -> 15.7μs (13.0% faster)

def test_preserve_double_newlines():
    # Double newlines should be preserved as paragraph breaks
    text = "Para one line one\nPara one line two.\n\nPara two line one\nPara two line two."
    expected = "Para one line one Para one line two.\n\nPara two line one Para two line two."
    codeflash_output = group_broken_paragraphs(text) # 13.8μs -> 14.0μs (1.43% slower)

def test_short_lines_not_joined():
    # Short lines (less than 5 words) should not be joined, but kept as separate lines
    text = "Apache License\nVersion 2.0, January 2004\nhttp://www.apache.org/licenses/"
    expected = "Apache License\nVersion 2.0, January 2004\nhttp://www.apache.org/licenses/"
    codeflash_output = group_broken_paragraphs(text) # 10.7μs -> 11.2μs (4.59% slower)

def test_bullet_points_grouped():
    # Bullet points with line breaks should be joined into single lines per bullet
    text = "• The big red fox\nis walking down the lane.\n\n• At the end of the lane\nthe fox met a bear."
    expected = "• The big red fox is walking down the lane.\n\n• At the end of the lane the fox met a bear."
    codeflash_output = group_broken_paragraphs(text) # 35.4μs -> 21.1μs (68.0% faster)

def test_e_bullet_points_grouped():
    # 'e' as bullet should be replaced and grouped
    text = "e The big red fox\nis walking down the lane."
    expected = "· The big red fox is walking down the lane."
    codeflash_output = group_broken_paragraphs(text) # 17.5μs -> 10.9μs (61.7% faster)

# -------------------------------
# 2. Edge Test Cases
# -------------------------------

def test_empty_string():
    # Empty string should return empty string
    codeflash_output = group_broken_paragraphs("") # 1.13μs -> 2.03μs (44.3% slower)

def test_only_newlines():
    # String of only newlines should return empty string
    codeflash_output = group_broken_paragraphs("\n\n\n") # 2.70μs -> 3.52μs (23.1% slower)

def test_spaces_and_newlines():
    # String of spaces and newlines should return empty string
    codeflash_output = group_broken_paragraphs("   \n  \n\n  ") # 2.91μs -> 3.90μs (25.4% slower)

def test_single_word():
    # Single word should be returned as is
    codeflash_output = group_broken_paragraphs("Hello") # 5.77μs -> 6.09μs (5.24% slower)

def test_single_line_paragraphs():
    # Multiple single-line paragraphs separated by double newlines
    text = "First para.\n\nSecond para.\n\nThird para."
    expected = "First para.\n\nSecond para.\n\nThird para."
    codeflash_output = group_broken_paragraphs(text) # 11.3μs -> 12.0μs (5.89% slower)

def test_paragraph_with_trailing_newlines():
    # Paragraph with trailing newlines should be handled
    text = "The big red fox\nis walking down the lane.\n\n"
    expected = "The big red fox is walking down the lane."
    codeflash_output = group_broken_paragraphs(text) # 12.7μs -> 11.1μs (13.6% faster)

def test_bullet_with_extra_spaces():
    # Bullet with extra spaces and newlines
    text = "  •   The quick brown\nfox jumps over\n  the lazy dog.  "
    expected = "•   The quick brown fox jumps over   the lazy dog.  "
    codeflash_output = group_broken_paragraphs(text) # 22.5μs -> 12.6μs (78.1% faster)

def test_mixed_bullets_and_normal():
    # Mixed bullet and non-bullet paragraphs
    text = "• Bullet one\ncontinues here.\n\nNormal para\ncontinues here."
    expected = "• Bullet one continues here.\n\nNormal para continues here."
    codeflash_output = group_broken_paragraphs(text) # 22.0μs -> 15.6μs (40.8% faster)

def test_multiple_bullet_styles():
    # Multiple Unicode bullet styles
    text = "• Bullet A\nline two.\n\n◦ Bullet B\nline two."
    expected = "• Bullet A line two.\n\n◦ Bullet B line two."
    codeflash_output = group_broken_paragraphs(text) # 23.7μs -> 12.4μs (90.4% faster)

def test_short_and_long_lines_mixed():
    # A paragraph with both short and long lines
    text = "Short\nThis is a much longer line that should be joined\nAnother short"
    # Only the first and last lines are short, but the presence of a long line means the paragraph will be joined
    expected = "Short This is a much longer line that should be joined Another short"
    codeflash_output = group_broken_paragraphs(text) # 14.1μs -> 12.7μs (10.9% faster)

def test_paragraph_with_tabs():
    # Paragraph with tabs instead of spaces
    text = "The big red fox\tis walking down the lane."
    expected = "The big red fox\tis walking down the lane."
    codeflash_output = group_broken_paragraphs(text) # 9.45μs -> 7.96μs (18.7% faster)

def test_bullet_with_leading_newline():
    # Bullet point with a leading newline
    text = "\n• Bullet with leading newline\ncontinues here."
    expected = "• Bullet with leading newline continues here."
    codeflash_output = group_broken_paragraphs(text) # 18.7μs -> 9.98μs (87.2% faster)

def test_bullet_with_trailing_newline():
    # Bullet point with a trailing newline
    text = "• Bullet with trailing newline\ncontinues here.\n"
    expected = "• Bullet with trailing newline continues here."
    codeflash_output = group_broken_paragraphs(text) # 17.2μs -> 9.58μs (79.6% faster)

def test_unicode_bullet_variants():
    # Test with a variety of Unicode bullets
    text = "● Unicode bullet one\ncontinues\n\n○ Unicode bullet two\ncontinues"
    expected = "● Unicode bullet one continues\n\n○ Unicode bullet two continues"
    codeflash_output = group_broken_paragraphs(text) # 24.3μs -> 13.8μs (76.7% faster)

def test_multiple_empty_paragraphs():
    # Multiple empty paragraphs between text
    text = "First para.\n\n\n\nSecond para."
    expected = "First para.\n\nSecond para."
    codeflash_output = group_broken_paragraphs(text) # 9.26μs -> 9.85μs (6.00% slower)

# -------------------------------
# 3. Large Scale Test Cases
# -------------------------------

def test_large_number_of_paragraphs():
    # 500 paragraphs, each with two lines to be joined
    paras = ["Line one {}\nLine two {}".format(i, i) for i in range(500)]
    text = "\n\n".join(paras)
    expected = "\n\n".join(["Line one {} Line two {}".format(i, i) for i in range(500)])
    codeflash_output = group_broken_paragraphs(text) # 1.36ms -> 1.29ms (5.79% faster)

def test_large_number_of_bullets():
    # 300 bullet points, each with two lines
    paras = ["• Bullet {}\ncontinues here.".format(i) for i in range(300)]
    text = "\n\n".join(paras)
    expected = "\n\n".join(["• Bullet {} continues here.".format(i) for i in range(300)])
    codeflash_output = group_broken_paragraphs(text) # 1.98ms -> 969μs (104% faster)

def test_large_mixed_content():
    # Mix of 200 normal paras and 200 bullets
    normal_paras = ["Normal {}\ncontinues".format(i) for i in range(200)]
    bullet_paras = ["• Bullet {}\ncontinues".format(i) for i in range(200)]
    all_paras = []
    for i in range(200):
        all_paras.append(normal_paras[i])
        all_paras.append(bullet_paras[i])
    text = "\n\n".join(all_paras)
    expected = "\n\n".join([
        "Normal {} continues".format(i) if j % 2 == 0 else "• Bullet {} continues".format(i//2)
        for j, i in enumerate(range(400))
    ])
    # Fix expected to match the correct sequence
    expected = "\n\n".join(
        ["Normal {} continues".format(i) for i in range(200)] +
        ["• Bullet {} continues".format(i) for i in range(200)]
    )
    # The function will process in order, so we need to interleave
    interleaved = []
    for i in range(200):
        interleaved.append("Normal {} continues".format(i))
        interleaved.append("• Bullet {} continues".format(i))
    expected = "\n\n".join(interleaved)
    codeflash_output = group_broken_paragraphs(text)

def test_large_short_lines():
    # 1000 short lines, all should be preserved as is (not joined)
    text = "\n".join(["A {}".format(i) for i in range(1000)])
    expected = "\n".join(["A {}".format(i) for i in range(1000)])
    codeflash_output = group_broken_paragraphs(text) # 605μs -> 565μs (7.11% faster)

def test_large_paragraph_with_long_lines():
    # One paragraph with 1000 long lines (should be joined into one)
    text = "\n".join(["This is a long line number {}".format(i) for i in range(1000)])
    expected = " ".join(["This is a long line number {}".format(i) for i in range(1000)])
    codeflash_output = group_broken_paragraphs(text) # 2.11ms -> 2.09ms (1.10% faster)
# codeflash_output is used to check that the output of the original code is the same as that of the optimized code.
```

</details>


To edit these changes `git checkout
codeflash/optimize-group_broken_paragraphs-mcg8s57e` and push.


[![Codeflash](https://img.shields.io/badge/Optimized%20with-Codeflash-yellow?style=flat&color=%23ffc428&logo=)](https://codeflash.ai)

---------

Co-authored-by: codeflash-ai[bot] <148906541+codeflash-ai[bot]@users.noreply.github.com>
Co-authored-by: Saurabh Misra <misra.saurabh1@gmail.com>
Co-authored-by: qued <64741807+qued@users.noreply.github.com>
Co-authored-by: Alan Bertl <alan@unstructured.io>
2025-09-09 23:32:42 +00:00
qued
1030a696a1
fix: update deps to resolve cve (#4093)
There's a
[CVE](https://github.com/Unstructured-IO/unstructured/actions/runs/17506946725/job/49892516686#step:4:27)
in `deepdiff` that's resolved in 8.6.1, so I'm bumping deps.
2025-09-09 21:43:40 +00:00
Saurabh Misra
e3854d279f
Setup Codeflash Github Actions to optimize all future code (#4082)
- This Pull Request sets up the `codeflash.yml` file which will run on
every new Pull Request that modifies the source code for `unstructured`
directory.
- We setup the codeflash config in the pyproject.toml file. This defines
basic project config for codeflash.
- The workflow uses uv to install the CI dependencies faster than your
current caching solution. Speed is useful to get quicker optimizations.
- Please take a look at the requirements that are being installed. Feel
free to add more to the install list. Codeflash tries to execute code
and if it is missing a dependency needed to make something run, it will
fail to optimize.
- Codeflash is being installed everytime in the CI. This helps the
workflow always use the latest version of codeflash as it improves
rapidly. Feel free to add codeflash to dev dependency as well, since we
are about to release more local optimization tools like VS Code and
claude code extensions.
- Feel free to modify this Github action anyway you want

**Actions Required to make this work-**

- Install the Codeflash Github app from [this
link](https://github.com/apps/codeflash-ai/installations/select_target)
to this repo. This is required for our github-bot to comment and create
suggestions on the github repo.
- Create a new `CODEFLASH_API_KEY` after signing up to [Codeflash from
our website](https://www.codeflash.ai/). The onboarding will ask you to
create an API Key and show instructions on how to save the api key on
your repo secrets.

Then, after this PR is merged in it will start generating new
optimizations 🎉

---------

Signed-off-by: Saurabh Misra <misra.saurabh1@gmail.com>
Co-authored-by: Aseem Saxena <aseem.bits@gmail.com>
Co-authored-by: cragwolfe <cragcw@gmail.com>
2025-09-02 20:15:29 +00:00
luke-kucing
fed8942cf8
manual fix for open CVEs (#4085) 0.18.14 unstructured_0.18.14 2025-08-26 00:57:22 +00:00
Saurabh Misra
51425ddb32
️ Speed up function sentence_count by 59% (#4080)
Saurabh's comments - The changes look good, especially because they have
been rigorously tested with a variety of cases, which makes me feel
confident

### 📄 59% (0.59x) speedup for ***`sentence_count` in
`unstructured/partition/text_type.py`***

⏱️ Runtime : **`190 milliseconds`** **→** **`119 milliseconds`** (best
of `39` runs)
### 📝 Explanation and details

Major speedups.
- Replace list comprehensions with generator expressions in counting
scenarios to avoid building intermediate lists.
- Use a simple word count (split by space or with str.split()) after
punctuation removal, rather than expensive word_tokenize call, since
only token count is used and punctuation is already stripped.
- Avoid calling remove_punctuation and word_tokenize on already very
short sentences if there's a min_length filter: filter quickly if text
length is zero.
-
If you wish to maximize compatibility with sentences containing
non-whitespace-separable tokens (e.g. CJK languages), consider further
optimization on the token counting line as needed for your domain.
Otherwise, `str.split()` after punctuation removal suffices and is far
faster than a full NLP tokenizer.


 **Correctness verification report:**

| Test                        | Status            |
| --------------------------- | ----------------- |
| ⚙️ Existing Unit Tests |  **21 Passed** |
| 🌀 Generated Regression Tests |  **92 Passed** |
|  Replay Tests |  **695 Passed** |
| 🔎 Concolic Coverage Tests | 🔘 **None Found** |
|📊 Tests Coverage       | 100.0% |
<details>
<summary>⚙️ Existing Unit Tests and Runtime</summary>

| Test File::Test Function | Original ⏱️ | Optimized ⏱️ | Speedup |

|:-----------------------------------------------------------------------------------------|:--------------|:---------------|:----------|
| `partition/test_text_type.py::test_item_titles` | 92.5μs | 47.8μs |
93.6% |
| `partition/test_text_type.py::test_sentence_count` | 47.8μs | 4.67μs |
924% |
|
`test_tracer_py__replay_test_0.py::test_unstructured_partition_text_type_sentence_count`
| 4.37ms | 2.11ms | 107% |
|
`test_tracer_py__replay_test_2.py::test_unstructured_partition_text_type_sentence_count`
| 13.0ms | 4.69ms | 177% |
|
`test_tracer_py__replay_test_3.py::test_unstructured_partition_text_type_sentence_count`
| 5.29ms | 3.02ms | 75.3% |

</details>

<details>
<summary>🌀 Generated Regression Tests and Runtime</summary>

```python
from __future__ import annotations

import random
import string
import sys
import unicodedata
from functools import lru_cache
from typing import Final, List, Optional

# imports
import pytest  # used for our unit tests
from nltk import sent_tokenize as _sent_tokenize
from nltk import word_tokenize as _word_tokenize
from unstructured.cleaners.core import remove_punctuation
from unstructured.logger import trace_logger
from unstructured.nlp.tokenize import sent_tokenize, word_tokenize
from unstructured.partition.text_type import sentence_count

# unit tests

# --------------------
# BASIC TEST CASES
# --------------------

def test_empty_string():
    # Empty string should return 0 sentences
    codeflash_output = sentence_count("") # 7.66μs -> 6.98μs (9.74% faster)

def test_single_sentence():
    # Single sentence with period
    codeflash_output = sentence_count("This is a sentence.") # 36.7μs -> 9.46μs (288% faster)

def test_single_sentence_no_punctuation():
    # Single sentence, no punctuation (should still count as 1 by NLTK)
    codeflash_output = sentence_count("This is a sentence") # 29.5μs -> 7.63μs (286% faster)

def test_two_sentences():
    # Two sentences separated by period
    codeflash_output = sentence_count("This is one. This is two.") # 72.4μs -> 37.0μs (95.7% faster)

def test_multiple_sentences_with_various_punctuation():
    # Sentences ending with ! and ?
    codeflash_output = sentence_count("Is this working? Yes! It is.") # 93.2μs -> 44.5μs (109% faster)

def test_sentence_with_abbreviation():
    # Abbreviations shouldn't split sentences
    codeflash_output = sentence_count("Dr. Smith went home. He was tired.") # 80.0μs -> 42.3μs (89.0% faster)

def test_sentence_with_ellipsis():
    # Ellipsis should not split sentences
    codeflash_output = sentence_count("Wait... what happened? I don't know.") # 76.9μs -> 42.5μs (81.0% faster)

def test_sentence_with_newlines():
    # Sentences separated by newlines
    codeflash_output = sentence_count("First sentence.\nSecond sentence.\nThird sentence.") # 91.7μs -> 43.4μs (111% faster)

def test_sentence_with_min_length_met():
    # min_length is met for all sentences
    codeflash_output = sentence_count("One two three. Four five six.", min_length=2) # 63.1μs -> 30.2μs (109% faster)

def test_sentence_with_min_length_not_met():
    # Only one sentence meets min_length
    codeflash_output = sentence_count("One. Two three four.", min_length=3) # 62.8μs -> 31.9μs (97.1% faster)

def test_sentence_with_min_length_none_met():
    # No sentence meets min_length
    codeflash_output = sentence_count("A. B.", min_length=2) # 60.9μs -> 32.3μs (88.5% faster)

def test_sentence_with_min_length_equals_length():
    # Sentence with exactly min_length words
    codeflash_output = sentence_count("One two three.", min_length=3) # 27.3μs -> 9.54μs (187% faster)

def test_sentence_with_trailing_space():
    # Sentence with trailing spaces
    codeflash_output = sentence_count("Hello world.   ") # 27.8μs -> 8.60μs (223% faster)

# --------------------
# EDGE TEST CASES
# --------------------

def test_only_punctuation():
    # Only punctuation, no words
    codeflash_output = sentence_count("...!!!") # 39.1μs -> 33.5μs (16.8% faster)

def test_only_whitespace():
    # Only whitespace
    codeflash_output = sentence_count("    \n\t   ") # 5.21μs -> 5.49μs (5.05% slower)

def test_sentence_with_numbers_and_symbols():
    # Sentence with numbers and symbols
    codeflash_output = sentence_count("12345! $%^&*()") # 66.6μs -> 32.7μs (104% faster)

def test_sentence_with_unicode_characters():
    # Sentences with unicode and emoji
    codeflash_output = sentence_count("Hello 😊. How are you?") # 75.9μs -> 37.7μs (102% faster)

def test_sentence_with_mixed_scripts():
    # Sentences with mixed scripts (e.g., English and Japanese)
    codeflash_output = sentence_count("Hello. こんにちは。How are you?") # 71.2μs -> 34.9μs (104% faster)

def test_sentence_with_multiple_spaces():
    # Sentences with irregular spacing
    codeflash_output = sentence_count("This   is   spaced.   And   so   is   this.") # 69.5μs -> 30.3μs (129% faster)

def test_sentence_with_no_word_characters():
    # Only punctuation and numbers
    codeflash_output = sentence_count("... 123 ...") # 42.1μs -> 25.5μs (65.2% faster)

def test_sentence_with_long_word():
    # Sentence with a single long word
    long_word = "a" * 100
    codeflash_output = sentence_count(f"{long_word}.") # 42.0μs -> 7.55μs (457% faster)

def test_sentence_with_long_word_and_min_length():
    # Sentence with long word, min_length > 1
    long_word = "a" * 100
    codeflash_output = sentence_count(f"{long_word}.", min_length=2) # 43.1μs -> 10.7μs (303% faster)

def test_sentence_with_only_abbreviation():
    # Sentence is only an abbreviation
    codeflash_output = sentence_count("U.S.A.") # 23.0μs -> 7.46μs (208% faster)

def test_sentence_with_nonbreaking_space():
    # Sentence with non-breaking space
    text = "Hello\u00A0world. How are you?"
    codeflash_output = sentence_count(text) # 74.4μs -> 37.3μs (99.6% faster)

def test_sentence_with_tab_characters():
    # Sentences separated by tabs
    text = "Hello world.\tHow are you?\tFine."
    codeflash_output = sentence_count(text) # 100μs -> 44.5μs (125% faster)

def test_sentence_with_multiple_punctuation_marks():
    # Sentences ending with multiple punctuation marks
    text = "Wait!! What?? Really..."
    codeflash_output = sentence_count(text) # 88.9μs -> 48.3μs (83.9% faster)

def test_sentence_with_leading_and_trailing_punctuation():
    # Sentence surrounded by punctuation
    text = "...Hello world!..."
    codeflash_output = sentence_count(text) # 25.9μs -> 8.64μs (200% faster)

def test_sentence_with_quotes():
    # Sentences with quotes
    text = '"Hello," she said. "How are you?"'
    codeflash_output = sentence_count(text) # 85.3μs -> 46.6μs (83.1% faster)

def test_sentence_with_parentheses():
    # Sentence with parentheses
    text = "This is a sentence (with parentheses). This is another."
    codeflash_output = sentence_count(text) # 74.7μs -> 33.5μs (123% faster)

def test_sentence_with_semicolons():
    # Semicolons should not split sentences
    text = "This is a sentence; this is not a new sentence."
    codeflash_output = sentence_count(text) # 37.1μs -> 8.74μs (324% faster)

def test_sentence_with_colons():
    # Colons should not split sentences
    text = "This is a sentence: it continues here."
    codeflash_output = sentence_count(text) # 33.2μs -> 8.44μs (294% faster)

def test_sentence_with_dash():
    # Dashes should not split sentences
    text = "This is a sentence - it continues here."
    codeflash_output = sentence_count(text) # 34.8μs -> 8.52μs (309% faster)

def test_sentence_with_multiple_dots():
    # Multiple dots but not ellipsis
    text = "This is a sentence.... This is another."
    codeflash_output = sentence_count(text) # 79.9μs -> 38.2μs (109% faster)

def test_sentence_with_min_length_and_punctuation():
    # min_length with sentences containing only punctuation
    text = "!!! ... ???"
    codeflash_output = sentence_count(text, min_length=1) # 80.2μs -> 77.0μs (4.21% faster)

def test_sentence_with_min_length_and_numbers():
    # min_length with numbers as words
    text = "1 2 3 4. 5 6."
    codeflash_output = sentence_count(text, min_length=4) # 69.6μs -> 34.6μs (101% faster)

def test_sentence_with_min_length_and_unicode():
    # min_length with unicode
    text = "😊 😊 😊 😊. Hello!"
    codeflash_output = sentence_count(text, min_length=4) # 76.1μs -> 41.7μs (82.5% faster)

def test_sentence_with_non_ascii_punctuation():
    # Sentence with non-ASCII punctuation (e.g., Chinese full stop)
    text = "Hello world。How are you?"
    codeflash_output = sentence_count(text) # 31.9μs -> 9.34μs (242% faster)

def test_sentence_with_repeated_newlines():
    # Sentences separated by multiple newlines
    text = "First sentence.\n\n\nSecond sentence."
    codeflash_output = sentence_count(text) # 71.9μs -> 33.4μs (115% faster)

# --------------------
# LARGE SCALE TEST CASES
# --------------------

def test_large_number_of_sentences():
    # 1000 sentences, each "Sentence X."
    n = 1000
    text = " ".join([f"Sentence {i}." for i in range(n)])
    codeflash_output = sentence_count(text) # 21.2ms -> 8.43ms (151% faster)

def test_large_number_of_sentences_with_min_length():
    # 1000 sentences, every even-indexed has 3 words, odd-indexed has 1 word
    n = 1000
    sentences = []
    for i in range(n):
        if i % 2 == 0:
            sentences.append(f"Word1 Word2 Word3.")
        else:
            sentences.append(f"Word.")
    text = " ".join(sentences)
    # Only even-indexed sentences should count for min_length=3
    codeflash_output = sentence_count(text, min_length=3)

def test_large_sentence():
    # One very long sentence (999 words)
    sentence = " ".join(["word"] * 999) + "."
    codeflash_output = sentence_count(sentence) # 1.15ms -> 29.0μs (3854% faster)
    codeflash_output = sentence_count(sentence, min_length=999) # 18.7μs -> 41.1μs (54.5% slower)
    codeflash_output = sentence_count(sentence, min_length=1000) # 18.4μs -> 38.6μs (52.3% slower)

def test_large_text_with_varied_sentence_lengths():
    # 500 short sentences, 500 long sentences (5 and 20 words)
    n_short = 500
    n_long = 500
    short_sentence = "a b c d e."
    long_sentence = " ".join(["word"] * 20) + "."
    text = " ".join([short_sentence]*n_short + [long_sentence]*n_long)
    # min_length=10 should only count long sentences
    codeflash_output = sentence_count(text, min_length=10) # 8.33ms -> 7.46ms (11.7% faster)
    # min_length=1 should count all
    codeflash_output = sentence_count(text, min_length=1) # 412μs -> 722μs (42.9% slower)

def test_large_text_with_unicode_and_punctuation():
    # 1000 sentences, each with emoji and punctuation
    n = 1000
    text = " ".join([f"Hello 😊! How are you?"] * n)
    # Each repetition has 2 sentences
    codeflash_output = sentence_count(text) # 15.8ms -> 15.3ms (3.27% faster)

def test_large_text_with_random_punctuation():
    # 1000 sentences with random punctuation at the end
    n = 1000
    punctuations = [".", "!", "?"]
    text = " ".join([f"Sentence {i}{random.choice(punctuations)}" for i in range(n)])
    codeflash_output = sentence_count(text) # 20.6ms -> 8.02ms (157% faster)

def test_large_text_with_abbreviations():
    # 1000 sentences, some with abbreviations
    n = 1000
    text = " ".join([f"Dr. Smith went home. He was tired."] * (n // 2))
    # Each repetition has 2 sentences
    codeflash_output = sentence_count(text) # 11.6ms -> 11.2ms (3.99% faster)

def test_large_text_with_newlines_and_tabs():
    # 500 sentences separated by newlines, 500 by tabs
    n = 500
    text1 = "\n".join([f"Sentence {i}." for i in range(n)])
    text2 = "\t".join([f"Sentence {i}." for i in range(n, 2*n)])
    text = text1 + "\n" + text2
    codeflash_output = sentence_count(text) # 21.3ms -> 8.57ms (149% faster)

def test_large_text_with_min_length_and_unicode():
    # 1000 sentences, half with 5 emojis, half with 1 emoji
    n = 1000
    text = " ".join(["😊 " * 5 + "." if i % 2 == 0 else "😊." for i in range(n)])
    # min_length=5 should count only even-indexed
    codeflash_output = sentence_count(text, min_length=5) # 7.88ms -> 7.83ms (0.679% faster)
    # min_length=1 should count all
    codeflash_output = sentence_count(text, min_length=1) # 573μs -> 742μs (22.7% slower)
# codeflash_output is used to check that the output of the original code is the same as that of the optimized code.

from __future__ import annotations

import string
import sys
import unicodedata
from functools import lru_cache
from typing import Final, List, Optional

# imports
import pytest  # used for our unit tests
from nltk import sent_tokenize as _sent_tokenize
from nltk import word_tokenize as _word_tokenize
from unstructured.partition.text_type import sentence_count


# Dummy trace_logger for test purposes (since real logger is not available)
class DummyLogger:
    def detail(self, msg):
        pass
trace_logger = DummyLogger()
from unstructured.partition.text_type import sentence_count

# unit tests

# ---------------- BASIC TEST CASES ----------------

def test_single_sentence():
    # A simple sentence
    codeflash_output = sentence_count("This is a sentence.") # 34.7μs -> 10.1μs (244% faster)

def test_multiple_sentences():
    # Two distinct sentences
    codeflash_output = sentence_count("This is the first sentence. This is the second.") # 77.6μs -> 33.9μs (129% faster)

def test_sentence_with_min_length_met():
    # Sentence with enough words for min_length
    codeflash_output = sentence_count("This is a long enough sentence.", min_length=5) # 33.7μs -> 10.9μs (209% faster)

def test_sentence_with_min_length_not_met():
    # Sentence with too few words for min_length
    codeflash_output = sentence_count("Too short.", min_length=3) # 28.4μs -> 11.1μs (156% faster)

def test_multiple_sentences_with_min_length():
    # Only one of two sentences meets min_length
    text = "Short. This one is long enough."
    codeflash_output = sentence_count(text, min_length=4) # 73.7μs -> 37.0μs (99.1% faster)

def test_sentence_with_punctuation():
    # Sentence with internal punctuation
    text = "Hello, world! How are you?"
    codeflash_output = sentence_count(text) # 66.2μs -> 29.9μs (121% faster)

def test_sentence_with_abbreviations():
    # Sentence with abbreviation that should not split sentences
    text = "Dr. Smith went to Washington. He arrived at 3 p.m. It was sunny."
    codeflash_output = sentence_count(text) # 117μs -> 61.3μs (92.3% faster)

# ---------------- EDGE TEST CASES ----------------

def test_empty_string():
    # Empty string should yield 0 sentences
    codeflash_output = sentence_count("") # 5.53μs -> 5.59μs (1.02% slower)

def test_whitespace_only():
    # String with only whitespace
    codeflash_output = sentence_count("   ") # 5.11μs -> 5.48μs (6.73% slower)

def test_no_sentence_ending_punctuation():
    # No periods, exclamation or question marks
    codeflash_output = sentence_count("This is not split into sentences") # 35.0μs -> 7.85μs (346% faster)

def test_sentence_with_only_punctuation():
    # String with only punctuation marks
    codeflash_output = sentence_count("!!!...???") # 47.1μs -> 41.7μs (12.9% faster)

def test_sentence_with_newlines():
    # Sentences split by newlines
    text = "First sentence.\nSecond sentence.\n\nThird sentence."
    codeflash_output = sentence_count(text) # 98.5μs -> 47.0μs (110% faster)

def test_sentence_with_multiple_spaces():
    # Sentences separated by multiple spaces
    text = "Sentence one.   Sentence two.      Sentence three."
    codeflash_output = sentence_count(text) # 95.3μs -> 42.1μs (126% faster)

def test_sentence_with_unicode_punctuation():
    # Sentences with unicode punctuation (em dash, ellipsis, etc.)
    text = "Hello… How are you—good?"
    codeflash_output = sentence_count(text) # 32.4μs -> 11.0μs (196% faster)

def test_sentence_with_non_ascii_characters():
    # Sentences with non-ASCII (e.g., accented) characters
    text = "C'est la vie. Voilà!"
    codeflash_output = sentence_count(text) # 73.6μs -> 34.4μs (114% faster)

def test_sentence_with_numbers_and_periods():
    # Numbers with periods should not split sentences
    text = "Version 3.2 is out. Please update."
    codeflash_output = sentence_count(text) # 66.2μs -> 29.1μs (128% faster)

def test_sentence_with_emoji():
    # Sentences with emoji
    text = "I am happy 😊. Are you?"
    codeflash_output = sentence_count(text) # 73.7μs -> 33.8μs (118% faster)

def test_sentence_with_tabs_and_spaces():
    # Sentences separated by tabs and spaces
    text = "First sentence.\tSecond sentence.   Third sentence."
    codeflash_output = sentence_count(text) # 45.6μs -> 44.1μs (3.20% faster)

def test_sentence_with_min_length_zero():
    # min_length=0 should count all sentences
    text = "One. Two. Three."
    codeflash_output = sentence_count(text, min_length=0) # 88.5μs -> 40.5μs (119% faster)

def test_sentence_with_min_length_equals_num_words():
    # min_length equal to the number of words in a sentence
    text = "This is five words."
    codeflash_output = sentence_count(text, min_length=5) # 31.6μs -> 12.3μs (157% faster)

def test_sentence_with_min_length_greater_than_any_sentence():
    # min_length greater than any sentence's word count
    text = "Short. Tiny. Small."
    codeflash_output = sentence_count(text, min_length=10) # 79.2μs -> 47.8μs (65.7% faster)

def test_sentence_with_trailing_and_leading_spaces():
    # Sentences with leading/trailing spaces
    text = "   First sentence. Second sentence.   "
    codeflash_output = sentence_count(text) # 50.5μs -> 29.3μs (72.2% faster)

def test_sentence_with_only_newlines():
    # Only newlines
    text = "\n\n\n"
    codeflash_output = sentence_count(text) # 5.11μs -> 5.40μs (5.31% slower)

def test_sentence_with_multiple_punctuation_marks():
    # Sentences ending with multiple punctuation marks
    text = "What?! Really?! Yes."
    codeflash_output = sentence_count(text) # 99.1μs -> 49.6μs (99.9% faster)

def test_sentence_with_quoted_text():
    # Sentences with quoted text
    text = '"Hello there." She said. "How are you?"'
    codeflash_output = sentence_count(text) # 97.5μs -> 58.4μs (66.9% faster)

def test_sentence_with_parentheses():
    # Sentences with parentheses
    text = "This is a sentence (with extra info). Another sentence."
    codeflash_output = sentence_count(text) # 77.1μs -> 32.4μs (138% faster)

def test_sentence_with_semicolons_and_colons():
    # Semicolons and colons should not split sentences
    text = "First part; still same sentence: more info. Next sentence."
    codeflash_output = sentence_count(text) # 72.1μs -> 28.5μs (153% faster)

def test_sentence_with_single_word():
    # Single word, with and without punctuation
    codeflash_output = sentence_count("Hello.") # 23.6μs -> 7.43μs (218% faster)
    codeflash_output = sentence_count("Hello") # 3.91μs -> 5.14μs (24.0% slower)

def test_sentence_with_multiple_periods():
    # Ellipsis should not split into multiple sentences
    text = "Wait... What happened?"
    codeflash_output = sentence_count(text) # 52.3μs -> 29.7μs (76.4% faster)

def test_sentence_with_uppercase_acronyms():
    # Acronyms with periods should not split sentences
    text = "I work at U.S.A. headquarters. It's nice."
    codeflash_output = sentence_count(text) # 90.6μs -> 50.5μs (79.4% faster)

def test_sentence_with_decimal_numbers():
    # Decimal numbers should not split sentences
    text = "The value is 3.14. That's pi."
    codeflash_output = sentence_count(text) # 75.2μs -> 35.4μs (112% faster)

def test_sentence_with_bullet_points():
    # Bullet points without ending punctuation
    text = "• First item\n• Second item\n• Third item"
    codeflash_output = sentence_count(text) # 35.3μs -> 10.4μs (240% faster)

def test_sentence_with_dash_and_hyphen():
    # Dashes and hyphens should not split sentences
    text = "Well-known fact—it's true. Next sentence."
    codeflash_output = sentence_count(text) # 55.1μs -> 35.1μs (56.9% faster)

# ---------------- LARGE SCALE TEST CASES ----------------

def test_large_text_many_sentences():
    # Test with a large number of sentences
    text = " ".join([f"Sentence number {i}." for i in range(1, 501)])
    codeflash_output = sentence_count(text) # 11.5ms -> 4.39ms (162% faster)

def test_large_text_with_min_length():
    # Large text, only some sentences meet min_length
    text = "Short. " * 500 + "This is a sufficiently long sentence for counting. " * 200
    # Only the long sentences (7 words) should be counted
    codeflash_output = sentence_count(text, min_length=7) # 6.24ms -> 6.19ms (0.716% faster)

def test_large_text_no_sentences():
    # Large text with no sentence-ending punctuation
    text = " ".join(["word"] * 1000)
    codeflash_output = sentence_count(text) # 1.12ms -> 27.9μs (3935% faster)

def test_large_text_all_sentences_filtered_by_min_length():
    # All sentences too short for min_length
    text = "A. B. C. D. " * 250
    codeflash_output = sentence_count(text, min_length=5) # 7.01ms -> 6.93ms (1.12% faster)

def test_large_text_with_varied_sentence_lengths():
    # Mix of short and long sentences
    short = "Hi. " * 300
    long = "This is a longer sentence for testing. " * 100
    text = short + long
    codeflash_output = sentence_count(text, min_length=6) # 3.39ms -> 3.33ms (1.73% faster)

def test_large_text_with_unicode_and_emoji():
    # Large text with unicode and emoji in sentences
    text = "😊 Hello world! " * 400 + "C'est la vie. Voilà! " * 100
    codeflash_output = sentence_count(text) # 5.41ms -> 5.16ms (4.84% faster)

def test_large_text_with_newlines_and_tabs():
    # Large text with newlines and tabs between sentences
    text = "\n".join([f"Sentence {i}.\t" for i in range(1, 501)])
    codeflash_output = sentence_count(text) # 11.0ms -> 4.52ms (144% faster)
# codeflash_output is used to check that the output of the original code is the same as that of the optimized code.
```

</details>


To edit these changes `git checkout
codeflash/optimize-sentence_count-mcglwwcn` and push.


[![Codeflash](https://img.shields.io/badge/Optimized%20with-Codeflash-yellow?style=flat&color=%23ffc428&logo=)](https://codeflash.ai)

---------

Signed-off-by: Saurabh Misra <misra.saurabh1@gmail.com>
Co-authored-by: codeflash-ai[bot] <148906541+codeflash-ai[bot]@users.noreply.github.com>
2025-08-22 05:27:36 +00:00
Saurabh Misra
57cadf8192
️ Speed up function check_for_nltk_package by 111% (#4081)
### 📄 111% (1.11x) speedup for ***`check_for_nltk_package` in
`unstructured/nlp/tokenize.py`***

⏱️ Runtime : **`57.7 milliseconds`** **→** **`27.3 milliseconds`** (best
of `101` runs)
### 📝 Explanation and details

Here’s an optimized version of your program. The main improvements are.

- Eliminates the unnecessary list and loop for constructing `paths`;
instead, uses a generator expression so memory is not allocated for an
intermediate list.
- Uses `os.path.join` only if needed, otherwise leaves the original
path.
- Caches the result by using a local variable within the function
instead of constructing the list first.
- Overall reduced allocations & faster iteration.
- Avoid creating and storing a full list with potentially many paths,
instead lazily generate them as needed by `nltk.find`.

This is as fast as possible, given the external dependencies (nltk’s own
`find()` algorithm).


 **Correctness verification report:**

| Test                        | Status            |
| --------------------------- | ----------------- |
| ⚙️ Existing Unit Tests | 🔘 **None Found** |
| 🌀 Generated Regression Tests |  **796 Passed** |
|  Replay Tests |  **8 Passed** |
| 🔎 Concolic Coverage Tests | 🔘 **None Found** |
|📊 Tests Coverage       | 100.0% |
<details>
<summary>🌀 Generated Regression Tests and Runtime</summary>

```python
from __future__ import annotations

import os
import shutil
import tempfile

import nltk
# imports
import pytest  # used for our unit tests
from unstructured.nlp.tokenize import check_for_nltk_package

# unit tests

# -------------------
# Basic Test Cases
# -------------------

def test_existing_corpus():
    # Test with a standard corpus that is usually present if nltk_data is installed
    # 'punkt' is a common tokenizer model
    codeflash_output = check_for_nltk_package('punkt', 'tokenizers') # 117μs -> 76.8μs (53.7% faster)
    # If 'punkt' is present, should return True
    # If not present, should return False
    # We check both to allow for environments where punkt is not installed

def test_nonexistent_package():
    # Test with a package that does not exist
    codeflash_output = check_for_nltk_package('nonexistent_package_xyz', 'corpora') # 100μs -> 59.6μs (68.8% faster)

def test_existing_wordnet_corpus():
    # Test with a common corpus
    codeflash_output = check_for_nltk_package('wordnet', 'corpora') # 97.5μs -> 55.7μs (75.2% faster)

def test_existing_stopwords():
    # Test with another common corpus
    codeflash_output = check_for_nltk_package('stopwords', 'corpora') # 96.0μs -> 55.3μs (73.6% faster)

# -------------------
# Edge Test Cases
# -------------------

def test_empty_package_name():
    # Empty package name should not be found
    codeflash_output = check_for_nltk_package('', 'corpora') # 99.5μs -> 57.4μs (73.3% faster)

def test_empty_package_category():
    # Empty category should not be found
    codeflash_output = check_for_nltk_package('punkt', '') # 98.4μs -> 56.2μs (75.2% faster)

def test_empty_both():
    # Both empty should not be found
    codeflash_output = check_for_nltk_package('', '') # 18.1μs -> 19.3μs (5.86% slower)

def test_special_characters_in_name():
    # Special characters in package name should not be found
    codeflash_output = check_for_nltk_package('!@#$%^&*()', 'corpora') # 119μs -> 72.4μs (65.1% faster)

def test_special_characters_in_category():
    # Special characters in category should not be found
    codeflash_output = check_for_nltk_package('punkt', '!!!') # 96.8μs -> 56.3μs (71.9% faster)

def test_case_sensitivity():
    # NLTK is case-sensitive, so wrong case should not be found
    codeflash_output = check_for_nltk_package('PUNKT', 'tokenizers') # 96.5μs -> 55.9μs (72.6% faster)

def test_path_without_nltk_data():
    # Simulate a path without 'nltk_data' at the end
    # Create a temporary directory structure
    with tempfile.TemporaryDirectory() as tmpdir:
        # Create a fake nltk_data/tokenizers/punkt directory
        nltk_data_dir = os.path.join(tmpdir, 'nltk_data', 'tokenizers')
        os.makedirs(nltk_data_dir)
        # Place a dummy file for 'punkt'
        with open(os.path.join(nltk_data_dir, 'punkt'), 'w') as f:
            f.write('dummy')
        # Temporarily override nltk.data.path
        orig_paths = list(nltk.data.path)
        nltk.data.path.insert(0, tmpdir)
        try:
            # Should find the package now
            codeflash_output = check_for_nltk_package('punkt', 'tokenizers')
        finally:
            nltk.data.path = orig_paths

def test_path_with_nltk_data():
    # Simulate a path that already ends with 'nltk_data'
    with tempfile.TemporaryDirectory() as tmpdir:
        nltk_data_dir = os.path.join(tmpdir, 'nltk_data')
        tokenizers_dir = os.path.join(nltk_data_dir, 'tokenizers')
        os.makedirs(tokenizers_dir)
        with open(os.path.join(tokenizers_dir, 'punkt'), 'w') as f:
            f.write('dummy')
        orig_paths = list(nltk.data.path)
        nltk.data.path.insert(0, nltk_data_dir)
        try:
            codeflash_output = check_for_nltk_package('punkt', 'tokenizers')
        finally:
            nltk.data.path = orig_paths

def test_oserror_on_invalid_path(monkeypatch):
    # Simulate an OSError by passing in a path that cannot be accessed
    # We'll monkeypatch nltk.data.path to a directory that doesn't exist
    orig_paths = list(nltk.data.path)
    nltk.data.path.insert(0, '/nonexistent_dir_xyz_123')
    try:
        # Should not raise, but return False
        codeflash_output = check_for_nltk_package('punkt', 'tokenizers')
    finally:
        nltk.data.path = orig_paths

def test_unicode_package_name():
    # Unicode in package name should not be found
    codeflash_output = check_for_nltk_package('punkté', 'tokenizers') # 108μs -> 64.8μs (66.7% faster)

def test_unicode_category_name():
    # Unicode in category name should not be found
    codeflash_output = check_for_nltk_package('punkt', 'tokenizersé') # 102μs -> 59.0μs (73.0% faster)

# -------------------
# Large Scale Test Cases
# -------------------

def test_large_number_of_paths():
    # Simulate a large number of nltk.data.path entries
    orig_paths = list(nltk.data.path)
    with tempfile.TemporaryDirectory() as tmpdir:
        # Create many fake paths, only one contains the package
        fake_paths = []
        for i in range(100):
            fake_dir = os.path.join(tmpdir, f"fake_{i}")
            os.makedirs(fake_dir)
            fake_paths.append(fake_dir)
        # Add the real one at the end
        real_dir = os.path.join(tmpdir, 'real_nltk_data', 'tokenizers')
        os.makedirs(real_dir)
        with open(os.path.join(real_dir, 'punkt'), 'w') as f:
            f.write('dummy')
        nltk.data.path[:] = fake_paths + [os.path.join(tmpdir, 'real_nltk_data')]
        # Should find the package
        codeflash_output = check_for_nltk_package('punkt', 'tokenizers')
        nltk.data.path = orig_paths

def test_large_number_of_missing_packages():
    # Test that all missing packages are not found efficiently
    for i in range(100):
        codeflash_output = check_for_nltk_package(f'nonexistent_pkg_{i}', 'corpora')

def test_large_number_of_categories():
    # Test many different categories, all missing
    for i in range(100):
        codeflash_output = check_for_nltk_package('punkt', f'category_{i}')

def test_many_paths_with_some_invalid():
    # Mix valid and invalid paths
    orig_paths = list(nltk.data.path)
    with tempfile.TemporaryDirectory() as tmpdir:
        valid_dir = os.path.join(tmpdir, 'nltk_data', 'tokenizers')
        os.makedirs(valid_dir)
        with open(os.path.join(valid_dir, 'punkt'), 'w') as f:
            f.write('dummy')
        fake_paths = [f'/nonexistent_{i}' for i in range(50)]
        nltk.data.path[:] = fake_paths + [os.path.join(tmpdir, 'nltk_data')]
        codeflash_output = check_for_nltk_package('punkt', 'tokenizers')
        nltk.data.path = orig_paths

def test_performance_many_checks():
    # Performance: check the same valid package many times
    with tempfile.TemporaryDirectory() as tmpdir:
        nltk_data_dir = os.path.join(tmpdir, 'nltk_data', 'tokenizers')
        os.makedirs(nltk_data_dir)
        with open(os.path.join(nltk_data_dir, 'punkt'), 'w') as f:
            f.write('dummy')
        orig_paths = list(nltk.data.path)
        nltk.data.path.insert(0, os.path.join(tmpdir, 'nltk_data'))
        try:
            for _ in range(100):
                codeflash_output = check_for_nltk_package('punkt', 'tokenizers')
        finally:
            nltk.data.path = orig_paths
# codeflash_output is used to check that the output of the original code is the same as that of the optimized code.

from __future__ import annotations

import os

import nltk
# imports
import pytest  # used for our unit tests
from unstructured.nlp.tokenize import check_for_nltk_package

# unit tests

# ----------- BASIC TEST CASES -----------

def test_existing_corpus_package():
    # Test with a commonly available corpus package, e.g., 'punkt'
    # Should return True if 'punkt' is installed
    codeflash_output = check_for_nltk_package('punkt', 'tokenizers'); result = codeflash_output # 110μs -> 66.0μs (68.2% faster)

def test_nonexistent_package_returns_false():
    # Test with a clearly non-existent package
    codeflash_output = check_for_nltk_package('not_a_real_package', 'corpora') # 100μs -> 59.0μs (70.2% faster)

def test_existing_grammar_package():
    # Test with a grammar package that may exist
    codeflash_output = check_for_nltk_package('sample_grammar', 'grammars'); result = codeflash_output # 98.2μs -> 56.2μs (74.8% faster)

def test_existing_corpus_category():
    # Test with a corpus that is often installed by default
    codeflash_output = check_for_nltk_package('words', 'corpora'); result = codeflash_output # 96.9μs -> 55.1μs (75.8% faster)

def test_existing_stemmer_package():
    # Test for a stemmer package
    codeflash_output = check_for_nltk_package('porter.pickle', 'stemmers'); result = codeflash_output # 98.0μs -> 55.3μs (77.2% faster)

# ----------- EDGE TEST CASES -----------

def test_empty_package_name():
    # Test with empty package name
    codeflash_output = check_for_nltk_package('', 'corpora') # 99.0μs -> 57.0μs (73.9% faster)

def test_empty_category_name():
    # Test with empty category name
    codeflash_output = check_for_nltk_package('punkt', '') # 96.7μs -> 54.9μs (76.1% faster)

def test_both_empty():
    # Test with both package and category names empty
    codeflash_output = check_for_nltk_package('', '') # 18.1μs -> 19.4μs (6.87% slower)

def test_package_name_with_special_characters():
    # Test with special characters in package name
    codeflash_output = check_for_nltk_package('!@#, 'corpora') # 101μs -> 58.5μs (73.4% faster)

def test_category_name_with_special_characters():
    # Test with special characters in category name
    codeflash_output = check_for_nltk_package('punkt', '!@#) # 97.8μs -> 55.7μs (75.4% faster)

def test_package_name_with_path_traversal():
    # Test with directory traversal in package name
    codeflash_output = check_for_nltk_package('../punkt', 'tokenizers') # 63.7μs -> 44.7μs (42.5% faster)

def test_category_name_with_path_traversal():
    # Test with directory traversal in category name
    codeflash_output = check_for_nltk_package('punkt', '../tokenizers') # 178μs -> 75.5μs (137% faster)

def test_case_sensitivity():
    # NLTK is case-sensitive: 'Punkt' should not be found if only 'punkt' exists
    codeflash_output = check_for_nltk_package('punkt', 'tokenizers'); result_lower = codeflash_output # 95.6μs -> 54.0μs (77.0% faster)
    codeflash_output = check_for_nltk_package('Punkt', 'tokenizers'); result_upper = codeflash_output # 81.4μs -> 41.5μs (96.2% faster)
    # If lower is True, upper should be False
    if result_lower:
        pass

def test_leading_trailing_spaces():
    # Leading/trailing spaces should not resolve to a valid package
    codeflash_output = check_for_nltk_package(' punkt ', 'tokenizers') # 96.2μs -> 54.0μs (78.2% faster)
    codeflash_output = check_for_nltk_package('punkt', ' tokenizers ') # 82.0μs -> 42.2μs (94.3% faster)

def test_numeric_package_and_category():
    # Numeric names are very unlikely to exist
    codeflash_output = check_for_nltk_package('12345', '67890') # 93.6μs -> 53.1μs (76.4% faster)

def test_package_name_with_unicode():
    # Test with unicode characters in package name
    codeflash_output = check_for_nltk_package('😀', 'corpora') # 110μs -> 66.9μs (64.6% faster)

def test_category_name_with_unicode():
    # Test with unicode characters in category name
    codeflash_output = check_for_nltk_package('punkt', '😀') # 103μs -> 60.1μs (72.3% faster)

def test_package_and_category_with_long_names():
    # Very long names should not exist and should not cause errors
    long_name = 'a' * 255
    codeflash_output = check_for_nltk_package(long_name, long_name) # 127μs -> 79.0μs (61.1% faster)

def test_package_and_category_with_slashes():
    # Slashes in names should not resolve to valid packages
    codeflash_output = check_for_nltk_package('punkt/other', 'tokenizers') # 125μs -> 62.9μs (99.4% faster)
    codeflash_output = check_for_nltk_package('punkt', 'tokenizers/other') # 108μs -> 47.8μs (127% faster)

# ----------- LARGE SCALE TEST CASES -----------

def test_large_number_of_nonexistent_packages():
    # Test performance/scalability with many non-existent packages
    for i in range(100):
        name = f"not_a_real_package_{i}"
        codeflash_output = check_for_nltk_package(name, 'corpora')

def test_large_number_of_nonexistent_categories():
    # Test performance/scalability with many non-existent categories
    for i in range(100):
        cat = f"not_a_real_category_{i}"
        codeflash_output = check_for_nltk_package('punkt', cat)

def test_large_number_of_random_combinations():
    # Test a large number of random package/category combinations
    for i in range(100):
        pkg = f"pkg_{i}"
        cat = f"cat_{i}"
        codeflash_output = check_for_nltk_package(pkg, cat)

def test_large_scale_existing_and_nonexisting():
    # Mix of likely existing and non-existing packages
    likely_existing = ['punkt', 'words', 'stopwords', 'averaged_perceptron_tagger']
    for pkg in likely_existing:
        codeflash_output = check_for_nltk_package(pkg, 'corpora'); result = codeflash_output # 74.8μs -> 34.1μs (119% faster)
    # Now add a batch of non-existing ones
    for i in range(50):
        codeflash_output = check_for_nltk_package(f"noexist_{i}", 'corpora')

def test_large_scale_edge_cases():
    # Edge-like names in large scale
    for i in range(50):
        weird_name = f"../noexist_{i}"
        codeflash_output = check_for_nltk_package(weird_name, 'corpora')
        codeflash_output = check_for_nltk_package('punkt', weird_name)

# ----------- DETERMINISM AND TYPE TESTS -----------

def test_return_type_is_bool():
    # The function should always return a bool, regardless of input
    inputs = [
        ('punkt', 'tokenizers'),
        ('not_a_real_package', 'corpora'),
        ('', ''),
        ('😀', '😀'),
        ('../punkt', 'tokenizers'),
        ('punkt', '../tokenizers'),
    ]
    for pkg, cat in inputs:
        pass

def test_function_is_deterministic():
    # The function should return the same result for the same input
    pkg, cat = 'punkt', 'tokenizers'
    codeflash_output = check_for_nltk_package(pkg, cat); result1 = codeflash_output # 105μs -> 57.4μs (83.5% faster)
    codeflash_output = check_for_nltk_package(pkg, cat); result2 = codeflash_output # 81.0μs -> 41.0μs (97.6% faster)
# codeflash_output is used to check that the output of the original code is the same as that of the optimized code.
```

</details>


To edit these changes `git checkout
codeflash/optimize-check_for_nltk_package-mcftixl5` and push.


[![Codeflash](https://img.shields.io/badge/Optimized%20with-Codeflash-yellow?style=flat&color=%23ffc428&logo=)](https://codeflash.ai)

---------

Signed-off-by: Saurabh Misra <misra.saurabh1@gmail.com>
Co-authored-by: codeflash-ai[bot] <148906541+codeflash-ai[bot]@users.noreply.github.com>
2025-08-22 00:31:04 +00:00
Saurabh Misra
cc635c97df
️ Speed up function under_non_alpha_ratio by 76% (#4079)
### 📄 76% (0.76x) speedup for ***`under_non_alpha_ratio` in
`unstructured/partition/text_type.py`***

⏱️ Runtime : **`9.53 milliseconds`** **→** **`5.41 milliseconds`** (best
of `91` runs)
### 📝 Explanation and details

Here's an optimized version of your function.  
Major improvements.
- Only **one pass** through the text string instead of two list
comprehensions (saves a ton of memory and CPU).
- No lists are constructed, only simple integer counters.
- `char.strip()` is only used to check for non-space; you can check
explicitly for that.

Here's the optimized code with all original comments retained.



This approach processes the string only **once** and uses **O(1)
memory** (just two ints). The use of `char.isspace()` is a fast way to
check for all Unicode whitespace, just as before. This will
significantly speed up your function and eliminate almost all time spent
in the original two list comprehensions.


 **Correctness verification report:**

| Test                        | Status            |
| --------------------------- | ----------------- |
| ⚙️ Existing Unit Tests |  **21 Passed** |
| 🌀 Generated Regression Tests |  **80 Passed** |
|  Replay Tests |  **594 Passed** |
| 🔎 Concolic Coverage Tests | 🔘 **None Found** |
|📊 Tests Coverage       | 100.0% |
<details>
<summary>⚙️ Existing Unit Tests and Runtime</summary>

| Test File::Test Function | Original ⏱️ | Optimized ⏱️ | Speedup |

|:------------------------------------------------------------------------------------------------|:--------------|:---------------|:----------|
| `partition/test_text_type.py::test_under_non_alpha_ratio_zero_divide`
| 1.14μs | 991ns | 15.1% |
|
`test_tracer_py__replay_test_0.py::test_unstructured_partition_text_type_under_non_alpha_ratio`
| 820μs | 412μs | 98.8% |
|
`test_tracer_py__replay_test_2.py::test_unstructured_partition_text_type_under_non_alpha_ratio`
| 5.62ms | 3.21ms | 75.3% |
|
`test_tracer_py__replay_test_3.py::test_unstructured_partition_text_type_under_non_alpha_ratio`
| 1.95ms | 1.09ms | 79.2% |

</details>

<details>
<summary>🌀 Generated Regression Tests and Runtime</summary>

```python
from __future__ import annotations

# imports
import pytest  # used for our unit tests
from unstructured.partition.text_type import under_non_alpha_ratio

# unit tests

# -------------------------
# BASIC TEST CASES
# -------------------------

def test_all_alpha_below_threshold():
    # All alphabetic, so ratio is 1.0, which is not < threshold (default 0.5)
    codeflash_output = not under_non_alpha_ratio("HelloWorld") # 2.48μs -> 1.63μs (52.1% faster)

def test_all_alpha_above_threshold():
    # All alphabetic, but threshold is 1.1, so ratio is < threshold
    codeflash_output = under_non_alpha_ratio("HelloWorld", threshold=1.1) # 2.19μs -> 1.47μs (49.1% faster)

def test_all_non_alpha():
    # All non-alpha, so ratio is 0, which is < threshold
    codeflash_output = under_non_alpha_ratio("1234567890!@#$%^&*()_+-=[]{}|;':,.<>/?", threshold=0.5) # 3.38μs -> 2.00μs (68.8% faster)

def test_mixed_alpha_non_alpha_below_threshold():
    # 4 alpha, 6 non-alpha (excluding spaces): ratio = 4/10 = 0.4 < 0.5
    codeflash_output = under_non_alpha_ratio("a1b2c3d4!!", threshold=0.5) # 2.16μs -> 1.44μs (50.7% faster)

def test_mixed_alpha_non_alpha_above_threshold():
    # 6 alpha, 2 non-alpha: ratio = 6/8 = 0.75 > 0.5, so not under threshold
    codeflash_output = not under_non_alpha_ratio("abCD12ef", threshold=0.5) # 2.04μs -> 1.43μs (42.9% faster)

def test_spaces_are_ignored():
    # Only 'a', 'b', 'c', '1', '2', '3' are counted (spaces ignored)
    # 3 alpha, 3 non-alpha: ratio = 3/6 = 0.5, not < threshold
    codeflash_output = not under_non_alpha_ratio("a b c 1 2 3", threshold=0.5) # 2.16μs -> 1.39μs (55.3% faster)
    # If threshold is 0.6, ratio 0.5 < 0.6, so True
    codeflash_output = under_non_alpha_ratio("a b c 1 2 3", threshold=0.6) # 1.25μs -> 705ns (77.7% faster)

def test_threshold_edge_case_exact():
    # 2 alpha, 2 non-alpha: ratio = 2/4 = 0.5, not < threshold
    codeflash_output = not under_non_alpha_ratio("a1b2", threshold=0.5) # 1.76μs -> 1.26μs (39.2% faster)
    # If threshold is 0.51, ratio 0.5 < 0.51, so True
    codeflash_output = under_non_alpha_ratio("a1b2", threshold=0.51) # 781ns -> 541ns (44.4% faster)

# -------------------------
# EDGE TEST CASES
# -------------------------

def test_empty_string():
    # Empty string should always return False
    codeflash_output = under_non_alpha_ratio("") # 450ns -> 379ns (18.7% faster)

def test_only_spaces():
    # Only spaces, so total_count == 0, should return False
    codeflash_output = under_non_alpha_ratio("     ") # 1.24μs -> 822ns (50.9% faster)

def test_only_newlines_and_tabs():
    # Only whitespace, so total_count == 0, should return False
    codeflash_output = under_non_alpha_ratio("\n\t  \t") # 1.16μs -> 745ns (55.7% faster)

def test_only_one_alpha():
    # Single alpha, total_count == 1, ratio == 1.0
    codeflash_output = not under_non_alpha_ratio("A") # 1.43μs -> 1.06μs (35.5% faster)
    codeflash_output = under_non_alpha_ratio("A", threshold=1.1) # 689ns -> 589ns (17.0% faster)

def test_only_one_non_alpha():
    # Single non-alpha, total_count == 1, ratio == 0.0
    codeflash_output = under_non_alpha_ratio("1") # 1.27μs -> 937ns (35.4% faster)
    codeflash_output = under_non_alpha_ratio("!", threshold=0.1) # 775ns -> 550ns (40.9% faster)

def test_unicode_alpha_and_non_alpha():
    # Unicode alpha: 'é', 'ü', 'ß' are isalpha()
    # Unicode non-alpha: '1', '!', '。'
    # 3 alpha, 3 non-alpha, ratio = 0.5
    codeflash_output = not under_non_alpha_ratio("éüß1!。", threshold=0.5) # 3.03μs -> 2.21μs (37.3% faster)
    codeflash_output = under_non_alpha_ratio("éüß1!。", threshold=0.6) # 1.11μs -> 789ns (40.2% faster)

def test_mixed_with_whitespace():
    # Alpha: a, b, c; Non-alpha: 1, 2, 3; Spaces ignored
    codeflash_output = under_non_alpha_ratio(" a 1 b 2 c 3 ", threshold=0.6) # 2.46μs -> 1.60μs (54.0% faster)

def test_threshold_zero():
    # Any non-zero alpha ratio is not < 0, so always False unless all non-alpha
    codeflash_output = not under_non_alpha_ratio("abc123", threshold=0.0) # 2.19μs -> 1.49μs (46.5% faster)
    # All non-alpha: ratio = 0, not < 0, so False
    codeflash_output = not under_non_alpha_ratio("123", threshold=0.0) # 859ns -> 594ns (44.6% faster)

def test_threshold_one():
    # Any ratio < 1.0 should return True if not all alpha
    codeflash_output = under_non_alpha_ratio("abc123", threshold=1.0) # 1.94μs -> 1.24μs (56.4% faster)
    # All alpha: ratio = 1.0, not < 1.0, so False
    codeflash_output = not under_non_alpha_ratio("abcdef", threshold=1.0) # 1.17μs -> 602ns (93.7% faster)

def test_leading_trailing_whitespace():
    # Whitespace should be ignored
    codeflash_output = under_non_alpha_ratio("   a1b2c3   ", threshold=0.6) # 2.32μs -> 1.44μs (61.3% faster)

def test_only_symbols():
    # Only symbols, ratio = 0, so < threshold
    codeflash_output = under_non_alpha_ratio("!@#$%^&*", threshold=0.5) # 1.89μs -> 1.15μs (65.0% faster)

def test_long_string_all_spaces_and_newlines():
    # All whitespace, should return False
    codeflash_output = under_non_alpha_ratio(" \n " * 100) # 11.4μs -> 3.65μs (213% faster)

def test_single_space():
    # Single space, should return False
    codeflash_output = under_non_alpha_ratio(" ") # 1.04μs -> 741ns (39.8% faster)

def test_non_ascii_non_alpha():
    # Non-ASCII, non-alpha (emoji)
    codeflash_output = under_non_alpha_ratio("😀😀😀", threshold=0.5) # 2.42μs -> 1.79μs (34.9% faster)

def test_mixed_emojis_and_alpha():
    # 2 alpha, 2 emoji: ratio = 2/4 = 0.5
    codeflash_output = not under_non_alpha_ratio("a😀b😀", threshold=0.5) # 2.24μs -> 1.50μs (49.3% faster)
    codeflash_output = under_non_alpha_ratio("a😀b😀", threshold=0.6) # 948ns -> 658ns (44.1% faster)

# -------------------------
# LARGE SCALE TEST CASES
# -------------------------

def test_large_all_alpha():
    # 1000 alpha, ratio = 1.0
    s = "a" * 1000
    codeflash_output = not under_non_alpha_ratio(s, threshold=0.5) # 47.9μs -> 33.4μs (43.4% faster)

def test_large_all_non_alpha():
    # 1000 non-alpha, ratio = 0.0
    s = "1" * 1000
    codeflash_output = under_non_alpha_ratio(s, threshold=0.5) # 40.9μs -> 24.9μs (64.3% faster)

def test_large_mixed_half_and_half():
    # 500 alpha, 500 non-alpha, ratio = 0.5
    s = "a" * 500 + "1" * 500
    codeflash_output = not under_non_alpha_ratio(s, threshold=0.5) # 45.2μs -> 28.4μs (59.0% faster)
    codeflash_output = under_non_alpha_ratio(s, threshold=0.6) # 43.3μs -> 27.8μs (55.9% faster)

def test_large_with_spaces_ignored():
    # 400 alpha, 400 non-alpha, 200 spaces (should be ignored)
    s = "a" * 400 + "1" * 400 + " " * 200
    codeflash_output = not under_non_alpha_ratio(s, threshold=0.5) # 43.5μs -> 24.5μs (77.4% faster)
    codeflash_output = under_non_alpha_ratio(s, threshold=0.6) # 41.8μs -> 23.8μs (75.6% faster)

def test_large_unicode_mixed():
    # 300 unicode alpha, 300 unicode non-alpha, 400 ascii alpha
    s = "é" * 300 + "😀" * 300 + "a" * 400
    # alpha: 300 (é) + 400 (a) = 700, non-alpha: 300 (😀), total = 1000
    codeflash_output = not under_non_alpha_ratio(s, threshold=0.8) # 65.3μs -> 36.8μs (77.6% faster)
    codeflash_output = under_non_alpha_ratio(s, threshold=0.71) # 59.0μs -> 34.8μs (69.8% faster)
    # ratio = 700/1000 = 0.7

def test_large_threshold_zero_one():
    # All alpha, threshold=0.0, should be False
    s = "b" * 999
    codeflash_output = not under_non_alpha_ratio(s, threshold=0.0) # 46.2μs -> 33.1μs (39.4% faster)
    # All non-alpha, threshold=1.0, should be True
    s = "!" * 999
    codeflash_output = under_non_alpha_ratio(s, threshold=1.0) # 40.4μs -> 24.3μs (66.0% faster)

def test_large_string_with_whitespace_only():
    # 1000 spaces, should return False
    s = " " * 1000
    codeflash_output = under_non_alpha_ratio(s) # 33.7μs -> 10.0μs (236% faster)

def test_large_string_with_mixed_whitespace_and_chars():
    # 333 alpha, 333 non-alpha, 334 whitespace (ignored)
    s = "a" * 333 + "1" * 333 + " " * 334
    # total_count = 666, alpha = 333, ratio = 0.5
    codeflash_output = not under_non_alpha_ratio(s, threshold=0.5) # 41.2μs -> 21.8μs (89.4% faster)
    codeflash_output = under_non_alpha_ratio(s, threshold=0.51) # 40.1μs -> 21.0μs (90.8% faster)
# codeflash_output is used to check that the output of the original code is the same as that of the optimized code.

from __future__ import annotations

# imports
import pytest  # used for our unit tests
from unstructured.partition.text_type import under_non_alpha_ratio

# unit tests

# --- Basic Test Cases ---

def test_all_alpha_default_threshold():
    # All alphabetic, should be False (ratio = 1.0, not under 0.5)
    codeflash_output = under_non_alpha_ratio("HelloWorld") # 3.09μs -> 2.04μs (51.3% faster)

def test_all_non_alpha_default_threshold():
    # All non-alpha (punctuation), should be True (ratio = 0.0)
    codeflash_output = under_non_alpha_ratio("!!!???---") # 2.02μs -> 1.24μs (63.3% faster)

def test_mixed_alpha_non_alpha_default_threshold():
    # 5 alpha, 5 non-alpha, ratio = 0.5, should be False (not under threshold)
    codeflash_output = under_non_alpha_ratio("abc12!@#de") # 2.20μs -> 1.37μs (61.1% faster)

def test_mixed_alpha_non_alpha_just_under_threshold():
    # 2 alpha, 3 non-alpha, ratio = 0.4, should be True (under threshold)
    codeflash_output = under_non_alpha_ratio("a1!b2") # 1.67μs -> 1.19μs (40.9% faster)

def test_spaces_are_ignored():
    # Spaces should not count toward total_count
    # 3 alpha, 2 non-alpha, 2 spaces; ratio = 3/5 = 0.6, should be False
    codeflash_output = under_non_alpha_ratio("a b! c?") # 1.88μs -> 1.19μs (57.4% faster)

def test_threshold_parameter():
    # 2 alpha, 3 non-alpha, total=5, ratio=0.4
    # threshold=0.3 -> False (not under), threshold=0.5 -> True (under)
    codeflash_output = under_non_alpha_ratio("a1!b2", threshold=0.3) # 1.71μs -> 1.29μs (32.4% faster)
    codeflash_output = under_non_alpha_ratio("a1!b2", threshold=0.5) # 900ns -> 536ns (67.9% faster)

# --- Edge Test Cases ---

def test_empty_string():
    # Empty string should return False
    codeflash_output = under_non_alpha_ratio("") # 425ns -> 367ns (15.8% faster)

def test_only_spaces():
    # Only spaces, total_count=0, should return False
    codeflash_output = under_non_alpha_ratio("     ") # 1.16μs -> 776ns (49.9% faster)

def test_only_alpha_with_spaces():
    # Only alpha and spaces, ratio=1.0, should return False
    codeflash_output = under_non_alpha_ratio("a b c d e") # 1.79μs -> 1.24μs (44.3% faster)

def test_only_non_alpha_with_spaces():
    # Only non-alpha and spaces, ratio=0.0, should return True
    codeflash_output = under_non_alpha_ratio("! @ # $ %") # 1.73μs -> 1.08μs (59.6% faster)

def test_single_alpha():
    # Single alpha, ratio=1.0, should return False
    codeflash_output = under_non_alpha_ratio("A") # 1.38μs -> 1.03μs (33.9% faster)

def test_single_non_alpha():
    # Single non-alpha, ratio=0.0, should return True
    codeflash_output = under_non_alpha_ratio("?") # 1.23μs -> 978ns (26.1% faster)

def test_single_space():
    # Single space, total_count=0, should return False
    codeflash_output = under_non_alpha_ratio(" ") # 1.00μs -> 729ns (37.7% faster)

def test_all_digits():
    # All digits, ratio=0.0, should return True
    codeflash_output = under_non_alpha_ratio("1234567890") # 2.00μs -> 1.21μs (65.4% faster)

def test_unicode_alpha():
    # Unicode alphabetic characters (e.g. accented letters)
    # 3 alpha, 2 non-alpha, ratio=0.6, should be False
    codeflash_output = under_non_alpha_ratio("éàü!!") # 2.19μs -> 1.58μs (39.3% faster)

def test_unicode_non_alpha():
    # Unicode non-alpha (emoji, symbols)
    # 2 non-alpha, 2 alpha, ratio=0.5, should be False
    codeflash_output = under_non_alpha_ratio("a😀b!") # 2.46μs -> 1.69μs (45.9% faster)

def test_threshold_1_0():
    # threshold=1.0, any string with <100% alpha should return True
    # 2 alpha, 2 non-alpha, ratio=0.5 < 1.0
    codeflash_output = under_non_alpha_ratio("a1b2", threshold=1.0) # 1.73μs -> 1.36μs (27.4% faster)

def test_threshold_0_0():
    # threshold=0.0, only strings with 0% alpha should return True
    # 2 alpha, 2 non-alpha, ratio=0.5 > 0.0
    codeflash_output = under_non_alpha_ratio("a1b2", threshold=0.0) # 1.73μs -> 1.29μs (33.7% faster)
    # All non-alpha, ratio=0.0 == 0.0, should be False (not under threshold)
    codeflash_output = under_non_alpha_ratio("1234", threshold=0.0) # 883ns -> 653ns (35.2% faster)

def test_threshold_exactly_equal():
    # Ratio equals threshold: should return False (not under threshold)
    # 2 alpha, 2 non-alpha, ratio=0.5 == threshold
    codeflash_output = under_non_alpha_ratio("a1b2", threshold=0.5) # 1.60μs -> 1.20μs (33.1% faster)

def test_tabs_and_newlines_ignored():
    # Tabs and newlines are whitespace, so ignored
    # 2 alpha, 2 non-alpha, 2 whitespace, ratio=2/4=0.5, should be False
    codeflash_output = under_non_alpha_ratio("a\tb\n!") # 1.80μs -> 1.16μs (55.6% faster)

def test_long_repeated_pattern():
    # 500 alpha, 500 non-alpha, ratio=0.5, should be False
    s = "a1" * 500
    codeflash_output = under_non_alpha_ratio(s) # 45.3μs -> 29.1μs (55.5% faster)
    # 499 alpha, 501 non-alpha, ratio=499/1000=0.499, should be True
    s2 = "a1" * 499 + "1!"
    codeflash_output = under_non_alpha_ratio(s2) # 43.8μs -> 28.0μs (56.3% faster)

# --- Large Scale Test Cases ---

def test_large_all_alpha():
    # 1000 alphabetic characters, ratio=1.0, should be False
    s = "a" * 1000
    codeflash_output = under_non_alpha_ratio(s) # 46.5μs -> 32.9μs (41.2% faster)

def test_large_all_non_alpha():
    # 1000 non-alpha characters, ratio=0.0, should be True
    s = "!" * 1000
    codeflash_output = under_non_alpha_ratio(s) # 41.5μs -> 24.6μs (68.4% faster)

def test_large_half_alpha_half_non_alpha():
    # 500 alpha, 500 non-alpha, ratio=0.5, should be False
    s = ("a!" * 500)
    codeflash_output = under_non_alpha_ratio(s) # 44.6μs -> 28.6μs (56.1% faster)

def test_large_sparse_alpha():
    # 10 alpha, 990 non-alpha, ratio=0.01, should be True
    s = "a" + "!" * 99
    s = s * 10  # 10 alpha, 990 non-alpha
    codeflash_output = under_non_alpha_ratio(s) # 41.4μs -> 25.0μs (65.6% faster)

def test_large_sparse_non_alpha():
    # 990 alpha, 10 non-alpha, ratio=0.99, should be False
    s = "a" * 99 + "!"  # 99 alpha, 1 non-alpha
    s = s * 10  # 990 alpha, 10 non-alpha
    codeflash_output = under_non_alpha_ratio(s) # 46.2μs -> 33.0μs (39.9% faster)

def test_large_with_spaces():
    # 500 alpha, 500 non-alpha, 100 spaces (should be ignored)
    s = ("a!" * 500) + (" " * 100)
    codeflash_output = under_non_alpha_ratio(s) # 47.7μs -> 29.7μs (60.6% faster)

def test_large_thresholds():
    # 600 alpha, 400 non-alpha, ratio=0.6
    s = "a" * 600 + "!" * 400
    codeflash_output = under_non_alpha_ratio(s, threshold=0.5) # 44.5μs -> 29.4μs (51.1% faster)
    codeflash_output = under_non_alpha_ratio(s, threshold=0.7) # 42.6μs -> 28.5μs (49.4% faster)

# --- Additional Robustness Tests ---

def test_mixed_case_and_symbols():
    # Mixed uppercase, lowercase, digits, symbols
    # 3 alpha, 3 non-alpha, ratio=0.5, should be False
    codeflash_output = under_non_alpha_ratio("A1b2C!") # 1.79μs -> 1.17μs (53.7% faster)

def test_realistic_sentence():
    # Realistic sentence, mostly alpha, some punctuation
    # 20 alpha, 2 non-alpha (comma, period), ratio=20/22 ~ 0.909, should be False
    codeflash_output = under_non_alpha_ratio("Hello, this is a test sentence.") # 3.39μs -> 1.94μs (74.3% faster)

def test_realistic_break_line():
    # Typical break line, mostly non-alpha
    # 1 alpha, 9 non-alpha, ratio=0.1, should be True
    codeflash_output = under_non_alpha_ratio("----BREAK----") # 2.08μs -> 1.32μs (57.9% faster)

def test_space_heavy_string():
    # Spaces should be ignored, only non-space chars count
    # 2 alpha, 2 non-alpha, 10 spaces, ratio=2/4=0.5, should be False
    codeflash_output = under_non_alpha_ratio(" a ! b ?          ") # 2.25μs -> 1.29μs (74.3% faster)

def test_only_whitespace_variety():
    # Only tabs, spaces, newlines, should return False
    codeflash_output = under_non_alpha_ratio(" \t\n\r") # 1.08μs -> 714ns (51.4% faster)
# codeflash_output is used to check that the output of the original code is the same as that of the optimized code.
```

</details>


To edit these changes `git checkout
codeflash/optimize-under_non_alpha_ratio-mcgm6dor` and push.


[![Codeflash](https://img.shields.io/badge/Optimized%20with-Codeflash-yellow?style=flat&color=%23ffc428&logo=)](https://codeflash.ai)

---------

Signed-off-by: Saurabh Misra <misra.saurabh1@gmail.com>
Co-authored-by: codeflash-ai[bot] <148906541+codeflash-ai[bot]@users.noreply.github.com>
2025-08-21 22:43:58 +00:00
Yao You
76d7a5c3d0
Chore/change lang detection logging level to avoid warning log spamming (#4078)
This PR changes the log line for defaulting short text to English to
debug level.
- this log is not because the logic failed or exception handling
- short text can be common and we can get a lot of warning logs with the
original code -> spams warning log and potentially cause user to miss
other important warning level logs
2025-08-20 15:34:16 +00:00
David Potter
0d20f6a9b1
email date format flexibility (#4072)
we are seeing some .eml files come through the VLM partitioner. Which
then downgrades to hi-res i believe.

For some reason they have a date format that is not standard email
format. But it is still legitimate.

This uses a more robust date package to parse the date. This package is
already installed.

---------

Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com>
Co-authored-by: potter-potter <potter-potter@users.noreply.github.com>
0.18.13
2025-08-13 18:55:24 +00:00
Nick Franck
b8c14a7a4f
fix: replace UnicodeDecodeError to prevent large payload logging (#4071)
Replace UnicodeDecodeError with UnprocessableEntityError in encoding
detection to avoid logging entire file contents.
UnicodeDecodeError.object automatically stores complete input data,
causing memory issues with large files in logging and error reporting
systems.
0.18.12
2025-07-25 21:01:37 +00:00
Maksymilian Operlejn
591729c0b8
bump version and release (#4070) 0.18.11 2025-07-23 11:49:33 +00:00
qued
d83df422a6
chore: switch to charset normalizer (#4060)
Closes
[SPI-44](https://linear.app/unstructured/issue/SPI-44/spike-replace-chardet-with-charset-normalizer-if-possible).

Removes `chardet` as a dependency, standardizing on
`charset-normalizer`.

This involved:
- Changing `chardet` to `charset-normalizer` in our base dependency file
- Updating the code (in only one place) where `chardet` was used
- pip-compiling to update our published dependency tree
- Updating one test... `charset-normalizer` misdiagnosed the encoding of
a file used as a test fixture. My guess is that the ~10 characters in
the file were not enough for `charset-normalizer` to do a proper
inference, so I re-encoded another slightly longer file that's also used
for encoding testing, and it got that one.
- Updating an ingest test fixture.
- Updating the ingest test fixture update workflow to also update the
expected markdown results (this was a task I missed when adding the
markdown ingest tests)

---------

Co-authored-by: Ahmet Melek <39141206+ahmetmeleq@users.noreply.github.com>
Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com>
Co-authored-by: qued <qued@users.noreply.github.com>
Co-authored-by: Maksymilian Operlejn <36171422+MaksOpp@users.noreply.github.com>
2025-07-22 19:02:40 +00:00
Maksymilian Operlejn
536819719c
feat: map <input> tags by type + add coverage (#4068)
Implements type-aware classification of `<input>` elements in
`extract_tag_and_ontology_class_from_tag` (checkbox → `Checkbox`, radio
→ `RadioButton`, else → `FormFieldValue`) and updates/extends the
HTML-to-ontology test suite to validate the new behaviour.
2025-07-22 16:13:07 +00:00
jiajun-unstructured
d24dec5e04
add '|' as a delimiter in csv files (#4059)
This PR fixes the error “Failure to process CSV: Expected 2 fields in
line 2, saw 4” when '|' is used as a delimiter in the csv file
2025-07-18 17:56:24 +00:00
Nick Franck
a040483a7e
Add OCR_AGENT_CACHE_SIZE environment variable (#4066)
## Problem
OCR agents used unlimited caching, causing excessive memory usage. Each
cached OCR agent consumes different amounts of memory, but can easily
consume ~800MB.

## Solution
Add `OCR_AGENT_CACHE_SIZE` environment variable to limit cached OCR
agents per process.

- **Default**: 1 cached agent
- **Configurable**: Set to 0 to disable caching, or higher for more
languages
0.18.10
2025-07-18 15:47:55 +00:00
qued
869ef457fe
build: Update CodeQL GHA to v3 (#4065)
We were using CodeQL v2, which has been [deprecated since
January](https://github.blog/changelog/2025-01-10-code-scanning-codeql-action-v2-is-now-deprecated/).
2025-07-17 21:07:22 +00:00
Yao You
909716f310
feat: keep input tag's class attr in table (#4064)
This change affects partition html.

Previously when there is a table in the html, we clean any tags inside
the table of their class and id attributes except for the class
attribute for `img` tags. This change also preserves the class attribute
for `input` tags inside a table. This change is reflected in a table
element's metadata.text_as_html attribute.
0.18.9
2025-07-16 21:46:58 +00:00
shreyanid
446826885b
fix: add empty string case for language metadata (#4062)
Add an empty string edge case for when the element text field is None or
not a string.

most of the diff is `make tidy`
2025-07-16 21:35:00 +00:00
qued
c7c3e3c082
feat: convert elements to markdown (#4055)
Creates a staging function `elements_to_md` to convert lists of
`Elements` to markdown strings (or a markdown file). Includes unit tests
as well as ingest tests and expected output fixtures.
2025-07-16 14:34:29 +00:00
Filip Knefel
f66562b1cb
fix: properly handle password protected xlsx (#4057)
### Issue
Attempt at partitioning a password protected errors results in an
obscure exception
> Can't find workbook in OLE2 compound document

### Solution
Utilize [msoffcrypto-tool](https://pypi.org/project/msoffcrypto-tool/)
package (MIT License) to load XLSX file and check whether it's
encrypted, if yes throw an `UnprocessableEntityError` exception
detailing the reason for rejecting the file.

---------

Co-authored-by: Filip Knefel <filip@unstructured.io>
2025-07-16 13:19:14 +00:00
shreyanid
344202fa6d
feat: detect language for PDFs (#4051)
The `@apply_metadata` decorator already contains logic to detect the
language of the element text (on either a document or element level).
Update pdfs, and later images, to use this decorator to get accurate
element language results outputted.

Test
```
from unstructured.partition.auto import partition

def test_partition_pdf():
    pdf_path = "example-docs/language-docs/fr_olap.pdf"
    elements = partition(pdf_path)  # optionally set `detect_language_per_element=True)`
    print(f"Number of elements partitioned: {len(elements)}")

    # Check if elements are returned
    assert len(elements) > 0, "No elements were partitioned from the PDF."

    # check language outputted for each element
    for element in elements:
        print(element)
        print(element.metadata.languages)
        print("-------------------------------")

test_partition_pdf()
```

---------

Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com>
Co-authored-by: shreyanid <shreyanid@users.noreply.github.com>
0.18.7
2025-07-15 18:53:28 +00:00
Ahmet Melek
2ffaf6f323
fix: type for serialized TableChunks (#4056)
#### To test, simply serialize a TableChunk element with and without the
changes in the PR

____
**Without the changes:**

```
In [1]: from unstructured.documents.elements import TableChunk

In [2]: TableChunk("hi")
Out[2]: <unstructured.documents.elements.TableChunk at 0x110113410>

In [3]: TableChunk("hi").to_dict()
Out[3]: 
{'type': 'Table',
 'element_id': '6267e99a-46d8-4f2d-a206-51c691469c72',
 'text': 'hi',
 'metadata': {}}
```

____
**With the changes:**

```
In [1]: from unstructured.documents.elements import TableChunk

In [2]: TableChunk("hi")
Out[2]: <unstructured.documents.elements.TableChunk at 0x10367f050>

In [3]: TableChunk("hi").to_dict()
Out[3]: 
{'type': 'TableChunk',
 'element_id': 'f91af3ac-0dea-4dc4-8a6a-69c28cfcca3b',
 'text': 'hi',
 'metadata': {}}
```
____
0.18.6
2025-07-15 17:29:02 +00:00
mateuszkuprowski
37800c3523
feat: added new exception type to epub conversions (#4052)
Added UnprocessableEpubError to better handle the case when incoming
epub file is actually damanged which makes pandoc lib crash with exit
code 64.
2025-07-15 10:56:22 +00:00
Yao You
73d239fb28
feat: keep img tag's class attr (#4050)
This change affects partition html.

Previously when there is a table in the html, we clean any tags inside
the table of their `class` and `id` attributes. However, sometimes there
are images, `img` tags, present in a table and its `class` attribute
identifies some important information about the image. This change
preserves the `class` attribute for `img` tags inside a table. This
change is reflected in a table element's `metadata.text_as_html`
attribute.
0.18.5
2025-07-10 20:46:28 +00:00
qued
7764fb6fd4
build: drop remaining Python 3.9 refs (#4049)
Dropped variables that said we support Python 3.9 in `setup.py`, as well
as any remaining references to Python 3.9.

I also checked the pins and removed several that don't seem necessary
any more.
2025-07-10 16:43:15 +00:00
jiajun-unstructured
92965fb286
add fenced-code extension to the md parser (#4044)
https://github.com/Unstructured-IO/unstructured/issues/3578

---------

Co-authored-by: qued <64741807+qued@users.noreply.github.com>
Co-authored-by: Alan Bertl <alan@unstructured.io>
2025-07-07 21:05:54 +00:00
Filip Knefel
f078cd923b
fix(partition, csv): increase csv field limit (#4046)
Increase the csv field limit to support partitioning of files with large
data in fields.

---------

Co-authored-by: Filip Knefel <filip@unstructured.io>
0.18.4
2025-07-07 14:12:53 +00:00
Austin Walker
8a9abddb16
chore: bump pillow to address a CVE (#4045) 0.18.3 2025-07-05 18:33:15 +00:00
Yao You
d7dfda9ecb
bump version to make a release (#4042) 0.18.2 2025-07-01 23:06:02 +00:00