Massive refactor from pdelfin to olmocr

This commit is contained in:
Jake Poznanski 2025-01-27 18:30:41 +00:00
parent 7261bfc0b9
commit b2894d0280
94 changed files with 184 additions and 184 deletions

View File

@ -6,10 +6,10 @@ Thanks for considering contributing! Please read this document to learn the vari
### Did you find a bug? ### Did you find a bug?
First, do [a quick search](https://github.com/allenai/pdelfin/issues) to see whether your issue has already been reported. First, do [a quick search](https://github.com/allenai/olmocrissues) to see whether your issue has already been reported.
If your issue has already been reported, please comment on the existing issue. If your issue has already been reported, please comment on the existing issue.
Otherwise, open [a new GitHub issue](https://github.com/allenai/pdelfin/issues). Be sure to include a clear title Otherwise, open [a new GitHub issue](https://github.com/allenai/olmocrissues). Be sure to include a clear title
and description. The description should include as much relevant information as possible. The description should and description. The description should include as much relevant information as possible. The description should
explain how to reproduce the erroneous behavior as well as the behavior you expect to see. Ideally you would include a explain how to reproduce the erroneous behavior as well as the behavior you expect to see. Ideally you would include a
code sample or an executable test case demonstrating the expected behavior. code sample or an executable test case demonstrating the expected behavior.
@ -21,7 +21,7 @@ We use GitHub issues to track feature requests. Before you create a feature requ
* Make sure you have a clear idea of the enhancement you would like. If you have a vague idea, consider discussing * Make sure you have a clear idea of the enhancement you would like. If you have a vague idea, consider discussing
it first on a GitHub issue. it first on a GitHub issue.
* Check the documentation to make sure your feature does not already exist. * Check the documentation to make sure your feature does not already exist.
* Do [a quick search](https://github.com/allenai/pdelfin/issues) to see whether your feature has already been suggested. * Do [a quick search](https://github.com/allenai/olmocrissues) to see whether your feature has already been suggested.
When creating your request, please: When creating your request, please:
@ -41,31 +41,31 @@ When you're ready to contribute code to address an open issue, please follow the
Then clone your fork locally with Then clone your fork locally with
git clone https://github.com/USERNAME/pdelfin.git git clone https://github.com/USERNAME/olmocrgit
or or
git clone git@github.com:USERNAME/pdelfin.git git clone git@github.com:USERNAME/olmocrgit
At this point the local clone of your fork only knows that it came from *your* repo, github.com/USERNAME/pdelfin.git, but doesn't know anything the *main* repo, [https://github.com/allenai/pdelfin.git](https://github.com/allenai/pdelfin). You can see this by running At this point the local clone of your fork only knows that it came from *your* repo, github.com/USERNAME/olmocrgit, but doesn't know anything the *main* repo, [https://github.com/allenai/oolmocrit](https://github.com/allenai/ololmocrYou can see this by running
git remote -v git remote -v
which will output something like this: which will output something like this:
origin https://github.com/USERNAME/pdelfin.git (fetch) origin https://github.com/USERNAME/olmocrgit (fetch)
origin https://github.com/USERNAME/pdelfin.git (push) origin https://github.com/USERNAME/olmocrgit (push)
This means that your local clone can only track changes from your fork, but not from the main repo, and so you won't be able to keep your fork up-to-date with the main repo over time. Therefore you'll need to add another "remote" to your clone that points to [https://github.com/allenai/pdelfin.git](https://github.com/allenai/pdelfin). To do this, run the following: This means that your local clone can only track changes from your fork, but not from the main repo, and so you won't be able to keep your fork up-to-date with the main repo over time. Therefore you'll need to add another "remote" to your clone that points to [https://github.com/allenai/olmocrgit](https://github.com/allenai/oolmocr To do this, run the following:
git remote add upstream https://github.com/allenai/pdelfin.git git remote add upstream https://github.com/allenai/olmocrgit
Now if you do `git remote -v` again, you'll see Now if you do `git remote -v` again, you'll see
origin https://github.com/USERNAME/pdelfin.git (fetch) origin https://github.com/USERNAME/olmocrgit (fetch)
origin https://github.com/USERNAME/pdelfin.git (push) origin https://github.com/USERNAME/olmocrgit (push)
upstream https://github.com/allenai/pdelfin.git (fetch) upstream https://github.com/allenai/olmocrgit (fetch)
upstream https://github.com/allenai/pdelfin.git (push) upstream https://github.com/allenai/olmocrgit (push)
Finally, you'll need to create a Python 3 virtual environment suitable for working on this project. There a number of tools out there that making working with virtual environments easier. Finally, you'll need to create a Python 3 virtual environment suitable for working on this project. There a number of tools out there that making working with virtual environments easier.
The most direct way is with the [`venv` module](https://docs.python.org/3.7/library/venv.html) in the standard library, but if you're new to Python or you don't already have a recent Python 3 version installed on your machine, The most direct way is with the [`venv` module](https://docs.python.org/3.7/library/venv.html) in the standard library, but if you're new to Python or you don't already have a recent Python 3 version installed on your machine,
@ -77,8 +77,8 @@ When you're ready to contribute code to address an open issue, please follow the
Then you can create and activate a new Python environment by running: Then you can create and activate a new Python environment by running:
conda create -n pdelfin python=3.9 conda create -n olmocrpython=3.9
conda activate pdelfin conda activate olmocr
Once your virtual environment is activated, you can install your local clone in "editable mode" with Once your virtual environment is activated, you can install your local clone in "editable mode" with
@ -93,7 +93,7 @@ When you're ready to contribute code to address an open issue, please follow the
<details><summary>Expand details 👇</summary><br/> <details><summary>Expand details 👇</summary><br/>
Once you've added an "upstream" remote pointing to [https://github.com/allenai/python-package-temlate.git](https://github.com/allenai/pdelfin), keeping your fork up-to-date is easy: Once you've added an "upstream" remote pointing to [https://github.com/allenai/python-package-temlate.git](https://github.com/allenai/olmocr, keeping your fork up-to-date is easy:
git checkout main # if not already on main git checkout main # if not already on main
git pull --rebase upstream main git pull --rebase upstream main
@ -119,7 +119,7 @@ When you're ready to contribute code to address an open issue, please follow the
<details><summary>Expand details 👇</summary><br/> <details><summary>Expand details 👇</summary><br/>
Our continuous integration (CI) testing runs [a number of checks](https://github.com/allenai/pdelfin/actions) for each pull request on [GitHub Actions](https://github.com/features/actions). You can run most of these tests locally, which is something you should do *before* opening a PR to help speed up the review process and make it easier for us. Our continuous integration (CI) testing runs [a number of checks](https://github.com/allenai/olmocractions) for each pull request on [GitHub Actions](https://github.com/features/actions). You can run most of these tests locally, which is something you should do *before* opening a PR to help speed up the review process and make it easier for us.
First, you should run [`isort`](https://github.com/PyCQA/isort) and [`black`](https://github.com/psf/black) to make sure you code is formatted consistently. First, you should run [`isort`](https://github.com/PyCQA/isort) and [`black`](https://github.com/psf/black) to make sure you code is formatted consistently.
Many IDEs support code formatters as plugins, so you may be able to setup isort and black to run automatically everytime you save. Many IDEs support code formatters as plugins, so you may be able to setup isort and black to run automatically everytime you save.
@ -137,9 +137,9 @@ When you're ready to contribute code to address an open issue, please follow the
mypy . mypy .
We also strive to maintain high test coverage, so most contributions should include additions to [the unit tests](https://github.com/allenai/pdelfin/tree/main/tests). These tests are run with [`pytest`](https://docs.pytest.org/en/latest/), which you can use to locally run any test modules that you've added or changed. We also strive to maintain high test coverage, so most contributions should include additions to [the unit tests](https://github.com/allenai/olmocrtree/main/tests). These tests are run with [`pytest`](https://docs.pytest.org/en/latest/), which you can use to locally run any test modules that you've added or changed.
For example, if you've fixed a bug in `pdelfin/a/b.py`, you can run the tests specific to that module with For example, if you've fixed a bug in `olmocra/b.py`, you can run the tests specific to that module with
pytest -v tests/a/b_test.py pytest -v tests/a/b_test.py
@ -152,9 +152,9 @@ When you're ready to contribute code to address an open issue, please follow the
If the build fails, it's most likely due to small formatting issues. If the error message isn't clear, feel free to comment on this in your pull request. If the build fails, it's most likely due to small formatting issues. If the error message isn't clear, feel free to comment on this in your pull request.
And finally, please update the [CHANGELOG](https://github.com/allenai/pdelfin/blob/main/CHANGELOG.md) with notes on your contribution in the "Unreleased" section at the top. And finally, please update the [CHANGELOG](https://github.com/allenai/olmocrblob/main/CHANGELOG.md) with notes on your contribution in the "Unreleased" section at the top.
After all of the above checks have passed, you can now open [a new GitHub pull request](https://github.com/allenai/pdelfin/pulls). After all of the above checks have passed, you can now open [a new GitHub pull request](https://github.com/allenai/olmocrpulls).
Make sure you have a clear description of the problem and the solution, and include a link to relevant issues. Make sure you have a clear description of the problem and the solution, and include a link to relevant issues.
We look forward to reviewing your PR! We look forward to reviewing your PR!

View File

@ -6,7 +6,7 @@ body:
- type: markdown - type: markdown
attributes: attributes:
value: > value: >
#### Before submitting a bug, please make sure the issue hasn't been already addressed by searching through [the existing and past issues](https://github.com/allenai/pdelfin/issues?q=is%3Aissue+sort%3Acreated-desc+). #### Before submitting a bug, please make sure the issue hasn't been already addressed by searching through [the existing and past issues](https://github.com/allenai/olmocr/issues?q=is%3Aissue+sort%3Acreated-desc+).
- type: textarea - type: textarea
attributes: attributes:
label: 🐛 Describe the bug label: 🐛 Describe the bug
@ -17,7 +17,7 @@ body:
```python ```python
# All necessary imports at the beginning # All necessary imports at the beginning
import pdelfin import olmocr
# A succinct reproducing example trimmed down to the essential parts: # A succinct reproducing example trimmed down to the essential parts:
assert False is True, "Oh no!" assert False is True, "Oh no!"

View File

@ -1,5 +1,5 @@
name: 📚 Documentation name: 📚 Documentation
description: Report an issue related to https://pdelfin.readthedocs.io/latest description: Report an issue related to https://olmocr.readthedocs.io/latest
labels: 'documentation' labels: 'documentation'
body: body:
@ -7,7 +7,7 @@ body:
attributes: attributes:
label: 📚 The doc issue label: 📚 The doc issue
description: > description: >
A clear and concise description of what content in https://pdelfin.readthedocs.io/latest is an issue. A clear and concise description of what content in https://olmocr.readthedocs.io/latest is an issue.
validations: validations:
required: true required: true
- type: textarea - type: textarea

View File

@ -10,9 +10,9 @@ Changes proposed in this pull request:
## Before submitting ## Before submitting
<!-- Please complete this checklist BEFORE submitting your PR to speed along the review process. --> <!-- Please complete this checklist BEFORE submitting your PR to speed along the review process. -->
- [ ] I've read and followed all steps in the [Making a pull request](https://github.com/allenai/pdelfin/blob/main/.github/CONTRIBUTING.md#making-a-pull-request) - [ ] I've read and followed all steps in the [Making a pull request](https://github.com/allenai/olmocr/blob/main/.github/CONTRIBUTING.md#making-a-pull-request)
section of the `CONTRIBUTING` docs. section of the `CONTRIBUTING` docs.
- [ ] I've updated or added any relevant docstrings following the syntax described in the - [ ] I've updated or added any relevant docstrings following the syntax described in the
[Writing docstrings](https://github.com/allenai/pdelfin/blob/main/.github/CONTRIBUTING.md#writing-docstrings) section of the `CONTRIBUTING` docs. [Writing docstrings](https://github.com/allenai/olmocr/blob/main/.github/CONTRIBUTING.md#writing-docstrings) section of the `CONTRIBUTING` docs.
- [ ] If this PR fixes a bug, I've added a test that will fail without my fix. - [ ] If this PR fixes a bug, I've added a test that will fail without my fix.
- [ ] If this PR adds a new feature, I've added tests that sufficiently cover my new functionality. - [ ] If this PR adds a new feature, I've added tests that sufficiently cover my new functionality.

View File

@ -97,7 +97,7 @@ jobs:
if: always() if: always()
run: | run: |
. .venv/bin/activate . .venv/bin/activate
pip uninstall -y pdelfin pip uninstall -y olmocr
release: release:
name: Release name: Release

View File

@ -9,7 +9,7 @@ on:
branches: branches:
- main - main
paths: paths:
- 'pdelfin/**' - 'olmocr/**'
jobs: jobs:
changelog: changelog:

View File

@ -7,12 +7,12 @@ Toolkit for training language models to work with PDF documents in the wild.
What is included: What is included:
- A prompting strategy to get really good natural text parsing using ChatGPT 4o - [buildsilver.py](https://github.com/allenai/pdelfin/blob/main/pdelfin/data/buildsilver.py) - A prompting strategy to get really good natural text parsing using ChatGPT 4o - [buildsilver.py](https://github.com/allenai/olmocr/blob/main/olmocr/data/buildsilver.py)
- An eval toolkit for comparing different pipeline versions - [runeval.py](https://github.com/allenai/pdelfin/blob/main/pdelfin/eval/runeval.py) - An eval toolkit for comparing different pipeline versions - [runeval.py](https://github.com/allenai/olmocr/blob/main/olmocr/eval/runeval.py)
- Basic filtering by language and SEO spam removal - [filter.py](https://github.com/allenai/pdelfin/blob/main/pdelfin/filter/filter.py) - Basic filtering by language and SEO spam removal - [filter.py](https://github.com/allenai/olmocr/blob/main/olmocr/filter/filter.py)
- Finetuning code for Qwen2-VL (and soon other VLMs) - [train.py](https://github.com/allenai/pdelfin/blob/main/pdelfin/train/train.py) - Finetuning code for Qwen2-VL (and soon other VLMs) - [train.py](https://github.com/allenai/olmocr/blob/main/olmocr/train/train.py)
- Processing millions of PDFs through a finetuned model using Sglang - [beakerpipeline.py](https://github.com/allenai/pdelfin/blob/main/pdelfin/beakerpipeline.py) - Processing millions of PDFs through a finetuned model using Sglang - [beakerpipeline.py](https://github.com/allenai/olmocr/blob/main/olmocr/beakerpipeline.py)
- Viewing Dolma Docs created from PDFs - [dolmaviewer.py](https://github.com/allenai/pdelfin/blob/main/pdelfin/viewer/dolmaviewer.py) - Viewing Dolma Docs created from PDFs - [dolmaviewer.py](https://github.com/allenai/olmocr/blob/main/olmocr/viewer/dolmaviewer.py)
### Installation ### Installation
@ -22,10 +22,10 @@ You will need to install poppler-utils and then also some fonts on your computer
sudo apt-get install poppler-utils ttf-mscorefonts-installer msttcorefonts fonts-crosextra-caladea fonts-crosextra-carlito gsfonts lcdf-typetools sudo apt-get install poppler-utils ttf-mscorefonts-installer msttcorefonts fonts-crosextra-caladea fonts-crosextra-carlito gsfonts lcdf-typetools
``` ```
Then, clone and install the pdelfin package Then, clone and install the olmocr package
```bash ```bash
git clone https://github.com/allenai/pdelfin.git git clone https://github.com/allenai/olmocr.git
cd pdelfin cd olmocr
pip install -e . pip install -e .
``` ```
@ -43,7 +43,7 @@ It also runs at 2,800+ tokens per second per H100 GPU.
For example: For example:
```bash ```bash
python -m pdelfin.beakerpipeline s3://ai2-oe-data/[your username]/pdfworkspaces/[workspacename] --pdfs s3://ai2-oe-data/jakep/gnarly_pdfs/*.pdf --beaker python -m olmocr.beakerpipeline s3://ai2-oe-data/[your username]/pdfworkspaces/[workspacename] --pdfs s3://ai2-oe-data/jakep/gnarly_pdfs/*.pdf --beaker
``` ```
This will convert all the pdfs at `s3://ai2-oe-data/jakep/gnarly_pdfs/*.pdf` and output dolma formatted documents at `s3://ai2-oe-data/[your username]/pdfworkspaces/[workspacename]/results` This will convert all the pdfs at `s3://ai2-oe-data/jakep/gnarly_pdfs/*.pdf` and output dolma formatted documents at `s3://ai2-oe-data/[your username]/pdfworkspaces/[workspacename]/results`
@ -53,7 +53,7 @@ With default settings, it should work fine on any available GPUs.
```bash ```bash
python -m pdelfin.beakerpipeline --help python -m olmocr.beakerpipeline --help
usage: beakerpipeline.py [-h] [--pdfs PDFS] [--workspace_profile WORKSPACE_PROFILE] [--pdf_profile PDF_PROFILE] [--pages_per_group PAGES_PER_GROUP] usage: beakerpipeline.py [-h] [--pdfs PDFS] [--workspace_profile WORKSPACE_PROFILE] [--pdf_profile PDF_PROFILE] [--pages_per_group PAGES_PER_GROUP]
[--max_page_retries MAX_PAGE_RETRIES] [--max_page_error_rate MAX_PAGE_ERROR_RATE] [--workers WORKERS] [--stats] [--max_page_retries MAX_PAGE_RETRIES] [--max_page_error_rate MAX_PAGE_ERROR_RATE] [--workers WORKERS] [--stats]
[--model MODEL] [--model_max_context MODEL_MAX_CONTEXT] [--model_chat_template MODEL_CHAT_TEMPLATE] [--model MODEL] [--model_max_context MODEL_MAX_CONTEXT] [--model_chat_template MODEL_CHAT_TEMPLATE]

View File

@ -2,7 +2,7 @@
## Steps ## Steps
1. Update the version in `pdelfin/version.py`. 1. Update the version in `olmocr/version.py`.
3. Run the release script: 3. Run the release script:

View File

@ -18,11 +18,11 @@ from datetime import datetime
sys.path.insert(0, os.path.abspath("../../")) sys.path.insert(0, os.path.abspath("../../"))
from pdelfin import VERSION, VERSION_SHORT # noqa: E402 from olmocr import VERSION, VERSION_SHORT # noqa: E402
# -- Project information ----------------------------------------------------- # -- Project information -----------------------------------------------------
project = "pdelfin" project = "olmocr"
copyright = f"{datetime.today().year}, Allen Institute for Artificial Intelligence" copyright = f"{datetime.today().year}, Allen Institute for Artificial Intelligence"
author = "Allen Institute for Artificial Intelligence" author = "Allen Institute for Artificial Intelligence"
version = VERSION_SHORT version = VERSION_SHORT
@ -82,7 +82,7 @@ typehints_defaults = "comma"
# #
html_theme = "furo" html_theme = "furo"
html_title = f"pdelfin v{VERSION}" html_title = f"olmocr v{VERSION}"
# Add any paths that contain custom static files (such as style sheets) here, # Add any paths that contain custom static files (such as style sheets) here,
# relative to this directory. They are copied after the builtin static files, # relative to this directory. They are copied after the builtin static files,
@ -97,7 +97,7 @@ html_theme_options = {
"footer_icons": [ "footer_icons": [
{ {
"name": "GitHub", "name": "GitHub",
"url": "https://github.com/allenai/pdelfin", "url": "https://github.com/allenai/olmocr",
"html": """ "html": """
<svg stroke="currentColor" fill="currentColor" stroke-width="0" viewBox="0 0 16 16"> <svg stroke="currentColor" fill="currentColor" stroke-width="0" viewBox="0 0 16 16">
<path fill-rule="evenodd" d="M8 0C3.58 0 0 3.58 0 8c0 3.54 2.29 6.53 5.47 7.59.4.07.55-.17.55-.38 0-.19-.01-.82-.01-1.49-2.01.37-2.53-.49-2.69-.94-.09-.23-.48-.94-.82-1.13-.28-.15-.68-.52-.01-.53.63-.01 1.08.58 1.23.82.72 1.21 1.87.87 2.33.66.07-.52.28-.87.51-1.07-1.78-.2-3.64-.89-3.64-3.95 0-.87.31-1.59.82-2.15-.08-.2-.36-1.02.08-2.12 0 0 .67-.21 2.2.82.64-.18 1.32-.27 2-.27.68 0 1.36.09 2 .27 1.53-1.04 2.2-.82 2.2-.82.44 1.1.16 1.92.08 2.12.51.56.82 1.27.82 2.15 0 3.07-1.87 3.75-3.65 3.95.29.25.54.73.54 1.48 0 1.07-.01 1.93-.01 2.2 0 .21.15.46.55.38A8.013 8.013 0 0 0 16 8c0-4.42-3.58-8-8-8z"></path> <path fill-rule="evenodd" d="M8 0C3.58 0 0 3.58 0 8c0 3.54 2.29 6.53 5.47 7.59.4.07.55-.17.55-.38 0-.19-.01-.82-.01-1.49-2.01.37-2.53-.49-2.69-.94-.09-.23-.48-.94-.82-1.13-.28-.15-.68-.52-.01-.53.63-.01 1.08.58 1.23.82.72 1.21 1.87.87 2.33.66.07-.52.28-.87.51-1.07-1.78-.2-3.64-.89-3.64-3.95 0-.87.31-1.59.82-2.15-.08-.2-.36-1.02.08-2.12 0 0 .67-.21 2.2.82.64-.18 1.32-.27 2-.27.68 0 1.36.09 2 .27 1.53-1.04 2.2-.82 2.2-.82.44 1.1.16 1.92.08 2.12.51.56.82 1.27.82 2.15 0 3.07-1.87 3.75-3.65 3.95.29.25.54.73.54 1.48 0 1.07-.01 1.93-.01 2.2 0 .21.15.46.55.38A8.013 8.013 0 0 0 16 8c0-4.42-3.58-8-8-8z"></path>

View File

@ -1,4 +1,4 @@
# **pdelfin** # **olmocr**
```{toctree} ```{toctree}
:maxdepth: 2 :maxdepth: 2
@ -15,8 +15,8 @@ overview
CHANGELOG CHANGELOG
CONTRIBUTING CONTRIBUTING
License <https://raw.githubusercontent.com/allenai/pdelfin/main/LICENSE> License <https://raw.githubusercontent.com/allenai/olmocr/main/LICENSE>
GitHub Repository <https://github.com/allenai/pdelfin> GitHub Repository <https://github.com/allenai/olmocr>
``` ```
## Indices and tables ## Indices and tables

View File

@ -1,23 +1,23 @@
Installation Installation
============ ============
**pdelfin** supports Python >= 3.8. **olmocr** supports Python >= 3.8.
## Installing with `pip` ## Installing with `pip`
**pdelfin** is available [on PyPI](https://pypi.org/project/pdelfin/). Just run **olmocr** is available [on PyPI](https://pypi.org/project/olmocr/). Just run
```bash ```bash
pip install pdelfin pip install olmocr
``` ```
## Installing from source ## Installing from source
To install **pdelfin** from source, first clone [the repository](https://github.com/allenai/pdelfin): To install **olmocr** from source, first clone [the repository](https://github.com/allenai/olmocr):
```bash ```bash
git clone https://github.com/allenai/pdelfin.git git clone https://github.com/allenai/olmocr.git
cd pdelfin cd olmocr
``` ```
Then run Then run

View File

@ -31,15 +31,15 @@ from typing import Optional, Tuple, List, Dict, Set
from concurrent.futures import ProcessPoolExecutor, ThreadPoolExecutor, as_completed from concurrent.futures import ProcessPoolExecutor, ThreadPoolExecutor, as_completed
from concurrent.futures.process import BrokenProcessPool from concurrent.futures.process import BrokenProcessPool
from pdelfin.s3_queue import S3WorkQueue, WorkItem from olmocr.s3_queue import S3WorkQueue, WorkItem
from pdelfin.s3_utils import expand_s3_glob, get_s3_bytes, get_s3_bytes_with_backoff, parse_s3_path, download_zstd_csv, upload_zstd_csv, download_directory from olmocr.s3_utils import expand_s3_glob, get_s3_bytes, get_s3_bytes_with_backoff, parse_s3_path, download_zstd_csv, upload_zstd_csv, download_directory
from pdelfin.data.renderpdf import render_pdf_to_base64png from olmocr.data.renderpdf import render_pdf_to_base64png
from pdelfin.filter.filter import PdfFilter, Language from olmocr.filter.filter import PdfFilter, Language
from pdelfin.prompts import build_finetuning_prompt, PageResponse from olmocr.prompts import build_finetuning_prompt, PageResponse
from pdelfin.prompts.anchor import get_anchor_text from olmocr.prompts.anchor import get_anchor_text
from pdelfin.check import check_poppler_version from olmocr.check import check_poppler_version
from pdelfin.metrics import MetricsKeeper, WorkerTracker from olmocr.metrics import MetricsKeeper, WorkerTracker
from pdelfin.version import VERSION from olmocr.version import VERSION
# Initialize logger # Initialize logger
logger = logging.getLogger(__name__) logger = logging.getLogger(__name__)
@ -380,7 +380,7 @@ def build_dolma_document(pdf_s3_path, page_results):
# Build the Dolma document # Build the Dolma document
metadata = { metadata = {
"Source-File": pdf_s3_path, "Source-File": pdf_s3_path,
"pdelfin-version": VERSION, "olmocr-version": VERSION,
"pdf-total-pages": len(page_results), "pdf-total-pages": len(page_results),
"total-input-tokens": sum(page.input_tokens for page in page_results), "total-input-tokens": sum(page.input_tokens for page in page_results),
"total-output-tokens": sum(page.output_tokens for page in page_results), "total-output-tokens": sum(page.output_tokens for page in page_results),
@ -392,7 +392,7 @@ def build_dolma_document(pdf_s3_path, page_results):
dolma_doc = { dolma_doc = {
"id": id_, "id": id_,
"text": document_text, "text": document_text,
"source": "pdelfin", "source": "olmocr",
"added": datetime.datetime.now().strftime("%Y-%m-%d"), "added": datetime.datetime.now().strftime("%Y-%m-%d"),
"created": datetime.datetime.now().strftime("%Y-%m-%d"), "created": datetime.datetime.now().strftime("%Y-%m-%d"),
"metadata": metadata, "metadata": metadata,
@ -463,7 +463,7 @@ async def worker(args, work_queue: S3WorkQueue, semaphore, worker_id):
async def sglang_server_task(args, semaphore): async def sglang_server_task(args, semaphore):
model_cache_dir = os.path.join(os.path.expanduser('~'), '.cache', 'pdelfin', 'model') model_cache_dir = os.path.join(os.path.expanduser('~'), '.cache', 'olmocr', 'model')
download_directory(args.model, model_cache_dir) download_directory(args.model, model_cache_dir)
# Check the rope config and make sure it's got the proper key # Check the rope config and make sure it's got the proper key
@ -642,7 +642,7 @@ def submit_beaker_job(args):
owner = account.name owner = account.name
beaker_image = f"jakep/pdelfin-inference-{VERSION}" beaker_image = f"jakep/pdelfin-inference-{VERSION}"
task_name = f"pdelfin-{os.path.basename(args.workspace.rstrip('/'))}" task_name = f"olmocr-{os.path.basename(args.workspace.rstrip('/'))}"
# Take out --beaker flag so the workers will just run things # Take out --beaker flag so the workers will just run things
args_list = [arg for arg in sys.argv[1:] if arg != "--beaker"] args_list = [arg for arg in sys.argv[1:] if arg != "--beaker"]
@ -695,7 +695,7 @@ def submit_beaker_job(args):
preemptible=True, preemptible=True,
), ),
image=ImageSource(beaker=beaker_image), image=ImageSource(beaker=beaker_image),
command=["python", "-m", "pdelfin.beakerpipeline"] + args_list, command=["python", "-m", "olmocr.beakerpipeline"] + args_list,
env_vars=[ env_vars=[
EnvVar(name="BEAKER_JOB_NAME", value=task_name), EnvVar(name="BEAKER_JOB_NAME", value=task_name),
EnvVar(name="OWNER", value=owner), EnvVar(name="OWNER", value=owner),
@ -857,7 +857,7 @@ async def main():
# Beaker/job running stuff # Beaker/job running stuff
parser.add_argument('--beaker', action='store_true', help='Submit this job to beaker instead of running locally') parser.add_argument('--beaker', action='store_true', help='Submit this job to beaker instead of running locally')
parser.add_argument('--beaker_workspace', help='Beaker workspace to submit to', default='ai2/pdelfin') parser.add_argument('--beaker_workspace', help='Beaker workspace to submit to', default='ai2/olmocr')
parser.add_argument('--beaker_cluster', help='Beaker clusters you want to run on', default=["ai2/jupiter-cirrascale-2", "ai2/ceres-cirrascale", "ai2/neptune-cirrascale", "ai2/saturn-cirrascale", "ai2/augusta-google-1"]) parser.add_argument('--beaker_cluster', help='Beaker clusters you want to run on', default=["ai2/jupiter-cirrascale-2", "ai2/ceres-cirrascale", "ai2/neptune-cirrascale", "ai2/saturn-cirrascale", "ai2/augusta-google-1"])
parser.add_argument('--beaker_gpus', type=int, default=1, help="Number of gpu replicas to run") parser.add_argument('--beaker_gpus', type=int, default=1, help="Number of gpu replicas to run")
parser.add_argument('--beaker_priority', type=str, default="normal", help="Beaker priority level for the job") parser.add_argument('--beaker_priority', type=str, default="normal", help="Beaker priority level for the job")

View File

@ -25,11 +25,11 @@ from urllib.parse import urlparse
import concurrent.futures import concurrent.futures
from concurrent.futures import ProcessPoolExecutor, as_completed from concurrent.futures import ProcessPoolExecutor, as_completed
from pdelfin.data.renderpdf import render_pdf_to_base64png from olmocr.data.renderpdf import render_pdf_to_base64png
from pdelfin.prompts import build_finetuning_prompt, PageResponse from olmocr.prompts import build_finetuning_prompt, PageResponse
from pdelfin.prompts.anchor import get_anchor_text from olmocr.prompts.anchor import get_anchor_text
from pdelfin.s3_utils import parse_custom_id, expand_s3_glob, get_s3_bytes, parse_s3_path from olmocr.s3_utils import parse_custom_id, expand_s3_glob, get_s3_bytes, parse_s3_path
from pdelfin.check import check_poppler_version from olmocr.check import check_poppler_version
# Initialize logger # Initialize logger
logger = logging.getLogger(__name__) logger = logging.getLogger(__name__)
@ -79,7 +79,7 @@ class DatabaseManager:
def __init__(self, s3_workspace: str, skip_init: bool=False): def __init__(self, s3_workspace: str, skip_init: bool=False):
cache_key = hashlib.sha256(s3_workspace.strip().lower().encode('utf-8')).hexdigest() cache_key = hashlib.sha256(s3_workspace.strip().lower().encode('utf-8')).hexdigest()
home_cache_dir = os.path.join(os.path.expanduser('~'), '.cache', 'pdelfin', cache_key) home_cache_dir = os.path.join(os.path.expanduser('~'), '.cache', 'olmocr', cache_key)
os.makedirs(home_cache_dir, exist_ok=True) os.makedirs(home_cache_dir, exist_ok=True)
self.db_path = os.path.join(home_cache_dir, 'index.db') self.db_path = os.path.join(home_cache_dir, 'index.db')
@ -618,7 +618,7 @@ def build_dolma_doc(s3_workspace: str, pdf: DatabaseManager.PDFRecord) -> Option
dolma_doc = { dolma_doc = {
"id": id_, "id": id_,
"text": document_text, "text": document_text,
"source": "pdelfin", "source": "olmocr",
"added": datetime.datetime.now().strftime("%Y-%m-%d"), "added": datetime.datetime.now().strftime("%Y-%m-%d"),
"created": datetime.datetime.now().strftime("%Y-%m-%d"), "created": datetime.datetime.now().strftime("%Y-%m-%d"),
"metadata": metadata, "metadata": metadata,

View File

@ -14,7 +14,7 @@ def check_poppler_version():
sys.exit(1) sys.exit(1)
except FileNotFoundError: except FileNotFoundError:
logger.error("pdftoppm is not installed.") logger.error("pdftoppm is not installed.")
logger.error("Check the README in the https://github.com/allenai/pdelfin/blob/main/README.md for installation instructions") logger.error("Check the README in the https://github.com/allenai/olmocr/blob/main/README.md for installation instructions")
sys.exit(1) sys.exit(1)
if __name__ == "__main__": if __name__ == "__main__":

View File

@ -12,10 +12,10 @@ from typing import Generator
from concurrent.futures import ThreadPoolExecutor, ProcessPoolExecutor, as_completed from concurrent.futures import ThreadPoolExecutor, ProcessPoolExecutor, as_completed
from urllib.parse import urlparse from urllib.parse import urlparse
from pdelfin.data.renderpdf import render_pdf_to_base64png from olmocr.data.renderpdf import render_pdf_to_base64png
from pdelfin.prompts import build_openai_silver_data_prompt, openai_response_format_schema from olmocr.prompts import build_openai_silver_data_prompt, openai_response_format_schema
from pdelfin.prompts.anchor import get_anchor_text from olmocr.prompts.anchor import get_anchor_text
from pdelfin.filter import PdfFilter from olmocr.filter import PdfFilter
TARGET_IMAGE_DIM = 2048 TARGET_IMAGE_DIM = 2048

View File

@ -10,8 +10,8 @@ from concurrent.futures import ProcessPoolExecutor, as_completed
from urllib.parse import urlparse from urllib.parse import urlparse
from typing import List from typing import List
from pdelfin.data.renderpdf import render_pdf_to_base64png from olmocr.data.renderpdf import render_pdf_to_base64png
from pdelfin.filter import PdfFilter from olmocr.filter import PdfFilter
pdf_filter = PdfFilter() pdf_filter = PdfFilter()

View File

@ -10,9 +10,9 @@ import os
import smart_open import smart_open
import boto3 import boto3
from pdelfin.prompts import build_finetuning_prompt from olmocr.prompts import build_finetuning_prompt
from pdelfin.prompts.anchor import get_anchor_text from olmocr.prompts.anchor import get_anchor_text
from pdelfin.data.renderpdf import render_pdf_to_base64png from olmocr.data.renderpdf import render_pdf_to_base64png
# Import Plotly for plotting # Import Plotly for plotting
import plotly.express as px import plotly.express as px

View File

@ -9,7 +9,7 @@ import logging
import smart_open import smart_open
from cached_path import cached_path from cached_path import cached_path
from pdelfin.prompts import build_finetuning_prompt from olmocr.prompts import build_finetuning_prompt
def setup_logging(): def setup_logging():
@ -73,12 +73,12 @@ def process_file(input_file: str, output_file: str, rewrite_prompt_str: bool):
# Save the pdf to a temporary cache folder # Save the pdf to a temporary cache folder
local_pdf_path = cached_path(s3_path, quiet=True) local_pdf_path = cached_path(s3_path, quiet=True)
from pdelfin.prompts.anchor import get_anchor_text from olmocr.prompts.anchor import get_anchor_text
from pdelfin.data.buildsilver import build_page_query from olmocr.data.buildsilver import build_page_query
obj = build_page_query(local_pdf_path, s3_path, page) obj = build_page_query(local_pdf_path, s3_path, page)
# raw_page_text = get_anchor_text(local_pdf_path, page, pdf_engine="pdfreport") # raw_page_text = get_anchor_text(local_pdf_path, page, pdf_engine="pdfreport")
# from pdelfin.prompts import build_openai_silver_data_prompt # from olmocr.prompts import build_openai_silver_data_prompt
# obj["body"]["messages"][0]["content"][0]["text"] = build_openai_silver_data_prompt(raw_page_text) # obj["body"]["messages"][0]["content"][0]["text"] = build_openai_silver_data_prompt(raw_page_text)
if obj is not None: if obj is not None:

View File

@ -8,12 +8,12 @@ import functools
from tqdm import tqdm from tqdm import tqdm
from itertools import combinations from itertools import combinations
from pdelfin.s3_utils import parse_s3_path, expand_s3_glob, get_s3_bytes from olmocr.s3_utils import parse_s3_path, expand_s3_glob, get_s3_bytes
from dolma_refine.evaluate.metrics import DocumentEditSimilarity from dolma_refine.evaluate.metrics import DocumentEditSimilarity
from dolma_refine.evaluate.segmenters import SpacySegmenter from dolma_refine.evaluate.segmenters import SpacySegmenter
from dolma_refine.evaluate.aligners import HirschbergAligner from dolma_refine.evaluate.aligners import HirschbergAligner
from pdelfin.eval.evalhtml import create_review_html from olmocr.eval.evalhtml import create_review_html
@dataclasses.dataclass @dataclasses.dataclass
class Comparison: class Comparison:

View File

@ -7,7 +7,7 @@ from jinja2 import Template
from urllib.parse import urlparse from urllib.parse import urlparse
from difflib import SequenceMatcher from difflib import SequenceMatcher
from tqdm import tqdm from tqdm import tqdm
from pdelfin.data.renderpdf import render_pdf_to_base64png from olmocr.data.renderpdf import render_pdf_to_base64png
session = boto3.Session(profile_name='s2') session = boto3.Session(profile_name='s2')
s3_client = session.client('s3') s3_client = session.client('s3')

View File

@ -125,7 +125,7 @@ class PdfFilter:
if __name__ == "__main__": if __name__ == "__main__":
import tempfile import tempfile
import boto3 import boto3
from pdelfin.s3_utils import parse_s3_path from olmocr.s3_utils import parse_s3_path
from concurrent.futures import ProcessPoolExecutor, wait, FIRST_COMPLETED from concurrent.futures import ProcessPoolExecutor, wait, FIRST_COMPLETED
from tqdm import tqdm from tqdm import tqdm

View File

@ -19,11 +19,11 @@ from functools import lru_cache
import pypdfium2 as pdfium import pypdfium2 as pdfium
import pymupdf import pymupdf
from pdelfin.filter.coherency import get_document_coherency from olmocr.filter.coherency import get_document_coherency
from pypdf import PdfReader from pypdf import PdfReader
from pypdf.generic import RectangleObject from pypdf.generic import RectangleObject
from pdelfin.prompts._adv_anchor import mult from olmocr.prompts._adv_anchor import mult
def get_anchor_text(local_pdf_path: str, page: int, pdf_engine: Literal["pdftotext", "pdfium", "pymupdf", "pypdf", "topcoherency", "pdfreport"], target_length: int=4000) -> str: def get_anchor_text(local_pdf_path: str, page: int, pdf_engine: Literal["pdftotext", "pdfium", "pymupdf", "pypdf", "topcoherency", "pdfreport"], target_length: int=4000) -> str:

View File

@ -9,7 +9,7 @@ from dataclasses import dataclass
import asyncio import asyncio
from functools import partial from functools import partial
from pdelfin.s3_utils import ( from olmocr.s3_utils import (
expand_s3_glob, expand_s3_glob,
download_zstd_csv, download_zstd_csv,
upload_zstd_csv, upload_zstd_csv,

View File

@ -5,7 +5,7 @@ import os
import boto3 import boto3
from datasets import Dataset from datasets import Dataset
from botocore.exceptions import NoCredentialsError, PartialCredentialsError from botocore.exceptions import NoCredentialsError, PartialCredentialsError
from pdelfin.train.dataloader import build_batch_query_response_vision_dataset from olmocr.train.dataloader import build_batch_query_response_vision_dataset
def save_dataset_in_parquet(dataset: Dataset, output_dir: str, rows_per_file: int = 10000, s3_endpoint_url: str = None): def save_dataset_in_parquet(dataset: Dataset, output_dir: str, rows_per_file: int = 10000, s3_endpoint_url: str = None):

View File

@ -16,9 +16,9 @@ import boto3
from datasets import Dataset, Features, Value, load_dataset, concatenate_datasets, DatasetDict from datasets import Dataset, Features, Value, load_dataset, concatenate_datasets, DatasetDict
from .core.config import DataConfig, SourceConfig from .core.config import DataConfig, SourceConfig
from pdelfin.prompts.anchor import get_anchor_text from olmocr.prompts.anchor import get_anchor_text
from pdelfin.s3_utils import parse_custom_id, get_s3_bytes, parse_s3_path from olmocr.s3_utils import parse_custom_id, get_s3_bytes, parse_s3_path
from pdelfin.data.renderpdf import get_pdf_media_box_width_height from olmocr.data.renderpdf import get_pdf_media_box_width_height
# Configure logging # Configure logging
logging.basicConfig(level=logging.INFO) logging.basicConfig(level=logging.INFO)
@ -143,7 +143,7 @@ def cache_s3_files(dataset: Dataset, pdf_cache_location: str, num_proc: int = 32
def build_finetuning_dataset(response_glob_path: str, pdf_cache_location: Optional[str]=None, num_proc: int=32) -> Dataset: def build_finetuning_dataset(response_glob_path: str, pdf_cache_location: Optional[str]=None, num_proc: int=32) -> Dataset:
if pdf_cache_location is None: if pdf_cache_location is None:
pdf_cache_location = os.path.join(os.path.expanduser('~'), '.cache', 'pdelfin_pdfs') pdf_cache_location = os.path.join(os.path.expanduser('~'), '.cache', 'olmocr_pdfs')
logger.info("Loading fine tuning dataset from OpenAI style batch responses") logger.info("Loading fine tuning dataset from OpenAI style batch responses")
response_data = load_jsonl_into_ds(response_glob_path) response_data = load_jsonl_into_ds(response_glob_path)

View File

@ -6,9 +6,9 @@ import base64
import random import random
import torch # Make sure to import torch as it's used in the DataCollator import torch # Make sure to import torch as it's used in the DataCollator
from pdelfin.prompts.anchor import get_anchor_text from olmocr.prompts.anchor import get_anchor_text
from pdelfin.prompts import build_finetuning_prompt from olmocr.prompts import build_finetuning_prompt
from pdelfin.data.renderpdf import render_pdf_to_base64png from olmocr.data.renderpdf import render_pdf_to_base64png
def prepare_data_for_qwen2_training(example, processor, target_longest_image_dim: Union[int, list[int]], target_anchor_text_len: Union[int, list[int]]): def prepare_data_for_qwen2_training(example, processor, target_longest_image_dim: Union[int, list[int]], target_anchor_text_len: Union[int, list[int]]):

View File

@ -9,7 +9,7 @@ import concurrent.futures
from smart_open import smart_open from smart_open import smart_open
from transformers import Qwen2VLForConditionalGeneration from transformers import Qwen2VLForConditionalGeneration
from pdelfin.s3_utils import parse_s3_path from olmocr.s3_utils import parse_s3_path
s3_client = boto3.client('s3') s3_client = boto3.client('s3')

View File

@ -29,11 +29,11 @@ from transformers import (
) )
from pdelfin.data.renderpdf import render_pdf_to_base64png from olmocr.data.renderpdf import render_pdf_to_base64png
from pdelfin.prompts.anchor import get_anchor_text from olmocr.prompts.anchor import get_anchor_text
from pdelfin.prompts.prompts import build_finetuning_prompt from olmocr.prompts.prompts import build_finetuning_prompt
from pdelfin.train.dataprep import prepare_data_for_qwen2_inference from olmocr.train.dataprep import prepare_data_for_qwen2_inference
def build_page_query(local_pdf_path: str, page: int) -> dict: def build_page_query(local_pdf_path: str, page: int) -> dict:
image_base64 = render_pdf_to_base64png(local_pdf_path, page, 1024) image_base64 = render_pdf_to_base64png(local_pdf_path, page, 1024)

View File

@ -3,8 +3,8 @@ from transformers import (
DataCollatorForSeq2Seq DataCollatorForSeq2Seq
) )
from pdelfin.train.core.cli import make_cli from olmocr.train.core.cli import make_cli
from pdelfin.train.core.config import TrainConfig from olmocr.train.core.config import TrainConfig
from tqdm import tqdm from tqdm import tqdm
from .utils import ( from .utils import (
make_dataset, TruncatingCollator make_dataset, TruncatingCollator

View File

@ -36,11 +36,11 @@ from torch.utils.data import DataLoader
import wandb import wandb
from pdelfin.train.core.cli import make_cli, save_config, to_native_types from olmocr.train.core.cli import make_cli, save_config, to_native_types
from pdelfin.train.core.config import TrainConfig from olmocr.train.core.config import TrainConfig
from pdelfin.train.core.loggers import get_logger from olmocr.train.core.loggers import get_logger
from pdelfin.train.core.paths import copy_dir, join_path from olmocr.train.core.paths import copy_dir, join_path
from pdelfin.train.core.state import BeakerState from olmocr.train.core.state import BeakerState
from .utils import ( from .utils import (
RunName, RunName,

View File

@ -28,8 +28,8 @@ from .core.state import BeakerState
T = TypeVar("T") T = TypeVar("T")
from pdelfin.train.dataloader import build_finetuning_dataset, list_dataset_files from olmocr.train.dataloader import build_finetuning_dataset, list_dataset_files
from pdelfin.train.dataprep import batch_prepare_data_for_qwen2_training, batch_prepare_data_for_molmo_training from olmocr.train.dataprep import batch_prepare_data_for_qwen2_training, batch_prepare_data_for_molmo_training
def accelerator_to_dtype(accelerator: Accelerator) -> torch.dtype: def accelerator_to_dtype(accelerator: Accelerator) -> torch.dtype:

View File

@ -11,8 +11,8 @@ from tqdm import tqdm
from concurrent.futures import ThreadPoolExecutor, as_completed from concurrent.futures import ThreadPoolExecutor, as_completed
import markdown2 import markdown2
from pdelfin.s3_utils import get_s3_bytes from olmocr.s3_utils import get_s3_bytes
from pdelfin.data.renderpdf import render_pdf_to_base64webp from olmocr.data.renderpdf import render_pdf_to_base64webp
def read_jsonl(path): def read_jsonl(path):
with smart_open.smart_open(path, 'r', encoding='utf-8') as f: with smart_open.smart_open(path, 'r', encoding='utf-8') as f:

View File

@ -4,7 +4,7 @@ build-backend = "setuptools.build_meta"
[project] [project]
# See https://setuptools.pypa.io/en/latest/userguide/quickstart.html for more project configuration options. # See https://setuptools.pypa.io/en/latest/userguide/quickstart.html for more project configuration options.
name = "pdelfin" name = "olmocr"
dynamic = ["version"] dynamic = ["version"]
readme = "README.md" readme = "README.md"
classifiers = [ classifiers = [
@ -42,10 +42,10 @@ dependencies = [
license = {file = "LICENSE"} license = {file = "LICENSE"}
[project.urls] [project.urls]
Homepage = "https://github.com/allenai/pdelfin" Homepage = "https://github.com/allenai/olmocr"
Repository = "https://github.com/allenai/pdelfin" Repository = "https://github.com/allenai/olmocr"
Changelog = "https://github.com/allenai/pdelfin/blob/main/CHANGELOG.md" Changelog = "https://github.com/allenai/olmocr/blob/main/CHANGELOG.md"
# Documentation = "https://pdelfin.readthedocs.io/" # Documentation = "https://olmocr.readthedocs.io/"
[project.optional-dependencies] [project.optional-dependencies]
dev = [ dev = [
@ -103,10 +103,10 @@ exclude = [
include-package-data = true include-package-data = true
[tool.setuptools.package-data] [tool.setuptools.package-data]
pdelfin = ["py.typed"] olmocr = ["py.typed"]
[tool.setuptools.dynamic] [tool.setuptools.dynamic]
version = {attr = "pdelfin.version.VERSION"} version = {attr = "olmocr.version.VERSION"}
[tool.black] [tool.black]
line-length = 120 line-length = 120

View File

@ -36,17 +36,17 @@ RUN /root/.local/bin/uv pip install --system flashinfer -i https://flashinfer.ai
ENV PYTHONUNBUFFERED=1 ENV PYTHONUNBUFFERED=1
WORKDIR /root WORKDIR /root
COPY pyproject.toml pyproject.toml COPY pyproject.toml pyproject.toml
COPY pdelfin/version.py pdelfin/version.py COPY olmocr/version.py olmocr/version.py
RUN /root/.local/bin/uv pip install --system --no-cache -e .[inference] RUN /root/.local/bin/uv pip install --system --no-cache -e .[inference]
COPY pdelfin pdelfin COPY olmocr olmocr
# TODO You can remove this once pypdf > 5.10 comes out # TODO You can remove this once pypdf > 5.10 comes out
RUN /root/.local/bin/uv pip install --system --no-cache git+https://github.com/py-pdf/pypdf.git@c6e43374ab002d76811ec85333fdc2c82c268251 RUN /root/.local/bin/uv pip install --system --no-cache git+https://github.com/py-pdf/pypdf.git@c6e43374ab002d76811ec85333fdc2c82c268251
WORKDIR /root WORKDIR /root
COPY pdelfin pdelfin COPY olmocr olmocr
RUN python3 -m sglang.launch_server --help RUN python3 -m sglang.launch_server --help
RUN python3 -m pdelfin.beakerpipeline --help RUN python3 -m olmocr.beakerpipeline --help

View File

@ -2,8 +2,8 @@
set -e set -e
VERSION=$(python -c 'import pdelfin.version; print(pdelfin.version.VERSION)') VERSION=$(python -c 'import olmocr.version; print(olmocr.version.VERSION)')
echo "$VERSION" echo "$VERSION"
docker build --platform linux/amd64 -f ./scripts/beaker/Dockerfile-inference -t pdelfin-inference-$VERSION . docker build --platform linux/amd64 -f ./scripts/beaker/Dockerfile-inference -t olmocr-inference-$VERSION .
beaker image create --workspace ai2/oe-data-pdf --name pdelfin-inference-$VERSION pdelfin-inference-$VERSION beaker image create --workspace ai2/oe-data-pdf --name olmocr-inference-$VERSION olmocr-inference-$VERSION

View File

@ -10,7 +10,7 @@ then
fi fi
EXTRA_ARGS="-c pdelfin/train/config/molmo-o-lora-8192.yaml --num_proc 64 --save.path \"s3://ai2-oe-data/jakep/experiments/molmo-pdf/v1/models/\${BEAKER_USER_ID}\"" EXTRA_ARGS="-c olmocr/train/config/molmo-o-lora-8192.yaml --num_proc 64 --save.path \"s3://ai2-oe-data/jakep/experiments/molmo-pdf/v1/models/\${BEAKER_USER_ID}\""
run_name=$(basename "$0" .sh) run_name=$(basename "$0" .sh)
@ -45,4 +45,4 @@ gantry run \
--env-secret WANDB_API_KEY=JAKE_WANDB_API_KEY \ --env-secret WANDB_API_KEY=JAKE_WANDB_API_KEY \
--shared-memory 10GiB \ --shared-memory 10GiB \
--yes \ --yes \
-- /bin/bash -c "source scripts/beaker/${CLUSTER}-ib.sh && python -m pdelfin.train.loaddataset ${EXTRA_ARGS} && accelerate launch --multi_gpu --num_processes \${BEAKER_ASSIGNED_GPU_COUNT} --mixed_precision bf16 -m pdelfin.train.train ${EXTRA_ARGS}" -- /bin/bash -c "source scripts/beaker/${CLUSTER}-ib.sh && python -m olmocr.train.loaddataset ${EXTRA_ARGS} && accelerate launch --multi_gpu --num_processes \${BEAKER_ASSIGNED_GPU_COUNT} --mixed_precision bf16 -m olmocr.train.train ${EXTRA_ARGS}"

View File

@ -1,7 +1,7 @@
from datetime import datetime from datetime import datetime
from pathlib import Path from pathlib import Path
from pdelfin.version import VERSION from olmocr.version import VERSION
def main(): def main():
@ -27,7 +27,7 @@ def main():
lines.insert(insert_index, "\n") lines.insert(insert_index, "\n")
lines.insert( lines.insert(
insert_index + 1, insert_index + 1,
f"## [v{VERSION}](https://github.com/allenai/pdelfin/releases/tag/v{VERSION}) - " f"## [v{VERSION}](https://github.com/allenai/olmocr/releases/tag/v{VERSION}) - "
f"{datetime.now().strftime('%Y-%m-%d')}\n", f"{datetime.now().strftime('%Y-%m-%d')}\n",
) )

View File

@ -10,7 +10,7 @@ then
fi fi
EXTRA_ARGS="-c pdelfin/train/config/qwen2vl-2b.yaml --num_proc 64 --save.path \"s3://ai2-oe-data/jakep/experiments/qwen2vl-pdf/v1/models/\${BEAKER_USER_ID}\"" EXTRA_ARGS="-c olmocr/train/config/qwen2vl-2b.yaml --num_proc 64 --save.path \"s3://ai2-oe-data/jakep/experiments/qwen2vl-pdf/v1/models/\${BEAKER_USER_ID}\""
run_name=$(basename "$0" .sh) run_name=$(basename "$0" .sh)
@ -43,4 +43,4 @@ gantry run \
--env-secret WANDB_API_KEY=JAKE_WANDB_API_KEY \ --env-secret WANDB_API_KEY=JAKE_WANDB_API_KEY \
--shared-memory 10GiB \ --shared-memory 10GiB \
--yes \ --yes \
-- /bin/bash -c "source scripts/beaker/${CLUSTER}-ib.sh && accelerate launch --multi_gpu --num_processes \${BEAKER_ASSIGNED_GPU_COUNT} --mixed_precision bf16 -m pdelfin.train.train ${EXTRA_ARGS}" -- /bin/bash -c "source scripts/beaker/${CLUSTER}-ib.sh && accelerate launch --multi_gpu --num_processes \${BEAKER_ASSIGNED_GPU_COUNT} --mixed_precision bf16 -m olmocr.train.train ${EXTRA_ARGS}"

View File

@ -10,7 +10,7 @@ then
fi fi
EXTRA_ARGS="-c pdelfin/train/config/qwen2vl-7b.yaml --num_proc 64 --save.path \"s3://ai2-oe-data/jakep/experiments/qwen2vl-pdf/v1/models/\${BEAKER_USER_ID}\"" EXTRA_ARGS="-c olmocr/train/config/qwen2vl-7b.yaml --num_proc 64 --save.path \"s3://ai2-oe-data/jakep/experiments/qwen2vl-pdf/v1/models/\${BEAKER_USER_ID}\""
run_name=$(basename "$0" .sh) run_name=$(basename "$0" .sh)
@ -44,4 +44,4 @@ gantry run \
--env-secret WANDB_API_KEY=JAKE_WANDB_API_KEY \ --env-secret WANDB_API_KEY=JAKE_WANDB_API_KEY \
--shared-memory 10GiB \ --shared-memory 10GiB \
--yes \ --yes \
-- /bin/bash -c "source scripts/beaker/${CLUSTER}-ib.sh && python -m pdelfin.train.loaddataset ${EXTRA_ARGS} && accelerate launch --use_fsdp --num_processes \${BEAKER_ASSIGNED_GPU_COUNT} --fsdp_offload_params false --fsdp_sharding_strategy FULL_SHARD --fsdp_auto_wrap_policy TRANSFORMER_BASED_WRAP --mixed_precision bf16 -m pdelfin.train.train ${EXTRA_ARGS}" -- /bin/bash -c "source scripts/beaker/${CLUSTER}-ib.sh && python -m olmocr.train.loaddataset ${EXTRA_ARGS} && accelerate launch --use_fsdp --num_processes \${BEAKER_ASSIGNED_GPU_COUNT} --fsdp_offload_params false --fsdp_sharding_strategy FULL_SHARD --fsdp_auto_wrap_policy TRANSFORMER_BASED_WRAP --mixed_precision bf16 -m olmocr.train.train ${EXTRA_ARGS}"

View File

@ -10,7 +10,7 @@ then
fi fi
EXTRA_ARGS="-c pdelfin/train/config/qwen2vl-7b-lora.yaml --num_proc 64 --save.path \"s3://ai2-oe-data/jakep/experiments/qwen2vl-pdf/v1/models/\${BEAKER_USER_ID}\"" EXTRA_ARGS="-c olmocr/train/config/qwen2vl-7b-lora.yaml --num_proc 64 --save.path \"s3://ai2-oe-data/jakep/experiments/qwen2vl-pdf/v1/models/\${BEAKER_USER_ID}\""
run_name=$(basename "$0" .sh) run_name=$(basename "$0" .sh)
@ -46,4 +46,4 @@ gantry run \
--env-secret WANDB_API_KEY=JAKE_WANDB_API_KEY \ --env-secret WANDB_API_KEY=JAKE_WANDB_API_KEY \
--shared-memory 10GiB \ --shared-memory 10GiB \
--yes \ --yes \
-- /bin/bash -c "source scripts/beaker/${CLUSTER}-ib.sh && python -m pdelfin.train.loaddataset ${EXTRA_ARGS} && accelerate launch --multi_gpu --num_processes \${BEAKER_ASSIGNED_GPU_COUNT} --mixed_precision bf16 -m pdelfin.train.train ${EXTRA_ARGS}" -- /bin/bash -c "source scripts/beaker/${CLUSTER}-ib.sh && python -m olmocr.train.loaddataset ${EXTRA_ARGS} && accelerate launch --multi_gpu --num_processes \${BEAKER_ASSIGNED_GPU_COUNT} --mixed_precision bf16 -m olmocr.train.train ${EXTRA_ARGS}"

View File

@ -2,7 +2,7 @@
set -e set -e
TAG=$(python -c 'from pdelfin.version import VERSION; print("v" + VERSION)') TAG=$(python -c 'from olmocr.version import VERSION; print("v" + VERSION)')
read -p "Creating new release for $TAG. Do you want to continue? [Y/n] " prompt read -p "Creating new release for $TAG. Do you want to continue? [Y/n] " prompt

View File

@ -6,8 +6,8 @@ import glob
from pypdf import PdfReader from pypdf import PdfReader
from pdelfin.prompts.anchor import _pdf_report, _linearize_pdf_report, get_anchor_text from olmocr.prompts.anchor import _pdf_report, _linearize_pdf_report, get_anchor_text
from pdelfin.data.renderpdf import get_pdf_media_box_width_height from olmocr.data.renderpdf import get_pdf_media_box_width_height
class AnchorTest(unittest.TestCase): class AnchorTest(unittest.TestCase):
def testExtractText(self): def testExtractText(self):
@ -168,11 +168,11 @@ class BuildSilverTest(unittest.TestCase):
def testSmallPage(self): def testSmallPage(self):
local_pdf_path = os.path.join(os.path.dirname(__file__), "gnarly_pdfs", "small_page_size.pdf") local_pdf_path = os.path.join(os.path.dirname(__file__), "gnarly_pdfs", "small_page_size.pdf")
from pdelfin.data.buildsilver import build_page_query from olmocr.data.buildsilver import build_page_query
result = build_page_query(local_pdf_path, "s3://test.pdf", 1) result = build_page_query(local_pdf_path, "s3://test.pdf", 1)
from pdelfin.data.renderpdf import get_png_dimensions_from_base64 from olmocr.data.renderpdf import get_png_dimensions_from_base64
base64data = result["body"]["messages"][0]["content"][1]["image_url"]["url"] base64data = result["body"]["messages"][0]["content"][1]["image_url"]["url"]

View File

@ -4,15 +4,15 @@ from io import BytesIO
from PIL import Image from PIL import Image
from transformers import AutoProcessor from transformers import AutoProcessor
from pdelfin.data.renderpdf import render_pdf_to_base64png from olmocr.data.renderpdf import render_pdf_to_base64png
from pdelfin.train.dataprep import ( from olmocr.train.dataprep import (
prepare_data_for_qwen2_training, build_finetuning_prompt prepare_data_for_qwen2_training, build_finetuning_prompt
) )
import numpy as np import numpy as np
from tqdm import tqdm from tqdm import tqdm
from torch.utils.data import DataLoader from torch.utils.data import DataLoader
from pdelfin.train.utils import make_dataset from olmocr.train.utils import make_dataset
from pdelfin.train.core.config import TrainConfig, DataConfig, SourceConfig from olmocr.train.core.config import TrainConfig, DataConfig, SourceConfig
import math import math
@ -138,7 +138,7 @@ class TestBirrTokenization(unittest.TestCase):
from birr.core.config import FormatConfig, LLMModelConfig from birr.core.config import FormatConfig, LLMModelConfig
from birr.batch_inference.data_models import RawInputItem from birr.batch_inference.data_models import RawInputItem
from pdelfin.birrpipeline import build_page_query from olmocr.birrpipeline import build_page_query
original_query = build_page_query(os.path.join( original_query = build_page_query(os.path.join(
os.path.dirname(__file__), os.path.dirname(__file__),

View File

@ -7,11 +7,11 @@ import base64
from PIL import Image from PIL import Image
# Adjust the import path to match where your code resides # Adjust the import path to match where your code resides
from pdelfin.birrpipeline import build_dolma_doc, DatabaseManager, build_finetuning_prompt, build_page_query from olmocr.birrpipeline import build_dolma_doc, DatabaseManager, build_finetuning_prompt, build_page_query
class TestBuildDolmaDoc(unittest.TestCase): class TestBuildDolmaDoc(unittest.TestCase):
@patch('pdelfin.birrpipeline.DatabaseManager') @patch('olmocr.birrpipeline.DatabaseManager')
@patch('pdelfin.birrpipeline.get_s3_bytes') @patch('olmocr.birrpipeline.get_s3_bytes')
def test_build_dolma_doc_with_multiple_page_entries(self, mock_get_s3_bytes, mock_DatabaseManager): def test_build_dolma_doc_with_multiple_page_entries(self, mock_get_s3_bytes, mock_DatabaseManager):
# Mock DatabaseManager instance # Mock DatabaseManager instance
mock_db_instance = MagicMock() mock_db_instance = MagicMock()

View File

@ -4,7 +4,7 @@ import concurrent.futures
from concurrent.futures import TimeoutError from concurrent.futures import TimeoutError
# Assuming the CappedProcessPoolExecutor code is in a module named 'capped_executor' # Assuming the CappedProcessPoolExecutor code is in a module named 'capped_executor'
from pdelfin.cappedpool import CappedProcessPoolExecutor from olmocr.cappedpool import CappedProcessPoolExecutor
# Define functions at the top level to ensure they are picklable by multiprocessing # Define functions at the top level to ensure they are picklable by multiprocessing

View File

@ -5,9 +5,9 @@ import time
import unittest import unittest
from pdelfin.filter.coherency import get_document_coherency from olmocr.filter.coherency import get_document_coherency
from pdelfin.prompts.anchor import get_anchor_text from olmocr.prompts.anchor import get_anchor_text
class TestCoherencyScores(unittest.TestCase): class TestCoherencyScores(unittest.TestCase):
def testBadOcr1(self): def testBadOcr1(self):

View File

@ -5,14 +5,14 @@ from functools import partial
from transformers import AutoProcessor from transformers import AutoProcessor
from pdelfin.train.dataloader import ( from olmocr.train.dataloader import (
build_finetuning_dataset, build_finetuning_dataset,
extract_openai_batch_response, extract_openai_batch_response,
load_jsonl_into_ds, load_jsonl_into_ds,
list_dataset_files list_dataset_files
) )
from pdelfin.train.dataprep import batch_prepare_data_for_qwen2_training from olmocr.train.dataprep import batch_prepare_data_for_qwen2_training
class TestBatchQueryResponseDataset(unittest.TestCase): class TestBatchQueryResponseDataset(unittest.TestCase):

View File

@ -10,19 +10,19 @@ from PIL import Image
from transformers import AutoProcessor from transformers import AutoProcessor
from unittest.mock import patch from unittest.mock import patch
from pdelfin.train.dataloader import ( from olmocr.train.dataloader import (
build_finetuning_dataset, build_finetuning_dataset,
) )
from pdelfin.train.dataprep import ( from olmocr.train.dataprep import (
prepare_data_for_qwen2_training, build_finetuning_prompt, prepare_data_for_qwen2_training, build_finetuning_prompt,
prepare_data_for_molmo_training, batch_prepare_data_for_molmo_training prepare_data_for_molmo_training, batch_prepare_data_for_molmo_training
) )
import numpy as np import numpy as np
from tqdm import tqdm from tqdm import tqdm
from torch.utils.data import DataLoader from torch.utils.data import DataLoader
from pdelfin.train.utils import make_dataset from olmocr.train.utils import make_dataset
from pdelfin.train.core.config import TrainConfig, DataConfig, SourceConfig from olmocr.train.core.config import TrainConfig, DataConfig, SourceConfig
class TestDataprep(unittest.TestCase): class TestDataprep(unittest.TestCase):
def testFullDataloader(self): def testFullDataloader(self):
@ -215,8 +215,8 @@ class TestMolmoDataPrep(unittest.TestCase):
random.seed(42) random.seed(42)
# Mock the functions that require actual PDF files # Mock the functions that require actual PDF files
with patch('pdelfin.prompts.anchor.get_anchor_text') as mock_get_anchor_text, \ with patch('olmocr.prompts.anchor.get_anchor_text') as mock_get_anchor_text, \
patch('pdelfin.data.renderpdf.render_pdf_to_base64png') as mock_render_pdf_to_base64png: patch('olmocr.data.renderpdf.render_pdf_to_base64png') as mock_render_pdf_to_base64png:
# Set return values for the mocked functions # Set return values for the mocked functions
mock_get_anchor_text.return_value = "This is the anchor text." mock_get_anchor_text.return_value = "This is the anchor text."
@ -323,8 +323,8 @@ class TestMolmoDataPrep(unittest.TestCase):
target_anchor_text_len = [0, 6000] target_anchor_text_len = [0, 6000]
# Mock the necessary functions # Mock the necessary functions
with patch('pdelfin.prompts.anchor.get_anchor_text') as mock_get_anchor_text, \ with patch('olmocr.prompts.anchor.get_anchor_text') as mock_get_anchor_text, \
patch('pdelfin.data.renderpdf.render_pdf_to_base64png') as mock_render_pdf_to_base64png: patch('olmocr.data.renderpdf.render_pdf_to_base64png') as mock_render_pdf_to_base64png:
mock_get_anchor_text.return_value = "This is the anchor text." mock_get_anchor_text.return_value = "This is the anchor text."
img = Image.new('RGB', (100, 100), color='red') img = Image.new('RGB', (100, 100), color='red')

View File

@ -3,7 +3,7 @@ import unittest
from pypdf import PdfReader from pypdf import PdfReader
from pdelfin.filter import PdfFilter from olmocr.filter import PdfFilter
class PdfFilterTest(unittest.TestCase): class PdfFilterTest(unittest.TestCase):

View File

@ -7,7 +7,7 @@ import hashlib
from typing import List, Dict from typing import List, Dict
# Import the classes we're testing # Import the classes we're testing
from pdelfin.s3_queue import S3WorkQueue, WorkItem from olmocr.s3_queue import S3WorkQueue, WorkItem
class TestS3WorkQueue(unittest.TestCase): class TestS3WorkQueue(unittest.TestCase):
def setUp(self): def setUp(self):
@ -70,8 +70,8 @@ class TestS3WorkQueue(unittest.TestCase):
async def test_populate_queue_new_items(self): async def test_populate_queue_new_items(self):
"""Test populating queue with new items""" """Test populating queue with new items"""
# Mock empty existing index # Mock empty existing index
with patch('pdelfin.s3_queue.download_zstd_csv', return_value=[]): with patch('olmocr.s3_queue.download_zstd_csv', return_value=[]):
with patch('pdelfin.s3_queue.upload_zstd_csv') as mock_upload: with patch('olmocr.s3_queue.upload_zstd_csv') as mock_upload:
await self.work_queue.populate_queue(self.sample_paths, items_per_group=2) await self.work_queue.populate_queue(self.sample_paths, items_per_group=2)
# Verify upload was called with correct data # Verify upload was called with correct data
@ -97,8 +97,8 @@ class TestS3WorkQueue(unittest.TestCase):
existing_hash = S3WorkQueue._compute_workgroup_hash(existing_paths) existing_hash = S3WorkQueue._compute_workgroup_hash(existing_paths)
existing_line = f"{existing_hash},{existing_paths[0]}" existing_line = f"{existing_hash},{existing_paths[0]}"
with patch('pdelfin.s3_queue.download_zstd_csv', return_value=[existing_line]): with patch('olmocr.s3_queue.download_zstd_csv', return_value=[existing_line]):
with patch('pdelfin.s3_queue.upload_zstd_csv') as mock_upload: with patch('olmocr.s3_queue.upload_zstd_csv') as mock_upload:
await self.work_queue.populate_queue(existing_paths + new_paths, items_per_group=1) await self.work_queue.populate_queue(existing_paths + new_paths, items_per_group=1)
# Verify upload called with both existing and new items # Verify upload called with both existing and new items
@ -116,8 +116,8 @@ class TestS3WorkQueue(unittest.TestCase):
completed_items = [f"s3://test-bucket/workspace/results/output_{work_hash}.jsonl"] completed_items = [f"s3://test-bucket/workspace/results/output_{work_hash}.jsonl"]
with patch('pdelfin.s3_queue.download_zstd_csv', return_value=[work_line]): with patch('olmocr.s3_queue.download_zstd_csv', return_value=[work_line]):
with patch('pdelfin.s3_queue.expand_s3_glob', return_value=completed_items): with patch('olmocr.s3_queue.expand_s3_glob', return_value=completed_items):
await self.work_queue.initialize_queue() await self.work_queue.initialize_queue()
# Queue should be empty since all work is completed # Queue should be empty since all work is completed

View File

@ -17,8 +17,8 @@ from io import BytesIO
from PIL import Image from PIL import Image
from transformers import AutoProcessor, AutoTokenizer, Qwen2VLForConditionalGeneration from transformers import AutoProcessor, AutoTokenizer, Qwen2VLForConditionalGeneration
from pathlib import Path from pathlib import Path
from pdelfin.beakerpipeline import sglang_server_task, sglang_server_ready, build_page_query, SGLANG_SERVER_PORT, render_pdf_to_base64png, get_anchor_text, download_directory from olmocr.beakerpipeline import sglang_server_task, sglang_server_ready, build_page_query, SGLANG_SERVER_PORT, render_pdf_to_base64png, get_anchor_text, download_directory
from pdelfin.prompts import PageResponse from olmocr.prompts import PageResponse
from httpx import AsyncClient from httpx import AsyncClient
import torch.nn.functional as F import torch.nn.functional as F
MODEL_FINETUNED_PATH = "s3://ai2-oe-data/jakep/experiments/qwen2vl-pdf/v1/models/jakep/Qwen_Qwen2-VL-7B-Instruct-e4ecf8-01JAH8GMWHTJ376S2N7ETXRXH4/checkpoint-9500/bf16/" MODEL_FINETUNED_PATH = "s3://ai2-oe-data/jakep/experiments/qwen2vl-pdf/v1/models/jakep/Qwen_Qwen2-VL-7B-Instruct-e4ecf8-01JAH8GMWHTJ376S2N7ETXRXH4/checkpoint-9500/bf16/"
@ -103,7 +103,7 @@ class TestSglangServer(unittest.IsolatedAsyncioTestCase):
class TestHuggingFaceModel(unittest.IsolatedAsyncioTestCase): class TestHuggingFaceModel(unittest.IsolatedAsyncioTestCase):
async def asyncSetUp(self): async def asyncSetUp(self):
# Set up the Hugging Face model and tokenizer # Set up the Hugging Face model and tokenizer
model_cache_dir = os.path.join(os.path.expanduser('~'), '.cache', 'pdelfin', 'model') model_cache_dir = os.path.join(os.path.expanduser('~'), '.cache', 'olmocr', 'model')
download_directory([MODEL_FINETUNED_PATH], model_cache_dir) download_directory([MODEL_FINETUNED_PATH], model_cache_dir)
# Check the rope config and make sure it's got the proper key # Check the rope config and make sure it's got the proper key
@ -249,7 +249,7 @@ class TestHuggingFaceModel(unittest.IsolatedAsyncioTestCase):
class RawSGLangTest(unittest.IsolatedAsyncioTestCase): class RawSGLangTest(unittest.IsolatedAsyncioTestCase):
def setUp(self): def setUp(self):
# Set up the Hugging Face model and tokenizer # Set up the Hugging Face model and tokenizer
model_cache_dir = os.path.join(os.path.expanduser('~'), '.cache', 'pdelfin', 'model') model_cache_dir = os.path.join(os.path.expanduser('~'), '.cache', 'olmocr', 'model')
download_directory([MODEL_FINETUNED_PATH], model_cache_dir) download_directory([MODEL_FINETUNED_PATH], model_cache_dir)
# Check the rope config and make sure it's got the proper key # Check the rope config and make sure it's got the proper key