2024-07-15 09:42:42 +02:00
< p align = "center" >
2025-03-14 12:35:29 +01:00
< a href = "https://github.com/docling-project/docling" >
< img loading = "lazy" alt = "Docling" src = "https://github.com/docling-project/docling/raw/main/docs/assets/docling_processing.png" width = "100%" / >
2024-07-18 11:23:23 +02:00
< / a >
2024-07-15 09:42:42 +02:00
< / p >
2024-12-06 12:37:57 +01:00
# Docling
2024-07-15 09:42:42 +02:00
2024-11-05 13:57:06 +01:00
< p align = "center" >
< a href = "https://trendshift.io/repositories/12132" target = "_blank" > < img src = "https://trendshift.io/api/badge/repositories/12132" alt = "DS4SD%2Fdocling | Trendshift" style = "width: 250px; height: 55px;" width = "250" height = "55" / > < / a >
< / p >
2024-08-20 12:32:53 +02:00
[](https://arxiv.org/abs/2408.09869)
2025-03-14 12:35:29 +01:00
[](https://docling-project.github.io/docling/)
2024-07-17 15:49:26 +02:00
[](https://pypi.org/project/docling/)
2024-11-20 15:21:40 +01:00
[](https://pypi.org/project/docling/)
2024-07-17 15:49:26 +02:00
[](https://python-poetry.org/)
[](https://github.com/psf/black)
[](https://pycqa.github.io/isort/)
[](https://pydantic.dev)
[](https://github.com/pre-commit/pre-commit)
2025-03-14 12:35:29 +01:00
[](https://opensource.org/licenses/MIT)
2024-11-21 13:59:45 +01:00
[](https://pepy.tech/projects/docling)
feat(actor): Docling Actor on Apify infrastructure (#875)
* fix: Improve OCR results, stricten criteria before dropping bitmap areas (#719)
fix: Properly care for all bitmap elements in OCR
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
Signed-off-by: Adam Kliment <adam@netmilk.net>
* chore: bump version to 2.15.1 [skip ci]
* Actor: Initial implementation
Signed-off-by: Václav Vančura <commit@vancura.dev>
Signed-off-by: Adam Kliment <adam@netmilk.net>
* Actor: .dockerignore update
Signed-off-by: Václav Vančura <commit@vancura.dev>
* Actor: Adding the Actor badge
Signed-off-by: Václav Vančura <commit@vancura.dev>
* Actor: Moving the badge where it belongs
Signed-off-by: Václav Vančura <commit@vancura.dev>
* Actor: Documentation update
Signed-off-by: Václav Vančura <commit@vancura.dev>
Signed-off-by: Adam Kliment <adam@netmilk.net>
* Actor: Switching Docker to python:3.11-slim-bookworm
Signed-off-by: Václav Vančura <commit@vancura.dev>
* Actor: Enhance Docker security with proper user permissions
- Set proper ownership and permissions for runtime directory.
- Switch to non-root user for enhanced security.
- Use `--chown` flag in COPY commands to maintain correct file ownership.
- Ensure all files and directories are owned by `appuser`.
Signed-off-by: Václav Vančura <commit@vancura.dev>
* Actor: Optimize Dockerfile with security and size improvements
- Combine RUN commands to reduce image layers and overall size.
- Add non-root user `appuser` for improved security.
- Use `--no-install-recommends` flag to minimize installed packages.
- Install only necessary dependencies in a single RUN command.
- Maintain proper cleanup of package lists and caches.
Signed-off-by: Václav Vančura <commit@vancura.dev>
* Actor: Add Docker image metadata labels
Signed-off-by: Václav Vančura <commit@vancura.dev>
* Actor: Update dependencies with fixed versions
Upgrade pip and npm to latest versions, pin docling to 2.15.1 and apify-cli to 2.7.1 for better stability and reproducibility. This change helps prevent unexpected behavior from dependency updates and ensures consistent builds across environments.
Signed-off-by: Václav Vančura <commit@vancura.dev>
* Actor: Fix apify-cli version problem
Signed-off-by: Václav Vančura <commit@vancura.dev>
* Actor: Create Apify user home directory in Docker setup
Add and configure `/home/appuser/.apify` directory with proper permissions for the appuser in the Docker container. This ensures the Apify SDK has a writable home directory for storing its configuration and temporary files.
Signed-off-by: Václav Vančura <commit@vancura.dev>
* Actor: Update Docker configuration for improved security
- Add `ACTOR_PATH_IN_DOCKER_CONTEXT` argument to ignore the Apify-tooling related warning.
- Improve readability with consistent formatting and spacing in RUN commands.
- Enhance security by properly setting up appuser home directory and permissions.
- Streamline directory structure and ownership for runtime operations.
- Remove redundant `.apify` directory creation as it's handled by the CLI.
Signed-off-by: Václav Vančura <commit@vancura.dev>
* Actor: Improve shell script robustness and error handling
The shell script has been enhanced with better error handling, input validation, and cleanup procedures. Key improvements include:
- Added proper quoting around variables to prevent word splitting.
- Improved error messages and logging functionality.
- Implemented a cleanup trap to ensure temporary files are removed.
- Enhanced validation of input parameters and output formats.
- Added better handling of the log file and its storage.
- Improved command execution with proper evaluation.
- Added comments for better code readability and maintenance.
- Fixed potential security issues with proper variable expansion.
Signed-off-by: Václav Vančura <commit@vancura.dev>
* Actor: Improve script logging and error handling
- Initialize log file at `/tmp/docling.log` and redirect all output to it
- Remove exit on error trap, now only logs error line numbers
- Use temporary directory for timestamp file
- Capture Docling exit code and handle errors more gracefully
- Update log file references to use `LOG_FILE` variable
- Remove local log file during cleanup
Signed-off-by: Václav Vančura <commit@vancura.dev>
* Actor: Updating Docling to 2.17.0
Signed-off-by: Václav Vančura <commit@vancura.dev>
* Actor: Adding README
Signed-off-by: Václav Vančura <commit@vancura.dev>
* Actor: README update
Signed-off-by: Václav Vančura <commit@vancura.dev>
* Actor: Enhance Dockerfile with additional utilities and env vars
- Add installation of `time` and `procps` packages for better resource monitoring.
- Set environment variables `PYTHONUNBUFFERED`, `MALLOC_ARENA_MAX`, and `EASYOCR_DOWNLOAD_CACHE` for improved performance.
- Create a cache directory for EasyOCR to optimize storage usage.
Signed-off-by: Václav Vančura <commit@vancura.dev>
* Actor: README update
Signed-off-by: Václav Vančura <commit@vancura.dev>
* Actor: Adding the Apify FirstPromoter integration
Signed-off-by: Václav Vančura <commit@vancura.dev>
* Actor: Adding the "Run on Apify" button
Signed-off-by: Václav Vančura <commit@vancura.dev>
* Actor: Fixing example PDF document URLs
Signed-off-by: Václav Vančura <commit@vancura.dev>
* Actor: Documentation update
Signed-off-by: Václav Vančura <commit@vancura.dev>
* Actor: Adding input document URL validation
Signed-off-by: Václav Vančura <commit@vancura.dev>
* Actor: Fix quoting in `DOC_CONVERT_CMD` variable
Signed-off-by: Václav Vančura <commit@vancura.dev>
* Actor: Documentation update
Removing the dollar signs due to what we discovered at https://cirosantilli.com/markdown-style-guide/#dollar-signs-in-shell-code
Signed-off-by: Václav Vančura <commit@vancura.dev>
* Actor: Add specific error codes for better error handling
- `ERR_INVALID_INPUT` for missing document URL
- `ERR_URL_INACCESSIBLE` for inaccessible URLs
- `ERR_DOCLING_FAILED` for Docling command failures
- `ERR_OUTPUT_MISSING` for missing or empty output files
- `ERR_STORAGE_FAILED` for failures in storing the output document
Signed-off-by: Václav Vančura <commit@vancura.dev>
* Actor: Enhance error handling and data logging
- Add `apify pushData` calls to log errors when the document URL is missing or inaccessible.
- Introduce dataset record creation with processing results, including a success status and output file URL.
- Modify completion message to indicate successful processing and provide a link to the results.
Signed-off-by: Václav Vančura <commit@vancura.dev>
* Actor: Normalize key-value store terminology
Signed-off-by: Václav Vančura <commit@vancura.dev>
* Actor: Enhance `README.md` with output details
Added detailed information about the Actor's output storage to the `README.md`. This includes specifying where processed documents, processing logs, and dataset records are stored.
Signed-off-by: Václav Vančura <commit@vancura.dev>
* Actor: Adding CHANGELOG.md
Signed-off-by: Václav Vančura <commit@vancura.dev>
* Actor: Adding dataset schema
Signed-off-by: Václav Vančura <commit@vancura.dev>
* Actor: Update README with output URL details
Signed-off-by: Václav Vančura <commit@vancura.dev>
* Actor: Fix the Apify call syntax and final result URL message
Signed-off-by: Václav Vančura <commit@vancura.dev>
* Actor: Add section on Actors to README
Signed-off-by: Václav Vančura <commit@vancura.dev>
* Actor: Replace Docling CLI with docling-serve API
This commit transitions the Actor from using the full Docling CLI package to the more lightweight docling-serve API. Key changes include:
- Redesign Dockerfile to use docling-serve as base image
- Update actor.sh to communicate with API instead of running CLI commands
- Improve content type handling for various output formats
- Update input schema to align with API parameters
- Reduce Docker image size from ~6GB to ~600MB
- Update documentation and changelog to reflect architectural changes
The image size reduction will make the Actor more cost-effective for users while maintaining all existing functionality including OCR capabilities.
Issue: No official docling-serve Docker image is currently available, which will be addressed in a future commit.
Signed-off-by: Václav Vančura <commit@vancura.dev>
* Actor: Overhaul the implementation using official docling-serve image
This commit completely revamps the Actor implementation with two major improvements:
1) CRITICAL CHANGE: Switch to official docling-serve image
* Now using quay.io/ds4sd/docling-serve-cpu:latest as base image
* Eliminates need for custom docling installation
* Ensures compatibility with latest docling-serve features
* Provides more reliable and consistent document processing
2) Fix Apify Actor KVS storage issues:
* Standardize key names to follow Apify conventions:
- Change "OUTPUT_RESULT" to "OUTPUT"
- Change "DOCLING_LOG" to "LOG"
* Add proper multi-stage Docker build:
- First stage builds dependencies including apify-cli
- Second stage uses official image and adds only necessary tools
* Fix permission issues in Docker container:
- Set up proper user and directory permissions
- Create writable directories for temporary files and models
- Configure environment variables for proper execution
3) Solve EACCES permission errors during CLI version checks:
* Create temporary HOME directory with proper write permissions
* Set APIFY_DISABLE_VERSION_CHECK=1 environment variable
* Add NODE_OPTIONS="--no-warnings" to suppress update checks
* Support --no-update-notifier CLI flag when available
4) Improve code organization and reliability:
* Create reusable upload_to_kvs() function for all KVS operations
* Ensure log files are uploaded before tools directory is removed
* Set proper MIME types based on output format
* Add detailed error reporting and proper cleanup
* Display final output URLs for easy verification
This major refactoring significantly improves reliability and maintainability by leveraging the official docling-serve image while solving persistent permission and storage issues. The Actor now properly follows Apify standards while providing a more robust document processing pipeline.
Signed-off-by: Václav Vančura <commit@vancura.dev>
* Actor: Refactor `actor.sh` and add `docling_processor.py`
Refactor the `actor.sh` script to modularize functions for finding the Apify CLI, setting up a temporary environment, and cleaning it up. Introduce a new function, `get_actor_input()`, to handle input detection more robustly. Replace inline Python conversion logic with an external script, `docling_processor.py`, for processing documents via the docling-serve API.
Signed-off-by: Václav Vančura <commit@vancura.dev>
* Actor: Update CHANGELOG and README for Docker and API changes
Signed-off-by: Václav Vančura <commit@vancura.dev>
* Actor: Removing obsolete actor.json keys
Signed-off-by: Václav Vančura <commit@vancura.dev>
* Actor: Fixed input getter
Signed-off-by: Adam Kliment <adam@netmilk.net>
* Actor: Always output a zip
Signed-off-by: Adam Kliment <adam@netmilk.net>
* Actor: Resolving conflicts with main
Signed-off-by: Václav Vančura <commit@vancura.dev>
* Actor: Resolving conflicts with main (pass 2)
Signed-off-by: Václav Vančura <commit@vancura.dev>
* Actor: Updated main Readme and Actor Readme
Signed-off-by: Adam Kliment <adam@netmilk.net>
---------
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
Signed-off-by: Adam Kliment <adam@netmilk.net>
Signed-off-by: Václav Vančura <commit@vancura.dev>
Co-authored-by: Christoph Auer <60343111+cau-git@users.noreply.github.com>
Co-authored-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>
Co-authored-by: Adam Kliment <adam@netmilk.net>
2025-03-18 10:17:44 +01:00
[](https://apify.com/vancura/docling)
2025-03-19 09:05:57 +01:00
[](https://lfaidata.foundation/projects/)
2024-07-17 15:49:26 +02:00
2025-01-28 13:23:30 +01:00
Docling simplifies document processing, parsing diverse formats — including advanced PDF understanding — and providing seamless integrations with the gen AI ecosystem.
2024-07-15 09:42:42 +02:00
## Features
2024-10-16 21:02:03 +02:00
2025-01-28 13:23:30 +01:00
* 🗂️ Parsing of [multiple document formats][supported_formats] incl. PDF, DOCX, XLSX, HTML, images, and more
* 📑 Advanced PDF understanding incl. page layout, reading order, table structure, code, formulas, image classification, and more
* 🧬 Unified, expressive [DoclingDocument][docling_document] representation format
* ↪️ Various [export formats][supported_formats] and options, including Markdown, HTML, and lossless JSON
* 🔒 Local execution capabilities for sensitive data and air-gapped environments
* 🤖 Plug-and-play [integrations][integrations] incl. LangChain, LlamaIndex, Crew AI & Haystack for agentic AI
* 🔍 Extensive OCR support for scanned PDFs and images
2025-03-19 15:38:54 +01:00
* 🥚 Support of Visual Language Models ([SmolDocling ](https://huggingface.co/ds4sd/SmolDocling-256M-preview )) 🆕
2024-10-16 21:02:03 +02:00
* 💻 Simple and convenient CLI
2024-11-05 08:53:02 +01:00
### Coming soon
* 📝 Metadata extraction, including title, authors, references & language
2025-01-30 09:52:54 +01:00
* 📝 Chart understanding (Barchart, Piechart, LinePlot, etc)
* 📝 Complex chemistry understanding (Molecular structures)
2024-09-02 12:27:29 +02:00
2024-07-17 15:49:26 +02:00
## Installation
2024-07-15 09:42:42 +02:00
2024-07-17 15:49:26 +02:00
To use Docling, simply install `docling` from your package manager, e.g. pip:
```bash
2024-07-16 14:15:09 +02:00
pip install docling
```
2024-10-03 14:23:47 +02:00
Works on macOS, Linux and Windows environments. Both x86_64 and arm64 architectures.
2024-07-16 14:15:09 +02:00
2025-03-14 12:35:29 +01:00
More [detailed installation instructions ](https://docling-project.github.io/docling/installation/ ) are available in the docs.
2024-08-30 10:20:21 +02:00
2024-09-24 09:21:28 +02:00
## Getting started
2024-07-15 09:42:42 +02:00
2025-03-19 15:38:54 +01:00
To convert individual documents with python, use `convert()` , for example:
2024-07-26 16:55:33 +02:00
```python
from docling.document_converter import DocumentConverter
2024-10-22 15:29:36 +02:00
source = "https://arxiv.org/pdf/2408.09869" # document per local path or URL
2024-07-26 16:55:33 +02:00
converter = DocumentConverter()
2024-10-16 21:02:03 +02:00
result = converter.convert(source)
print(result.document.export_to_markdown()) # output: "## Docling Technical Report[...]"
2024-07-17 16:13:21 +02:00
```
2025-03-14 12:35:29 +01:00
More [advanced usage options ](https://docling-project.github.io/docling/usage/ ) are available in
2024-11-28 09:41:21 +01:00
the docs.
2025-03-19 15:38:54 +01:00
## CLI
Docling has a built-in CLI to run conversions.
```bash
docling https://arxiv.org/pdf/2206.01062
```
You can also use 🥚[SmolDocling ](https://huggingface.co/ds4sd/SmolDocling-256M-preview ) and other VLMs via Docling CLI:
```bash
docling --pipeline vlm --vlm-model smoldocling https://arxiv.org/pdf/2206.01062
```
This will use MLX acceleration on supported Apple Silicon hardware.
Read more [here ](https://docling-project.github.io/docling/usage/ )
2024-11-28 09:41:21 +01:00
## Documentation
2025-03-14 12:35:29 +01:00
Check out Docling's [documentation ](https://docling-project.github.io/docling/ ), for details on
2024-11-28 09:41:21 +01:00
installation, usage, concepts, recipes, extensions, and more.
## Examples
2025-03-14 12:35:29 +01:00
Go hands-on with our [examples ](https://docling-project.github.io/docling/examples/ ),
2024-11-28 09:41:21 +01:00
demonstrating how to address different application use cases with Docling.
## Integrations
To further accelerate your AI application development, check out Docling's native
2025-03-14 12:35:29 +01:00
[integrations ](https://docling-project.github.io/docling/integrations/ ) with popular frameworks
2024-11-28 09:41:21 +01:00
and tools.
2024-09-26 21:37:08 +02:00
2024-10-16 21:02:03 +02:00
## Get help and support
2024-07-15 09:42:42 +02:00
2025-03-14 12:35:29 +01:00
Please feel free to connect with us using the [discussion section ](https://github.com/docling-project/docling/discussions ).
2024-09-27 11:16:04 +02:00
2024-09-09 12:03:04 +02:00
## Technical report
For more details on Docling's inner workings, check out the [Docling Technical Report ](https://arxiv.org/abs/2408.09869 ).
2024-07-15 09:42:42 +02:00
## Contributing
2025-03-14 12:35:29 +01:00
Please read [Contributing to Docling ](https://github.com/docling-project/docling/blob/main/CONTRIBUTING.md ) for details.
2024-07-15 09:42:42 +02:00
## References
2024-07-17 15:49:26 +02:00
If you use Docling in your projects, please consider citing the following:
2024-07-15 09:42:42 +02:00
```bib
2024-08-20 12:32:53 +02:00
@techreport {Docling,
author = {Deep Search Team},
month = {8},
2024-09-09 12:03:04 +02:00
title = {Docling Technical Report},
url = {https://arxiv.org/abs/2408.09869},
eprint = {2408.09869},
doi = {10.48550/arXiv.2408.09869},
2024-08-20 12:32:53 +02:00
version = {1.0.0},
year = {2024}
2024-07-15 09:42:42 +02:00
}
```
## License
2024-11-05 08:53:02 +01:00
The Docling codebase is under MIT license.
2024-07-15 09:42:42 +02:00
For individual model usage, please refer to the model licenses found in the original packages.
2024-11-05 13:57:06 +01:00
2025-03-19 09:05:57 +01:00
## LF AI & Data
2024-11-05 13:57:06 +01:00
2025-03-19 09:05:57 +01:00
Docling is hosted as a project in the [LF AI & Data Foundation ](https://lfaidata.foundation/projects/ ).
### IBM ❤️ Open Source AI
The project was started by the AI for knowledge team at IBM Research Zurich.
2025-01-28 13:23:30 +01:00
2025-03-14 12:35:29 +01:00
[supported_formats]: https://docling-project.github.io/docling/usage/supported_formats/
[docling_document]: https://docling-project.github.io/docling/concepts/docling_document/
[integrations]: https://docling-project.github.io/docling/integrations/