
* updated the base-model and added the asciidoc_backend Signed-off-by: Peter Staar <taa@zurich.ibm.com> * updated the asciidoc backend Signed-off-by: Peter Staar <taa@zurich.ibm.com> * Ensure all models work only on valid pages (#158) Signed-off-by: Christoph Auer <cau@zurich.ibm.com> * ci: run ci also on forks (#160) --------- Signed-off-by: Michele Dolfi <dol@zurich.ibm.com> Signed-off-by: Michele Dolfi <97102151+dolfim-ibm@users.noreply.github.com> * fix: fix legacy doc ref (#162) Signed-off-by: Panos Vagenas <35837085+vagenas@users.noreply.github.com> * docs: typo fix (#155) * Docs: Typo fix - Corrected spelling of invidual to automatic Signed-off-by: ABHISHEK FADAKE <31249309+fadkeabhi@users.noreply.github.com> * add synchronize event for forks Signed-off-by: Michele Dolfi <dol@zurich.ibm.com> --------- Signed-off-by: ABHISHEK FADAKE <31249309+fadkeabhi@users.noreply.github.com> Signed-off-by: Michele Dolfi <dol@zurich.ibm.com> Co-authored-by: Michele Dolfi <dol@zurich.ibm.com> * feat: add coverage_threshold to skip OCR for small images (#161) * feat: add coverage_threshold to skip OCR for small images Signed-off-by: Michele Dolfi <dol@zurich.ibm.com> * filter individual boxes Signed-off-by: Michele Dolfi <dol@zurich.ibm.com> * rename option Signed-off-by: Michele Dolfi <dol@zurich.ibm.com> --------- Signed-off-by: Michele Dolfi <dol@zurich.ibm.com> * chore: bump version to 2.1.0 [skip ci] * adding tests for asciidocs Signed-off-by: Peter Staar <taa@zurich.ibm.com> * first working asciidoc parser Signed-off-by: Peter Staar <taa@zurich.ibm.com> * reformatted the code Signed-off-by: Peter Staar <taa@zurich.ibm.com> * fixed the mypy Signed-off-by: Peter Staar <taa@zurich.ibm.com> * adding test_02.asciidoc Signed-off-by: Peter Staar <taa@zurich.ibm.com> * Drafting Markdown backend via Marko library Signed-off-by: Maksym Lysak <mly@zurich.ibm.com> * work in progress on MD backend Signed-off-by: Maksym Lysak <mly@zurich.ibm.com> * md_backend produces docling document with headers, paragraphs, lists Signed-off-by: Maksym Lysak <mly@zurich.ibm.com> * Improvements in md parsing Signed-off-by: Maksym Lysak <mly@zurich.ibm.com> * Detecting and assembling tables in markdown in temporary buffers Signed-off-by: Maksym Lysak <mly@zurich.ibm.com> * Added initial docling table support to md_backend Signed-off-by: Maksym Lysak <mly@zurich.ibm.com> * Cleaned code, improved logging for MD Signed-off-by: Maksym Lysak <mly@zurich.ibm.com> * Fixes MyPy requirements, and rest of pre-commit Signed-off-by: Maksym Lysak <mly@zurich.ibm.com> * Fixed example run_md, added origin info to md_backend Signed-off-by: Maksym Lysak <mly@zurich.ibm.com> * working on asciidocs, struggling with ImageRef Signed-off-by: Peter Staar <taa@zurich.ibm.com> * able to parse the captions and image uri's Signed-off-by: Peter Staar <taa@zurich.ibm.com> * fixed the mypy Signed-off-by: Peter Staar <taa@zurich.ibm.com> * Update all backends with proper filename in DocumentOrigin Signed-off-by: Christoph Auer <cau@zurich.ibm.com> * Update to docling-core v2.1.0 Signed-off-by: Christoph Auer <cau@zurich.ibm.com> * Fixes for MD Backend, to avoid duplicated text inserts into docling doc Signed-off-by: Maksym Lysak <mly@zurich.ibm.com> * Fix styling Signed-off-by: Christoph Auer <cau@zurich.ibm.com> * Added support for code blocks and fenced code in MD Signed-off-by: Maksym Lysak <mly@zurich.ibm.com> * cleaned prints Signed-off-by: Maksym Lysak <mly@zurich.ibm.com> * Added proper processing of in-line textual elements for MD backend Signed-off-by: Maksym Lysak <mly@zurich.ibm.com> * Fixed issues with duplicated paragraphs and incorrect lists in pptx Signed-off-by: Maksym Lysak <mly@zurich.ibm.com> * Fixed issue with group ordeering in pptx backend, added gebug log into run with formats Signed-off-by: Maksym Lysak <mly@zurich.ibm.com> --------- Signed-off-by: Peter Staar <taa@zurich.ibm.com> Signed-off-by: Christoph Auer <cau@zurich.ibm.com> Signed-off-by: Michele Dolfi <dol@zurich.ibm.com> Signed-off-by: Michele Dolfi <97102151+dolfim-ibm@users.noreply.github.com> Signed-off-by: Panos Vagenas <35837085+vagenas@users.noreply.github.com> Signed-off-by: ABHISHEK FADAKE <31249309+fadkeabhi@users.noreply.github.com> Signed-off-by: Maksym Lysak <mly@zurich.ibm.com> Co-authored-by: Peter Staar <taa@zurich.ibm.com> Co-authored-by: Michele Dolfi <97102151+dolfim-ibm@users.noreply.github.com> Co-authored-by: Panos Vagenas <35837085+vagenas@users.noreply.github.com> Co-authored-by: ABHISHEK FADAKE <31249309+fadkeabhi@users.noreply.github.com> Co-authored-by: Michele Dolfi <dol@zurich.ibm.com> Co-authored-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com> Co-authored-by: Maksym Lysak <mly@zurich.ibm.com>
3.7 KiB
Docling
Docling parses documents and exports them to the desired format with ease and speed.
Features
- 🗂️ Multi-format support for input (PDF, DOCX etc.) & output (Markdown, JSON etc.)
- 📑 Advanced PDF document understanding incl. page layout, reading order & table structures
- 📝 Metadata extraction, including title, authors, references & language
- 🤖 Seamless LlamaIndex 🦙 & LangChain 🦜🔗 integration for powerful RAG / QA applications
- 🔍 OCR support for scanned PDFs
- 💻 Simple and convenient CLI
Explore the documentation to discover plenty examples and unlock the full power of Docling!
Installation
To use Docling, simply install docling
from your package manager, e.g. pip:
pip install docling
Works on macOS, Linux and Windows environments. Both x86_64 and arm64 architectures.
More detailed installation instructions are available in the docs.
Getting started
To convert individual documents, use convert()
, for example:
from docling.document_converter import DocumentConverter
source = "https://arxiv.org/pdf/2408.09869" # document per local path or URL
converter = DocumentConverter()
result = converter.convert(source)
print(result.document.export_to_markdown()) # output: "## Docling Technical Report[...]"
Check out Getting started. You will find lots of tuning options to leverage all the advanced capabilities.
Get help and support
Please feel free to connect with us using the discussion section.
Technical report
For more details on Docling's inner workings, check out the Docling Technical Report.
Contributing
Please read Contributing to Docling for details.
References
If you use Docling in your projects, please consider citing the following:
@techreport{Docling,
author = {Deep Search Team},
month = {8},
title = {Docling Technical Report},
url = {https://arxiv.org/abs/2408.09869},
eprint = {2408.09869},
doi = {10.48550/arXiv.2408.09869},
version = {1.0.0},
year = {2024}
}
License
The Docling codebase is under MIT license. For individual model usage, please refer to the model licenses found in the original packages.