docling/CHANGELOG.md
2025-06-25 16:27:46 +00:00

88 KiB

v2.38.1 - 2025-06-25

Fix

  • Updated granite vision model version for picture description (#1852) (d337825)
  • markdown: Fix single-formatted headings & list items (#1820) (7c5614a)
  • Fix response type of ollama (#1850) (41e8cae)
  • Handle missing runs to avoid out of range exception (#1844) (4002de1)

v2.38.0 - 2025-06-23

Feature

Fix

  • docx: Ensure list items have a list parent (#1827) (d26dac6)
  • msword_backend: Identify text in the same line after an image #1425 (#1610) (1350a8d)
  • Ensure uninitialized pages are removed before assembling document (#1812) (dd7f64f)
  • Formula conversion with page_range param set (#1791) (dbab30e)

Documentation

v2.37.0 - 2025-06-16

Feature

  • Make Page.parsed_page the only source of truth for text cells, add OCR cells to it (#1745) (7d3302c)
  • Support xlsm files (#1520) (df14022)

Fix

  • Pptx line break and space handling (#1664) (f28d23c)
  • asciidoc: Set default size when missing in image directive (#1769) (b886e4d)
  • Handle NoneType error in MsPowerpointDocumentBackend (#1747) (7a275c7)
  • Prov for merged-elems (#1728) (6613b9e)
  • tesseract: Initialize df_osd to avoid uninitialized variable error (#1718) (e979750)
  • Allow custom torch_dtype in vlm models (#1735) (f7f3113)
  • Improve extraction from textboxes in Word docs (#1701) (9dbcb3d)
  • Add WEBP to the list of image file extensions (#1711) (a2b83fe)

Documentation

v2.36.1 - 2025-06-04

Fix

Documentation

v2.36.0 - 2025-06-03

Feature

v2.35.0 - 2025-06-02

Feature

  • Add visualization of bbox on page with html export. (#1663) (b356b33)

Fix

  • Guess HTML content starting with script tag (#1673) (984cb13)
  • UnicodeDecodeError: 'utf-8' codec can't decode byte 0xd0 in position 0: invalid continuation byte (#1665) (51d3450)

Documentation

v2.34.0 - 2025-05-22

Feature

  • ocr: Auto-detect rotated pages in Tesseract (#1167) (45265bf)
  • Establish confidence estimation for document and pages (#1313) (9087524)

Fix

  • Fix ZeroDivisionError for cell_bbox.area() (#1636) (c2f595d)
  • integration: Update the Apify Actor integration (#1619) (14d4f5b)

v2.33.0 - 2025-05-20

Feature

  • Add textbox content extraction in msword_backend (#1538) (12a0e64)

Fix

  • Fix issue with detecting docx files, and files with upper case extensions (#1609) (f4d9d41)
  • Load_from_doctags static usage (#1617) (0e00a26)
  • Incorrect force_backend_text behaviour for VLM DocTag pipelines (#1371) (f2e9c07)
  • pypdfium: Resolve overlapping text when merging bounding boxes (#1549) (98b5eeb)

v2.32.0 - 2025-05-14

Feature

  • Improve parallelization for remote services API calls (#1548) (3a04f2a)
  • Support image/webp file type (#1415) (12dab0a)

Fix

  • ocr: Orig field in TesseractOcrCliModel as str (#1553) (9f8b479)
  • settings: Fix nested settings load via environment variables (#1551) (2efb7a7)

Documentation

  • Add advanced chunking & serialization example (#1589) (9f28abf)

v2.31.2 - 2025-05-13

Fix

  • AsciiDoc header identification (#1562) (#1563) (4046d0b)
  • Restrict click version and update lock file (#1582) (8baa85a)

v2.31.1 - 2025-05-12

Fix

  • Add smoldocling in download utils (#1577) (127e386)
  • HTML: Handle row spans in header rows (#1536) (776e7ec)
  • Mime error in document streams (#1523) (f1658ed)
  • Usage of hashlib for FIPS (#1512) (7c70573)
  • Guard against attribute errors in TesseractOcrModel del (#1494) (4ab7e9d)
  • Enable cuda_use_flash_attention2 for PictureDescriptionVlmModel (#1496) (cc45396)
  • Updated the time-recorder label for reading order (#1490) (976e92e)
  • Incorrect scaling of TableModel bboxes when do_cell_matching is False (#1459) (94d66a0)

Documentation

v2.31.0 - 2025-04-25

Feature

  • Add tutorial using Milvus and Docling for RAG pipeline (#1449) (a2fbbba)

Fix

  • html: Handle address, details, and summary tags (#1436) (ed20124)
  • Treat overflowing -v flags as DEBUG (#1419) (8012a3e)
  • codecov: Fix codecov argument and yaml file (#1399) (fa7fc9e)

Documentation

v2.30.0 - 2025-04-14

Feature

  • cli: Add option for html with split-page mode (#1355) (c0ba88e)
  • xlsx: Create a page for each worksheet in XLSX backend (#1332) (eef2bde)
  • OllamaVlmModel for Granite Vision 3.2 (#1337) (c605edd)

Fix

  • deps: Widen typer upper bound (#1375) (7e40ad3)
  • Auto-recognize .xlsx, .docx and .pptx files (#1340) (0de70e7)
  • docx: Declare image_data variable when handling pictures (#1359) (415b877)
  • Implement PictureDescriptionApiOptions.bitmap_area_threshold (#1248) (2503999)
  • Properly address page in pipeline _assemble_document when page_range is provided (#1334) (6b696b5)

v2.29.0 - 2025-04-10

Feature

Fix

  • docx: Adding new latex symbols, simplifying how equations are added to text (#1295) (14e9c0c)
  • pptx: Check if picture shape has an image attached (#1316) (dc3bf9c)
  • docx: Improve text parsing (#1268) (d2d6874)
  • Tesseract OCR CLI can't process images composed with numbers only (#1201) (b3d111a)

Documentation

v2.28.4 - 2025-03-29

Fix

v2.28.3 - 2025-03-28

Fix

v2.28.2 - 2025-03-26

Fix

  • Improve HTML layer detection, various MD fixes (#1241) (9210812)
  • html: Fix HTML parsed heading level (#1244) (85c4df8)

v2.28.1 - 2025-03-25

Fix

  • converter: Cache same pipeline class with different options (#1152) (825b226)
  • debug: Missing translation of bbox to to_bounding_box (#1220) (6df8827)
  • docx: Identifying numbered headers (#1231) (f739d0e)

Documentation

  • examples: Batch conversion doc raises_on_error (#1147) (0974ba4)

v2.28.0 - 2025-03-19

Feature

  • SmolDocling: Support MLX acceleration in VLM pipeline (#1199) (1c26769)
  • Add PPTX notes slides (#474) (b454aa1)
  • Updated vlm pipeline (with latest changes from docling-core) (#1158) (2f72167)

Fix

  • Determine correct page size in DoclingParseV4Backend (#1196) (f5adfb9)
  • msword: Fixing function return in equations handling (#1194) (0b707d0)

Documentation

v2.27.0 - 2025-03-18

Feature

  • Add factory for ocr engines via plugins (#1010) (6eaae3c)
  • Add DoclingParseV4 backend, using high-level docling-parse API (#905) (3960b19)
  • actor: Docling Actor on Apify infrastructure (#875) (772487f)
  • Equations to latex in MSWord backend (with inline groups) (#1114) (6eb718f)

Fix

Documentation

v2.26.0 - 2025-03-11

Feature

  • Use new TableFormer model weights and default to accurate model version (#1100) (eb97357)

Fix

Documentation

  • Add description of DOCLING_ARTIFACTS_PATH env var (#1124) (e1c49ad)

Performance

  • New revision code formula model and document picture classifier (#1140) (5e30381)

v2.25.2 - 2025-03-05

Fix

  • Proper handling of orphan IDs in layout postprocessing (#1118) (c56ab3a)

Documentation

v2.25.1 - 2025-03-03

Fix

  • Enable locks for threadsafe pdfium (#1052) (8dc0562)
  • html: Use 'start' attribute when parsing ordered lists from HTML docs (#1062) (de7b963)

Documentation

  • Improve docs on token limit warning triggered by HybridChunker (#1077) (db3ceef)

v2.25.0 - 2025-02-26

Feature

  • [Experimental] Introduce VLM pipeline using HF AutoModelForVision2Seq, featuring SmolDocling model (#1054) (3c9fe76)
  • cli: Add option for downloading all models, refine help messages (#1061) (ab683e4)

Fix

Documentation

  • Extend chunking docs, add FAQ on token limit (#1053) (c84b973)

v2.24.0 - 2025-02-20

Feature

v2.23.1 - 2025-02-20

Fix

  • Runtime error when Pandas Series is not always of string type (#1024) (6796f0a)

Documentation

v2.23.0 - 2025-02-17

Feature

Fix

  • Revise DocTags, fix iterate_items to output content_layer in items (#965) (6e75f0b)

v2.22.0 - 2025-02-14

Feature

  • Add support for CSV input with new backend to transform CSV files to DoclingDocument (#945) (00d9405)
  • Introduce the enable_remote_services option to allow remote connections while processing (#941) (2716c7d)
  • Allow artifacts_path to be defined as ENV (#940) (5101e25)

Fix

Documentation

  • Update example Dockerfile with download CLI (#929) (7493d5b)
  • Examples for picture descriptions (#951) (2d66e99)

v2.21.0 - 2025-02-10

Feature

  • Add content_layer property to items to address body, furniture and other roles (#735) (cf78d5b)

v2.20.0 - 2025-02-07

Feature

  • Describe pictures using vision models (#259) (4cc6e3e)

Fix

v2.19.0 - 2025-02-07

Feature

Fix

  • markdown: Handle nested lists (#910) (90b766e)
  • Test cases for RTL programmatic PDFs and fixes for the formula model (#903) (9114ada)
  • msword_backend: Handle conversion error in label parsing (#896) (722a6eb)
  • Enrichment models batch size and expose picture classifier (#878) (5ad6de0)

Documentation

  • Introduce example with custom models for RapidOCR (#874) (6d3fea0)

v2.18.0 - 2025-02-03

Feature

Fix

  • markdown: Fix parsing if doc ending with table (#873) (5ac2887)
  • markdown: Add support for HTML content (#855) (94751a7)
  • docx: Merged table cells not properly converted (#857) (0cd81a8)
  • Processing of placeholder shapes in pptx that have text but no bbox (#868) (eff16b6)
  • KeyError in tableformer prediction (#854) (b1cf796)
  • Fixed docx import with headers that are also lists (#842) (2c037ae)
  • Use new add_code in html backend and add more typing hints (#850) (2a1f8af)
  • markdown: Fix empty block handling (#843) (bccb022)
  • Fix for the crash when encountering WMF images in pptx and docx (#837) (fea0a99)

Documentation

  • Updated the readme with upcoming features (#831) (d7c0828)
  • Add example for inspection of picture content (#624) (f9144f2)

v2.17.0 - 2025-01-28

Feature

  • CLI: Expose code and formula models in the CLI (#820) (6882e6c)
  • Add platform info to CLI version printout (#816) (95b293a)
  • ocr: Expose rec_keys_path in RapidOcrOptions to support custom dictionaries (#786) (5332755)
  • Introduce automatic language detection in TesseractOcrCliModel (#800) (3be2fb5)

Fix

  • Fix single newline handling in MD backend (#824) (5aed9f8)
  • Use file extension if filetype fails with PDF (#827) (adf6353)
  • Parse html with omitted body tag (#818) (a112d7a)

Documentation

v2.16.0 - 2025-01-24

Feature

  • New document picture classifier (#805) (16a218d)
  • Add Docling JSON ingestion (#783) (88a0e66)
  • Code and equation model for PDF and code blocks in markdown (#752) (3213b24)
  • Add "auto" language for TesseractOcr (#759) (8543c22)

Fix

  • Added extraction of byte-images in excel (#804) (a458e29)
  • Update docling-parse-v2 backend version with new parsing fixes (#769) (670a08b)

Documentation

v2.15.1 - 2025-01-10

Fix

  • Improve OCR results, stricten criteria before dropping bitmap areas (#719) (5a060f2)
  • Allow earlier requests versions (#716) (e64b5a2)

Documentation

v2.15.0 - 2025-01-08

Feature

  • Added http header support for document converter and cli (#642) (0ee849e)

Fix

  • Correct scaling of debug visualizations, tune OCR (#700) (5cb4cf6)
  • Let BeautifulSoup detect the HTML encoding (#695) (42856fd)
  • mspowerpoint: Handle invalid images in PowerPoint slides (#650) (d49650c)

Documentation

v2.14.0 - 2024-12-18

Feature

  • Create a backend to transform PubMed XML files to DoclingDocument (#557) (fd03480)

v2.13.0 - 2024-12-17

Feature

  • Updated Layout processing with forms and key-value areas (#530) (60dc852)
  • Create a backend to parse USPTO patents into DoclingDocument (#606) (4e08750)
  • Add Easyocr parameter recog_network (#613) (3b53bd3)

Documentation

  • Add Haystack RAG example (#615) (3e599c7)
  • Fix the path to the run_with_accelerator.py example (#608) (3bb3bf5)

v2.12.0 - 2024-12-13

Feature

  • Introduce support for GPU Accelerators (#593) (19fad92)

v2.11.0 - 2024-12-12

Feature

  • Add timeout limit to document parsing job. DS4SD#270 (#552) (3da166e)

Fix

  • Do not import python modules from deepsearch-glm (#569) (aee9c0b)
  • Handle no result from RapidOcr reader (#558) (f45499c)
  • Make enum serializable with human-readable value (#555) (a7df337)

Documentation

  • Update chunking usage docs, minor reorg (#550) (d0c9e8e)

v2.10.0 - 2024-12-09

Feature

  • Docling-parse v2 as default PDF backend (#549) (aca57f0)

Fix

  • Call into docling-core for legacy document transform (#551) (7972d47)
  • Introduce Image format options in CLI. Silence the tqdm downloading messages. (#544) (78f61a8)

v2.9.0 - 2024-12-09

Feature

  • Expose new hybrid chunker, update docs (#384) (c8ecdd9)
  • MS Word backend: Make detection of headers and other styles localization agnostic (#534) (3e073df)

Fix

  • Correcting DefaultText ID for MS Word backend (#537) (eb7ffcd)
  • Add py.typed marker file (#531) (9102fe1)
  • Enable HTML export in CLI and add options for image mode (#513) (0d11e30)
  • Missing text in docx (t tag) when embedded in a table (#528) (b730b2d)
  • Restore pydantic version pin after fixes (#512) (c830b92)
  • Folder input in cli (#511) (8ada0bc)

Documentation

v2.8.3 - 2024-12-03

Fix

  • Improve handling of disallowed formats (#429) (34c7c79)

v2.8.2 - 2024-12-03

Fix

  • ParserError EOF inside string (#470) (#472) (c90c41c)
  • PermissionError when using tesseract_ocr_cli_model (#496) (d3f84b2)

Documentation

Performance

  • Prevent temp file leftovers, reuse core type (#487) (051789d)

v2.8.1 - 2024-11-29

Fix

Documentation

v2.8.0 - 2024-11-27

Feature

  • ocr: Added support for RapidOCR engine (#415) (85b2999)

Fix

  • Use correct image index in word backend (#442) (767563b)
  • Update tests and examples for docling-core 2.5.1 (#449) (29807a2)

v2.7.1 - 2024-11-26

Fix

Documentation

  • Add DocETL, Kotaemon, spaCy integrations; minor docs improvements (#408) (7a45b92)

v2.7.0 - 2024-11-20

Feature

  • Add support for ocrmac OCR engine on macOS (#276) (6efa96c)

Fix

v2.6.0 - 2024-11-19

Feature

  • Added support for exporting DocItem to an image when page image is available (#379) (3f91e7d)
  • Expose ocr-lang in CLI (#375) (ed785ea)
  • Added excel backend (#334) (926dfd2)
  • Extracting picture data for raster images found in PPTX (#349) (7a97d71)

Fix

  • Fixing images in the input Word files (#330) (8533039)
  • Reduce logging by keeping option for more verbose (#323) (8b437ad)

Documentation

v2.5.2 - 2024-11-13

Fix

v2.5.1 - 2024-11-12

Fix

  • Handling of single-cell tables in DOCX backend (#314) (fb8ba86)

Documentation

v2.5.0 - 2024-11-12

Feature

  • OCR: Introduce the OcrOptions.force_full_page_ocr parameter that forces a full page OCR scanning (#290) (c6b3763)

Fix

  • Configure env prefix for docling settings (#315) (5d4a10b)
  • Added handling of grouped elements in pptx backend (#307) (81c8243)
  • Allow mps usage for easyocr (#286) (97f214e)

Documentation

v2.4.2 - 2024-11-08

Fix

  • EasyOcrModel: Support the use_gpu pipeline parameter in EasyOcrModel. Initialize easyocr (#282) (0eb065e)

v2.4.1 - 2024-11-08

Fix

  • tesserocr: Raise Exception if tesserocr has not loaded any languages (#279) (704d792)
  • Dockerfile example copy command (#234) (90836db)

Documentation

v2.4.0 - 2024-11-04

Feature

  • Pdf backend, table mode as options and artifacts path (#203) (40ad987)

Documentation

v2.3.1 - 2024-10-30

Fix

  • Simplify torch dependencies and update pinned docling deps (#190) (eb679cc)
  • Allow to explicitly initialize the pipeline (#189) (904d24d)

v2.3.0 - 2024-10-30

Feature

  • Add pipeline timings and toggle visualization, establish debug settings (#183) (2a2c65b)

Fix

  • Fix duplicate title and heading + add e2e tests for html and docx (#186) (f542460)

v2.2.1 - 2024-10-28

Fix

  • Fix header levels for DOCX & HTML (#184) (b9f5c74)
  • Handling of long sequence of unescaped underscore chars in markdown (#173) (94d0729)
  • HTML backend, fixes for Lists and nested texts (#180) (7d19418)
  • MD Backend, fixes to properly handle trailing inline text and emphasis in headers (#178) (88c1673)

Documentation

v2.2.0 - 2024-10-23

Feature

  • Update to docling-parse v2 without history (#170) (4116819)
  • Support AsciiDoc and Markdown input format (#168) (3023f18)

Fix

  • Set valid=false for invalid backends (#171) (3496b48)

v2.1.0 - 2024-10-18

Feature

  • Add coverage_threshold to skip OCR for small images (#161) (b346faf)

Fix

Documentation

v2.0.0 - 2024-10-16

Feature

Breaking

Documentation

v1.20.0 - 2024-10-11

Feature

  • New experimental docling-parse v2 backend (#131) (5e4944f)

v1.19.1 - 2024-10-11

Fix

  • Remove stderr from tesseract cli and introduce fuzziness in the text validation of OCR tests (#138) (dae2a3b)

Documentation

  • Simplify LlamaIndex example using Docling extension (#135) (5f1bd9e)

v1.19.0 - 2024-10-08

Feature

  • Add options for choosing OCR engines (#118) (f96ea86)

v1.18.0 - 2024-10-03

Feature

v1.17.0 - 2024-10-03

Feature

v1.16.1 - 2024-09-27

Fix

Documentation

v1.16.0 - 2024-09-27

Feature

  • Support tableformer model choice (#90) (d6df76f)

v1.15.0 - 2024-09-24

Feature

v1.14.0 - 2024-09-24

Feature

Fix

  • Fix OCR setting for pypdfium, minor refactor (#102) (d96b96c)

Documentation

v1.13.1 - 2024-09-23

Fix

  • Updated the render_as_doctags with the new arguments from docling-core (#93) (4794ce4)

v1.13.0 - 2024-09-18

Feature

Fix

  • Bumped the glm version and adjusted the tests (#83) (442443a)

Documentation

  • Updated Docling logo.png with transparent background (#88) (0da7519)

v1.12.2 - 2024-09-17

Fix

  • tests: Adjust the test data to match the new version of LayoutPredictor (#82) (fa9699f)

v1.12.1 - 2024-09-16

Fix

  • CLI compatibility with python 3.10 and 3.11 (#79) (2870fdc)

v1.12.0 - 2024-09-13

Feature

Documentation

  • Showcase RAG with LlamaIndex and LangChain (#71) (53569a1)

v1.11.0 - 2024-09-10

Feature

v1.10.0 - 2024-09-10

Feature

  • Linux arm64 support and reducing dependencies (#69) (27a7a15)

v1.9.0 - 2024-09-03

Feature

  • Export document pages as multimodal output (#54) (1de2e4f)

Documentation

v1.8.5 - 2024-08-30

Fix

v1.8.4 - 2024-08-30

Fix

Documentation

  • Add instructions for cpu-only installation (#56) (a8a60d5)

v1.8.3 - 2024-08-28

Fix

  • Table cells overlap and model warnings (#53) (f49ee82)

v1.8.2 - 2024-08-27

Fix

Documentation

v1.8.1 - 2024-08-26

Fix

v1.8.0 - 2024-08-23

Feature

  • Page-level error reporting from PDF backend, introduce PARTIAL_SUCCESS status (#47) (a294b7e)

v1.7.1 - 2024-08-23

Fix

  • Better raise exception when a page fails to parse (#46) (8808463)
  • Upgrade docling-parse to 1.1.1, safety checks for failed parse on pages (#45) (7e84533)

v1.7.0 - 2024-08-22

Feature

  • Upgrade docling-parse PDF backend and interface to use page-by-page parsing (#44) (a8c6b29)

v1.6.3 - 2024-08-22

Fix

  • Usage of bytesio with docling-parse (#43) (fac5745)

v1.6.2 - 2024-08-22

Fix

  • Remove [ocr] extra to fix wheel install (#42) (6995268)

v1.6.1 - 2024-08-21

Fix

v1.6.0 - 2024-08-20

Feature

  • Add adaptive OCR, factor out treatment of OCR areas and cell filtering (#38) (e94d317)

v1.5.0 - 2024-08-20

Feature

  • Allow computing page images on-demand with scale and cache them (#36) (78347bf)

Documentation

v1.4.0 - 2024-08-14

Feature

  • Update parser with bytesio interface and set as new default backend (#32) (90dd676)

Fix

v1.3.0 - 2024-08-12

Feature

  • Output page images and extracted bbox (#31) (63d80ed)

v1.2.1 - 2024-08-07

Fix

Documentation

v1.2.0 - 2024-08-07

Feature

v1.1.2 - 2024-07-31

Fix

  • Set page number using 1-based indexing (#22) (d2d9543)

v1.1.1 - 2024-07-30

Fix

  • Correct text extraction for table cells (#21) (f4bf3d2)

v1.1.0 - 2024-07-26

Feature

  • Add simplified single-doc conversion (#20) (d603137)

v1.0.2 - 2024-07-24

Fix

  • Add easyocr to main deps for valid extra (#19) (54b3dda)

v1.0.1 - 2024-07-24

Fix

v1.0.0 - 2024-07-18

Feature

Breaking

v0.4.0 - 2024-07-17

Feature

  • Optimize table extraction quality, add configuration options (#11) (e9526bb)

v0.3.1 - 2024-07-17

Fix

Documentation

  • Reflect supported Python versions, add badges (#10) (2baa35c)

v0.3.0 - 2024-07-17

Feature

  • Enable python 3.12 support by updating glm (#8) (fb72688)

Documentation

  • Add setup with pypi to Readme (#7) (2803222)

v0.2.0 - 2024-07-16

Feature