mirror of
https://github.com/docling-project/docling.git
synced 2025-06-27 05:20:05 +00:00
88 KiB
88 KiB
v2.38.1 - 2025-06-25
Fix
- Updated granite vision model version for picture description (#1852) (
d337825
) - markdown: Fix single-formatted headings & list items (#1820) (
7c5614a
) - Fix response type of ollama (#1850) (
41e8cae
) - Handle missing runs to avoid out of range exception (#1844) (
4002de1
)
v2.38.0 - 2025-06-23
Feature
- Support audio input (#1763) (
1557e7c
) - markdown: Add formatting & improve inline support (#1804) (
861abcd
) - Maximum image size for Vlm models (#1802) (
215b540
)
Fix
- docx: Ensure list items have a list parent (#1827) (
d26dac6
) - msword_backend: Identify text in the same line after an image #1425 (#1610) (
1350a8d
) - Ensure uninitialized pages are removed before assembling document (#1812) (
dd7f64f
) - Formula conversion with page_range param set (#1791) (
dbab30e
)
Documentation
- Update readme and add ASR example (#1836) (
f3ae302
) - Support running examples from root or subfolder (#1816) (
64ac043
)
v2.37.0 - 2025-06-16
Feature
- Make Page.parsed_page the only source of truth for text cells, add OCR cells to it (#1745) (
7d3302c
) - Support xlsm files (#1520) (
df14022
)
Fix
- Pptx line break and space handling (#1664) (
f28d23c
) - asciidoc: Set default size when missing in image directive (#1769) (
b886e4d
) - Handle NoneType error in MsPowerpointDocumentBackend (#1747) (
7a275c7
) - Prov for merged-elems (#1728) (
6613b9e
) - tesseract: Initialize df_osd to avoid uninitialized variable error (#1718) (
e979750
) - Allow custom torch_dtype in vlm models (#1735) (
f7f3113
) - Improve extraction from textboxes in Word docs (#1701) (
9dbcb3d
) - Add WEBP to the list of image file extensions (#1711) (
a2b83fe
)
Documentation
v2.36.1 - 2025-06-04
Fix
Documentation
v2.36.0 - 2025-06-03
Feature
v2.35.0 - 2025-06-02
Feature
Fix
- Guess HTML content starting with script tag (#1673) (
984cb13
) - UnicodeDecodeError: 'utf-8' codec can't decode byte 0xd0 in position 0: invalid continuation byte (#1665) (
51d3450
)
Documentation
v2.34.0 - 2025-05-22
Feature
- ocr: Auto-detect rotated pages in Tesseract (#1167) (
45265bf
) - Establish confidence estimation for document and pages (#1313) (
9087524
)
Fix
- Fix ZeroDivisionError for cell_bbox.area() (#1636) (
c2f595d
) - integration: Update the Apify Actor integration (#1619) (
14d4f5b
)
v2.33.0 - 2025-05-20
Feature
Fix
- Fix issue with detecting docx files, and files with upper case extensions (#1609) (
f4d9d41
) - Load_from_doctags static usage (#1617) (
0e00a26
) - Incorrect force_backend_text behaviour for VLM DocTag pipelines (#1371) (
f2e9c07
) - pypdfium: Resolve overlapping text when merging bounding boxes (#1549) (
98b5eeb
)
v2.32.0 - 2025-05-14
Feature
- Improve parallelization for remote services API calls (#1548) (
3a04f2a
) - Support image/webp file type (#1415) (
12dab0a
)
Fix
- ocr: Orig field in TesseractOcrCliModel as str (#1553) (
9f8b479
) - settings: Fix nested settings load via environment variables (#1551) (
2efb7a7
)
Documentation
v2.31.2 - 2025-05-13
Fix
- AsciiDoc header identification (#1562) (#1563) (
4046d0b
) - Restrict click version and update lock file (#1582) (
8baa85a
)
v2.31.1 - 2025-05-12
Fix
- Add smoldocling in download utils (#1577) (
127e386
) - HTML: Handle row spans in header rows (#1536) (
776e7ec
) - Mime error in document streams (#1523) (
f1658ed
) - Usage of hashlib for FIPS (#1512) (
7c70573
) - Guard against attribute errors in TesseractOcrModel del (#1494) (
4ab7e9d
) - Enable cuda_use_flash_attention2 for PictureDescriptionVlmModel (#1496) (
cc45396
) - Updated the time-recorder label for reading order (#1490) (
976e92e
) - Incorrect scaling of TableModel bboxes when do_cell_matching is False (#1459) (
94d66a0
)
Documentation
- Update links in data_prep_kit (#1559) (
844babb
) - Add serialization docs, update chunking docs (#1556) (
3220a59
) - Update supported formats guide (#1463) (
3afbe6c
)
v2.31.0 - 2025-04-25
Feature
Fix
- html: Handle address, details, and summary tags (#1436) (
ed20124
) - Treat overflowing -v flags as DEBUG (#1419) (
8012a3e
) - codecov: Fix codecov argument and yaml file (#1399) (
fa7fc9e
)
Documentation
- Fix wrong output format in example code (#1427) (
c2470ed
) - Add OpenSSF Best Practices badge (#1430) (
64918a8
) - Typo fixes in docling_document.md (#1400) (
995b3b0
) - Updated the [Usage] link in architecture.md (#1416) (
88948b0
) - ocr: Add docs entry for OnnxTR OCR plugin (#1382) (
a7dd59c
) - security: More statements about secure development (#1381) (
293c28c
) - Add testing in the docs (#1379) (
01fbfd5
) - Add Notes for Installing in Intel macOS (#1377) (
a026b4e
)
v2.30.0 - 2025-04-14
Feature
- cli: Add option for html with split-page mode (#1355) (
c0ba88e
) - xlsx: Create a page for each worksheet in XLSX backend (#1332) (
eef2bde
) - OllamaVlmModel for Granite Vision 3.2 (#1337) (
c605edd
)
Fix
- deps: Widen typer upper bound (#1375) (
7e40ad3
) - Auto-recognize .xlsx, .docx and .pptx files (#1340) (
0de70e7
) - docx: Declare image_data variable when handling pictures (#1359) (
415b877
) - Implement PictureDescriptionApiOptions.bitmap_area_threshold (#1248) (
2503999
) - Properly address page in pipeline _assemble_document when page_range is provided (#1334) (
6b696b5
)
v2.29.0 - 2025-04-10
Feature
- Handle
tags as code blocks (#1320) (
0499cd1
) - docx: Add text formatting and hyperlink support (#630) (
bfcab3d
)
Fix
- docx: Adding new latex symbols, simplifying how equations are added to text (#1295) (
14e9c0c
)
- pptx: Check if picture shape has an image attached (#1316) (
dc3bf9c
)
- docx: Improve text parsing (#1268) (
d2d6874
)
- Tesseract OCR CLI can't process images composed with numbers only (#1201) (
b3d111a
)
Documentation
v2.28.4 - 2025-03-29
Fix
v2.28.3 - 2025-03-28
Fix
v2.28.2 - 2025-03-26
Fix
- Improve HTML layer detection, various MD fixes (#1241) (
9210812
)
- html: Fix HTML parsed heading level (#1244) (
85c4df8
)
v2.28.1 - 2025-03-25
Fix
- converter: Cache same pipeline class with different options (#1152) (
825b226
)
- debug: Missing translation of bbox to to_bounding_box (#1220) (
6df8827
)
- docx: Identifying numbered headers (#1231) (
f739d0e
)
Documentation
v2.28.0 - 2025-03-19
Feature
- SmolDocling: Support MLX acceleration in VLM pipeline (#1199) (
1c26769
)
- Add PPTX notes slides (#474) (
b454aa1
)
- Updated vlm pipeline (with latest changes from docling-core) (#1158) (
2f72167
)
Fix
- Determine correct page size in DoclingParseV4Backend (#1196) (
f5adfb9
)
- msword: Fixing function return in equations handling (#1194) (
0b707d0
)
Documentation
v2.27.0 - 2025-03-18
Feature
- Add factory for ocr engines via plugins (#1010) (
6eaae3c
)
- Add DoclingParseV4 backend, using high-level docling-parse API (#905) (
3960b19
)
- actor: Docling Actor on Apify infrastructure (#875) (
772487f
)
- Equations to latex in MSWord backend (with inline groups) (#1114) (
6eb718f
)
Fix
- html: Handle nested empty lists (#1154) (
f94da44
)
- Use first table row as col headers (#1156) (
0945973
)
- Pass tests, update docling-core to 2.22.0 (#1150) (
aa92a57
)
Documentation
v2.26.0 - 2025-03-11
Feature
Fix
Documentation
Performance
v2.25.2 - 2025-03-05
Fix
Documentation
v2.25.1 - 2025-03-03
Fix
- Enable locks for threadsafe pdfium (#1052) (
8dc0562
)
- html: Use 'start' attribute when parsing ordered lists from HTML docs (#1062) (
de7b963
)
Documentation
v2.25.0 - 2025-02-26
Feature
- [Experimental] Introduce VLM pipeline using HF AutoModelForVision2Seq, featuring SmolDocling model (#1054) (
3c9fe76
)
- cli: Add option for downloading all models, refine help messages (#1061) (
ab683e4
)
Fix
- Vlm using artifacts path (#1057) (
e197225
)
- html: Parse text in div elements as TextItem (#1041) (
1b0ead6
)
Documentation
v2.24.0 - 2025-02-20
Feature
v2.23.1 - 2025-02-20
Fix
Documentation
v2.23.0 - 2025-02-17
Feature
- Support cuda:n GPU device allocation (#694) (
77eb77b
)
- xml-jats: Parse XML JATS documents (#967) (
428b656
)
Fix
v2.22.0 - 2025-02-14
Feature
- Add support for CSV input with new backend to transform CSV files to DoclingDocument (#945) (
00d9405
)
- Introduce the enable_remote_services option to allow remote connections while processing (#941) (
2716c7d
)
- Allow artifacts_path to be defined as ENV (#940) (
5101e25
)
Fix
- Update Pillow constraints (#958) (
af19c03
)
- Fix the initialization of the TesseractOcrModel (#935) (
c47ae70
)
Documentation
- Update example Dockerfile with download CLI (#929) (
7493d5b
)
- Examples for picture descriptions (#951) (
2d66e99
)
v2.21.0 - 2025-02-10
Feature
v2.20.0 - 2025-02-07
Feature
Fix
v2.19.0 - 2025-02-07
Feature
Fix
- markdown: Handle nested lists (#910) (
90b766e
)
- Test cases for RTL programmatic PDFs and fixes for the formula model (#903) (
9114ada
)
- msword_backend: Handle conversion error in label parsing (#896) (
722a6eb
)
- Enrichment models batch size and expose picture classifier (#878) (
5ad6de0
)
Documentation
v2.18.0 - 2025-02-03
Feature
- Expose equation exports (#869) (
6a76b49
)
- Add option to define page range (#852) (
70d68b6
)
- docx: Support of SDTs in docx backend (#853) (
d727b04
)
- Python 3.13 support (#841) (
4df085a
)
Fix
- markdown: Fix parsing if doc ending with table (#873) (
5ac2887
)
- markdown: Add support for HTML content (#855) (
94751a7
)
- docx: Merged table cells not properly converted (#857) (
0cd81a8
)
- Processing of placeholder shapes in pptx that have text but no bbox (#868) (
eff16b6
)
- KeyError in tableformer prediction (#854) (
b1cf796
)
- Fixed docx import with headers that are also lists (#842) (
2c037ae
)
- Use new add_code in html backend and add more typing hints (#850) (
2a1f8af
)
- markdown: Fix empty block handling (#843) (
bccb022
)
- Fix for the crash when encountering WMF images in pptx and docx (#837) (
fea0a99
)
Documentation
- Updated the readme with upcoming features (#831) (
d7c0828
)
- Add example for inspection of picture content (#624) (
f9144f2
)
v2.17.0 - 2025-01-28
Feature
- CLI: Expose code and formula models in the CLI (#820) (
6882e6c
)
- Add platform info to CLI version printout (#816) (
95b293a
)
- ocr: Expose
rec_keys_path
in RapidOcrOptions to support custom dictionaries (#786) (5332755
)
- Introduce automatic language detection in TesseractOcrCliModel (#800) (
3be2fb5
)
Fix
- Fix single newline handling in MD backend (#824) (
5aed9f8
)
- Use file extension if filetype fails with PDF (#827) (
adf6353
)
- Parse html with omitted body tag (#818) (
a112d7a
)
Documentation
- Document Docling JSON parsing (#819) (
6875913
)
- Add SSL verification error mitigation (#821) (
5139b48
)
- backend XML: Do not delete temp file in notebook (#817) (
4d41db3
)
- Typo (#814) (
8a4ec77
)
- Added markdown headings to enable TOC in github pages (#808) (
b885b2f
)
- Description of supported formats and backends (#788) (
c2ae1cc
)
v2.16.0 - 2025-01-24
Feature
- New document picture classifier (#805) (
16a218d
)
- Add Docling JSON ingestion (#783) (
88a0e66
)
- Code and equation model for PDF and code blocks in markdown (#752) (
3213b24
)
- Add "auto" language for TesseractOcr (#759) (
8543c22
)
Fix
- Added extraction of byte-images in excel (#804) (
a458e29
)
- Update docling-parse-v2 backend version with new parsing fixes (#769) (
670a08b
)
Documentation
- Fix minor typos (#801) (
c58f75d
)
- Add Azure RAG example (#675) (
9020a93
)
- Fix links between docs pages (#697) (
c49b352
)
- Fix correct Accelerator pipeline options in docs/examples/custom_convert.py (#733) (
7686083
)
- Example to translate documents (#739) (
f7e1cbf
)
v2.15.1 - 2025-01-10
Fix
- Improve OCR results, stricten criteria before dropping bitmap areas (#719) (
5a060f2
)
- Allow earlier requests versions (#716) (
e64b5a2
)
Documentation
v2.15.0 - 2025-01-08
Feature
Fix
- Correct scaling of debug visualizations, tune OCR (#700) (
5cb4cf6
)
- Let BeautifulSoup detect the HTML encoding (#695) (
42856fd
)
- mspowerpoint: Handle invalid images in PowerPoint slides (#650) (
d49650c
)
Documentation
- Specify docstring types (#702) (
ead396a
)
- Add link to rag with granite (#698) (
6701f34
)
- Add integrations, revamp docs (#693) (
2d24fae
)
- Add OpenContracts as an integration (#679) (
569038d
)
- Add Weaviate RAG recipe notebook (#451) (
2b591f9
)
- Document Haystack & Vectara support (#628) (
fc645ea
)
v2.14.0 - 2024-12-18
Feature
v2.13.0 - 2024-12-17
Feature
- Updated Layout processing with forms and key-value areas (#530) (
60dc852
)
- Create a backend to parse USPTO patents into DoclingDocument (#606) (
4e08750
)
- Add Easyocr parameter recog_network (#613) (
3b53bd3
)
Documentation
- Add Haystack RAG example (#615) (
3e599c7
)
- Fix the path to the run_with_accelerator.py example (#608) (
3bb3bf5
)
v2.12.0 - 2024-12-13
Feature
v2.11.0 - 2024-12-12
Feature
Fix
- Do not import python modules from deepsearch-glm (#569) (
aee9c0b
)
- Handle no result from RapidOcr reader (#558) (
f45499c
)
- Make enum serializable with human-readable value (#555) (
a7df337
)
Documentation
v2.10.0 - 2024-12-09
Feature
Fix
- Call into docling-core for legacy document transform (#551) (
7972d47
)
- Introduce Image format options in CLI. Silence the tqdm downloading messages. (#544) (
78f61a8
)
v2.9.0 - 2024-12-09
Feature
- Expose new hybrid chunker, update docs (#384) (
c8ecdd9
)
- MS Word backend: Make detection of headers and other styles localization agnostic (#534) (
3e073df
)
Fix
- Correcting DefaultText ID for MS Word backend (#537) (
eb7ffcd
)
- Add
py.typed
marker file (#531) (9102fe1
)
- Enable HTML export in CLI and add options for image mode (#513) (
0d11e30
)
- Missing text in docx (t tag) when embedded in a table (#528) (
b730b2d
)
- Restore pydantic version pin after fixes (#512) (
c830b92
)
- Folder input in cli (#511) (
8ada0bc
)
Documentation
v2.8.3 - 2024-12-03
Fix
v2.8.2 - 2024-12-03
Fix
- ParserError EOF inside string (#470) (#472) (
c90c41c
)
- PermissionError when using tesseract_ocr_cli_model (#496) (
d3f84b2
)
Documentation
- Add styling for faq (#502) (
5ba3807
)
- Typo in faq (#484) (
33cff98
)
- Add automatic api reference (#475) (
d487210
)
- Introduce faq section (#468) (
8ccb3c6
)
Performance
v2.8.1 - 2024-11-29
Fix
Documentation
v2.8.0 - 2024-11-27
Feature
Fix
- Use correct image index in word backend (#442) (
767563b
)
- Update tests and examples for docling-core 2.5.1 (#449) (
29807a2
)
v2.7.1 - 2024-11-26
Fix
Documentation
v2.7.0 - 2024-11-20
Feature
Fix
v2.6.0 - 2024-11-19
Feature
- Added support for exporting DocItem to an image when page image is available (#379) (
3f91e7d
)
- Expose ocr-lang in CLI (#375) (
ed785ea
)
- Added excel backend (#334) (
926dfd2
)
- Extracting picture data for raster images found in PPTX (#349) (
7a97d71
)
Fix
- Fixing images in the input Word files (#330) (
8533039
)
- Reduce logging by keeping option for more verbose (#323) (
8b437ad
)
Documentation
- Fixed typo in v2 example v2 (#378) (
911c3bd
)
- Add automatic generation of CLI reference (#325) (
ca8524e
)
- Add architecture outline (#341) (
25fd149
)
- Fix parameter in usage.md (#332) (
835e077
)
v2.5.2 - 2024-11-13
Fix
v2.5.1 - 2024-11-12
Fix
Documentation
v2.5.0 - 2024-11-12
Feature
- OCR: Introduce the OcrOptions.force_full_page_ocr parameter that forces a full page OCR scanning (#290) (
c6b3763
)
Fix
- Configure env prefix for docling settings (#315) (
5d4a10b
)
- Added handling of grouped elements in pptx backend (#307) (
81c8243
)
- Allow mps usage for easyocr (#286) (
97f214e
)
Documentation
v2.4.2 - 2024-11-08
Fix
- EasyOcrModel: Support the use_gpu pipeline parameter in EasyOcrModel. Initialize easyocr (#282) (
0eb065e
)
v2.4.1 - 2024-11-08
Fix
- tesserocr: Raise Exception if tesserocr has not loaded any languages (#279) (
704d792
)
- Dockerfile example copy command (#234) (
90836db
)
Documentation
- Update badges & credits (#248) (
a84ec27
)
- Add coming-soon section (#235) (
5ce02c5
)
- Add artifacts-path param to CLI (#233) (
d5e65ae
)
v2.4.0 - 2024-11-04
Feature
Documentation
- Add explicit artifacts path example (#224) (
eeee3b4
)
- Update custom convert and dockerfile (#226) (
5f5fea9
)
- Correct spelling of 'individual' (#219) (
41acaa9
)
- Update LlamaIndex docs (#196) (
244ca69
)
v2.3.1 - 2024-10-30
Fix
- Simplify torch dependencies and update pinned docling deps (#190) (
eb679cc
)
- Allow to explicitly initialize the pipeline (#189) (
904d24d
)
v2.3.0 - 2024-10-30
Feature
Fix
v2.2.1 - 2024-10-28
Fix
- Fix header levels for DOCX & HTML (#184) (
b9f5c74
)
- Handling of long sequence of unescaped underscore chars in markdown (#173) (
94d0729
)
- HTML backend, fixes for Lists and nested texts (#180) (
7d19418
)
- MD Backend, fixes to properly handle trailing inline text and emphasis in headers (#178) (
88c1673
)
Documentation
- Update LlamaIndex docs for Docling v2 (#182) (
2cece27
)
- Fix batch convert (#177) (
189d3c2
)
- Add export with embedded images (#175) (
8d356aa
)
v2.2.0 - 2024-10-23
Feature
- Update to docling-parse v2 without history (#170) (
4116819
)
- Support AsciiDoc and Markdown input format (#168) (
3023f18
)
Fix
v2.1.0 - 2024-10-18
Feature
Fix
Documentation
- Typo fix (#155) (
f799e77
)
- Add graphical band in readme (#154) (
034a411
)
- Add use docling (#150) (
61c092f
)
v2.0.0 - 2024-10-16
Feature
Breaking
Documentation
v1.20.0 - 2024-10-11
Feature
v1.19.1 - 2024-10-11
Fix
- Remove stderr from tesseract cli and introduce fuzziness in the text validation of OCR tests (#138) (
dae2a3b
)
Documentation
v1.19.0 - 2024-10-08
Feature
v1.18.0 - 2024-10-03
Feature
v1.17.0 - 2024-10-03
Feature
v1.16.1 - 2024-09-27
Fix
Documentation
v1.16.0 - 2024-09-27
Feature
v1.15.0 - 2024-09-24
Feature
v1.14.0 - 2024-09-24
Feature
Fix
Documentation
v1.13.1 - 2024-09-23
Fix
v1.13.0 - 2024-09-18
Feature
Fix
Documentation
v1.12.2 - 2024-09-17
Fix
v1.12.1 - 2024-09-16
Fix
v1.12.0 - 2024-09-13
Feature
Documentation
v1.11.0 - 2024-09-10
Feature
v1.10.0 - 2024-09-10
Feature
v1.9.0 - 2024-09-03
Feature
Documentation
v1.8.5 - 2024-08-30
Fix
v1.8.4 - 2024-08-30
Fix
Documentation
v1.8.3 - 2024-08-28
Fix
v1.8.2 - 2024-08-27
Fix
Documentation
v1.8.1 - 2024-08-26
Fix
v1.8.0 - 2024-08-23
Feature
v1.7.1 - 2024-08-23
Fix
- Better raise exception when a page fails to parse (#46) (
8808463
)
- Upgrade docling-parse to 1.1.1, safety checks for failed parse on pages (#45) (
7e84533
)
v1.7.0 - 2024-08-22
Feature
v1.6.3 - 2024-08-22
Fix
v1.6.2 - 2024-08-22
Fix
v1.6.1 - 2024-08-21
Fix
v1.6.0 - 2024-08-20
Feature
v1.5.0 - 2024-08-20
Feature
Documentation
v1.4.0 - 2024-08-14
Feature
Fix
v1.3.0 - 2024-08-12
Feature
v1.2.1 - 2024-08-07
Fix
Documentation
v1.2.0 - 2024-08-07
Feature
v1.1.2 - 2024-07-31
Fix
v1.1.1 - 2024-07-30
Fix
v1.1.0 - 2024-07-26
Feature
v1.0.2 - 2024-07-24
Fix
v1.0.1 - 2024-07-24
Fix
v1.0.0 - 2024-07-18
Feature
Breaking
v0.4.0 - 2024-07-17
Feature
v0.3.1 - 2024-07-17
Fix
Documentation
v0.3.0 - 2024-07-17
Feature
Documentation
v0.2.0 - 2024-07-16
Feature