528 Commits

Author SHA1 Message Date
Bruno Rigal
7a275c7637
fix: Handle NoneType error in MsPowerpointDocumentBackend (#1747)
fix:nonetyperror in pptx backend

Signed-off-by: Bruno Rigal <bruno.rigal@probayes.com>
Co-authored-by: Bruno Rigal <bruno.rigal@probayes.com>
2025-06-10 19:43:20 +02:00
Ayraf
df140227c3
feat: support xlsm files (#1520)
* code for xlsm support

* updated support for xlsm

* updated code for xlsm support

* Update docling_parse_v4_backend.py

Signed-off-by: ShiroYasha18 <85089952+ShiroYasha18@users.noreply.github.com>

* Update docling_parse_v4_backend.py

Signed-off-by: ShiroYasha18 <85089952+ShiroYasha18@users.noreply.github.com>

* Update test_backend_msexcel_xlsm.py

 updated the tests/test_backend_msexcel_xlsm.py:

 have a function starting with test
removed all print statements
** To add an explicit assert {test}=={pred}

Signed-off-by: ShiroYasha18 <85089952+ShiroYasha18@users.noreply.github.com>

* Update base_models.py

Signed-off-by: ShiroYasha18 <85089952+ShiroYasha18@users.noreply.github.com>

* Update test_backend_msexcel.py

Signed-off-by: ShiroYasha18 <85089952+ShiroYasha18@users.noreply.github.com>

* Update test_backend_msexcel_xlsm.py

Signed-off-by: ShiroYasha18 <85089952+ShiroYasha18@users.noreply.github.com>

* Update document_converter.py

Signed-off-by: ShiroYasha18 <85089952+ShiroYasha18@users.noreply.github.com>

* Delete tests/test_backend_msexcel_xlsm.py

Signed-off-by: ShiroYasha18 <85089952+ShiroYasha18@users.noreply.github.com>

* xlsm file

Signed-off-by: ShiroYasha18 <85089952+ShiroYasha18@users.noreply.github.com>

* run tests

* ran tests

* Fix tests, upgrade XSLM example to a valid file

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

---------

Signed-off-by: ShiroYasha18 <85089952+ShiroYasha18@users.noreply.github.com>
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
Co-authored-by: Christoph Auer <cau@zurich.ibm.com>
2025-06-10 16:55:59 +02:00
Peter W. J. Staar
6613b9e98b
fix: prov for merged-elems (#1728)
* fix: prov for merged-elems

Signed-off-by: Peter Staar <taa@zurich.ibm.com>

* reformatted the code

Signed-off-by: Peter Staar <taa@zurich.ibm.com>

* Reset pyproject.toml

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

* Fix tests

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

---------

Signed-off-by: Peter Staar <taa@zurich.ibm.com>
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
Co-authored-by: Christoph Auer <cau@zurich.ibm.com>
2025-06-10 11:22:42 +02:00
Maras Ioannis
e979750ce9
fix(tesseract): initialize df_osd to avoid uninitialized variable error (#1718)
* fix: initialize df_osd to avoid uninitialized variable error

Signed-off-by: IoannisMaras <maras2002@gmail.com>

* Fix formatting

Signed-off-by: Christoph Auer <60343111+cau-git@users.noreply.github.com>

* Satisfy mypy, regenerate OCR tests

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

---------

Signed-off-by: IoannisMaras <maras2002@gmail.com>
Signed-off-by: Christoph Auer <60343111+cau-git@users.noreply.github.com>
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
Co-authored-by: Christoph Auer <60343111+cau-git@users.noreply.github.com>
Co-authored-by: Christoph Auer <cau@zurich.ibm.com>
2025-06-10 10:57:45 +02:00
Michele Dolfi
f7f31137f1
fix: allow custom torch_dtype in vlm models (#1735)
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>
2025-06-10 10:52:15 +02:00
Michele Dolfi
49b10e7419
docs: add open webui (#1734)
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>
2025-06-10 09:35:20 +02:00
AndrewTsai0406
9dbcb3d7d4
fix: Improve extraction from textboxes in Word docs (#1701)
* fix/docx_text_box_extraction

Signed-off-by: JiunAn Tsai <andrew@JiunAns-Mac-mini.local>

* fix/docx_text_box_extraction

Signed-off-by: JiunAn Tsai <andrew@JiunAns-Mac-mini.local>

---------

Signed-off-by: JiunAn Tsai <andrew@JiunAns-Mac-mini.local>
Co-authored-by: JiunAn Tsai <andrew@JiunAns-Mac-mini.local>
2025-06-06 11:37:46 +02:00
Eugene
a2b83fe4ae
fix: Add WEBP to the list of image file extensions (#1711)
feat: Add WEBP to the list of image file extensions

Signed-off-by: Eugene <fogaprod@gmail.com>
2025-06-05 09:09:27 +02:00
github-actions[bot]
40df0d74ad chore: bump version to 2.36.1 [skip ci] v2.36.1 2025-06-04 11:43:13 +00:00
Michele Dolfi
8846f1a393
fix: remove typer and click constraints (#1707)
release typer and click constraints

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>
2025-06-04 13:06:23 +02:00
Michele Dolfi
be42b03f9b
docs: flash-attn usage and install (#1706)
* docs: flash-attn usage and install

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>

* fix link

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>

---------

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>
2025-06-04 11:09:54 +02:00
github-actions[bot]
96c54dba91 chore: bump version to 2.36.0 [skip ci] v2.36.0 2025-06-03 13:54:25 +00:00
Michele Dolfi
cdd401847a
feat: simplify dependencies, switch to uv (#1700)
* refactor with uv

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>

* constraints for onnxruntime

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>

* more constraints

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>

---------

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>
2025-06-03 15:18:54 +02:00
Panos Vagenas
61d0d6c755
test: mark flaky test (#1698)
* test: cleanse Word test file

Signed-off-by: Panos Vagenas <pva@zurich.ibm.com>

* mark textbox file test as flaky

Signed-off-by: Panos Vagenas <pva@zurich.ibm.com>

* fix path usage

Signed-off-by: Panos Vagenas <pva@zurich.ibm.com>

---------

Signed-off-by: Panos Vagenas <pva@zurich.ibm.com>
2025-06-03 13:13:44 +02:00
Peter W. J. Staar
cfdf4cea25
feat: new vlm-models support (#1570)
* feat: adding new vlm-models support

Signed-off-by: Peter Staar <taa@zurich.ibm.com>

* fixed the transformers

Signed-off-by: Peter Staar <taa@zurich.ibm.com>

* got microsoft/Phi-4-multimodal-instruct to work

Signed-off-by: Peter Staar <taa@zurich.ibm.com>

* working on vlm's

Signed-off-by: Peter Staar <taa@zurich.ibm.com>

* refactoring the VLM part

Signed-off-by: Peter Staar <taa@zurich.ibm.com>

* all working, now serious refacgtoring necessary

Signed-off-by: Peter Staar <taa@zurich.ibm.com>

* refactoring the download_model

Signed-off-by: Peter Staar <taa@zurich.ibm.com>

* added the formulate_prompt

Signed-off-by: Peter Staar <taa@zurich.ibm.com>

* pixtral 12b runs via MLX and native transformers

Signed-off-by: Peter Staar <taa@zurich.ibm.com>

* added the VlmPredictionToken

Signed-off-by: Peter Staar <taa@zurich.ibm.com>

* refactoring minimal_vlm_pipeline

Signed-off-by: Peter Staar <taa@zurich.ibm.com>

* fixed the MyPy

Signed-off-by: Peter Staar <taa@zurich.ibm.com>

* added pipeline_model_specializations file

Signed-off-by: Peter Staar <taa@zurich.ibm.com>

* need to get Phi4 working again ...

Signed-off-by: Peter Staar <taa@zurich.ibm.com>

* finalising last points for vlms support

Signed-off-by: Peter Staar <taa@zurich.ibm.com>

* fixed the pipeline for Phi4

Signed-off-by: Peter Staar <taa@zurich.ibm.com>

* streamlining all code

Signed-off-by: Peter Staar <taa@zurich.ibm.com>

* reformatted the code

Signed-off-by: Peter Staar <taa@zurich.ibm.com>

* fixing the tests

Signed-off-by: Peter Staar <taa@zurich.ibm.com>

* added the html backend to the VLM pipeline

Signed-off-by: Peter Staar <taa@zurich.ibm.com>

* fixed the static load_from_doctags

Signed-off-by: Peter Staar <taa@zurich.ibm.com>

* restore stable imports

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>

* use AutoModelForVision2Seq for Pixtral and review example (including rename)

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>

* remove unused value

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>

* refactor instances of VLM models

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>

* skip compare example in CI

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>

* use lowercase and uppercase only

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>

* add new minimal_vlm example and refactor pipeline_options_vlm_model for cleaner import

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>

* rename pipeline_vlm_model_spec

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>

* move more argument to options and simplify model init

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>

* add supported_devices

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>

* remove not-needed function

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>

* exclude minimal_vlm

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>

* missing file

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>

* add message for transformers version

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>

* rename to specs

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>

* use module import and remove MLX from non-darwin

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>

* remove hf_vlm_model and add extra_generation_args

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>

* use single HF VLM model class

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>

* remove torch type

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>

* add docs for vision models

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>

---------

Signed-off-by: Peter Staar <taa@zurich.ibm.com>
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>
Co-authored-by: Michele Dolfi <dol@zurich.ibm.com>
2025-06-02 17:01:06 +02:00
github-actions[bot]
08dcacc5cb chore: bump version to 2.35.0 [skip ci] v2.35.0 2025-06-02 12:30:26 +00:00
Edgar Hipp
11ca4f7a7b
docs: fix typo in index.md (#1676)
Signed-off-by: Edgar Hipp <hipp.edg@gmail.com>
2025-06-02 12:35:59 +02:00
Panos Vagenas
1c8a1283c4
test: ensure utf-8 in test data utils (#1691)
Signed-off-by: Panos Vagenas <pva@zurich.ibm.com>
2025-06-02 12:13:19 +02:00
Cesar Berrospi Ramis
984cb137f6
fix: guess HTML content starting with script tag (#1673)
Signed-off-by: Cesar Berrospi Ramis <75900930+ceberam@users.noreply.github.com>
2025-06-02 08:43:24 +02:00
Cesar Berrospi Ramis
3942923125
chore: fix or ignore runtime and deprecation warnings (#1660)
* chore: fix or catch deprecation warnings

Signed-off-by: Cesar Berrospi Ramis <75900930+ceberam@users.noreply.github.com>

* chore: update poetry lock with latest docling-core

Signed-off-by: Cesar Berrospi Ramis <75900930+ceberam@users.noreply.github.com>

---------

Signed-off-by: Cesar Berrospi Ramis <75900930+ceberam@users.noreply.github.com>
2025-05-28 17:55:31 +02:00
Panos Vagenas
b3e0042813
chore: exclude data from GH Linguist (#1671)
Signed-off-by: Panos Vagenas <pva@zurich.ibm.com>
2025-05-28 15:42:34 +02:00
Cesar Berrospi Ramis
106951e71e
test: add missing ground truth files (#1667)
Signed-off-by: Cesar Berrospi Ramis <75900930+ceberam@users.noreply.github.com>
2025-05-28 13:26:49 +02:00
Peter W. J. Staar
b356b33059
feat: Add visualization of bbox on page with html export. (#1663)
* feat: Add visualization of bbox on page with html export.

Signed-off-by: Peter Staar <taa@zurich.ibm.com>

* updated the cli

Signed-off-by: Peter Staar <taa@zurich.ibm.com>

* reformatted code

Signed-off-by: Peter Staar <taa@zurich.ibm.com>

* updated the cli argument to show_layout

Signed-off-by: Peter Staar <taa@zurich.ibm.com>

---------

Signed-off-by: Peter Staar <taa@zurich.ibm.com>
2025-05-28 13:10:38 +02:00
DavidLee
51d3450915
fix: UnicodeDecodeError: 'utf-8' codec can't decode byte 0xd0 in position 0: invalid continuation byte (#1665)
Update document.py

fix: when mime not "application/xml" or "text/plain" raise
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xd0 in position 0: invalid continuation byte

Signed-off-by: DavidLee <yongsheng_li@foxmail.com>
2025-05-27 14:06:05 +02:00
github-actions[bot]
2579d89510 chore: bump version to 2.34.0 [skip ci] v2.34.0 2025-05-22 18:44:45 +00:00
Said Gürbüz
c2f595d283
fix: fix ZeroDivisionError for cell_bbox.area() (#1636)
fix ZeroDivisionError for cell_bbox.area()

Signed-off-by: Saidgurbuz <said.gurbuz@epfl.ch>
2025-05-22 13:43:33 +02:00
Clément Doumouro
45265bf8b1
feat(ocr): auto-detect rotated pages in Tesseract (#1167)
* fix(ocr): tesseract support mis-oriented documents

Signed-off-by: Clément Doumouro <clement.doumouro@gmail.com>

* fix(ocr): update missing test data

Signed-off-by: Clément Doumouro <clement.doumouro@gmail.com>

* fix(ocr): rotate image to the natural orientation before layout prediction

Signed-off-by: Clément Doumouro <clement.doumouro@gmail.com>

* fix(ocr): move bounding bow rotation util to orientation.py

Signed-off-by: Clément Doumouro <clement.doumouro@gmail.com>

* fix(ocr): refactor rotation utilities

Signed-off-by: Clément Doumouro <clement.doumouro@gmail.com>

* chore(ocr): revert layout updates

Signed-off-by: Clément Doumouro <clement.doumouro@gmail.com>

* chore(ocr): update e2e OCR test data

Signed-off-by: Clément Doumouro <clement.doumouro@gmail.com>

* fix(ocr): avoid to swallow tesseract errors causing orientation detection failures

Signed-off-by: Clément Doumouro <clement.doumouro@gmail.com>

* chore(ocr): revert layout updates

Signed-off-by: Clément Doumouro <clement.doumouro@gmail.com>

* chore(ocr): update e2e OCR test data

* chore(ocr): proceed to OCR without rotation when OSD fails in `TesseractOcrCliModel`

* chore(ocr): proceed to OCR without rotation when OSD fails in `TesseractOcrModel`

* chore(ocr): default `TesseractOcrCliModel._is_auto` to `False`

* fix(ocr): fix `TesseractOcrCliModel._is_auto` computation

* chore(ocr): improve logging in case of OSD failure in `TesseractOcrCliModel` and `TesseractOcrModel`

---------

Signed-off-by: Clément Doumouro <clement.doumouro@gmail.com>
2025-05-21 18:12:33 +02:00
Christoph Auer
90875247e5
feat: Establish confidence estimation for document and pages (#1313)
* Establish confidence field, propagate layout confidence through

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

* Add OCR confidence and parse confidence (stub)

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

* Add parse quality rules, use 5% percentile for overall and parse scores

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

* Heuristic updates

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

* Fix garbage regex

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

* Move grade to page

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

* Introduce mean_score and low_score, consistent aggregate computations

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

* Add confidence test

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

---------

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
2025-05-21 12:32:49 +02:00
Václav Vančura
14d4f5b109
fix(integration): update the Apify Actor integration (#1619)
* fix(actor): remove references to missing docling_processor.py

Signed-off-by: Václav Vančura <commit@vancura.dev>

* chore(actor): update Actor README.md with recent repo URL changes

Signed-off-by: Václav Vančura <commit@vancura.dev>

* chore(actor): improve the Actor README.md local header link

Signed-off-by: Václav Vančura <commit@vancura.dev>

* chore(actor): bump the Actor version number

Signed-off-by: Václav Vančura <commit@vancura.dev>

* Update .actor/actor.json

Co-authored-by: Marek Trunkát <marek@trunkat.eu>
Signed-off-by: Jan Čurn <jan.curn@gmail.com>

---------

Signed-off-by: Václav Vančura <commit@vancura.dev>
Signed-off-by: Jan Čurn <jan.curn@gmail.com>
Co-authored-by: Jan Čurn <jan.curn@gmail.com>
Co-authored-by: Marek Trunkát <marek@trunkat.eu>
2025-05-21 02:47:55 +02:00
github-actions[bot]
84d0889829 chore: bump version to 2.33.0 [skip ci] v2.33.0 2025-05-20 19:54:51 +00:00
MoheyElDin Badr
f4d9d4111b
fix: Fix issue with detecting docx files, and files with upper case extensions (#1609)
fix detecting files with uppercase extensions

Signed-off-by: MoheyElDin Badr <moheyeldin.badr@gmail.com>
2025-05-20 19:42:37 +02:00
Said Gürbüz
0e00a263fa
fix: load_from_doctags static usage (#1617)
* fix load_from_doctags usage

Signed-off-by: Saidgurbuz <said.gurbuz@epfl.ch>

* update dependencies

Signed-off-by: Saidgurbuz <said.gurbuz@epfl.ch>

* fix lock file

Signed-off-by: Saidgurbuz <said.gurbuz@epfl.ch>

* revert lock file

Signed-off-by: Saidgurbuz <said.gurbuz@epfl.ch>

* update lock file

Signed-off-by: Saidgurbuz <said.gurbuz@epfl.ch>

---------

Signed-off-by: Saidgurbuz <said.gurbuz@epfl.ch>
2025-05-20 15:06:12 +02:00
Krishnan
f2e9c0784c
fix: incorrect force_backend_text behaviour for VLM DocTag pipelines (#1371)
* Fix force_backend_text

Signed-off-by: Krishnan Raghavan <krishnanraghavan@Krishnans-MacBook-Air.local>

* empty commit to retrigger CI

Signed-off-by: Panos Vagenas <pva@zurich.ibm.com>

---------

Signed-off-by: Krishnan Raghavan <krishnanraghavan@Krishnans-MacBook-Air.local>
Signed-off-by: Panos Vagenas <pva@zurich.ibm.com>
Co-authored-by: Krishnan Raghavan <krishnanraghavan@Krishnans-MacBook-Air.local>
Co-authored-by: Panos Vagenas <pva@zurich.ibm.com>
2025-05-20 09:59:38 +02:00
Pedro Ribeiro
98b5eeb844
fix(pypdfium): resolve overlapping text when merging bounding boxes (#1549)
get merged_text from boundingbox instead of merging it to prevent overlaps

Signed-off-by: Pedro Ribeiro <pedro_ribeiro_93@hotmail.com>
2025-05-19 15:26:00 +02:00
AndrewTsai0406
12a0e64892
feat: add textbox content extraction in msword_backend (#1538)
* feat: add textbox content extraction in msword_backend

Signed-off-by: Andrew <tsai247365@gmail.com>

* feat: add textbox content extraction in msword_backend

Signed-off-by: Andrew <tsai247365@gmail.com>

* feat: add textbox content extraction in msword_backend

Signed-off-by: Andrew <tsai247365@gmail.com>

---------

Signed-off-by: Andrew <tsai247365@gmail.com>
2025-05-19 15:01:36 +02:00
Panos Vagenas
7c4c356e76
chore: fix chunking example data link (#1596)
Signed-off-by: Panos Vagenas <pva@zurich.ibm.com>
2025-05-16 08:44:47 +02:00
github-actions[bot]
aeb0716bbb chore: bump version to 2.32.0 [skip ci] v2.32.0 2025-05-14 14:28:21 +00:00
Vinay R Damodaran
3a04f2a367
feat: Improve parallelization for remote services API calls (#1548)
* Provide the option to make remote services call concurrent

Signed-off-by: Vinay Damodaran <vrdn@hey.com>

* Use yield from correctly?

Signed-off-by: Vinay Damodaran <vrdn@hey.com>

* not do amateur hour stuff

Signed-off-by: Vinay Damodaran <vrdn@hey.com>

---------

Signed-off-by: Vinay Damodaran <vrdn@hey.com>
2025-05-14 15:47:55 +02:00
jimkarag02
9f8b479f17
fix(ocr): orig field in TesseractOcrCliModel as str (#1553)
fix: ensure orig and text are both strings in TesseractOcrCliModel

Signed-off-by: Dimitris Karagatslis <dimo9.dk@gmail.com>
2025-05-14 15:05:52 +02:00
Panos Vagenas
9f28abf061
docs: add advanced chunking & serialization example (#1589)
Signed-off-by: Panos Vagenas <pva@zurich.ibm.com>
2025-05-14 14:35:07 +02:00
Alex Sokolov
2efb7a7c06
fix(settings): fix nested settings load via environment variables (#1551)
Signed-off-by: Alexander Sokolov <alsokoloff@gmail.com>
2025-05-14 13:42:10 +02:00
Elwin
12dab0a1e8
feat: support image/webp file type (#1415)
* support image/webp file type

Signed-off-by: Elwin <61868295+hzhaoy@users.noreply.github.com>
Signed-off-by: Elwin <hzywong@gmail.com>

* docs: add webp image format in supported_formats.md

Signed-off-by: Elwin <61868295+hzhaoy@users.noreply.github.com>
Signed-off-by: Elwin <hzywong@gmail.com>

* test: add a test case for `image/webp` file

Signed-off-by: Elwin <hzywong@gmail.com>

* style: apply styling

Signed-off-by: Elwin <hzywong@gmail.com>

* test: update test case of converting `image/webp` file with more ocr engines

Signed-off-by: Elwin <hzywong@gmail.com>

* style: apply styling

Signed-off-by: Elwin <hzywong@gmail.com>

* rename test file

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>

---------

Signed-off-by: Elwin <61868295+hzhaoy@users.noreply.github.com>
Signed-off-by: Elwin <hzywong@gmail.com>
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>
Co-authored-by: Michele Dolfi <dol@zurich.ibm.com>
2025-05-14 09:47:28 +02:00
github-actions[bot]
23238c241f chore: bump version to 2.31.2 [skip ci] v2.31.2 2025-05-13 10:09:19 +00:00
Marco Fargetta
4046d0b2f3
fix: AsciiDoc header identification (#1562) (#1563)
Fix regular expression to identify header lines in AsciiDoc avoiding to
match defined blocks.

Signed-off-by: Marco Fargetta <mfargett@redhat.com>
2025-05-13 11:17:26 +02:00
Michele Dolfi
8baa85a49d
fix: restrict click version and update lock file (#1582)
* fix click dependency and update lock file

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>

* Update test GT

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

---------

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
Co-authored-by: Christoph Auer <cau@zurich.ibm.com>
2025-05-13 10:40:08 +02:00
github-actions[bot]
0d0fa6cbe3 chore: bump version to 2.31.1 [skip ci] v2.31.1 2025-05-12 09:44:26 +00:00
Michele Dolfi
127e38646f
fix: add smoldocling in download utils (#1577)
add smoldocling in download utils

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>
2025-05-12 10:48:07 +02:00
Oleg Lavrovsky
844babb390
docs: update links in data_prep_kit (#1559)
Update data_prep_kit.md

The links were broken, since the repository was renamed. I also noticed that PDF2Parquet is now referred to as Docling2Parquet.

Signed-off-by: Oleg Lavrovsky <31819+loleg@users.noreply.github.com>
2025-05-11 20:38:25 +02:00
Cesar Berrospi Ramis
776e7ecf9a
fix(HTML): handle row spans in header rows (#1536)
* chore(HTML): log the stacktrace of errors

Signed-off-by: Cesar Berrospi Ramis <75900930+ceberam@users.noreply.github.com>

* fix(HTML): handle row headers like in pivot tables

Signed-off-by: Cesar Berrospi Ramis <75900930+ceberam@users.noreply.github.com>

---------

Signed-off-by: Cesar Berrospi Ramis <75900930+ceberam@users.noreply.github.com>
2025-05-09 15:14:32 +02:00
Panos Vagenas
3220a592e7
docs: add serialization docs, update chunking docs (#1556)
* docs: add serializers docs, update chunking docs

Signed-off-by: Panos Vagenas <pva@zurich.ibm.com>

* update notebook to improve MD table rendering

Signed-off-by: Panos Vagenas <pva@zurich.ibm.com>

---------

Signed-off-by: Panos Vagenas <pva@zurich.ibm.com>
2025-05-08 21:43:01 +02:00