Panos Vagenas
70865b4c7d
fix: make CLI JSON export more human-readable
...
Signed-off-by: Panos Vagenas <35837085+vagenas@users.noreply.github.com>
2024-10-29 08:54:41 +01:00
github-actions[bot]
dda2645d4c
chore: bump version to 2.2.1 [skip ci]
v2.2.1
2024-10-28 17:18:41 +00:00
Panos Vagenas
b9f5c74a7d
fix: fix header levels for DOCX & HTML ( #184 )
...
Signed-off-by: Panos Vagenas <35837085+vagenas@users.noreply.github.com>
2024-10-28 17:02:52 +01:00
Maxim Lysak
94d0729c50
fix: handling of long sequence of unescaped underscore chars in markdown ( #173 )
...
* Fix for md hanging when encountering long sequence of unescaped underscore chars
Signed-off-by: Maksym Lysak <mly@zurich.ibm.com>
* Added comment explaining reason for fix
Signed-off-by: Maksym Lysak <mly@zurich.ibm.com>
* Fixed trailing inline text handling (at the end of a file), and corrected underscore sequence shortening
Signed-off-by: Maksym Lysak <mly@zurich.ibm.com>
* making fix more rare
Signed-off-by: Maksym Lysak <mly@zurich.ibm.com>
---------
Signed-off-by: Maksym Lysak <mly@zurich.ibm.com>
Co-authored-by: Maksym Lysak <mly@zurich.ibm.com>
2024-10-28 16:34:48 +01:00
Panos Vagenas
2cece27208
docs: update LlamaIndex docs for Docling v2 ( #182 )
...
Signed-off-by: Panos Vagenas <35837085+vagenas@users.noreply.github.com>
2024-10-28 14:28:26 +01:00
Michele Dolfi
189d3c2d44
docs: fix batch convert ( #177 )
...
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>
2024-10-26 05:50:34 +02:00
Maxim Lysak
7d19418b77
fix: HTML backend, fixes for Lists and nested texts ( #180 )
...
* Fixes for HTML backend
Signed-off-by: Maksym Lysak <mly@zurich.ibm.com>
* removed prints
Signed-off-by: Maksym Lysak <mly@zurich.ibm.com>
* cleaning up
Signed-off-by: Maksym Lysak <mly@zurich.ibm.com>
---------
Signed-off-by: Maksym Lysak <mly@zurich.ibm.com>
Co-authored-by: Maksym Lysak <mly@zurich.ibm.com>
2024-10-25 20:14:04 +02:00
Maxim Lysak
88c1673057
fix: MD Backend, fixes to properly handle trailing inline text and emphasis in headers ( #178 )
...
* Small fix to properly handle trailing inline text in the md backend
Signed-off-by: Maksym Lysak <mly@zurich.ibm.com>
* Added proper handling of headers with bold, italic or emphasis
Signed-off-by: Maksym Lysak <mly@zurich.ibm.com>
* removed print
Signed-off-by: Maksym Lysak <mly@zurich.ibm.com>
* Made smarter processing of headers, with arbitrary styling
Signed-off-by: Maksym Lysak <mly@zurich.ibm.com>
* Updated docling-core to 2.2.1
Signed-off-by: Maksym Lysak <mly@zurich.ibm.com>
* Updated tests because of the change in Markdown export in docling-core
Signed-off-by: Maksym Lysak <mly@zurich.ibm.com>
---------
Signed-off-by: Maksym Lysak <mly@zurich.ibm.com>
Co-authored-by: Maksym Lysak <mly@zurich.ibm.com>
2024-10-25 18:02:20 +02:00
Michele Dolfi
77a89c3334
chore: make auto-release on request ( #179 )
...
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>
2024-10-25 10:47:25 +02:00
Michele Dolfi
8d356aa247
docs: add export with embedded images ( #175 )
...
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>
2024-10-24 20:19:41 +02:00
github-actions[bot]
8208c93e3a
chore: bump version to 2.2.0 [skip ci]
v2.2.0
2024-10-23 16:04:55 +00:00
Peter W. J. Staar
4116819b51
feat: Update to docling-parse v2 without history ( #170 )
...
* updated the pyproject (still need to run poetry lock after docling-parse is accepted)
Signed-off-by: Peter Staar <taa@zurich.ibm.com>
* Update imports for docling_parse.pdf_parser_v1
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
* Lock docling-parse 2.0.0
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
* Lock docling-parse 2.0.0
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
* repin poetry.lock
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>
---------
Signed-off-by: Peter Staar <taa@zurich.ibm.com>
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>
Co-authored-by: Christoph Auer <cau@zurich.ibm.com>
Co-authored-by: Michele Dolfi <dol@zurich.ibm.com>
2024-10-23 17:20:11 +02:00
Christoph Auer
3023f18ba0
feat: Support AsciiDoc and Markdown input format ( #168 )
...
* updated the base-model and added the asciidoc_backend
Signed-off-by: Peter Staar <taa@zurich.ibm.com>
* updated the asciidoc backend
Signed-off-by: Peter Staar <taa@zurich.ibm.com>
* Ensure all models work only on valid pages (#158 )
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
* ci: run ci also on forks (#160 )
---------
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>
Signed-off-by: Michele Dolfi <97102151+dolfim-ibm@users.noreply.github.com>
* fix: fix legacy doc ref (#162 )
Signed-off-by: Panos Vagenas <35837085+vagenas@users.noreply.github.com>
* docs: typo fix (#155 )
* Docs: Typo fix
- Corrected spelling of invidual to automatic
Signed-off-by: ABHISHEK FADAKE <31249309+fadkeabhi@users.noreply.github.com>
* add synchronize event for forks
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>
---------
Signed-off-by: ABHISHEK FADAKE <31249309+fadkeabhi@users.noreply.github.com>
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>
Co-authored-by: Michele Dolfi <dol@zurich.ibm.com>
* feat: add coverage_threshold to skip OCR for small images (#161 )
* feat: add coverage_threshold to skip OCR for small images
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>
* filter individual boxes
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>
* rename option
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>
---------
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>
* chore: bump version to 2.1.0 [skip ci]
* adding tests for asciidocs
Signed-off-by: Peter Staar <taa@zurich.ibm.com>
* first working asciidoc parser
Signed-off-by: Peter Staar <taa@zurich.ibm.com>
* reformatted the code
Signed-off-by: Peter Staar <taa@zurich.ibm.com>
* fixed the mypy
Signed-off-by: Peter Staar <taa@zurich.ibm.com>
* adding test_02.asciidoc
Signed-off-by: Peter Staar <taa@zurich.ibm.com>
* Drafting Markdown backend via Marko library
Signed-off-by: Maksym Lysak <mly@zurich.ibm.com>
* work in progress on MD backend
Signed-off-by: Maksym Lysak <mly@zurich.ibm.com>
* md_backend produces docling document with headers, paragraphs, lists
Signed-off-by: Maksym Lysak <mly@zurich.ibm.com>
* Improvements in md parsing
Signed-off-by: Maksym Lysak <mly@zurich.ibm.com>
* Detecting and assembling tables in markdown in temporary buffers
Signed-off-by: Maksym Lysak <mly@zurich.ibm.com>
* Added initial docling table support to md_backend
Signed-off-by: Maksym Lysak <mly@zurich.ibm.com>
* Cleaned code, improved logging for MD
Signed-off-by: Maksym Lysak <mly@zurich.ibm.com>
* Fixes MyPy requirements, and rest of pre-commit
Signed-off-by: Maksym Lysak <mly@zurich.ibm.com>
* Fixed example run_md, added origin info to md_backend
Signed-off-by: Maksym Lysak <mly@zurich.ibm.com>
* working on asciidocs, struggling with ImageRef
Signed-off-by: Peter Staar <taa@zurich.ibm.com>
* able to parse the captions and image uri's
Signed-off-by: Peter Staar <taa@zurich.ibm.com>
* fixed the mypy
Signed-off-by: Peter Staar <taa@zurich.ibm.com>
* Update all backends with proper filename in DocumentOrigin
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
* Update to docling-core v2.1.0
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
* Fixes for MD Backend, to avoid duplicated text inserts into docling doc
Signed-off-by: Maksym Lysak <mly@zurich.ibm.com>
* Fix styling
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
* Added support for code blocks and fenced code in MD
Signed-off-by: Maksym Lysak <mly@zurich.ibm.com>
* cleaned prints
Signed-off-by: Maksym Lysak <mly@zurich.ibm.com>
* Added proper processing of in-line textual elements for MD backend
Signed-off-by: Maksym Lysak <mly@zurich.ibm.com>
* Fixed issues with duplicated paragraphs and incorrect lists in pptx
Signed-off-by: Maksym Lysak <mly@zurich.ibm.com>
* Fixed issue with group ordeering in pptx backend, added gebug log into run with formats
Signed-off-by: Maksym Lysak <mly@zurich.ibm.com>
---------
Signed-off-by: Peter Staar <taa@zurich.ibm.com>
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>
Signed-off-by: Michele Dolfi <97102151+dolfim-ibm@users.noreply.github.com>
Signed-off-by: Panos Vagenas <35837085+vagenas@users.noreply.github.com>
Signed-off-by: ABHISHEK FADAKE <31249309+fadkeabhi@users.noreply.github.com>
Signed-off-by: Maksym Lysak <mly@zurich.ibm.com>
Co-authored-by: Peter Staar <taa@zurich.ibm.com>
Co-authored-by: Michele Dolfi <97102151+dolfim-ibm@users.noreply.github.com>
Co-authored-by: Panos Vagenas <35837085+vagenas@users.noreply.github.com>
Co-authored-by: ABHISHEK FADAKE <31249309+fadkeabhi@users.noreply.github.com>
Co-authored-by: Michele Dolfi <dol@zurich.ibm.com>
Co-authored-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>
Co-authored-by: Maksym Lysak <mly@zurich.ibm.com>
2024-10-23 16:14:26 +02:00
Michele Dolfi
3496b4838f
fix: set valid=false for invalid backends ( #171 )
...
* fix: set valid=false for invalid backends
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>
* Add test case for InputDocument
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
---------
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
Co-authored-by: Christoph Auer <cau@zurich.ibm.com>
2024-10-23 15:52:30 +02:00
Panos Vagenas
b8d2286dd1
chore: various minor docs fixes ( #169 )
...
Signed-off-by: Panos Vagenas <35837085+vagenas@users.noreply.github.com>
2024-10-22 15:29:36 +02:00
Mohamed Ali
fa5f94ec10
Fix Typo errors in CONTRIBUTING.md file ( #164 )
2024-10-22 07:01:48 +02:00
github-actions[bot]
d5460e2d1f
chore: bump version to 2.1.0 [skip ci]
v2.1.0
2024-10-18 13:21:15 +00:00
Michele Dolfi
b346faf622
feat: add coverage_threshold to skip OCR for small images ( #161 )
...
* feat: add coverage_threshold to skip OCR for small images
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>
* filter individual boxes
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>
* rename option
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>
---------
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>
2024-10-18 13:58:23 +02:00
ABHISHEK FADAKE
f799e777c1
docs: typo fix ( #155 )
...
* Docs: Typo fix
- Corrected spelling of invidual to automatic
Signed-off-by: ABHISHEK FADAKE <31249309+fadkeabhi@users.noreply.github.com>
* add synchronize event for forks
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>
---------
Signed-off-by: ABHISHEK FADAKE <31249309+fadkeabhi@users.noreply.github.com>
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>
Co-authored-by: Michele Dolfi <dol@zurich.ibm.com>
2024-10-18 13:56:48 +02:00
Panos Vagenas
63bef59d9e
fix: fix legacy doc ref ( #162 )
...
Signed-off-by: Panos Vagenas <35837085+vagenas@users.noreply.github.com>
2024-10-18 13:11:20 +02:00
Michele Dolfi
bb7a58d45d
ci: run ci also on forks ( #160 )
...
---------
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>
Signed-off-by: Michele Dolfi <97102151+dolfim-ibm@users.noreply.github.com>
2024-10-18 12:32:27 +02:00
Christoph Auer
a00c937e19
Ensure all models work only on valid pages ( #158 )
...
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
2024-10-18 08:54:06 +02:00
Maxim Lysak
034a411057
docs: add graphical band in readme ( #154 )
...
Signed-off-by: Maksym Lysak <mly@zurich.ibm.com>
Co-authored-by: Maksym Lysak <mly@zurich.ibm.com>
Co-authored-by: Michele Dolfi <dol@zurich.ibm.com>
2024-10-17 18:15:40 +02:00
Michele Dolfi
61c092f445
docs: add use docling ( #150 )
...
---------
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>
Signed-off-by: Panos Vagenas <35837085+vagenas@users.noreply.github.com>
Co-authored-by: Panos Vagenas <35837085+vagenas@users.noreply.github.com>
2024-10-17 18:14:48 +02:00
Michele Dolfi
24f949ada2
chore: run apt-get update before install ( #156 )
...
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>
2024-10-17 17:27:16 +02:00
github-actions[bot]
a29c256041
chore: bump version to 2.0.0 [skip ci]
v2.0.0
2024-10-16 19:48:06 +00:00
Christoph Auer
7d3be0edeb
feat!: Docling v2 ( #117 )
...
---------
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
Signed-off-by: Maxim Lysak <mly@zurich.ibm.com>
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>
Signed-off-by: Panos Vagenas <35837085+vagenas@users.noreply.github.com>
Co-authored-by: Maxim Lysak <mly@zurich.ibm.com>
Co-authored-by: Michele Dolfi <dol@zurich.ibm.com>
Co-authored-by: Panos Vagenas <35837085+vagenas@users.noreply.github.com>
2024-10-16 21:02:03 +02:00
Panos Vagenas
d504432c1e
docs: introduce docs site ( #141 )
...
Signed-off-by: Panos Vagenas <35837085+vagenas@users.noreply.github.com>
2024-10-14 14:13:13 +02:00
Michele Dolfi
2b1e72d327
refactor: fix type of tesseractocr options ( #140 )
...
Signed-off-by: Michele Dolfi <97102151+dolfim-ibm@users.noreply.github.com>
2024-10-14 08:40:22 +02:00
github-actions[bot]
4672b24c1a
chore: bump version to 1.20.0 [skip ci]
v1.20.0
2024-10-11 13:48:02 +00:00
Christoph Auer
5e4944f15f
feat: new experimental docling-parse v2 backend ( #131 )
...
---------
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>
Co-authored-by: Michele Dolfi <dol@zurich.ibm.com>
2024-10-11 15:12:49 +02:00
github-actions[bot]
2ec39636f0
chore: bump version to 1.19.1 [skip ci]
v1.19.1
2024-10-11 08:52:09 +00:00
Nikos Livathinos
dae2a3b667
fix: remove stderr from tesseract cli and introduce fuzziness in the text validation of OCR tests ( #138 )
...
* feat(OCR tests): Introduce fuzziness in the text validation of OCR tests
Signed-off-by: Nikos Livathinos <nli@zurich.ibm.com>
* fix(TesseractOcrCliModel): Send the stderr to devnull to avoid poluting the console with messages from tesseract cmd
Signed-off-by: Nikos Livathinos <nli@zurich.ibm.com>
---------
Signed-off-by: Nikos Livathinos <nli@zurich.ibm.com>
2024-10-11 10:21:19 +02:00
Panos Vagenas
5f1bd9e9c8
docs: simplify LlamaIndex example using Docling extension ( #135 )
...
Signed-off-by: Panos Vagenas <35837085+vagenas@users.noreply.github.com>
2024-10-09 22:17:56 +02:00
Panos Vagenas
6924999f1f
chore: explicitly manage pandas dependency ( #134 )
...
Signed-off-by: Panos Vagenas <35837085+vagenas@users.noreply.github.com>
2024-10-09 14:50:39 +02:00
github-actions[bot]
0ffc1708d2
chore: bump version to 1.19.0 [skip ci]
v1.19.0
2024-10-08 17:42:29 +00:00
Michele Dolfi
f96ea86a00
feat: add options for choosing OCR engines ( #118 )
...
---------
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>
Signed-off-by: Nikos Livathinos <nli@zurich.ibm.com>
Signed-off-by: Peter Staar <taa@zurich.ibm.com>
Co-authored-by: Nikos Livathinos <nli@zurich.ibm.com>
Co-authored-by: Peter Staar <taa@zurich.ibm.com>
2024-10-08 19:07:08 +02:00
Fasal Shah
d412c363d7
fixed unload pdf backend resources ( #129 )
...
Signed-off-by: faisal shah <fashah@redhat.com>
Co-authored-by: faisal shah <fashah@redhat.com>
2024-10-08 10:46:43 +02:00
github-actions[bot]
9b82ae3324
chore: bump version to 1.18.0 [skip ci]
v1.18.0
2024-10-03 17:16:00 +00:00
Maxim Lysak
2422f706a1
feat: new torch-based docling models ( #120 )
...
---------
Signed-off-by: Maxim Lysak <mly@zurich.ibm.com>
Co-authored-by: Maxim Lysak <mly@zurich.ibm.com>
2024-10-03 18:42:33 +02:00
github-actions[bot]
9ebbbc1245
chore: bump version to 1.17.0 [skip ci]
v1.17.0
2024-10-03 13:44:52 +00:00
Rui Dias Gomes
dde0aff8bd
update examples ( #123 )
...
Signed-off-by: rmdg88 <rmdg88@gmail.com>
2024-10-03 14:28:25 +02:00
Michele Dolfi
d44c62d7ce
feat: windows support ( #122 )
...
* feat: windows support
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>
* add Windows in README
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>
---------
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>
2024-10-03 14:23:47 +02:00
github-actions[bot]
cde671cf34
chore: bump version to 1.16.1 [skip ci]
v1.16.1
2024-09-27 14:36:40 +00:00
Michele Dolfi
34bd887a7f
fix: allow usage of opencv 4.6.x ( #110 )
...
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>
2024-09-27 15:51:43 +02:00
Panos Vagenas
c05b692d69
docs: document chunking ( #111 )
...
[skip ci]
Signed-off-by: Panos Vagenas <35837085+vagenas@users.noreply.github.com>
2024-09-27 11:16:04 +02:00
github-actions[bot]
6760571fe1
chore: bump version to 1.16.0 [skip ci]
v1.16.0
2024-09-27 06:21:15 +00:00
Christoph Auer
d6df76f90b
feat: Support tableformer model choice ( #90 )
...
* Support tableformer model choice
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
* Update datamodel structure
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
* Update docs
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
* Cleanup
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
* Add test unit for table options
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
* Ensure import backwards-compatibility for PipelineOptions
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
* Update README
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
* Adjust parameters on custom_convert
Signed-off-by: Christoph Auer <60343111+cau-git@users.noreply.github.com>
* Update Dockerfile
Signed-off-by: Christoph Auer <60343111+cau-git@users.noreply.github.com>
---------
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
Signed-off-by: Christoph Auer <60343111+cau-git@users.noreply.github.com>
2024-09-26 21:37:08 +02:00
Panos Vagenas
39977b5631
chore: move examples extras to respective group ( #103 )
...
Signed-off-by: Panos Vagenas <35837085+vagenas@users.noreply.github.com>
2024-09-25 15:47:48 +02:00
github-actions[bot]
3dfd02a7e9
chore: bump version to 1.15.0 [skip ci]
v1.15.0
2024-09-24 15:58:16 +00:00