chore: update the with input formats and DoclingDocument (#188)

---------

Signed-off-by: Peter Staar <taa@zurich.ibm.com>
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
Co-authored-by: Michele Dolfi <dol@zurich.ibm.com>
Co-authored-by: Christoph Auer <cau@zurich.ibm.com>
This commit is contained in:
Peter W. J. Staar 2024-10-30 15:02:28 +01:00 committed by GitHub
parent f542460af3
commit 94a5290789
No known key found for this signature in database
GPG Key ID: B5690EEEBB952194
7 changed files with 38 additions and 14 deletions

14
.github/workflows/cd-docs.yml vendored Normal file
View File

@ -0,0 +1,14 @@
name: "Run Docs CD"
on:
push:
branches:
- "main"
jobs:
build-deploy-docs:
uses: ./.github/workflows/docs.yml
with:
deploy: true
permissions:
contents: write

View File

@ -10,12 +10,6 @@ env:
jobs:
code-checks:
uses: ./.github/workflows/checks.yml
build-deploy-docs:
uses: ./.github/workflows/docs.yml
with:
deploy: true
permissions:
contents: write
pre-release-check:
runs-on: ubuntu-latest
outputs:

16
.github/workflows/ci-docs.yml vendored Normal file
View File

@ -0,0 +1,16 @@
name: "Run Docs CI"
on:
pull_request:
types: [opened, reopened, synchronize]
push:
branches:
- "**"
- "!gh-pages"
jobs:
build-docs:
if: ${{ github.event_name == 'push' || (github.event.pull_request.head.repo.full_name != 'DS4SD/docling' && github.event.pull_request.head.repo.full_name != 'ds4sd/docling') }}
uses: ./.github/workflows/docs.yml
with:
deploy: false

View File

@ -6,6 +6,7 @@ on:
push:
branches:
- "**"
- "!main"
- "!gh-pages"
env:
@ -16,8 +17,3 @@ jobs:
code-checks:
if: ${{ github.event_name == 'push' || (github.event.pull_request.head.repo.full_name != 'DS4SD/docling' && github.event.pull_request.head.repo.full_name != 'ds4sd/docling') }}
uses: ./.github/workflows/checks.yml
build-docs:
if: ${{ github.event_name == 'push' || (github.event.pull_request.head.repo.full_name != 'DS4SD/docling' && github.event.pull_request.head.repo.full_name != 'ds4sd/docling') }}
uses: ./.github/workflows/docs.yml
with:
deploy: false

View File

@ -22,8 +22,9 @@ Docling parses documents and exports them to the desired format with ease and sp
## Features
* 🗂️ Multi-format support for input (PDF, DOCX etc.) & output (Markdown, JSON etc.)
* 📑 Advanced PDF document understanding incl. page layout, reading order & table structures
* 🗂️ Reads popular document formats (PDF, DOCX, PPTX, Images, HTML, AsciiDoc, Markdown) and exports to Markdown and JSON
* 📑 Advanced PDF document understanding including page layout, reading order & table structures
* 🧩 Unified, expressive [DoclingDocument](https://ds4sd.github.io/docling/concepts/docling_document/) representation format
* 📝 Metadata extraction, including title, authors, references & language
* 🤖 Seamless LlamaIndex 🦙 & LangChain 🦜🔗 integration for powerful RAG / QA applications
* 🔍 OCR support for scanned PDFs

View File

@ -7,6 +7,8 @@ pydantic datatype, which can express several features common to documents, such
* Layout information (i.e. bounding boxes) for all items, if available
* Provenance information
The definition of the Pydantic types is implemented in the module `docling_core.types.doc`, more details in [source code definitions](https://github.com/DS4SD/docling-core/tree/main/docling_core/types/doc).
It also brings a set of document construction APIs to build up a `DoclingDocument` from scratch.
## Example document structures

View File

@ -19,8 +19,9 @@ Docling parses documents and exports them to the desired format with ease and sp
## Features
* 🗂️ Multi-format support for input (PDF, DOCX etc.) & output (Markdown, JSON etc.)
* 🗂️ Reads popular document formats (PDF, DOCX, PPTX, Images, HTML, AsciiDoc, Markdown) and exports to Markdown and JSON
* 📑 Advanced PDF document understanding incl. page layout, reading order & table structures
* 🧩 Unified, expressive [DoclingDocument](./concepts/docling_document.md) representation format
* 📝 Metadata extraction, including title, authors, references & language
* 🤖 Seamless LlamaIndex 🦙 & LangChain 🦜🔗 integration for powerful RAG / QA applications
* 🔍 OCR support for scanned PDFs