docling/README.md

<p align="center">
  <a href="https://github.com/ds4sd/docling"> <img loading="lazy" alt="Docling" src="https://github.com/DS4SD/docling/raw/main/logo.png" width="150" /> </a>
</p>

# Docling

Dockling bundles PDF document conversion to JSON and Markdown in an easy, self-contained package.

## Features
* ⚡ Converts any PDF document to JSON or Markdown format, stable and lightning fast
* 📑 Understands detailed page layout, reading order and recovers table structures
* 📝 Extracts metadata from the document, such as title, authors, references and language
* 🔍 Optionally applies OCR (use with scanned PDFs)

## Setup

You need Python 3.11 and poetry. Install poetry from [here](https://python-poetry.org/docs/#installing-with-the-official-installer).

Once you have `poetry` installed, create an environment and install the package:

```bash
poetry env use $(which python3.11)
poetry shell
poetry install
```

**Notes**:
* Works on macOS and Linux environments. Windows platforms are currently not tested.


## Usage

For basic usage, see the [convert.py](examples/convert.py) example module. Run with:

```
python examples/convert.py
```
The output of the above command will be written to `./scratch`.

### Enable or disable pipeline features

You can control if table structure recognition or OCR should be performed by arguments passed to `DocumentConverter` 
```python
doc_converter = DocumentConverter(
    artifacts_path=artifacts_path,
    pipeline_options=PipelineOptions(do_table_structure=False, # Controls if table structure is recovered. 
                                     do_ocr=True), # Controls if OCR is applied (ignores programmatic content)
)
```

### Impose limits on the document size

You can limit the file size and number of pages which should be allowed to process per document.
```python
paths = [Path("./test/data/2206.01062.pdf")]

input = DocumentConversionInput.from_paths(
    paths, limits=DocumentLimits(max_num_pages=100, max_file_size=20971520)
)
```

### Convert from binary PDF streams 

You can convert PDFs from a binary stream instead of from the filesystem as follows:
```python
buf = BytesIO(your_binary_stream)
docs = [DocumentStream(filename="my_doc.pdf", stream=buf)]
input = DocumentConversionInput.from_streams(docs)
converted_docs = doc_converter.convert(input)
```
### Limit resource usage

You can limit the CPU threads used by `docling` by setting the environment variable `OMP_NUM_THREADS` accordingly. The default setting is using 4 CPU threads.


## Contributing

Please read [Contributing to Docling](https://github.com/DS4SD/docling/blob/main/CONTRIBUTING.md) for details.


## References

If you use `Docling` in your projects, please consider citing the following:

```bib
@software{Docling,
author = {Deep Search Team},
month = {7},
title = {{Docling}},
url = {https://github.com/DS4SD/docling},
version = {main},
year = {2024}
}
```

## License

The `Docling` codebase is under MIT license.
For individual model usage, please refer to the model licenses found in the original packages.
Initial commit 2024-07-15 09:42:42 +02:00			`<p align="center">`
docs: Update links, add GH repository to metadata (#1) * Add repo, absolute URLs Signed-off-by: Christoph Auer <cau@zurich.ibm.com> * Bump version Signed-off-by: Christoph Auer <cau@zurich.ibm.com> --------- Signed-off-by: Christoph Auer <cau@zurich.ibm.com> Co-authored-by: Christoph Auer <cau@zurich.ibm.com> 2024-07-15 12:43:05 +02:00			`<a href="https://github.com/ds4sd/docling"> <img loading="lazy" alt="Docling" src="https://github.com/DS4SD/docling/raw/main/logo.png" width="150" /> </a>`
Initial commit 2024-07-15 09:42:42 +02:00			`</p>`

			`# Docling`

			`Dockling bundles PDF document conversion to JSON and Markdown in an easy, self-contained package.`

			`## Features`
			`* ⚡ Converts any PDF document to JSON or Markdown format, stable and lightning fast`
			`* 📑 Understands detailed page layout, reading order and recovers table structures`
			`* 📝 Extracts metadata from the document, such as title, authors, references and language`
			`* 🔍 Optionally applies OCR (use with scanned PDFs)`

			`## Setup`

			`You need Python 3.11 and poetry. Install poetry from [here](https://python-poetry.org/docs/#installing-with-the-official-installer).`

			Once you have `poetry` installed, create an environment and install the package:

			```bash
			`poetry env use $(which python3.11)`
			`poetry shell`
			`poetry install`
			```

			`Notes:`
			`* Works on macOS and Linux environments. Windows platforms are currently not tested.`


			`## Usage`

			`For basic usage, see the [convert.py](examples/convert.py) example module. Run with:`

			```
			`python examples/convert.py`
			```
			The output of the above command will be written to `./scratch`.

			`### Enable or disable pipeline features`

			You can control if table structure recognition or OCR should be performed by arguments passed to `DocumentConverter`
			```python
			`doc_converter = DocumentConverter(`
			`artifacts_path=artifacts_path,`
			`pipeline_options=PipelineOptions(do_table_structure=False, # Controls if table structure is recovered.`
			`do_ocr=True), # Controls if OCR is applied (ignores programmatic content)`
			`)`
			```

			`### Impose limits on the document size`

			`You can limit the file size and number of pages which should be allowed to process per document.`
			```python
			`paths = [Path("./test/data/2206.01062.pdf")]`

			`input = DocumentConversionInput.from_paths(`
			`paths, limits=DocumentLimits(max_num_pages=100, max_file_size=20971520)`
			`)`
			```

			`### Convert from binary PDF streams`

			`You can convert PDFs from a binary stream instead of from the filesystem as follows:`
			```python
			`buf = BytesIO(your_binary_stream)`
			`docs = [DocumentStream(filename="my_doc.pdf", stream=buf)]`
			`input = DocumentConversionInput.from_streams(docs)`
			`converted_docs = doc_converter.convert(input)`
			```
			`### Limit resource usage`

			You can limit the CPU threads used by `docling` by setting the environment variable `OMP_NUM_THREADS` accordingly. The default setting is using 4 CPU threads.


			`## Contributing`

docs: Update links, add GH repository to metadata (#1) * Add repo, absolute URLs Signed-off-by: Christoph Auer <cau@zurich.ibm.com> * Bump version Signed-off-by: Christoph Auer <cau@zurich.ibm.com> --------- Signed-off-by: Christoph Auer <cau@zurich.ibm.com> Co-authored-by: Christoph Auer <cau@zurich.ibm.com> 2024-07-15 12:43:05 +02:00			`Please read [Contributing to Docling](https://github.com/DS4SD/docling/blob/main/CONTRIBUTING.md) for details.`
Initial commit 2024-07-15 09:42:42 +02:00

			`## References`

			If you use `Docling` in your projects, please consider citing the following:

			```bib
			`@software{Docling,`
			`author = {Deep Search Team},`
			`month = {7},`
			`title = {{Docling}},`
			`url = {https://github.com/DS4SD/docling},`
			`version = {main},`
			`year = {2024}`
			`}`
			```

			`## License`

			The `Docling` codebase is under MIT license.
			`For individual model usage, please refer to the model licenses found in the original packages.`