mirror of
https://github.com/docling-project/docling.git
synced 2025-06-27 05:20:05 +00:00

Update data_prep_kit.md The links were broken, since the repository was renamed. I also noticed that PDF2Parquet is now referred to as Docling2Parquet. Signed-off-by: Oleg Lavrovsky <31819+loleg@users.noreply.github.com>
11 lines
741 B
Markdown
11 lines
741 B
Markdown
Docling is used by the [Data Prep Kit](https://data-prep-kit.github.io/data-prep-kit/) open-source toolkit for preparing unstructured data for LLM application development ranging from laptop scale to datacenter scale.
|
|
|
|
## Components
|
|
### PDF ingestion to Parquet
|
|
- 💻 [Docling2Parquet source](https://github.com/data-prep-kit/data-prep-kit/tree/dev/transforms/language/docling2parquet)
|
|
- 📖 [Docling2Parquet docs](https://data-prep-kit.github.io/data-prep-kit/transforms/language/pdf2parquet/)
|
|
|
|
### Document chunking
|
|
- 💻 [Doc Chunking source](https://github.com/data-prep-kit/data-prep-kit/tree/dev/transforms/language/doc_chunk)
|
|
- 📖 [Doc Chunking docs](https://data-prep-kit.github.io/data-prep-kit/transforms/language/doc_chunk/)
|