mirror of
https://github.com/Unstructured-IO/unstructured.git
synced 2025-08-03 14:29:23 +00:00

The new "basic" chunking strategy and overlap options need to be available from the ingest CLI. An ingest test of those features is also welcome, both to verify the ingest feature and to defend against regressions in the chunking code. Add a local ingest test exercising both the "basic" chunking strategy and intra-chunk overlap. Since there is no new source connector involved, use the local ingest source and destination. Update documentation to suit, filling in some details that hadn't made it into the docs yet.
19 lines
979 B
Plaintext
19 lines
979 B
Plaintext
filename doctype connector cct-accuracy cct-%missing
|
|
fake-text.txt txt Sharepoint 1.0 0.0
|
|
ideas-page.html html Sharepoint 0.93 0.033
|
|
stanley-cups.xlsx xlsx Sharepoint 0.778 0.0
|
|
Core-Skills-for-Biomedical-Data-Scientists-2-pages.pdf pdf azure 0.981 0.007
|
|
IRS-form-1987.pdf pdf azure 0.783 0.135
|
|
spring-weather.html html azure 0.0 0.018
|
|
example-10k.html html local 0.727 0.037
|
|
fake-html-cp1252.html html local 0.659 0.0
|
|
ideas-page.html html local 0.93 0.033
|
|
UDHR_first_article_all.txt txt local-single-file 0.995 0.0
|
|
handbook-1p.docx docx local-single-file-basic-chunking 0.858 0.029
|
|
fake-html-cp1252.html html local-single-file-with-encoding 0.659 0.0
|
|
layout-parser-paper-with-table.jpg jpg local-single-file-with-pdf-infer-table-structure 0.716 0.032
|
|
layout-parser-paper.pdf pdf local-single-file-with-pdf-infer-table-structure 0.949 0.029
|
|
2023-Jan-economic-outlook.pdf pdf s3 0.845 0.039
|
|
page-with-formula.pdf pdf s3 0.971 0.021
|
|
recalibrating-risk-report.pdf pdf s3 0.968 0.008
|