Steve Canny 2f2c48acd5
feat(ingest): add basic chunking to ingest (#2380)
The new "basic" chunking strategy and overlap options need to be
available from the ingest CLI. An ingest test of those features is also
welcome, both to verify the ingest feature and to defend against
regressions in the chunking code.

Add a local ingest test exercising both the "basic" chunking strategy
and intra-chunk overlap. Since there is no new source connector
involved, use the local ingest source and destination. Update
documentation to suit, filling in some details that hadn't made it into
the docs yet.
2024-01-12 20:27:34 +00:00

979 B

1filenamedoctypeconnectorcct-accuracycct-%missing
2fake-text.txttxtSharepoint1.00.0
3ideas-page.htmlhtmlSharepoint0.930.033
4stanley-cups.xlsxxlsxSharepoint0.7780.0
5Core-Skills-for-Biomedical-Data-Scientists-2-pages.pdfpdfazure0.9810.007
6IRS-form-1987.pdfpdfazure0.7830.135
7spring-weather.htmlhtmlazure0.00.018
8example-10k.htmlhtmllocal0.7270.037
9fake-html-cp1252.htmlhtmllocal0.6590.0
10ideas-page.htmlhtmllocal0.930.033
11UDHR_first_article_all.txttxtlocal-single-file0.9950.0
12handbook-1p.docxdocxlocal-single-file-basic-chunking0.8580.029
13fake-html-cp1252.htmlhtmllocal-single-file-with-encoding0.6590.0
14layout-parser-paper-with-table.jpgjpglocal-single-file-with-pdf-infer-table-structure0.7160.032
15layout-parser-paper.pdfpdflocal-single-file-with-pdf-infer-table-structure0.9490.029
162023-Jan-economic-outlook.pdfpdfs30.8450.039
17page-with-formula.pdfpdfs30.9710.021
18recalibrating-risk-report.pdfpdfs30.9680.008