mirror of
				https://github.com/Unstructured-IO/unstructured.git
				synced 2025-10-30 17:38:13 +00:00 
			
		
		
		
	 2f2c48acd5
			
		
	
	
		2f2c48acd5
		
			
		
	
	
	
	
		
			
			The new "basic" chunking strategy and overlap options need to be available from the ingest CLI. An ingest test of those features is also welcome, both to verify the ingest feature and to defend against regressions in the chunking code. Add a local ingest test exercising both the "basic" chunking strategy and intra-chunk overlap. Since there is no new source connector involved, use the local ingest source and destination. Update documentation to suit, filling in some details that hadn't made it into the docs yet.
		
			
				
	
	
	
		
			979 B
		
	
	
	
	
	
	
	
			
		
		
	
	
			979 B
		
	
	
	
	
	
	
	
| 1 | filename | doctype | connector | cct-accuracy | cct-%missing | 
|---|---|---|---|---|---|
| 2 | fake-text.txt | txt | Sharepoint | 1.0 | 0.0 | 
| 3 | ideas-page.html | html | Sharepoint | 0.93 | 0.033 | 
| 4 | stanley-cups.xlsx | xlsx | Sharepoint | 0.778 | 0.0 | 
| 5 | Core-Skills-for-Biomedical-Data-Scientists-2-pages.pdf | azure | 0.981 | 0.007 | |
| 6 | IRS-form-1987.pdf | azure | 0.783 | 0.135 | |
| 7 | spring-weather.html | html | azure | 0.0 | 0.018 | 
| 8 | example-10k.html | html | local | 0.727 | 0.037 | 
| 9 | fake-html-cp1252.html | html | local | 0.659 | 0.0 | 
| 10 | ideas-page.html | html | local | 0.93 | 0.033 | 
| 11 | UDHR_first_article_all.txt | txt | local-single-file | 0.995 | 0.0 | 
| 12 | handbook-1p.docx | docx | local-single-file-basic-chunking | 0.858 | 0.029 | 
| 13 | fake-html-cp1252.html | html | local-single-file-with-encoding | 0.659 | 0.0 | 
| 14 | layout-parser-paper-with-table.jpg | jpg | local-single-file-with-pdf-infer-table-structure | 0.716 | 0.032 | 
| 15 | layout-parser-paper.pdf | local-single-file-with-pdf-infer-table-structure | 0.949 | 0.029 | |
| 16 | 2023-Jan-economic-outlook.pdf | s3 | 0.845 | 0.039 | |
| 17 | page-with-formula.pdf | s3 | 0.971 | 0.021 | |
| 18 | recalibrating-risk-report.pdf | s3 | 0.968 | 0.008 |