2023-05-01 18:17:52 -04:00 
										
									 
								 
							 
							
								
									
										 
									 
								
							 
							
								 
							 
							
							
								Unstructured Core Library
 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							 
							
							
								=========================
 
							 
						 
					
						
							
								
									
										
										
										
											2022-06-29 14:35:19 -04:00 
										
									 
								 
							 
							
								
							 
							
								 
							 
							
							
								
							 
						 
					
						
							
								
									
										
										
										
											2024-01-25 12:31:28 -08:00 
										
									 
								 
							 
							
								
									
										 
									 
								
							 
							
								 
							 
							
							
								The `` unstructured ``  library is designed to help preprocess and structure unstructured text documents for use in downstream machine learning tasks. Examples of documents that can be processed
 
							 
						 
					
						
							
								
									
										
										
										
											2022-06-29 14:35:19 -04:00 
										
									 
								 
							 
							
								
							 
							
								 
							 
							
							
								using the `` unstructured ``  library include PDFs, XML and HTML documents.
 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							 
							
							
								
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							 
							
							
								Library Documentation
 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							 
							
							
								---------------------
 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							 
							
							
								
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							 
							
							
								:doc: `installing` 
 
							 
						 
					
						
							
								
									
										
										
										
											2023-02-27 18:11:49 -05:00 
										
									 
								 
							 
							
								
									
										 
									 
								
							 
							
								 
							 
							
							
								  Instructions on how to install the `` unstructured ``  library on your system.
 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							 
							
							
								
							 
						 
					
						
							
								
									
										
										
										
											2023-07-14 14:28:57 -04:00 
										
									 
								 
							 
							
								
									
										 
									 
								
							 
							
								 
							 
							
							
								:doc: `api` 
 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							 
							
							
								  Access all the power of `` unstructured ``  through the `` unstructured-api ``  or learn to host it locally.
 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							 
							
							
								
							 
						 
					
						
							
								
									
										
										
										
											2024-03-06 11:16:08 -08:00 
										
									 
								 
							 
							
								
									
										 
									 
								
							 
							
								 
							 
							
							
								:doc: `platform` 
 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							 
							
							
								  Explore the enterprise-grade platform for enterprises and high-growth companies with large data volume looking to automatically retrieve, transform, and stage their data for LLMs.
 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							 
							
							
								
							 
						 
					
						
							
								
									
										
										
										
											2023-11-02 10:43:26 -04:00 
										
									 
								 
							 
							
								
									
										 
									 
								
							 
							
								 
							 
							
							
								:doc: `core` 
 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							 
							
							
								  Learn more about the core partitioning, chunking, cleaning, and staging functionality within the
 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							 
							
							
								  Unstructured library.
 
							 
						 
					
						
							
								
									
										
										
										
											2023-06-16 10:10:56 -04:00 
										
									 
								 
							 
							
								
									
										 
									 
								
							 
							
								 
							 
							
							
								
							 
						 
					
						
							
								
									
										
										
										
											2023-11-02 16:40:35 -04:00 
										
									 
								 
							 
							
								
									
										 
									 
								
							 
							
								 
							 
							
							
								:doc: `ingest/index` 
 
							 
						 
					
						
							
								
									
										
										
										
											2023-08-28 14:05:48 +02:00 
										
									 
								 
							 
							
								
									
										 
									 
								
							 
							
								 
							 
							
							
								  Connect to your favorite data storage platforms for an effortless batch processing of your files.
 
							 
						 
					
						
							
								
									
										
										
										
											2023-07-12 14:56:09 -04:00 
										
									 
								 
							 
							
								
									
										 
									 
								
							 
							
								 
							 
							
							
								
							 
						 
					
						
							
								
									
										
										
										
											2023-06-16 10:10:56 -04:00 
										
									 
								 
							 
							
								
									
										 
									 
								
							 
							
								 
							 
							
							
								:doc: `metadata` 
 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							 
							
							
								  Learn more about how metadata is tracked in the `` unstructured ``  library.
 
							 
						 
					
						
							
								
									
										
										
										
											2022-06-29 14:35:19 -04:00 
										
									 
								 
							 
							
								
							 
							
								 
							 
							
							
								
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							 
							
							
								:doc: `examples` 
 
							 
						 
					
						
							
								
									
										
										
										
											2023-02-27 18:11:49 -05:00 
										
									 
								 
							 
							
								
									
										 
									 
								
							 
							
								 
							 
							
							
								  Examples of other types of workflows within the `` unstructured ``  package.
 
							 
						 
					
						
							
								
									
										
										
										
											2022-06-29 14:35:19 -04:00 
										
									 
								 
							 
							
								
							 
							
								 
							 
							
							
								
							 
						 
					
						
							
								
									
										
										
										
											2023-03-17 20:11:38 +01:00 
										
									 
								 
							 
							
								
									
										 
									 
								
							 
							
								 
							 
							
							
								:doc: `integrations` 
 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							 
							
							
								  We make it easy for you to connect your output with other popular ML services.
 
							 
						 
					
						
							
								
									
										
										
										
											2022-06-29 14:35:19 -04:00 
										
									 
								 
							 
							
								
							 
							
								 
							 
							
							
								
							 
						 
					
						
							
								
									
										
										
										
											2023-09-09 18:54:01 -07:00 
										
									 
								 
							 
							
								
									
										 
									 
								
							 
							
								 
							 
							
							
								:doc: `best_practices` 
 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							 
							
							
								  Learn best practices to optimize document information extraction using `` unstructured ``  library.
 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							 
							
							
								
							 
						 
					
						
							
								
									
										
										
										
											2022-06-29 14:35:19 -04:00 
										
									 
								 
							 
							
								
							 
							
								 
							 
							
							
								.. Hidden TOCs 
 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							 
							
							
								
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							 
							
							
								..  toctree :: 
 
							 
						 
					
						
							
								
									
										
										
										
											2023-05-01 18:17:52 -04:00 
										
									 
								 
							 
							
								
									
										 
									 
								
							 
							
								 
							 
							
							
								   :caption:  Documentation 
 
							 
						 
					
						
							
								
									
										
										
										
											2022-06-29 14:35:19 -04:00 
										
									 
								 
							 
							
								
							 
							
								 
							 
							
							
								   :maxdepth:  2 
 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							 
							
							
								   :hidden: 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							 
							
							
								
							 
						 
					
						
							
								
									
										
										
										
											2023-08-21 10:27:32 -07:00 
										
									 
								 
							 
							
								
									
										 
									 
								
							 
							
								 
							 
							
							
								   introduction
 
							 
						 
					
						
							
								
									
										
										
										
											2022-06-29 14:35:19 -04:00 
										
									 
								 
							 
							
								
							 
							
								 
							 
							
							
								   installing
 
							 
						 
					
						
							
								
									
										
										
										
											2023-07-14 14:28:57 -04:00 
										
									 
								 
							 
							
								
									
										 
									 
								
							 
							
								 
							 
							
							
								   api
 
							 
						 
					
						
							
								
									
										
										
										
											2024-03-06 11:16:08 -08:00 
										
									 
								 
							 
							
								
									
										 
									 
								
							 
							
								 
							 
							
							
								   platform
 
							 
						 
					
						
							
								
									
										
										
										
											2023-11-02 10:43:26 -04:00 
										
									 
								 
							 
							
								
									
										 
									 
								
							 
							
								 
							 
							
							
								   core
 
							 
						 
					
						
							
								
									
										
										
										
											2023-11-02 16:40:35 -04:00 
										
									 
								 
							 
							
								
									
										 
									 
								
							 
							
								 
							 
							
							
								   ingest/index
 
							 
						 
					
						
							
								
									
										
										
										
											2023-06-16 10:10:56 -04:00 
										
									 
								 
							 
							
								
									
										 
									 
								
							 
							
								 
							 
							
							
								   metadata
 
							 
						 
					
						
							
								
									
										
										
										
											2022-06-29 14:35:19 -04:00 
										
									 
								 
							 
							
								
							 
							
								 
							 
							
							
								   examples
 
							 
						 
					
						
							
								
									
										
										
										
											2023-03-17 20:11:38 +01:00 
										
									 
								 
							 
							
								
									
										 
									 
								
							 
							
								 
							 
							
							
								   integrations
 
							 
						 
					
						
							
								
									
										
										
										
											2023-09-15 18:13:39 -04:00 
										
									 
								 
							 
							
								
									
										 
									 
								
							 
							
								 
							 
							
							
								   best_practices