2024-01-17 13:01:01 -08:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
								Quick Start
 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								===========
 
							 
						 
					
						
							
								
									
										
										
										
											2023-08-21 10:27:32 -07:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
								
							 
						 
					
						
							
								
									
										
										
										
											2024-01-17 13:01:01 -08:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
								Installation
 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								------------
 
							 
						 
					
						
							
								
									
										
										
										
											2023-08-21 10:27:32 -07:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
								
							 
						 
					
						
							
								
									
										
										
										
											2024-01-17 13:01:01 -08:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
								This guide offers concise steps to swiftly install and validate your `` unstructured ``  installation. For more comprehensive installation guide, please refer to `this page  <http://localhost:63342/CHANGELOG.md/docs/build/html/installing.html> `__ .
 
							 
						 
					
						
							
								
									
										
										
										
											2023-08-21 10:27:32 -07:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
								
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								1.  **Installing the Python SDK** : 
  
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								   You can install the core SDK using pip:
 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								   
 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								   ..  code-block ::  bash
  
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								      pip install unstructured
 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								   Plain text files, HTML, XML, JSON, and Emails are immediately supported without any additional dependencies.
 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								   If you need to process other document types, you can install the extras required by following the :doc: `../installation/full_installation` 
 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								2.  **System Dependencies** :
  
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								   Ensure the subsequent system dependencies are installed. Your requirements might vary based on the document types you're handling:
 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								   -  `libmagic-dev` 
 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								   -  `poppler-utils` 
 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								   -  `tesseract-ocr` 
 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								   -  `libreoffice` 
 
							 
						 
					
						
							
								
									
										
											 
										
											
												fix: Install pandoc consistently, via Makefile recipe (version that supports .rtf files as input format) (#2593)
## Problem Description
In some cases you might find yourselves in a situation when pandoc won't
be able to process an `rtf` as input file format, because older versions
simply do not support that.
```
RuntimeError: Invalid input format! Got "rtf" but expected one of these: commonmark, creole, csv, docbook, docx, dokuwiki, epub, fb2, gfm, haddock, html, ipynb, jats, jira, json, latex, man, markdown, markdown_github, markdown_mmd, markdown_phpextra, markdown_strict, mediawiki, muse, native, odt, opml, org, rst, t2t, textile, tikiwiki, twiki, vimwiki
```
Basically, some user may install the wrong version. The `README.md` is
not be precise enough when mentioning RTF files support:
https://github.com/Unstructured-IO/unstructured/blob/47b35ccdd61ffbc376c86e9bb08a2039b042cc2b/README.md?plain=1#L120-L122
## Example
Installing `pandoc` from a [stable repository, like
Debian](https://packages.debian.org/source/bullseye/pandoc) will give
you `2.9` and the official documentation shows clearly that support for
rtf was introduced in `2.14`
https://pandoc.org/releases.html#pandoc-2.14.2-2021-08-21

### Note that `rtf` is not there

### More detail

## Proposed Solution 
- [x] I've simply added/copied `make install-pandoc` calls, mimicking
other recipes in order to ensure that `3.1.2` will be installed in all
cases. **Side note**: `make install-pandoc` calls
`./scripts/install-pandoc.sh` under the hood.
- [x] Update README file - mention that `make install-pandoc` is
recommended (`>=2.14.2`)
- [x] Verify tests that cover `rtf` cases:
https://github.com/Unstructured-IO/unstructured/blob/47b35ccdd61ffbc376c86e9bb08a2039b042cc2b/test_unstructured/file_utils/test_file_conversion.py#L14
- [x] Update `setup_ubuntu.sh` if needed?:
https://github.com/Unstructured-IO/unstructured/blob/47b35ccdd61ffbc376c86e9bb08a2039b042cc2b/scripts/setup_ubuntu.sh#L87
-
											 
										 
										
											2024-03-04 12:02:32 +01:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
								   -  `pandoc` : For EPUBs, RTFs, and Open Office documents. Please note that to handle RTF files, you need version `2.14.2` or newer. Running `this script  <https://github.com/Unstructured-IO/unstructured/blob/main/scripts/install-pandoc.sh> `__  will install the correct version for you.
 
							 
						 
					
						
							
								
									
										
										
										
											2023-08-21 10:27:32 -07:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
								
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								Validating Installation
 
							 
						 
					
						
							
								
									
										
										
										
											2023-09-25 21:20:16 -07:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
								-----------------------
 
							 
						 
					
						
							
								
									
										
										
										
											2023-08-21 10:27:32 -07:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
								
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								After installation, confirm the setup by executing the below Python code:
 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								..  code-block ::  python
  
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								   from unstructured.partition.auto import partition
 
							 
						 
					
						
							
								
									
										
										
										
											2023-09-19 15:32:46 -07:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
								   elements = partition(filename="example-docs/eml/fake-email.eml")
 
							 
						 
					
						
							
								
									
										
										
										
											2023-08-21 10:27:32 -07:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
								
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								If you've opted for the "local-inference" installation, you should also be able to execute:
 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								..  code-block ::  python
  
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								   from unstructured.partition.auto import partition
 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								   elements = partition("example-docs/layout-parser-paper.pdf")
 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								If these code snippets run without errors, congratulations! Your `` unstructured ``  installation is successful and ready for use.
 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								
							 
						 
					
						
							
								
									
										
										
										
											2023-02-27 10:10:53 -05:00 
										
									 
								 
							 
							
								
							 
							
								 
							
							
								
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								The following section will cover basic concepts and usage patterns in `` unstructured `` .
 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								After reading this section, you should be able to:
 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								*  Partitioning a document with the `` partition ``  function.
  
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								*  Understand how documents are structured in `` unstructured `` .
  
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								*  Convert a document to a dictionary and/or save it as a JSON.
  
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								The example documents in this section come from the
 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								`example-docs  <https://github.com/Unstructured-IO/unstructured/tree/main/example-docs> `_ 
  
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								directory in the `` unstructured ``  repo.
 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								Before running the code in this make sure you've installed the `` unstructured ``  library
 
							 
						 
					
						
							
								
									
										
										
										
											2023-08-22 11:20:26 -07:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
								and all dependencies using the instructions in the `Quick Start  <https://unstructured-io.github.io/unstructured/installing.html#quick-start> `_  section.
 
							 
						 
					
						
							
								
									
										
										
										
											2023-08-21 10:27:32 -07:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
								
							 
						 
					
						
							
								
									
										
										
										
											2023-02-27 10:10:53 -05:00 
										
									 
								 
							 
							
								
							 
							
								 
							
							
								Partitioning a document
 
							 
						 
					
						
							
								
									
										
										
										
											2023-09-25 21:20:16 -07:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
								-----------------------
 
							 
						 
					
						
							
								
									
										
										
										
											2023-02-27 10:10:53 -05:00 
										
									 
								 
							 
							
								
							 
							
								 
							
							
								
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								In this section, we'll cut right to the chase and get to the most important part of the library: partitioning a document.
 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								The goal of document partitioning is to read in a source document, split the document into sections, categorize those sections,
 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								and extract the text associated with those sections. Depending on the document type, unstructured uses different methods for
 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								partitioning a document. We'll cover those in a later section. For now, we'll use the simplest API in the library,
 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								the `` partition ``  function. The `` partition ``  function will detect the filetype of the source document and route it to the appropriate
 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								partitioning function. You can try out the partition function by running the cell below.
 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								..  code ::  python
  
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
									from unstructured.partition.auto import partition
 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
									elements = partition(filename="example-10k.html")
 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								You can also pass in a file as a file-like object using the following workflow:
 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								..  code ::  python
  
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
									with open("example-10k.html", "rb") as f:
 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
									    elements = partition(file=f)
 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								The `` partition ``  function uses `libmagic  <https://formulae.brew.sh/formula/libmagic> `_  for filetype detection. If `` libmagic ``  is
 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								not present and the user passes a filename, `` partition ``  falls back to detecting the filetype using the file extension.
 
							 
						 
					
						
							
								
									
										
										
										
											2023-02-27 18:11:49 -05:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
								`` libmagic ``  is required if you'd like to pass a file-like object to `` partition `` .
  
						 
					
						
							
								
									
										
										
										
											2023-02-27 10:10:53 -05:00 
										
									 
								 
							 
							
								
							 
							
								 
							
							
								We highly recommend installing `` libmagic ``  and you may observe different file detection behaviors
 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								if `` libmagic ``  is not installed`.
 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								
							 
						 
					
						
							
								
									
										
										
										
											2024-01-17 13:01:01 -08:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
								Quickstart Tutorial
 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							
							
								-------------------
 
							 
						 
					
						
							
								
									
										
										
										
											2023-02-27 10:10:53 -05:00 
										
									 
								 
							 
							
								
							 
							
								 
							
							
								
							 
						 
					
						
							
								
									
										
										
										
											2024-01-17 13:01:01 -08:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
								If you're eager to dive in, head over `Getting Started  <https://colab.research.google.com/drive/1U8VCjY2-x8c6y5TYMbSFtQGlQVFHCVIW#scrollTo=jZp37lfueaeZ> `__  on Google Colab to get a hands-on introduction to the `` unstructured ``  library. In a few minutes, you'll have a basic workflow set up and running!
 
							 
						 
					
						
							
								
									
										
										
										
											2023-02-27 10:10:53 -05:00 
										
									 
								 
							 
							
								
							 
							
								 
							
							
								
							 
						 
					
						
							
								
									
										
										
										
											2024-01-17 13:01:01 -08:00 
										
									 
								 
							 
							
								
									
										 
								
							 
							
								 
							
							
								For more detailed information about specific components or advanced features, explore the rest of the documentation.