mirror of
				https://github.com/Unstructured-IO/unstructured.git
				synced 2025-10-21 13:03:15 +00:00 
			
		
		
		
	
		
			
	
	
		
			28 lines
		
	
	
		
			889 B
		
	
	
	
		
			Org Mode
		
	
	
	
	
	
		
		
			
		
	
	
			28 lines
		
	
	
		
			889 B
		
	
	
	
		
			Org Mode
		
	
	
	
	
	
|   | * Example Docs | ||
|  | 
 | ||
|  | The sample docs directory contains the following files: | ||
|  | 
 | ||
|  | -  ~example-10k.html~ - A 10-K SEC filing in HTML format | ||
|  | -  ~layout-parser-paper.pdf~ - A PDF copy of the layout parser paper | ||
|  | -  ~factbook.xml~ / ~factbook.xsl~ - Example XML/XLS files that you | ||
|  |    can use to test stylesheets | ||
|  | 
 | ||
|  | These documents can be used to test out the parsers in the library. In | ||
|  | addition, here are instructions for pulling in some sample docs that are | ||
|  | too big to store in the repo. | ||
|  | 
 | ||
|  | ** XBRL 10-K | ||
|  | 
 | ||
|  | You can get an example 10-K in inline XBRL format using the following | ||
|  | ~curl~. Note, you need to have the user agent set in the header or the | ||
|  | SEC site will reject your request. | ||
|  | 
 | ||
|  | #+BEGIN_SRC bash | ||
|  | 
 | ||
|  |    curl -O \ | ||
|  |      -A '${organization} ${email}' | ||
|  |      https://www.sec.gov/Archives/edgar/data/311094/000117184321001344/0001171843-21-001344.txt | ||
|  | #+END_SRC | ||
|  | 
 | ||
|  | You can parse this document using the HTML parser. |