Jerry Liu e631266036
cr (#56)
Co-authored-by: Jerry Liu <jerry@robustintelligence.com>
2023-02-22 09:43:29 -08:00

2.5 KiB

Unstructured.io File Loader

This loader extracts the text from a variety of unstructured text files using Unstructured.io. Currently, the file extensions that are supported are .txt, .docx, .pptx, .jpg, .png, .eml, .html, and .pdf documents. A single local file is passed in each time you call load_data.

Check out their documentation to see more details, but notably, this enables you to parse the unstructured data of many use-cases. For example, you can download the 10-K SEC filings of public companies (e.g. Coinbase), and feed it directly into this loader without worrying about cleaning up the formatting or HTML tags.

Usage

To use this loader, you need to pass in a Path to a local file. Optionally, you may specify split_documents if you want each element generated by Unstructured.io to be placed in a separate document. This will guarantee that those elements will be split when an index is created in LlamaIndex, which, depending on your use-case, could be a smarter form of text-splitting. By default this is False.

from pathlib import Path
from llama_index import download_loader

UnstructuredReader = download_loader("UnstructuredReader")

loader = UnstructuredReader()
documents = loader.load_data(file=Path('./10k_filing.html'))

You can also easily use this loader in conjunction with SimpleDirectoryReader if you want to parse certain files throughout a directory with Unstructured.io.

from pathlib import Path
from llama_index import download_loader

SimpleDirectoryReader = download_loader("SimpleDirectoryReader")

loader = SimpleDirectoryReader('./data', file_extractor={
  ".pdf": "UnstructuredReader",
  ".html": "UnstructuredReader",
  ".eml": "UnstructuredReader",
  ".pptx": "PptxReader"
})
documents = loader.load_data()

This loader is designed to be used as a way to load data into LlamaIndex and/or subsequently used as a Tool in a LangChain Agent. See here for examples.

Troubleshooting

"failed to find libmagic" error: Try pip install python-magic-bin==0.4.14. Solution documented here. On MacOS, you may also try brew install libmagic.