llama-hub/README.md

# LlamaHub 🦙

This is a simple library of all the data loaders / readers that have been created by the community. The goal is to make it extremely easy to connect large language models to a large variety of knowledge sources. These are general-purpose utilities that are meant to be used in [LlamaIndex](https://github.com/jerryjliu/gpt_index/tree/main/gpt_index) (e.g. when building a index) and [LangChain](https://github.com/hwchase17/langchain) (e.g. when building different tools an agent can use). For example, there are loaders to parse Google Docs, SQL Databases, PDF files, PowerPoints, Notion, Slack, Obsidian, and many more. Note that because different loaders produce the same types of Documents, you can easily use them together in the same index.

Check out our website here: https://llamahub.ai/.

![Website screenshot](https://scrabble-dictionary.s3.us-west-2.amazonaws.com/Screen+Shot+2023-02-11+at+12.45.44+PM.png)

## Usage

These general-purpose loaders are designed to be used as a way to load data into [LlamaIndex](https://github.com/jerryjliu/gpt_index/tree/main/gpt_index) and/or subsequently used as a Tool in a [LangChain](https://github.com/hwchase17/langchain) Agent. **You can use them with `download_loader` from LlamaIndex in a single line of code!** For example, see the code snippets below using the Google Docs Loader.

### LlamaIndex

```python
from llama_index import GPTSimpleVectorIndex, download_loader

GoogleDocsReader = download_loader('GoogleDocsReader')

gdoc_ids = ['1wf-y2pd9C878Oh-FmLH7Q_BQkljdm6TQal-c1pUfrec']
loader = GoogleDocsReader()
documents = loader.load_data(document_ids=gdoc_ids)
index = GPTSimpleVectorIndex(documents)
index.query('Where did the author go to school?')
```

### LangChain

Note: Make sure you change the description of the `Tool` to match your use-case.

```python
from llama_index import GPTSimpleVectorIndex, download_loader
from langchain.llms import OpenAI
from langchain.chains.question_answering import load_qa_chain

# load documents
GoogleDocsReader = download_loader('GoogleDocsReader')
gdoc_ids = ['1wf-y2pd9C878Oh-FmLH7Q_BQkljdm6TQal-c1pUfrec']
loader = GoogleDocsReader()
documents = loader.load_data(document_ids=gdoc_ids)
langchain_documents = [d.to_langchain_format() for d in documents]

# initialize sample QA chain
llm = OpenAI(temperature=0)
qa_chain = load_qa_chain(llm)
question="<query here>"
answer = qa_chain.run(input_documents=langchain_documents, question=question)

```

## How to add a loader

Adding a loader simply requires forking this repo and making a Pull Request. The Loader Hub website will update automatically. However, please keep in the mind the following guidelines when making your PR.

### Step 1: Create a new directory

In `loader_hub`, create a new directory for your new loader. It can be nested within another, but name it something unique because the name of the directory will become the identifier for your loader (e.g. `google_docs`). Inside your new directory, create a `__init__.py` file, which can be empty, a `base.py` file which will contain your loader implementation, and, if needed, a `requirements.txt` file to list the package dependencies of your loader. Those packages will automatically be installed when your loader is used, so no need to worry about that anymore!

If you'd like, you can create the new directory and files by running the following script in the `loader_hub` directory. Just remember to put your dependencies into a `requirements.txt` file.

```
./add_loader.sh [NAME_OF_NEW_DIRECTORY]
```

### Step 2: Write your README

Inside your new directory, create a `README.md` that mirrors that of the existing ones. It should have a summary of what your loader does, its inputs, and how its used in the context of LlamaIndex and LangChain.

### Step 3: Add your loader to the library.json file

Finally, add your loader to the `loader_hub/library.json` file so that it may be used by others. As is exemplified by the current file, add in the class name of your loader, along with its id, author, etc. This file is referenced by the Loader Hub website and the download function within LlamaIndex.

### Step 4: Make a Pull Request!

Create a PR against the main branch. We typically review the PR within a day. To help expedite the process, it may be helpful to provide screenshots (either in the PR or in
the README directly) showing your data loader in action!

## FAQ

### How do I test my loader before it's merged?

There is an argument called `loader_hub_url` in [`download_loader`](https://github.com/jerryjliu/gpt_index/blob/main/gpt_index/readers/download.py) that defaults to the main branch of this repo. You can set it to your branch or fork to test your new loader.

### Should I create a PR against LlamaHub or the LlamaIndex repo directly?

If you have a data loader PR, by default let's try to create it against LlamaHub! We will make exceptions in certain cases
(for instance, if we think the data loader should be core to the LlamaIndex repo).

For all other PR's relevant to LlamaIndex, let's create it directly against the [LlamaIndex repo](https://github.com/jerryjliu/gpt_index).

### Other questions?

Feel free to hop into the [community Discord](https://discord.gg/dGcwcsnxhU) or tag the official [Twitter account](https://twitter.com/gpt_index)!
README updates for LlamaIndex 2023-02-20 13:19:15 -08:00			`# LlamaHub 🦙`
README 2023-02-01 16:02:30 -08:00
Update README.md 2023-02-20 13:19:31 -08:00			This is a simple library of all the data loaders / readers that have been created by the community. The goal is to make it extremely easy to connect large language models to a large variety of knowledge sources. These are general-purpose utilities that are meant to be used in [LlamaIndex](https://github.com/jerryjliu/gpt_index/tree/main/gpt_index) (e.g. when building a index) and [LangChain](https://github.com/hwchase17/langchain) (e.g. when building different tools an agent can use). For example, there are loaders to parse Google Docs, SQL Databases, PDF files, PowerPoints, Notion, Slack, Obsidian, and many more. Note that because different loaders produce the same types of Documents, you can easily use them together in the same index.
Added web and instructions 2023-02-01 22:44:43 -08:00
cr (#10) Co-authored-by: Jerry Liu <jerry@robustintelligence.com> 2023-02-07 13:56:23 -08:00			`Check out our website here: https://llamahub.ai/.`

Update README.md 2023-02-11 12:46:34 -08:00			`![Website screenshot](https://scrabble-dictionary.s3.us-west-2.amazonaws.com/Screen+Shot+2023-02-11+at+12.45.44+PM.png)`
Update README.md 2023-02-11 12:44:38 -08:00
Added web and instructions 2023-02-01 22:44:43 -08:00			`## Usage`

README updates for LlamaIndex 2023-02-20 13:19:15 -08:00			These general-purpose loaders are designed to be used as a way to load data into [LlamaIndex](https://github.com/jerryjliu/gpt_index/tree/main/gpt_index) and/or subsequently used as a Tool in a [LangChain](https://github.com/hwchase17/langchain) Agent. You can use them with `download_loader` from LlamaIndex in a single line of code! For example, see the code snippets below using the Google Docs Loader.
Added web and instructions 2023-02-01 22:44:43 -08:00
README updates for LlamaIndex 2023-02-20 13:19:15 -08:00			`### LlamaIndex`
Added web and instructions 2023-02-01 22:44:43 -08:00
			```python
swap out gpt_index imports for llama_index imports (#49) * cr * cr * cr --------- Co-authored-by: Jerry Liu <jerry@robustintelligence.com> Co-authored-by: Jesse Zhang <jessetanzhang@gmail.com> 2023-02-20 21:46:58 -08:00			`from llama_index import GPTSimpleVectorIndex, download_loader`
Readme updates 2023-02-03 14:36:26 -08:00
			`GoogleDocsReader = download_loader('GoogleDocsReader')`
Added web and instructions 2023-02-01 22:44:43 -08:00
			`gdoc_ids = ['1wf-y2pd9C878Oh-FmLH7Q_BQkljdm6TQal-c1pUfrec']`
			`loader = GoogleDocsReader()`
			`documents = loader.load_data(document_ids=gdoc_ids)`
			`index = GPTSimpleVectorIndex(documents)`
			`index.query('Where did the author go to school?')`
			```

			`### LangChain`

			Note: Make sure you change the description of the `Tool` to match your use-case.

			```python
swap out gpt_index imports for llama_index imports (#49) * cr * cr * cr --------- Co-authored-by: Jerry Liu <jerry@robustintelligence.com> Co-authored-by: Jesse Zhang <jessetanzhang@gmail.com> 2023-02-20 21:46:58 -08:00			`from llama_index import GPTSimpleVectorIndex, download_loader`
Added web and instructions 2023-02-01 22:44:43 -08:00			`from langchain.llms import OpenAI`
update readme example (#140) 2023-03-26 09:34:13 -07:00			`from langchain.chains.question_answering import load_qa_chain`
Added web and instructions 2023-02-01 22:44:43 -08:00
update readme example (#140) 2023-03-26 09:34:13 -07:00			`# load documents`
Readme updates 2023-02-03 14:36:26 -08:00			`GoogleDocsReader = download_loader('GoogleDocsReader')`
Added web and instructions 2023-02-01 22:44:43 -08:00			`gdoc_ids = ['1wf-y2pd9C878Oh-FmLH7Q_BQkljdm6TQal-c1pUfrec']`
			`loader = GoogleDocsReader()`
			`documents = loader.load_data(document_ids=gdoc_ids)`
update readme example (#140) 2023-03-26 09:34:13 -07:00			`langchain_documents = [d.to_langchain_format() for d in documents]`
Added web and instructions 2023-02-01 22:44:43 -08:00
update readme example (#140) 2023-03-26 09:34:13 -07:00			`# initialize sample QA chain`
Added web and instructions 2023-02-01 22:44:43 -08:00			`llm = OpenAI(temperature=0)`
update readme example (#140) 2023-03-26 09:34:13 -07:00			`qa_chain = load_qa_chain(llm)`
			`question="<query here>"`
Minor update to example of loading (incorrect variable name) (#147) 2023-03-30 00:48:04 -04:00			`answer = qa_chain.run(input_documents=langchain_documents, question=question)`
Added web and instructions 2023-02-01 22:44:43 -08:00
			```

			`## How to add a loader`

Readme updates 2023-02-03 14:36:26 -08:00			`Adding a loader simply requires forking this repo and making a Pull Request. The Loader Hub website will update automatically. However, please keep in the mind the following guidelines when making your PR.`
Added web and instructions 2023-02-01 22:44:43 -08:00
			`### Step 1: Create a new directory`

Refactors mbox loader 2023-02-06 23:54:33 -08:00			In `loader_hub`, create a new directory for your new loader. It can be nested within another, but name it something unique because the name of the directory will become the identifier for your loader (e.g. `google_docs`). Inside your new directory, create a `__init__.py` file, which can be empty, a `base.py` file which will contain your loader implementation, and, if needed, a `requirements.txt` file to list the package dependencies of your loader. Those packages will automatically be installed when your loader is used, so no need to worry about that anymore!
Readme updates 2023-02-03 14:36:26 -08:00
Remote reader (#17) * Small bug fixes * Remote loader for pages/files * Add to library 2023-02-09 17:27:20 -08:00			If you'd like, you can create the new directory and files by running the following script in the `loader_hub` directory. Just remember to put your dependencies into a `requirements.txt` file.
Readme updates 2023-02-03 14:36:26 -08:00
			```
Remote reader (#17) * Small bug fixes * Remote loader for pages/files * Add to library 2023-02-09 17:27:20 -08:00			`./add_loader.sh [NAME_OF_NEW_DIRECTORY]`
Readme updates 2023-02-03 14:36:26 -08:00			```
Added web and instructions 2023-02-01 22:44:43 -08:00
Readme updates 2023-02-03 14:36:26 -08:00			`### Step 2: Write your README`
Added web and instructions 2023-02-01 22:44:43 -08:00
README updates for LlamaIndex 2023-02-20 13:19:15 -08:00			Inside your new directory, create a `README.md` that mirrors that of the existing ones. It should have a summary of what your loader does, its inputs, and how its used in the context of LlamaIndex and LangChain.
Added web and instructions 2023-02-01 22:44:43 -08:00
update README (#7) * cr * cr --------- Co-authored-by: Jerry Liu <jerry@robustintelligence.com> Co-authored-by: Jesse Zhang <jessetanzhang@gmail.com> 2023-02-06 21:04:51 -08:00			`### Step 3: Add your loader to the library.json file`
Added web and instructions 2023-02-01 22:44:43 -08:00
swap out gpt_index imports for llama_index imports (#49) * cr * cr * cr --------- Co-authored-by: Jerry Liu <jerry@robustintelligence.com> Co-authored-by: Jesse Zhang <jessetanzhang@gmail.com> 2023-02-20 21:46:58 -08:00			Finally, add your loader to the `loader_hub/library.json` file so that it may be used by others. As is exemplified by the current file, add in the class name of your loader, along with its id, author, etc. This file is referenced by the Loader Hub website and the download function within LlamaIndex.
README updates 2023-02-05 17:56:28 -08:00
Refactors mbox loader 2023-02-06 23:54:33 -08:00			`### Step 4: Make a Pull Request!`
update README (#7) * cr * cr --------- Co-authored-by: Jerry Liu <jerry@robustintelligence.com> Co-authored-by: Jesse Zhang <jessetanzhang@gmail.com> 2023-02-06 21:04:51 -08:00
			`Create a PR against the main branch. We typically review the PR within a day. To help expedite the process, it may be helpful to provide screenshots (either in the PR or in`
Refactors mbox loader 2023-02-06 23:54:33 -08:00			`the README directly) showing your data loader in action!`
update README (#7) * cr * cr --------- Co-authored-by: Jerry Liu <jerry@robustintelligence.com> Co-authored-by: Jesse Zhang <jessetanzhang@gmail.com> 2023-02-06 21:04:51 -08:00
			`## FAQ`

README updates 2023-02-20 00:30:50 -08:00			`### How do I test my loader before it's merged?`

			There is an argument called `loader_hub_url` in [`download_loader`](https://github.com/jerryjliu/gpt_index/blob/main/gpt_index/readers/download.py) that defaults to the main branch of this repo. You can set it to your branch or fork to test your new loader.

README updates for LlamaIndex 2023-02-20 13:19:15 -08:00			`### Should I create a PR against LlamaHub or the LlamaIndex repo directly?`
update README (#7) * cr * cr --------- Co-authored-by: Jerry Liu <jerry@robustintelligence.com> Co-authored-by: Jesse Zhang <jessetanzhang@gmail.com> 2023-02-06 21:04:51 -08:00
README updates for LlamaIndex 2023-02-20 13:19:15 -08:00			`If you have a data loader PR, by default let's try to create it against LlamaHub! We will make exceptions in certain cases`
			`(for instance, if we think the data loader should be core to the LlamaIndex repo).`
update README (#7) * cr * cr --------- Co-authored-by: Jerry Liu <jerry@robustintelligence.com> Co-authored-by: Jesse Zhang <jessetanzhang@gmail.com> 2023-02-06 21:04:51 -08:00
README updates for LlamaIndex 2023-02-20 13:19:15 -08:00			`For all other PR's relevant to LlamaIndex, let's create it directly against the [LlamaIndex repo](https://github.com/jerryjliu/gpt_index).`
update README (#7) * cr * cr --------- Co-authored-by: Jerry Liu <jerry@robustintelligence.com> Co-authored-by: Jesse Zhang <jessetanzhang@gmail.com> 2023-02-06 21:04:51 -08:00
			`### Other questions?`
README updates 2023-02-05 17:56:28 -08:00
			`Feel free to hop into the [community Discord](https://discord.gg/dGcwcsnxhU) or tag the official [Twitter account](https://twitter.com/gpt_index)!`