Github Repository Loader
This loader takes in owner
, repo
, branch
, commit_sha
and other optional parameters such as for filtering dicrectories or only allowing some files with given extensions etc. It then fetches all the contents of the GitHub repository.
As a prerequisite, you will need to generate a "classic" personal access token with the repo
and read:org
scopes. See here for instructions.
Usage
To use this loader, you simply need to pass in the owner
and repo
and either branch
or commit_sha
for example, you can owner = jerryjliu
and repo = llama_index
and also either branch or commit branch = main
or commit_sha = a6c89159bf8e7086bea2f4305cff3f0a4102e370
.
export GITHUB_TOKEN='...'
import os
from llama_index import download_loader
download_loader("GithubRepositoryReader")
from llama_index.readers.llamahub_modules.github_repo import GithubRepositoryReader, GithubClient
github_client = GithubClient(os.getenv("GITHUB_TOKEN"))
loader = GithubRepositoryReader(
github_client,
owner = "jerryjliu",
repo = "llama_index",
filter_directories = (["gpt_index", "docs"], GithubRepositoryReader.FilterType.INCLUDE),
filter_file_extensions = ([".py"], GithubRepositoryReader.FilterType.INCLUDE),
verbose = True,
concurrent_requests = 10,
)
docs = loader.load_data(branch="main")
# alternatively, load from a specific commit:
# docs = loader.load_data(commit_sha="a6c89159bf8e7086bea2f4305cff3f0a4102e370")
for doc in docs:
print(doc.extra_info)
Examples
This loader designed to be used as a way to load data into Llama Index and/or subsequently used as a Tool in a LangChain Agent.
Llama Index
export OPENAI_API_KEY='...'
export GITHUB_TOKEN='...'
import pickle
import os
from llama_index import download_loader, GPTVectorStoreIndex
download_loader("GithubRepositoryReader")
from llama_index.readers.llamahub_modules.github_repo import GithubClient, GithubRepositoryReader
docs = None
if os.path.exists("docs.pkl"):
with open("docs.pkl", "rb") as f:
docs = pickle.load(f)
if docs is None:
github_client = GithubClient(os.getenv("GITHUB_TOKEN"))
loader = GithubRepositoryReader(
github_client,
owner = "jerryjliu",
repo = "llama_index",
filter_directories = (["gpt_index", "docs"], GithubRepositoryReader.FilterType.INCLUDE),
filter_file_extensions = ([".py"], GithubRepositoryReader.FilterType.INCLUDE),
verbose = True,
concurrent_requests = 10,
)
docs = loader.load_data(branch="main")
with open("docs.pkl", "wb") as f:
pickle.dump(docs, f)
index = GPTVectorStoreIndex.from_documents(docs)
index.query("Explain each LlamaIndex class?")