llama-hub/loader_hub/file/epub/base.py

"""Epub Reader.

A parser for epub files.
"""

from pathlib import Path
from typing import Dict, List, Optional

from llama_index.readers.base import BaseReader
from llama_index.readers.schema.base import Document


class EpubReader(BaseReader):
    """Epub Parser."""

    def load_data(
        self, file: Path, extra_info: Optional[Dict] = None
    ) -> List[Document]:
        """Parse file."""
        import ebooklib
        import html2text
        from ebooklib import epub

        text_list = []
        book = epub.read_epub(file, options={"ignore_ncx": True})

        # Iterate through all chapters.
        for item in book.get_items():
            # Chapters are typically located in epub documents items.
            if item.get_type() == ebooklib.ITEM_DOCUMENT:
                text_list.append(
                    html2text.html2text(item.get_content().decode("utf-8"))
                )

        text = "\n".join(text_list)
        return [Document(text, extra_info=extra_info)]
Added new file readers 2023-02-03 20:12:03 -08:00			`"""Epub Reader.`

			`A parser for epub files.`
			`"""`

			`from pathlib import Path`
cr 2023-02-03 23:38:12 -08:00			`from typing import Dict, List, Optional`
Added new file readers 2023-02-03 20:12:03 -08:00
swap out gpt_index imports for llama_index imports (#49) * cr * cr * cr --------- Co-authored-by: Jerry Liu <jerry@robustintelligence.com> Co-authored-by: Jesse Zhang <jessetanzhang@gmail.com> 2023-02-20 21:46:58 -08:00			`from llama_index.readers.base import BaseReader`
			`from llama_index.readers.schema.base import Document`
Added new file readers 2023-02-03 20:12:03 -08:00

			`class EpubReader(BaseReader):`
			`"""Epub Parser."""`

			`def load_data(`
			`self, file: Path, extra_info: Optional[Dict] = None`
			`) -> List[Document]:`
			`"""Parse file."""`
Requirements txt implemented 2023-02-03 20:41:20 -08:00			`import ebooklib`
			`import html2text`
cr 2023-02-03 23:38:12 -08:00			`from ebooklib import epub`
Added new file readers 2023-02-03 20:12:03 -08:00
			`text_list = []`
			`book = epub.read_epub(file, options={"ignore_ncx": True})`

			`# Iterate through all chapters.`
			`for item in book.get_items():`
			`# Chapters are typically located in epub documents items.`
			`if item.get_type() == ebooklib.ITEM_DOCUMENT:`
			`text_list.append(`
			`html2text.html2text(item.get_content().decode("utf-8"))`
			`)`

			`text = "\n".join(text_list)`
			`return [Document(text, extra_info=extra_info)]`