mirror of
https://github.com/Unstructured-IO/unstructured.git
synced 2025-06-27 02:30:08 +00:00
feat: add partition_epub
function (#364)
* add pypandoc dependency * added epub partitioner and file conversion * test for partition_epub * tests for file conversion * add epub to filetype detection * added epub to auto partition * update bricks docs * updated installing docs * changelot and version * add pandoc to dependencies * add pandoc to debian dependencies * linting, linting, linting * typo fix * typo fix * file conversion type hints * more type hints --------- Co-authored-by: qued <64741807+qued@users.noreply.github.com>
This commit is contained in:
parent
aa494623a2
commit
e43cb0e6e0
2
.github/workflows/ci.yml
vendored
2
.github/workflows/ci.yml
vendored
@ -105,7 +105,7 @@ jobs:
|
|||||||
source .venv/bin/activate
|
source .venv/bin/activate
|
||||||
make install-detectron2
|
make install-detectron2
|
||||||
sudo apt-get update
|
sudo apt-get update
|
||||||
sudo apt-get install -y libmagic-dev poppler-utils tesseract-ocr libreoffice
|
sudo apt-get install -y libmagic-dev poppler-utils tesseract-ocr libreoffice pandoc
|
||||||
make test
|
make test
|
||||||
make check-coverage
|
make check-coverage
|
||||||
make install-ingest-s3
|
make install-ingest-s3
|
||||||
|
@ -1,4 +1,4 @@
|
|||||||
## 0.5.4-dev7
|
## 0.5.4
|
||||||
|
|
||||||
### Enhancements
|
### Enhancements
|
||||||
|
|
||||||
@ -21,6 +21,7 @@
|
|||||||
|
|
||||||
* Add `AzureBlobStorageConnector` based on its `fsspec` implementation inheriting
|
* Add `AzureBlobStorageConnector` based on its `fsspec` implementation inheriting
|
||||||
from `FsspecConnector`
|
from `FsspecConnector`
|
||||||
|
* Add `partition_epub` for partitioning e-books in EPUB3 format.
|
||||||
|
|
||||||
### Fixes
|
### Fixes
|
||||||
|
|
||||||
|
@ -110,7 +110,7 @@ file to ensure your code matches the formatting and linting standards used in `u
|
|||||||
If you'd prefer not having code changes auto-tidied before every commit, you can use `make check` to see
|
If you'd prefer not having code changes auto-tidied before every commit, you can use `make check` to see
|
||||||
whether any linting or formatting changes should be applied, and `make tidy` to apply them.
|
whether any linting or formatting changes should be applied, and `make tidy` to apply them.
|
||||||
|
|
||||||
If using the optional `pre-commit`, you'll just need to install the hooks with `pre-commit install` since the
|
If using the optional `pre-commit`, you'll just need to install the hooks with `pre-commit install` since the
|
||||||
`pre-commit` package is installed as part of `make install` mentioned above. Finally, if you decided to use `pre-commit`
|
`pre-commit` package is installed as part of `make install` mentioned above. Finally, if you decided to use `pre-commit`
|
||||||
you can also uninstall the hooks with `pre-commit uninstall`.
|
you can also uninstall the hooks with `pre-commit uninstall`.
|
||||||
|
|
||||||
@ -119,7 +119,7 @@ you can also uninstall the hooks with `pre-commit uninstall`.
|
|||||||
You can run this [Colab notebook](https://colab.research.google.com/drive/1U8VCjY2-x8c6y5TYMbSFtQGlQVFHCVIW) to run the examples below.
|
You can run this [Colab notebook](https://colab.research.google.com/drive/1U8VCjY2-x8c6y5TYMbSFtQGlQVFHCVIW) to run the examples below.
|
||||||
|
|
||||||
The following examples show how to get started with the `unstructured` library.
|
The following examples show how to get started with the `unstructured` library.
|
||||||
You can parse **TXT**, **HTML**, **PDF**, **EML**, **DOC**, **DOCX**, **PPT**, **PPTX**, **JPG**,
|
You can parse **TXT**, **HTML**, **PDF**, **EML**, **EPUB**, **DOC**, **DOCX**, **PPT**, **PPTX**, **JPG**,
|
||||||
and **PNG** documents with one line of code!
|
and **PNG** documents with one line of code!
|
||||||
<br></br>
|
<br></br>
|
||||||
See our [documentation page](https://unstructured-io.github.io/unstructured) for a full description
|
See our [documentation page](https://unstructured-io.github.io/unstructured) for a full description
|
||||||
|
@ -82,7 +82,7 @@ If you call the ``partition`` function, ``unstructured`` will attempt to detect
|
|||||||
file type and route it to the appropriate partitioning brick. All partitioning bricks
|
file type and route it to the appropriate partitioning brick. All partitioning bricks
|
||||||
called within ``partition`` are called using the default kwargs. Use the document-type
|
called within ``partition`` are called using the default kwargs. Use the document-type
|
||||||
specific bricks if you need to apply non-default settings.
|
specific bricks if you need to apply non-default settings.
|
||||||
``partition`` currently supports ``.docx``, ``.doc``, ``.pptx``, ``.ppt``, ``.eml``, ``.html``, ``.pdf``,
|
``partition`` currently supports ``.docx``, ``.doc``, ``.pptx``, ``.ppt``, ``.eml``, ``.epub``, ``.html``, ``.pdf``,
|
||||||
``.png``, ``.jpg``, and ``.txt`` files.
|
``.png``, ``.jpg``, and ``.txt`` files.
|
||||||
If you set the ``include_page_breaks`` kwarg to ``True``, the output will include page breaks. This is only supported for ``.pptx``, ``.html``, ``.pdf``,
|
If you set the ``include_page_breaks`` kwarg to ``True``, the output will include page breaks. This is only supported for ``.pptx``, ``.html``, ``.pdf``,
|
||||||
``.png``, and ``.jpg``.
|
``.png``, and ``.jpg``.
|
||||||
@ -306,6 +306,41 @@ Examples:
|
|||||||
elements = partition_email(text=text, include_headers=True)
|
elements = partition_email(text=text, include_headers=True)
|
||||||
|
|
||||||
|
|
||||||
|
``partition_epub``
|
||||||
|
---------------------
|
||||||
|
|
||||||
|
The ``partition_epub`` function processes e-books in EPUB3 format. The function
|
||||||
|
first converts the document to HTML using ``pandocs`` and then calls ``partition_html``.
|
||||||
|
You'll need `pandocs <https://pandoc.org/installing.html>`_ installed on your system
|
||||||
|
to use ``partition_epub``.
|
||||||
|
|
||||||
|
|
||||||
|
Examples:
|
||||||
|
|
||||||
|
.. code:: python
|
||||||
|
|
||||||
|
from unstructured.partition.epub import partition_epub
|
||||||
|
|
||||||
|
elements = partition_epub(filename="example-docs/winter-sports.epub")
|
||||||
|
|
||||||
|
|
||||||
|
``partition_md``
|
||||||
|
---------------------
|
||||||
|
|
||||||
|
The ``partition_md`` function provides the ability to parse markdown files. The
|
||||||
|
following workflow shows how to use ``partition_md``.
|
||||||
|
|
||||||
|
|
||||||
|
Examples:
|
||||||
|
|
||||||
|
.. code:: python
|
||||||
|
|
||||||
|
from unstructured.partition.md import partition_md
|
||||||
|
|
||||||
|
elements = partition_md(filename="README.md")
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
``partition_text``
|
``partition_text``
|
||||||
---------------------
|
---------------------
|
||||||
|
|
||||||
|
@ -15,6 +15,7 @@ installation.
|
|||||||
* ``poppler-utils`` (images and PDFs)
|
* ``poppler-utils`` (images and PDFs)
|
||||||
* ``tesseract-ocr`` (images and PDFs)
|
* ``tesseract-ocr`` (images and PDFs)
|
||||||
* ``libreoffice`` (MS Office docs)
|
* ``libreoffice`` (MS Office docs)
|
||||||
|
* ``pandocs`` (EPUBs)
|
||||||
|
|
||||||
* If you are parsing PDFs, run the following to install the ``detectron2`` model, which ``unstructured`` uses for layout detection:
|
* If you are parsing PDFs, run the following to install the ``detectron2`` model, which ``unstructured`` uses for layout detection:
|
||||||
* ``pip install "detectron2@git+https://github.com/facebookresearch/detectron2.git@v0.6#egg=detectron2"``
|
* ``pip install "detectron2@git+https://github.com/facebookresearch/detectron2.git@v0.6#egg=detectron2"``
|
||||||
|
BIN
example-docs/winter-sports.epub
Normal file
BIN
example-docs/winter-sports.epub
Normal file
Binary file not shown.
@ -4,6 +4,9 @@
|
|||||||
#
|
#
|
||||||
# pip-compile --output-file=requirements/base.txt
|
# pip-compile --output-file=requirements/base.txt
|
||||||
#
|
#
|
||||||
|
--extra-index-url https://pypi.ngc.nvidia.com
|
||||||
|
--trusted-host pypi.ngc.nvidia.com
|
||||||
|
|
||||||
anyio==3.6.2
|
anyio==3.6.2
|
||||||
# via httpcore
|
# via httpcore
|
||||||
argilla==1.4.0
|
argilla==1.4.0
|
||||||
@ -72,6 +75,8 @@ pydantic==1.10.6
|
|||||||
# via argilla
|
# via argilla
|
||||||
pygments==2.14.0
|
pygments==2.14.0
|
||||||
# via rich
|
# via rich
|
||||||
|
pypandoc==1.11
|
||||||
|
# via unstructured (setup.py)
|
||||||
python-dateutil==2.8.2
|
python-dateutil==2.8.2
|
||||||
# via pandas
|
# via pandas
|
||||||
python-docx==0.8.11
|
python-docx==0.8.11
|
||||||
|
@ -84,7 +84,7 @@ $sudo $pac install -y poppler-utils
|
|||||||
|
|
||||||
#### Tesseract
|
#### Tesseract
|
||||||
# Install tesseract as well as Russian language
|
# Install tesseract as well as Russian language
|
||||||
$sudo $pac install -y tesseract-ocr libtesseract-dev tesseract-ocr-rus libreoffice
|
$sudo $pac install -y tesseract-ocr libtesseract-dev tesseract-ocr-rus libreoffice pandoc
|
||||||
|
|
||||||
#### libmagic
|
#### libmagic
|
||||||
$sudo $pac install -y libmagic-dev
|
$sudo $pac install -y libmagic-dev
|
||||||
|
1
setup.py
1
setup.py
@ -56,6 +56,7 @@ setup(
|
|||||||
"openpyxl",
|
"openpyxl",
|
||||||
"pandas",
|
"pandas",
|
||||||
"pillow",
|
"pillow",
|
||||||
|
"pypandoc",
|
||||||
"python-docx",
|
"python-docx",
|
||||||
"python-pptx",
|
"python-pptx",
|
||||||
"python-magic",
|
"python-magic",
|
||||||
|
23
test_unstructured/file_utils/test_file_conversion.py
Normal file
23
test_unstructured/file_utils/test_file_conversion.py
Normal file
@ -0,0 +1,23 @@
|
|||||||
|
import os
|
||||||
|
import pathlib
|
||||||
|
from unittest.mock import patch
|
||||||
|
|
||||||
|
import pypandoc
|
||||||
|
import pytest
|
||||||
|
|
||||||
|
from unstructured.file_utils.file_conversion import convert_file_to_text
|
||||||
|
|
||||||
|
DIRECTORY = pathlib.Path(__file__).parent.resolve()
|
||||||
|
|
||||||
|
|
||||||
|
def test_convert_file_to_text():
|
||||||
|
filename = os.path.join(DIRECTORY, "..", "..", "example-docs", "winter-sports.epub")
|
||||||
|
html_text = convert_file_to_text(filename, source_format="epub", target_format="html")
|
||||||
|
assert html_text.startswith("<p>")
|
||||||
|
|
||||||
|
|
||||||
|
def test_convert_to_file_raises_if_pandoc_not_available():
|
||||||
|
filename = os.path.join(DIRECTORY, "..", "..", "example-docs", "winter-sports.epub")
|
||||||
|
with patch.object(pypandoc, "convert_file", side_effect=FileNotFoundError):
|
||||||
|
with pytest.raises(FileNotFoundError):
|
||||||
|
convert_file_to_text(filename, source_format="epub", target_format="html")
|
@ -30,6 +30,7 @@ EXAMPLE_DOCS_DIRECTORY = os.path.join(FILE_DIRECTORY, "..", "..", "example-docs"
|
|||||||
("fake-html.html", FileType.HTML),
|
("fake-html.html", FileType.HTML),
|
||||||
("unsupported/fake-excel.xlsx", FileType.XLSX),
|
("unsupported/fake-excel.xlsx", FileType.XLSX),
|
||||||
("fake-power-point.pptx", FileType.PPTX),
|
("fake-power-point.pptx", FileType.PPTX),
|
||||||
|
("winter-sports.epub", FileType.EPUB),
|
||||||
],
|
],
|
||||||
)
|
)
|
||||||
def test_detect_filetype_from_filename(file, expected):
|
def test_detect_filetype_from_filename(file, expected):
|
||||||
@ -50,6 +51,7 @@ def test_detect_filetype_from_filename(file, expected):
|
|||||||
("fake-html.html", FileType.HTML),
|
("fake-html.html", FileType.HTML),
|
||||||
("unsupported/fake-excel.xlsx", FileType.XLSX),
|
("unsupported/fake-excel.xlsx", FileType.XLSX),
|
||||||
("fake-power-point.pptx", FileType.PPTX),
|
("fake-power-point.pptx", FileType.PPTX),
|
||||||
|
("winter-sports.epub", FileType.EPUB),
|
||||||
],
|
],
|
||||||
)
|
)
|
||||||
def test_detect_filetype_from_filename_with_extension(monkeypatch, file, expected):
|
def test_detect_filetype_from_filename_with_extension(monkeypatch, file, expected):
|
||||||
@ -73,6 +75,7 @@ def test_detect_filetype_from_filename_with_extension(monkeypatch, file, expecte
|
|||||||
("fake-html.html", FileType.HTML),
|
("fake-html.html", FileType.HTML),
|
||||||
("unsupported/fake-excel.xlsx", FileType.XLSX),
|
("unsupported/fake-excel.xlsx", FileType.XLSX),
|
||||||
("fake-power-point.pptx", FileType.PPTX),
|
("fake-power-point.pptx", FileType.PPTX),
|
||||||
|
("winter-sports.epub", FileType.EPUB),
|
||||||
],
|
],
|
||||||
)
|
)
|
||||||
def test_detect_filetype_from_file(file, expected):
|
def test_detect_filetype_from_file(file, expected):
|
||||||
|
@ -277,3 +277,18 @@ def test_auto_with_page_breaks():
|
|||||||
filename = os.path.join(EXAMPLE_DOCS_DIRECTORY, "layout-parser-paper-fast.pdf")
|
filename = os.path.join(EXAMPLE_DOCS_DIRECTORY, "layout-parser-paper-fast.pdf")
|
||||||
elements = partition(filename=filename, include_page_breaks=True)
|
elements = partition(filename=filename, include_page_breaks=True)
|
||||||
assert PageBreak() in elements
|
assert PageBreak() in elements
|
||||||
|
|
||||||
|
|
||||||
|
def test_auto_partition_epub_from_filename():
|
||||||
|
filename = os.path.join(DIRECTORY, "..", "..", "example-docs", "winter-sports.epub")
|
||||||
|
elements = partition(filename=filename)
|
||||||
|
assert len(elements) > 0
|
||||||
|
assert elements[0].text.startswith("The Project Gutenberg eBook of Winter Sports")
|
||||||
|
|
||||||
|
|
||||||
|
def test_auto_partition_epub_from_file():
|
||||||
|
filename = os.path.join(DIRECTORY, "..", "..", "example-docs", "winter-sports.epub")
|
||||||
|
with open(filename, "rb") as f:
|
||||||
|
elements = partition(file=f)
|
||||||
|
assert len(elements) > 0
|
||||||
|
assert elements[0].text.startswith("The Project Gutenberg eBook of Winter Sports")
|
||||||
|
21
test_unstructured/partition/test_epub.py
Normal file
21
test_unstructured/partition/test_epub.py
Normal file
@ -0,0 +1,21 @@
|
|||||||
|
import os
|
||||||
|
import pathlib
|
||||||
|
|
||||||
|
from unstructured.partition.epub import partition_epub
|
||||||
|
|
||||||
|
DIRECTORY = pathlib.Path(__file__).parent.resolve()
|
||||||
|
|
||||||
|
|
||||||
|
def test_partition_epub_from_filename():
|
||||||
|
filename = os.path.join(DIRECTORY, "..", "..", "example-docs", "winter-sports.epub")
|
||||||
|
elements = partition_epub(filename=filename)
|
||||||
|
assert len(elements) > 0
|
||||||
|
assert elements[0].text.startswith("The Project Gutenberg eBook of Winter Sports")
|
||||||
|
|
||||||
|
|
||||||
|
def test_partition_epub_from_file():
|
||||||
|
filename = os.path.join(DIRECTORY, "..", "..", "example-docs", "winter-sports.epub")
|
||||||
|
with open(filename, "rb") as f:
|
||||||
|
elements = partition_epub(file=f)
|
||||||
|
assert len(elements) > 0
|
||||||
|
assert elements[0].text.startswith("The Project Gutenberg eBook of Winter Sports")
|
@ -1 +1 @@
|
|||||||
__version__ = "0.5.4-dev7" # pragma: no cover
|
__version__ = "0.5.4" # pragma: no cover
|
||||||
|
49
unstructured/file_utils/file_conversion.py
Normal file
49
unstructured/file_utils/file_conversion.py
Normal file
@ -0,0 +1,49 @@
|
|||||||
|
import tempfile
|
||||||
|
from typing import IO, Optional
|
||||||
|
|
||||||
|
import pypandoc
|
||||||
|
|
||||||
|
from unstructured.partition.common import exactly_one
|
||||||
|
|
||||||
|
|
||||||
|
def convert_file_to_text(filename: str, source_format: str, target_format: str) -> str:
|
||||||
|
"""Uses pandoc to convert the source document to a raw text string."""
|
||||||
|
try:
|
||||||
|
text = pypandoc.convert_file(filename, "html", format="epub")
|
||||||
|
except FileNotFoundError as err:
|
||||||
|
msg = (
|
||||||
|
"Error converting the file to text. Ensure you have the pandoc "
|
||||||
|
"package installed on your system. Install instructions are available at "
|
||||||
|
"https://pandoc.org/installing.html. The original exception text was:\n"
|
||||||
|
f"{err}"
|
||||||
|
)
|
||||||
|
raise FileNotFoundError(msg)
|
||||||
|
|
||||||
|
return text
|
||||||
|
|
||||||
|
|
||||||
|
def convert_epub_to_html(
|
||||||
|
filename: Optional[str] = None,
|
||||||
|
file: Optional[IO] = None,
|
||||||
|
) -> str:
|
||||||
|
"""Converts an EPUB document to HTML raw text. Enables an EPUB doucment to be
|
||||||
|
processed using the partition_html function."""
|
||||||
|
exactly_one(filename=filename, file=file)
|
||||||
|
|
||||||
|
if file is not None:
|
||||||
|
tmp = tempfile.NamedTemporaryFile(delete=False)
|
||||||
|
tmp.write(file.read())
|
||||||
|
tmp.close()
|
||||||
|
html_text = convert_file_to_text(
|
||||||
|
filename=tmp.name,
|
||||||
|
source_format="epub",
|
||||||
|
target_format="html",
|
||||||
|
)
|
||||||
|
elif filename is not None:
|
||||||
|
html_text = convert_file_to_text(
|
||||||
|
filename=filename,
|
||||||
|
source_format="epub",
|
||||||
|
target_format="html",
|
||||||
|
)
|
||||||
|
|
||||||
|
return html_text
|
@ -47,6 +47,11 @@ MD_MIME_TYPES = [
|
|||||||
"text/x-markdown",
|
"text/x-markdown",
|
||||||
]
|
]
|
||||||
|
|
||||||
|
EPUB_MIME_TYPES = [
|
||||||
|
"application/epub",
|
||||||
|
"application/epub+zip",
|
||||||
|
]
|
||||||
|
|
||||||
# NOTE(robinson) - .docx.xlsx files are actually zip file with a .docx/.xslx extension.
|
# NOTE(robinson) - .docx.xlsx files are actually zip file with a .docx/.xslx extension.
|
||||||
# If the MIME type is application/octet-stream, we check if it's a .docx/.xlsx file by
|
# If the MIME type is application/octet-stream, we check if it's a .docx/.xlsx file by
|
||||||
# looking for expected filenames within the zip file.
|
# looking for expected filenames within the zip file.
|
||||||
@ -94,6 +99,7 @@ class FileType(Enum):
|
|||||||
HTML = 50
|
HTML = 50
|
||||||
XML = 51
|
XML = 51
|
||||||
MD = 52
|
MD = 52
|
||||||
|
EPUB = 53
|
||||||
|
|
||||||
# Compressed Types
|
# Compressed Types
|
||||||
ZIP = 60
|
ZIP = 60
|
||||||
@ -123,6 +129,7 @@ EXT_TO_FILETYPE = {
|
|||||||
".ppt": FileType.PPT,
|
".ppt": FileType.PPT,
|
||||||
".rtf": FileType.RTF,
|
".rtf": FileType.RTF,
|
||||||
".json": FileType.JSON,
|
".json": FileType.JSON,
|
||||||
|
".epub": FileType.EPUB,
|
||||||
}
|
}
|
||||||
|
|
||||||
|
|
||||||
@ -180,6 +187,9 @@ def detect_filetype(
|
|||||||
# NOTE - I am not sure whether libmagic ever returns these mimetypes.
|
# NOTE - I am not sure whether libmagic ever returns these mimetypes.
|
||||||
return FileType.MD
|
return FileType.MD
|
||||||
|
|
||||||
|
elif mime_type in EPUB_MIME_TYPES:
|
||||||
|
return FileType.EPUB
|
||||||
|
|
||||||
elif mime_type in TXT_MIME_TYPES:
|
elif mime_type in TXT_MIME_TYPES:
|
||||||
if extension and extension == ".eml":
|
if extension and extension == ".eml":
|
||||||
return FileType.EML
|
return FileType.EML
|
||||||
|
@ -4,6 +4,7 @@ from unstructured.file_utils.filetype import FileType, detect_filetype
|
|||||||
from unstructured.partition.doc import partition_doc
|
from unstructured.partition.doc import partition_doc
|
||||||
from unstructured.partition.docx import partition_docx
|
from unstructured.partition.docx import partition_docx
|
||||||
from unstructured.partition.email import partition_email
|
from unstructured.partition.email import partition_email
|
||||||
|
from unstructured.partition.epub import partition_epub
|
||||||
from unstructured.partition.html import partition_html
|
from unstructured.partition.html import partition_html
|
||||||
from unstructured.partition.image import partition_image
|
from unstructured.partition.image import partition_image
|
||||||
from unstructured.partition.json import partition_json
|
from unstructured.partition.json import partition_json
|
||||||
@ -59,6 +60,8 @@ def partition(
|
|||||||
include_page_breaks=include_page_breaks,
|
include_page_breaks=include_page_breaks,
|
||||||
encoding=encoding,
|
encoding=encoding,
|
||||||
)
|
)
|
||||||
|
elif filetype == FileType.EPUB:
|
||||||
|
return partition_epub(filename=filename, file=file, include_page_breaks=include_page_breaks)
|
||||||
elif filetype == FileType.MD:
|
elif filetype == FileType.MD:
|
||||||
return partition_md(filename=filename, file=file, include_page_breaks=include_page_breaks)
|
return partition_md(filename=filename, file=file, include_page_breaks=include_page_breaks)
|
||||||
elif filetype == FileType.PDF:
|
elif filetype == FileType.PDF:
|
||||||
|
32
unstructured/partition/epub.py
Normal file
32
unstructured/partition/epub.py
Normal file
@ -0,0 +1,32 @@
|
|||||||
|
from typing import IO, List, Optional
|
||||||
|
|
||||||
|
from unstructured.documents.elements import Element
|
||||||
|
from unstructured.file_utils.file_conversion import convert_epub_to_html
|
||||||
|
from unstructured.partition.html import partition_html
|
||||||
|
|
||||||
|
|
||||||
|
def partition_epub(
|
||||||
|
filename: Optional[str] = None,
|
||||||
|
file: Optional[IO] = None,
|
||||||
|
include_page_breaks: bool = False,
|
||||||
|
) -> List[Element]:
|
||||||
|
"""Partitions an EPUB document. The document is first converted to HTML and then
|
||||||
|
partitoned using partiton_html.
|
||||||
|
|
||||||
|
Parameters
|
||||||
|
----------
|
||||||
|
filename
|
||||||
|
A string defining the target filename path.
|
||||||
|
file
|
||||||
|
A file-like object using "rb" mode --> open(filename, "rb").
|
||||||
|
include_page_breaks
|
||||||
|
If True, the output will include page breaks if the filetype supports it
|
||||||
|
"""
|
||||||
|
html_text = convert_epub_to_html(filename=filename, file=file)
|
||||||
|
# NOTE(robinson) - pypandoc returns a text string with unicode encoding
|
||||||
|
# ref: https://github.com/JessicaTegner/pypandoc#usage
|
||||||
|
return partition_html(
|
||||||
|
text=html_text,
|
||||||
|
include_page_breaks=include_page_breaks,
|
||||||
|
encoding="unicode",
|
||||||
|
)
|
Loading…
x
Reference in New Issue
Block a user