mirror of
https://github.com/deepset-ai/haystack.git
synced 2025-11-03 11:19:57 +00:00
Updated the example code in readme for Indexing PDF / Docx files (#502)
* Updated the example code to Indexing PDF / Docx files The example code was referencing a structure haystack.indexing which does not exist anymore. Modified this and the function "extract_pages" with "convert" * Update converter example in readme Co-authored-by: Malte Pietsch <malte.pietsch@deepset.ai>
This commit is contained in:
parent
3434d5205d
commit
dc16258dab
22
README.rst
22
README.rst
@ -284,15 +284,15 @@ Example:
|
||||
.. code-block:: python
|
||||
|
||||
#PDF
|
||||
from haystack.indexing.file_converters.pdf import PDFToTextConverter
|
||||
converter = PDFToTextConverter(remove_header_footer=True, remove_numeric_tables=True, valid_languages=["de","en"])
|
||||
pages = converter.extract_pages(file_path=file)
|
||||
# => list of str, one per page
|
||||
from haystack.file_converter.pdf import PDFToTextConverter
|
||||
converter = PDFToTextConverter(remove_numeric_tables=True, valid_languages=["de","en"])
|
||||
doc = converter.convert(file_path=file, meta=None)
|
||||
# => {"text": "text first page \f text second page ...", "meta": None}
|
||||
#DOCX
|
||||
from haystack.indexing.file_converters.docx import DocxToTextConverter
|
||||
converter = DocxToTextConverter()
|
||||
paragraphs = converter.extract_pages(file_path=file)
|
||||
# => list of str, one per paragraph (as docx has no direct notion of pages)
|
||||
from haystack.file_converter.docx import DocxToTextConverter
|
||||
converter = DocxToTextConverter(remove_numeric_tables=True, valid_languages=["de","en"])
|
||||
doc = converter.convert(file_path=file, meta=None)
|
||||
# => {"text": "some text", "meta": None}
|
||||
|
||||
Advanced document convertion is enabled by leveraging mature text extraction library `Apache Tika <https://tika.apache.org/>`_, which is mostly written in Java. Although it's possible to call Tika API from Python, the current :code:`TikaConverter` only supports RESTful call to a Tika server running at localhost. One may either run Tika as a REST service at port 9998 (default), or to start a `docker container for Tika <https://hub.docker.com/r/apache/tika/tags>`_. The latter is recommended, as it's easily scalable, and does not require setting up any Java runtime environment. What's more, future update is also taken care of by docker.
|
||||
Either way, TikaConverter makes RESTful calls to convert any document format supported by Tika. Example code can be found at :code:`indexing/file_converters/utils.py`'s :code:`tika_convert)_files_to_dicts` function:
|
||||
@ -312,9 +312,9 @@ If you feel adventurous, Tika even supports some image OCR with Tesseract, or ob
|
||||
|
||||
.. code-block:: python
|
||||
|
||||
converter = TikaConverter(remove_header_footer=True)
|
||||
pages = converter.extract_pages(file_path=path)
|
||||
pages, meta = converter.extract_pages(file_path=path, return_meta=True)
|
||||
converter = TikaConverter(tika_url: str = "http://localhost:9998/tika")
|
||||
doc = converter.convert(file_path=path)
|
||||
# => {"text": "text first page \f text second page ...", "meta": {"Content-Type": 'application/pdf', "Last-Modified":...}}
|
||||
|
||||
Contributing
|
||||
=============
|
||||
|
||||
Loading…
x
Reference in New Issue
Block a user