From 84a25c73b3e3d80a0dc02f97876c9ef51f4a1c95 Mon Sep 17 00:00:00 2001 From: Malte Pietsch Date: Thu, 2 Jul 2020 09:15:03 +0200 Subject: [PATCH] Update README.rst --- README.rst | 11 +++++++++-- 1 file changed, 9 insertions(+), 2 deletions(-) diff --git a/README.rst b/README.rst index 87b8c92d4..651bb15b3 100644 --- a/README.rst +++ b/README.rst @@ -245,8 +245,15 @@ You will find the Swagger API documentation at http://127.0.0.1:80/docs 7. Indexing PDF files --------------------- -Haystack has a customizable PDF text extraction pipeline with cleaning functions for header, footers, and tables. It supports complex document layouts with multi-column text. +Haystack has basic converters to extract text from PDFs. While it's almost impossible to cover all types, layouts and special cases in PDFs, the implementation covers the most common formats and provides basic cleaning functions to remove header, footers, and tables. Multi-Column text layouts are also supported. +The converters are easily extendable, so that you can customize them for your PDFs if needed. -8. Development +Example:: + + from haystack.indexing.file_converters.pdf import PDFToTextConverter + converter = PDFToTextConverter(remove_header_footer=True, remove_numeric_tables=True, valid_languages=["de","en"]) + pages = converter.extract_pages(file_path=file) + +8. Tests ------------------- * Unit tests can be executed by running :code:`tox`.