mirror of
https://github.com/deepset-ai/haystack.git
synced 2025-07-19 06:52:56 +00:00

* Add ParsrConverter * Fix typing error + add Parsr to Linux CI * Fix valid_language for all converters + fix context generation for ParsrConverter * Remove ParsrConverter test from WindowsCI * Add latest docstring and tutorial changes Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
425 lines
20 KiB
Markdown
425 lines
20 KiB
Markdown
<a name="base"></a>
|
|
# Module base
|
|
|
|
<a name="base.BaseConverter"></a>
|
|
## BaseConverter
|
|
|
|
```python
|
|
class BaseConverter(BaseComponent)
|
|
```
|
|
|
|
Base class for implementing file converts to transform input documents to text format for ingestion in DocumentStore.
|
|
|
|
<a name="base.BaseConverter.__init__"></a>
|
|
#### \_\_init\_\_
|
|
|
|
```python
|
|
| __init__(remove_numeric_tables: bool = False, valid_languages: Optional[List[str]] = None)
|
|
```
|
|
|
|
**Arguments**:
|
|
|
|
- `remove_numeric_tables`: This option uses heuristics to remove numeric rows from the tables.
|
|
The tabular structures in documents might be noise for the reader model if it
|
|
does not have table parsing capability for finding answers. However, tables
|
|
may also have long strings that could possible candidate for searching answers.
|
|
The rows containing strings are thus retained in this option.
|
|
- `valid_languages`: validate languages from a list of languages specified in the ISO 639-1
|
|
(https://en.wikipedia.org/wiki/ISO_639-1) format.
|
|
This option can be used to add test for encoding errors. If the extracted text is
|
|
not one of the valid languages, then it might likely be encoding error resulting
|
|
in garbled text.
|
|
|
|
<a name="base.BaseConverter.convert"></a>
|
|
#### convert
|
|
|
|
```python
|
|
| @abstractmethod
|
|
| convert(file_path: Path, meta: Optional[Dict[str, str]], remove_numeric_tables: Optional[bool] = None, valid_languages: Optional[List[str]] = None, encoding: Optional[str] = "utf-8") -> List[Dict[str, Any]]
|
|
```
|
|
|
|
Convert a file to a dictionary containing the text and any associated meta data.
|
|
|
|
File converters may extract file meta like name or size. In addition to it, user
|
|
supplied meta data like author, url, external IDs can be supplied as a dictionary.
|
|
|
|
**Arguments**:
|
|
|
|
- `file_path`: path of the file to convert
|
|
- `meta`: dictionary of meta data key-value pairs to append in the returned document.
|
|
- `remove_numeric_tables`: This option uses heuristics to remove numeric rows from the tables.
|
|
The tabular structures in documents might be noise for the reader model if it
|
|
does not have table parsing capability for finding answers. However, tables
|
|
may also have long strings that could possible candidate for searching answers.
|
|
The rows containing strings are thus retained in this option.
|
|
- `valid_languages`: validate languages from a list of languages specified in the ISO 639-1
|
|
(https://en.wikipedia.org/wiki/ISO_639-1) format.
|
|
This option can be used to add test for encoding errors. If the extracted text is
|
|
not one of the valid languages, then it might likely be encoding error resulting
|
|
in garbled text.
|
|
- `encoding`: Select the file encoding (default is `utf-8`)
|
|
|
|
<a name="base.BaseConverter.validate_language"></a>
|
|
#### validate\_language
|
|
|
|
```python
|
|
| validate_language(text: str, valid_languages: Optional[List[str]] = None) -> bool
|
|
```
|
|
|
|
Validate if the language of the text is one of valid languages.
|
|
|
|
<a name="docx"></a>
|
|
# Module docx
|
|
|
|
<a name="docx.DocxToTextConverter"></a>
|
|
## DocxToTextConverter
|
|
|
|
```python
|
|
class DocxToTextConverter(BaseConverter)
|
|
```
|
|
|
|
<a name="docx.DocxToTextConverter.convert"></a>
|
|
#### convert
|
|
|
|
```python
|
|
| convert(file_path: Path, meta: Optional[Dict[str, str]] = None, remove_numeric_tables: Optional[bool] = None, valid_languages: Optional[List[str]] = None, encoding: Optional[str] = None) -> List[Dict[str, Any]]
|
|
```
|
|
|
|
Extract text from a .docx file.
|
|
Note: As docx doesn't contain "page" information, we actually extract and return a list of paragraphs here.
|
|
For compliance with other converters we nevertheless opted for keeping the methods name.
|
|
|
|
**Arguments**:
|
|
|
|
- `file_path`: Path to the .docx file you want to convert
|
|
- `meta`: dictionary of meta data key-value pairs to append in the returned document.
|
|
- `remove_numeric_tables`: This option uses heuristics to remove numeric rows from the tables.
|
|
The tabular structures in documents might be noise for the reader model if it
|
|
does not have table parsing capability for finding answers. However, tables
|
|
may also have long strings that could possible candidate for searching answers.
|
|
The rows containing strings are thus retained in this option.
|
|
- `valid_languages`: validate languages from a list of languages specified in the ISO 639-1
|
|
(https://en.wikipedia.org/wiki/ISO_639-1) format.
|
|
This option can be used to add test for encoding errors. If the extracted text is
|
|
not one of the valid languages, then it might likely be encoding error resulting
|
|
in garbled text.
|
|
- `encoding`: Not applicable
|
|
|
|
<a name="image"></a>
|
|
# Module image
|
|
|
|
<a name="image.ImageToTextConverter"></a>
|
|
## ImageToTextConverter
|
|
|
|
```python
|
|
class ImageToTextConverter(BaseConverter)
|
|
```
|
|
|
|
<a name="image.ImageToTextConverter.__init__"></a>
|
|
#### \_\_init\_\_
|
|
|
|
```python
|
|
| __init__(remove_numeric_tables: bool = False, valid_languages: Optional[List[str]] = ["eng"])
|
|
```
|
|
|
|
**Arguments**:
|
|
|
|
- `remove_numeric_tables`: This option uses heuristics to remove numeric rows from the tables.
|
|
The tabular structures in documents might be noise for the reader model if it
|
|
does not have table parsing capability for finding answers. However, tables
|
|
may also have long strings that could possible candidate for searching answers.
|
|
The rows containing strings are thus retained in this option.
|
|
- `valid_languages`: validate languages from a list of languages specified here
|
|
(https://tesseract-ocr.github.io/tessdoc/Data-Files-in-different-versions.html)
|
|
This option can be used to add test for encoding errors. If the extracted text is
|
|
not one of the valid languages, then it might likely be encoding error resulting
|
|
in garbled text. Run the following line of code to check available language packs:
|
|
# List of available languages
|
|
print(pytesseract.get_languages(config=''))
|
|
|
|
<a name="image.ImageToTextConverter.convert"></a>
|
|
#### convert
|
|
|
|
```python
|
|
| convert(file_path: Union[Path,str], meta: Optional[Dict[str, str]] = None, remove_numeric_tables: Optional[bool] = None, valid_languages: Optional[List[str]] = None, encoding: Optional[str] = "utf-8") -> List[Dict[str, Any]]
|
|
```
|
|
|
|
Extract text from image file using the pytesseract library (https://github.com/madmaze/pytesseract)
|
|
|
|
**Arguments**:
|
|
|
|
- `file_path`: path to image file
|
|
- `meta`: Optional dictionary with metadata that shall be attached to all resulting documents.
|
|
Can be any custom keys and values.
|
|
- `remove_numeric_tables`: This option uses heuristics to remove numeric rows from the tables.
|
|
The tabular structures in documents might be noise for the reader model if it
|
|
does not have table parsing capability for finding answers. However, tables
|
|
may also have long strings that could possible candidate for searching answers.
|
|
The rows containing strings are thus retained in this option.
|
|
- `valid_languages`: validate languages from a list of languages supported by tessarect
|
|
(https://tesseract-ocr.github.io/tessdoc/Data-Files-in-different-versions.html).
|
|
This option can be used to add test for encoding errors. If the extracted text is
|
|
not one of the valid languages, then it might likely be encoding error resulting
|
|
in garbled text.
|
|
|
|
<a name="markdown"></a>
|
|
# Module markdown
|
|
|
|
<a name="markdown.MarkdownConverter"></a>
|
|
## MarkdownConverter
|
|
|
|
```python
|
|
class MarkdownConverter(BaseConverter)
|
|
```
|
|
|
|
<a name="markdown.MarkdownConverter.convert"></a>
|
|
#### convert
|
|
|
|
```python
|
|
| convert(file_path: Path, meta: Optional[Dict[str, str]] = None, remove_numeric_tables: Optional[bool] = None, valid_languages: Optional[List[str]] = None, encoding: Optional[str] = "utf-8") -> List[Dict[str, Any]]
|
|
```
|
|
|
|
Reads text from a txt file and executes optional preprocessing steps.
|
|
|
|
**Arguments**:
|
|
|
|
- `file_path`: path of the file to convert
|
|
- `meta`: dictionary of meta data key-value pairs to append in the returned document.
|
|
- `encoding`: Select the file encoding (default is `utf-8`)
|
|
- `remove_numeric_tables`: Not applicable
|
|
- `valid_languages`: Not applicable
|
|
|
|
**Returns**:
|
|
|
|
Dict of format {"text": "The text from file", "meta": meta}}
|
|
|
|
<a name="markdown.MarkdownConverter.markdown_to_text"></a>
|
|
#### markdown\_to\_text
|
|
|
|
```python
|
|
| @staticmethod
|
|
| markdown_to_text(markdown_string: str) -> str
|
|
```
|
|
|
|
Converts a markdown string to plaintext
|
|
|
|
**Arguments**:
|
|
|
|
- `markdown_string`: String in markdown format
|
|
|
|
<a name="pdf"></a>
|
|
# Module pdf
|
|
|
|
<a name="pdf.PDFToTextConverter"></a>
|
|
## PDFToTextConverter
|
|
|
|
```python
|
|
class PDFToTextConverter(BaseConverter)
|
|
```
|
|
|
|
<a name="pdf.PDFToTextConverter.__init__"></a>
|
|
#### \_\_init\_\_
|
|
|
|
```python
|
|
| __init__(remove_numeric_tables: bool = False, valid_languages: Optional[List[str]] = None)
|
|
```
|
|
|
|
**Arguments**:
|
|
|
|
- `remove_numeric_tables`: This option uses heuristics to remove numeric rows from the tables.
|
|
The tabular structures in documents might be noise for the reader model if it
|
|
does not have table parsing capability for finding answers. However, tables
|
|
may also have long strings that could possible candidate for searching answers.
|
|
The rows containing strings are thus retained in this option.
|
|
- `valid_languages`: validate languages from a list of languages specified in the ISO 639-1
|
|
(https://en.wikipedia.org/wiki/ISO_639-1) format.
|
|
This option can be used to add test for encoding errors. If the extracted text is
|
|
not one of the valid languages, then it might likely be encoding error resulting
|
|
in garbled text.
|
|
|
|
<a name="pdf.PDFToTextConverter.convert"></a>
|
|
#### convert
|
|
|
|
```python
|
|
| convert(file_path: Path, meta: Optional[Dict[str, str]] = None, remove_numeric_tables: Optional[bool] = None, valid_languages: Optional[List[str]] = None, encoding: Optional[str] = "Latin1") -> List[Dict[str, Any]]
|
|
```
|
|
|
|
Extract text from a .pdf file using the pdftotext library (https://www.xpdfreader.com/pdftotext-man.html)
|
|
|
|
**Arguments**:
|
|
|
|
- `file_path`: Path to the .pdf file you want to convert
|
|
- `meta`: Optional dictionary with metadata that shall be attached to all resulting documents.
|
|
Can be any custom keys and values.
|
|
- `remove_numeric_tables`: This option uses heuristics to remove numeric rows from the tables.
|
|
The tabular structures in documents might be noise for the reader model if it
|
|
does not have table parsing capability for finding answers. However, tables
|
|
may also have long strings that could possible candidate for searching answers.
|
|
The rows containing strings are thus retained in this option.
|
|
- `valid_languages`: validate languages from a list of languages specified in the ISO 639-1
|
|
(https://en.wikipedia.org/wiki/ISO_639-1) format.
|
|
This option can be used to add test for encoding errors. If the extracted text is
|
|
not one of the valid languages, then it might likely be encoding error resulting
|
|
in garbled text.
|
|
- `encoding`: Encoding that will be passed as -enc parameter to pdftotext. "Latin 1" is the default encoding
|
|
of pdftotext. While this works well on many PDFs, it might be needed to switch to "UTF-8" or
|
|
others if your doc contains special characters (e.g. German Umlauts, Cyrillic characters ...).
|
|
Note: With "UTF-8" we experienced cases, where a simple "fi" gets wrongly parsed as
|
|
"xef\xac\x81c" (see test cases). That's why we keep "Latin 1" as default here.
|
|
(See list of available encodings by running `pdftotext -listenc` in the terminal)
|
|
|
|
<a name="pdf.PDFToTextOCRConverter"></a>
|
|
## PDFToTextOCRConverter
|
|
|
|
```python
|
|
class PDFToTextOCRConverter(BaseConverter)
|
|
```
|
|
|
|
<a name="pdf.PDFToTextOCRConverter.__init__"></a>
|
|
#### \_\_init\_\_
|
|
|
|
```python
|
|
| __init__(remove_numeric_tables: bool = False, valid_languages: Optional[List[str]] = ["eng"])
|
|
```
|
|
|
|
Extract text from image file using the pytesseract library (https://github.com/madmaze/pytesseract)
|
|
|
|
**Arguments**:
|
|
|
|
- `remove_numeric_tables`: This option uses heuristics to remove numeric rows from the tables.
|
|
The tabular structures in documents might be noise for the reader model if it
|
|
does not have table parsing capability for finding answers. However, tables
|
|
may also have long strings that could possible candidate for searching answers.
|
|
The rows containing strings are thus retained in this option.
|
|
- `valid_languages`: validate languages from a list of languages supported by tessarect
|
|
(https://tesseract-ocr.github.io/tessdoc/Data-Files-in-different-versions.html).
|
|
This option can be used to add test for encoding errors. If the extracted text is
|
|
not one of the valid languages, then it might likely be encoding error resulting
|
|
in garbled text.
|
|
|
|
<a name="pdf.PDFToTextOCRConverter.convert"></a>
|
|
#### convert
|
|
|
|
```python
|
|
| convert(file_path: Path, meta: Optional[Dict[str, str]] = None, remove_numeric_tables: Optional[bool] = None, valid_languages: Optional[List[str]] = None, encoding: Optional[str] = "utf-8") -> List[Dict[str, Any]]
|
|
```
|
|
|
|
Convert a file to a dictionary containing the text and any associated meta data.
|
|
|
|
File converters may extract file meta like name or size. In addition to it, user
|
|
supplied meta data like author, url, external IDs can be supplied as a dictionary.
|
|
|
|
**Arguments**:
|
|
|
|
- `file_path`: path of the file to convert
|
|
- `meta`: dictionary of meta data key-value pairs to append in the returned document.
|
|
- `remove_numeric_tables`: This option uses heuristics to remove numeric rows from the tables.
|
|
The tabular structures in documents might be noise for the reader model if it
|
|
does not have table parsing capability for finding answers. However, tables
|
|
may also have long strings that could possible candidate for searching answers.
|
|
The rows containing strings are thus retained in this option.
|
|
- `valid_languages`: validate languages from a list of languages specified in the ISO 639-1
|
|
(https://en.wikipedia.org/wiki/ISO_639-1) format.
|
|
This option can be used to add test for encoding errors. If the extracted text is
|
|
not one of the valid languages, then it might likely be encoding error resulting
|
|
in garbled text.
|
|
- `encoding`: Select the file encoding (default is `utf-8`)
|
|
|
|
<a name="tika"></a>
|
|
# Module tika
|
|
|
|
<a name="tika.TikaConverter"></a>
|
|
## TikaConverter
|
|
|
|
```python
|
|
class TikaConverter(BaseConverter)
|
|
```
|
|
|
|
<a name="tika.TikaConverter.__init__"></a>
|
|
#### \_\_init\_\_
|
|
|
|
```python
|
|
| __init__(tika_url: str = "http://localhost:9998/tika", remove_numeric_tables: bool = False, valid_languages: Optional[List[str]] = None)
|
|
```
|
|
|
|
**Arguments**:
|
|
|
|
- `tika_url`: URL of the Tika server
|
|
- `remove_numeric_tables`: This option uses heuristics to remove numeric rows from the tables.
|
|
The tabular structures in documents might be noise for the reader model if it
|
|
does not have table parsing capability for finding answers. However, tables
|
|
may also have long strings that could possible candidate for searching answers.
|
|
The rows containing strings are thus retained in this option.
|
|
- `valid_languages`: validate languages from a list of languages specified in the ISO 639-1
|
|
(https://en.wikipedia.org/wiki/ISO_639-1) format.
|
|
This option can be used to add test for encoding errors. If the extracted text is
|
|
not one of the valid languages, then it might likely be encoding error resulting
|
|
in garbled text.
|
|
|
|
<a name="tika.TikaConverter.convert"></a>
|
|
#### convert
|
|
|
|
```python
|
|
| convert(file_path: Path, meta: Optional[Dict[str, str]] = None, remove_numeric_tables: Optional[bool] = None, valid_languages: Optional[List[str]] = None, encoding: Optional[str] = None) -> List[Dict[str, Any]]
|
|
```
|
|
|
|
**Arguments**:
|
|
|
|
- `file_path`: path of the file to convert
|
|
- `meta`: dictionary of meta data key-value pairs to append in the returned document.
|
|
- `remove_numeric_tables`: This option uses heuristics to remove numeric rows from the tables.
|
|
The tabular structures in documents might be noise for the reader model if it
|
|
does not have table parsing capability for finding answers. However, tables
|
|
may also have long strings that could possible candidate for searching answers.
|
|
The rows containing strings are thus retained in this option.
|
|
- `valid_languages`: validate languages from a list of languages specified in the ISO 639-1
|
|
(https://en.wikipedia.org/wiki/ISO_639-1) format.
|
|
This option can be used to add test for encoding errors. If the extracted text is
|
|
not one of the valid languages, then it might likely be encoding error resulting
|
|
in garbled text.
|
|
- `encoding`: Not applicable
|
|
|
|
**Returns**:
|
|
|
|
a list of pages and the extracted meta data of the file.
|
|
|
|
<a name="txt"></a>
|
|
# Module txt
|
|
|
|
<a name="txt.TextConverter"></a>
|
|
## TextConverter
|
|
|
|
```python
|
|
class TextConverter(BaseConverter)
|
|
```
|
|
|
|
<a name="txt.TextConverter.convert"></a>
|
|
#### convert
|
|
|
|
```python
|
|
| convert(file_path: Path, meta: Optional[Dict[str, str]] = None, remove_numeric_tables: Optional[bool] = None, valid_languages: Optional[List[str]] = None, encoding: Optional[str] = "utf-8") -> List[Dict[str, Any]]
|
|
```
|
|
|
|
Reads text from a txt file and executes optional preprocessing steps.
|
|
|
|
**Arguments**:
|
|
|
|
- `file_path`: path of the file to convert
|
|
- `meta`: dictionary of meta data key-value pairs to append in the returned document.
|
|
- `remove_numeric_tables`: This option uses heuristics to remove numeric rows from the tables.
|
|
The tabular structures in documents might be noise for the reader model if it
|
|
does not have table parsing capability for finding answers. However, tables
|
|
may also have long strings that could possible candidate for searching answers.
|
|
The rows containing strings are thus retained in this option.
|
|
- `valid_languages`: validate languages from a list of languages specified in the ISO 639-1
|
|
(https://en.wikipedia.org/wiki/ISO_639-1) format.
|
|
This option can be used to add test for encoding errors. If the extracted text is
|
|
not one of the valid languages, then it might likely be encoding error resulting
|
|
in garbled text.
|
|
- `encoding`: Select the file encoding (default is `utf-8`)
|
|
|
|
**Returns**:
|
|
|
|
Dict of format {"text": "The text from file", "meta": meta}}
|
|
|