haystack/docs/file_converter.md

<a name="txt"></a>
# txt

<a name="txt.TextConverter"></a>
## TextConverter

```python
class TextConverter(BaseConverter)
```

<a name="txt.TextConverter.__init__"></a>
#### \_\_init\_\_

```python
 | __init__(remove_numeric_tables: Optional[bool] = False, remove_whitespace: Optional[bool] = None, remove_empty_lines: Optional[bool] = None, remove_header_footer: Optional[bool] = None, valid_languages: Optional[List[str]] = None)
```

**Arguments**:

- `remove_numeric_tables`: This option uses heuristics to remove numeric rows from the tables.
The tabular structures in documents might be noise for the reader model if it
does not have table parsing capability for finding answers. However, tables
may also have long strings that could possible candidate for searching answers.
The rows containing strings are thus retained in this option.
- `remove_whitespace`: strip whitespaces before or after each line in the text.
- `remove_empty_lines`: remove more than two empty lines in the text.
- `remove_header_footer`: use heuristic to remove footers and headers across different pages by searching
for the longest common string. This heuristic uses exact matches and therefore
works well for footers like "Copyright 2019 by XXX", but won't detect "Page 3 of 4"
or similar.
- `valid_languages`: validate languages from a list of languages specified in the ISO 639-1
(https://en.wikipedia.org/wiki/ISO_639-1) format.
This option can be used to add test for encoding errors. If the extracted text is
not one of the valid languages, then it might likely be encoding error resulting
in garbled text.

<a name="docx"></a>
# docx

<a name="docx.DocxToTextConverter"></a>
## DocxToTextConverter

```python
class DocxToTextConverter(BaseConverter)
```

<a name="docx.DocxToTextConverter.extract_pages"></a>
#### extract\_pages

```python
 | extract_pages(file_path: Path) -> Tuple[List[str], Optional[Dict[str, Any]]]
```

Extract text from a .docx file.
Note: As docx doesn't contain "page" information, we actually extract and return a list of paragraphs here.
For compliance with other converters we nevertheless opted for keeping the methods name.

**Arguments**:

- `file_path`: Path to the .docx file you want to convert

<a name="__init__"></a>
# \_\_init\_\_

<a name="tika"></a>
# tika

<a name="tika.TikaConverter"></a>
## TikaConverter

```python
class TikaConverter(BaseConverter)
```

<a name="tika.TikaConverter.__init__"></a>
#### \_\_init\_\_

```python
 | __init__(tika_url: str = "http://localhost:9998/tika", remove_numeric_tables: Optional[bool] = False, remove_whitespace: Optional[bool] = None, remove_empty_lines: Optional[bool] = None, remove_header_footer: Optional[bool] = None, valid_languages: Optional[List[str]] = None)
```

**Arguments**:

- `tika_url`: URL of the Tika server
- `remove_numeric_tables`: This option uses heuristics to remove numeric rows from the tables.
The tabular structures in documents might be noise for the reader model if it
does not have table parsing capability for finding answers. However, tables
may also have long strings that could possible candidate for searching answers.
The rows containing strings are thus retained in this option.
- `remove_whitespace`: strip whitespaces before or after each line in the text.
- `remove_empty_lines`: remove more than two empty lines in the text.
- `remove_header_footer`: use heuristic to remove footers and headers across different pages by searching
for the longest common string. This heuristic uses exact matches and therefore
works well for footers like "Copyright 2019 by XXX", but won't detect "Page 3 of 4"
or similar.
- `valid_languages`: validate languages from a list of languages specified in the ISO 639-1
(https://en.wikipedia.org/wiki/ISO_639-1) format.
This option can be used to add test for encoding errors. If the extracted text is
not one of the valid languages, then it might likely be encoding error resulting
in garbled text.

<a name="tika.TikaConverter.extract_pages"></a>
#### extract\_pages

```python
 | extract_pages(file_path: Path) -> Tuple[List[str], Optional[Dict[str, Any]]]
```

**Arguments**:

- `file_path`: Path of file to be converted.

**Returns**:

a list of pages and the extracted meta data of the file.

<a name="base"></a>
# base

<a name="base.BaseConverter"></a>
## BaseConverter

```python
class BaseConverter()
```

Base class for implementing file converts to transform input documents to text format for ingestion in DocumentStore.

<a name="base.BaseConverter.__init__"></a>
#### \_\_init\_\_

```python
 | __init__(remove_numeric_tables: Optional[bool] = None, remove_header_footer: Optional[bool] = None, remove_whitespace: Optional[bool] = None, remove_empty_lines: Optional[bool] = None, valid_languages: Optional[List[str]] = None)
```

**Arguments**:

- `remove_numeric_tables`: This option uses heuristics to remove numeric rows from the tables.
The tabular structures in documents might be noise for the reader model if it
does not have table parsing capability for finding answers. However, tables
may also have long strings that could possible candidate for searching answers.
The rows containing strings are thus retained in this option.
- `remove_header_footer`: use heuristic to remove footers and headers across different pages by searching
for the longest common string. This heuristic uses exact matches and therefore
works well for footers like "Copyright 2019 by XXX", but won't detect "Page 3 of 4"
or similar.
- `remove_whitespace`: strip whitespaces before or after each line in the text.
- `remove_empty_lines`: remove more than two empty lines in the text.
- `valid_languages`: validate languages from a list of languages specified in the ISO 639-1
(https://en.wikipedia.org/wiki/ISO_639-1) format.
This option can be used to add test for encoding errors. If the extracted text is
not one of the valid languages, then it might likely be encoding error resulting
in garbled text.

<a name="base.BaseConverter.validate_language"></a>
#### validate\_language

```python
 | validate_language(text: str) -> bool
```

Validate if the language of the text is one of valid languages.

<a name="base.BaseConverter.find_and_remove_header_footer"></a>
#### find\_and\_remove\_header\_footer

```python
 | find_and_remove_header_footer(pages: List[str], n_chars: int, n_first_pages_to_ignore: int, n_last_pages_to_ignore: int) -> Tuple[List[str], Optional[str], Optional[str]]
```

Heuristic to find footers and headers across different pages by searching for the longest common string.
For headers we only search in the first n_chars characters (for footer: last n_chars).
Note: This heuristic uses exact matches and therefore works well for footers like "Copyright 2019 by XXX",
but won't detect "Page 3 of 4" or similar.

**Arguments**:

- `pages`: list of strings, one string per page
- `n_chars`: number of first/last characters where the header/footer shall be searched in
- `n_first_pages_to_ignore`: number of first pages to ignore (e.g. TOCs often don't contain footer/header)
- `n_last_pages_to_ignore`: number of last pages to ignore

**Returns**:

(cleaned pages, found_header_str, found_footer_str)

<a name="pdf"></a>
# pdf

<a name="pdf.PDFToTextConverter"></a>
## PDFToTextConverter

```python
class PDFToTextConverter(BaseConverter)
```

<a name="pdf.PDFToTextConverter.__init__"></a>
#### \_\_init\_\_

```python
 | __init__(remove_numeric_tables: Optional[bool] = False, remove_whitespace: Optional[bool] = None, remove_empty_lines: Optional[bool] = None, remove_header_footer: Optional[bool] = None, valid_languages: Optional[List[str]] = None)
```

**Arguments**:

- `remove_numeric_tables`: This option uses heuristics to remove numeric rows from the tables.
The tabular structures in documents might be noise for the reader model if it
does not have table parsing capability for finding answers. However, tables
may also have long strings that could possible candidate for searching answers.
The rows containing strings are thus retained in this option.
- `remove_whitespace`: strip whitespaces before or after each line in the text.
- `remove_empty_lines`: remove more than two empty lines in the text.
- `remove_header_footer`: use heuristic to remove footers and headers across different pages by searching
for the longest common string. This heuristic uses exact matches and therefore
works well for footers like "Copyright 2019 by XXX", but won't detect "Page 3 of 4"
or similar.
- `valid_languages`: validate languages from a list of languages specified in the ISO 639-1
(https://en.wikipedia.org/wiki/ISO_639-1) format.
This option can be used to add test for encoding errors. If the extracted text is
not one of the valid languages, then it might likely be encoding error resulting
in garbled text.
Create documentation website (#272) * Skeleton of doc website * Flesh out documentation pages * Split concepts into their own rst files * add tutorial rsts * Consistent level 1 markdown headers in tutorials * Change theme to readthedocs * Turn bullet points into prose * Populate sections * Add more text * Add more sphinx files * Add more retriever documentation * combined all documenations in one structure * rename of src to _src as it was ignored by git * Incorporate MP2's changes * add benchmark bar charts * Adapt docstrings in Readers * Improvements to intro, creation of glossary * Adapt docstrings in Retrievers * Adapt docstrings in Finder * Adapt Docstrings of Finder * Updates to text * Edit text * update doc strings * proof read tutorials * Edit text * Edit text * Add stacked chart * populate graph with data * Switch Documentation to markdown (#386) * add way to generate markdown files to sphinx * changed from rst to markdown and extended sphinx for it * fix spelling * Clean titles * delete file * change spelling * add sections to document store usage * add basic rest api docs * fix readme in setup.py * Update Tutorials * Change section names * add windows note to pip install * update intro * new renderer for markdown files * Fix typos * delete dpr_utils.py * fix windows note in get started * Fix docstrings * deleted rest api docs in api * fixed typo * Fix docstring * revert readme to rst * Fix readme * Update setup.py Co-authored-by: deepset <deepset@Crenolape.localdomain> Co-authored-by: PiffPaffM <markuspaff.mp@gmail.com> Co-authored-by: Bogdan Kostić <bogdankostic@web.de> Co-authored-by: Malte Pietsch <malte.pietsch@deepset.ai> 2020-09-18 12:57:32 +02:00			`<a name="txt"></a>`
			`# txt`

			`<a name="txt.TextConverter"></a>`
			`## TextConverter`

			```python
			`class TextConverter(BaseConverter)`
			```

			`<a name="txt.TextConverter.__init__"></a>`
			`#### \_\_init\_\_`

			```python
			`\| __init__(remove_numeric_tables: Optional[bool] = False, remove_whitespace: Optional[bool] = None, remove_empty_lines: Optional[bool] = None, remove_header_footer: Optional[bool] = None, valid_languages: Optional[List[str]] = None)`
			```

			`Arguments:`

			- `remove_numeric_tables`: This option uses heuristics to remove numeric rows from the tables.
			`The tabular structures in documents might be noise for the reader model if it`
			`does not have table parsing capability for finding answers. However, tables`
			`may also have long strings that could possible candidate for searching answers.`
			`The rows containing strings are thus retained in this option.`
			- `remove_whitespace`: strip whitespaces before or after each line in the text.
			- `remove_empty_lines`: remove more than two empty lines in the text.
			- `remove_header_footer`: use heuristic to remove footers and headers across different pages by searching
			`for the longest common string. This heuristic uses exact matches and therefore`
			`works well for footers like "Copyright 2019 by XXX", but won't detect "Page 3 of 4"`
			`or similar.`
			- `valid_languages`: validate languages from a list of languages specified in the ISO 639-1
			`(https://en.wikipedia.org/wiki/ISO_639-1) format.`
			`This option can be used to add test for encoding errors. If the extracted text is`
			`not one of the valid languages, then it might likely be encoding error resulting`
			`in garbled text.`

			`<a name="docx"></a>`
			`# docx`

			`<a name="docx.DocxToTextConverter"></a>`
			`## DocxToTextConverter`

			```python
			`class DocxToTextConverter(BaseConverter)`
			```

			`<a name="docx.DocxToTextConverter.extract_pages"></a>`
			`#### extract\_pages`

			```python
			`\| extract_pages(file_path: Path) -> Tuple[List[str], Optional[Dict[str, Any]]]`
			```

			`Extract text from a .docx file.`
			`Note: As docx doesn't contain "page" information, we actually extract and return a list of paragraphs here.`
			`For compliance with other converters we nevertheless opted for keeping the methods name.`

			`Arguments:`

			- `file_path`: Path to the .docx file you want to convert

			`<a name="__init__"></a>`
			`# \_\_init\_\_`

			`<a name="tika"></a>`
			`# tika`

			`<a name="tika.TikaConverter"></a>`
			`## TikaConverter`

			```python
			`class TikaConverter(BaseConverter)`
			```

			`<a name="tika.TikaConverter.__init__"></a>`
			`#### \_\_init\_\_`

			```python
			`\| __init__(tika_url: str = "http://localhost:9998/tika", remove_numeric_tables: Optional[bool] = False, remove_whitespace: Optional[bool] = None, remove_empty_lines: Optional[bool] = None, remove_header_footer: Optional[bool] = None, valid_languages: Optional[List[str]] = None)`
			```

			`Arguments:`

			- `tika_url`: URL of the Tika server
			- `remove_numeric_tables`: This option uses heuristics to remove numeric rows from the tables.
			`The tabular structures in documents might be noise for the reader model if it`
			`does not have table parsing capability for finding answers. However, tables`
			`may also have long strings that could possible candidate for searching answers.`
			`The rows containing strings are thus retained in this option.`
			- `remove_whitespace`: strip whitespaces before or after each line in the text.
			- `remove_empty_lines`: remove more than two empty lines in the text.
			- `remove_header_footer`: use heuristic to remove footers and headers across different pages by searching
			`for the longest common string. This heuristic uses exact matches and therefore`
			`works well for footers like "Copyright 2019 by XXX", but won't detect "Page 3 of 4"`
			`or similar.`
			- `valid_languages`: validate languages from a list of languages specified in the ISO 639-1
			`(https://en.wikipedia.org/wiki/ISO_639-1) format.`
			`This option can be used to add test for encoding errors. If the extracted text is`
			`not one of the valid languages, then it might likely be encoding error resulting`
			`in garbled text.`

			`<a name="tika.TikaConverter.extract_pages"></a>`
			`#### extract\_pages`

			```python
			`\| extract_pages(file_path: Path) -> Tuple[List[str], Optional[Dict[str, Any]]]`
			```

			`Arguments:`

			- `file_path`: Path of file to be converted.

			`Returns:`

			`a list of pages and the extracted meta data of the file.`

			`<a name="base"></a>`
			`# base`

			`<a name="base.BaseConverter"></a>`
			`## BaseConverter`

			```python
			`class BaseConverter()`
			```

			`Base class for implementing file converts to transform input documents to text format for ingestion in DocumentStore.`

			`<a name="base.BaseConverter.__init__"></a>`
			`#### \_\_init\_\_`

			```python
			`\| __init__(remove_numeric_tables: Optional[bool] = None, remove_header_footer: Optional[bool] = None, remove_whitespace: Optional[bool] = None, remove_empty_lines: Optional[bool] = None, valid_languages: Optional[List[str]] = None)`
			```

			`Arguments:`

			- `remove_numeric_tables`: This option uses heuristics to remove numeric rows from the tables.
			`The tabular structures in documents might be noise for the reader model if it`
			`does not have table parsing capability for finding answers. However, tables`
			`may also have long strings that could possible candidate for searching answers.`
			`The rows containing strings are thus retained in this option.`
			- `remove_header_footer`: use heuristic to remove footers and headers across different pages by searching
			`for the longest common string. This heuristic uses exact matches and therefore`
			`works well for footers like "Copyright 2019 by XXX", but won't detect "Page 3 of 4"`
			`or similar.`
			- `remove_whitespace`: strip whitespaces before or after each line in the text.
			- `remove_empty_lines`: remove more than two empty lines in the text.
			- `valid_languages`: validate languages from a list of languages specified in the ISO 639-1
			`(https://en.wikipedia.org/wiki/ISO_639-1) format.`
			`This option can be used to add test for encoding errors. If the extracted text is`
			`not one of the valid languages, then it might likely be encoding error resulting`
			`in garbled text.`

			`<a name="base.BaseConverter.validate_language"></a>`
			`#### validate\_language`

			```python
			`\| validate_language(text: str) -> bool`
			```

			`Validate if the language of the text is one of valid languages.`

			`<a name="base.BaseConverter.find_and_remove_header_footer"></a>`
			`#### find\_and\_remove\_header\_footer`

			```python
			`\| find_and_remove_header_footer(pages: List[str], n_chars: int, n_first_pages_to_ignore: int, n_last_pages_to_ignore: int) -> Tuple[List[str], Optional[str], Optional[str]]`
			```

			`Heuristic to find footers and headers across different pages by searching for the longest common string.`
			`For headers we only search in the first n_chars characters (for footer: last n_chars).`
			`Note: This heuristic uses exact matches and therefore works well for footers like "Copyright 2019 by XXX",`
			`but won't detect "Page 3 of 4" or similar.`

			`Arguments:`

			- `pages`: list of strings, one string per page
			- `n_chars`: number of first/last characters where the header/footer shall be searched in
			- `n_first_pages_to_ignore`: number of first pages to ignore (e.g. TOCs often don't contain footer/header)
			- `n_last_pages_to_ignore`: number of last pages to ignore

			`Returns:`

			`(cleaned pages, found_header_str, found_footer_str)`

			`<a name="pdf"></a>`
			`# pdf`

			`<a name="pdf.PDFToTextConverter"></a>`
			`## PDFToTextConverter`

			```python
			`class PDFToTextConverter(BaseConverter)`
			```

			`<a name="pdf.PDFToTextConverter.__init__"></a>`
			`#### \_\_init\_\_`

			```python
			`\| __init__(remove_numeric_tables: Optional[bool] = False, remove_whitespace: Optional[bool] = None, remove_empty_lines: Optional[bool] = None, remove_header_footer: Optional[bool] = None, valid_languages: Optional[List[str]] = None)`
			```

			`Arguments:`

			- `remove_numeric_tables`: This option uses heuristics to remove numeric rows from the tables.
			`The tabular structures in documents might be noise for the reader model if it`
			`does not have table parsing capability for finding answers. However, tables`
			`may also have long strings that could possible candidate for searching answers.`
			`The rows containing strings are thus retained in this option.`
			- `remove_whitespace`: strip whitespaces before or after each line in the text.
			- `remove_empty_lines`: remove more than two empty lines in the text.
			- `remove_header_footer`: use heuristic to remove footers and headers across different pages by searching
			`for the longest common string. This heuristic uses exact matches and therefore`
			`works well for footers like "Copyright 2019 by XXX", but won't detect "Page 3 of 4"`
			`or similar.`
			- `valid_languages`: validate languages from a list of languages specified in the ISO 639-1
			`(https://en.wikipedia.org/wiki/ISO_639-1) format.`
			`This option can be used to add test for encoding errors. If the extracted text is`
			`not one of the valid languages, then it might likely be encoding error resulting`
			`in garbled text.`