mirror of
https://github.com/deepset-ai/haystack.git
synced 2025-09-06 23:03:54 +00:00

* Skeleton of doc website * Flesh out documentation pages * Split concepts into their own rst files * add tutorial rsts * Consistent level 1 markdown headers in tutorials * Change theme to readthedocs * Turn bullet points into prose * Populate sections * Add more text * Add more sphinx files * Add more retriever documentation * combined all documenations in one structure * rename of src to _src as it was ignored by git * Incorporate MP2's changes * add benchmark bar charts * Adapt docstrings in Readers * Improvements to intro, creation of glossary * Adapt docstrings in Retrievers * Adapt docstrings in Finder * Adapt Docstrings of Finder * Updates to text * Edit text * update doc strings * proof read tutorials * Edit text * Edit text * Add stacked chart * populate graph with data * Switch Documentation to markdown (#386) * add way to generate markdown files to sphinx * changed from rst to markdown and extended sphinx for it * fix spelling * Clean titles * delete file * change spelling * add sections to document store usage * add basic rest api docs * fix readme in setup.py * Update Tutorials * Change section names * add windows note to pip install * update intro * new renderer for markdown files * Fix typos * delete dpr_utils.py * fix windows note in get started * Fix docstrings * deleted rest api docs in api * fixed typo * Fix docstring * revert readme to rst * Fix readme * Update setup.py Co-authored-by: deepset <deepset@Crenolape.localdomain> Co-authored-by: PiffPaffM <markuspaff.mp@gmail.com> Co-authored-by: Bogdan Kostić <bogdankostic@web.de> Co-authored-by: Malte Pietsch <malte.pietsch@deepset.ai>
223 lines
8.6 KiB
Markdown
223 lines
8.6 KiB
Markdown
<a name="txt"></a>
|
|
# txt
|
|
|
|
<a name="txt.TextConverter"></a>
|
|
## TextConverter
|
|
|
|
```python
|
|
class TextConverter(BaseConverter)
|
|
```
|
|
|
|
<a name="txt.TextConverter.__init__"></a>
|
|
#### \_\_init\_\_
|
|
|
|
```python
|
|
| __init__(remove_numeric_tables: Optional[bool] = False, remove_whitespace: Optional[bool] = None, remove_empty_lines: Optional[bool] = None, remove_header_footer: Optional[bool] = None, valid_languages: Optional[List[str]] = None)
|
|
```
|
|
|
|
**Arguments**:
|
|
|
|
- `remove_numeric_tables`: This option uses heuristics to remove numeric rows from the tables.
|
|
The tabular structures in documents might be noise for the reader model if it
|
|
does not have table parsing capability for finding answers. However, tables
|
|
may also have long strings that could possible candidate for searching answers.
|
|
The rows containing strings are thus retained in this option.
|
|
- `remove_whitespace`: strip whitespaces before or after each line in the text.
|
|
- `remove_empty_lines`: remove more than two empty lines in the text.
|
|
- `remove_header_footer`: use heuristic to remove footers and headers across different pages by searching
|
|
for the longest common string. This heuristic uses exact matches and therefore
|
|
works well for footers like "Copyright 2019 by XXX", but won't detect "Page 3 of 4"
|
|
or similar.
|
|
- `valid_languages`: validate languages from a list of languages specified in the ISO 639-1
|
|
(https://en.wikipedia.org/wiki/ISO_639-1) format.
|
|
This option can be used to add test for encoding errors. If the extracted text is
|
|
not one of the valid languages, then it might likely be encoding error resulting
|
|
in garbled text.
|
|
|
|
<a name="docx"></a>
|
|
# docx
|
|
|
|
<a name="docx.DocxToTextConverter"></a>
|
|
## DocxToTextConverter
|
|
|
|
```python
|
|
class DocxToTextConverter(BaseConverter)
|
|
```
|
|
|
|
<a name="docx.DocxToTextConverter.extract_pages"></a>
|
|
#### extract\_pages
|
|
|
|
```python
|
|
| extract_pages(file_path: Path) -> Tuple[List[str], Optional[Dict[str, Any]]]
|
|
```
|
|
|
|
Extract text from a .docx file.
|
|
Note: As docx doesn't contain "page" information, we actually extract and return a list of paragraphs here.
|
|
For compliance with other converters we nevertheless opted for keeping the methods name.
|
|
|
|
**Arguments**:
|
|
|
|
- `file_path`: Path to the .docx file you want to convert
|
|
|
|
<a name="__init__"></a>
|
|
# \_\_init\_\_
|
|
|
|
<a name="tika"></a>
|
|
# tika
|
|
|
|
<a name="tika.TikaConverter"></a>
|
|
## TikaConverter
|
|
|
|
```python
|
|
class TikaConverter(BaseConverter)
|
|
```
|
|
|
|
<a name="tika.TikaConverter.__init__"></a>
|
|
#### \_\_init\_\_
|
|
|
|
```python
|
|
| __init__(tika_url: str = "http://localhost:9998/tika", remove_numeric_tables: Optional[bool] = False, remove_whitespace: Optional[bool] = None, remove_empty_lines: Optional[bool] = None, remove_header_footer: Optional[bool] = None, valid_languages: Optional[List[str]] = None)
|
|
```
|
|
|
|
**Arguments**:
|
|
|
|
- `tika_url`: URL of the Tika server
|
|
- `remove_numeric_tables`: This option uses heuristics to remove numeric rows from the tables.
|
|
The tabular structures in documents might be noise for the reader model if it
|
|
does not have table parsing capability for finding answers. However, tables
|
|
may also have long strings that could possible candidate for searching answers.
|
|
The rows containing strings are thus retained in this option.
|
|
- `remove_whitespace`: strip whitespaces before or after each line in the text.
|
|
- `remove_empty_lines`: remove more than two empty lines in the text.
|
|
- `remove_header_footer`: use heuristic to remove footers and headers across different pages by searching
|
|
for the longest common string. This heuristic uses exact matches and therefore
|
|
works well for footers like "Copyright 2019 by XXX", but won't detect "Page 3 of 4"
|
|
or similar.
|
|
- `valid_languages`: validate languages from a list of languages specified in the ISO 639-1
|
|
(https://en.wikipedia.org/wiki/ISO_639-1) format.
|
|
This option can be used to add test for encoding errors. If the extracted text is
|
|
not one of the valid languages, then it might likely be encoding error resulting
|
|
in garbled text.
|
|
|
|
<a name="tika.TikaConverter.extract_pages"></a>
|
|
#### extract\_pages
|
|
|
|
```python
|
|
| extract_pages(file_path: Path) -> Tuple[List[str], Optional[Dict[str, Any]]]
|
|
```
|
|
|
|
**Arguments**:
|
|
|
|
- `file_path`: Path of file to be converted.
|
|
|
|
**Returns**:
|
|
|
|
a list of pages and the extracted meta data of the file.
|
|
|
|
<a name="base"></a>
|
|
# base
|
|
|
|
<a name="base.BaseConverter"></a>
|
|
## BaseConverter
|
|
|
|
```python
|
|
class BaseConverter()
|
|
```
|
|
|
|
Base class for implementing file converts to transform input documents to text format for ingestion in DocumentStore.
|
|
|
|
<a name="base.BaseConverter.__init__"></a>
|
|
#### \_\_init\_\_
|
|
|
|
```python
|
|
| __init__(remove_numeric_tables: Optional[bool] = None, remove_header_footer: Optional[bool] = None, remove_whitespace: Optional[bool] = None, remove_empty_lines: Optional[bool] = None, valid_languages: Optional[List[str]] = None)
|
|
```
|
|
|
|
**Arguments**:
|
|
|
|
- `remove_numeric_tables`: This option uses heuristics to remove numeric rows from the tables.
|
|
The tabular structures in documents might be noise for the reader model if it
|
|
does not have table parsing capability for finding answers. However, tables
|
|
may also have long strings that could possible candidate for searching answers.
|
|
The rows containing strings are thus retained in this option.
|
|
- `remove_header_footer`: use heuristic to remove footers and headers across different pages by searching
|
|
for the longest common string. This heuristic uses exact matches and therefore
|
|
works well for footers like "Copyright 2019 by XXX", but won't detect "Page 3 of 4"
|
|
or similar.
|
|
- `remove_whitespace`: strip whitespaces before or after each line in the text.
|
|
- `remove_empty_lines`: remove more than two empty lines in the text.
|
|
- `valid_languages`: validate languages from a list of languages specified in the ISO 639-1
|
|
(https://en.wikipedia.org/wiki/ISO_639-1) format.
|
|
This option can be used to add test for encoding errors. If the extracted text is
|
|
not one of the valid languages, then it might likely be encoding error resulting
|
|
in garbled text.
|
|
|
|
<a name="base.BaseConverter.validate_language"></a>
|
|
#### validate\_language
|
|
|
|
```python
|
|
| validate_language(text: str) -> bool
|
|
```
|
|
|
|
Validate if the language of the text is one of valid languages.
|
|
|
|
<a name="base.BaseConverter.find_and_remove_header_footer"></a>
|
|
#### find\_and\_remove\_header\_footer
|
|
|
|
```python
|
|
| find_and_remove_header_footer(pages: List[str], n_chars: int, n_first_pages_to_ignore: int, n_last_pages_to_ignore: int) -> Tuple[List[str], Optional[str], Optional[str]]
|
|
```
|
|
|
|
Heuristic to find footers and headers across different pages by searching for the longest common string.
|
|
For headers we only search in the first n_chars characters (for footer: last n_chars).
|
|
Note: This heuristic uses exact matches and therefore works well for footers like "Copyright 2019 by XXX",
|
|
but won't detect "Page 3 of 4" or similar.
|
|
|
|
**Arguments**:
|
|
|
|
- `pages`: list of strings, one string per page
|
|
- `n_chars`: number of first/last characters where the header/footer shall be searched in
|
|
- `n_first_pages_to_ignore`: number of first pages to ignore (e.g. TOCs often don't contain footer/header)
|
|
- `n_last_pages_to_ignore`: number of last pages to ignore
|
|
|
|
**Returns**:
|
|
|
|
(cleaned pages, found_header_str, found_footer_str)
|
|
|
|
<a name="pdf"></a>
|
|
# pdf
|
|
|
|
<a name="pdf.PDFToTextConverter"></a>
|
|
## PDFToTextConverter
|
|
|
|
```python
|
|
class PDFToTextConverter(BaseConverter)
|
|
```
|
|
|
|
<a name="pdf.PDFToTextConverter.__init__"></a>
|
|
#### \_\_init\_\_
|
|
|
|
```python
|
|
| __init__(remove_numeric_tables: Optional[bool] = False, remove_whitespace: Optional[bool] = None, remove_empty_lines: Optional[bool] = None, remove_header_footer: Optional[bool] = None, valid_languages: Optional[List[str]] = None)
|
|
```
|
|
|
|
**Arguments**:
|
|
|
|
- `remove_numeric_tables`: This option uses heuristics to remove numeric rows from the tables.
|
|
The tabular structures in documents might be noise for the reader model if it
|
|
does not have table parsing capability for finding answers. However, tables
|
|
may also have long strings that could possible candidate for searching answers.
|
|
The rows containing strings are thus retained in this option.
|
|
- `remove_whitespace`: strip whitespaces before or after each line in the text.
|
|
- `remove_empty_lines`: remove more than two empty lines in the text.
|
|
- `remove_header_footer`: use heuristic to remove footers and headers across different pages by searching
|
|
for the longest common string. This heuristic uses exact matches and therefore
|
|
works well for footers like "Copyright 2019 by XXX", but won't detect "Page 3 of 4"
|
|
or similar.
|
|
- `valid_languages`: validate languages from a list of languages specified in the ISO 639-1
|
|
(https://en.wikipedia.org/wiki/ISO_639-1) format.
|
|
This option can be used to add test for encoding errors. If the extracted text is
|
|
not one of the valid languages, then it might likely be encoding error resulting
|
|
in garbled text.
|
|
|