mirror of
https://github.com/deepset-ai/haystack.git
synced 2025-07-23 17:00:41 +00:00

* refactor: improve support for dataclasses * refactor: refactor class init * refactor: remove unused import * refactor: testing 3.7 diffs * refactor: checking meta where is Optional * refactor: reverting some changes on 3.7 * refactor: remove unused imports * build: manual pre-commit run * doc: run doc pre-commit manually * refactor: post initialization hack for 3.7-3.10 compat. TODO: investigate another method to improve 3.7 compatibility. * doc: force pre-commit * refactor: refactored for both Python 3.7 and 3.9 * docs: manually run pre-commit hooks * docs: run api docs manually * docs: fix wrong comment * refactor: change no type-checked test code * docs: update primitives * docs: api documentation * docs: api documentation * refactor: minor test refactoring * refactor: remova unused enumeration on test * refactor: remove unneeded dir in gitignore * refactor: exclude all private fields and change meta def * refactor: add pydantic comment * refactor : fix for mypy on Python 3.7 * refactor: revert custom init * docs: update docs to new pydoc-markdown style * Update test/nodes/test_generator.py Co-authored-by: Sara Zan <sarazanzo94@gmail.com>
827 lines
36 KiB
Markdown
827 lines
36 KiB
Markdown
<a id="base"></a>
|
|
|
|
# Module base
|
|
|
|
<a id="base.BaseConverter"></a>
|
|
|
|
## BaseConverter
|
|
|
|
```python
|
|
class BaseConverter(BaseComponent)
|
|
```
|
|
|
|
Base class for implementing file converts to transform input documents to text format for ingestion in DocumentStore.
|
|
|
|
<a id="base.BaseConverter.__init__"></a>
|
|
|
|
#### BaseConverter.\_\_init\_\_
|
|
|
|
```python
|
|
def __init__(remove_numeric_tables: bool = False,
|
|
valid_languages: Optional[List[str]] = None,
|
|
id_hash_keys: Optional[List[str]] = None,
|
|
progress_bar: bool = True)
|
|
```
|
|
|
|
**Arguments**:
|
|
|
|
- `remove_numeric_tables`: This option uses heuristics to remove numeric rows from the tables.
|
|
The tabular structures in documents might be noise for the reader model if it
|
|
does not have table parsing capability for finding answers. However, tables
|
|
may also have long strings that could possible candidate for searching answers.
|
|
The rows containing strings are thus retained in this option.
|
|
- `valid_languages`: validate languages from a list of languages specified in the ISO 639-1
|
|
(https://en.wikipedia.org/wiki/ISO_639-1) format.
|
|
This option can be used to add test for encoding errors. If the extracted text is
|
|
not one of the valid languages, then it might likely be encoding error resulting
|
|
in garbled text.
|
|
- `id_hash_keys`: Generate the document id from a custom list of strings that refer to the document's
|
|
attributes. If you want to ensure you don't have duplicate documents in your DocumentStore but texts are
|
|
not unique, you can modify the metadata and pass e.g. `"meta"` to this field (e.g. [`"content"`, `"meta"`]).
|
|
In this case the id will be generated by using the content and the defined metadata.
|
|
- `progress_bar`: Show a progress bar for the conversion.
|
|
|
|
<a id="base.BaseConverter.convert"></a>
|
|
|
|
#### BaseConverter.convert
|
|
|
|
```python
|
|
@abstractmethod
|
|
def convert(file_path: Path,
|
|
meta: Optional[Dict[str, Any]],
|
|
remove_numeric_tables: Optional[bool] = None,
|
|
valid_languages: Optional[List[str]] = None,
|
|
encoding: Optional[str] = "UTF-8",
|
|
id_hash_keys: Optional[List[str]] = None) -> List[Document]
|
|
```
|
|
|
|
Convert a file to a dictionary containing the text and any associated meta data.
|
|
|
|
File converters may extract file meta like name or size. In addition to it, user
|
|
supplied meta data like author, url, external IDs can be supplied as a dictionary.
|
|
|
|
**Arguments**:
|
|
|
|
- `file_path`: path of the file to convert
|
|
- `meta`: dictionary of meta data key-value pairs to append in the returned document.
|
|
- `remove_numeric_tables`: This option uses heuristics to remove numeric rows from the tables.
|
|
The tabular structures in documents might be noise for the reader model if it
|
|
does not have table parsing capability for finding answers. However, tables
|
|
may also have long strings that could possible candidate for searching answers.
|
|
The rows containing strings are thus retained in this option.
|
|
- `valid_languages`: validate languages from a list of languages specified in the ISO 639-1
|
|
(https://en.wikipedia.org/wiki/ISO_639-1) format.
|
|
This option can be used to add test for encoding errors. If the extracted text is
|
|
not one of the valid languages, then it might likely be encoding error resulting
|
|
in garbled text.
|
|
- `encoding`: Select the file encoding (default is `UTF-8`)
|
|
- `id_hash_keys`: Generate the document id from a custom list of strings that refer to the document's
|
|
attributes. If you want to ensure you don't have duplicate documents in your DocumentStore but texts are
|
|
not unique, you can modify the metadata and pass e.g. `"meta"` to this field (e.g. [`"content"`, `"meta"`]).
|
|
In this case the id will be generated by using the content and the defined metadata.
|
|
|
|
<a id="base.BaseConverter.validate_language"></a>
|
|
|
|
#### BaseConverter.validate\_language
|
|
|
|
```python
|
|
def validate_language(text: str,
|
|
valid_languages: Optional[List[str]] = None) -> bool
|
|
```
|
|
|
|
Validate if the language of the text is one of valid languages.
|
|
|
|
<a id="base.BaseConverter.run"></a>
|
|
|
|
#### BaseConverter.run
|
|
|
|
```python
|
|
def run(file_paths: Union[Path, List[Path]],
|
|
meta: Optional[Union[Dict[str, str],
|
|
List[Optional[Dict[str, str]]]]] = None,
|
|
remove_numeric_tables: Optional[bool] = None,
|
|
known_ligatures: Dict[str, str] = KNOWN_LIGATURES,
|
|
valid_languages: Optional[List[str]] = None,
|
|
encoding: Optional[str] = "UTF-8",
|
|
id_hash_keys: Optional[List[str]] = None)
|
|
```
|
|
|
|
Extract text from a file.
|
|
|
|
**Arguments**:
|
|
|
|
- `file_paths`: Path to the files you want to convert
|
|
- `meta`: Optional dictionary with metadata that shall be attached to all resulting documents.
|
|
Can be any custom keys and values.
|
|
- `remove_numeric_tables`: This option uses heuristics to remove numeric rows from the tables.
|
|
The tabular structures in documents might be noise for the reader model if it
|
|
does not have table parsing capability for finding answers. However, tables
|
|
may also have long strings that could possible candidate for searching answers.
|
|
The rows containing strings are thus retained in this option.
|
|
- `known_ligatures`: Some converters tends to recognize clusters of letters as ligatures, such as "ff" (double f).
|
|
Such ligatures however make text hard to compare with the content of other files,
|
|
which are generally ligature free. Therefore we automatically find and replace the most
|
|
common ligatures with their split counterparts. The default mapping is in
|
|
`haystack.nodes.file_converter.base.KNOWN_LIGATURES`: it is rather biased towards Latin alphabeths
|
|
but excludes all ligatures that are known to be used in IPA.
|
|
You can use this parameter to provide your own set of ligatures to clean up from the documents.
|
|
- `valid_languages`: validate languages from a list of languages specified in the ISO 639-1
|
|
(https://en.wikipedia.org/wiki/ISO_639-1) format.
|
|
This option can be used to add test for encoding errors. If the extracted text is
|
|
not one of the valid languages, then it might likely be encoding error resulting
|
|
in garbled text.
|
|
- `encoding`: Select the file encoding (default is `UTF-8`)
|
|
- `id_hash_keys`: Generate the document id from a custom list of strings that refer to the document's
|
|
attributes. If you want to ensure you don't have duplicate documents in your DocumentStore but texts are
|
|
not unique, you can modify the metadata and pass e.g. `"meta"` to this field (e.g. [`"content"`, `"meta"`]).
|
|
In this case the id will be generated by using the content and the defined metadata.
|
|
|
|
<a id="docx"></a>
|
|
|
|
# Module docx
|
|
|
|
<a id="docx.DocxToTextConverter"></a>
|
|
|
|
## DocxToTextConverter
|
|
|
|
```python
|
|
class DocxToTextConverter(BaseConverter)
|
|
```
|
|
|
|
<a id="docx.DocxToTextConverter.convert"></a>
|
|
|
|
#### DocxToTextConverter.convert
|
|
|
|
```python
|
|
def convert(file_path: Path,
|
|
meta: Optional[Dict[str, str]] = None,
|
|
remove_numeric_tables: Optional[bool] = None,
|
|
valid_languages: Optional[List[str]] = None,
|
|
encoding: Optional[str] = None,
|
|
id_hash_keys: Optional[List[str]] = None) -> List[Document]
|
|
```
|
|
|
|
Extract text from a .docx file.
|
|
|
|
Note: As docx doesn't contain "page" information, we actually extract and return a list of paragraphs here.
|
|
For compliance with other converters we nevertheless opted for keeping the methods name.
|
|
|
|
**Arguments**:
|
|
|
|
- `file_path`: Path to the .docx file you want to convert
|
|
- `meta`: dictionary of meta data key-value pairs to append in the returned document.
|
|
- `remove_numeric_tables`: This option uses heuristics to remove numeric rows from the tables.
|
|
The tabular structures in documents might be noise for the reader model if it
|
|
does not have table parsing capability for finding answers. However, tables
|
|
may also have long strings that could possible candidate for searching answers.
|
|
The rows containing strings are thus retained in this option.
|
|
- `valid_languages`: validate languages from a list of languages specified in the ISO 639-1
|
|
(https://en.wikipedia.org/wiki/ISO_639-1) format.
|
|
This option can be used to add test for encoding errors. If the extracted text is
|
|
not one of the valid languages, then it might likely be encoding error resulting
|
|
in garbled text.
|
|
- `encoding`: Not applicable
|
|
- `id_hash_keys`: Generate the document id from a custom list of strings that refer to the document's
|
|
attributes. If you want to ensure you don't have duplicate documents in your DocumentStore but texts are
|
|
not unique, you can modify the metadata and pass e.g. `"meta"` to this field (e.g. [`"content"`, `"meta"`]).
|
|
In this case the id will be generated by using the content and the defined metadata.
|
|
|
|
<a id="image"></a>
|
|
|
|
# Module image
|
|
|
|
<a id="image.ImageToTextConverter"></a>
|
|
|
|
## ImageToTextConverter
|
|
|
|
```python
|
|
class ImageToTextConverter(BaseConverter)
|
|
```
|
|
|
|
<a id="image.ImageToTextConverter.__init__"></a>
|
|
|
|
#### ImageToTextConverter.\_\_init\_\_
|
|
|
|
```python
|
|
def __init__(remove_numeric_tables: bool = False,
|
|
valid_languages: Optional[List[str]] = ["eng"],
|
|
id_hash_keys: Optional[List[str]] = None)
|
|
```
|
|
|
|
**Arguments**:
|
|
|
|
- `remove_numeric_tables`: This option uses heuristics to remove numeric rows from the tables.
|
|
The tabular structures in documents might be noise for the reader model if it
|
|
does not have table parsing capability for finding answers. However, tables
|
|
may also have long strings that could possible candidate for searching answers.
|
|
The rows containing strings are thus retained in this option.
|
|
- `valid_languages`: validate languages from a list of languages specified here
|
|
(https://tesseract-ocr.github.io/tessdoc/Data-Files-in-different-versions.html)
|
|
This option can be used to add test for encoding errors. If the extracted text is
|
|
not one of the valid languages, then it might likely be encoding error resulting
|
|
in garbled text. Run the following line of code to check available language packs:
|
|
# List of available languages
|
|
print(pytesseract.get_languages(config=''))
|
|
- `id_hash_keys`: Generate the document id from a custom list of strings that refer to the document's
|
|
attributes. If you want to ensure you don't have duplicate documents in your DocumentStore but texts are
|
|
not unique, you can modify the metadata and pass e.g. `"meta"` to this field (e.g. [`"content"`, `"meta"`]).
|
|
In this case the id will be generated by using the content and the defined metadata.
|
|
|
|
<a id="image.ImageToTextConverter.convert"></a>
|
|
|
|
#### ImageToTextConverter.convert
|
|
|
|
```python
|
|
def convert(file_path: Union[Path, str],
|
|
meta: Optional[Dict[str, str]] = None,
|
|
remove_numeric_tables: Optional[bool] = None,
|
|
valid_languages: Optional[List[str]] = None,
|
|
encoding: Optional[str] = None,
|
|
id_hash_keys: Optional[List[str]] = None) -> List[Document]
|
|
```
|
|
|
|
Extract text from image file using the pytesseract library (https://github.com/madmaze/pytesseract)
|
|
|
|
**Arguments**:
|
|
|
|
- `file_path`: path to image file
|
|
- `meta`: Optional dictionary with metadata that shall be attached to all resulting documents.
|
|
Can be any custom keys and values.
|
|
- `remove_numeric_tables`: This option uses heuristics to remove numeric rows from the tables.
|
|
The tabular structures in documents might be noise for the reader model if it
|
|
does not have table parsing capability for finding answers. However, tables
|
|
may also have long strings that could possible candidate for searching answers.
|
|
The rows containing strings are thus retained in this option.
|
|
- `valid_languages`: validate languages from a list of languages supported by tessarect
|
|
(https://tesseract-ocr.github.io/tessdoc/Data-Files-in-different-versions.html).
|
|
This option can be used to add test for encoding errors. If the extracted text is
|
|
not one of the valid languages, then it might likely be encoding error resulting
|
|
in garbled text.
|
|
- `encoding`: Not applicable
|
|
- `id_hash_keys`: Generate the document id from a custom list of strings that refer to the document's
|
|
attributes. If you want to ensure you don't have duplicate documents in your DocumentStore but texts are
|
|
not unique, you can modify the metadata and pass e.g. `"meta"` to this field (e.g. [`"content"`, `"meta"`]).
|
|
In this case the id will be generated by using the content and the defined metadata.
|
|
|
|
<a id="markdown"></a>
|
|
|
|
# Module markdown
|
|
|
|
<a id="markdown.MarkdownConverter"></a>
|
|
|
|
## MarkdownConverter
|
|
|
|
```python
|
|
class MarkdownConverter(BaseConverter)
|
|
```
|
|
|
|
<a id="markdown.MarkdownConverter.convert"></a>
|
|
|
|
#### MarkdownConverter.convert
|
|
|
|
```python
|
|
def convert(file_path: Path,
|
|
meta: Optional[Dict[str, str]] = None,
|
|
remove_numeric_tables: Optional[bool] = None,
|
|
valid_languages: Optional[List[str]] = None,
|
|
encoding: Optional[str] = "utf-8",
|
|
id_hash_keys: Optional[List[str]] = None) -> List[Document]
|
|
```
|
|
|
|
Reads text from a txt file and executes optional preprocessing steps.
|
|
|
|
**Arguments**:
|
|
|
|
- `file_path`: path of the file to convert
|
|
- `meta`: dictionary of meta data key-value pairs to append in the returned document.
|
|
- `encoding`: Select the file encoding (default is `utf-8`)
|
|
- `remove_numeric_tables`: Not applicable
|
|
- `valid_languages`: Not applicable
|
|
- `id_hash_keys`: Generate the document id from a custom list of strings that refer to the document's
|
|
attributes. If you want to ensure you don't have duplicate documents in your DocumentStore but texts are
|
|
not unique, you can modify the metadata and pass e.g. `"meta"` to this field (e.g. [`"content"`, `"meta"`]).
|
|
In this case the id will be generated by using the content and the defined metadata.
|
|
|
|
<a id="markdown.MarkdownConverter.markdown_to_text"></a>
|
|
|
|
#### MarkdownConverter.markdown\_to\_text
|
|
|
|
```python
|
|
@staticmethod
|
|
def markdown_to_text(markdown_string: str) -> str
|
|
```
|
|
|
|
Converts a markdown string to plaintext
|
|
|
|
**Arguments**:
|
|
|
|
- `markdown_string`: String in markdown format
|
|
|
|
<a id="pdf"></a>
|
|
|
|
# Module pdf
|
|
|
|
<a id="pdf.PDFToTextConverter"></a>
|
|
|
|
## PDFToTextConverter
|
|
|
|
```python
|
|
class PDFToTextConverter(BaseConverter)
|
|
```
|
|
|
|
<a id="pdf.PDFToTextConverter.__init__"></a>
|
|
|
|
#### PDFToTextConverter.\_\_init\_\_
|
|
|
|
```python
|
|
def __init__(remove_numeric_tables: bool = False,
|
|
valid_languages: Optional[List[str]] = None,
|
|
id_hash_keys: Optional[List[str]] = None,
|
|
encoding: Optional[str] = "UTF-8")
|
|
```
|
|
|
|
**Arguments**:
|
|
|
|
- `remove_numeric_tables`: This option uses heuristics to remove numeric rows from the tables.
|
|
The tabular structures in documents might be noise for the reader model if it
|
|
does not have table parsing capability for finding answers. However, tables
|
|
may also have long strings that could possible candidate for searching answers.
|
|
The rows containing strings are thus retained in this option.
|
|
- `valid_languages`: validate languages from a list of languages specified in the ISO 639-1
|
|
(https://en.wikipedia.org/wiki/ISO_639-1) format.
|
|
This option can be used to add test for encoding errors. If the extracted text is
|
|
not one of the valid languages, then it might likely be encoding error resulting
|
|
in garbled text.
|
|
- `id_hash_keys`: Generate the document id from a custom list of strings that refer to the document's
|
|
attributes. If you want to ensure you don't have duplicate documents in your DocumentStore but texts are
|
|
not unique, you can modify the metadata and pass e.g. `"meta"` to this field (e.g. [`"content"`, `"meta"`]).
|
|
In this case the id will be generated by using the content and the defined metadata.
|
|
- `encoding`: Encoding that will be passed as `-enc` parameter to `pdftotext`.
|
|
Defaults to "UTF-8" in order to support special characters (e.g. German Umlauts, Cyrillic ...).
|
|
(See list of available encodings, such as "Latin1", by running `pdftotext -listenc` in the terminal)
|
|
|
|
<a id="pdf.PDFToTextConverter.convert"></a>
|
|
|
|
#### PDFToTextConverter.convert
|
|
|
|
```python
|
|
def convert(file_path: Path,
|
|
meta: Optional[Dict[str, Any]] = None,
|
|
remove_numeric_tables: Optional[bool] = None,
|
|
valid_languages: Optional[List[str]] = None,
|
|
encoding: Optional[str] = None,
|
|
id_hash_keys: Optional[List[str]] = None) -> List[Document]
|
|
```
|
|
|
|
Extract text from a .pdf file using the pdftotext library (https://www.xpdfreader.com/pdftotext-man.html)
|
|
|
|
**Arguments**:
|
|
|
|
- `file_path`: Path to the .pdf file you want to convert
|
|
- `meta`: Optional dictionary with metadata that shall be attached to all resulting documents.
|
|
Can be any custom keys and values.
|
|
- `remove_numeric_tables`: This option uses heuristics to remove numeric rows from the tables.
|
|
The tabular structures in documents might be noise for the reader model if it
|
|
does not have table parsing capability for finding answers. However, tables
|
|
may also have long strings that could possible candidate for searching answers.
|
|
The rows containing strings are thus retained in this option.
|
|
- `valid_languages`: validate languages from a list of languages specified in the ISO 639-1
|
|
(https://en.wikipedia.org/wiki/ISO_639-1) format.
|
|
This option can be used to add test for encoding errors. If the extracted text is
|
|
not one of the valid languages, then it might likely be encoding error resulting
|
|
in garbled text.
|
|
- `encoding`: Encoding that overwrites self.encoding and will be passed as `-enc` parameter to `pdftotext`.
|
|
(See list of available encodings by running `pdftotext -listenc` in the terminal)
|
|
- `id_hash_keys`: Generate the document id from a custom list of strings that refer to the document's
|
|
attributes. If you want to ensure you don't have duplicate documents in your DocumentStore but texts are
|
|
not unique, you can modify the metadata and pass e.g. `"meta"` to this field (e.g. [`"content"`, `"meta"`]).
|
|
In this case the id will be generated by using the content and the defined metadata.
|
|
|
|
<a id="pdf.PDFToTextOCRConverter"></a>
|
|
|
|
## PDFToTextOCRConverter
|
|
|
|
```python
|
|
class PDFToTextOCRConverter(BaseConverter)
|
|
```
|
|
|
|
<a id="pdf.PDFToTextOCRConverter.__init__"></a>
|
|
|
|
#### PDFToTextOCRConverter.\_\_init\_\_
|
|
|
|
```python
|
|
def __init__(remove_numeric_tables: bool = False,
|
|
valid_languages: Optional[List[str]] = ["eng"],
|
|
id_hash_keys: Optional[List[str]] = None)
|
|
```
|
|
|
|
Extract text from image file using the pytesseract library (https://github.com/madmaze/pytesseract)
|
|
|
|
**Arguments**:
|
|
|
|
- `remove_numeric_tables`: This option uses heuristics to remove numeric rows from the tables.
|
|
The tabular structures in documents might be noise for the reader model if it
|
|
does not have table parsing capability for finding answers. However, tables
|
|
may also have long strings that could possible candidate for searching answers.
|
|
The rows containing strings are thus retained in this option.
|
|
- `valid_languages`: validate languages from a list of languages supported by tessarect
|
|
(https://tesseract-ocr.github.io/tessdoc/Data-Files-in-different-versions.html).
|
|
This option can be used to add test for encoding errors. If the extracted text is
|
|
not one of the valid languages, then it might likely be encoding error resulting
|
|
in garbled text.
|
|
- `id_hash_keys`: Generate the document id from a custom list of strings that refer to the document's
|
|
attributes. If you want to ensure you don't have duplicate documents in your DocumentStore but texts are
|
|
not unique, you can modify the metadata and pass e.g. `"meta"` to this field (e.g. [`"content"`, `"meta"`]).
|
|
In this case the id will be generated by using the content and the defined metadata.
|
|
|
|
<a id="pdf.PDFToTextOCRConverter.convert"></a>
|
|
|
|
#### PDFToTextOCRConverter.convert
|
|
|
|
```python
|
|
def convert(file_path: Path,
|
|
meta: Optional[Dict[str, Any]] = None,
|
|
remove_numeric_tables: Optional[bool] = None,
|
|
valid_languages: Optional[List[str]] = None,
|
|
encoding: Optional[str] = None,
|
|
id_hash_keys: Optional[List[str]] = None) -> List[Document]
|
|
```
|
|
|
|
Convert a file to a dictionary containing the text and any associated meta data.
|
|
|
|
File converters may extract file meta like name or size. In addition to it, user
|
|
supplied meta data like author, url, external IDs can be supplied as a dictionary.
|
|
|
|
**Arguments**:
|
|
|
|
- `file_path`: path of the file to convert
|
|
- `meta`: dictionary of meta data key-value pairs to append in the returned document.
|
|
- `remove_numeric_tables`: This option uses heuristics to remove numeric rows from the tables.
|
|
The tabular structures in documents might be noise for the reader model if it
|
|
does not have table parsing capability for finding answers. However, tables
|
|
may also have long strings that could possible candidate for searching answers.
|
|
The rows containing strings are thus retained in this option.
|
|
- `valid_languages`: validate languages from a list of languages specified in the ISO 639-1
|
|
(https://en.wikipedia.org/wiki/ISO_639-1) format.
|
|
This option can be used to add test for encoding errors. If the extracted text is
|
|
not one of the valid languages, then it might likely be encoding error resulting
|
|
in garbled text.
|
|
- `encoding`: Not applicable
|
|
- `id_hash_keys`: Generate the document id from a custom list of strings that refer to the document's
|
|
attributes. If you want to ensure you don't have duplicate documents in your DocumentStore but texts are
|
|
not unique, you can modify the metadata and pass e.g. `"meta"` to this field (e.g. [`"content"`, `"meta"`]).
|
|
In this case the id will be generated by using the content and the defined metadata.
|
|
|
|
<a id="parsr"></a>
|
|
|
|
# Module parsr
|
|
|
|
<a id="parsr.ParsrConverter"></a>
|
|
|
|
## ParsrConverter
|
|
|
|
```python
|
|
class ParsrConverter(BaseConverter)
|
|
```
|
|
|
|
File converter that makes use of the open-source Parsr tool by axa-group.
|
|
(https://github.com/axa-group/Parsr).
|
|
This Converter extracts both text and tables.
|
|
Supported file formats are: PDF, DOCX
|
|
|
|
<a id="parsr.ParsrConverter.__init__"></a>
|
|
|
|
#### ParsrConverter.\_\_init\_\_
|
|
|
|
```python
|
|
def __init__(parsr_url: str = "http://localhost:3001",
|
|
extractor: Literal["pdfminer", "pdfjs"] = "pdfminer",
|
|
table_detection_mode: Literal["lattice", "stream"] = "lattice",
|
|
preceding_context_len: int = 3,
|
|
following_context_len: int = 3,
|
|
remove_page_headers: bool = False,
|
|
remove_page_footers: bool = False,
|
|
remove_table_of_contents: bool = False,
|
|
valid_languages: Optional[List[str]] = None,
|
|
id_hash_keys: Optional[List[str]] = None,
|
|
add_page_number: bool = True)
|
|
```
|
|
|
|
**Arguments**:
|
|
|
|
- `parsr_url`: URL endpoint to Parsr"s REST API.
|
|
- `extractor`: Backend used to extract textual structured from PDFs. ("pdfminer" or "pdfjs")
|
|
- `table_detection_mode`: Parsing method used to detect tables and their cells.
|
|
"lattice" detects tables and their cells by demarcated lines between cells.
|
|
"stream" detects tables and their cells by looking at whitespace between cells.
|
|
- `preceding_context_len`: Number of lines before a table to extract as preceding context
|
|
(will be returned as part of meta data).
|
|
- `following_context_len`: Number of lines after a table to extract as preceding context
|
|
(will be returned as part of meta data).
|
|
- `remove_page_headers`: Whether to remove text that Parsr detected as a page header.
|
|
- `remove_page_footers`: Whether to remove text that Parsr detected as a page footer.
|
|
- `remove_table_of_contents`: Whether to remove text that Parsr detected as a table of contents.
|
|
- `valid_languages`: Validate languages from a list of languages specified in the ISO 639-1
|
|
(https://en.wikipedia.org/wiki/ISO_639-1) format.
|
|
This option can be used to add test for encoding errors. If the extracted text is
|
|
not one of the valid languages, then it might likely be encoding error resulting
|
|
in garbled text.
|
|
- `id_hash_keys`: Generate the document id from a custom list of strings that refer to the document's
|
|
attributes. If you want to ensure you don't have duplicate documents in your DocumentStore but texts are
|
|
not unique, you can modify the metadata and pass e.g. `"meta"` to this field (e.g. [`"content"`, `"meta"`]).
|
|
In this case the id will be generated by using the content and the defined metadata.
|
|
- `add_page_number`: Adds the number of the page a table occurs in to the Document's meta field
|
|
`"page"`.
|
|
|
|
<a id="parsr.ParsrConverter.convert"></a>
|
|
|
|
#### ParsrConverter.convert
|
|
|
|
```python
|
|
def convert(file_path: Path,
|
|
meta: Optional[Dict[str, Any]] = None,
|
|
remove_numeric_tables: Optional[bool] = None,
|
|
valid_languages: Optional[List[str]] = None,
|
|
encoding: Optional[str] = "utf-8",
|
|
id_hash_keys: Optional[List[str]] = None) -> List[Document]
|
|
```
|
|
|
|
Extract text and tables from a PDF or DOCX using the open-source Parsr tool.
|
|
|
|
**Arguments**:
|
|
|
|
- `file_path`: Path to the file you want to convert.
|
|
- `meta`: Optional dictionary with metadata that shall be attached to all resulting documents.
|
|
Can be any custom keys and values.
|
|
- `remove_numeric_tables`: Not applicable.
|
|
- `valid_languages`: Validate languages from a list of languages specified in the ISO 639-1
|
|
(https://en.wikipedia.org/wiki/ISO_639-1) format.
|
|
This option can be used to add test for encoding errors. If the extracted text is
|
|
not one of the valid languages, then it might likely be encoding error resulting
|
|
in garbled text.
|
|
- `encoding`: Not applicable.
|
|
- `id_hash_keys`: Generate the document id from a custom list of strings that refer to the document's
|
|
attributes. If you want to ensure you don't have duplicate documents in your DocumentStore but texts are
|
|
not unique, you can modify the metadata and pass e.g. `"meta"` to this field (e.g. [`"content"`, `"meta"`]).
|
|
In this case the id will be generated by using the content and the defined metadata.
|
|
|
|
<a id="azure"></a>
|
|
|
|
# Module azure
|
|
|
|
<a id="azure.AzureConverter"></a>
|
|
|
|
## AzureConverter
|
|
|
|
```python
|
|
class AzureConverter(BaseConverter)
|
|
```
|
|
|
|
File converter that makes use of Microsoft Azure's Form Recognizer service
|
|
(https://azure.microsoft.com/en-us/services/form-recognizer/).
|
|
This Converter extracts both text and tables.
|
|
Supported file formats are: PDF, JPEG, PNG, BMP and TIFF.
|
|
|
|
In order to be able to use this Converter, you need an active Azure account
|
|
and a Form Recognizer or Cognitive Services resource.
|
|
(Here you can find information on how to set this up:
|
|
https://docs.microsoft.com/en-us/azure/applied-ai-services/form-recognizer/quickstarts/try-v3-python-sdk#prerequisites)
|
|
|
|
<a id="azure.AzureConverter.__init__"></a>
|
|
|
|
#### AzureConverter.\_\_init\_\_
|
|
|
|
```python
|
|
def __init__(endpoint: str,
|
|
credential_key: str,
|
|
model_id: str = "prebuilt-document",
|
|
valid_languages: Optional[List[str]] = None,
|
|
save_json: bool = False,
|
|
preceding_context_len: int = 3,
|
|
following_context_len: int = 3,
|
|
merge_multiple_column_headers: bool = True,
|
|
id_hash_keys: Optional[List[str]] = None,
|
|
add_page_number: bool = True)
|
|
```
|
|
|
|
**Arguments**:
|
|
|
|
- `endpoint`: Your Form Recognizer or Cognitive Services resource's endpoint.
|
|
- `credential_key`: Your Form Recognizer or Cognitive Services resource's subscription key.
|
|
- `model_id`: The identifier of the model you want to use to extract information out of your file.
|
|
Default: "prebuilt-document". General purpose models are "prebuilt-document"
|
|
and "prebuilt-layout".
|
|
List of available prebuilt models:
|
|
https://azuresdkdocs.blob.core.windows.net/$web/python/azure-ai-formrecognizer/3.2.0b1/index.html#documentanalysisclient
|
|
- `valid_languages`: Validate languages from a list of languages specified in the ISO 639-1
|
|
(https://en.wikipedia.org/wiki/ISO_639-1) format.
|
|
This option can be used to add test for encoding errors. If the extracted text is
|
|
not one of the valid languages, then it might likely be encoding error resulting
|
|
in garbled text.
|
|
- `save_json`: Whether to save the output of the Form Recognizer to a JSON file.
|
|
- `preceding_context_len`: Number of lines before a table to extract as preceding context (will be returned as part of meta data).
|
|
- `following_context_len`: Number of lines after a table to extract as subsequent context (will be returned as part of meta data).
|
|
- `merge_multiple_column_headers`: Some tables contain more than one row as a column header (i.e., column description).
|
|
This parameter lets you choose, whether to merge multiple column header
|
|
rows to a single row.
|
|
- `id_hash_keys`: Generate the document id from a custom list of strings that refer to the document's
|
|
attributes. If you want to ensure you don't have duplicate documents in your DocumentStore but texts are
|
|
not unique, you can modify the metadata and pass e.g. `"meta"` to this field (e.g. [`"content"`, `"meta"`]).
|
|
In this case the id will be generated by using the content and the defined metadata.
|
|
- `add_page_number`: Adds the number of the page a table occurs in to the Document's meta field
|
|
`"page"`.
|
|
|
|
<a id="azure.AzureConverter.convert"></a>
|
|
|
|
#### AzureConverter.convert
|
|
|
|
```python
|
|
def convert(file_path: Path,
|
|
meta: Optional[Dict[str, Any]] = None,
|
|
remove_numeric_tables: Optional[bool] = None,
|
|
valid_languages: Optional[List[str]] = None,
|
|
encoding: Optional[str] = "utf-8",
|
|
id_hash_keys: Optional[List[str]] = None,
|
|
pages: Optional[str] = None,
|
|
known_language: Optional[str] = None) -> List[Document]
|
|
```
|
|
|
|
Extract text and tables from a PDF, JPEG, PNG, BMP or TIFF file using Azure's Form Recognizer service.
|
|
|
|
**Arguments**:
|
|
|
|
- `file_path`: Path to the file you want to convert.
|
|
- `meta`: Optional dictionary with metadata that shall be attached to all resulting documents.
|
|
Can be any custom keys and values.
|
|
- `remove_numeric_tables`: Not applicable.
|
|
- `valid_languages`: Validate languages from a list of languages specified in the ISO 639-1
|
|
(https://en.wikipedia.org/wiki/ISO_639-1) format.
|
|
This option can be used to add test for encoding errors. If the extracted text is
|
|
not one of the valid languages, then it might likely be encoding error resulting
|
|
in garbled text.
|
|
- `encoding`: Not applicable.
|
|
- `id_hash_keys`: Generate the document id from a custom list of strings that refer to the document's
|
|
attributes. If you want to ensure you don't have duplicate documents in your DocumentStore but texts are
|
|
not unique, you can modify the metadata and pass e.g. `"meta"` to this field (e.g. [`"content"`, `"meta"`]).
|
|
In this case the id will be generated by using the content and the defined metadata.
|
|
- `pages`: Custom page numbers for multi-page documents(PDF/TIFF). Input the page numbers and/or ranges
|
|
of pages you want to get in the result. For a range of pages, use a hyphen,
|
|
like pages=”1-3, 5-6”. Separate each page number or range with a comma.
|
|
- `known_language`: Locale hint of the input document.
|
|
See supported locales here: https://aka.ms/azsdk/formrecognizer/supportedlocales.
|
|
|
|
<a id="azure.AzureConverter.convert_azure_json"></a>
|
|
|
|
#### AzureConverter.convert\_azure\_json
|
|
|
|
```python
|
|
def convert_azure_json(
|
|
file_path: Path,
|
|
meta: Optional[Dict[str, Any]] = None,
|
|
valid_languages: Optional[List[str]] = None,
|
|
id_hash_keys: Optional[List[str]] = None) -> List[Document]
|
|
```
|
|
|
|
Extract text and tables from the JSON output of Azure's Form Recognizer service.
|
|
|
|
**Arguments**:
|
|
|
|
- `file_path`: Path to the JSON-file you want to convert.
|
|
- `meta`: Optional dictionary with metadata that shall be attached to all resulting documents.
|
|
Can be any custom keys and values.
|
|
- `valid_languages`: Validate languages from a list of languages specified in the ISO 639-1
|
|
(https://en.wikipedia.org/wiki/ISO_639-1) format.
|
|
This option can be used to add test for encoding errors. If the extracted text is
|
|
not one of the valid languages, then it might likely be encoding error resulting
|
|
in garbled text.
|
|
- `id_hash_keys`: Generate the document id from a custom list of strings that refer to the document's
|
|
attributes. If you want to ensure you don't have duplicate documents in your DocumentStore but texts are
|
|
not unique, you can modify the metadata and pass e.g. `"meta"` to this field (e.g. [`"content"`, `"meta"`]).
|
|
In this case the id will be generated by using the content and the defined metadata.
|
|
|
|
<a id="tika"></a>
|
|
|
|
# Module tika
|
|
|
|
<a id="tika.TikaConverter"></a>
|
|
|
|
## TikaConverter
|
|
|
|
```python
|
|
class TikaConverter(BaseConverter)
|
|
```
|
|
|
|
<a id="tika.TikaConverter.__init__"></a>
|
|
|
|
#### TikaConverter.\_\_init\_\_
|
|
|
|
```python
|
|
def __init__(tika_url: str = "http://localhost:9998/tika",
|
|
remove_numeric_tables: bool = False,
|
|
valid_languages: Optional[List[str]] = None,
|
|
id_hash_keys: Optional[List[str]] = None)
|
|
```
|
|
|
|
**Arguments**:
|
|
|
|
- `tika_url`: URL of the Tika server
|
|
- `remove_numeric_tables`: This option uses heuristics to remove numeric rows from the tables.
|
|
The tabular structures in documents might be noise for the reader model if it
|
|
does not have table parsing capability for finding answers. However, tables
|
|
may also have long strings that could possible candidate for searching answers.
|
|
The rows containing strings are thus retained in this option.
|
|
- `valid_languages`: validate languages from a list of languages specified in the ISO 639-1
|
|
(https://en.wikipedia.org/wiki/ISO_639-1) format.
|
|
This option can be used to add test for encoding errors. If the extracted text is
|
|
not one of the valid languages, then it might likely be encoding error resulting
|
|
in garbled text.
|
|
- `id_hash_keys`: Generate the document id from a custom list of strings that refer to the document's
|
|
attributes. If you want to ensure you don't have duplicate documents in your DocumentStore but texts are
|
|
not unique, you can modify the metadata and pass e.g. `"meta"` to this field (e.g. [`"content"`, `"meta"`]).
|
|
In this case the id will be generated by using the content and the defined metadata.
|
|
|
|
<a id="tika.TikaConverter.convert"></a>
|
|
|
|
#### TikaConverter.convert
|
|
|
|
```python
|
|
def convert(file_path: Path,
|
|
meta: Optional[Dict[str, str]] = None,
|
|
remove_numeric_tables: Optional[bool] = None,
|
|
valid_languages: Optional[List[str]] = None,
|
|
encoding: Optional[str] = None,
|
|
id_hash_keys: Optional[List[str]] = None) -> List[Document]
|
|
```
|
|
|
|
**Arguments**:
|
|
|
|
- `file_path`: path of the file to convert
|
|
- `meta`: dictionary of meta data key-value pairs to append in the returned document.
|
|
- `remove_numeric_tables`: This option uses heuristics to remove numeric rows from the tables.
|
|
The tabular structures in documents might be noise for the reader model if it
|
|
does not have table parsing capability for finding answers. However, tables
|
|
may also have long strings that could possible candidate for searching answers.
|
|
The rows containing strings are thus retained in this option.
|
|
- `valid_languages`: validate languages from a list of languages specified in the ISO 639-1
|
|
(https://en.wikipedia.org/wiki/ISO_639-1) format.
|
|
This option can be used to add test for encoding errors. If the extracted text is
|
|
not one of the valid languages, then it might likely be encoding error resulting
|
|
in garbled text.
|
|
- `encoding`: Not applicable
|
|
- `id_hash_keys`: Generate the document id from a custom list of strings that refer to the document's
|
|
attributes. If you want to ensure you don't have duplicate documents in your DocumentStore but texts are
|
|
not unique, you can modify the metadata and pass e.g. `"meta"` to this field (e.g. [`"content"`, `"meta"`]).
|
|
In this case the id will be generated by using the content and the defined metadata.
|
|
|
|
**Returns**:
|
|
|
|
A list of pages and the extracted meta data of the file.
|
|
|
|
<a id="txt"></a>
|
|
|
|
# Module txt
|
|
|
|
<a id="txt.TextConverter"></a>
|
|
|
|
## TextConverter
|
|
|
|
```python
|
|
class TextConverter(BaseConverter)
|
|
```
|
|
|
|
<a id="txt.TextConverter.convert"></a>
|
|
|
|
#### TextConverter.convert
|
|
|
|
```python
|
|
def convert(file_path: Path,
|
|
meta: Optional[Dict[str, str]] = None,
|
|
remove_numeric_tables: Optional[bool] = None,
|
|
valid_languages: Optional[List[str]] = None,
|
|
encoding: Optional[str] = "utf-8",
|
|
id_hash_keys: Optional[List[str]] = None) -> List[Document]
|
|
```
|
|
|
|
Reads text from a txt file and executes optional preprocessing steps.
|
|
|
|
**Arguments**:
|
|
|
|
- `file_path`: path of the file to convert
|
|
- `meta`: dictionary of meta data key-value pairs to append in the returned document.
|
|
- `remove_numeric_tables`: This option uses heuristics to remove numeric rows from the tables.
|
|
The tabular structures in documents might be noise for the reader model if it
|
|
does not have table parsing capability for finding answers. However, tables
|
|
may also have long strings that could possible candidate for searching answers.
|
|
The rows containing strings are thus retained in this option.
|
|
- `valid_languages`: validate languages from a list of languages specified in the ISO 639-1
|
|
(https://en.wikipedia.org/wiki/ISO_639-1) format.
|
|
This option can be used to add test for encoding errors. If the extracted text is
|
|
not one of the valid languages, then it might likely be encoding error resulting
|
|
in garbled text.
|
|
- `encoding`: Select the file encoding (default is `utf-8`)
|
|
- `id_hash_keys`: Generate the document id from a custom list of strings that refer to the document's
|
|
attributes. If you want to ensure you don't have duplicate documents in your DocumentStore but texts are
|
|
not unique, you can modify the metadata and pass e.g. `"meta"` to this field (e.g. [`"content"`, `"meta"`]).
|
|
In this case the id will be generated by using the content and the defined metadata.
|
|
|