haystack/docs/preprocessor.md

<a name="cleaning"></a>
# cleaning

<a name="__init__"></a>
# \_\_init\_\_

<a name="utils"></a>
# utils

<a name="utils.eval_data_from_file"></a>
#### eval\_data\_from\_file

```python
eval_data_from_file(filename: str) -> Tuple[List[Document], List[Label]]
```

Read Documents + Labels from a SQuAD-style file.
Document and Labels can then be indexed to the DocumentStore and be used for evaluation.

**Arguments**:

- `filename`: Path to file in SQuAD format

**Returns**:

(List of Documents, List of Labels)

<a name="utils.convert_files_to_dicts"></a>
#### convert\_files\_to\_dicts

```python
convert_files_to_dicts(dir_path: str, clean_func: Optional[Callable] = None, split_paragraphs: bool = False) -> List[dict]
```

Convert all files(.txt, .pdf) in the sub-directories of the given path to Python dicts that can be written to a
Document Store.

**Arguments**:

- `dir_path`: path for the documents to be written to the DocumentStore
- `clean_func`: a custom cleaning function that gets applied to each doc (input: str, output:str)
- `split_paragraphs`: split text in paragraphs.

**Returns**:

None

<a name="utils.tika_convert_files_to_dicts"></a>
#### tika\_convert\_files\_to\_dicts

```python
tika_convert_files_to_dicts(dir_path: str, clean_func: Optional[Callable] = None, split_paragraphs: bool = False, merge_short: bool = True, merge_lowercase: bool = True) -> List[dict]
```

Convert all files(.txt, .pdf) in the sub-directories of the given path to Python dicts that can be written to a
Document Store.

**Arguments**:

- `dir_path`: path for the documents to be written to the DocumentStore
- `clean_func`: a custom cleaning function that gets applied to each doc (input: str, output:str)
- `split_paragraphs`: split text in paragraphs.

**Returns**:

None

<a name="utils.fetch_archive_from_http"></a>
#### fetch\_archive\_from\_http

```python
fetch_archive_from_http(url: str, output_dir: str, proxies: Optional[dict] = None)
```

Fetch an archive (zip or tar.gz) from a url via http and extract content to an output directory.

**Arguments**:

- `url`: http address
:type url: str
- `output_dir`: local path
:type output_dir: str
- `proxies`: proxies details as required by requests library
:type proxies: dict

**Returns**:

bool if anything got fetched
Create documentation website (#272) * Skeleton of doc website * Flesh out documentation pages * Split concepts into their own rst files * add tutorial rsts * Consistent level 1 markdown headers in tutorials * Change theme to readthedocs * Turn bullet points into prose * Populate sections * Add more text * Add more sphinx files * Add more retriever documentation * combined all documenations in one structure * rename of src to _src as it was ignored by git * Incorporate MP2's changes * add benchmark bar charts * Adapt docstrings in Readers * Improvements to intro, creation of glossary * Adapt docstrings in Retrievers * Adapt docstrings in Finder * Adapt Docstrings of Finder * Updates to text * Edit text * update doc strings * proof read tutorials * Edit text * Edit text * Add stacked chart * populate graph with data * Switch Documentation to markdown (#386) * add way to generate markdown files to sphinx * changed from rst to markdown and extended sphinx for it * fix spelling * Clean titles * delete file * change spelling * add sections to document store usage * add basic rest api docs * fix readme in setup.py * Update Tutorials * Change section names * add windows note to pip install * update intro * new renderer for markdown files * Fix typos * delete dpr_utils.py * fix windows note in get started * Fix docstrings * deleted rest api docs in api * fixed typo * Fix docstring * revert readme to rst * Fix readme * Update setup.py Co-authored-by: deepset <deepset@Crenolape.localdomain> Co-authored-by: PiffPaffM <markuspaff.mp@gmail.com> Co-authored-by: Bogdan Kostić <bogdankostic@web.de> Co-authored-by: Malte Pietsch <malte.pietsch@deepset.ai> 2020-09-18 12:57:32 +02:00			`<a name="cleaning"></a>`
			`# cleaning`

			`<a name="__init__"></a>`
			`# \_\_init\_\_`

			`<a name="utils"></a>`
			`# utils`

			`<a name="utils.eval_data_from_file"></a>`
			`#### eval\_data\_from\_file`

			```python
			`eval_data_from_file(filename: str) -> Tuple[List[Document], List[Label]]`
			```

			`Read Documents + Labels from a SQuAD-style file.`
			`Document and Labels can then be indexed to the DocumentStore and be used for evaluation.`

			`Arguments:`

			- `filename`: Path to file in SQuAD format

			`Returns:`

			`(List of Documents, List of Labels)`

			`<a name="utils.convert_files_to_dicts"></a>`
			`#### convert\_files\_to\_dicts`

			```python
			`convert_files_to_dicts(dir_path: str, clean_func: Optional[Callable] = None, split_paragraphs: bool = False) -> List[dict]`
			```

			`Convert all files(.txt, .pdf) in the sub-directories of the given path to Python dicts that can be written to a`
			`Document Store.`

			`Arguments:`

			- `dir_path`: path for the documents to be written to the DocumentStore
			- `clean_func`: a custom cleaning function that gets applied to each doc (input: str, output:str)
			- `split_paragraphs`: split text in paragraphs.

			`Returns:`

			`None`

			`<a name="utils.tika_convert_files_to_dicts"></a>`
			`#### tika\_convert\_files\_to\_dicts`

			```python
			`tika_convert_files_to_dicts(dir_path: str, clean_func: Optional[Callable] = None, split_paragraphs: bool = False, merge_short: bool = True, merge_lowercase: bool = True) -> List[dict]`
			```

			`Convert all files(.txt, .pdf) in the sub-directories of the given path to Python dicts that can be written to a`
			`Document Store.`

			`Arguments:`

			- `dir_path`: path for the documents to be written to the DocumentStore
			- `clean_func`: a custom cleaning function that gets applied to each doc (input: str, output:str)
			- `split_paragraphs`: split text in paragraphs.

			`Returns:`

			`None`

			`<a name="utils.fetch_archive_from_http"></a>`
			`#### fetch\_archive\_from\_http`

			```python
			`fetch_archive_from_http(url: str, output_dir: str, proxies: Optional[dict] = None)`
			```

			`Fetch an archive (zip or tar.gz) from a url via http and extract content to an output directory.`

			`Arguments:`

			- `url`: http address
			`:type url: str`
			- `output_dir`: local path
			`:type output_dir: str`
			- `proxies`: proxies details as required by requests library
			`:type proxies: dict`

			`Returns:`

			`bool if anything got fetched`