haystack/docs/v0.4.0/_src/api/api/preprocessor.md
Markus Paff 2531c8e061
Add versioning docs (#495)
* add time and perf benchmark for es

* Add retriever benchmarking

* Add Reader benchmarking

* add nq to squad conversion

* add conversion stats

* clean benchmarks

* Add link to dataset

* Update imports

* add first support for neg psgs

* Refactor test

* set max_seq_len

* cleanup benchmark

* begin retriever speed benchmarking

* Add support for retriever query index benchmarking

* improve reader eval, retriever speed benchmarking

* improve retriever speed benchmarking

* Add retriever accuracy benchmark

* Add neg doc shuffling

* Add top_n

* 3x speedup of SQL. add postgres docker run. make shuffle neg a param. add more logging

* Add models to sweep

* add option for faiss index type

* remove unneeded line

* change faiss to faiss_flat

* begin automatic benchmark script

* remove existing postgres docker for benchmarking

* Add data processing scripts

* Remove shuffle in script bc data already shuffled

* switch hnsw setup from 256 to 128

* change es similarity to dot product by default

* Error includes stack trace

* Change ES default timeout

* remove delete_docs() from timing for indexing

* Add support for website export

* update website on push to benchmarks

* add complete benchmarks results

* new json format

* removed NaN as is not a valid json token

* versioning for docs

* unsaved changes

* cleaning

* cleaning

* Edit format of benchmarks data

* update also jsons in v0.4.0

Co-authored-by: brandenchan <brandenchan@icloud.com>
Co-authored-by: deepset <deepset@Crenolape.localdomain>
Co-authored-by: Malte Pietsch <malte.pietsch@deepset.ai>
2020-10-19 11:46:51 +02:00

2.2 KiB

cleaning

utils

eval_data_from_file

eval_data_from_file(filename: str) -> Tuple[List[Document], List[Label]]

Read Documents + Labels from a SQuAD-style file. Document and Labels can then be indexed to the DocumentStore and be used for evaluation.

Arguments:

  • filename: Path to file in SQuAD format

Returns:

(List of Documents, List of Labels)

convert_files_to_dicts

convert_files_to_dicts(dir_path: str, clean_func: Optional[Callable] = None, split_paragraphs: bool = False) -> List[dict]

Convert all files(.txt, .pdf) in the sub-directories of the given path to Python dicts that can be written to a Document Store.

Arguments:

  • dir_path: path for the documents to be written to the DocumentStore
  • clean_func: a custom cleaning function that gets applied to each doc (input: str, output:str)
  • split_paragraphs: split text in paragraphs.

Returns:

None

tika_convert_files_to_dicts

tika_convert_files_to_dicts(dir_path: str, clean_func: Optional[Callable] = None, split_paragraphs: bool = False, merge_short: bool = True, merge_lowercase: bool = True) -> List[dict]

Convert all files(.txt, .pdf) in the sub-directories of the given path to Python dicts that can be written to a Document Store.

Arguments:

  • dir_path: path for the documents to be written to the DocumentStore
  • clean_func: a custom cleaning function that gets applied to each doc (input: str, output:str)
  • split_paragraphs: split text in paragraphs.

Returns:

None

fetch_archive_from_http

fetch_archive_from_http(url: str, output_dir: str, proxies: Optional[dict] = None)

Fetch an archive (zip or tar.gz) from a url via http and extract content to an output directory.

Arguments:

  • url: http address :type url: str
  • output_dir: local path :type output_dir: str
  • proxies: proxies details as required by requests library :type proxies: dict

Returns:

bool if anything got fetched