mirror of
https://github.com/deepset-ai/haystack.git
synced 2025-07-25 18:00:28 +00:00

* add time and perf benchmark for es * Add retriever benchmarking * Add Reader benchmarking * add nq to squad conversion * add conversion stats * clean benchmarks * Add link to dataset * Update imports * add first support for neg psgs * Refactor test * set max_seq_len * cleanup benchmark * begin retriever speed benchmarking * Add support for retriever query index benchmarking * improve reader eval, retriever speed benchmarking * improve retriever speed benchmarking * Add retriever accuracy benchmark * Add neg doc shuffling * Add top_n * 3x speedup of SQL. add postgres docker run. make shuffle neg a param. add more logging * Add models to sweep * add option for faiss index type * remove unneeded line * change faiss to faiss_flat * begin automatic benchmark script * remove existing postgres docker for benchmarking * Add data processing scripts * Remove shuffle in script bc data already shuffled * switch hnsw setup from 256 to 128 * change es similarity to dot product by default * Error includes stack trace * Change ES default timeout * remove delete_docs() from timing for indexing * Add support for website export * update website on push to benchmarks * add complete benchmarks results * new json format * removed NaN as is not a valid json token * versioning for docs * unsaved changes * cleaning * cleaning * Edit format of benchmarks data * update also jsons in v0.4.0 Co-authored-by: brandenchan <brandenchan@icloud.com> Co-authored-by: deepset <deepset@Crenolape.localdomain> Co-authored-by: Malte Pietsch <malte.pietsch@deepset.ai>
2.2 KiB
2.2 KiB
cleaning
utils
eval_data_from_file
eval_data_from_file(filename: str) -> Tuple[List[Document], List[Label]]
Read Documents + Labels from a SQuAD-style file. Document and Labels can then be indexed to the DocumentStore and be used for evaluation.
Arguments:
filename
: Path to file in SQuAD format
Returns:
(List of Documents, List of Labels)
convert_files_to_dicts
convert_files_to_dicts(dir_path: str, clean_func: Optional[Callable] = None, split_paragraphs: bool = False) -> List[dict]
Convert all files(.txt, .pdf) in the sub-directories of the given path to Python dicts that can be written to a Document Store.
Arguments:
dir_path
: path for the documents to be written to the DocumentStoreclean_func
: a custom cleaning function that gets applied to each doc (input: str, output:str)split_paragraphs
: split text in paragraphs.
Returns:
None
tika_convert_files_to_dicts
tika_convert_files_to_dicts(dir_path: str, clean_func: Optional[Callable] = None, split_paragraphs: bool = False, merge_short: bool = True, merge_lowercase: bool = True) -> List[dict]
Convert all files(.txt, .pdf) in the sub-directories of the given path to Python dicts that can be written to a Document Store.
Arguments:
dir_path
: path for the documents to be written to the DocumentStoreclean_func
: a custom cleaning function that gets applied to each doc (input: str, output:str)split_paragraphs
: split text in paragraphs.
Returns:
None
fetch_archive_from_http
fetch_archive_from_http(url: str, output_dir: str, proxies: Optional[dict] = None)
Fetch an archive (zip or tar.gz) from a url via http and extract content to an output directory.
Arguments:
url
: http address :type url: stroutput_dir
: local path :type output_dir: strproxies
: proxies details as required by requests library :type proxies: dict
Returns:
bool if anything got fetched