haystack/docs/preprocessor.md
Branden Chan 7fdb85d63a
Create documentation website (#272)
* Skeleton of doc website

* Flesh out documentation pages

* Split concepts into their own rst files

* add tutorial rsts

* Consistent level 1 markdown headers in tutorials

* Change theme to readthedocs

* Turn bullet points into prose

* Populate sections

* Add more text

* Add more sphinx files

* Add more retriever documentation

* combined all documenations in one structure

* rename of src to _src as it was ignored by git

* Incorporate MP2's changes

* add benchmark bar charts

* Adapt docstrings in Readers

* Improvements to intro, creation of glossary

* Adapt docstrings in Retrievers

* Adapt docstrings in Finder

* Adapt Docstrings of Finder

* Updates to text

* Edit text

* update doc strings

* proof read tutorials

* Edit text

* Edit text

* Add stacked chart

* populate graph with data

* Switch Documentation to markdown (#386)

* add way to generate markdown files to sphinx

* changed from rst to markdown and extended sphinx for it

* fix spelling

* Clean titles

* delete file

* change spelling

* add sections to document store usage

* add basic rest api docs

* fix readme in setup.py

* Update Tutorials

* Change section names

* add windows note to pip install

* update intro

* new renderer for markdown files

* Fix typos

* delete dpr_utils.py

* fix windows note in get started

* Fix docstrings

* deleted rest api docs in api

* fixed typo

* Fix docstring

* revert readme to rst

* Fix readme

* Update setup.py

Co-authored-by: deepset <deepset@Crenolape.localdomain>
Co-authored-by: PiffPaffM <markuspaff.mp@gmail.com>
Co-authored-by: Bogdan Kostić <bogdankostic@web.de>
Co-authored-by: Malte Pietsch <malte.pietsch@deepset.ai>
2020-09-18 12:57:32 +02:00

2.2 KiB

cleaning

__init__

utils

eval_data_from_file

eval_data_from_file(filename: str) -> Tuple[List[Document], List[Label]]

Read Documents + Labels from a SQuAD-style file. Document and Labels can then be indexed to the DocumentStore and be used for evaluation.

Arguments:

  • filename: Path to file in SQuAD format

Returns:

(List of Documents, List of Labels)

convert_files_to_dicts

convert_files_to_dicts(dir_path: str, clean_func: Optional[Callable] = None, split_paragraphs: bool = False) -> List[dict]

Convert all files(.txt, .pdf) in the sub-directories of the given path to Python dicts that can be written to a Document Store.

Arguments:

  • dir_path: path for the documents to be written to the DocumentStore
  • clean_func: a custom cleaning function that gets applied to each doc (input: str, output:str)
  • split_paragraphs: split text in paragraphs.

Returns:

None

tika_convert_files_to_dicts

tika_convert_files_to_dicts(dir_path: str, clean_func: Optional[Callable] = None, split_paragraphs: bool = False, merge_short: bool = True, merge_lowercase: bool = True) -> List[dict]

Convert all files(.txt, .pdf) in the sub-directories of the given path to Python dicts that can be written to a Document Store.

Arguments:

  • dir_path: path for the documents to be written to the DocumentStore
  • clean_func: a custom cleaning function that gets applied to each doc (input: str, output:str)
  • split_paragraphs: split text in paragraphs.

Returns:

None

fetch_archive_from_http

fetch_archive_from_http(url: str, output_dir: str, proxies: Optional[dict] = None)

Fetch an archive (zip or tar.gz) from a url via http and extract content to an output directory.

Arguments:

  • url: http address :type url: str
  • output_dir: local path :type output_dir: str
  • proxies: proxies details as required by requests library :type proxies: dict

Returns:

bool if anything got fetched