mirror of https://github.com/deepset-ai/haystack.git synced 2025-10-02 03:26:55 +00:00

Redesign primitives - Document, Answer, Label (#1398 )

* first draft / notes on new primitives

* wip label / feedback refactor

* rename doc.text -> doc.content. add doc.content_type

* add datatype for content

* remove faq_question_field from ES and weaviate. rename text_field -> content_field in docstores. update tutorials for content field

* update converters for . Add warning for empty

* renam label.question -> label.query. Allow sorting of Answers.

* WIP primitives

* update ui/reader for new Answer format

* Improve Label. First refactoring of MultiLabel. Adjust eval code

* fixed workflow conflict with introducing new one (#1472)

* Add latest docstring and tutorial changes

* make add_eval_data() work again

* fix reader formats. WIP fix _extract_docs_and_labels_from_dict

* fix test reader

* Add latest docstring and tutorial changes

* fix another test case for reader

* fix mypy in farm reader.eval()

* fix mypy in farm reader.eval()

* WIP ORM refactor

* Add latest docstring and tutorial changes

* fix mypy weaviate

* make label and multilabel dataclasses

* bump mypy env in CI to python 3.8

* WIP refactor Label ORM

* WIP refactor Label ORM

* simplify tests for individual doc stores

* WIP refactoring markers of tests

* test alternative approach for tests with existing parametrization

* WIP refactor ORMs

* fix skip logic of already parametrized tests

* fix weaviate behaviour in tests - not parametrizing it in our general test cases.

* Add latest docstring and tutorial changes

* fix some tests

* remove sql from document_store_types

* fix markers for generator and pipeline test

* remove inmemory marker

* remove unneeded elasticsearch markers

* add dataclasses-json dependency. adjust ORM to just store JSON repr

* ignore type as dataclasses_json seems to miss functionality here

* update readme and contributing.md

* update contributing

* adjust example

* fix duplicate doc handling for custom index

* Add latest docstring and tutorial changes

* fix some ORM issues. fix get_all_labels_aggregated.

* update drop flags where get_all_labels_aggregated() was used before

* Add latest docstring and tutorial changes

* add to_json(). add + fix tests

* fix no_answer handling in label / multilabel

* fix duplicate docs in memory doc store. change primary key for sql doc table

* fix mypy issues

* fix mypy issues

* haystack/retriever/base.py

* fix test_write_document_meta[elastic]

* fix test_elasticsearch_custom_fields

* fix test_labels[elastic]

* fix crawler

* fix converter

* fix docx converter

* fix preprocessor

* fix test_utils

* fix tfidf retriever. fix selection of docstore in tests with multiple fixtures / parameterizations

* Add latest docstring and tutorial changes

* fix crawler test. fix ocrconverter attribute

* fix test_elasticsearch_custom_query

* fix generator pipeline

* fix ocr converter

* fix ragenerator

* Add latest docstring and tutorial changes

* fix test_load_and_save_yaml for elasticsearch

* fixes for pipeline tests

* fix faq pipeline

* fix pipeline tests

* Add latest docstring and tutorial changes

* fix weaviate

* Add latest docstring and tutorial changes

* trigger CI

* satisfy mypy

* Add latest docstring and tutorial changes

* satisfy mypy

* Add latest docstring and tutorial changes

* trigger CI

* fix question generation test

* fix ray. fix Q-generation

* fix translator test

* satisfy mypy

* wip refactor feedback rest api

* fix rest api feedback endpoint

* fix doc classifier

* remove relation of Labels -> Docs in SQL ORM

* fix faiss/milvus tests

* fix doc classifier test

* fix eval test

* fixing eval issues

* Add latest docstring and tutorial changes

* fix mypy

* WIP replace dataclasses-json with manual serialization

* Add latest docstring and tutorial changes

* revert to dataclass-json serialization for now. remove debug prints.

* update docstrings

* fix extractor. fix Answer Span init

* fix api test

* keep meta data of answers in reader.run()

* fix meta handling

* adress review feedback

* Add latest docstring and tutorial changes

* make document=None for open domain labels

* add import

* fix print utils

* fix rest api

* adress review feedback

* Add latest docstring and tutorial changes

* fix mypy

Co-authored-by: Markus Paff <markuspaff.mp@gmail.com>
Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>

2021-10-13 14:23:23 +02:00

4.2 KiB

Raw Blame History

Module base

BaseTranslator Objects

class BaseTranslator(BaseComponent)

Abstract class for a Translator component that translates either a query or a doc from language A to language B.

translate

 | @abstractmethod
 | translate(query: Optional[str] = None, documents: Optional[Union[List[Document], List[Answer], List[str], List[Dict[str, Any]]]] = None, dict_key: Optional[str] = None) -> Union[str, List[Document], List[Answer], List[str], List[Dict[str, Any]]]

Translate the passed query or a list of documents from language A to B.

run

 | run(query: Optional[str] = None, documents: Optional[Union[List[Document], List[Answer], List[str], List[Dict[str, Any]]]] = None, answers: Optional[Union[Dict[str, Any], List[Dict[str, Any]]]] = None, dict_key: Optional[str] = None)

Method that gets executed when this class is used as a Node in a Haystack Pipeline

Module transformers

TransformersTranslator Objects

class TransformersTranslator(BaseTranslator)

Translator component based on Seq2Seq models from Huggingface's transformers library. Exemplary use cases:

Translate a query from Language A to B (e.g. if you only have good models + documents in language B)
Translate a document from Language A to B (e.g. if you want to return results in the native language of the user)

We currently recommend using OPUS models (see init() for details)

Example:

|    DOCS = [
|        Document(text="Heinz von Foerster was an Austrian American scientist combining physics and philosophy,
|                       and widely attributed as the originator of Second-order cybernetics.")
|    ]
|    translator = TransformersTranslator(model_name_or_path="Helsinki-NLP/opus-mt-en-de")
|    res = translator.translate(documents=DOCS, query=None)

init

 | __init__(model_name_or_path: str, tokenizer_name: Optional[str] = None, max_seq_len: Optional[int] = None, clean_up_tokenization_spaces: Optional[bool] = True)

Initialize the translator with a model that fits your targeted languages. While we support all seq2seq models from Hugging Face's model hub, we recommend using the OPUS models from Helsiniki NLP. They provide plenty of different models, usually one model per language pair and translation direction. They have a pretty standardized naming that should help you find the right model:

"Helsinki-NLP/opus-mt-en-de" => translating from English to German
"Helsinki-NLP/opus-mt-de-en" => translating from German to English
"Helsinki-NLP/opus-mt-fr-en" => translating from French to English
"Helsinki-NLP/opus-mt-hi-en"=> translating from Hindi to English ...

They also have a few multilingual models that support multiple languages at once.

Arguments:

model_name_or_path: Name of the seq2seq model that shall be used for translation. Can be a remote name from Huggingface's modelhub or a local path.
tokenizer_name: Optional tokenizer name. If not supplied, model_name_or_path will also be used for the tokenizer.
max_seq_len: The maximum sentence length the model accepts. (Optional)
clean_up_tokenization_spaces: Whether or not to clean up the tokenization spaces. (default True)

translate

 | translate(query: Optional[str] = None, documents: Optional[Union[List[Document], List[Answer], List[str], List[Dict[str, Any]]]] = None, dict_key: Optional[str] = None) -> Union[str, List[Document], List[Answer], List[str], List[Dict[str, Any]]]

Run the actual translation. You can supply a query or a list of documents. Whatever is supplied will be translated.

Arguments:

query: The query string to translate
documents: The documents to translate
dict_key: If you pass a dictionary in documents, you can specify here the field which shall be translated.

4.2 KiB Raw Blame History

Module base

BaseTranslator Objects

translate

run

Module transformers

TransformersTranslator Objects

__init__

translate

4.2 KiB

Raw Blame History

init