Sara Zan
13510aa753
Refactoring of the haystack
package ( #1624 )
...
* Files moved, imports all broken
* Fix most imports and docstrings into
* Fix the paths to the modules in the API docs
* Add latest docstring and tutorial changes
* Add a few pipelines that were lost in the inports
* Fix a bunch of mypy warnings
* Add latest docstring and tutorial changes
* Create a file_classifier module
* Add docs for file_classifier
* Fixed most circular imports, now the REST API can start
* Add latest docstring and tutorial changes
* Tackling more mypy issues
* Reintroduce from FARM and fix last mypy issues hopefully
* Re-enable old-style imports
* Fix some more import from the top-level package in an attempt to sort out circular imports
* Fix some imports in tests to new-style to prevent failed class equalities from breaking tests
* Change document_store into document_stores
* Update imports in tutorials
* Add latest docstring and tutorial changes
* Probably fixes summarizer tests
* Improve the old-style import allowing module imports (should work)
* Try to fix the docs
* Remove dedicated KnowledgeGraph page from autodocs
* Remove dedicated GraphRetriever page from autodocs
* Fix generate_docstrings.sh with an updated list of yaml files to look for
* Fix some more modules in the docs
* Fix the document stores docs too
* Fix a small issue on Tutorial14
* Add latest docstring and tutorial changes
* Add deprecation warning to old-style imports
* Remove stray folder and import Dict into dense.py
* Change import path for MLFlowLogger
* Add old loggers path to the import path aliases
* Fix debug output of convert_ipynb.py
* Fix circular import on BaseRetriever
* Missed one merge block
* re-run tutorial 5
* Fix imports in tutorial 5
* Re-enable squad_to_dpr CLI from the root package and move get_batches_from_generator into document_stores.base
* Add latest docstring and tutorial changes
* Fix typo in utils __init__
* Fix a few more imports
* Fix benchmarks too
* New-style imports in test_knowledge_graph
* Rollback setup.py
* Rollback squad_to_dpr too
Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
2021-10-25 15:50:23 +02:00
Julian Risch
6033319cfe
Fix parameter names in tutorial 5 and 12 ( #1639 )
...
* Fix parameter names in tutorial 5
* Update parameters in tutorial notebook
* Add latest docstring and tutorial changes
* Add latest docstring and tutorial changes
Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
2021-10-22 17:22:51 +02:00
Timo Moeller
9dc125df9d
Bugfix Tutorial 5 parameters, adjust default split length ( #1635 )
...
Bugfix parameters, adjust default split length, add sentencetransformers
2021-10-22 16:03:12 +02:00
Malte Pietsch
4a6c9302b3
Redesign primitives - Document
, Answer
, Label
( #1398 )
...
* first draft / notes on new primitives
* wip label / feedback refactor
* rename doc.text -> doc.content. add doc.content_type
* add datatype for content
* remove faq_question_field from ES and weaviate. rename text_field -> content_field in docstores. update tutorials for content field
* update converters for . Add warning for empty
* renam label.question -> label.query. Allow sorting of Answers.
* WIP primitives
* update ui/reader for new Answer format
* Improve Label. First refactoring of MultiLabel. Adjust eval code
* fixed workflow conflict with introducing new one (#1472 )
* Add latest docstring and tutorial changes
* make add_eval_data() work again
* fix reader formats. WIP fix _extract_docs_and_labels_from_dict
* fix test reader
* Add latest docstring and tutorial changes
* fix another test case for reader
* fix mypy in farm reader.eval()
* fix mypy in farm reader.eval()
* WIP ORM refactor
* Add latest docstring and tutorial changes
* fix mypy weaviate
* make label and multilabel dataclasses
* bump mypy env in CI to python 3.8
* WIP refactor Label ORM
* WIP refactor Label ORM
* simplify tests for individual doc stores
* WIP refactoring markers of tests
* test alternative approach for tests with existing parametrization
* WIP refactor ORMs
* fix skip logic of already parametrized tests
* fix weaviate behaviour in tests - not parametrizing it in our general test cases.
* Add latest docstring and tutorial changes
* fix some tests
* remove sql from document_store_types
* fix markers for generator and pipeline test
* remove inmemory marker
* remove unneeded elasticsearch markers
* add dataclasses-json dependency. adjust ORM to just store JSON repr
* ignore type as dataclasses_json seems to miss functionality here
* update readme and contributing.md
* update contributing
* adjust example
* fix duplicate doc handling for custom index
* Add latest docstring and tutorial changes
* fix some ORM issues. fix get_all_labels_aggregated.
* update drop flags where get_all_labels_aggregated() was used before
* Add latest docstring and tutorial changes
* add to_json(). add + fix tests
* fix no_answer handling in label / multilabel
* fix duplicate docs in memory doc store. change primary key for sql doc table
* fix mypy issues
* fix mypy issues
* haystack/retriever/base.py
* fix test_write_document_meta[elastic]
* fix test_elasticsearch_custom_fields
* fix test_labels[elastic]
* fix crawler
* fix converter
* fix docx converter
* fix preprocessor
* fix test_utils
* fix tfidf retriever. fix selection of docstore in tests with multiple fixtures / parameterizations
* Add latest docstring and tutorial changes
* fix crawler test. fix ocrconverter attribute
* fix test_elasticsearch_custom_query
* fix generator pipeline
* fix ocr converter
* fix ragenerator
* Add latest docstring and tutorial changes
* fix test_load_and_save_yaml for elasticsearch
* fixes for pipeline tests
* fix faq pipeline
* fix pipeline tests
* Add latest docstring and tutorial changes
* fix weaviate
* Add latest docstring and tutorial changes
* trigger CI
* satisfy mypy
* Add latest docstring and tutorial changes
* satisfy mypy
* Add latest docstring and tutorial changes
* trigger CI
* fix question generation test
* fix ray. fix Q-generation
* fix translator test
* satisfy mypy
* wip refactor feedback rest api
* fix rest api feedback endpoint
* fix doc classifier
* remove relation of Labels -> Docs in SQL ORM
* fix faiss/milvus tests
* fix doc classifier test
* fix eval test
* fixing eval issues
* Add latest docstring and tutorial changes
* fix mypy
* WIP replace dataclasses-json with manual serialization
* Add latest docstring and tutorial changes
* revert to dataclass-json serialization for now. remove debug prints.
* update docstrings
* fix extractor. fix Answer Span init
* fix api test
* keep meta data of answers in reader.run()
* fix meta handling
* adress review feedback
* Add latest docstring and tutorial changes
* make document=None for open domain labels
* add import
* fix print utils
* fix rest api
* adress review feedback
* Add latest docstring and tutorial changes
* fix mypy
Co-authored-by: Markus Paff <markuspaff.mp@gmail.com>
Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
2021-10-13 14:23:23 +02:00
Julian Risch
f9d2f786ca
Replace FARM import statements; add dependencies ( #1492 )
...
* Replace FARM import statements; add dependencies
* Add InferenceProc., TextCl.Proc., TextPairCl.Proc.
* Remove FARMRanker, add type annotations, rename max_sample
* Add sample_to_features_text for InferenceProc.
* Fix type annotations: model_name_or_path is str not Path
* Fix mypy errors: implement _create_dataset in TextCl.Proc.
* Add task_type "embeddings" in Inferencer
* Allow loading AdaptiveModel for embedding task
* Add SQuAD eval metrics; enable InferenceProc for embedding task
* Add baskets as param to log_samples and handle empty basket list in log_samples
* Remove unused dependencies
* Remove FARMClassifier (doc classificer) due to ref to TextClassificationHead
* Remove FARMRanker and Classifier from doc generation scripts
Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
2021-09-28 16:34:24 +02:00
ju-gu
05da7f71dd
changed delete_all_documents to delete_documents ( #1477 )
2021-09-20 14:29:33 +02:00
oryx1729
9dd7c74f4f
Refactor communication between Pipeline Components ( #1321 )
2021-09-10 11:41:16 +02:00
Timo Moeller
07bd3c50ea
Add new QA eval metric: Semantic Answer Similarity (SAS) ( #1338 )
...
* init
* Add type annotation
* Add test case, fix mypy
* Add german model to docstring
Co-authored-by: Malte Pietsch <malte.pietsch@deepset.ai>
2021-08-12 14:31:48 +02:00
Branden Chan
da97d81305
Change variable names ( #1286 )
2021-07-14 14:03:34 +02:00
Branden Chan
783893c3d2
Tutorial update ( #1166 )
...
* Add header / footer
* Add Milvus example
* Generate md files
* Fix mypy CI
2021-06-11 11:09:15 +02:00
Branden Chan
aa6f768efa
Prevent merge of same questions on different documents during evaluation ( #1119 )
...
* Fix duplicate question in Reader.eval()
* Add duplicate question support in document store
* Support duplicate questions in retriever eval
* Update tutorial
* Rename key_tuple
* Change error message
* Add warning when more than 6 labels
* Allow for label grouping options
* Add support for aggregating by label meta
* Satisfy mypy
* Fix duplicate question in Reader.eval()
* Add duplicate question support in document store
* Support duplicate questions in retriever eval
* Update tutorial
* Rename key_tuple
* Change error message
* Add warning when more than 6 labels
* Allow for label grouping options
* Add support for aggregating by label meta
* Satisfy mypy
* Make label field flexible, add docstrings
* Satisfy mypy
* Fix failing tests
* Adjust docstring
* Fix tutorial
Co-authored-by: Malte Pietsch <malte.pietsch@deepset.ai>
2021-06-02 12:09:03 +02:00
Julian Risch
84c34295a1
Re-ranking component for document search without QA ( #1025 )
...
* Adding ranker similar to retriever and reader
* Sort documents according to query-document similarity scores
* Reranking and model training runs for small example
* Added EvalRanker node
* Calculate recall@k in EvalRetriever and EvalRanker nodes
* Renaming EvalRetriever to EvalDocuments and EvalReader to EvalAnswers
* Added mean reciprocal rank as metric for EvalDocuments
* Fix bug that appeared when ranking documents with same score
* Remove commented code for unimplmented eval() of Ranker node
* Add documentation of k parameter in EvalDocuments
* Add Ranker docu and renaming top_k param
2021-05-31 15:31:36 +02:00
Branden Chan
d77152c469
WIP: Add evaluation nodes for Pipelines ( #904 )
...
* Add main eval fns
* WIP: make pipeline_eval.py run
* Fix typo
* Add support for no_answers
* Add latest docstring and tutorial changes
* Working pipeline eval
* Add timing of nodes
* Add latest docstring and tutorial changes
* Refactor and clean
* Update tutorial script
* Set default params
* Update tutorials
* Fix indent
* Add latest docstring and tutorial changes
* Address mypy issues
* Add test
* Fix mypy error
* Clear outputs
* Add doc strings
* Incorporate reviewer feedback
* Add latest docstring and tutorial changes
* Revert query counting
* Fix typo
Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
2021-04-01 17:35:18 +02:00
Julian Risch
3331608e03
Adding a guard that prevents the tutorial code from being executed in every subprocess when using multiprocessing on windows ( #729 )
2021-01-13 18:17:54 +01:00
Tanay Soni
db4151bbc0
Fix scoring in Elasticsearch for dot product ( #517 )
2020-10-23 17:50:49 +02:00
Malte Pietsch
9727829cc6
Rename and restructure modules (database, indexing, schemas) ( #379 )
...
* rename database to documentstore
* move document, label, multilabel to haystack/schema.py
* rename documentstore -> document_store
* split indexing modules -> file_converter + preprocessor
* fix order of imports
* Update tutorial notebooks
* fix torch version in tutorial 4
2020-09-16 18:33:23 +02:00
brandenchan
b44b1ac6ec
Set top_k_per_candidate
2020-08-26 12:03:56 +02:00
kolk
f2b6cc761b
Refactor DPR from FB to Transformers codebase ( #308 )
...
* change_HFBertEncoder to transformers DPREncoder
* Removed BertTensorizer
* model download relative path
* Refactor model load
* Tutorial5 DPR updated
* fix print_eval_results typo
* copy transformers DPR modules in dpr_utils and test
* transformer v3.0.2 import errors fixed
* remove dependency of DPRConfig on attribute use_return_tuple
* Adjust transformers 302 locally to work with dpr
* projection layer removed from DPR encoders
* fixed mypy errors
* transformers DPR compatible code added
* transformers DPR compatibility added
* bug fix in tutorial 6 notebook
* Docstring update and variable naming issues fix
* tutorial modified to reflect DPR variable naming change
* title addition to passage use-cases handled
* modified handling untitled batch
* resolved mypy errors
* typos in docstrings and comments fixed
* cleaned DPR code and added new test cases
* warnings added for non-bert model [SEP] token removal
* changed warning to logger warning
* title mask creation refactored
* bug fix on cuda issues
* tutorial 6 instantiates modified DPR
* tutorial 5 modified
* tutorial 5 ipython notebook modified: DPR instantiation
* batch_size added to DPR instantiation
* tutorial 5 jupyter notebook typos fixed
* improved docstrings, fixed typos
* Update docstring
Co-authored-by: Timo Moeller <timo.moeller@deepset.ai>
Co-authored-by: Malte Pietsch <malte.pietsch@deepset.ai>
2020-08-25 20:16:00 +05:30
brandenchan
8a3eca05c3
Change to retriever eval top_k to match notebook
2020-08-18 11:39:49 +02:00
Timo Moeller
72e6867278
Aggregate label objects for same questions ( #292 )
...
* Add aggregate labels obj, use in retriever eval function
* Change launch ES param
* Move aggregation from ES document store to base class
* Fix type annotations
2020-08-07 11:24:41 +02:00
Malte Pietsch
29a15c0d59
Add eval for Dense Passage Retriever & Refactor handling of labels/feedback ( #243 )
2020-07-31 11:34:06 +02:00
Branden Chan
64721d3196
One more update
2020-07-15 16:24:10 +02:00
Branden Chan
c55477e0ce
update eval dataset
2020-07-15 16:14:52 +02:00
Malte Pietsch
99a6a34047
Upgrade to new FARM / Transformers / PyTorch versions ( #212 )
2020-07-14 18:53:15 +02:00
Malte Pietsch
07ecfb60b9
Dense Passage Retriever (Inference) ( #167 )
2020-06-30 19:05:45 +02:00
Timo Moeller
c53aaddb78
Fix document id missing in farm inference output ( #174 )
2020-06-26 11:01:10 +02:00
Yaser Martinez Palenzuela
97bbb4280c
Correct field in evaluation tutorial ( #139 )
2020-06-08 16:38:09 +02:00
Tanay Soni
ef9e4f4467
Add PDF text extraction ( #109 )
2020-06-08 11:07:19 +02:00
bogdankostic
479fcb1ace
Fix evaluation ( #132 )
...
* Fix bugs in Tutorial 5
* Adapt tutorials to new metrics
2020-06-05 18:33:50 +02:00
bogdankostic
bbfccf5cf6
Add Evaluation of Reader, Retriever and Finder ( #92 )
2020-05-29 15:57:07 +02:00