135 Commits

Author SHA1 Message Date
bogdankostic
c00b32cf67
Fix Tutorial 11 on Google Colab (#1795)
* Remove installation of latest release

* Add latest docstring and tutorial changes

Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
2021-11-23 15:35:23 +01:00
bogdankostic
a19a9f548b
Upgrade torch to v1.10.0 (#1789)
* Upgrade torch to v1.10.0

* Adapt torch version for torch-scatter in TableQA tutorial

* Add latest docstring and tutorial changes

* Make torch version more flexible

Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
2021-11-23 11:49:46 +01:00
Julian Risch
9211c4c64d
initialize doc store with doc and label index in tutorial 5 (#1730)
* initialize doc store with doc and label index

* change ipynb according to py for tutorial 5

* Add latest docstring and tutorial changes

Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
2021-11-22 15:18:02 +01:00
Sara Zan
ea3abd305b
Fix a few details of some tutorials (#1733)
* Make Tutorial10 use print instead of logs and fix a typo in Tutoria15

* Add a type check in 'print_answers'

* Add same checks to print_documents and print_questions

* Make RAGenerator return Answers instead of dictionaries

* Fix RAGenerator tests

Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
2021-11-12 16:44:28 +01:00
Sara Zan
85a08d671a
Fix another self.device/s typo (#1734)
* Fix yet another self.device(s) typo

* Add typing to 'initialize_device_settings' to try prevent future issues

* Fix bug in Tutorial5

* Fix the same bug in the notebook

Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
2021-11-11 17:18:06 +01:00
Branden Chan
8082549663
Add debugging example to tutorial (#1731)
* Add debugging example to tutorial

* Add latest docstring and tutorial changes

* Remove Objects suffix

* Add latest docstring and tutorial changes

* Revert "Remove Objects suffix"

This reverts commit 6681cb06510b080775994effe6a50bae42254be4.

* Revert unintentional commit

* Add third debugging option

* Add latest docstring and tutorial changes

Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
2021-11-11 14:45:06 +01:00
tstadel
14515a861b
Tutorial for DocumentClassifier at Index Time (#1697)
* basic example of document classifier in preprocessing logic

* add batch_size to TransformersDocumentClassifier

* complete tutorial16

* Add latest docstring and tutorial changes

* fix missing batch_size

* add notebook

* test for batch_size use added

* add tutorial 16 to headers.py

* Add latest docstring and tutorial changes

* make DocumentClassifier indexing pipeline rdy

* Add latest docstring and tutorial changes

* flexibility improvements for DocumentClassifier in Pipelines

* Add latest docstring and tutorial changes

* fix index time usage

* remove query from documentclassifier tests

* improve classification_field resolving + minor fixes

* Add latest docstring and tutorial changes

* tutorial 16 extended with zero shot and pipelines

* Add latest docstring and tutorial changes

* install graphviz in notebook

* Add latest docstring and tutorial changes

* remove convert_to_dicts

* Add latest docstring and tutorial changes

* Fix typo

* Add latest docstring and tutorial changes

* remove retriever from indexing pipeline

* Add latest docstring and tutorial changes

* fix save_to_yaml when using FileTypeClassifier

* emphasize the impact with zero shot classification

* Add latest docstring and tutorial changes

* adjust use_gpu to boolean in test

Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Co-authored-by: Malte Pietsch <malte.pietsch@deepset.ai>
2021-11-09 18:43:00 +01:00
Sara Zan
91cafb49bb
Improve tutorials' output (#1694)
* Modify __str__ and __repr__ for Document and Answer

* Rename QueryClassifier in Tutorial11

* Improve the output of tutorial1

* Make the output of Tutorial8 a bit less dense

* Add a print_questions util to print the output of question generating pipelines

* Replace custom printing with the new utility in Tutorial13

* Ensure all output is printed with minimal details in Tutorial14 and add some titles

* Minor change to print_answers

* Make tutorial3's output the same as tutorial1

* Add __repr__ to Answer and fix to_dict()

* Fix a bug in the Document and Answer's __str__ method

* Improve print_answers, print_documents and print_questions

* Using print_answers in Tutorial7 and fixing typo in the utils

* Remove duplicate line in Tutorial12

* Use print_answers in Tutorial4

* Add explanation of what the documents in the output of the basic QA pipeline are

* Move the fields constant into print_answers

* Normalize all 'minimal' to 'minimum' (they were mixed up)

* Improve the sample output to include all fields from Document and Answer

Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
2021-11-09 15:09:26 +01:00
Timo Moeller
5ac89332a2
Fix Typo in TableQA Tutorial (#1690) 2021-11-08 17:14:04 +01:00
Julian Risch
efdcd24d70
fixed typo (#1680)
* fixed typo

* Add latest docstring and tutorial changes

Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
2021-11-01 10:38:39 +01:00
bogdankostic
9025615be7
Add TableQA tutorial (#1670)
* Add TableQA tutorial

* Add tutorial header

* Add latest docstring and tutorial changes

* Add more details

* Add latest docstring and tutorial changes

Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
2021-10-29 11:07:13 +02:00
Julian Risch
e106eb41a7
Remove trailing comma in import statement (#1655) 2021-10-27 10:11:22 +02:00
Sara Zan
13510aa753
Refactoring of the haystack package (#1624)
* Files moved, imports all broken

* Fix most imports and docstrings into

* Fix the paths to the modules in the API docs

* Add latest docstring and tutorial changes

* Add a few pipelines that were lost in the inports

* Fix a bunch of mypy warnings

* Add latest docstring and tutorial changes

* Create a file_classifier module

* Add docs for file_classifier

* Fixed most circular imports, now the REST API can start

* Add latest docstring and tutorial changes

* Tackling more mypy issues

* Reintroduce  from FARM and fix last mypy issues hopefully

* Re-enable old-style imports

* Fix some more import from the top-level  package in an attempt to sort out circular imports

* Fix some imports in tests to new-style to prevent failed class equalities from breaking tests

* Change document_store into document_stores

* Update imports in tutorials

* Add latest docstring and tutorial changes

* Probably fixes summarizer tests

* Improve the old-style import allowing module imports (should work)

* Try to fix the docs

* Remove dedicated KnowledgeGraph page from autodocs

* Remove dedicated GraphRetriever page from autodocs

* Fix generate_docstrings.sh with an updated list of yaml files to look for

* Fix some more modules in the docs

* Fix the document stores docs too

* Fix a small issue on Tutorial14

* Add latest docstring and tutorial changes

* Add deprecation warning to old-style imports

* Remove stray folder and import Dict into dense.py

* Change import path for MLFlowLogger

* Add old loggers path to the import path aliases

* Fix debug output of convert_ipynb.py

* Fix circular import on BaseRetriever

* Missed one merge block

* re-run tutorial 5

* Fix imports in tutorial 5

* Re-enable squad_to_dpr CLI from the root package and move get_batches_from_generator into document_stores.base

* Add latest docstring and tutorial changes

* Fix typo in utils __init__

* Fix a few more imports

* Fix benchmarks too

* New-style imports in test_knowledge_graph

* Rollback setup.py

* Rollback squad_to_dpr too

Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
2021-10-25 15:50:23 +02:00
Julian Risch
6033319cfe
Fix parameter names in tutorial 5 and 12 (#1639)
* Fix parameter names in tutorial 5

* Update parameters in tutorial notebook

* Add latest docstring and tutorial changes

* Add latest docstring and tutorial changes

Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
2021-10-22 17:22:51 +02:00
Timo Moeller
9dc125df9d
Bugfix Tutorial 5 parameters, adjust default split length (#1635)
Bugfix parameters, adjust default split length, add sentencetransformers
2021-10-22 16:03:12 +02:00
Julian Risch
52e1fc991e
Update jobs link to personio (#1611)
* Update jobs link to personio

* Add latest docstring and tutorial changes

* Change jobs link to main website

* Add latest docstring and tutorial changes

Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
2021-10-21 11:42:32 +02:00
Julian Risch
f2a3f95ab6
add note on gpu runtime to tutorial 13 (#1614)
* add note on gpu runtime to tutorial 13

* Add latest docstring and tutorial changes

Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
2021-10-20 09:55:56 +02:00
Malte Pietsch
99c8046367
Fix Tutorials (#1594)
* fix response format of DocumentSearchPipeline

* Add latest docstring and tutorial changes

* fix typos

* change prints in tutorial 4

* Add latest docstring and tutorial changes

* fix tutorial 13

* Add latest docstring and tutorial changes

* remove unused import

Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
2021-10-14 11:49:35 +02:00
Malte Pietsch
82c2cdf7cd Merge branch 'master' of github.com:deepset-ai/haystack 2021-10-13 14:45:17 +02:00
Malte Pietsch
db2b5d913b Fix param in tutorial 8 2021-10-13 14:45:09 +02:00
Malte Pietsch
4a6c9302b3
Redesign primitives - Document, Answer, Label (#1398)
* first draft / notes on new primitives

* wip label / feedback refactor

* rename doc.text -> doc.content. add doc.content_type

* add datatype for content

* remove faq_question_field from ES and weaviate. rename text_field -> content_field in docstores. update tutorials for content field

* update converters for . Add warning for empty

* renam label.question -> label.query. Allow sorting of Answers.

* WIP primitives

* update ui/reader for new Answer format

* Improve Label. First refactoring of MultiLabel. Adjust eval code

* fixed workflow conflict with introducing new one (#1472)

* Add latest docstring and tutorial changes

* make add_eval_data() work again

* fix reader formats. WIP fix _extract_docs_and_labels_from_dict

* fix test reader

* Add latest docstring and tutorial changes

* fix another test case for reader

* fix mypy in farm reader.eval()

* fix mypy in farm reader.eval()

* WIP ORM refactor

* Add latest docstring and tutorial changes

* fix mypy weaviate

* make label and multilabel dataclasses

* bump mypy env in CI to python 3.8

* WIP refactor Label ORM

* WIP refactor Label ORM

* simplify tests for individual doc stores

* WIP refactoring markers of tests

* test alternative approach for tests with existing parametrization

* WIP refactor ORMs

* fix skip logic of already parametrized tests

* fix weaviate behaviour in tests - not parametrizing it in our general test cases.

* Add latest docstring and tutorial changes

* fix some tests

* remove sql from document_store_types

* fix markers for generator and pipeline test

* remove inmemory marker

* remove unneeded elasticsearch markers

* add dataclasses-json dependency. adjust ORM to just store JSON repr

* ignore type as dataclasses_json seems to miss functionality here

* update readme and contributing.md

* update contributing

* adjust example

* fix duplicate doc handling for custom index

* Add latest docstring and tutorial changes

* fix some ORM issues. fix get_all_labels_aggregated.

* update drop flags where get_all_labels_aggregated() was used before

* Add latest docstring and tutorial changes

* add to_json(). add + fix tests

* fix no_answer handling in label / multilabel

* fix duplicate docs in memory doc store. change primary key for sql doc table

* fix mypy issues

* fix mypy issues

* haystack/retriever/base.py

* fix test_write_document_meta[elastic]

* fix test_elasticsearch_custom_fields

* fix test_labels[elastic]

* fix crawler

* fix converter

* fix docx converter

* fix preprocessor

* fix test_utils

* fix tfidf retriever. fix selection of docstore in tests with multiple fixtures / parameterizations

* Add latest docstring and tutorial changes

* fix crawler test. fix ocrconverter attribute

* fix test_elasticsearch_custom_query

* fix generator pipeline

* fix ocr converter

* fix ragenerator

* Add latest docstring and tutorial changes

* fix test_load_and_save_yaml for elasticsearch

* fixes for pipeline tests

* fix faq pipeline

* fix pipeline tests

* Add latest docstring and tutorial changes

* fix weaviate

* Add latest docstring and tutorial changes

* trigger CI

* satisfy mypy

* Add latest docstring and tutorial changes

* satisfy mypy

* Add latest docstring and tutorial changes

* trigger CI

* fix question generation test

* fix ray. fix Q-generation

* fix translator test

* satisfy mypy

* wip refactor feedback rest api

* fix rest api feedback endpoint

* fix doc classifier

* remove relation of Labels -> Docs in SQL ORM

* fix faiss/milvus tests

* fix doc classifier test

* fix eval test

* fixing eval issues

* Add latest docstring and tutorial changes

* fix mypy

* WIP replace dataclasses-json with manual serialization

* Add latest docstring and tutorial changes

* revert to dataclass-json serialization for now. remove debug prints.

* update docstrings

* fix extractor. fix Answer Span init

* fix api test

* keep meta data of answers in reader.run()

* fix meta handling

* adress review feedback

* Add latest docstring and tutorial changes

* make document=None for open domain labels

* add import

* fix print utils

* fix rest api

* adress review feedback

* Add latest docstring and tutorial changes

* fix mypy

Co-authored-by: Markus Paff <markuspaff.mp@gmail.com>
Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
2021-10-13 14:23:23 +02:00
Markus Sagen
69a0c9f2ed
Clarify docs for PDF conversion, languages and encodings (#1570)
* Clarify PDF conversion, languages and encodings

The parameter name `valid_languages` may be a bit miss-leading from
reading only the tutorials. Users may, incorrectly assume that it
enforces that the conversions only works for those languages, then it's
more of a check.

- Provided clarifications in the tutorials to highlight what
valid_languages does and that changing the encoding may give better
results for their language of choice
- Updated the command for `pdftotext` to the correct one

* Allow encodings for `convert_files_to_dicts`

- Set option of passing encoding to the converters. Trying even for some
Latin1 languages, the converter does not do it in a good way.

Potential issues is that the encoding defaults to None, which is default
for the other converters, but not for the PDFToTextConverter. Could add
a check and change the ending to Latin1 for pdf if set to None.

Was considering adding it to **kwargs, but since it may be a commonly
used feature to be documented, I added it as a keyword argument instead.
Would love to hear your input and feedback on in.

* Set back PDF default encoding

* Update documentation
2021-10-11 09:30:12 +02:00
Julian Risch
f9d2f786ca
Replace FARM import statements; add dependencies (#1492)
* Replace FARM import statements; add dependencies

* Add InferenceProc., TextCl.Proc., TextPairCl.Proc.

* Remove FARMRanker, add type annotations, rename max_sample

* Add sample_to_features_text for InferenceProc.

* Fix type annotations: model_name_or_path is str not Path

* Fix mypy errors: implement _create_dataset in TextCl.Proc.

* Add task_type "embeddings" in Inferencer

* Allow loading AdaptiveModel for embedding task

* Add SQuAD eval metrics; enable InferenceProc for embedding task

* Add baskets as param to log_samples and handle empty basket list in log_samples

* Remove unused dependencies

* Remove FARMClassifier (doc classificer) due to ref to TextClassificationHead

* Remove FARMRanker and Classifier from doc generation scripts

Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
2021-09-28 16:34:24 +02:00
bogdankostic
c644e2b4d0
Add comment to tutorial notebooks about restarting runtime in colab (#1486)
* Add comment to tutorial notebooks about restarting runtime in colab

* Add latest docstring and tutorial changes

Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
2021-09-23 14:36:20 +02:00
Julian Risch
d569e66bc7
Update Tutorial1_Basic_QA_Pipeline.ipynb (#1489)
* Update Tutorial1_Basic_QA_Pipeline.ipynb

passing params to pipeline as dict

* Add latest docstring and tutorial changes

Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
2021-09-22 16:35:20 +02:00
Branden Chan
bddee2def4
Define SAS model in notebook (#1485)
* Define SAS model in notebook

* Add latest docstring and tutorial changes

Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
2021-09-21 17:05:16 +02:00
Branden Chan
2c4baa7f4e
Regenerate API and Tutorial md files (#1480)
* Change punctuation

* Add latest docstring and tutorial changes

* Change punctuation

* Add documentation for Docs2Answer

* Add latest docstring and tutorial changes

* Generate new API docs

* Replace Finder with Pipeline

* Add latest docstring and tutorial changes

Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
2021-09-21 14:42:18 +02:00
ju-gu
05da7f71dd
changed delete_all_documents to delete_documents (#1477) 2021-09-20 14:29:33 +02:00
oryx1729
9dd7c74f4f
Refactor communication between Pipeline Components (#1321) 2021-09-10 11:41:16 +02:00
Branden Chan
980d88a0f2
Update faq model (#1401) 2021-09-01 18:39:06 +02:00
Branden Chan
1938fb001b
Add support for no Docker envs in Tutorial 13 (#1365)
* Add support for no docker envs e.g. colab

* Generate md
2021-08-31 15:22:51 +02:00
Jeff Hammerbacher
1c8a03aaa2
Rag tutorial fixes (#1375)
* Update Tutorial7_RAG_Generator.ipynb

`delete_all_documents` --> `delete_documents` (cf. #1045)

* Update Tutorial7_RAG_Generator.py

`delete_all_documents` --> `delete_documents` (cf. #1045)
2021-08-30 15:27:18 +02:00
ramgarg102
51f0a56e5d
delete_all_documents() replaced by delete_documents() (#1377)
* [UPDT] delete_all_documents() replaced by delete_documents()

* [UPDT] warning logs to be fixed

* [UPDT] delete_all_documents() renamed and the same method added

Co-authored-by: Ram Garg <ramgarg102@gmai.com>
2021-08-30 15:18:28 +02:00
Timo Moeller
07bd3c50ea
Add new QA eval metric: Semantic Answer Similarity (SAS) (#1338)
* init

* Add type annotation

* Add test case, fix mypy

* Add german model to docstring

Co-authored-by: Malte Pietsch <malte.pietsch@deepset.ai>
2021-08-12 14:31:48 +02:00
Malte Pietsch
be9d19afa5
Remove Finder from tutorials (#1329) 2021-08-10 11:50:59 +02:00
Ikram Ali
d94674c5b6
Remove finder class from tutorial 1 (#1328) 2021-08-10 11:41:07 +02:00
Malte Pietsch
5e16ec4d76
Fix installation in Colab Tutorial 11 2021-08-10 08:50:04 +02:00
Shahrukh Khan
cc43502e7e
Add Tutorial about Query Classifier (#1324)
* add query classifier colab and jupyter notebook

* Delete Tutorial13_Query_Classifier.ipynb

* add query classifier tutorial with updated number

* add query classifier tutorial script

* Rename tutorial14_query_classifier.py to Tutorial14_Query_Classifier.py
2021-08-09 10:43:50 +02:00
Branden Chan
937247d628
Add QuestionGenerator (#1267)
* Create basic Question Generation

* Split texts into 50 word chunks

* Allow prompt to be changed

* Implement iteration functionality in DS

* Add docstrings, create pipelines

* Make pipelines work

* Add comments

* Add tests

* Add tutorials and docs

* Add doc string
2021-07-26 17:20:43 +02:00
Branden Chan
da97d81305
Change variable names (#1286) 2021-07-14 14:03:34 +02:00
Julian Risch
2a90471c73
Encapsulate tutorial code in method (#1266) 2021-07-09 17:08:19 +02:00
Branden Chan
efc03f72db
Make PreProcessor.process() work on lists of documents (#1163)
* Add process_batch method

* Rename methods

* Fix doc string, satisfy mypy

* Fix mypy CI

* Fix typp

* Update tutorial

* Fix argument name

* Change arg name

* Incorporate reviewer feedback
2021-06-23 18:13:51 +02:00
Branden Chan
7dbd58f6be
Add about sections (#1195) 2021-06-14 18:37:00 +02:00
vblagoje
2a5882578a
Add Longform-QA (LFQA), Seq2SeqGenerator for generative QA and Retribert Retriever (#1086)
* Integrate LFQA with Haystack

* Integrate LFQA with Haystack - unit tests

* Properly initialize conftest default value for vector_dim

* Update PR after inital feedback

* Fix conftest.py import

* Seq2SeqGenerator uses Callables instead of subclasses for custom model input

* Update docstring

* Fix Callable use

* Add LFQA tutorials

* Improve type error reporting for invalid input converter Callable

* Generate docstrings

* Format comments in tutorial script

* Generate tutorial md

* Add usage page

Co-authored-by: Malte Pietsch <malte.pietsch@deepset.ai>
Co-authored-by: brandenchan <brandenchan@icloud.com>
2021-06-14 17:53:43 +02:00
Branden Chan
783893c3d2
Tutorial update (#1166)
* Add header / footer

* Add Milvus example

* Generate md files

* Fix mypy CI
2021-06-11 11:09:15 +02:00
Branden Chan
aa6f768efa
Prevent merge of same questions on different documents during evaluation (#1119)
* Fix duplicate question in Reader.eval()

* Add duplicate question support in document store

* Support duplicate questions in retriever eval

* Update tutorial

* Rename key_tuple

* Change error message

* Add warning when more than 6 labels

* Allow for label grouping options

* Add support for aggregating by label meta

* Satisfy mypy

* Fix duplicate question in Reader.eval()

* Add duplicate question support in document store

* Support duplicate questions in retriever eval

* Update tutorial

* Rename key_tuple

* Change error message

* Add warning when more than 6 labels

* Allow for label grouping options

* Add support for aggregating by label meta

* Satisfy mypy

* Make label field flexible, add docstrings

* Satisfy mypy

* Fix failing tests

* Adjust docstring

* Fix tutorial

Co-authored-by: Malte Pietsch <malte.pietsch@deepset.ai>
2021-06-02 12:09:03 +02:00
Julian Risch
a7ba146246
Removed comma from last item in json list (#1114) 2021-06-01 12:32:21 +02:00
Julian Risch
40ceaf418a
Fixing grpcio-tools to version of colab's pre-installed grpcio (#1113) 2021-05-31 19:09:10 +02:00
Julian Risch
84c34295a1
Re-ranking component for document search without QA (#1025)
* Adding ranker similar to retriever and reader

* Sort documents according to query-document similarity scores

* Reranking and model training runs for small example

* Added EvalRanker node

* Calculate recall@k in EvalRetriever and EvalRanker nodes

* Renaming EvalRetriever to EvalDocuments and EvalReader to EvalAnswers

* Added mean reciprocal rank as metric for EvalDocuments

* Fix bug that appeared when ranking documents with same score

* Remove commented code for unimplmented eval() of Ranker node

* Add documentation of k parameter in EvalDocuments

* Add Ranker docu and renaming top_k param
2021-05-31 15:31:36 +02:00
Branden Chan
9827b3652e
Pipelines tutorial (#991)
* Start Pipelines tutorial

* Make Tutorial 11 run locally

* Add colab compatibility

* Fix pip install

* Add ES install from source

* Add ES install from source

* Add pygraphviz installation

* Incorporate reviewer feedback

* Ensure print_answers() works for Generator output

* Fix typo
2021-04-29 17:31:28 +02:00