3803 Commits

Author SHA1 Message Date
Branden Chan
325a4e4d14
Add Milvus Documentation (#838)
* First commit

* Add latest docstring and tutorial changes

* Add DocStore external setup info

* fixed tabs

* Add Milvus recommendation

Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Co-authored-by: Markus Paff <markuspaff.mp@gmail.com>
2021-02-24 11:43:40 +01:00
venuraja79
e930d8a717
Annotation Tool: data is not persisted when using local version #853 (#855) 2021-02-21 15:35:45 +01:00
Tu NGUYEN
ba91a90dd6
Fix download ntlk preprocessor (#852) 2021-02-21 10:17:50 +01:00
Malte Pietsch
e641bff7a6
Allow more options for elasticsearch client (auth, multiple hosts) (#845)
* allow more options for elasticsearch client (auth, multiple hosts)

* Add latest docstring and tutorial changes

* fix mypy

* Add latest docstring and tutorial changes

* test client connection via ping()

Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
2021-02-19 14:29:59 +01:00
Divya Yeruva
6c3ec540a4
Add crawler to get texts from websites (#775)
* add fetch_data_from_url to extract data and store as files

* corrected a typo

* corrected variable name error

* correction of urlparse error

* type error

* added selenium, urllib to requirements

* removed urllib

* minor changes and added function to find out inpage navigation links

* quick duplicate links fix

* quick type annotation fix

* created seperate module for crawler

* type error fix

* type error fix

* import  fix

* quick type error fix

* addee return description

* updated include type to list

* refactor modules. Add Crawler class. rename params.

* add basic pipeline compatibility

* update docstrings

* fix mypy issues

* update args, docstrings, return filepaths

* fix mypy

* make urls optional in init

Co-authored-by: Malte Pietsch <malte.pietsch@deepset.ai>
2021-02-18 12:00:49 +01:00
Malte Pietsch
d700592c9a
Update GPU Dockerimage (Cuda 11, Fix faiss)(#836) 2021-02-17 12:40:00 +01:00
Malte Pietsch
abf2d63c92
Upgrade FAISS to 1.7.0 (#834) 2021-02-17 10:00:33 +01:00
Branden Chan
a6a3b74199
Fix image in README 2021-02-16 17:05:15 +01:00
Andrey A
e0be5639ef
Update README.md 2021-02-16 18:47:14 +03:00
Andrey A
ab89fac76a
Update README.md 2021-02-16 18:45:20 +03:00
Andrey A
5c9f7d493c
Fix link to Quick Demo in ToC. (#831) 2021-02-16 16:38:04 +01:00
Tanay Soni
07907f9eac
Add support for indexing pipelines (#816) 2021-02-16 16:24:28 +01:00
Branden Chan
7030c94325
Revamp Readme (#820)
* Text changes

* Add new images

* First improvements

* Next iteration

* Resize gif

* Add bold

* Update key concepts diagram

* Center image

* Initial import of a more detailed README.md

* Slight changes to ToC, requirements and across the text.

* Grammar and Streamlit UI png.

* Unfix size of gif for mobile

* Remove requirements, add formatting to numbered lists.

* Formatting, remove img size options.

* Another iteration of phrasing the note about open ports.

* Rephrase the note about the docker ports.

Co-authored-by: Andrey A <56412611+aantti@users.noreply.github.com>
2021-02-16 15:32:43 +01:00
Malte Pietsch
47aae14efa relax assert precision of arrays 2021-02-15 14:52:13 +01:00
Malte Pietsch
9b1924a54a
Revert TOP_K_PER_CANDIDATE value to 3 2021-02-15 14:30:04 +01:00
Malte Pietsch
0eaae3c0dd
Fix UI when API returns fewer answers than expected (#828)
* fix ui for few answers from api. add top_k_per_sample env

* Add latest docstring and tutorial changes

Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
2021-02-15 14:27:17 +01:00
brandenchan
fe47e3a45e Fix link in documentation 2021-02-15 11:15:54 +01:00
Malte Pietsch
6798192d40
Add API endpoint to export accuracy metrics from user feedback + created_at timestamp (#803)
* WIP feedback metrics

* fix filters and zero division

* add created_at and model_name fields to labels

* add created_at value

* remove debug log level

* fix attribute init

* move timestamp creation down to docstore / db level

* fix import
2021-02-15 10:48:59 +01:00
brandenchan
03cda26d85 Fix link in Tutorial 8 2021-02-15 10:45:27 +01:00
Lalit Pagaria
5bd94ac5f7
Adding Translator (standalone component & wrapper for pipelines) (#782)
* Adding translator with many generic input parameter support

* Making dict_key as generic

* Fixing mypy issue

* Adding pipeline and using opus models

* Add latest docstring and tutorial changes

* Adding test cases for end-to-end translation for generator, summerizer etc

* raise error join and merge nodes

* Fix test failure

* add docstrings. add usage documentation. rm skip_special_tokens param

* Add latest docstring and tutorial changes

* fix code snippets in md

* Adding few extra configuration parameters and fixing tests

* Fixingmypy issue and updating usage document

* fix for mypy issue in pipeline.py

* reverting renaming of pytest_collection_modifyitems method

* Addressing review comments

* setting skip_special_tokens to True

* removing model_max_length argument as None type is not supported to many models

* Removing padding parameter. Better to leave it as default otherwise it cause tensor size miss match error. If this option required by used then it can be added later.

Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Co-authored-by: Malte Pietsch <malte.pietsch@deepset.ai>
2021-02-12 15:58:26 +01:00
oryx1729
4059805d89
Fix ElasticsearchDocumentStore.query_by_embedding() (#823) 2021-02-12 14:57:06 +01:00
Pavel Soriano
8adf5b4737
Allow non-standard Tokenizers (e.g. CamemBERT) for DPR via new arg (#811)
* added parameter to infer DPR tokenizers class

* Add latest docstring and tutorial changes

* Update docstring. fix mypy

* Add latest docstring and tutorial changes

Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Co-authored-by: Malte Pietsch <malte.pietsch@deepset.ai>
2021-02-12 14:17:55 +01:00
oryx1729
c4607cbd98
Revamp CI (#825) 2021-02-12 13:38:54 +01:00
Branden Chan
c807f0d050 Add key concepts diagram 2021-02-12 12:49:22 +01:00
Tanay Soni
8b0031bfc1
Remove conditional import of FAISS for Windows (#819) 2021-02-12 12:15:23 +01:00
Branden Chan
a1983ad84e Add new images 2021-02-11 15:10:00 +01:00
Branden Chan
db0364c728
Fix uvloop version to maintain Python<3.7 support
uvloop released v0.15 which requires Python >=3.7. This commit fixes the version so that Haystack can be directly installed in colab using pip
2021-02-10 19:16:53 +01:00
Tanay Soni
fd5c5dd23c
Introduce incremental updates for embeddings in document stores (#812) 2021-02-09 21:25:01 +01:00
Malte Pietsch
e91518ee00
Update tutorials (torch versions, ES version, replace Finder with Pipeline) (#814)
* remove manual torch install on colab

* update elasticsearch version everywhere to 7.9.2

* fix FAQPipeline

* update tutorials with new pipelines

* Add latest docstring and tutorial changes

* revert faqpipeline change. fix field names in tutorial 4

* Add latest docstring and tutorial changes

Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
2021-02-09 14:56:54 +01:00
Malte Pietsch
ac9f92466f
Allow custom encoding for pdftotext (Russian characters, German umlauts etc). Fix version in download instructions (#813)
* fix encoding of pdftotext. fix version in download instructions

* fix test

* Add latest docstring and tutorial changes

* make latin-1 default encoding again

* Add latest docstring and tutorial changes

Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
2021-02-09 13:42:43 +01:00
Tanay Soni
f95b70df38
Fix file upload API (#808) 2021-02-05 12:17:38 +01:00
Tanay Soni
7b18e324f2
Fix building Pipeline with YAML (#800) 2021-02-04 11:53:51 +01:00
Branden Chan
f3a3b73d9b
Choose correct similarity fns during benchmark runs & re-run benchmarks (#773)
* Adapt to new dataset_from_dicts return signature

* rename fn

* Align similarity fn in benchmark doc store

* Better choice of similarity fn

* Increase postgres wait time

* Add more expected returned variables

* update benchmark results

* Fix typo

* update all benchmark runs

* multiply stats by 100

* Specify similarity fns for website

Co-authored-by: Malte Pietsch <malte.pietsch@deepset.ai>
2021-02-03 11:45:18 +01:00
Tanay Soni
8a5dc8f826
Load Pipeline with YAML config file (#785) 2021-02-02 17:32:17 +01:00
Malte Pietsch
1318b55eec
Make tqdm progress bars optional (less verbose prod logs) (#796)
* make dpr queries less verbose

* add progress bar flag to more components

* Add latest docstring and tutorial changes

* add type

* Add latest docstring and tutorial changes

Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
2021-02-01 20:51:55 +01:00
Timo Moeller
f3ccd59045
Improve preprocessing and adding of eval data (#780)
* Remove empty document when splitting text

* Move error message of problematic ids to a highler level
2021-02-01 17:08:27 +01:00
Tanay Soni
b87dd244c1
Get metadata values for a key from Elasticsearch (#776) 2021-02-01 16:13:26 +01:00
brandenchan
5665d55ab4 Remove duplicate file 2021-02-01 15:43:53 +01:00
Pavel Soriano
16b8291091
SQuAD to DPR dataset converter (#765)
* Create squad_to_dpr.py

First commit of the squad2dpr script.

* adding review corrections/improvements

* Merge master 5bf351e

* Move script, add docstring

* Add type hints

Co-authored-by: brandenchan <brandenchan@icloud.com>
2021-02-01 15:40:43 +01:00
Tanay Soni
5bf351ea7b
Fix refresh behaviour for Elasticsearch delete (#794) 2021-02-01 14:07:55 +01:00
Tanay Soni
d62355ca88
Fix mypy typing (#792) 2021-02-01 12:15:36 +01:00
Branden Chan
1dc74c7067
Add model versioning support (#784)
* Add model versioning support

* Add latest docstring and tutorial changes

* Support DPR versioning

* Add RAG versioning support

* Add latest docstring and tutorial changes

* Add summarizer support

* Add Embedding Retriever support

* Add latest docstring and tutorial changes

Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
2021-02-01 11:42:36 +01:00
Malte Pietsch
2b05e801c3
Fix pdftotext dependency in CI (#788)
* Fix pdftotext dependency in CI

* udpate xpdf version

* Fix version
2021-01-29 16:07:37 +01:00
Lalit Pagaria
9f7f95221f
Milvus integration (#771)
* Initial commit for Milvus integration

* Add latest docstring and tutorial changes

* Updating implementation of Milvus document store

* Add latest docstring and tutorial changes

* Adding tests and updating doc string

* Add latest docstring and tutorial changes

* Fixing issue caught by tests

* Addressing review comments

* Fixing mypy detected issue

* Fixing issue caught in test about sorting of vector ids

* fixing test

* Fixing generator test failure

* update docstrings

* Addressing review comments about multiple network call while fetching embedding from milvus server

* Add latest docstring and tutorial changes

* Ignoring mypy issue while converting vector_id to int

Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Co-authored-by: Malte Pietsch <malte.pietsch@deepset.ai>
2021-01-29 13:29:12 +01:00
brandenchan
6efa4f06c1 Add Streamlit UI Image 2021-01-27 17:01:29 +01:00
Timo Moeller
f94bd96ddf
Remove RAG todos after transformers update (#781) 2021-01-27 16:50:02 +01:00
Tanay Soni
d9f011da9a
Add flag for use of window queries in SQLDocumentStore (#768) 2021-01-25 12:54:34 +01:00
Tanay Soni
46307d1571
Remove quotes around placeholders in Elasticsearch custom query (#762) 2021-01-25 12:46:43 +01:00
Tanay Soni
f0aa879a1c
Fix delete_all_documents for the SQLDocumentStore (#761) 2021-01-22 14:39:24 +01:00
Markus Paff
aee90c5df9
Docs v0.7.0 (#757)
* new docs version

* Add latest docstring and tutorial changes

Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
2021-01-22 10:28:33 +01:00