3803 Commits

Author SHA1 Message Date
Zenahr Barzani
955e6f7b3a
Add explicit encoding mode to file_converter/txt.py (#478)
* add explicit encoding mode

* parameterize encoding

Co-authored-by: Malte Pietsch <malte.pietsch@deepset.ai>
2020-10-12 17:32:04 +02:00
Branden Chan
1cebcb7dda
Create time and performance benchmarks for all readers and retrievers (#339)
* add time and perf benchmark for es

* Add retriever benchmarking

* Add Reader benchmarking

* add nq to squad conversion

* add conversion stats

* clean benchmarks

* Add link to dataset

* Update imports

* add first support for neg psgs

* Refactor test

* set max_seq_len

* cleanup benchmark

* begin retriever speed benchmarking

* Add support for retriever query index benchmarking

* improve reader eval, retriever speed benchmarking

* improve retriever speed benchmarking

* Add retriever accuracy benchmark

* Add neg doc shuffling

* Add top_n

* 3x speedup of SQL. add postgres docker run. make shuffle neg a param. add more logging

* Add models to sweep

* add option for faiss index type

* remove unneeded line

* change faiss to faiss_flat

* begin automatic benchmark script

* remove existing postgres docker for benchmarking

* Add data processing scripts

* Remove shuffle in script bc data already shuffled

* switch hnsw setup from 256 to 128

* change es similarity to dot product by default

* Error includes stack trace

* Change ES default timeout

* remove delete_docs() from timing for indexing

* Add support for website export

* update website on push to benchmarks

* add complete benchmarks results

* new json format

* removed NaN as is not a valid json token

* fix benchmarking for faiss hnsw queries. do sql calls in update_embeddings() as batches

* update benchmarks for hnsw 128,20,80

* don't delete full index in delete_all_documents()

* update texts for charts

* update recall column for retriever

* change scale and add units to desc

* add units to legend

* add axis titles. update desc

* add html tags

Co-authored-by: deepset <deepset@Crenolape.localdomain>
Co-authored-by: Malte Pietsch <malte.pietsch@deepset.ai>
Co-authored-by: PiffPaffM <markuspaff.mp@gmail.com>
2020-10-12 13:34:42 +02:00
Malte Pietsch
8edeb844f7
Remove phi normalization from FAISS, support more index types, 3x speedup (#467)
* remove phi normalization

* add special case for hnsw

* rename vector_size to vector_dim

* fix loading. fix extra dim in tests

* switch to new ES syntax for vector similarity

* 3x sql speed up. cascade deletes. add train_index()

* add docstrings. remove vector_dim from load()

* delete docs from faiss and sql

* fix delete of docs in test

* relax type hint for faiss index

* rename metric to metric_type

Co-authored-by: lalitpagaria <19303690+lalitpagaria@users.noreply.github.com>
2020-10-06 16:09:56 +02:00
Markus Paff
56852f820b
READ.me for Docstring Generation and remove not needed files (#468) 2020-10-06 15:16:56 +02:00
Markus Paff
25f34babce
Separate data and view for benchmarks (#451)
* separate data and view for benchmarks

* fixed typo
2020-10-06 10:30:19 +02:00
Lalit Pagaria
465ccbc12e
Allow multiple write calls to existing FAISS index. (#422)
- Fixing issue when update_embeddings always create new FAISS index instead of clearing existing one. New index creation may not free existing used memory and cause memory leak.

Co-authored-by: Malte Pietsch <malte.pietsch@deepset.ai>
2020-10-05 12:01:20 +02:00
Futurne
072e32b38a
Fix filters in query_embedding for ElasticsearchDocumentStore (#464)
Co-authored-by: Pierre Pereira <pierre.pereira@lexistems.com>
2020-10-05 11:25:07 +02:00
Tanay Soni
669c72d538
Enable bulk operations on vector IDs for FAISSDocumentStore (#460) 2020-10-02 14:43:25 +02:00
Malte Pietsch
029d1b75f2
Update docstring in DPR for embed_title (#459) 2020-10-02 13:41:33 +02:00
Lalit Pagaria
9b58374b7c
Skip file conversion if file type is not supported (#456)
* Skip file converter if file type is not supported. Refer https://github.com/deepset-ai/haystack/issues/453

* Fixing issue reported by mypy

* Addressing review comments
2020-10-01 14:47:45 +02:00
Malte Pietsch
a92ca04648
Update GPU docker & fix race condition with multiple workers (#436)
* fix gpu CMD and set tag to latest

* udpate dockerfiles. resolve race condition of index creation with multiple workers

* update dockerfiles for preload. remove try catch for elastic index creation

* add back try/catch. disable multiproc in default config to comply with --preload of gunicorn

* change to pip3 for GPU dockerfile

* remove --preload for gpu
2020-09-29 21:12:44 +02:00
Markus Paff
5d1e208186
Create deploy_website.yml (#450)
Creates a dispatch event on push to master so that we can trigger a build in haystack-website. The website should always have the latest docs version
2020-09-29 19:49:04 +02:00
Tanay Soni
52000ff678
Add Docker setup for the annotation tool (#444) 2020-09-29 14:09:45 +02:00
Tanay Soni
93fd4aa72f
Update ONNX conversion for FARMReader (#438) 2020-09-28 16:10:32 +02:00
Malte Pietsch
bb4802ae6a
Make sentence-transformers usage more user-friendly (#439)
Co-authored-by: guillim <guigloo@msn.com>
2020-09-28 15:34:23 +02:00
Malte Pietsch
cd19d65f1a
Update README.rst 2020-09-27 12:31:11 +02:00
Malte Pietsch
8a21843167
Update README.rst 2020-09-27 12:30:25 +02:00
Malte Pietsch
dfe244e287
Fix typos in roadmap (#434) 2020-09-25 11:28:46 +02:00
Malte Pietsch
0a123707e4
Fix typos in roadmap (#433) 2020-09-25 07:38:48 +02:00
Malte Pietsch
15c0064498
add roadmap section to docs (#432) 2020-09-24 23:43:40 +02:00
Markus Paff
6b35e38e12
Fixed tabs for haystack-website issue (#427) 2020-09-24 10:36:18 +02:00
Markus Paff
66a1893f79
Moved files to api directory (#418) 2020-09-22 11:48:26 +02:00
Guillim
29cbd1e4a1
Add embedding_field to existing index in ES (#415) 2020-09-22 10:25:58 +02:00
Guillim
fb5db59590
Remove useless line from Tutorial4_FAQ_style_QA (#416)
* Update Tutorial4_FAQ_style_QA.py

Used to be useful when `.apply()` was necessary, but not any longer

* Update Tutorial4_FAQ_style_QA.ipynb
2020-09-22 09:01:04 +02:00
Markus Paff
8e044dc16f
Fix typo in documentation (#406)
Co-authored-by: Antonio Lanza <antoniolanza1996@gmail.com>
2020-09-21 13:31:00 +02:00
Malte Pietsch
c5f1f9aa87
Update README.rst v0.4.0 2020-09-21 10:31:25 +02:00
Malte Pietsch
747e0c0046
Bump FARM to 0.4.9. Remove custom torch installation from colab tutorials (#404) 2020-09-21 10:26:12 +02:00
Malte Pietsch
271ff30262
fix type casting of embeddings for tutorial 4 (#402) 2020-09-18 18:10:50 +02:00
Malte Pietsch
0c5750fae0 Bump version to 0.4.0 2020-09-18 17:12:29 +02:00
Malte Pietsch
db6864d159
Fix type casting for vectors in FAISS (#399)
* Fix type casting for vectors in FAISS

Co-authored-by: philipp-bode <philipp.bode@student.hpi.de>

* add type casts for elastic. refactor embedding retriever tests

* fix case: empty embedding field

* fix faiss tolerance

* add assert in test_faiss_retrieving

Co-authored-by: philipp-bode <philipp.bode@student.hpi.de>
2020-09-18 17:08:13 +02:00
Branden Chan
4ea4cfd282
Merge pull request #400 from deepset-ai/fix_imgs
Fix images in readme
2020-09-18 15:01:20 +02:00
brandenchan
f4a1682570 Fix images 2020-09-18 14:58:03 +02:00
Malte Pietsch
d69133966d Fix faiss test tolerance 2020-09-18 13:57:29 +02:00
Branden Chan
7fdb85d63a
Create documentation website (#272)
* Skeleton of doc website

* Flesh out documentation pages

* Split concepts into their own rst files

* add tutorial rsts

* Consistent level 1 markdown headers in tutorials

* Change theme to readthedocs

* Turn bullet points into prose

* Populate sections

* Add more text

* Add more sphinx files

* Add more retriever documentation

* combined all documenations in one structure

* rename of src to _src as it was ignored by git

* Incorporate MP2's changes

* add benchmark bar charts

* Adapt docstrings in Readers

* Improvements to intro, creation of glossary

* Adapt docstrings in Retrievers

* Adapt docstrings in Finder

* Adapt Docstrings of Finder

* Updates to text

* Edit text

* update doc strings

* proof read tutorials

* Edit text

* Edit text

* Add stacked chart

* populate graph with data

* Switch Documentation to markdown (#386)

* add way to generate markdown files to sphinx

* changed from rst to markdown and extended sphinx for it

* fix spelling

* Clean titles

* delete file

* change spelling

* add sections to document store usage

* add basic rest api docs

* fix readme in setup.py

* Update Tutorials

* Change section names

* add windows note to pip install

* update intro

* new renderer for markdown files

* Fix typos

* delete dpr_utils.py

* fix windows note in get started

* Fix docstrings

* deleted rest api docs in api

* fixed typo

* Fix docstring

* revert readme to rst

* Fix readme

* Update setup.py

Co-authored-by: deepset <deepset@Crenolape.localdomain>
Co-authored-by: PiffPaffM <markuspaff.mp@gmail.com>
Co-authored-by: Bogdan Kostić <bogdankostic@web.de>
Co-authored-by: Malte Pietsch <malte.pietsch@deepset.ai>
2020-09-18 12:57:32 +02:00
Malte Pietsch
4c503158a7
Fix duplicate vector ids in FAISS (#395)
* fix duplicate vector ids in faiss

* Add test

Co-authored-by: lalitpagaria <19303690+lalitpagaria@users.noreply.github.com>

* revert score change

* switch to faiss_index.ntotal for ids. add tests

Co-authored-by: lalitpagaria <19303690+lalitpagaria@users.noreply.github.com>
2020-09-18 12:52:22 +02:00
Tanay Soni
0859da8f74
Fix document filtering in SQLDocumentStore (#396) 2020-09-18 12:22:52 +02:00
Tanay Soni
3399fc784d
Refactor file converter interface (#393) 2020-09-18 10:42:13 +02:00
Malte Pietsch
4e46d9d176 remove dpr_utils.py 2020-09-17 17:17:19 +02:00
Tanay Soni
06243dbda4
Move retriever probability calculations to document_store (#389) 2020-09-17 16:25:46 +02:00
Tanay Soni
03fa4a8740
Exclude embedding fields from the REST API (#390) 2020-09-17 14:37:01 +02:00
Malte Pietsch
3782646948
Add logo to readme (#384)
* add logo image

* add logo to readme

* change img path to master

* Update README.rst
2020-09-16 18:36:22 +02:00
Malte Pietsch
9727829cc6
Rename and restructure modules (database, indexing, schemas) (#379)
* rename database to documentstore

* move document, label, multilabel to haystack/schema.py

* rename documentstore -> document_store

* split indexing modules -> file_converter + preprocessor

* fix order of imports

* Update tutorial notebooks

* fix torch version in tutorial 4
2020-09-16 18:33:23 +02:00
Malte Pietsch
bde33ddaaa
Bump FARM version to 0.4.8 and PyTorch >=1.5.1, <= 1.6.0 (#376)
* bump farm version to 0.4.8

* move back to original transformers pipeline

* remove dpr_utils and use transformers implementation

* update tutorial notebooks
2020-09-16 17:24:40 +02:00
Lalit P
de5ad42e46
Adjust tests for MacOS (#374) 2020-09-15 15:04:46 +02:00
Tanay Soni
c0c2865e58
Add FAISS query scores (#368) 2020-09-11 13:59:38 +02:00
Tanay Soni
9d93ffbe54
Add Gunicorn timeout (#364) 2020-09-10 09:20:39 +02:00
maxupp
06e8be30ea
Add index arg to Finder.get_answers() and _via_similar_questions() (#362)
Co-authored-by: Max Uppenkamp <max.uppenkamp@inform-software.com>
2020-09-09 12:39:13 +02:00
Malte Pietsch
b1cdc68d6c
Update README.rst 2020-09-09 11:47:17 +02:00
Malte Pietsch
d821e8d260
Bump FARM version to 0.4.7 (#340) 2020-09-04 17:29:14 +02:00
Tanay Soni
26e4e7ad7a
Use port 8000 in docs (#357) 2020-09-04 09:54:24 +02:00