542 Commits

Author SHA1 Message Date
Markus Paff
39845c0624
Automate updates docstrings tutorials (#1461)
* remove not needed githab actions and reactivate docstrings and tutorial generation

* test workflow

* update pydoc version

* update python version

* update watchdog

* move to latest version pydoc-markdown

* remove version check

* Add latest docstring and tutorial changes

* remove test workflow

* test for param docstrings

* pin pydoc-markdown version

* add test workflow

* pin watchdog version

* Add latest docstring and tutorial changes

* update original workflow and delete test

Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
2021-09-17 13:44:31 +02:00
Bob van Luijt
c0cc8bc80f
Bump Weaviate version to 1.7.0 (#1412)
* Bump Weaviate

* Bump Weaviate

* Bump Weaviate client

* Bump Weaviate

* Revert client version

There is a change in the client API that needs to be addressed before bumping its version
2021-09-05 09:28:55 +02:00
Malte Pietsch
f3e7074c13
Remove stale bot 2021-09-03 17:39:24 +02:00
Malte Pietsch
6093bf9ff6
Fix Github action 2021-09-01 16:50:29 +02:00
Shahrukh Khan
4822536886
Add ImageToTextConverter and PDFToTextOCRConverter that utilize OCR (#1349)
* add image.py converter

* add PDFtoImageConverter

* add init to PDFtoImageConverter and classes to __init__

* update imagetotext pipeline

* update imagetotext pipeline

* update imagetotext pipeline

* update imagetotext pipeline

* update imagetotext pipeline

* update imagetotext pipeline

* update imagetotext pipeline

* revert change in base.py in file_conv

* Update base.py

* Update pdf.py

* add ocr file_converter testcase & update dockerfile

* fix tesseract exception message typo

* fix _image_to_text doctstring

* add tesseract installation to CI

* add tesseract installation to CI

* add content test for PDF OCR converter

* update PDFToTextOCRConverter constructor doctsring

* replace image files with tmp paths for image.py convert

* replace image files with tmp paths for image.py convert

* Update README.md

Co-authored-by: Malte Pietsch <malte.pietsch@deepset.ai>
2021-09-01 16:42:25 +02:00
Markus Paff
be8d305190
Editing docs read.me for new docs website workflow (#1372)
* editing docs read.me for new docs website workflow

* added new links to docs
2021-08-30 14:59:40 +02:00
oryx1729
bafa1b46de
Add Ray integration for Pipelines (#1255) 2021-08-02 14:51:24 +02:00
Bob van Luijt
8dae844447
Bump Weaviate version to 1.5 (#1287)
* bump Weaviate version to 1.5

* bump Weaviate version to 1.5
2021-07-15 08:26:22 +02:00
Malte Pietsch
5e23e72f31 Update issue templates 2021-06-30 12:12:07 +02:00
venuraja79
49886f88f0
Integrate Weaviate as another DocumentStore (#1064)
* Annotation Tool: data is not persisted when using local version #853

* First version of weaviate

* First version of weaviate

* First version of weaviate

* Updated comments

* Updated comments

* ran query, get and write tests

* update embeddings, dynamic schema and filters implemented

* Initial set of tests and fixes

* Tests added for update_embeddings and delete documents

* introduced duplicate documents fix

* fixed mypy errors

* Added Weaviate to requirements

* Fix the weaviate docker env variables

* Fixing test dependencies for now

* Created weaviate test marker and fixed query

* Update docstring

* Add documentation

* Bump up weaviate version

* Bump up weaviate version in documentation

* Bump up weaviate version in documentation

* Updgrade weaviate version

Co-authored-by: Malte Pietsch <malte.pietsch@deepset.ai>
2021-06-10 09:43:53 +02:00
Lalit Pagaria
db17d73a82
Fixing issues caused due to mypy upgrade (#1165) 2021-06-09 16:24:39 +02:00
Malte Pietsch
022f8586f6
Remove Python 3.6 support (#1059)
* Remove Python 3.6 support

* change cache key for CI
2021-06-01 15:24:44 +02:00
Malte Pietsch
25d1122773
Upgrade milvus to 1.1.0 (#1066)
* upgrade milvus in CI to 1.1

* fix pymilvus

* loose pymilvus requirement again

* add date to cache keys

* fix date var in action
2021-05-17 17:27:34 +02:00
Malte Pietsch
b1e8ebf81a
Create pull_request_template.md 2021-04-22 15:48:39 +02:00
oryx1729
8c1e411380
Fix update_embeddings() for FAISSDocumentStore (#978) 2021-04-21 09:56:35 +02:00
Julian Risch
d38c07e0ee
knowledge graph example (#934)
* Add knowledge graph module

* Fix type hint

* Add graph retriver module

* Change type annotations, change return format

* Add graph retriever that executes questions as sparql queries

* Linking only those entities that are in the knowledge graph

* Added logging and using relations extracted from Knowledge graph for linking

* Preventing entity linking from linking the same token to multiple entities

* Pruning triples that have no variables for select and count queries

* Support knowledge graphs with Pipelines

* Add text2sparql

* Entity linking and relation linking consider more special cases now based on evaluation on labelled data

* Separating example code from KGQA implementation

* Add eval on combined extarctive and kg questions

* Remove references to hp-test

* Add fields sparql_query and long_answer_list to metadata

* Removing modular Question2SPARQL approach

* Removing additional classes used for modular kgqa approach

* preparing lcquad data

* change graph db

* Translating namespaces in knowledge graph queries

* Creating graphdb index and loading triples from .ttl file

* Fetching graph config files, triples and model from S3

* Fix incompatibility issues with BaseGraphRetriever and BaseComponent

* Removing unused utility functions

* Adding doc strings and tutorial header

* Adding sparqlwrapper dependency

* Moving tutorial header

* Sorting tutorials by number within name of notebook

* Add latest docstring and tutorial changes

* Creating test cases for knowledge graph

* Changing knowledge graph example to harry potter

* Add latest docstring and tutorial changes

* Adapting the tutorial notebook to harry potter example

* Add GraphDB fixture for tests

* Add latest docstring and tutorial changes

* Added GraphDB docker launch to CI

* Use correct GraphDB fixture

* Check if GraphDB instance is already running

* Renaming question/query and incorporating other feedback from Timo and Tanay

* Removed type annotation

* Add latest docstring and tutorial changes

Co-authored-by: oryx1729 <oryx1729@protonmail.com>
Co-authored-by: Timo Moeller <timo.moeller@deepset.ai>
Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
2021-04-08 14:05:33 +02:00
Lalit Pagaria
e904deefa7
Add Markdown file convertor (#875) 2021-03-23 16:31:26 +01:00
oryx1729
c4607cbd98
Revamp CI (#825) 2021-02-12 13:38:54 +01:00
Malte Pietsch
2b05e801c3
Fix pdftotext dependency in CI (#788)
* Fix pdftotext dependency in CI

* udpate xpdf version

* Fix version
2021-01-29 16:07:37 +01:00
Lalit Pagaria
9f7f95221f
Milvus integration (#771)
* Initial commit for Milvus integration

* Add latest docstring and tutorial changes

* Updating implementation of Milvus document store

* Add latest docstring and tutorial changes

* Adding tests and updating doc string

* Add latest docstring and tutorial changes

* Fixing issue caught by tests

* Addressing review comments

* Fixing mypy detected issue

* Fixing issue caught in test about sorting of vector ids

* fixing test

* Fixing generator test failure

* update docstrings

* Addressing review comments about multiple network call while fetching embedding from milvus server

* Add latest docstring and tutorial changes

* Ignoring mypy issue while converting vector_id to int

Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Co-authored-by: Malte Pietsch <malte.pietsch@deepset.ai>
2021-01-29 13:29:12 +01:00
Markus Paff
0b583b8972
Generate docstrings and deploy to branches to Staging (Website) (#731)
* test pre commit hook

* test status

* test on this branch

* push generated docstrings and tutorials to branch

* fixed syntax error

* Add latest docstring and tutorial changes

* add files before commit

* catch commit error

* separate generation from deployment

* add deployment process for staging

* add current branch to payload

Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
2021-01-21 11:01:09 +01:00
Markus Paff
3af3ee1a12
Automate docstring and tutorial generation with every push to master (#718)
* automate docstring and tutorial generation with every push to master

* test CI for current branch

* fixed yaml syntax

* add setupttools to install process

* checkout repo

* fixed command for shell script

* install wheel as it is needed for CI

* install mkdocs

* test without shell script

* use package from github actions

* test other configuration

* back to right config

* cleaning script
2021-01-11 16:25:43 +01:00
Lalit Pagaria
75d0ebd076
Add Summarizer (standalone + node in custom pipelines + SearchSummarizationPipeline) (#698)
* Integration of SummarizationQAPipeline with Haystack.

* Moving summarizer tests because of OOM issue

* Fixing typo

* Splitting summarizer test in separate ci step

* Removing sysctl configuration as we already running elastic search in docker container

* fixing mypy issue

* update parameter names and docstrings

* update parameter names in BaseSummarizer

* rename pipeline

* change return type of summarizer from answer to document

* change scope of doc store fixture

* revert scope

* temp. disable test_faiss_index_save_and_load()

* fix mypy. change order for mypy in CI

Co-authored-by: Malte Pietsch <malte.pietsch@deepset.ai>
2021-01-08 14:29:46 +01:00
Malte Pietsch
5db73d4107
Update stale bot 2021-01-05 08:29:24 +01:00
Tanay Soni
369e237fd4
Add DocumentStore for Open Distro Elasticsearch (#676) 2020-12-15 09:28:40 +01:00
Tanay Soni
33fe597949
Cleanup Pytest Fixtures (#639) 2020-12-14 18:15:44 +01:00
Tanay Soni
8e52b48e1d
Add pipelines for GenerativeQA & FAQs (#645) 2020-12-03 10:27:06 +01:00
Ky-Anh Huynh
0edd127f35
Add formatting checks for shell scripts (#627) 2020-11-26 14:36:35 +01:00
Malte Pietsch
0acafc403a
Automate benchmarks via CML (#518)
* initial test cml

* Update cml.yaml

* WIP test workflow

* switch to general ubuntu ami

* switch to general ubuntu ami

* disable gpu for tests

* rm gpu infos

* rm gpu infos

* update token env

* switch github token

* add postgres

* test db connection

* fix typo

* remove tty

* add sleep for db

* debug runner

* debug removal postgres

* debug: reset to working commit

* debug: change github token

* switch to new bot token

* debug token

* add back postgres

* adjust network runner docker

* add elastic

* fix typo

* adjust working dir

* fix benchmark execution

* enable s3 downloads

* add query benchmark. fix path

* add saving of markdown files

* cat md files. add faiss+dpr. increase n_queries

* switch to GPU instance

* switch availability zone

* switch to public aws DL ami

* increase volume size

* rm faiss. fix error logging

* save markdown files

* add reader benchmarks

* add download of squad data

* correct reader metric normalization

* fix newlines between reports

* fix max_docs for reader eval data. remove max_docs from ci run config

* fix mypy. switch workflow trigger

* try trigger for label

* try trigger for label

* change trigger syntax

* debug machine shutdown with test workflow

* add es and postgres to test workflow

* Revert "add es and postgres to test workflow"

This reverts commit 6f038d3d7f12eea924b54529e61b192858eaa9d5.

* Revert "debug machine shutdown with test workflow"

This reverts commit db70eabae8850b88e1d61fd79b04d4f49d54990a.

* fix typo in action. set benchmark config back to original
2020-11-18 18:28:17 +01:00
Lalit Pagaria
f13443054a
[RAG] Integrate "Retrieval-Augmented Generation" with Haystack (#484)
* Adding dummy generator implementation

* Adding tutorial to try the model

* Committing current non working code

* Committing current update where we need to call generate function directly and need to convert embedding to tensor way

* Addressing review comments.

* Refactoring finder, and implementing rag_generator class.

* Refined the implementation of RAGGenerator and now it is in clean shape

* Renaming RAGGenerator to RAGenerator

* Reverting change from finder.py and addressing review comments

* Remove support for RagSequenceForGeneration

* Utilizing embed_passage function from DensePassageRetriever

* Adding sample test data to verify generator output

* Updating testing script

* Updating testing script

* Fixing bug related to top_k

* Updating latest farm dependency

* Comment out farm dependency

* Reverting changes from TransformersReader

* Adding transformers dataset to compare transformers and haystack generator implementation

* Using generator_encoder instead of question_encoder to generate context_input_ids

* Adding workaround to install FARM dependency from master branch

* Removing unnecessary changes

* Fixing generator test

* Removing transformers datasets

* Fixing generator test

* Some cleanup and updating TODO comments

* Adding tutorial notebook

* Updating tutorials with comments

* Explicitly passing token model in RAG test

* Addressing review comments

* Fixing notebook

* Refactoring tests to reduce memory footprint

* Split generator tests in separate ci step and before running it reclaim memory by terminating containers

* Moving tika dependent test to separate dir

* Remove unwanted code

* Brining reader under session scope

* Farm is now session object hence restoring changes from default value

* Updating assert for pdf converter

* Dummy commit to trigger CI flow

* REducing memory footprint required for generator tests

* Fixing mypy issues

* Marking test with tika and elasticsearch markers. Reverting changes in CI and pytest splits

* reducing changes

* Fixing CI

* changing elastic search ci

* Fixing test error

* Disabling return of embedding

* Marking generator test as well

* Refactoring tutorials

* Increasing ES memory to 750M

* Trying another fix for ES CI

* Reverting CI changes

* Splitting tests in CI

* Generator and non-generator markers split

* Adding pytest.ini to add markers and enable strict-markers option

* Reducing elastic search container memory

* Simplifying generator test by using documents with embedding directly

* Bump up farm to 0.5.0
2020-10-30 18:06:02 +01:00
Tanay Soni
db4151bbc0
Fix scoring in Elasticsearch for dot product (#517) 2020-10-23 17:50:49 +02:00
Branden Chan
1cebcb7dda
Create time and performance benchmarks for all readers and retrievers (#339)
* add time and perf benchmark for es

* Add retriever benchmarking

* Add Reader benchmarking

* add nq to squad conversion

* add conversion stats

* clean benchmarks

* Add link to dataset

* Update imports

* add first support for neg psgs

* Refactor test

* set max_seq_len

* cleanup benchmark

* begin retriever speed benchmarking

* Add support for retriever query index benchmarking

* improve reader eval, retriever speed benchmarking

* improve retriever speed benchmarking

* Add retriever accuracy benchmark

* Add neg doc shuffling

* Add top_n

* 3x speedup of SQL. add postgres docker run. make shuffle neg a param. add more logging

* Add models to sweep

* add option for faiss index type

* remove unneeded line

* change faiss to faiss_flat

* begin automatic benchmark script

* remove existing postgres docker for benchmarking

* Add data processing scripts

* Remove shuffle in script bc data already shuffled

* switch hnsw setup from 256 to 128

* change es similarity to dot product by default

* Error includes stack trace

* Change ES default timeout

* remove delete_docs() from timing for indexing

* Add support for website export

* update website on push to benchmarks

* add complete benchmarks results

* new json format

* removed NaN as is not a valid json token

* fix benchmarking for faiss hnsw queries. do sql calls in update_embeddings() as batches

* update benchmarks for hnsw 128,20,80

* don't delete full index in delete_all_documents()

* update texts for charts

* update recall column for retriever

* change scale and add units to desc

* add units to legend

* add axis titles. update desc

* add html tags

Co-authored-by: deepset <deepset@Crenolape.localdomain>
Co-authored-by: Malte Pietsch <malte.pietsch@deepset.ai>
Co-authored-by: PiffPaffM <markuspaff.mp@gmail.com>
2020-10-12 13:34:42 +02:00
Markus Paff
5d1e208186
Create deploy_website.yml (#450)
Creates a dispatch event on push to master so that we can trigger a build in haystack-website. The website should always have the latest docs version
2020-09-29 19:49:04 +02:00
Dany
f0222ecd27 Add Tika in CI 2020-08-17 11:35:33 +02:00
Tanay Soni
1637ce1184 Revert "Add Tika Converter (#314)"
This reverts commit 5ef59b1901da6d51bfa085683321a243228d4fc9.
2020-08-17 11:13:52 +02:00
Tanay Soni
5ef59b1901
Add Tika Converter (#314) 2020-08-14 14:13:59 +02:00
Malte Pietsch
5023fde2be Update issue templates 2020-07-13 10:45:58 +02:00
Tanay Soni
98f1a3f9a7
Add type hints and mypy checks (#138) 2020-06-10 17:22:37 +02:00
Tanay Soni
180dc8cbd6
Start Elasticsearch with a Github Action (#142) 2020-06-09 12:46:15 +02:00
Tanay Soni
160345f3d5 Update build workflow 2020-06-09 11:45:25 +02:00
Tanay Soni
c4592c1b9a
Create ci.yml 2020-06-09 11:36:27 +02:00
Malte Pietsch
0247e362f7
add stalebot (#131) 2020-06-05 17:55:06 +02:00