Silvano Cerza
fd16ec63cb
refactor: Add support for new filters declaration ( #6397 )
...
* Rework filter logic for InMemoryDocumentStore to support new filters
declaration
* Fix legacy filters tests
* Simplify logic and handle dates comparison
* Rework MetadataRouter to support new filters
* Update docstrings
* Add release notes
* Fix linting
* Avoid duplicating filters specifications
* Handle corner case
* Simplify docstring
* Fix filters logic and tests
* Fix Document Store testing legacy filters tests
2023-11-24 11:22:46 +01:00
SebastjanPrachovskij
28c2b09d90
Add SearchApi integration for websearch ( #6400 )
2023-11-24 11:18:43 +01:00
pandasar13
edb40b6c1b
refactor: add batch_size to FAISS __init__ ( #6401 )
...
* refactor: add batch_size to FAISS __init__
* refactor: add batch_size to FAISS __init__
* add release note to refactor: add batch_size to FAISS __init__
* fix release note
* add batch_size to docstrings
---------
Co-authored-by: anakin87 <stefanofiorucci@gmail.com>
2023-11-23 17:27:24 +01:00
ZanSara
4ec6a60a76
feat: CohereGenerator
( #6395 )
...
* added CohereGenerator with unit tests
Signed-off-by: sunilkumardash9 <sunilkumardash9@gmail.com>
* 1. added releasenote
2. removed commented files in test-cohere_generators
3. removed unused imports
Signed-off-by: sunilkumardash9 <sunilkumardash9@gmail.com>
* 1. move client creation to __init__
2. remove dict casting of metadata in run
Signed-off-by: sunilkumardash9 <sunilkumardash9@gmail.com>
* few fixes
Signed-off-by: sunilkumardash9 <sunilkumardash9@gmail.com>
* add cohere to git workflows
Signed-off-by: sunilkumardash9 <sunilkumardash9@gmail.com>
* 1. CohereGenerator as top level import in generators
2. small change in doc string
Signed-off-by: sunilkumardash9 <sunilkumardash9@gmail.com>
* 1. corrected git workflow files for cohere import
2. changed api key env var from CO_API_KEY to COHERE_API_KEY
Signed-off-by: sunilkumardash9 <sunilkumardash9@gmail.com>
* added cohere in missed out workflow installs
Signed-off-by: sunilkumardash9 <sunilkumardash9@gmail.com>
* 1. Removed default_streaming_callback from cohere.py and added in test.
2. Added kwargs doc strings for CohereGenerator
3. removed type hints for metadata and replies
4. use COHERE_API_URL instead of hard coded URL.
Signed-off-by: sunilkumardash9 <sunilkumardash9@gmail.com>
* Update haystack/preview/components/generators/cohere/cohere.py
Co-authored-by: Daria Fokina <daria.f93@gmail.com>
* Update haystack/preview/components/generators/cohere/cohere.py
Co-authored-by: Daria Fokina <daria.f93@gmail.com>
* Update haystack/preview/components/generators/cohere/cohere.py
Co-authored-by: Daria Fokina <daria.f93@gmail.com>
* Update haystack/preview/components/generators/cohere/cohere.py
Co-authored-by: Daria Fokina <daria.f93@gmail.com>
* Update haystack/preview/components/generators/cohere/cohere.py
Co-authored-by: Daria Fokina <daria.f93@gmail.com>
* move out of folder
* black
* fix tests
* feedback
* black
* remove api key from tests
* read api key from env var if missing
* typo
* black
* missing import
---------
Signed-off-by: sunilkumardash9 <sunilkumardash9@gmail.com>
Co-authored-by: sunilkumardash9 <sunilkumardash9@gmail.com>
Co-authored-by: Daria Fokina <daria.f93@gmail.com>
2023-11-23 17:21:07 +01:00
jlonge4
c44e2cf49b
feat: add microsoft pptx file converter ( #6399 )
...
* Create pptx.py
* feat: pptx converter import __init__.py
* feat: add pptx import __init__.py
* feat: add python-pptx dependency
* feat: add sample pptx for testing
* feat: add pptx file-converter test
* feat: release note pptx-file-converter-3e494d2747637eb2.yaml
* feat: Update releasenotes/notes/pptx-file-converter-3e494d2747637eb2.yaml
Co-authored-by: Stefano Fiorucci <44616784+anakin87@users.noreply.github.com>
* feat: refactor haystack/nodes/file_converter/pptx.py
Co-authored-by: Stefano Fiorucci <44616784+anakin87@users.noreply.github.com>
* fix imports
---------
Co-authored-by: Stefano Fiorucci <44616784+anakin87@users.noreply.github.com>
2023-11-23 16:46:41 +01:00
Stefano Fiorucci
b0b514778d
fix!: make PyPDFToDocument
JSON-serializable ( #6396 )
...
* add registry
* release not
* add checks
* rm superflous check
* fix typo
* rm print :-)
2023-11-23 15:37:20 +01:00
Ben Heckmann
a492771b4d
feat: PreProcessor split by token (tiktoken & Hugging Face) ( #5276 )
...
* #4983 implemented split by token for tiktoken tokenizer
* #4983 added unit test for tiktoken splitting
* #4983 implemented and added a test for splitting documents with HuggingFace tokenizer
* #4983 added support for passing HF model names (instead of objects) and added an example to the HF token splitting test
* mocked HTTP model loading in unit tests, fixed pylint error
* fix lossy tokenizers splitting, use LazyImport, ignore UnicodeEncodeError for tiktoken
* reno
* rename reno file
---------
Co-authored-by: Stefano Fiorucci <44616784+anakin87@users.noreply.github.com>
Co-authored-by: ZanSara <sara.zanzottera@deepset.ai>
2023-11-23 12:26:37 +01:00
Vladimir Blagojevic
e04a1f16bb
feat: Add DynamicPromptBuilder to Haystack 2.x ( #6328 )
...
* Add DynamicPromptBuilder
* Improve pydocs, add unit tests
* Add release note
* Make expected_runtime_variables optional
* Add pydocs usage example
* Add more pydocs
* Remove test markers
* Update type in unit test
* Update after canals upgrade
* add to api ref
* docstrings updates
* Update test/preview/components/builders/test_dynamic_prompt_builder.py
Co-authored-by: Silvano Cerza <3314350+silvanocerza@users.noreply.github.com>
* Update haystack/preview/components/builders/dynamic_prompt_builder.py
Co-authored-by: Silvano Cerza <3314350+silvanocerza@users.noreply.github.com>
* Deparametrize init test
* Rename expected_runtime_variables to runtime_variables
* Rephrase docstring so meaning is clearer
---------
Co-authored-by: Darja Fokina <daria.f93@gmail.com>
Co-authored-by: Daria Fokina <daria.fokina@deepset.ai>
Co-authored-by: Silvano Cerza <3314350+silvanocerza@users.noreply.github.com>
Co-authored-by: Silvano Cerza <silvanocerza@gmail.com>
2023-11-23 11:41:57 +01:00
Vladimir Blagojevic
b557f3035e
feat: Add ConditionalRouter
Haystack 2.x component ( #6147 )
...
Co-authored-by: Massimiliano Pippi <mpippi@gmail.com>
Co-authored-by: Daria Fokina <daria.fokina@deepset.ai>
2023-11-23 10:28:08 +01:00
Stefano Fiorucci
e91f7a8a4d
refactor!: improve the public interface of Generators ( #6374 )
...
* merge lazy import blocks
* refactor generators
* release note
* revert unrelated changes
2023-11-22 10:40:48 +01:00
ZanSara
b751978d65
Extends input types of RemoteWhisperTranscriber
( #6218 )
...
* fix tests
* reno
* tests
* retain file name
* paths are strings for openai sdk
* streams->sources
* feedback
* always add name to file
* mypy
* test placeholder with extension
* fallback
* paths
* path test
* path must be a string
* fix test
2023-11-22 09:57:45 +01:00
Ashwin Mathur
e6c8374562
feat: Add ByteStream
metadata and other metadata to Documents
created by HTMLToDocument
( #6304 )
...
* Refactor HTMLToDocument
* Add release notes
* Add additional tests
* remove progress bar
* Add additional test for metadata
* remove progress bar from release notes
* Update tests
* Use truthiness checks instead of is not None
2023-11-21 21:44:02 +01:00
Daniel Fleischer
0cef17ac13
feat: embedding instructions for dense retrieval ( #6372 )
...
* Embedding instructions in EmbeddingRetriever
Query and documents embeddings are prefixed with instructions, useful
for retrievers finetuned on specific tasks, such as Q&A.
* Tests
Checking vectors 0th component vs. reference, using different stores.
* Normalizing vectors
* Release notes
2023-11-21 12:56:40 +01:00
Silvano Cerza
83c245db74
feat: Implement function to convert legacy filters to new style ( #6314 )
...
* Implement function to convert legacy filters to new style
* Reduce return statements in conversion to fix linting
* Move convert function in different module
* Fix typos in docstrings
Co-authored-by: Agnieszka Marzec <97166305+agnieszka-m@users.noreply.github.com>
---------
Co-authored-by: Agnieszka Marzec <97166305+agnieszka-m@users.noreply.github.com>
2023-11-20 13:00:05 +01:00
ZanSara
9cee2f82c4
feat: extend write_documents
to return the number of documents actually written in the document store ( #6006 )
...
* add typing and docstring
* reno
* Update releasenotes/notes/extend-write-documents-855ffc315974f03b.yaml
Co-authored-by: Silvano Cerza <3314350+silvanocerza@users.noreply.github.com>
---------
Co-authored-by: Silvano Cerza <3314350+silvanocerza@users.noreply.github.com>
2023-11-20 11:54:02 +01:00
Julian Risch
4ef2a680bb
feat: Add DocumentJoiner component 2.0 ( #6105 )
...
* draft DocumentJoiner
* implement merge and rrf
* draft end-to-end test with DocumentJoiner in hybrid doc search pipeline
* adjust for variadics Canals PR #122
* fix text_embedder input
* adapt to the new Document class
* adapt to new doc id
* specify documents input as Variadic in run method
* compare doc ids instead of full docs
* rename text_file_converter input to sources
* update docstring
* Update haystack/preview/components/routers/document_joiner.py
Co-authored-by: Agnieszka Marzec <97166305+agnieszka-m@users.noreply.github.com>
* Apply suggestions from docstring review
Co-authored-by: Agnieszka Marzec <97166305+agnieszka-m@users.noreply.github.com>
* capitalize Documents and Retrievers in docstrings
* fix log message in test
---------
Co-authored-by: Stefano Fiorucci <44616784+anakin87@users.noreply.github.com>
Co-authored-by: anakin87 <stefanofiorucci@gmail.com>
Co-authored-by: ZanSara <sara.zanzottera@deepset.ai>
Co-authored-by: Massimiliano Pippi <mpippi@gmail.com>
Co-authored-by: Agnieszka Marzec <97166305+agnieszka-m@users.noreply.github.com>
2023-11-20 10:56:56 +01:00
ZanSara
e905066458
feat: make InMemoryDocumentStore
return the number of docs actually written ( #6274 )
...
* make InMemoryDocumentStore return the number of documents actually written
* add fixme
* reno
* add missing continue
2023-11-20 10:03:22 +01:00
x110
d03bffab8b
Promptnode timeout ( #6282 )
2023-11-19 16:32:09 +01:00
Stefano Fiorucci
68be0d7f2c
refactor: improve Document representation ( #6333 )
...
* new repr
* reno
2023-11-17 17:49:00 +01:00
ZanSara
e888852aec
Standardize TextFileToDocument
( #6232 )
...
* simplify textfiletodocument
* fix error handling and tests
* stray print
* reno
* streams->sources
* reno
* feedback
* test
* fix tests
2023-11-17 15:39:39 +01:00
ZanSara
dfc1d452bb
feat: upgrade canals to 0.10.1 ( #6309 )
...
* upgrade canals
* reno
* trigger preview e2e
* bump canals
* fix decorator
* fix test
* test factory
* tests inmemory
* tests writer
* test audio
* tests builders
* tests caching
* tests embedders
* tests converters
* tests generators
* tests rankers
* tests retrievers
* fix pipeline and telemetry tests
* remove trigger
2023-11-17 14:46:23 +01:00
Stefano Fiorucci
dd6e35d675
build: upgrade to transformers==4.35.2
( #6322 )
...
* upgrade transformers to 4.35.2
* reno
2023-11-17 10:12:34 +01:00
Silvano Cerza
6dda6e5b2d
Change Document.__eq__ to compare all fields ( #6323 )
2023-11-16 17:17:43 +01:00
Massimiliano Pippi
ff3165b8b8
fix: fix un-flattening of metadata ( #6318 )
...
* fix un-flattening of metadata
* test should pass
* add relnote
* change policy: raise an error if both meta and keys are passed
* Update document.py
* support python 3.8
* adjust wording in the error message
2023-11-16 17:10:53 +01:00
Julian Risch
34ecff1d19
build: Upgrade openai-whisper and re-introduce audio extra ( #6319 )
...
* upgrade openai-whisper and re-introduce audio extra
* add audio extra to
2023-11-16 15:04:50 +01:00
Julian Risch
8b092a90c0
test: Add MetadataRouter to preprocessing pipeline in e2e test ( #6321 )
...
* add MetadataRouter to preprocessing pipeline
* replace mimetype check with language check
2023-11-16 11:22:37 +01:00
x110
c4cfe6cb90
fix: Load additional fields from SQUAD-format file to meta field for labels #5978 ( #6301 )
...
* Load additional fields from SQUAD-format file to meta field for labels
* added a test function
* rewritten test using pytest
* added release notes
* improve release note
* clean up test
---------
Co-authored-by: Stefano Fiorucci <44616784+anakin87@users.noreply.github.com>
2023-11-16 10:44:51 +01:00
Vivek Silimkhan
f998bf4a4f
feat: add Amazon Bedrock support ( #6226 )
...
* Add Bedrock
* Update supported models for Bedrock
* Fix supports and add extract response in Bedrock
* fix errors imports
* improve and refactor supports
* fix install
* fix mypy
* fix pylint
* fix existing tests
* Added Anthropic Bedrock
* fix tests
* fix sagemaker tests
* add default prompt handler, constructor and supports tests
* more tests
* invoke refactoring
* refactor model_kwargs
* fix mypy
* lstrip responses
* Add streaming support
* bump boto3 version
* add class docstrings, better exception names
* fix layer name
* add tests for anthropic and cohere model adapters
* update cohere params
* update ai21 args and add tests
* support cohere command light model
* add tital tests
* better class names
* support meta llama 2 model
* fix streaming support
* more future-proof model adapter selection
* fix import
* fix mypy
* fix pylint for preview
* add tests for streaming
* add release notes
* Apply suggestions from code review
Co-authored-by: Agnieszka Marzec <97166305+agnieszka-m@users.noreply.github.com>
* fix format
* fix tests after msg changes
* fix streaming for cohere
---------
Co-authored-by: tstadel <60758086+tstadel@users.noreply.github.com>
Co-authored-by: tstadel <thomas.stadelmann@deepset.ai>
Co-authored-by: Agnieszka Marzec <97166305+agnieszka-m@users.noreply.github.com>
2023-11-15 13:26:29 +01:00
Julian Risch
08ec492039
refactor!: Remove routing from DocumentLanguageClassifier and rename TextLanguageClassifier ( #6307 )
...
* remove routing from DocumentLanguageClassifier
* fix MetadataRouter typo
2023-11-15 13:10:07 +01:00
Julian Risch
5295b40def
docs: Reader returns top_k+1 answers if no_answer is enabled
2023-11-15 10:20:21 +01:00
Ashwin Mathur
4e4d5eb3e2
feat!: Remove unused query parameter from MetaFieldRanker
( #6300 )
...
* Remove unused query parameter from MetaFieldRanker
* Add release notes
2023-11-14 12:33:38 +01:00
Stefano Fiorucci
f708cf6056
refactor!: set scale_score
default value to False ( #6276 )
...
* set default scale_score to False
* release note
2023-11-13 11:59:18 +01:00
Silvano Cerza
8e7ce208fc
Fix Document init when passing non existing fields ( #6286 )
...
* Fix Document init when passing non existing fields
* Update releasenotes/notes/fix-document-init-09c1cbb14202be7d.yaml
Co-authored-by: Massimiliano Pippi <mpippi@gmail.com>
* Fix linting
---------
Co-authored-by: Massimiliano Pippi <mpippi@gmail.com>
2023-11-13 11:42:42 +01:00
Vladimir Blagojevic
b4d8d1c904
feat: Add custom conversion callable to PyPDFToDocument - Haystack 2.x ( #6258 )
...
* Allow user specified converter hook
* Add a release note
* More unit tests
* PR review - Massi, use protocol as converter
2023-11-09 17:35:33 +01:00
Stefano Fiorucci
2b3c77e41d
fix: make JoinDocuments
correctly handle duplicate documents w null scores ( #6261 )
...
* fix error with null values
* release note
* simplify
2023-11-09 14:28:56 +01:00
Domenico
676da681d0
feat: MetaField Ranker ( #6189 )
...
* proposal: meta field ranker
* Apply suggestions from code review
Co-authored-by: ZanSara <sarazanzo94@gmail.com>
* update proposal filename
* feat: add metafield ranker
* fix docstrings
* remove proposal file from pr
* add release notes
* update code according to new Document class
* separate loops for each ranking mode in __merge_scores
* change error type in init and new tests for linear score warning
* docstring upd
---------
Co-authored-by: ZanSara <sarazanzo94@gmail.com>
Co-authored-by: Daria Fokina <daria.fokina@deepset.ai>
2023-11-09 12:20:41 +01:00
Sebastian Husch Lee
71d0d92ea2
feat: Add model_kwargs
to ExtractiveReader to impact model loading ( #6257 )
...
* Add ability to pass model_kwargs to AutoModelForQuestionAnswering
* Add testing for new model_kwargs
* Add spacing
* Add release notes
* Update haystack/preview/components/readers/extractive.py
Co-authored-by: Stefano Fiorucci <44616784+anakin87@users.noreply.github.com>
* Make changes suggested by Stefano
---------
Co-authored-by: Stefano Fiorucci <44616784+anakin87@users.noreply.github.com>
2023-11-09 11:25:22 +01:00
Vladimir Blagojevic
cd429a73cd
feat: Add GPTChatGenerator
to Haystack 2.x ( #6212 )
...
* Add GPTChatGenerator
* Apply lessons from previous PR
* PR review - Stefano
2023-11-09 10:45:41 +01:00
jambudipa
2f118e857c
feat: add tokenization details for gpt-4-1106-preview ( #6250 )
...
* feat: add tokenization details for gpt-4-1106-preview
* update max_tokens value
* reno
---------
Co-authored-by: jambudipa <mark.norgate@ext.ons.gov.uk>
Co-authored-by: ZanSara <sara.zanzottera@deepset.ai>
2023-11-08 12:04:08 +01:00
Silvano Cerza
bf884094d1
refactor: Change Document.blob type and remove mime_type field ( #6249 )
...
* Change Document.blob type and remove mime_type field
* Add release notes
* Remove mime_type from Document docstring
2023-11-08 10:35:17 +01:00
Vladimir Blagojevic
5497ca2a45
feat: Adapt GPTGenerator
to use str input/output format in Haystack 2.x ( #6214 )
...
* Adapt GPTGenerator to string input/output
* Finishing touches
* punctuation upd
* PR feedback
* Small naming fixes
* Update haystack/preview/components/generators/openai.py
Co-authored-by: Stefano Fiorucci <44616784+anakin87@users.noreply.github.com>
* Update class pydoc with a printed response
---------
Co-authored-by: Stefano Fiorucci <44616784+anakin87@users.noreply.github.com>
Co-authored-by: Daria Fokina <daria.fokina@deepset.ai>
2023-11-07 18:00:43 +01:00
Stefano Fiorucci
fb96aef4dd
refactor!: move classifiers to an appropriate directory/package ( #6240 )
...
* mv classifiers
* release note
2023-11-06 12:00:01 +01:00
Vladimir Blagojevic
d7e1833c40
feat: Add HuggingFaceTGIChatGenerator
Haystack 2.x component ( #6199 )
...
* Add ChatHuggingFaceTGIGenerator
* Add release note
---------
Co-authored-by: Daria Fokina <daria.fokina@deepset.ai>
Co-authored-by: Stefano Fiorucci <44616784+anakin87@users.noreply.github.com>
2023-11-06 09:48:45 +01:00
Stefano Fiorucci
063d27c522
refactor!: rename TextDocumentSplitter
to DocumentSplitter
( #6223 )
...
* rename TextDocumentSplitter to DocumentSplitter
* reno
* fix init
2023-11-03 11:33:20 +01:00
Vladimir Blagojevic
6e2dbdc320
feat: Add HuggingFaceTGIGenerator
Haystack 2.x component ( #6205 )
...
* Add HuggingFaceTGIGenerator
* PR review
* PR feedback from Stefano
---------
Co-authored-by: Stefano Fiorucci <44616784+anakin87@users.noreply.github.com>
2023-11-02 19:35:16 +01:00
Stefano Fiorucci
8511b8cd79
feat: HuggingFaceLocalGenerator
- allow passing generation_kwargs
in run
method ( #6220 )
...
* allow custom generation_kwargs in run
* reno
* make pylint ignore too-many-public-methods
2023-11-02 15:29:38 +01:00
Vladimir Blagojevic
f2db68ef0b
fix: Add new rankers to nodes __init__.py ( #6219 )
...
* Add new rankers to nodes __init__.py
* Add release note
2023-11-02 10:56:52 +01:00
Ashwin Mathur
6bf0b9dc7c
feat: Add MarkdownToTextDocument
(v2) ( #6159 )
...
* Add MarkdownToTextDocument
* Add release notes
* Update GitHub workflows
* Update GitHub workflows
* Refactor code with minimal dependencies
* Update docstrings
* Apply suggestions from code review
Co-authored-by: Daria Fokina <daria.f93@gmail.com>
* Update document with content and meta for backward compatibility
* Refactor Document Class for Backward Compatibility
Co-authored-by: Stefano Fiorucci <44616784+anakin87@users.noreply.github.com>
* Update tests
* Improve test assertions
---------
Co-authored-by: Daria Fokina <daria.f93@gmail.com>
Co-authored-by: Stefano Fiorucci <44616784+anakin87@users.noreply.github.com>
2023-10-31 18:28:13 +01:00
Julian Risch
29b1fefaa4
feat: Add DocumentLanguageClassifier 2.0 ( #6037 )
...
* add DocumentLanguageClassifier and tests
* reno
* fix import, rename DocumentCleaner
* mark example usage as python code
* add assertions to e2e test
* use deserialized document_store
* Apply suggestions from code review
Co-authored-by: Massimiliano Pippi <mpippi@gmail.com>
* remove from/to_dict
* use renamed InMemoryDocumentStore
* adapt to Document refactoring
* improve docstring
* fix test for new Document
---------
Co-authored-by: Massimiliano Pippi <mpippi@gmail.com>
Co-authored-by: Stefano Fiorucci <44616784+anakin87@users.noreply.github.com>
Co-authored-by: anakin87 <stefanofiorucci@gmail.com>
2023-10-31 15:35:05 +01:00
Silvano Cerza
7287657f0e
refactor: Rename Document
's text
field to content
( #6181 )
...
* Rework Document serialisation
Make Document backward compatible
Fix InMemoryDocumentStore filters
Fix InMemoryDocumentStore.bm25_retrieval
Add release notes
Fix pylint failures
Enhance Document kwargs handling and docstrings
Rename Document's text field to content
Fix e2e tests
Fix SimilarityRanker tests
Fix typo in release notes
Rename Document's metadata field to meta (#6183 )
* fix bugs
* make linters happy
* fix
* more fix
* match regex
---------
Co-authored-by: Massimiliano Pippi <mpippi@gmail.com>
2023-10-31 12:44:04 +01:00