623 Commits

Author SHA1 Message Date
ZanSara
cf79aa1485
feat: add support for single meta dict in TextFileToDocument (#6606)
* add support for single meta dict

* reno

* reno

* mypy

* extract to function

* docstring

* mypy
2023-12-21 14:21:17 +01:00
sahusiddharth
3d17e6ff76
changed metadata to meta (#6605) 2023-12-21 12:39:58 +01:00
Ashwin Mathur
fc88ef7076
feat: Add HuggingFace TEI Embedders - HuggingFaceTEITextEmbedder and HuggingFaceTEIDocumentEmbedder (#6602)
* Add TEI Embedders

* Add release notes

* Update release notes with usage examples
2023-12-21 12:16:36 +01:00
ZanSara
ae5297bfd7
example: self-correcting loop for RAG (#6420)
* add example

* docstrings

* reno

* use condrouter

* move functions

* tests

* reno

* add component

* reno

* add tests

* mypy

* pylint

* logger

* module name

* multiplexer

* draw

* query_multiplexer

* reno

* typo
2023-12-20 11:35:05 +01:00
ZanSara
5a0f0ce22f
feat: Multiplexer (#6592)
* move functions

* tests

* reno

* add component

* reno

* add tests

* mypy

* pylint

* logger

* module name
2023-12-20 11:03:22 +01:00
Silvano Cerza
e836fd6875
fix: Fix Pipeline.connect() when multiple compatible sockets are found (#6594)
* Fix connect not picking the correct socket

* Add release notes
2023-12-20 11:01:18 +01:00
Silvano Cerza
f224f991be
Change DocumentWriter default policy from DuplicatePolicy.FAIL to DuplicatePolicy.NONE (#6596) 2023-12-19 17:46:16 +01:00
ZanSara
f877704839
chore: extract type serialization (#6586)
* move functions

* tests

* reno
2023-12-19 14:16:20 +01:00
Vladimir Blagojevic
2dd5a94b04
feat: Add RAG based OpenAPI service integration (#6555)
* Add OpenAPIServiceConnector and OpenAPIServiceToFunctions

* Add release note

* Add test deps

* Better docs on OpenAPI spec reqs, improve tests

* Silvano PR feedback
2023-12-19 13:27:41 +01:00
Stefano Fiorucci
94cfe5d9ae
feat!: HTMLToDocument - allow choosing the boilerpy3 extractor (#6582)
* allow extractor customizability

* release note

* typo
2023-12-19 10:52:12 +01:00
Sebastian Husch Lee
dcf37c5173
feat: Extractive QA answer deduplication (#6459)
* Add answer deduplication

* Fix test

* Handle None case

* Release notes

* Handle cases where documents or answer spans could be None

* Adding checks for Nones and satisfying mypy

* Add option to turn off deduplication

* Adding unit tests

* Refactored tests to use fixtures

* Added overlap_threshold to run

* Update test

* Fixes related to the merge

* Remove casting, use direct variable names

* Move out if statement and add new test for it

* Update if statement to match comment

* Update how if statements work
2023-12-18 19:27:04 +01:00
Sebastian Husch Lee
c294b8ac8c
feat: Add auto device checks and model_kwargs to TransformersSimilarityRanker (#6561)
* Add device checking and model_kwargs like we do in ExtractiveReader

* Add release notes

* Make a utility function for the device checking

* Better warning message and updated ExtractiveReader to use the util function

* Add unit tests for get_device

* Fix pylint
2023-12-18 15:13:42 +01:00
Ashwin Mathur
46b395eec3
feat: Add Eval and EvaluationResult (#6505)
* Add initial implementation for Eval and EvaluationResult

* Add release notes

* Update files with suggestions from review

* Remove serialization

* Add eval e2e tests

* Update eval e2e tests
2023-12-18 11:29:09 +01:00
Sebastian Husch Lee
3e0e81b1e0
feat: Add meta_fields_to_embed to TransformersSimilarityRanker (#6564)
* Add initial implementation following SentenceTransformersDocumentEmbedder

* Add test for embedding metadata

* Add release notes

* Update name

* Fix tests and to dict

* Fix release notes
2023-12-18 11:28:16 +01:00
Massimiliano Pippi
0ac1bdc6a0
refactor!: uniform run api for LocalWhisperTranscriber (#6542)
* uniform run api for LocalWhisperTranscriber

* add relnote

* fix linter
2023-12-18 10:47:46 +01:00
Massimiliano Pippi
00fed32024
build: depend on haystack_bm25 instead of rank_bm25 (#6578)
* use the forked package

* switch package dependency

* relnote

* fix package name
2023-12-18 10:47:15 +01:00
Stefano Fiorucci
2f034d3c97
refactor!: Converters - standardize inputs (#6540)
* standardize converters inputs: first draft

* fix precommit

* fix precommit 2

* fix precommit 3

* add default for optional param

* rm leftover

* install boilerpy in linting workflow

* add boilerpy3 to the core dependencies

* add reno

* remove boilerpy3 installation from test workflow

* fix pylint: import order and unused import

* fix import order

* add release note

* better Tika docstring

* rm boilerpy from linting

* leftover

* md link brackets

* feat: Converters - allow passing `meta` in the `run` method (#6554)

* first impl for html

* progressing on other components

* fix test

* add tests - run with meta

* release note

* reintroduce patches wrongly deleted

* add patch in test

* fix tika test

* Update haystack/components/converters/azure.py

Co-authored-by: Massimiliano Pippi <mpippi@gmail.com>

---------

Co-authored-by: Massimiliano Pippi <mpippi@gmail.com>

* Update releasenotes/notes/converters-standardize-inputs-ed2ba9c97b762974.yaml

Co-authored-by: Silvano Cerza <3314350+silvanocerza@users.noreply.github.com>

* simplify test

---------

Co-authored-by: Massimiliano Pippi <mpippi@gmail.com>
Co-authored-by: Julian Risch <julian.risch@deepset.ai>
Co-authored-by: Daria Fokina <daria.fokina@deepset.ai>
Co-authored-by: Silvano Cerza <3314350+silvanocerza@users.noreply.github.com>
2023-12-15 16:41:35 +01:00
Vladimir Blagojevic
c642695ec0
feat: Add FileTypeRouter markdown support (#6551)
* Add FileTypeRouter markdown support

* Add releae note
2023-12-14 16:30:57 +01:00
Massimiliano Pippi
bc45170f4e
chore: add boilerpy3 to the core dependencies (#6544)
* add boilerpy3 to the core dependencies

* remove boilerpy3 installation from test workflow

* fix pylint: import order and unused import

* fix import order

* add release note

---------

Co-authored-by: Julian Risch <julian.risch@deepset.ai>
2023-12-14 11:53:38 +01:00
Massimiliano Pippi
8d9c3de37e
Remove 'preview' from the release notes template (#6543) 2023-12-14 09:59:48 +01:00
Massimiliano Pippi
a55024bee7
fix: do not dump pipeline graph into the debug payload (#6528) 2023-12-12 18:24:23 +01:00
Massimiliano Pippi
09abcc1d4c
allow connecting the same components multiple times (#6530) 2023-12-12 16:01:09 +01:00
Julian Risch
25a6eaae05
feat!: Rename ExtractiveReader's confidence_threshold to score_threshold (#6532)
* rename to score_threshold

* Update haystack/components/readers/extractive.py

Co-authored-by: Stefano Fiorucci <44616784+anakin87@users.noreply.github.com>

---------

Co-authored-by: Stefano Fiorucci <44616784+anakin87@users.noreply.github.com>
2023-12-12 15:12:28 +01:00
Silvano Cerza
18dbce25fc
refacotr: Refactor answer dataclasses (#6523)
* Refactor answer dataclasses

* Add release notes

* Fix tests

* Fix end to end tests

* Enhance ExtractiveReader
2023-12-11 18:50:49 +01:00
bogdankostic
728383a149
fix: Make TransformersSimilarityRanker run with single document list (#6503)
* Make `TransformersSimilarityRanker` run with single document list

* Add release note

* Remove unused import in test
2023-12-08 16:18:46 +01:00
Ashwin Mathur
2767cd2f01
Fix usage examples (#6507) 2023-12-07 14:01:32 +01:00
Bijay Gurung
c5342d1110
fix: Prevent invalid answer from being selected in ExtractiveReader (#6460)
* Fix invalid answer being selected issue on ExtractiveReader

* Rename variables to not shadow arguments
2023-12-06 09:49:02 +01:00
Vladimir Blagojevic
008a322023
feat: Add Indexing Pipeline (#6424)
* Add build_indexing_pipeline utils function

* Pylint fixes

* Move into another package to avoid circular deps

* Revert change

* Revert haystack/utils/__init__.py change

* Add example

* Use DocumentStore type, remove typing checks
2023-12-04 16:08:53 +01:00
ZanSara
a38f871dbd
feat: Add RAG pipeline (#6461)
* add rag pipeline

* Update examples/getting_started/rag.py

Co-authored-by: Massimiliano Pippi <mpippi@gmail.com>

---------

Co-authored-by: Vladimir Blagojevic <dovlex@gmail.com>
Co-authored-by: Massimiliano Pippi <mpippi@gmail.com>
2023-12-04 15:25:29 +01:00
Stefano Fiorucci
4912f7cb58
refactor!: improve the deserialization logic for components that use a Document Store (#6466)
* improve deserialization

* rm ds decorator

* improve tests

* fix pylint

* rm decorator from module init

* rm decorator

* rm decorator from factory

* fix tests

* release note

* rm print
2023-12-04 15:17:28 +01:00
Vladimir Blagojevic
b9bf83bbef
feat: Allow flat dictionary Pipeline.run() inputs (#6413)
* Initial implementation, release note, update API and unit test
---------

Co-authored-by: Massimiliano Pippi <mpippi@gmail.com>
Co-authored-by: Silvano Cerza <silvanocerza@gmail.com>
2023-11-30 14:37:55 +01:00
Silvano Cerza
831d0611d9
feat: Change default DuplicatePolicy in DocumentStore.write_documents() (#6438)
* Change default DuplicatePolicy in DocumentStore.write_documents()

* Add release notes
2023-11-28 12:30:17 +01:00
Massimiliano Pippi
00e1dd6eb8
chore: rearrange the core package, move tests and clean up (#6427)
* rearrange code

* fix tests

* relnote

* merge test modules

* remove extra

* rearrange draw tests

* forgot

* remove unused import
2023-11-28 09:58:56 +01:00
Silvano Cerza
9a7fd6f2ce
refactor: Add new filters tests for Document Store testing (#6428)
* Add new filters tests for Document Store testing

* Add release notes
2023-11-28 09:57:08 +01:00
Silvano Cerza
fd16ec63cb
refactor: Add support for new filters declaration (#6397)
* Rework filter logic for InMemoryDocumentStore to support new filters
declaration

* Fix legacy filters tests

* Simplify logic and handle dates comparison

* Rework MetadataRouter to support new filters

* Update docstrings

* Add release notes

* Fix linting

* Avoid duplicating filters specifications

* Handle corner case

* Simplify docstring

* Fix filters logic and tests

* Fix Document Store testing legacy filters tests
2023-11-24 11:22:46 +01:00
SebastjanPrachovskij
28c2b09d90
Add SearchApi integration for websearch (#6400) 2023-11-24 11:18:43 +01:00
pandasar13
edb40b6c1b
refactor: add batch_size to FAISS __init__ (#6401)
* refactor: add batch_size to FAISS __init__

* refactor: add batch_size to FAISS __init__

* add release note to refactor: add batch_size to FAISS __init__

* fix release note

* add batch_size to docstrings

---------

Co-authored-by: anakin87 <stefanofiorucci@gmail.com>
2023-11-23 17:27:24 +01:00
ZanSara
4ec6a60a76
feat: CohereGenerator (#6395)
* added CohereGenerator with unit tests

Signed-off-by: sunilkumardash9 <sunilkumardash9@gmail.com>

* 1. added releasenote
2. removed commented files in test-cohere_generators
3. removed unused imports

Signed-off-by: sunilkumardash9 <sunilkumardash9@gmail.com>

* 1. move client creation to __init__
2. remove dict casting of metadata in run

Signed-off-by: sunilkumardash9 <sunilkumardash9@gmail.com>

* few fixes

Signed-off-by: sunilkumardash9 <sunilkumardash9@gmail.com>

* add cohere to git workflows

Signed-off-by: sunilkumardash9 <sunilkumardash9@gmail.com>

* 1. CohereGenerator as top level import in generators
2. small change in doc string

Signed-off-by: sunilkumardash9 <sunilkumardash9@gmail.com>

* 1. corrected git workflow files for cohere import
2. changed api key env var from CO_API_KEY to COHERE_API_KEY

Signed-off-by: sunilkumardash9 <sunilkumardash9@gmail.com>

* added cohere in missed out workflow installs

Signed-off-by: sunilkumardash9 <sunilkumardash9@gmail.com>

* 1. Removed default_streaming_callback from cohere.py and added in test.
2. Added kwargs doc strings for CohereGenerator
3. removed type hints for metadata and replies
4. use COHERE_API_URL instead of hard coded URL.

Signed-off-by: sunilkumardash9 <sunilkumardash9@gmail.com>

* Update haystack/preview/components/generators/cohere/cohere.py

Co-authored-by: Daria Fokina <daria.f93@gmail.com>

* Update haystack/preview/components/generators/cohere/cohere.py

Co-authored-by: Daria Fokina <daria.f93@gmail.com>

* Update haystack/preview/components/generators/cohere/cohere.py

Co-authored-by: Daria Fokina <daria.f93@gmail.com>

* Update haystack/preview/components/generators/cohere/cohere.py

Co-authored-by: Daria Fokina <daria.f93@gmail.com>

* Update haystack/preview/components/generators/cohere/cohere.py

Co-authored-by: Daria Fokina <daria.f93@gmail.com>

* move out of folder

* black

* fix tests

* feedback

* black

* remove api key from tests

* read api key from env var if missing

* typo

* black

* missing import

---------

Signed-off-by: sunilkumardash9 <sunilkumardash9@gmail.com>
Co-authored-by: sunilkumardash9 <sunilkumardash9@gmail.com>
Co-authored-by: Daria Fokina <daria.f93@gmail.com>
2023-11-23 17:21:07 +01:00
jlonge4
c44e2cf49b
feat: add microsoft pptx file converter (#6399)
* Create pptx.py

* feat: pptx converter import __init__.py

* feat: add pptx import __init__.py

* feat: add python-pptx dependency

* feat: add sample pptx for testing

* feat: add pptx file-converter test

* feat: release note pptx-file-converter-3e494d2747637eb2.yaml

* feat: Update releasenotes/notes/pptx-file-converter-3e494d2747637eb2.yaml

Co-authored-by: Stefano Fiorucci <44616784+anakin87@users.noreply.github.com>

* feat: refactor haystack/nodes/file_converter/pptx.py

Co-authored-by: Stefano Fiorucci <44616784+anakin87@users.noreply.github.com>

* fix imports

---------

Co-authored-by: Stefano Fiorucci <44616784+anakin87@users.noreply.github.com>
2023-11-23 16:46:41 +01:00
Stefano Fiorucci
b0b514778d
fix!: make PyPDFToDocument JSON-serializable (#6396)
* add registry

* release not

* add checks

* rm superflous check

* fix typo

* rm print :-)
2023-11-23 15:37:20 +01:00
Ben Heckmann
a492771b4d
feat: PreProcessor split by token (tiktoken & Hugging Face) (#5276)
* #4983 implemented split by token for tiktoken tokenizer

* #4983 added unit test for tiktoken splitting

* #4983 implemented and added a test for splitting documents with HuggingFace tokenizer

* #4983 added support for passing HF model names (instead of objects) and added an example to the HF token splitting test

* mocked HTTP model loading in unit tests, fixed pylint error

* fix lossy tokenizers splitting, use LazyImport, ignore UnicodeEncodeError for tiktoken

* reno

* rename reno file

---------

Co-authored-by: Stefano Fiorucci <44616784+anakin87@users.noreply.github.com>
Co-authored-by: ZanSara <sara.zanzottera@deepset.ai>
2023-11-23 12:26:37 +01:00
Vladimir Blagojevic
e04a1f16bb
feat: Add DynamicPromptBuilder to Haystack 2.x (#6328)
* Add DynamicPromptBuilder

* Improve pydocs, add unit tests

* Add release note

* Make expected_runtime_variables optional

* Add pydocs usage example

* Add more pydocs

* Remove test markers

* Update type in unit test

* Update after canals upgrade

* add to api ref

* docstrings updates

* Update test/preview/components/builders/test_dynamic_prompt_builder.py

Co-authored-by: Silvano Cerza <3314350+silvanocerza@users.noreply.github.com>

* Update haystack/preview/components/builders/dynamic_prompt_builder.py

Co-authored-by: Silvano Cerza <3314350+silvanocerza@users.noreply.github.com>

* Deparametrize init test

* Rename expected_runtime_variables to runtime_variables

* Rephrase docstring so meaning is clearer

---------

Co-authored-by: Darja Fokina <daria.f93@gmail.com>
Co-authored-by: Daria Fokina <daria.fokina@deepset.ai>
Co-authored-by: Silvano Cerza <3314350+silvanocerza@users.noreply.github.com>
Co-authored-by: Silvano Cerza <silvanocerza@gmail.com>
2023-11-23 11:41:57 +01:00
Vladimir Blagojevic
b557f3035e
feat: Add ConditionalRouter Haystack 2.x component (#6147)
Co-authored-by: Massimiliano Pippi <mpippi@gmail.com>
Co-authored-by: Daria Fokina <daria.fokina@deepset.ai>
2023-11-23 10:28:08 +01:00
Stefano Fiorucci
e91f7a8a4d
refactor!: improve the public interface of Generators (#6374)
* merge lazy import blocks

* refactor generators

* release note

* revert unrelated changes
2023-11-22 10:40:48 +01:00
ZanSara
b751978d65
Extends input types of RemoteWhisperTranscriber (#6218)
* fix tests

* reno

* tests

* retain file name

* paths are strings for openai sdk

* streams->sources

* feedback

* always add name to file

* mypy

* test placeholder with extension

* fallback

* paths

* path test

* path must be a string

* fix test
2023-11-22 09:57:45 +01:00
Ashwin Mathur
e6c8374562
feat: Add ByteStream metadata and other metadata to Documents created by HTMLToDocument (#6304)
* Refactor HTMLToDocument

* Add release notes

* Add additional tests

* remove progress bar

* Add additional test for metadata

* remove progress bar from release notes

* Update tests

* Use truthiness checks instead of is not None
2023-11-21 21:44:02 +01:00
Daniel Fleischer
0cef17ac13
feat: embedding instructions for dense retrieval (#6372)
* Embedding instructions in EmbeddingRetriever

Query and documents embeddings are prefixed with instructions, useful
for retrievers finetuned on specific tasks, such as Q&A.

* Tests

Checking vectors 0th component vs. reference, using different stores.

* Normalizing vectors

* Release notes
2023-11-21 12:56:40 +01:00
Silvano Cerza
83c245db74
feat: Implement function to convert legacy filters to new style (#6314)
* Implement function to convert legacy filters to new style

* Reduce return statements in conversion to fix linting

* Move convert function in different module

* Fix typos in docstrings

Co-authored-by: Agnieszka Marzec <97166305+agnieszka-m@users.noreply.github.com>

---------

Co-authored-by: Agnieszka Marzec <97166305+agnieszka-m@users.noreply.github.com>
2023-11-20 13:00:05 +01:00
ZanSara
9cee2f82c4
feat: extend write_documents to return the number of documents actually written in the document store (#6006)
* add typing and docstring

* reno

* Update releasenotes/notes/extend-write-documents-855ffc315974f03b.yaml

Co-authored-by: Silvano Cerza <3314350+silvanocerza@users.noreply.github.com>

---------

Co-authored-by: Silvano Cerza <3314350+silvanocerza@users.noreply.github.com>
2023-11-20 11:54:02 +01:00
Julian Risch
4ef2a680bb
feat: Add DocumentJoiner component 2.0 (#6105)
* draft DocumentJoiner

* implement merge and rrf

* draft end-to-end test with DocumentJoiner in hybrid doc search pipeline

* adjust for variadics Canals PR #122

* fix text_embedder input

* adapt to the new Document class

* adapt to new doc id

* specify documents input as Variadic in run method

* compare doc ids instead of full docs

* rename text_file_converter input to sources

* update docstring

* Update haystack/preview/components/routers/document_joiner.py

Co-authored-by: Agnieszka Marzec <97166305+agnieszka-m@users.noreply.github.com>

* Apply suggestions from docstring review

Co-authored-by: Agnieszka Marzec <97166305+agnieszka-m@users.noreply.github.com>

* capitalize Documents and Retrievers in docstrings

* fix log message in test

---------

Co-authored-by: Stefano Fiorucci <44616784+anakin87@users.noreply.github.com>
Co-authored-by: anakin87 <stefanofiorucci@gmail.com>
Co-authored-by: ZanSara <sara.zanzottera@deepset.ai>
Co-authored-by: Massimiliano Pippi <mpippi@gmail.com>
Co-authored-by: Agnieszka Marzec <97166305+agnieszka-m@users.noreply.github.com>
2023-11-20 10:56:56 +01:00