24 Commits

Author SHA1 Message Date
David S. Batista
3f77d3ab6c
!feat: unify NLTKDocumentSplitter and DocumentSplitter (#8617)
* wip: initial import

* wip: refactoring

* wip: refactoring tests

* wip: refactoring tests

* making all NLTKSplitter related tests work

* refactoring

* docstrings

* refactoring and removing NLTKDocumentSplitter

* fixing tests for custom sentence tokenizer

* fixing tests for custom sentence tokenizer

* cleaning up

* adding release notes

* reverting some changes

* cleaning up tests

* fixing serialisation and adding tests

* cleaning up

* wip

* renaming and cleaning

* adding NLTK files

* updating docstring

* adding import to init

* Update haystack/components/preprocessors/document_splitter.py

Co-authored-by: Stefano Fiorucci <stefanofiorucci@gmail.com>

* updating tests

* wip

* adding sentence/period change warning

* fixing LICENSE header

* Update haystack/components/preprocessors/document_splitter.py

Co-authored-by: Stefano Fiorucci <stefanofiorucci@gmail.com>

---------

Co-authored-by: Stefano Fiorucci <stefanofiorucci@gmail.com>
2024-12-12 14:22:27 +00:00
David S. Batista
2282c26f17
feat!: SentenceWindowRetriever returns List[Document] with docs ordered by split_idx_start (#8590)
* initial import

* adding a few pylint disable

* adding tests

* fixing integration tests

* adding release notes

* fixing types and docstrings
2024-12-04 16:55:56 +01:00
Alper
a556e11bf1
fix: window_size set during run instead of construction (#8463)
* window_size set during runtime

* revert init and update run with window_size

* improved doc, removed print

* adding release notes

* updating tests

* reverting docstring example

* Update haystack/components/retrievers/sentence_window_retriever.py

Co-authored-by: Silvano Cerza <3314350+silvanocerza@users.noreply.github.com>

* Update haystack/components/retrievers/sentence_window_retriever.py

Co-authored-by: Silvano Cerza <3314350+silvanocerza@users.noreply.github.com>

* Update haystack/components/retrievers/sentence_window_retriever.py

Co-authored-by: Silvano Cerza <3314350+silvanocerza@users.noreply.github.com>

---------

Co-authored-by: David S. Batista <dsbatista@gmail.com>
Co-authored-by: Silvano Cerza <3314350+silvanocerza@users.noreply.github.com>
2024-10-22 14:01:26 +00:00
Ajit Singh
6cf13e8b98
enhancement: reduced usage of numpy and substituted built-in libraries (#8418)
* reduced usage of numpy and substituted built-in libraries

* added release note

* edited expit function to support both float as well as list (this case was giving error CI)

* revert code , numpy can't be removed here

* more cleaning

* fix relnote

---------

Co-authored-by: anakin87 <stefanofiorucci@gmail.com>
2024-10-18 15:42:19 +02:00
Stefano Fiorucci
842a7b80a8
rm sentence_window_retrieval (#8303) 2024-08-28 10:51:07 +02:00
David S. Batista
2f3257b77a
chore: removing deprecated SentenceWindowRetrieval (#8294)
* removing deprecated SentenceWindowRetrieval

* adding release notes

* Rename TestSentenceWindowRetrieval to TestSentenceWindowRetriever

---------

Co-authored-by: Julian Risch <julian.risch@deepset.ai>
2024-08-28 10:04:52 +02:00
David S. Batista
b411c14414
feat: The SentenceWindowRetriever has now an extra output key containing all the documents belonging to the context window (#8283)
* initial import

* adding release notes

* linting

* improving docs and release notes

* updating example
2024-08-27 10:30:12 +02:00
Stefano Fiorucci
bcc4104729
refactor: utility function for docstore deserialization (#8226)
* refactor docstore deserialization

* more tests

* reno; headers

* expose key
2024-08-14 13:29:27 +02:00
Amna Mubashar
373de97426
Deprecate SentenceWindowRetrieval (#8206) 2024-08-13 13:49:41 +02:00
Amna Mubashar
e0de423ee0
Rename SentenceWindowRetrieval to SentenceWindowRetriever 2024-07-26 17:46:44 +02:00
Sebastian Husch Lee
baed478f23
fix: Fix split_start_idx and _split_overlap information in DocumentSplitter (#8046)
* Fix bug in DocumentSplitter and expand tests to catch said bug

* Fix split overlap information calc and actually test it

* Add release notes

* Remove comments

* Same fix in SentenceWindowRetrieval

---------

Co-authored-by: Vladimir Blagojevic <dovlex@gmail.com>
2024-07-24 15:15:36 +02:00
David S. Batista
431aa4a406
updating sentence window retriever tests (#8034)
* updating sentence window retriever tests

* fix
2024-07-16 22:10:55 +02:00
David S. Batista
ebfeb571d7
feat: add sentence window retrieval (#7997)
* initial import

* adding tests

* adding license and release notes

* adding missing release notes

* working with any type of doc store

* nit

* adding get_class_object to serialization package

* nit

* refactoring get_class_object()

* refactoring get_class_object()

* chaning type and var names

* more refactoring

* Update haystack/core/serialization.py

Co-authored-by: Vladimir Blagojevic <dovlex@gmail.com>

* Update haystack/core/serialization.py

Co-authored-by: Vladimir Blagojevic <dovlex@gmail.com>

* Update test/core/test_serialization.py

Co-authored-by: Vladimir Blagojevic <dovlex@gmail.com>

* more refactoring

* more refactoring

* Pydoc syntax

---------

Co-authored-by: Vladimir Blagojevic <dovlex@gmail.com>
2024-07-10 13:13:46 +00:00
Vladimir Blagojevic
678f193f10
feat: Add filter_policy init parameter to in memory retrievers (#7795)
* Add filter_policy init parameter to in-memory retrievers
2024-06-04 17:51:16 +02:00
Silvano Cerza
854c4173f2
feat: Add memory sharing between different instances of InMemoryDocumentStore (#7781)
* Add memory sharing between different instances of InMemoryDocumentStore

* Fix FilterRetriever tests

* Fix InMemoryBM25Retriever tests
2024-05-31 16:44:14 +02:00
Massimiliano Pippi
10c675d534
chore: add license header to all modules (#7675)
* add license header to modules
* check license header at linting time
2024-05-09 13:40:36 +00:00
Bijay Gurung
74683fe74d
Feat: Add FilterRetriever (#6836)
* Add FilterRetriever draft

* Implement FilterRetriever and add tests

* Update comparison to compare whole docs instead of just contents

* Expose FilterRetriever at the retrievers level

* Update docstring (add example usage)

* Add filter_retriever in the API reference docs config

Update retriever search path to start one dir level higher

* simplify _documents_equal

* improve usage example

---------

Co-authored-by: anakin87 <stefanofiorucci@gmail.com>
2024-02-08 08:48:46 +01:00
ZanSara
1182c08daf
fix: Dont filter negative scores when using BM25Okapi and scale_score=False (#6889)
* dont filter negatives for unscaled Okapi

* change BM25 algorithm default to BM25L

* Update haystack/document_stores/in_memory/document_store.py

* improve comment
2024-02-06 11:07:27 +01:00
Madeesh Kannan
a5189dd035
fix!: InMemoryBM25Retriever no longer returns documents that have a score of 0.0 (#6717)
* fix!: `InMemoryBM25Retriever` no longer returns documents that have a score of 0.0

Also update tests to accommodate the new behavior.

* Remove superfluous code
2024-01-12 17:50:55 +01:00
Massimiliano Pippi
e1ec4e5e4d
refact!: Remove symbols under the haystack.document_stores namespace (#6714)
* remove symbols under the haystack.document_stores namespace

* Update haystack/document_stores/types/protocol.py

Co-authored-by: Silvano Cerza <3314350+silvanocerza@users.noreply.github.com>

* fix

* same for retrievers

* leftovers

* more leftovers

* add relnote

* leftovers

* one more

* fix examples

---------

Co-authored-by: Silvano Cerza <3314350+silvanocerza@users.noreply.github.com>
2024-01-10 21:20:42 +01:00
Stefano Fiorucci
4912f7cb58
refactor!: improve the deserialization logic for components that use a Document Store (#6466)
* improve deserialization

* rm ds decorator

* improve tests

* fix pylint

* rm decorator from module init

* rm decorator

* rm decorator from factory

* fix tests

* release note

* rm print
2023-12-04 15:17:28 +01:00
Massimiliano Pippi
7c05f37a53
remove unit marker (#6450) 2023-11-29 19:24:25 +01:00
Silvano Cerza
e6637f5ec2 Fix all tests 2023-11-24 14:48:43 +01:00
Massimiliano Pippi
8adb8bbab8
Remove preview folder in test/
---------

Co-authored-by: Silvano Cerza <silvanocerza@gmail.com>
2023-11-24 11:52:55 +01:00