Julian Risch
9f3b6512be
refactor: Remove reimplementations of default from_dict/to_dict and corresponding tests in 2.0 ( #6108 )
...
* whisper transcriber
* remove from/to_dict from builders
* remove from/to_dict from embedders
* remove from/to_dict from fetcher, file_converters
* remove from/to_dict from generators, preprocessors
* remove from/to_dict from ranker, reader
* remove from/to_dict from router, sampler, websearch
* pylint
* reno
* refactor import
* remove unused import
2023-10-19 11:17:02 +02:00
Stefano Fiorucci
21d894d85a
refactor: adopt token instead of use_auth_token in HF components ( #6040 )
...
* move embedding backends
* use token in Sentence Transformers embeddings
* more compact token handling
* token parameter in reader
* add token to ranker
* release note
* add test for reader
2023-10-17 16:32:13 +02:00
Stefano Fiorucci
4e4af99a5e
refactor!: rename MemoryDocumentStore and related Retrievers ( #6076 )
...
* rename doc store and retrievers
* release note
* fix patch
2023-10-17 16:15:16 +02:00
Silvano Cerza
ec9f898cd6
fix: Fix TextDocumentSplitter failing if run with empty list ( #6081 )
...
* Fix TextDocumentSplitter failing if run with empty list
* Release notes
* Simplify check
* Enhance test
2023-10-17 11:25:28 +02:00
Julian Risch
90ddeba579
fix: DocumentSplitter and DocumentCleaner copy id_hash_keys to newly created Documents ( #6083 )
...
* copy id_hash_keys in splitter and cleaner
* reno
2023-10-17 11:03:48 +02:00
Stefano Fiorucci
e963c8acdd
feat: HuggingFaceLocalGenerator - stopwords handling ( #6049 )
...
* first implementation
* release notes
* fixes
* tests
* better reno
* release note
2023-10-17 10:36:08 +02:00
Julian Risch
aaee03aee8
feat: Add DocumentCleaner 2.0 ( #5976 )
...
* remove whitespaces, substrings, regex, empty lines
* remove repeated substrings
* reno
* return empty string as shortest common ngram
* address first half of review feedback
* address second half of review feedback
* mention \f page separator for header/footer removal
* mention \f page separator for header/footer removal
* mark example usage as python code
2023-10-13 12:39:55 +02:00
Stefano Fiorucci
fbd22bc1e9
feat: HuggingFaceLocalGenerator - first implementation ( #6022 )
...
* draft
* still a raw draft
* still a raw draft
* improvements
* minimal impl ok
* tests
* reno
* better language
* examples of generation_kwargs
* incorporate feedback
* lg and format updates
* don't save valid str tokens
* fix style
---------
Co-authored-by: Darja Fokina <daria.f93@gmail.com>
2023-10-13 11:23:56 +02:00
Julian Risch
b507f1a124
feat: Add TextLanguageClassifier 2.0 ( #6026 )
...
* draft TextLanguageClassifier
* implement language detection with langdetect
* add unit test for logging message
* reno
* pylint
* change input from List[str] to str
* remove empty output connections
* add from_dict/to_dict tests
* mark example usage as python code
2023-10-13 10:30:49 +02:00
Stefano Fiorucci
2c2549f13d
move embedding backends ( #6033 )
2023-10-12 17:52:28 +02:00
Vladimir Blagojevic
d51be9edac
Add top_k to SimilarityRanker ( #6036 )
2023-10-12 13:52:01 +02:00
Vladimir Blagojevic
3803d23ff6
feat: Update PyPDFToDocument to process ByteStream inputs ( #6021 )
...
* Update PyPDF converter
* Add mixed source unit test
* Update haystack/preview/components/file_converters/pypdf.py
Co-authored-by: ZanSara <sara.zanzottera@deepset.ai>
---------
Co-authored-by: ZanSara <sara.zanzottera@deepset.ai>
2023-10-11 10:52:08 +02:00
Vladimir Blagojevic
1a6a8863e8
feat: Update HTMLToDocument to handle ByteStream inputs ( #6020 )
...
* Update HTML converter
* Add mixed source unit test
* Update haystack/preview/components/file_converters/html.py
Co-authored-by: ZanSara <sara.zanzottera@deepset.ai>
---------
Co-authored-by: ZanSara <sara.zanzottera@deepset.ai>
2023-10-11 10:15:58 +02:00
Vladimir Blagojevic
6a50123b9f
feat: Adjust LinkContentFetcher run method, use ByteStream ( #5972 )
2023-10-10 17:48:31 +02:00
Vladimir Blagojevic
98215aec0d
feat: Rename FileExtensionRouter to FileTypeRouter, handle ByteStream(s) ( #5998 )
...
Co-authored-by: Daria Fokina <daria.fokina@deepset.ai>
2023-10-10 09:14:04 +02:00
Vladimir Blagojevic
40b83d8a47
feat: Add TopPSampler Haystack 2.0 component ( #5924 )
2023-10-09 13:44:01 +02:00
Vladimir Blagojevic
1cdff6427e
feat: Add SimilarityRanker to Haystack 2.0 ( #5923 )
...
* Initial SimilarityRanker
2023-10-06 16:01:34 +02:00
Vladimir Blagojevic
e882a7d5c8
feat: Add HTMLToDocument component (v2) ( #5907 )
2023-09-28 17:22:28 +02:00
Stefano Fiorucci
d4aacad5f9
feat: OpenAIDocumentEmbedder ( #5822 )
...
* first draft
* release note
* mypy fix
* fix test
* corrections
* pr feedback
* better secrets handling and new tests
* missing imports in embedders/__init__.py
* better format condition
* address feedback
2023-09-28 15:42:51 +02:00
ZanSara
83724b74e3
feat: Make metadata optional in AnswerBuilder ( #5909 )
...
* optional metadata
* improve docstring
2023-09-28 14:42:19 +02:00
Stefano Fiorucci
9340c572f9
alternative skipif conditions in azure ocr converter test ( #5906 )
2023-09-28 12:09:19 +02:00
Julian Risch
4413675e64
feat: Add TextDocumentSplitter that splits by word, sentence, passage (2.0) ( #5870 )
...
* draft split by word, sentence, passage
* naive way to split sentences without nltk
* reno
* add tests
* make input list of docs, review feedback
* add source_id and more validation
* update docstrings
* add split delimiters back to strings
---------
Co-authored-by: Daria Fokina <daria.fokina@deepset.ai>
2023-09-27 12:26:20 +02:00
bogdankostic
80192589b1
feat: Add AzureOCRDocumentConverter (2.0) ( #5855 )
...
* Add AzureOCRDocumentConverter
* Add tests
* Add release note
* Formatting
* update docstrings
* Apply suggestions from code review
Co-authored-by: ZanSara <sara.zanzottera@deepset.ai>
* PR feedback
* PR feedback
* PR feedback
* Add secrets as environment variables
* Adapt test
* Add azure dependency to CI
* Add azure dependency to CI
---------
Co-authored-by: ZanSara <sara.zanzottera@deepset.ai>
Co-authored-by: Daria Fokina <daria.fokina@deepset.ai>
2023-09-26 15:57:55 +02:00
Stefano Fiorucci
6aa471ac5e
chore: make preview integration tests reproducible ( #5871 )
...
* relax extractive reader integration tests
* force reader to CPU
* ensure integration tests reproducibility
* move set_all_seeds to testing package
2023-09-25 18:39:10 +02:00
bogdankostic
9a4373bf8e
feat: Add TikaDocumentConverter (2.0) ( #5847 )
...
* Add TikaFileToDocument component
* Add tests
* Add tika service to CI
* Add release note
* Change name
* PR feedback
* Fix naming in tests
* Fix tika version in CI
* Update tests
---------
Co-authored-by: ZanSara <sara.zanzottera@deepset.ai>
2023-09-25 11:47:21 +02:00
MichelBartels
4da43b6b05
Add link output to SerperDevWebSearch ( #5853 )
...
* add link output
* adjust tests
* fix test
* remove print statements
---------
Co-authored-by: ZanSara <sara.zanzottera@deepset.ai>
2023-09-25 10:03:01 +02:00
Stefano Fiorucci
c0f22372d4
feat: OpenAITextEmbedder ( #5801 )
...
* first draft
* release notes
* avoid serializing secrets
* fix import order
* simplify serialization
* simplification
* monkeypatch delenv
* Update haystack/preview/components/embedders/openai_text_embedder.py
Co-authored-by: Daria Fokina <daria.fokina@deepset.ai>
* docstrings updates
* fix test
* Update haystack/preview/components/embedders/openai_text_embedder.py
Co-authored-by: Massimiliano Pippi <mpippi@gmail.com>
* rm comment
---------
Co-authored-by: Daria Fokina <daria.fokina@deepset.ai>
Co-authored-by: Massimiliano Pippi <mpippi@gmail.com>
2023-09-22 21:54:11 +02:00
Silvano Cerza
cc4f95bf51
Remove unnecessary GPT4Generator class ( #5863 )
...
* Remove GPT4Generator class
* Rename GPT35Generator to GPTGenerator
* Fix tests
* Release notes
2023-09-22 11:05:06 +02:00
MichelBartels
f3dc9edd26
feat: initial ExtractiveReader implementation ( #5553 )
...
* initial ExtractiveReader implementation
* initial ExtractiveReader implementation
* fix mypy
* remove unused import
* Use AutoTokenizer
* rename reader to model
* combine no-answer logit
* support document slicing with proper probabilities
* add variable stride
* validate model
* fix typo
* make postprocessing easier to understand
* remove debug code
* set default reader
* add ExtractiveReader to __init__
* remove validation
* use new answer class
* add batching
* use v2 lazy imports
* move reader
* fix type hints
* add doc strings
* add nucleus sampling
* fix types
* fix doc string
* add no_answer parameter
* remove print statement
* fix gpu support
* turn into binary classification task
* change dataclass so document does not need to be provided for no answer
* add simple tests
* add unit tests
* rename reader folder to readers
* add integration tests
* fix type hints
* add release notes
* remove accidentally included test file
* remove unnecessary __init__ file
* revert __init__ file to main
* rename test script by adding test_ prefix
* undo accidentally moving of test script after renaming it
* remove use of bisect
* rename _flatten and _unflatten
* make variable name more intuitive
* remove type: ignore
* fix mypy issue
* refactor long tuple
* add doc strings
* explain HF test
* remove unnecessary top_k check
---------
Co-authored-by: ZanSara <sara.zanzottera@deepset.ai>
2023-09-21 12:16:51 +02:00
Vladimir Blagojevic
92a6221927
feat: Add PyPDFToDocument component (2.0) ( #5850 )
...
* Initial PyPDFToDocument implementation
* Remove progress bar
* Add release note
* Minor fix
* import check and dependency
---------
Co-authored-by: ZanSara <sara.zanzottera@deepset.ai>
2023-09-21 11:52:26 +02:00
ZanSara
23fdef929e
chore: move GPT35Generator tests in the main test suite ( #5844 )
...
* move tests
* fix no-test-found error from pytest
* missing self
---------
Co-authored-by: Massimiliano Pippi <mpippi@gmail.com>
2023-09-21 11:42:32 +02:00
ZanSara
28f5c4c780
fix: Whisper integration tests ( #5851 )
...
* fix tests
* add ffmpeg
* apt update for ffmpeg
* not run on windows
2023-09-21 00:14:07 +02:00
bogdankostic
abe2706298
feat: Add MetadataRouter (2.0) ( #5824 )
...
* Move filter utilities
* Add MetadataRouter
* Add tests for MetadataRouter
* Add more tests
* Rename FileExtensionClassifer to FileExtensionRouter
* Add support for dates in filters
* Add tests
* Add release note
* Add release note
* Apply suggestions from code review
Co-authored-by: ZanSara <sara.zanzottera@deepset.ai>
---------
Co-authored-by: ZanSara <sara.zanzottera@deepset.ai>
2023-09-20 14:49:17 +02:00
ZanSara
c933bcaa69
chore: move Whisper e2e tests in the main tests suite ( #5845 )
...
* move whisper local tests
* remove e2e file
* move remote tests
* remove e2e file
2023-09-20 14:48:09 +02:00
ZanSara
454988672e
feat: UrlCacheChecker ( #5841 )
...
* add UrlCacheChecker
* rename
* add tests
* reno
* pylint
* review feedback
2023-09-20 14:45:50 +02:00
ZanSara
44f0c468ac
move websearch tests back to main tests suite ( #5842 )
2023-09-20 11:55:18 +02:00
Vladimir Blagojevic
0983fb656a
feat: Add LinkContentFetcher Haystack 2.0 component ( #5724 )
...
* Add LinkContentFetcher
* Add release note
* Small fixes
* Fix pydocs
* PR feedback
* Remove handlers registration
* PR feedback
* adjustments
* improve tests
* initial draft
* tests
* add proposal
* proposal number
* reno
* fix tests and usage of content and content_type
* update branch & fix more tests
* mypy
* use the new document
* add docstring
* fix more tests
* mypy
* fix tests
* add e2e
* review feedback
* improve __str__
* Apply suggestions from code review
Co-authored-by: Daria Fokina <daria.fokina@deepset.ai>
* Update haystack/preview/dataclasses/document.py
Co-authored-by: Daria Fokina <daria.fokina@deepset.ai>
* improve __str__
* fix tests
* fix more tests
* fix test
* Fix end-of-file-fixer
* Post merge fixes
* Move e2e tests back into component
---------
Co-authored-by: ZanSara <sara.zanzottera@deepset.ai>
Co-authored-by: Daria Fokina <daria.fokina@deepset.ai>
2023-09-20 11:03:52 +02:00
Christian Clauss
bf6d306d68
ci: Simplify Python code with ruff rules SIM ( #5833 )
...
* ci: Simplify Python code with ruff rules SIM
* Revert #5828
* ruff --select=I --fix haystack/modeling/infer.py
---------
Co-authored-by: Massimiliano Pippi <mpippi@gmail.com>
2023-09-20 08:32:44 +02:00
Stefano Fiorucci
de84a95970
separate classes and tests ( #5819 )
...
Co-authored-by: Massimiliano Pippi <mpippi@gmail.com>
2023-09-19 19:21:49 +02:00
Christian Clauss
1bc03ddc73
ci: Fix all ruff pyflakes errors except unused imports ( #5820 )
...
* ci: Fix all ruff pyflakes errors except unused imports
* Delete releasenotes/notes/fix-some-pyflakes-errors-69a1106efa5d0203.yaml
2023-09-15 18:30:33 +02:00
Stefano Fiorucci
1c69070db6
make MemoryEmbeddingRetriever act in non-batch mode ( #5809 )
2023-09-14 15:37:20 +02:00
Stefano Fiorucci
ad5b615503
make SentenceTransformersTextEmbedder non batch ( #5811 )
2023-09-14 12:38:24 +02:00
ZanSara
5888fb7052
make MemoryBM25Retriever non match ( #5768 )
2023-09-13 15:11:47 +02:00
Stefano Fiorucci
283ecf2760
feat: add prefix and suffix to SentenceTransformersDocumentEmbedder ( #5745 )
...
* add prefix and suffix
* fix test
2023-09-13 12:55:06 +02:00
ZanSara
335a09bc1d
feat: make AnswerBuilder non batch ( #5766 )
...
* make answerbuilder non batch
* fix mypy
* review feedback
* mypy
---------
Co-authored-by: bogdankostic <bogdankostic@web.de>
2023-09-13 12:01:16 +02:00
ZanSara
2c4d839b64
feat: GPT4Generator ( #5744 )
...
* add gpt4generator
* add e2e
* add tests
* reno
* fix e2e
* Update test/preview/components/generators/openai/test_gpt4_generator.py
Co-authored-by: Stefano Fiorucci <44616784+anakin87@users.noreply.github.com>
---------
Co-authored-by: Stefano Fiorucci <44616784+anakin87@users.noreply.github.com>
2023-09-13 10:07:09 +02:00
ZanSara
94c5d6d216
feat: make GPT35Generator non batch ( #5764 )
...
* make gpt35generator not batch
* fix tests
* review feedback
* mypy
2023-09-12 18:19:28 +02:00
ZanSara
6e70d403f8
feat: Improve Document for Haystack 2.0 ( #5738 )
...
* initial draft
* tests
* add proposal
* proposal number
* reno
* fix tests and usage of content and content_type
* update branch & fix more tests
* mypy
* add docstring
* fix more tests
* review feedback
* improve __str__
* Apply suggestions from code review
Co-authored-by: Daria Fokina <daria.fokina@deepset.ai>
* Update haystack/preview/dataclasses/document.py
Co-authored-by: Daria Fokina <daria.fokina@deepset.ai>
* improve __str__
* fix tests
* fix more tests
* Update haystack/preview/document_stores/memory/document_store.py
---------
Co-authored-by: Daria Fokina <daria.fokina@deepset.ai>
2023-09-11 17:40:00 +02:00
Stefano Fiorucci
2edf85f739
MemoryEmbeddingRetriever (2.0) (#5726 )
...
* MemoryDocumentStore - Embedding retrieval draft
* add release notes
* fix mypy
* better comment
* improve return_embeddings handling
* MemoryEmbeddingRetriever - first draft
* address PR comments
* release note
* update docstrings
* update docstrings
* incorporated feeback
* add return_embedding to __init__
* rm leftover docstring
---------
Co-authored-by: Daria Fokina <daria.fokina@deepset.ai>
2023-09-08 15:52:48 +02:00
bogdankostic
71852c7b06
Fix output of AnswerBuilder ( #5737 )
2023-09-07 12:54:24 +02:00