DanShatford
07048791aa
feat: allow list of file paths in convert_files_to_docs
( #5961 )
...
* feat: allow list of file paths in `convert_files_to_docs`
* Fix validation
* Fix check errors
2023-10-09 20:19:03 +02:00
David Berenstein
13fb7c5b5f
feat: added on_agent_final_answer-support to Agent callback_manager ( #5736 )
...
* chore: added on_agent_final_answer-support to Agent callback_manager
* chore: format black
* run pre-commit to format file
* updated release notes
* reverted sorted imports
---------
Co-authored-by: Stefano Fiorucci <44616784+anakin87@users.noreply.github.com>
2023-10-09 18:03:47 +02:00
Vladimir Blagojevic
40b83d8a47
feat: Add TopPSampler Haystack 2.0 component ( #5924 )
2023-10-09 13:44:01 +02:00
Vladimir Blagojevic
1cdff6427e
feat: Add SimilarityRanker to Haystack 2.0 ( #5923 )
...
* Initial SimilarityRanker
2023-10-06 16:01:34 +02:00
Stefano Fiorucci
ccc9f010bb
fix: fix ChatGPT invocation layer (and add async support) ( #5979 )
...
* ChatGPT async
* release note
* fix tests
2023-10-05 18:43:26 +02:00
Tobias Wochinger
d5d3a9eef4
chore: adapt deepset cloud sdk endpoint format for saving pipelines ( #5969 )
...
* chore: adapt to new endpoints formats
* docs: add release notes
2023-10-05 08:56:28 +02:00
Massimiliano Pippi
c2ec3f5fde
feat: add File type to preview package ( #5873 )
...
* add Blob type
* review feedback
* fix tests and naming
* Update add-blob-type-2a9476a39841f54d.yaml
* removed unused import
---------
Co-authored-by: Stefano Fiorucci <44616784+anakin87@users.noreply.github.com>
2023-10-04 17:23:12 +02:00
Stefano Fiorucci
cc70b4b613
deprecation ( #5954 )
2023-10-03 12:48:06 +02:00
Massimiliano Pippi
ac408134f4
feat: add support for async openai calls ( #5946 )
...
* add support for async openai calls
* add actual async call
* split the async api
* ask permission
* Update haystack/utils/openai_utils.py
Co-authored-by: Silvano Cerza <3314350+silvanocerza@users.noreply.github.com>
* Fix OpenAI content moderation tests
* Fix ChatGPT invocation layer tests
---------
Co-authored-by: Silvano Cerza <3314350+silvanocerza@users.noreply.github.com>
Co-authored-by: Silvano Cerza <silvanocerza@gmail.com>
2023-10-03 10:42:21 +02:00
Lavesh Akhadkar
1ccf674d73
feat: DocumentWriter
returns number of documents written ( #5939 )
...
* Make DocumentWriter return the number of documents it wrote
* Fixed return type
2023-10-03 10:02:33 +02:00
Massimiliano Pippi
0947f59545
feat: add async PromptNode run ( #5890 )
...
* add async promptnode
* Remove unecessary calls to dict.keys()
---------
Co-authored-by: Silvano Cerza <silvanocerza@gmail.com>
Co-authored-by: Silvano Cerza <3314350+silvanocerza@users.noreply.github.com>
2023-09-29 08:40:01 +02:00
Vladimir Blagojevic
e882a7d5c8
feat: Add HTMLToDocument component (v2) ( #5907 )
2023-09-28 17:22:28 +02:00
Stefano Fiorucci
d4aacad5f9
feat: OpenAIDocumentEmbedder
( #5822 )
...
* first draft
* release note
* mypy fix
* fix test
* corrections
* pr feedback
* better secrets handling and new tests
* missing imports in embedders/__init__.py
* better format condition
* address feedback
2023-09-28 15:42:51 +02:00
Julian Risch
4413675e64
feat: Add TextDocumentSplitter that splits by word, sentence, passage (2.0) ( #5870 )
...
* draft split by word, sentence, passage
* naive way to split sentences without nltk
* reno
* add tests
* make input list of docs, review feedback
* add source_id and more validation
* update docstrings
* add split delimiters back to strings
---------
Co-authored-by: Daria Fokina <daria.fokina@deepset.ai>
2023-09-27 12:26:20 +02:00
bogdankostic
80192589b1
feat: Add AzureOCRDocumentConverter
(2.0) ( #5855 )
...
* Add AzureOCRDocumentConverter
* Add tests
* Add release note
* Formatting
* update docstrings
* Apply suggestions from code review
Co-authored-by: ZanSara <sara.zanzottera@deepset.ai>
* PR feedback
* PR feedback
* PR feedback
* Add secrets as environment variables
* Adapt test
* Add azure dependency to CI
* Add azure dependency to CI
---------
Co-authored-by: ZanSara <sara.zanzottera@deepset.ai>
Co-authored-by: Daria Fokina <daria.fokina@deepset.ai>
2023-09-26 15:57:55 +02:00
Silvano Cerza
cf7f0ebc22
Add Pipelines async run ( #5864 )
...
* Add Pipeline.arun()
* Sleeper node
* Fix async running
* Add e2e tests
To run a Pipeline that doesn't have any async node in async mode:
pytest e2e/pipelines/test_standard_pipelines.py::test_query_and_indexing_pipeline
To run a Pipeline that has a single async node in concurrent mode:
pytest e2e/pipelines/test_standard_pipelines.py::test_async_concurrent_complex_pipeline
To run a Pipeline that has a single async node in sequential mode:
pytest e2e/pipelines/test_standard_pipelines.py::test_async_sequential_complex_pipeline
* Remove unused _adispatch_run method
* Make Pipeline.run work with async nodes
* Revert "Make Pipeline.run work with async nodes"
This reverts commit 22d7a94e4d41aca1b59dad18c0b366fbb6e8f431.
* Rename Pipeline.arun to Pipeline._arun
* Enhance docstring
* Add Sleeper docstring
* Add release notes
* ignore typing across the node
* make pylint happy
* skip pylint on needed unused import
* fix
* if a node has an arun method, use it
---------
Co-authored-by: Massimiliano Pippi <mpippi@gmail.com>
2023-09-26 15:37:27 +02:00
ZanSara
6cb7d16e22
feat: preview
extra ( #5869 )
...
* copy the deps list over from haystack-ai
* fix lazyimport usage
* keep jinja and openai
* fix ci
* reno
* separate out preview unit tests
* fix import error message for tika
* tika
* add preview to all
* wrap torch
* remove comment
* unwrap openai and jinja
2023-09-26 12:48:15 +02:00
bogdankostic
9a4373bf8e
feat: Add TikaDocumentConverter
(2.0) ( #5847 )
...
* Add TikaFileToDocument component
* Add tests
* Add tika service to CI
* Add release note
* Change name
* PR feedback
* Fix naming in tests
* Fix tika version in CI
* Update tests
---------
Co-authored-by: ZanSara <sara.zanzottera@deepset.ai>
2023-09-25 11:47:21 +02:00
Stefano Fiorucci
c0f22372d4
feat: OpenAITextEmbedder
( #5801 )
...
* first draft
* release notes
* avoid serializing secrets
* fix import order
* simplify serialization
* simplification
* monkeypatch delenv
* Update haystack/preview/components/embedders/openai_text_embedder.py
Co-authored-by: Daria Fokina <daria.fokina@deepset.ai>
* docstrings updates
* fix test
* Update haystack/preview/components/embedders/openai_text_embedder.py
Co-authored-by: Massimiliano Pippi <mpippi@gmail.com>
* rm comment
---------
Co-authored-by: Daria Fokina <daria.fokina@deepset.ai>
Co-authored-by: Massimiliano Pippi <mpippi@gmail.com>
2023-09-22 21:54:11 +02:00
Massimiliano Pippi
a5a0dc9f87
feat: optionally pass an id to the Document constructor ( #5862 )
...
* revert #5826
* do not use Optional
2023-09-22 11:09:59 +02:00
Silvano Cerza
cc4f95bf51
Remove unnecessary GPT4Generator class ( #5863 )
...
* Remove GPT4Generator class
* Rename GPT35Generator to GPTGenerator
* Fix tests
* Release notes
2023-09-22 11:05:06 +02:00
MichelBartels
f3dc9edd26
feat: initial ExtractiveReader implementation ( #5553 )
...
* initial ExtractiveReader implementation
* initial ExtractiveReader implementation
* fix mypy
* remove unused import
* Use AutoTokenizer
* rename reader to model
* combine no-answer logit
* support document slicing with proper probabilities
* add variable stride
* validate model
* fix typo
* make postprocessing easier to understand
* remove debug code
* set default reader
* add ExtractiveReader to __init__
* remove validation
* use new answer class
* add batching
* use v2 lazy imports
* move reader
* fix type hints
* add doc strings
* add nucleus sampling
* fix types
* fix doc string
* add no_answer parameter
* remove print statement
* fix gpu support
* turn into binary classification task
* change dataclass so document does not need to be provided for no answer
* add simple tests
* add unit tests
* rename reader folder to readers
* add integration tests
* fix type hints
* add release notes
* remove accidentally included test file
* remove unnecessary __init__ file
* revert __init__ file to main
* rename test script by adding test_ prefix
* undo accidentally moving of test script after renaming it
* remove use of bisect
* rename _flatten and _unflatten
* make variable name more intuitive
* remove type: ignore
* fix mypy issue
* refactor long tuple
* add doc strings
* explain HF test
* remove unnecessary top_k check
---------
Co-authored-by: ZanSara <sara.zanzottera@deepset.ai>
2023-09-21 12:16:51 +02:00
Vladimir Blagojevic
92a6221927
feat: Add PyPDFToDocument component (2.0) ( #5850 )
...
* Initial PyPDFToDocument implementation
* Remove progress bar
* Add release note
* Minor fix
* import check and dependency
---------
Co-authored-by: ZanSara <sara.zanzottera@deepset.ai>
2023-09-21 11:52:26 +02:00
bogdankostic
abe2706298
feat: Add MetadataRouter
(2.0) ( #5824 )
...
* Move filter utilities
* Add MetadataRouter
* Add tests for MetadataRouter
* Add more tests
* Rename FileExtensionClassifer to FileExtensionRouter
* Add support for dates in filters
* Add tests
* Add release note
* Add release note
* Apply suggestions from code review
Co-authored-by: ZanSara <sara.zanzottera@deepset.ai>
---------
Co-authored-by: ZanSara <sara.zanzottera@deepset.ai>
2023-09-20 14:49:17 +02:00
ZanSara
454988672e
feat: UrlCacheChecker
( #5841 )
...
* add UrlCacheChecker
* rename
* add tests
* reno
* pylint
* review feedback
2023-09-20 14:45:50 +02:00
bogdankostic
719c1c040c
feat: Add support for dates in filters (2.0) ( #5823 )
...
* Add support for dates in filters
* Add tests
* Add release note
* Update haystack/preview/utils/filters.py
Co-authored-by: ZanSara <sara.zanzottera@deepset.ai>
---------
Co-authored-by: ZanSara <sara.zanzottera@deepset.ai>
2023-09-20 12:05:56 +02:00
Vladimir Blagojevic
0983fb656a
feat: Add LinkContentFetcher
Haystack 2.0 component ( #5724 )
...
* Add LinkContentFetcher
* Add release note
* Small fixes
* Fix pydocs
* PR feedback
* Remove handlers registration
* PR feedback
* adjustments
* improve tests
* initial draft
* tests
* add proposal
* proposal number
* reno
* fix tests and usage of content and content_type
* update branch & fix more tests
* mypy
* use the new document
* add docstring
* fix more tests
* mypy
* fix tests
* add e2e
* review feedback
* improve __str__
* Apply suggestions from code review
Co-authored-by: Daria Fokina <daria.fokina@deepset.ai>
* Update haystack/preview/dataclasses/document.py
Co-authored-by: Daria Fokina <daria.fokina@deepset.ai>
* improve __str__
* fix tests
* fix more tests
* fix test
* Fix end-of-file-fixer
* Post merge fixes
* Move e2e tests back into component
---------
Co-authored-by: ZanSara <sara.zanzottera@deepset.ai>
Co-authored-by: Daria Fokina <daria.fokina@deepset.ai>
2023-09-20 11:03:52 +02:00
Malte Pietsch
aa3cc3d5ae
feat: Add support for OpenAI's gpt-3.5-turbo-instruct
model ( #5837 )
...
* support gpt-3.5.-turbo-instruct
* add release note
2023-09-19 16:06:43 +02:00
Onur Eren Arpacı
8af0d816e6
bug: fix the date_fields request bottleneck ( #5695 )
...
* bug: fix the date_fields request bottleneck
I encountered a performance issue while attempting to index 1 million vectors. Despite the Weaviate instance having low utilization, the process was estimated to take around 10 hours.
After some investigation, I identified the bottleneck: _get_date_properties function was being called for every document, consequently a request to the Weaviate client was being sent and awaited for each document.
To address this, I optimized the code by invoking the _get_date_properties function only when there is a schema change. This modification resulted in a notable performance improvement, reducing the indexing time to approximately 90 minutes for the same 1 million vectors.
* bug: fix the date_fields request bottleneck
* fix: executed the pre commit hooks for #9341
2023-09-15 18:12:14 +02:00
Silvano Cerza
5c04cd6ba2
Fix Document constructor accepting unused id parameter ( #5826 )
2023-09-15 17:03:03 +02:00
Chivereanu Radu
cab21da87b
fix: Support for Azure 16k gpt 35 deployment ( #5804 )
...
* Support for Azure 16k gpt 35 deployment
* releasenote added
---------
Co-authored-by: user11999 <radugabrielchivereanu@gmail.com>
2023-09-14 18:01:22 +02:00
Ivana Zeljkovic
4bad202197
feat: Pinecone document store refactoring ( #5725 )
...
* Refactor codebase so that doc_type metadata is used instead of namespaces for making distinction between documents without embeddings, documents with embeddings and labels
* Fix parameter name in integration test
* Remove code under comment in add_type_metadata_filter method
* Fix mypy and pylint checks
* Add release note
* Apply minimal changes: rename method, update method docs and remove redundant method
* Mypy fixes
* Fix docstrings
* Revert helper methods for fetching documents when the number of documents exceeds Pinecone limit
* Remove unnecessary attributes in PineconeDocumentStore
* Fix unit test
---------
Co-authored-by: Ivana Zeljkovic <ivana.zeljkovic@smartcat.io>
Co-authored-by: DosticJelena <jelena.dostic@smartcat.io>
2023-09-14 11:46:47 +02:00
Darion
beb8853412
fix: return types of EntityExtractor to work with FAISSDocumentStore ( #5750 )
...
* Changed entity extractor score from type float32 to float64 and start/stop from int64 to int
* Added relase notes
2023-09-14 10:49:54 +02:00
Stefano Fiorucci
28f42fbaab
move release note to the right directory ( #5808 )
2023-09-14 09:57:09 +02:00
Christian Clauss
6dd52d91b2
ci: Fix typos discovered by codespell ( #5778 )
...
* Fix typos discovered by codespell
* pylint: max-args = 38
2023-09-13 16:14:45 +02:00
Julian Risch
4ae0924ea0
feat!: Remove SklearnQueryClassifier ( #5779 )
...
* remove SklearnQueryClassifier
* reno
2023-09-13 12:55:33 +02:00
Stefano Fiorucci
283ecf2760
feat: add prefix
and suffix
to SentenceTransformersDocumentEmbedder
( #5745 )
...
* add prefix and suffix
* fix test
2023-09-13 12:55:06 +02:00
ZanSara
2c4d839b64
feat: GPT4Generator
( #5744 )
...
* add gpt4generator
* add e2e
* add tests
* reno
* fix e2e
* Update test/preview/components/generators/openai/test_gpt4_generator.py
Co-authored-by: Stefano Fiorucci <44616784+anakin87@users.noreply.github.com>
---------
Co-authored-by: Stefano Fiorucci <44616784+anakin87@users.noreply.github.com>
2023-09-13 10:07:09 +02:00
Christian Clauss
23f7308bec
ci: pre-commit autoupdate ( #5777 )
2023-09-12 14:34:41 +02:00
ZanSara
6e70d403f8
feat: Improve Document
for Haystack 2.0 ( #5738 )
...
* initial draft
* tests
* add proposal
* proposal number
* reno
* fix tests and usage of content and content_type
* update branch & fix more tests
* mypy
* add docstring
* fix more tests
* review feedback
* improve __str__
* Apply suggestions from code review
Co-authored-by: Daria Fokina <daria.fokina@deepset.ai>
* Update haystack/preview/dataclasses/document.py
Co-authored-by: Daria Fokina <daria.fokina@deepset.ai>
* improve __str__
* fix tests
* fix more tests
* Update haystack/preview/document_stores/memory/document_store.py
---------
Co-authored-by: Daria Fokina <daria.fokina@deepset.ai>
2023-09-11 17:40:00 +02:00
Stefano Fiorucci
2edf85f739
MemoryEmbeddingRetriever
(2.0) (#5726 )
...
* MemoryDocumentStore - Embedding retrieval draft
* add release notes
* fix mypy
* better comment
* improve return_embeddings handling
* MemoryEmbeddingRetriever - first draft
* address PR comments
* release note
* update docstrings
* update docstrings
* incorporated feeback
* add return_embedding to __init__
* rm leftover docstring
---------
Co-authored-by: Daria Fokina <daria.fokina@deepset.ai>
2023-09-08 15:52:48 +02:00
Stefano Fiorucci
b7bea3ae9c
MemoryDocumentStore
- Embedding retrieval (2.0) (#5715 )
...
* MemoryDocumentStore - Embedding retrieval draft
* add release notes
* fix mypy
* better comment
* improve return_embeddings handling
* address PR comments
* update docstrings
* incorporated feeback
---------
Co-authored-by: Daria Fokina <daria.fokina@deepset.ai>
2023-09-07 15:44:07 +02:00
ZanSara
63cbde7287
feat: GPT35Generator
( #5714 )
...
* chatgpt backend
* fix tests
* reno
* remove print
* helpers tests
* add chatgpt generator
* use openai sdk
* remove backend
* tests are broken
* fix tests
* stray param
* move _check_troncated_answers into the class
* wrong import
* rename function
* typo in test
* add openai deps
* mypy
* improve system prompt docstring
* typos update
* Update haystack/preview/components/generators/openai/chatgpt.py
* pylint
* Update haystack/preview/components/generators/openai/chatgpt.py
Co-authored-by: Silvano Cerza <3314350+silvanocerza@users.noreply.github.com>
* Update haystack/preview/components/generators/openai/chatgpt.py
Co-authored-by: Silvano Cerza <3314350+silvanocerza@users.noreply.github.com>
* Update haystack/preview/components/generators/openai/chatgpt.py
Co-authored-by: Silvano Cerza <3314350+silvanocerza@users.noreply.github.com>
* review feedback
* fix tests
* freview feedback
* reno
* remove tenacity mock
* gpt35generator
* fix naming
* remove stray references to chatgpt
* fix e2e
* Update releasenotes/notes/chatgpt-llm-generator-d043532654efe684.yaml
Co-authored-by: Daria Fokina <daria.fokina@deepset.ai>
* add another test
* test wrong model name
* review feedback
---------
Co-authored-by: Daria Fokina <daria.fokina@deepset.ai>
Co-authored-by: Silvano Cerza <3314350+silvanocerza@users.noreply.github.com>
2023-09-07 10:06:57 +02:00
Vladimir Blagojevic
c5edb45c10
feat: Add SerperDevWebSearch
Haystack 2.0 component ( #5712 )
...
* Add SerperDev
* Add release note
* PR Feedback
* Simplify, remove one-liner
* Update haystack/preview/components/websearch/serper_dev.py
Co-authored-by: ZanSara <sara.zanzottera@deepset.ai>
* Update haystack/preview/components/websearch/serper_dev.py
Co-authored-by: ZanSara <sara.zanzottera@deepset.ai>
* Fix formatting
* PR feedback
* Fix tests
* Function rename
* Remove scoring, update tests
* PR feedback
* Fix return
* small adjustments
* fix tests
* add e2e test
* fix release notes
* fix tests
* fix e2e
---------
Co-authored-by: ZanSara <sara.zanzottera@deepset.ai>
2023-09-06 17:31:42 +02:00
bogdankostic
639f7cf888
chore: Rename AnswersBuilder
to AnswerBuilder
( #5720 )
...
* Add AnswersBuilder
* Add tests for AnswersBuilder
* Add release note
* PR feedback
* Fix mypy
* Remove redundant check for number of groups
* Rename AnswersBuilder to AnswerBuilder
* Update test/preview/components/builders/test_answer_builder.py
Co-authored-by: Daria Fokina <daria.fokina@deepset.ai>
* Rename reno file
---------
Co-authored-by: Daria Fokina <daria.fokina@deepset.ai>
2023-09-05 14:34:22 +02:00
Silvano Cerza
2acc41ea85
Add PromptBuilder
( #5713 )
...
* Add PromptBuilder
* Update release note
* Add test
2023-09-05 12:22:21 +02:00
bogdankostic
a5b815690e
feat: Add AnswersBuilder
component (2.0) ( #5701 )
...
* Add AnswersBuilder
* Add tests for AnswersBuilder
* Add release note
* PR feedback
* Fix mypy
* Remove redundant check for number of groups
* docstrings upd
---------
Co-authored-by: Daria Fokina <daria.fokina@deepset.ai>
2023-09-04 21:16:20 +02:00
bogdankostic
11440395f4
fix: Set model_max_length in the Tokenizer of DefaultPromptHandler
( #5596 )
...
* Set model_max_length in tokenizer in prompt handler
* Add release note
2023-09-01 11:48:41 +02:00
ZanSara
5f1256ac7e
feat: generators
(2.0) ( #5690 )
...
* add generators module
* add tests for module helper
* reno
* add another test
* move into openai
* improve tests
2023-08-31 17:33:12 +02:00
Fanli Lin
40d9f34e68
feat: enable passing use_fast
to the underlying transformers' pipeline ( #5655 )
...
* copy instead of deepcopy
* fix pylint
* add use_fast
* add release note
* remove unrelevant changes
* black fix
* fix bug
* black
* bug fix
2023-08-30 10:25:18 +02:00