Ashwin Mathur
101bd816f8
refactor: Remove api_key from serialization of AzureOCRDocumentConverter
and SerperDevWebSearch
( #6150 )
...
* Remove api_key from serialization of AzureOCRDocumentConverter
* Remove api_key from serialization of SerperDevWebSearch
* Add release notes
* Add init_fail_without_api_key test for SerperDevWebSearch
* Rename env var to AZURE_AI_API_KEY
2023-10-23 12:26:23 +02:00
Silvano Cerza
c8d162ced9
refactor: Change Document.embedding
type to list of floats ( #6135 )
...
* Change Document.embedding type
* Add release notes
* Fix document_store testing
* Fix pylint
* Fix tests
2023-10-23 12:26:05 +02:00
Silvano Cerza
8f289282f1
refactor: Remove id_hash_keys
field from Document
( #6127 )
...
* Remove id_hash_fields from Document
* Update release notes
* Remove unused import
2023-10-23 10:35:24 +02:00
Silvano Cerza
2a45e7cc06
refactor: Remove id_hash_keys
from all file_converters
( #6125 )
...
* Remove id_hash_keys from DocumentCleaner
* Remove id_hash_keys from TextDocumentSplitter
* Remove id_hash_keys from all file_converters
* Fix pylint failure
* Update docstrings
2023-10-20 16:22:14 +02:00
Silvano Cerza
3d69094f9a
refactor: Remove id_hash_keys
from TextDocumentSplitter
( #6124 )
...
* Remove id_hash_keys from DocumentCleaner
* Remove id_hash_keys from TextDocumentSplitter
2023-10-20 15:18:28 +02:00
Silvano Cerza
ec376c7dbd
Remove id_hash_keys from DocumentCleaner ( #6123 )
2023-10-20 15:16:06 +02:00
Silvano Cerza
3f98bd9137
refactor: Rework Document.id
generation ( #6122 )
...
* Rework Document id generation
* Fix tests
* Add release notes
* Fix failing integration test
* Remove score from Document id generation
* Enhance tests
* Update release notes
---------
Co-authored-by: Julian Risch <julian.risch@deepset.ai>
2023-10-20 10:34:28 +02:00
Stefano Fiorucci
ef40c7c728
refactor: make sure that Document's id_hash_keys
has a valid value ( #6112 )
...
* fix handling id_hash_keys
* reno
* handle empty id_hash_keys in post_init
* fix
* reno
* test
2023-10-19 12:10:19 +02:00
Julian Risch
9f3b6512be
refactor: Remove reimplementations of default from_dict
/to_dict
and corresponding tests in 2.0 ( #6108 )
...
* whisper transcriber
* remove from/to_dict from builders
* remove from/to_dict from embedders
* remove from/to_dict from fetcher, file_converters
* remove from/to_dict from generators, preprocessors
* remove from/to_dict from ranker, reader
* remove from/to_dict from router, sampler, websearch
* pylint
* reno
* refactor import
* remove unused import
2023-10-19 11:17:02 +02:00
Stefano Fiorucci
21d894d85a
refactor: adopt token
instead of use_auth_token
in HF components ( #6040 )
...
* move embedding backends
* use token in Sentence Transformers embeddings
* more compact token handling
* token parameter in reader
* add token to ranker
* release note
* add test for reader
2023-10-17 16:32:13 +02:00
Stefano Fiorucci
4e4af99a5e
refactor!: rename MemoryDocumentStore
and related Retrievers ( #6076 )
...
* rename doc store and retrievers
* release note
* fix patch
2023-10-17 16:15:16 +02:00
Silvano Cerza
ec9f898cd6
fix: Fix TextDocumentSplitter failing if run with empty list ( #6081 )
...
* Fix TextDocumentSplitter failing if run with empty list
* Release notes
* Simplify check
* Enhance test
2023-10-17 11:25:28 +02:00
Julian Risch
90ddeba579
fix: DocumentSplitter and DocumentCleaner copy id_hash_keys
to newly created Documents ( #6083 )
...
* copy id_hash_keys in splitter and cleaner
* reno
2023-10-17 11:03:48 +02:00
Stefano Fiorucci
e963c8acdd
feat: HuggingFaceLocalGenerator
- stopwords handling ( #6049 )
...
* first implementation
* release notes
* fixes
* tests
* better reno
* release note
2023-10-17 10:36:08 +02:00
Ivana Zeljkovic
2326f2f9fe
feat: Pinecone document store optimizations ( #5902 )
...
* Optimize methods for deleting documents and getting vector count. Enable warning messages when Pinecone limits are exceeded on Starter index type.
* Fix typo
* Add release note
* Fix mypy errors
* Remove unused import. Fix warning logging message.
* Update release note with description about limits for Starter index type in Pinecone
* Improve code base by:
- Adding new test cases for get_embedding_count method
- Fixing get_embedding_count method
- Improving delete documents
- Fix label retrieval
- Increase default batch size
- Improve get_document_count method
* Remove unused variable
* Fix mypy issues
2023-10-16 19:26:24 +02:00
ZanSara
660f84e6ef
feat: enable telemetry to pick up component data ( #5957 )
...
* add telemetry to pipelines 2.0
* only collect data if telemetry is on
* reno
* add downsampling
* typing
* manual tests
* pylint
* simplify code
* Update haystack/preview/telemetry/__init__.py
* look for _telemetry_data
* rather index by component type
* black
* mypy
* error handling
* comment
* review feedback & small improvements
* defaultdict
* stray changes
* try-catch
* method instead of attribute
* fixes
* remove print statements
* lint
* invert condition
* always send the first event of the day
* collect specs
* track 2nd and 3rd events too
* send first event and then max 1 event a minute
* rename constant
* black
* add test
2023-10-16 17:43:48 +02:00
Nicola Procopio
32e87d37c1
fixed join_docs.py concatenate ( #5970 )
...
* added hybrid search example
Added an example about hybrid search for faq pipeline on covid dataset
* formatted with back formatter
* renamed document
* fixed
* fixed typos
* added test
added test for hybrid search
* fixed withespaces
* removed test for hybrid search
* fixed pylint
* commented logging
* fixed bug in join_docs.py _concatenate_results
* Update join_docs.py
updated comment
* format with black
* added releasenote on PR
* updated release notes
* updated test_join_documents
* updated test
* updated test
* Update test_join_documents.py
* formatted with black
* fixed test
* fixed
---------
Co-authored-by: Stefano Fiorucci <44616784+anakin87@users.noreply.github.com>
2023-10-16 09:31:52 +02:00
Julian Risch
aaee03aee8
feat: Add DocumentCleaner 2.0 ( #5976 )
...
* remove whitespaces, substrings, regex, empty lines
* remove repeated substrings
* reno
* return empty string as shortest common ngram
* address first half of review feedback
* address second half of review feedback
* mention \f page separator for header/footer removal
* mention \f page separator for header/footer removal
* mark example usage as python code
2023-10-13 12:39:55 +02:00
Stefano Fiorucci
fbd22bc1e9
feat: HuggingFaceLocalGenerator
- first implementation ( #6022 )
...
* draft
* still a raw draft
* still a raw draft
* improvements
* minimal impl ok
* tests
* reno
* better language
* examples of generation_kwargs
* incorporate feedback
* lg and format updates
* don't save valid str tokens
* fix style
---------
Co-authored-by: Darja Fokina <daria.f93@gmail.com>
2023-10-13 11:23:56 +02:00
Julian Risch
b507f1a124
feat: Add TextLanguageClassifier 2.0 ( #6026 )
...
* draft TextLanguageClassifier
* implement language detection with langdetect
* add unit test for logging message
* reno
* pylint
* change input from List[str] to str
* remove empty output connections
* add from_dict/to_dict tests
* mark example usage as python code
2023-10-13 10:30:49 +02:00
ZanSara
adf7e49af3
chore: review all
extra ( #6029 )
2023-10-12 21:50:53 +02:00
Stefano Fiorucci
2c2549f13d
move embedding backends ( #6033 )
2023-10-12 17:52:28 +02:00
Vladimir Blagojevic
d51be9edac
Add top_k to SimilarityRanker ( #6036 )
2023-10-12 13:52:01 +02:00
Vladimir Blagojevic
3803d23ff6
feat: Update PyPDFToDocument
to process ByteStream
inputs ( #6021 )
...
* Update PyPDF converter
* Add mixed source unit test
* Update haystack/preview/components/file_converters/pypdf.py
Co-authored-by: ZanSara <sara.zanzottera@deepset.ai>
---------
Co-authored-by: ZanSara <sara.zanzottera@deepset.ai>
2023-10-11 10:52:08 +02:00
Vladimir Blagojevic
1a6a8863e8
feat: Update HTMLToDocument
to handle ByteStream
inputs ( #6020 )
...
* Update HTML converter
* Add mixed source unit test
* Update haystack/preview/components/file_converters/html.py
Co-authored-by: ZanSara <sara.zanzottera@deepset.ai>
---------
Co-authored-by: ZanSara <sara.zanzottera@deepset.ai>
2023-10-11 10:15:58 +02:00
Vladimir Blagojevic
6a50123b9f
feat: Adjust LinkContentFetcher run method, use ByteStream ( #5972 )
2023-10-10 17:48:31 +02:00
Vladimir Blagojevic
98215aec0d
feat: Rename FileExtensionRouter
to FileTypeRouter
, handle ByteStream(s) ( #5998 )
...
Co-authored-by: Daria Fokina <daria.fokina@deepset.ai>
2023-10-10 09:14:04 +02:00
DanShatford
07048791aa
feat: allow list of file paths in convert_files_to_docs
( #5961 )
...
* feat: allow list of file paths in `convert_files_to_docs`
* Fix validation
* Fix check errors
2023-10-09 20:19:03 +02:00
Vladimir Blagojevic
40b83d8a47
feat: Add TopPSampler Haystack 2.0 component ( #5924 )
2023-10-09 13:44:01 +02:00
Vladimir Blagojevic
1cdff6427e
feat: Add SimilarityRanker to Haystack 2.0 ( #5923 )
...
* Initial SimilarityRanker
2023-10-06 16:01:34 +02:00
Stefano Fiorucci
ccc9f010bb
fix: fix ChatGPT invocation layer (and add async support) ( #5979 )
...
* ChatGPT async
* release note
* fix tests
2023-10-05 18:43:26 +02:00
Vladimir Blagojevic
282419d82b
feat: Unfreeze Document in Haystack 2.0 ( #5974 )
...
* Unfreeze document
* Remove immutability test
2023-10-05 17:55:07 +02:00
Tobias Wochinger
d5d3a9eef4
chore: adapt deepset cloud sdk endpoint format for saving pipelines ( #5969 )
...
* chore: adapt to new endpoints formats
* docs: add release notes
2023-10-05 08:56:28 +02:00
Massimiliano Pippi
c2ec3f5fde
feat: add File type to preview package ( #5873 )
...
* add Blob type
* review feedback
* fix tests and naming
* Update add-blob-type-2a9476a39841f54d.yaml
* removed unused import
---------
Co-authored-by: Stefano Fiorucci <44616784+anakin87@users.noreply.github.com>
2023-10-04 17:23:12 +02:00
Stefano Fiorucci
cc70b4b613
deprecation ( #5954 )
2023-10-03 12:48:06 +02:00
Massimiliano Pippi
ac408134f4
feat: add support for async openai calls ( #5946 )
...
* add support for async openai calls
* add actual async call
* split the async api
* ask permission
* Update haystack/utils/openai_utils.py
Co-authored-by: Silvano Cerza <3314350+silvanocerza@users.noreply.github.com>
* Fix OpenAI content moderation tests
* Fix ChatGPT invocation layer tests
---------
Co-authored-by: Silvano Cerza <3314350+silvanocerza@users.noreply.github.com>
Co-authored-by: Silvano Cerza <silvanocerza@gmail.com>
2023-10-03 10:42:21 +02:00
Massimiliano Pippi
0947f59545
feat: add async PromptNode run ( #5890 )
...
* add async promptnode
* Remove unecessary calls to dict.keys()
---------
Co-authored-by: Silvano Cerza <silvanocerza@gmail.com>
Co-authored-by: Silvano Cerza <3314350+silvanocerza@users.noreply.github.com>
2023-09-29 08:40:01 +02:00
Vladimir Blagojevic
e882a7d5c8
feat: Add HTMLToDocument component (v2) ( #5907 )
2023-09-28 17:22:28 +02:00
Stefano Fiorucci
d4aacad5f9
feat: OpenAIDocumentEmbedder
( #5822 )
...
* first draft
* release note
* mypy fix
* fix test
* corrections
* pr feedback
* better secrets handling and new tests
* missing imports in embedders/__init__.py
* better format condition
* address feedback
2023-09-28 15:42:51 +02:00
ZanSara
83724b74e3
feat: Make metadata
optional in AnswerBuilder ( #5909 )
...
* optional metadata
* improve docstring
2023-09-28 14:42:19 +02:00
Stefano Fiorucci
9340c572f9
alternative skipif conditions in azure ocr converter test ( #5906 )
2023-09-28 12:09:19 +02:00
Julian Risch
4413675e64
feat: Add TextDocumentSplitter that splits by word, sentence, passage (2.0) ( #5870 )
...
* draft split by word, sentence, passage
* naive way to split sentences without nltk
* reno
* add tests
* make input list of docs, review feedback
* add source_id and more validation
* update docstrings
* add split delimiters back to strings
---------
Co-authored-by: Daria Fokina <daria.fokina@deepset.ai>
2023-09-27 12:26:20 +02:00
bogdankostic
80192589b1
feat: Add AzureOCRDocumentConverter
(2.0) ( #5855 )
...
* Add AzureOCRDocumentConverter
* Add tests
* Add release note
* Formatting
* update docstrings
* Apply suggestions from code review
Co-authored-by: ZanSara <sara.zanzottera@deepset.ai>
* PR feedback
* PR feedback
* PR feedback
* Add secrets as environment variables
* Adapt test
* Add azure dependency to CI
* Add azure dependency to CI
---------
Co-authored-by: ZanSara <sara.zanzottera@deepset.ai>
Co-authored-by: Daria Fokina <daria.fokina@deepset.ai>
2023-09-26 15:57:55 +02:00
Stefano Fiorucci
6aa471ac5e
chore: make preview integration tests reproducible ( #5871 )
...
* relax extractive reader integration tests
* force reader to CPU
* ensure integration tests reproducibility
* move set_all_seeds to testing package
2023-09-25 18:39:10 +02:00
bogdankostic
9a4373bf8e
feat: Add TikaDocumentConverter
(2.0) ( #5847 )
...
* Add TikaFileToDocument component
* Add tests
* Add tika service to CI
* Add release note
* Change name
* PR feedback
* Fix naming in tests
* Fix tika version in CI
* Update tests
---------
Co-authored-by: ZanSara <sara.zanzottera@deepset.ai>
2023-09-25 11:47:21 +02:00
MichelBartels
4da43b6b05
Add link output to SerperDevWebSearch
( #5853 )
...
* add link output
* adjust tests
* fix test
* remove print statements
---------
Co-authored-by: ZanSara <sara.zanzottera@deepset.ai>
2023-09-25 10:03:01 +02:00
Stefano Fiorucci
c0f22372d4
feat: OpenAITextEmbedder
( #5801 )
...
* first draft
* release notes
* avoid serializing secrets
* fix import order
* simplify serialization
* simplification
* monkeypatch delenv
* Update haystack/preview/components/embedders/openai_text_embedder.py
Co-authored-by: Daria Fokina <daria.fokina@deepset.ai>
* docstrings updates
* fix test
* Update haystack/preview/components/embedders/openai_text_embedder.py
Co-authored-by: Massimiliano Pippi <mpippi@gmail.com>
* rm comment
---------
Co-authored-by: Daria Fokina <daria.fokina@deepset.ai>
Co-authored-by: Massimiliano Pippi <mpippi@gmail.com>
2023-09-22 21:54:11 +02:00
Massimiliano Pippi
a5a0dc9f87
feat: optionally pass an id to the Document constructor ( #5862 )
...
* revert #5826
* do not use Optional
2023-09-22 11:09:59 +02:00
Silvano Cerza
cc4f95bf51
Remove unnecessary GPT4Generator class ( #5863 )
...
* Remove GPT4Generator class
* Rename GPT35Generator to GPTGenerator
* Fix tests
* Release notes
2023-09-22 11:05:06 +02:00
MichelBartels
f3dc9edd26
feat: initial ExtractiveReader implementation ( #5553 )
...
* initial ExtractiveReader implementation
* initial ExtractiveReader implementation
* fix mypy
* remove unused import
* Use AutoTokenizer
* rename reader to model
* combine no-answer logit
* support document slicing with proper probabilities
* add variable stride
* validate model
* fix typo
* make postprocessing easier to understand
* remove debug code
* set default reader
* add ExtractiveReader to __init__
* remove validation
* use new answer class
* add batching
* use v2 lazy imports
* move reader
* fix type hints
* add doc strings
* add nucleus sampling
* fix types
* fix doc string
* add no_answer parameter
* remove print statement
* fix gpu support
* turn into binary classification task
* change dataclass so document does not need to be provided for no answer
* add simple tests
* add unit tests
* rename reader folder to readers
* add integration tests
* fix type hints
* add release notes
* remove accidentally included test file
* remove unnecessary __init__ file
* revert __init__ file to main
* rename test script by adding test_ prefix
* undo accidentally moving of test script after renaming it
* remove use of bisect
* rename _flatten and _unflatten
* make variable name more intuitive
* remove type: ignore
* fix mypy issue
* refactor long tuple
* add doc strings
* explain HF test
* remove unnecessary top_k check
---------
Co-authored-by: ZanSara <sara.zanzottera@deepset.ai>
2023-09-21 12:16:51 +02:00