3803 Commits

Author SHA1 Message Date
Silvano Cerza
2a45e7cc06
refactor: Remove id_hash_keys from all file_converters (#6125)
* Remove id_hash_keys from DocumentCleaner

* Remove id_hash_keys from TextDocumentSplitter

* Remove id_hash_keys from all file_converters

* Fix pylint failure

* Update docstrings
2023-10-20 16:22:14 +02:00
Silvano Cerza
3d69094f9a
refactor: Remove id_hash_keys from TextDocumentSplitter (#6124)
* Remove id_hash_keys from DocumentCleaner

* Remove id_hash_keys from TextDocumentSplitter
2023-10-20 15:18:28 +02:00
Silvano Cerza
ec376c7dbd
Remove id_hash_keys from DocumentCleaner (#6123) 2023-10-20 15:16:06 +02:00
Tuana Çelik
366f0366bf
Update gpt.py docstring (#6129)
* Update gpt.py docstring

Noticed this slight issue in docstrings for GPTGenerator, so submitting a fix.

* Update haystack/preview/components/generators/openai/gpt.py

Co-authored-by: Daria Fokina <daria.fokina@deepset.ai>

---------

Co-authored-by: Daria Fokina <daria.fokina@deepset.ai>
2023-10-20 14:45:05 +02:00
Julian Risch
64649312bc
build: Upgrade to canals==0.9.0 (#6133)
* build: Upgrade to `canals==0.9.0`

* reno
2023-10-20 13:00:24 +02:00
Silvano Cerza
3f98bd9137
refactor: Rework Document.id generation (#6122)
* Rework Document id generation

* Fix tests

* Add release notes

* Fix failing integration test

* Remove score from Document id generation

* Enhance tests

* Update release notes

---------

Co-authored-by: Julian Risch <julian.risch@deepset.ai>
2023-10-20 10:34:28 +02:00
Sunil Kumar Dash
957d1be68d
Enrich documents with embeddings for OpenAIDocumentEmbedder (#6126)
* Enrich documents with embeddings

Signed-off-by: sunilkumardash9 <sunilkumardash9@gmail.com>

* add release note

Signed-off-by: sunilkumardash9 <sunilkumardash9@gmail.com>

* try to fix typing

* change embedding field type in Document

---------

Signed-off-by: sunilkumardash9 <sunilkumardash9@gmail.com>
Co-authored-by: Stefano Fiorucci <44616784+anakin87@users.noreply.github.com>
2023-10-19 18:29:16 +02:00
Stefano Fiorucci
fe261b9986
mv StopWordsCriteria under lazy_import (#6128) 2023-10-19 17:48:59 +02:00
Stefano Fiorucci
025418c10e
rm unnecessary deps (#6121) 2023-10-19 17:01:02 +02:00
Stefano Fiorucci
ff06da8712
pin fastapi (#6120) 2023-10-19 13:30:50 +02:00
Stefano Fiorucci
ef40c7c728
refactor: make sure that Document's id_hash_keys has a valid value (#6112)
* fix handling id_hash_keys

* reno

* handle empty id_hash_keys in post_init

* fix

* reno

* test
2023-10-19 12:10:19 +02:00
Julian Risch
9f3b6512be
refactor: Remove reimplementations of default from_dict/to_dict and corresponding tests in 2.0 (#6108)
* whisper transcriber

* remove from/to_dict from builders

* remove from/to_dict from embedders

* remove from/to_dict from fetcher, file_converters

* remove from/to_dict from generators, preprocessors

* remove from/to_dict from ranker, reader

* remove from/to_dict from router, sampler, websearch

* pylint

* reno

* refactor import

* remove unused import
2023-10-19 11:17:02 +02:00
Stefano Fiorucci
6df077cbb4
add more-itertools to preview dependencies (#6110) 2023-10-18 17:53:48 +02:00
Stefano Fiorucci
21d894d85a
refactor: adopt token instead of use_auth_token in HF components (#6040)
* move embedding backends

* use token in Sentence Transformers embeddings

* more compact token handling

* token parameter in reader

* add token to ranker

* release note

* add test for reader
2023-10-17 16:32:13 +02:00
Stefano Fiorucci
4e4af99a5e
refactor!: rename MemoryDocumentStore and related Retrievers (#6076)
* rename doc store and retrievers

* release note

* fix patch
2023-10-17 16:15:16 +02:00
Silvano Cerza
ec9f898cd6
fix: Fix TextDocumentSplitter failing if run with empty list (#6081)
* Fix TextDocumentSplitter failing if run with empty list

* Release notes

* Simplify check

* Enhance test
2023-10-17 11:25:28 +02:00
Julian Risch
90ddeba579
fix: DocumentSplitter and DocumentCleaner copy id_hash_keys to newly created Documents (#6083)
* copy id_hash_keys in splitter and cleaner

* reno
2023-10-17 11:03:48 +02:00
Stefano Fiorucci
e963c8acdd
feat: HuggingFaceLocalGenerator - stopwords handling (#6049)
* first implementation

* release notes

* fixes

* tests

* better reno

* release note
2023-10-17 10:36:08 +02:00
Stefano Fiorucci
c4187eeebe
CI: make only test_preview run when preview e2e tests are changed (#6078)
* make only test_preview workflow run when e2e tests are modified

* revert wrong changes to test_preview

* revert wrong order
2023-10-17 10:06:39 +02:00
Ivana Zeljkovic
2326f2f9fe
feat: Pinecone document store optimizations (#5902)
* Optimize methods for deleting documents and getting vector count. Enable warning messages when Pinecone limits are exceeded on Starter index type.

* Fix typo

* Add release note

* Fix mypy errors

* Remove unused import. Fix warning logging message.

* Update release note with description about limits for Starter index type in Pinecone

* Improve code base by:
- Adding new test cases for get_embedding_count method
- Fixing get_embedding_count method
- Improving delete documents
- Fix label retrieval
- Increase default batch size
- Improve get_document_count method

* Remove unused variable

* Fix mypy issues
2023-10-16 19:26:24 +02:00
ZanSara
b43fc35deb
chore: Telemetry for embedder classes (#6072)
* add telemetry to pipelines 2.0

* only collect data if telemetry is on

* reno

* add downsampling

* typing

* manual tests

* pylint

* simplify code

* Update haystack/preview/telemetry/__init__.py

* look for _telemetry_data

* rather index by component type

* black

* mypy

* error handling

* comment

* review feedback & small improvements

* defaultdict

* stray changes

* try-catch

* method instead of attribute

* fixes

* remove print statements

* lint

* invert condition

* always send the first event of the day

* collect specs

* track 2nd and 3rd events too

* send first event and then max 1 event a minute

* rename constant

* black

* add telemetry for embedders

* add test

* remove boolean values
2023-10-16 18:25:28 +02:00
Stefano Fiorucci
167700de4d
CI: make only linting_preview run on preview e2e tests (#6077)
* apply only linting_preview to preview e2e tests

* add paths to linting_skipper
2023-10-16 18:18:17 +02:00
ZanSara
22a24c8477
chore: HuggingFaceLocalGenerator telemetry (#6070)
* add telemetry to pipelines 2.0

* only collect data if telemetry is on

* reno

* add downsampling

* typing

* manual tests

* pylint

* simplify code

* Update haystack/preview/telemetry/__init__.py

* look for _telemetry_data

* rather index by component type

* black

* mypy

* error handling

* comment

* review feedback & small improvements

* defaultdict

* stray changes

* try-catch

* method instead of attribute

* fixes

* remove print statements

* lint

* invert condition

* always send the first event of the day

* collect specs

* track 2nd and 3rd events too

* send first event and then max 1 event a minute

* rename constant

* black

* add telemetry details to HuggingFaceLocalGenerator

* add test

* check if the model is a string

---------

Co-authored-by: Stefano Fiorucci <44616784+anakin87@users.noreply.github.com>
2023-10-16 18:02:50 +02:00
ZanSara
6b097928ff
chore: telemetry for rankers, readers, retrievers, writers (#6075)
* add telemetry to pipelines 2.0

* only collect data if telemetry is on

* reno

* add downsampling

* typing

* manual tests

* pylint

* simplify code

* Update haystack/preview/telemetry/__init__.py

* look for _telemetry_data

* rather index by component type

* black

* mypy

* error handling

* comment

* review feedback & small improvements

* defaultdict

* stray changes

* try-catch

* method instead of attribute

* fixes

* remove print statements

* lint

* invert condition

* always send the first event of the day

* collect specs

* track 2nd and 3rd events too

* send first event and then max 1 event a minute

* rename constant

* black

* add test

* add telemetry for rankers readers and retrievers

* get only the type of docstore, not the whole object
2023-10-16 18:02:24 +02:00
ZanSara
490de4e119
feat: add _get_telemetry_data to GPTGenerator (#5958)
* add telemetry to pipelines 2.0

* only collect data if telemetry is on

* reno

* add downsampling

* typing

* manual tests

* pylint

* simplify code

* Update haystack/preview/telemetry/__init__.py

* look for _telemetry_data

* rather index by component type

* black

* mypy

* error handling

* comment

* add telemetry_data to gptgenerator

* review feedback & small improvements

* defaultdict

* stray changes

* try-catch

* method instead of attribute

* change attribute to method

* fixes

* remove print statements

* lint

* invert condition

* always send the first event of the day

* collect specs

* track 2nd and 3rd events too

* send first event and then max 1 event a minute

* rename constant

* black
2023-10-16 17:45:56 +02:00
ZanSara
660f84e6ef
feat: enable telemetry to pick up component data (#5957)
* add telemetry to pipelines 2.0

* only collect data if telemetry is on

* reno

* add downsampling

* typing

* manual tests

* pylint

* simplify code

* Update haystack/preview/telemetry/__init__.py

* look for _telemetry_data

* rather index by component type

* black

* mypy

* error handling

* comment

* review feedback & small improvements

* defaultdict

* stray changes

* try-catch

* method instead of attribute

* fixes

* remove print statements

* lint

* invert condition

* always send the first event of the day

* collect specs

* track 2nd and 3rd events too

* send first event and then max 1 event a minute

* rename constant

* black

* add test
2023-10-16 17:43:48 +02:00
Silvano Cerza
740436319a
Add missing preview dependency (#6074) 2023-10-16 16:21:49 +02:00
Silvano Cerza
53838ace5a
chore: Fix preview module importing from old haystack (#6052)
* Fix import in link_content.py

* Fix another import

* Move __version__ to separate file to fix circular import

* Fix mypy complaining about redefinition of __version__
2023-10-16 15:44:40 +02:00
Stefano Fiorucci
e629a5d467
add posthog (#6050) 2023-10-16 15:44:24 +02:00
Silvano Cerza
a476486d34
chore: Fix mypy errors when running preview linting in CI (#6073)
* Fix mypy errors when running preview linting in CI

* Trigger CI

* Revert "Trigger CI"

This reverts commit 9b47d19279eaa4e020c645ed1c18c8263acd7695.

* Revert "Fix mypy errors when running preview linting in CI"

This reverts commit 78b5d92ad8085c9b61848ecf6de242bea67f3281.

* Ignore mypy errrors

* Trigger CI

* Revert "Trigger CI"

This reverts commit 62050ec0fd057b2efb2f7f0a13da42b0eeabb6b8.
2023-10-16 15:00:58 +02:00
Silvano Cerza
c78e1a7eb3
Add a workflow to verify haystack.preview doesn't import non preview modules (#6053) 2023-10-16 09:36:45 +02:00
Nicola Procopio
32e87d37c1
fixed join_docs.py concatenate (#5970)
* added hybrid search example

Added an example about hybrid search for faq pipeline on covid dataset

* formatted with back formatter

* renamed document

* fixed

* fixed typos

* added test

added test for hybrid search

* fixed withespaces

* removed test for hybrid search

* fixed pylint

* commented logging

* fixed bug in join_docs.py _concatenate_results

* Update join_docs.py

updated comment

* format with black

* added releasenote on PR

* updated release notes

* updated test_join_documents

* updated test

* updated test

* Update test_join_documents.py

* formatted with black

* fixed test

* fixed

---------

Co-authored-by: Stefano Fiorucci <44616784+anakin87@users.noreply.github.com>
2023-10-16 09:31:52 +02:00
Silvano Cerza
92ae169bdf
Proposal: Document Stores filter specification for Haystack 2.x (#6001)
* Filters rework proposal

* Update proposal with received feedback
2023-10-16 09:26:23 +02:00
Julian Risch
aaee03aee8
feat: Add DocumentCleaner 2.0 (#5976)
* remove whitespaces, substrings, regex, empty lines

* remove repeated substrings

* reno

* return empty string as shortest common ngram

* address first half of review feedback

* address second half of review feedback

* mention \f page separator for header/footer removal

* mention \f page separator for header/footer removal

* mark example usage as python code
2023-10-13 12:39:55 +02:00
Bilge Yücel
ad25041618
Remove old Cohere models and add aliases for existing ones (#6007)
* Remove old cohere models

* Add aliases for the existing models according to Cohere documentation

* Add release note

* put cohere embdding models in a constant
* update doc strings

---------

Co-authored-by: Massimiliano Pippi <mpippi@gmail.com>
2023-10-13 12:08:26 +02:00
Stefano Fiorucci
fbd22bc1e9
feat: HuggingFaceLocalGenerator - first implementation (#6022)
* draft

* still a raw draft

* still a raw draft

* improvements

* minimal impl ok

* tests

* reno

* better language

* examples of generation_kwargs

* incorporate feedback

* lg and format updates

* don't save valid str tokens

* fix style

---------

Co-authored-by: Darja Fokina <daria.f93@gmail.com>
2023-10-13 11:23:56 +02:00
Daria Fokina
41fd0c5458
docs: adding missing docstrings for run and run_batch methods (#5609)
* docstrings for run methods

* updates from pr review

* wrong article

* fix style

---------

Co-authored-by: anakin87 <stefanofiorucci@gmail.com>
Co-authored-by: Stefano Fiorucci <44616784+anakin87@users.noreply.github.com>
2023-10-13 11:23:26 +02:00
Julian Risch
b507f1a124
feat: Add TextLanguageClassifier 2.0 (#6026)
* draft TextLanguageClassifier

* implement language detection with langdetect

* add unit test for logging message

* reno

* pylint

* change input from List[str] to str

* remove empty output connections

* add from_dict/to_dict tests

* mark example usage as python code
2023-10-13 10:30:49 +02:00
ZanSara
110aacdc35
feat: add basic telemetry to pipelines 2.0 (#5929)
* add telemetry to pipelines 2.0

* only collect data if telemetry is on

* reno

* add downsampling

* typing

* manual tests

* pylint

* simplify code

* Update haystack/preview/telemetry/__init__.py

* rather index by component type

* black

* mypy

* review feedback & small improvements

* defaultdict

* stray changes

* lint

* invert condition

* always send the first event of the day

* collect specs

* track 2nd and 3rd events too

* send first event and then max 1 event a minute

* rename constant

* invert condition

* linting
2023-10-13 09:31:51 +02:00
Akash Goyal
988fa61f84
Addition to the text in ValueError when creating a prompt node to inf… (#6000)
* Addition to the text in ValueError when creating a prompt node to inform users to double check they have authorisation for the loaded model and have logged into the huggingface cli

* Update haystack/nodes/prompt/prompt_model.py

Accepted the suggested changes to the value error text

Co-authored-by: Massimiliano Pippi <mpippi@gmail.com>

---------

Co-authored-by: Massimiliano Pippi <mpippi@gmail.com>
Co-authored-by: Stefano Fiorucci <44616784+anakin87@users.noreply.github.com>
2023-10-13 09:05:21 +02:00
Julian Risch
59e89b1031
test: Remove anthropic from "getting started" example test (#6024) 2023-10-12 22:36:49 +02:00
ZanSara
adf7e49af3
chore: review all extra (#6029) 2023-10-12 21:50:53 +02:00
Stefano Fiorucci
2c2549f13d
move embedding backends (#6033) 2023-10-12 17:52:28 +02:00
Vladimir Blagojevic
d51be9edac
Add top_k to SimilarityRanker (#6036) 2023-10-12 13:52:01 +02:00
Vladimir Blagojevic
4b8b6e9191
Use forward reference for AnalyzeResult (#6030) 2023-10-11 16:33:02 +02:00
Vladimir Blagojevic
3803d23ff6
feat: Update PyPDFToDocument to process ByteStream inputs (#6021)
* Update PyPDF converter

* Add mixed source unit test

* Update haystack/preview/components/file_converters/pypdf.py

Co-authored-by: ZanSara <sara.zanzottera@deepset.ai>

---------

Co-authored-by: ZanSara <sara.zanzottera@deepset.ai>
2023-10-11 10:52:08 +02:00
Vladimir Blagojevic
1a6a8863e8
feat: Update HTMLToDocument to handle ByteStream inputs (#6020)
* Update HTML converter

* Add mixed source unit test

* Update haystack/preview/components/file_converters/html.py

Co-authored-by: ZanSara <sara.zanzottera@deepset.ai>

---------

Co-authored-by: ZanSara <sara.zanzottera@deepset.ai>
2023-10-11 10:15:58 +02:00
Julian Risch
12fe0364dc
test: Utility to compare two lists of documents for equality (#6005)
* check that sorted lists contain same docs

* fix broken tests
2023-10-11 08:16:41 +02:00
Vladimir Blagojevic
6a50123b9f
feat: Adjust LinkContentFetcher run method, use ByteStream (#5972) 2023-10-10 17:48:31 +02:00
Nicola Procopio
c102b152dc
fix: Run update_embeddings in examples (#6008)
* added hybrid search example

Added an example about hybrid search for faq pipeline on covid dataset

* formatted with back formatter

* renamed document

* fixed

* fixed typos

* added test

added test for hybrid search

* fixed withespaces

* removed test for hybrid search

* fixed pylint

* commented logging

* updated hybrid search example

* release notes

* Update hybrid_search_faq_pipeline.py-815df846dca7e872.yaml

* Update hybrid_search_faq_pipeline.py

* mention hybrid search example in release notes

* reduce installed dependencies in examples test workflow

* do not install cuda dependencies

* skip models if API key not set; delete document indices

* skip models if API key not set; delete document indices

* skip models if API key not set; delete document indices

* keep roberta-base model and inference extra

* pylint

* disable pylint no-logging-basicconfig rule

---------

Co-authored-by: Julian Risch <julian.risch@deepset.ai>
2023-10-10 16:38:52 +02:00