3597 Commits

Author SHA1 Message Date
Vladimir Blagojevic
92a6221927
feat: Add PyPDFToDocument component (2.0) (#5850)
* Initial PyPDFToDocument implementation

* Remove progress bar

* Add release note

* Minor fix

* import check and dependency

---------

Co-authored-by: ZanSara <sara.zanzottera@deepset.ai>
2023-09-21 11:52:26 +02:00
ZanSara
23fdef929e
chore: move GPT35Generator tests in the main test suite (#5844)
* move tests

* fix no-test-found error from pytest

* missing self

---------

Co-authored-by: Massimiliano Pippi <mpippi@gmail.com>
2023-09-21 11:42:32 +02:00
Julian Risch
5820120f9b
fix: Change retriever return type to list of docs (#5848)
Co-authored-by: Stefano Fiorucci <44616784+anakin87@users.noreply.github.com>
2023-09-21 10:32:40 +02:00
ZanSara
28f5c4c780
fix: Whisper integration tests (#5851)
* fix tests

* add ffmpeg

* apt update for ffmpeg

* not run on windows
2023-09-21 00:14:07 +02:00
bogdankostic
abe2706298
feat: Add MetadataRouter (2.0) (#5824)
* Move filter utilities

* Add MetadataRouter

* Add tests for MetadataRouter

* Add more tests

* Rename FileExtensionClassifer to FileExtensionRouter

* Add support for dates in filters

* Add tests

* Add release note

* Add release note

* Apply suggestions from code review

Co-authored-by: ZanSara <sara.zanzottera@deepset.ai>

---------

Co-authored-by: ZanSara <sara.zanzottera@deepset.ai>
2023-09-20 14:49:17 +02:00
ZanSara
c933bcaa69
chore: move Whisper e2e tests in the main tests suite (#5845)
* move whisper local tests

* remove e2e file

* move remote tests

* remove e2e file
2023-09-20 14:48:09 +02:00
ZanSara
454988672e
feat: UrlCacheChecker (#5841)
* add UrlCacheChecker

* rename

* add tests

* reno

* pylint

* review feedback
2023-09-20 14:45:50 +02:00
ZanSara
ea2a5595ca
add missing dependency (#5849) 2023-09-20 12:57:53 +02:00
bogdankostic
719c1c040c
feat: Add support for dates in filters (2.0) (#5823)
* Add support for dates in filters

* Add tests

* Add release note

* Update haystack/preview/utils/filters.py

Co-authored-by: ZanSara <sara.zanzottera@deepset.ai>

---------

Co-authored-by: ZanSara <sara.zanzottera@deepset.ai>
2023-09-20 12:05:56 +02:00
ZanSara
44f0c468ac
move websearch tests back to main tests suite (#5842) 2023-09-20 11:55:18 +02:00
bogdankostic
57d33ee6da
ci: Run preview integration tests in CI (#5843)
* Run preview integration tests in CI

* Only install inference extra
2023-09-20 11:54:41 +02:00
Vladimir Blagojevic
0983fb656a
feat: Add LinkContentFetcher Haystack 2.0 component (#5724)
* Add LinkContentFetcher

* Add release note

* Small fixes

* Fix pydocs

* PR feedback

* Remove handlers registration

* PR feedback

* adjustments

* improve tests

* initial draft

* tests

* add proposal

* proposal number

* reno

* fix tests and usage of content and content_type

* update branch & fix more tests

* mypy

* use the new document

* add docstring

* fix more tests

* mypy

* fix tests

* add e2e

* review feedback

* improve __str__

* Apply suggestions from code review

Co-authored-by: Daria Fokina <daria.fokina@deepset.ai>

* Update haystack/preview/dataclasses/document.py

Co-authored-by: Daria Fokina <daria.fokina@deepset.ai>

* improve __str__

* fix tests

* fix more tests

* fix test

* Fix end-of-file-fixer

* Post merge fixes

* Move e2e tests back into component

---------

Co-authored-by: ZanSara <sara.zanzottera@deepset.ai>
Co-authored-by: Daria Fokina <daria.fokina@deepset.ai>
2023-09-20 11:03:52 +02:00
Christian Clauss
bf6d306d68
ci: Simplify Python code with ruff rules SIM (#5833)
* ci: Simplify Python code with ruff rules SIM

* Revert #5828

* ruff --select=I --fix haystack/modeling/infer.py

---------

Co-authored-by: Massimiliano Pippi <mpippi@gmail.com>
2023-09-20 08:32:44 +02:00
Stefano Fiorucci
de84a95970
separate classes and tests (#5819)
Co-authored-by: Massimiliano Pippi <mpippi@gmail.com>
2023-09-19 19:21:49 +02:00
Malte Pietsch
aa3cc3d5ae
feat: Add support for OpenAI's gpt-3.5-turbo-instruct model (#5837)
* support gpt-3.5.-turbo-instruct

* add release note
2023-09-19 16:06:43 +02:00
Christian Clauss
41126397d6
Revert "ci: Speed up pylint GitHub Action (#5828)" (#5832)
This reverts commit d49c86c845ef9ba5bfc17909cd6cf456910516e1.
2023-09-18 10:05:17 +02:00
Christian Clauss
d49c86c845
ci: Speed up pylint GitHub Action (#5828) 2023-09-16 16:30:13 +02:00
Christian Clauss
66b8b6656c
test: Fix the test_nin_filter_embedding() function (#5829)
* Fix the test_nin_filter_embedding() function

* mypy: type: ignore[arg-type]
2023-09-16 16:28:22 +02:00
Christian Clauss
91ab90a256
perf: Python performance improvements with ruff C4 and PERF fixes (#5803)
* Python performance improvements with ruff C4 and PERF

* pre-commit fixes

* Revert changes to examples/basic_qa_pipeline.py

* Revert changes to haystack/preview/testing/document_store.py

* revert releasenotes

* Upgrade to ruff v0.0.290
2023-09-16 16:26:07 +02:00
Christian Clauss
1bc03ddc73
ci: Fix all ruff pyflakes errors except unused imports (#5820)
* ci: Fix all ruff pyflakes errors except unused imports

* Delete releasenotes/notes/fix-some-pyflakes-errors-69a1106efa5d0203.yaml
2023-09-15 18:30:33 +02:00
Onur Eren Arpacı
8af0d816e6
bug: fix the date_fields request bottleneck (#5695)
* bug: fix the date_fields request bottleneck

I encountered a performance issue while attempting to index 1 million vectors. Despite the Weaviate instance having low utilization, the process was estimated to take around 10 hours. 

After some investigation, I identified the bottleneck: _get_date_properties function was being called for every document, consequently a request to the Weaviate client was being sent and awaited for each document.

To address this, I optimized the code by invoking the _get_date_properties function only when there is a schema change. This modification resulted in a notable performance improvement, reducing the indexing time to approximately 90 minutes for the same 1 million vectors.

* bug: fix the date_fields request bottleneck

* fix: executed the pre commit hooks for #9341
2023-09-15 18:12:14 +02:00
Silvano Cerza
5c04cd6ba2
Fix Document constructor accepting unused id parameter (#5826) 2023-09-15 17:03:03 +02:00
Stefano Fiorucci
771113c901
move ruff after black (#5825) 2023-09-15 16:13:02 +02:00
Chivereanu Radu
cab21da87b
fix: Support for Azure 16k gpt 35 deployment (#5804)
* Support for Azure 16k gpt 35 deployment

* releasenote added

---------

Co-authored-by: user11999 <radugabrielchivereanu@gmail.com>
2023-09-14 18:01:22 +02:00
Massimiliano Pippi
c7971a809d
ci: skip mandatory release notes check when not needed (#5817) 2023-09-14 17:00:41 +02:00
Christian Clauss
9405eb90ee
ci: Fix invalid escape sequences in Python code (#5802)
* ci: Use ruff in pre-commit to further limit complexity

* Fix invalid escape sequences in Python code

* Delete releasenotes/notes/ruff-4d2504d362035166.yaml
2023-09-14 16:42:48 +02:00
Massimiliano Pippi
6fc12a2bd1
ci: run apt-get update (#5816)
* run apt-get update

* run when changing the workflow file
2023-09-14 16:37:42 +02:00
ZanSara
9056c43240
fix: remove __future__ import from pinecone.py (#5813)
* remove future import

* fix forward reference
2023-09-14 16:28:39 +02:00
Stefano Fiorucci
1c69070db6
make MemoryEmbeddingRetriever act in non-batch mode (#5809) 2023-09-14 15:37:20 +02:00
bogdankostic
1a212420b7
refactor: Move filter utilities (2.0) (#5797)
* Move filter utilities

* PR feedback
2023-09-14 13:23:53 +02:00
Stefano Fiorucci
ad5b615503
make SentenceTransformersTextEmbedder non batch (#5811) 2023-09-14 12:38:24 +02:00
Ivana Zeljkovic
4bad202197
feat: Pinecone document store refactoring (#5725)
* Refactor codebase so that doc_type metadata is used instead of namespaces for making distinction between documents without embeddings, documents with embeddings and labels

* Fix parameter name in integration test

* Remove code under comment in add_type_metadata_filter method

* Fix mypy and pylint checks

* Add release note

* Apply minimal changes: rename method, update method docs and remove redundant method

* Mypy fixes

* Fix docstrings

* Revert helper methods for fetching documents when the number of documents exceeds Pinecone limit

* Remove unnecessary attributes in PineconeDocumentStore

* Fix unit test

---------

Co-authored-by: Ivana Zeljkovic <ivana.zeljkovic@smartcat.io>
Co-authored-by: DosticJelena <jelena.dostic@smartcat.io>
2023-09-14 11:46:47 +02:00
Darion
beb8853412
fix: return types of EntityExtractor to work with FAISSDocumentStore (#5750)
* Changed entity extractor score from type float32 to float64 and start/stop from int64 to int

* Added relase notes
2023-09-14 10:49:54 +02:00
Stefano Fiorucci
28f42fbaab
move release note to the right directory (#5808) 2023-09-14 09:57:09 +02:00
Christian Clauss
6dd52d91b2
ci: Fix typos discovered by codespell (#5778)
* Fix typos discovered by codespell

* pylint: max-args = 38
2023-09-13 16:14:45 +02:00
Christian Clauss
30ca042370
ci: Use ruff in pre-commit to further limit code complexity (#5783)
* ci: Use ruff in pre-commit to further limit complexity

* Delete releasenotes/notes/ruff-4d2504d362035166.yaml

---------

Co-authored-by: Silvano Cerza <3314350+silvanocerza@users.noreply.github.com>
2023-09-13 15:18:16 +02:00
ZanSara
5888fb7052
make MemoryBM25Retriever non match (#5768) 2023-09-13 15:11:47 +02:00
Shantanu
027980358a
Use newer tiktoken (#5785)
* Use newer tiktoken

* reno

---------

Co-authored-by: ZanSara <sara.zanzottera@deepset.ai>
2023-09-13 15:11:21 +02:00
Stefano Fiorucci
cfc75dfdd5
rm sklearn from query-classifier.yml (#5796) 2023-09-13 15:03:22 +02:00
Silvano Cerza
c23cac3215
Try to send event to Datadog only if possible (#5795) 2023-09-13 14:10:30 +02:00
Julian Risch
4ae0924ea0
feat!: Remove SklearnQueryClassifier (#5779)
* remove SklearnQueryClassifier

* reno
2023-09-13 12:55:33 +02:00
Stefano Fiorucci
283ecf2760
feat: add prefix and suffix to SentenceTransformersDocumentEmbedder (#5745)
* add prefix and suffix

* fix test
2023-09-13 12:55:06 +02:00
ZanSara
335a09bc1d
feat: make AnswerBuilder non batch (#5766)
* make answerbuilder non batch

* fix mypy

* review feedback

* mypy

---------

Co-authored-by: bogdankostic <bogdankostic@web.de>
2023-09-13 12:01:16 +02:00
Stefano Fiorucci
784034ffc3
Revert "build(deps): bump readmeio/rdme from 8.3.1 to 8.6.6 (#5789)" (#5792)
This reverts commit 55a2e7ab7fc16e4e311ea994b95553031711a506.
2023-09-13 11:56:42 +02:00
dependabot[bot]
55a2e7ab7f
build(deps): bump readmeio/rdme from 8.3.1 to 8.6.6 (#5789)
Bumps [readmeio/rdme](https://github.com/readmeio/rdme) from 8.3.1 to 8.6.6.
- [Release notes](https://github.com/readmeio/rdme/releases)
- [Changelog](https://github.com/readmeio/rdme/blob/next/CHANGELOG.md)
- [Commits](https://github.com/readmeio/rdme/compare/8.3.1...8.6.6)

---
updated-dependencies:
- dependency-name: readmeio/rdme
  dependency-type: direct:production
  update-type: version-update:semver-minor
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2023-09-13 11:03:12 +02:00
Silvano Cerza
7e544d4f60
Fix license compliance workflow (#5791)
* Formatting

* Try to send event to Datadog only if possible
2023-09-13 10:43:06 +02:00
dependabot[bot]
e688d3dddb
build(deps): bump aws-actions/configure-aws-credentials (#5790)
Bumps [aws-actions/configure-aws-credentials](https://github.com/aws-actions/configure-aws-credentials) from 2.2.0 to 4.0.0.
- [Release notes](https://github.com/aws-actions/configure-aws-credentials/releases)
- [Changelog](https://github.com/aws-actions/configure-aws-credentials/blob/main/CHANGELOG.md)
- [Commits](5fd3084fc3...8c3f20df09)

---
updated-dependencies:
- dependency-name: aws-actions/configure-aws-credentials
  dependency-type: direct:production
  update-type: version-update:semver-major
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2023-09-13 10:25:54 +02:00
Massimiliano Pippi
de6c57e20b
let dependabot update github actions (#5788) 2023-09-13 10:23:30 +02:00
ZanSara
2c4d839b64
feat: GPT4Generator (#5744)
* add gpt4generator

* add e2e

* add tests

* reno

* fix e2e

* Update test/preview/components/generators/openai/test_gpt4_generator.py

Co-authored-by: Stefano Fiorucci <44616784+anakin87@users.noreply.github.com>

---------

Co-authored-by: Stefano Fiorucci <44616784+anakin87@users.noreply.github.com>
2023-09-13 10:07:09 +02:00
Christian Clauss
75dc60b0bb
ci: Upgrade GitHub Actions (#5787) 2023-09-13 09:58:47 +02:00