3174 Commits

Author SHA1 Message Date
ZanSara
81b2e83d04
feat: separate out preview tests (#5639)
* add preview workflows

* feedback

* feedback

* use preview extra

* remove coverage and add separate e2e

* rename workflow file for consistency

* trigger ci

* undo trigger

* torch import in testing

* add deps to unit tests

* feedback

* run container instead of service

* comment

* add if statement

* fix tika version

* separate out win integration tests

* separate out all CIs

* try installing docker on macos

* exclude tika

* remove tika docker
2023-09-29 13:16:08 +02:00
bogdankostic
d61df24b27
chore: Remove classifiers directory from preview package (#5918) 2023-09-29 10:38:33 +02:00
Massimiliano Pippi
0947f59545
feat: add async PromptNode run (#5890)
* add async promptnode

* Remove unecessary calls to dict.keys()

---------

Co-authored-by: Silvano Cerza <silvanocerza@gmail.com>
Co-authored-by: Silvano Cerza <3314350+silvanocerza@users.noreply.github.com>
2023-09-29 08:40:01 +02:00
ZanSara
578f2b4bbf
feat: update canals to 0.8.1 (#5900)
* Update canals to 0.8.1

* scale up runner
2023-09-28 17:50:46 +02:00
Vladimir Blagojevic
e882a7d5c8
feat: Add HTMLToDocument component (v2) (#5907) 2023-09-28 17:22:28 +02:00
Massimiliano Pippi
dfa48eece9
clean up the Slack integrations (#5908) 2023-09-28 15:49:19 +02:00
Stefano Fiorucci
d4aacad5f9
feat: OpenAIDocumentEmbedder (#5822)
* first draft

* release note

* mypy fix

* fix test

* corrections

* pr feedback

* better secrets handling and new tests

* missing imports in embedders/__init__.py

* better format condition

* address feedback
2023-09-28 15:42:51 +02:00
ZanSara
83724b74e3
feat: Make metadata optional in AnswerBuilder (#5909)
* optional metadata

* improve docstring
2023-09-28 14:42:19 +02:00
Stefano Fiorucci
9340c572f9
alternative skipif conditions in azure ocr converter test (#5906) 2023-09-28 12:09:19 +02:00
Silvano Cerza
35ec8cc8fb
Rework evaluation and metrics calculation for Haystack 2.x (#5794)
* draft requirements from discussion

* Add some more information

* Update proposal given new feedback

* More drawbacks

* Decision drivers

* Nitpick

* Summary

* PR number

* Mark code snippets

Co-authored-by: ZanSara <sara.zanzottera@deepset.ai>

* Link correct issue

* Add missing word

* More context on blind evaluation

* Rephrase confusing sentence

* Add a more detailed code example

* Ignore mypy and pylint in example file

---------

Co-authored-by: Julian Risch <julian.risch@deepset.ai>
Co-authored-by: ZanSara <sara.zanzottera@deepset.ai>
2023-09-28 00:51:51 +02:00
Julian Risch
4413675e64
feat: Add TextDocumentSplitter that splits by word, sentence, passage (2.0) (#5870)
* draft split by word, sentence, passage

* naive way to split sentences without nltk

* reno

* add tests

* make input list of docs, review feedback

* add source_id and more validation

* update docstrings

* add split delimiters back to strings

---------

Co-authored-by: Daria Fokina <daria.fokina@deepset.ai>
2023-09-27 12:26:20 +02:00
ZanSara
6665e8ec7f
Add preview extra to e2e tests (#5898) 2023-09-27 10:36:00 +02:00
Stefano Fiorucci
a4787e7b52
pin setuptools_scm only for windows (#5894) 2023-09-26 18:39:50 +02:00
Stefano Fiorucci
61877056ef
pin setuptools_scm in the metrics extra (#5891) 2023-09-26 17:12:59 +02:00
bogdankostic
80192589b1
feat: Add AzureOCRDocumentConverter (2.0) (#5855)
* Add AzureOCRDocumentConverter

* Add tests

* Add release note

* Formatting

* update docstrings

* Apply suggestions from code review

Co-authored-by: ZanSara <sara.zanzottera@deepset.ai>

* PR feedback

* PR feedback

* PR feedback

* Add secrets as environment variables

* Adapt test

* Add azure dependency to CI

* Add azure dependency to CI

---------

Co-authored-by: ZanSara <sara.zanzottera@deepset.ai>
Co-authored-by: Daria Fokina <daria.fokina@deepset.ai>
2023-09-26 15:57:55 +02:00
Stefano Fiorucci
c8398eeb6d
test: e2e test for Extractive QA Pipeline (#5879)
* e2e test for e. qa pipeline
2023-09-26 15:44:34 +02:00
Silvano Cerza
cf7f0ebc22
Add Pipelines async run (#5864)
* Add Pipeline.arun()

* Sleeper node

* Fix async running

* Add e2e tests

To run a Pipeline that doesn't have any async node in async mode:

    pytest e2e/pipelines/test_standard_pipelines.py::test_query_and_indexing_pipeline

To run a Pipeline that has a single async node in concurrent mode:

    pytest e2e/pipelines/test_standard_pipelines.py::test_async_concurrent_complex_pipeline

To run a Pipeline that has a single async node in sequential mode:

    pytest e2e/pipelines/test_standard_pipelines.py::test_async_sequential_complex_pipeline

* Remove unused _adispatch_run method

* Make Pipeline.run work with async nodes

* Revert "Make Pipeline.run work with async nodes"

This reverts commit 22d7a94e4d41aca1b59dad18c0b366fbb6e8f431.

* Rename Pipeline.arun to Pipeline._arun

* Enhance docstring

* Add Sleeper docstring

* Add release notes

* ignore typing across the node

* make pylint happy

* skip pylint on needed unused import

* fix

* if a node has an arun method, use it

---------

Co-authored-by: Massimiliano Pippi <mpippi@gmail.com>
2023-09-26 15:37:27 +02:00
github-actions[bot]
8d26057566
Update unstable version (#5887)
Co-authored-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>
v1.22.0-rc0
2023-09-26 15:23:14 +02:00
ZanSara
6cb7d16e22
feat: preview extra (#5869)
* copy the deps list over from haystack-ai

* fix lazyimport usage

* keep jinja and openai

* fix ci

* reno

* separate out preview unit tests

* fix import error message for tika

* tika

* add preview to all

* wrap torch

* remove comment

* unwrap openai and jinja
v1.21.0-rc0
2023-09-26 12:48:15 +02:00
Stefano Fiorucci
e9d34fc0e3
test: e2e tests for RAG Pipelines (#5876)
* relax extractive reader integration tests

* force reader to CPU

* ensure integration tests reproducibility

* e2e rag tests

* move set_all_seeds to testing package

* refine rag tests

* Update e2e/preview/pipelines/test_rag_pipelines.py

Co-authored-by: ZanSara <sara.zanzottera@deepset.ai>

---------

Co-authored-by: ZanSara <sara.zanzottera@deepset.ai>
2023-09-26 11:49:50 +02:00
Stefano Fiorucci
6aa471ac5e
chore: make preview integration tests reproducible (#5871)
* relax extractive reader integration tests

* force reader to CPU

* ensure integration tests reproducibility

* move set_all_seeds to testing package
2023-09-25 18:39:10 +02:00
bogdankostic
9a4373bf8e
feat: Add TikaDocumentConverter (2.0) (#5847)
* Add TikaFileToDocument component

* Add tests

* Add tika service to CI

* Add release note

* Change name

* PR feedback

* Fix naming in tests

* Fix tika version in CI

* Update tests

---------

Co-authored-by: ZanSara <sara.zanzottera@deepset.ai>
2023-09-25 11:47:21 +02:00
MichelBartels
4da43b6b05
Add link output to SerperDevWebSearch (#5853)
* add link output

* adjust tests

* fix test

* remove print statements

---------

Co-authored-by: ZanSara <sara.zanzottera@deepset.ai>
2023-09-25 10:03:01 +02:00
Stefano Fiorucci
c0f22372d4
feat: OpenAITextEmbedder (#5801)
* first draft

* release notes

* avoid serializing secrets

* fix import order

* simplify serialization

* simplification

* monkeypatch delenv

* Update haystack/preview/components/embedders/openai_text_embedder.py

Co-authored-by: Daria Fokina <daria.fokina@deepset.ai>

* docstrings updates

* fix test

* Update haystack/preview/components/embedders/openai_text_embedder.py

Co-authored-by: Massimiliano Pippi <mpippi@gmail.com>

* rm comment

---------

Co-authored-by: Daria Fokina <daria.fokina@deepset.ai>
Co-authored-by: Massimiliano Pippi <mpippi@gmail.com>
2023-09-22 21:54:11 +02:00
Massimiliano Pippi
a5a0dc9f87
feat: optionally pass an id to the Document constructor (#5862)
* revert #5826

* do not use Optional
2023-09-22 11:09:59 +02:00
Silvano Cerza
cc4f95bf51
Remove unnecessary GPT4Generator class (#5863)
* Remove GPT4Generator class

* Rename GPT35Generator to GPTGenerator

* Fix tests

* Release notes
2023-09-22 11:05:06 +02:00
MichelBartels
f3dc9edd26
feat: initial ExtractiveReader implementation (#5553)
* initial ExtractiveReader implementation

* initial ExtractiveReader implementation

* fix mypy

* remove unused import

* Use AutoTokenizer

* rename reader to model

* combine no-answer logit

* support document slicing with proper probabilities

* add variable stride

* validate model

* fix typo

* make postprocessing easier to understand

* remove debug code

* set default reader

* add ExtractiveReader to __init__

* remove validation

* use new answer class

* add batching

* use v2 lazy imports

* move reader

* fix type hints

* add doc strings

* add nucleus sampling

* fix types

* fix doc string

* add no_answer parameter

* remove print statement

* fix gpu support

* turn into binary classification task

* change dataclass so document does not need to be provided for no answer

* add simple tests

* add unit tests

* rename reader folder to readers

* add integration tests

* fix type hints

* add release notes

* remove accidentally included test file

* remove unnecessary __init__ file

* revert __init__ file to main

* rename test script by adding test_ prefix

* undo accidentally moving of test script after renaming it

* remove use of bisect

* rename _flatten and _unflatten

* make variable name more intuitive

* remove type: ignore

* fix mypy issue

* refactor long tuple

* add doc strings

* explain HF test

* remove unnecessary top_k check

---------

Co-authored-by: ZanSara <sara.zanzottera@deepset.ai>
2023-09-21 12:16:51 +02:00
Vladimir Blagojevic
92a6221927
feat: Add PyPDFToDocument component (2.0) (#5850)
* Initial PyPDFToDocument implementation

* Remove progress bar

* Add release note

* Minor fix

* import check and dependency

---------

Co-authored-by: ZanSara <sara.zanzottera@deepset.ai>
2023-09-21 11:52:26 +02:00
ZanSara
23fdef929e
chore: move GPT35Generator tests in the main test suite (#5844)
* move tests

* fix no-test-found error from pytest

* missing self

---------

Co-authored-by: Massimiliano Pippi <mpippi@gmail.com>
2023-09-21 11:42:32 +02:00
Julian Risch
5820120f9b
fix: Change retriever return type to list of docs (#5848)
Co-authored-by: Stefano Fiorucci <44616784+anakin87@users.noreply.github.com>
2023-09-21 10:32:40 +02:00
ZanSara
28f5c4c780
fix: Whisper integration tests (#5851)
* fix tests

* add ffmpeg

* apt update for ffmpeg

* not run on windows
2023-09-21 00:14:07 +02:00
bogdankostic
abe2706298
feat: Add MetadataRouter (2.0) (#5824)
* Move filter utilities

* Add MetadataRouter

* Add tests for MetadataRouter

* Add more tests

* Rename FileExtensionClassifer to FileExtensionRouter

* Add support for dates in filters

* Add tests

* Add release note

* Add release note

* Apply suggestions from code review

Co-authored-by: ZanSara <sara.zanzottera@deepset.ai>

---------

Co-authored-by: ZanSara <sara.zanzottera@deepset.ai>
2023-09-20 14:49:17 +02:00
ZanSara
c933bcaa69
chore: move Whisper e2e tests in the main tests suite (#5845)
* move whisper local tests

* remove e2e file

* move remote tests

* remove e2e file
2023-09-20 14:48:09 +02:00
ZanSara
454988672e
feat: UrlCacheChecker (#5841)
* add UrlCacheChecker

* rename

* add tests

* reno

* pylint

* review feedback
2023-09-20 14:45:50 +02:00
ZanSara
ea2a5595ca
add missing dependency (#5849) 2023-09-20 12:57:53 +02:00
bogdankostic
719c1c040c
feat: Add support for dates in filters (2.0) (#5823)
* Add support for dates in filters

* Add tests

* Add release note

* Update haystack/preview/utils/filters.py

Co-authored-by: ZanSara <sara.zanzottera@deepset.ai>

---------

Co-authored-by: ZanSara <sara.zanzottera@deepset.ai>
2023-09-20 12:05:56 +02:00
ZanSara
44f0c468ac
move websearch tests back to main tests suite (#5842) 2023-09-20 11:55:18 +02:00
bogdankostic
57d33ee6da
ci: Run preview integration tests in CI (#5843)
* Run preview integration tests in CI

* Only install inference extra
2023-09-20 11:54:41 +02:00
Vladimir Blagojevic
0983fb656a
feat: Add LinkContentFetcher Haystack 2.0 component (#5724)
* Add LinkContentFetcher

* Add release note

* Small fixes

* Fix pydocs

* PR feedback

* Remove handlers registration

* PR feedback

* adjustments

* improve tests

* initial draft

* tests

* add proposal

* proposal number

* reno

* fix tests and usage of content and content_type

* update branch & fix more tests

* mypy

* use the new document

* add docstring

* fix more tests

* mypy

* fix tests

* add e2e

* review feedback

* improve __str__

* Apply suggestions from code review

Co-authored-by: Daria Fokina <daria.fokina@deepset.ai>

* Update haystack/preview/dataclasses/document.py

Co-authored-by: Daria Fokina <daria.fokina@deepset.ai>

* improve __str__

* fix tests

* fix more tests

* fix test

* Fix end-of-file-fixer

* Post merge fixes

* Move e2e tests back into component

---------

Co-authored-by: ZanSara <sara.zanzottera@deepset.ai>
Co-authored-by: Daria Fokina <daria.fokina@deepset.ai>
2023-09-20 11:03:52 +02:00
Christian Clauss
bf6d306d68
ci: Simplify Python code with ruff rules SIM (#5833)
* ci: Simplify Python code with ruff rules SIM

* Revert #5828

* ruff --select=I --fix haystack/modeling/infer.py

---------

Co-authored-by: Massimiliano Pippi <mpippi@gmail.com>
2023-09-20 08:32:44 +02:00
Stefano Fiorucci
de84a95970
separate classes and tests (#5819)
Co-authored-by: Massimiliano Pippi <mpippi@gmail.com>
2023-09-19 19:21:49 +02:00
Malte Pietsch
aa3cc3d5ae
feat: Add support for OpenAI's gpt-3.5-turbo-instruct model (#5837)
* support gpt-3.5.-turbo-instruct

* add release note
2023-09-19 16:06:43 +02:00
Christian Clauss
41126397d6
Revert "ci: Speed up pylint GitHub Action (#5828)" (#5832)
This reverts commit d49c86c845ef9ba5bfc17909cd6cf456910516e1.
2023-09-18 10:05:17 +02:00
Christian Clauss
d49c86c845
ci: Speed up pylint GitHub Action (#5828) 2023-09-16 16:30:13 +02:00
Christian Clauss
66b8b6656c
test: Fix the test_nin_filter_embedding() function (#5829)
* Fix the test_nin_filter_embedding() function

* mypy: type: ignore[arg-type]
2023-09-16 16:28:22 +02:00
Christian Clauss
91ab90a256
perf: Python performance improvements with ruff C4 and PERF fixes (#5803)
* Python performance improvements with ruff C4 and PERF

* pre-commit fixes

* Revert changes to examples/basic_qa_pipeline.py

* Revert changes to haystack/preview/testing/document_store.py

* revert releasenotes

* Upgrade to ruff v0.0.290
2023-09-16 16:26:07 +02:00
Christian Clauss
1bc03ddc73
ci: Fix all ruff pyflakes errors except unused imports (#5820)
* ci: Fix all ruff pyflakes errors except unused imports

* Delete releasenotes/notes/fix-some-pyflakes-errors-69a1106efa5d0203.yaml
2023-09-15 18:30:33 +02:00
Onur Eren Arpacı
8af0d816e6
bug: fix the date_fields request bottleneck (#5695)
* bug: fix the date_fields request bottleneck

I encountered a performance issue while attempting to index 1 million vectors. Despite the Weaviate instance having low utilization, the process was estimated to take around 10 hours. 

After some investigation, I identified the bottleneck: _get_date_properties function was being called for every document, consequently a request to the Weaviate client was being sent and awaited for each document.

To address this, I optimized the code by invoking the _get_date_properties function only when there is a schema change. This modification resulted in a notable performance improvement, reducing the indexing time to approximately 90 minutes for the same 1 million vectors.

* bug: fix the date_fields request bottleneck

* fix: executed the pre commit hooks for #9341
2023-09-15 18:12:14 +02:00
Silvano Cerza
5c04cd6ba2
Fix Document constructor accepting unused id parameter (#5826) 2023-09-15 17:03:03 +02:00
Stefano Fiorucci
771113c901
move ruff after black (#5825) 2023-09-15 16:13:02 +02:00