ZanSara
81b2e83d04
feat: separate out preview
tests ( #5639 )
...
* add preview workflows
* feedback
* feedback
* use preview extra
* remove coverage and add separate e2e
* rename workflow file for consistency
* trigger ci
* undo trigger
* torch import in testing
* add deps to unit tests
* feedback
* run container instead of service
* comment
* add if statement
* fix tika version
* separate out win integration tests
* separate out all CIs
* try installing docker on macos
* exclude tika
* remove tika docker
2023-09-29 13:16:08 +02:00
bogdankostic
d61df24b27
chore: Remove classifiers directory from preview package ( #5918 )
2023-09-29 10:38:33 +02:00
Massimiliano Pippi
0947f59545
feat: add async PromptNode run ( #5890 )
...
* add async promptnode
* Remove unecessary calls to dict.keys()
---------
Co-authored-by: Silvano Cerza <silvanocerza@gmail.com>
Co-authored-by: Silvano Cerza <3314350+silvanocerza@users.noreply.github.com>
2023-09-29 08:40:01 +02:00
ZanSara
578f2b4bbf
feat: update canals to 0.8.1 ( #5900 )
...
* Update canals to 0.8.1
* scale up runner
2023-09-28 17:50:46 +02:00
Vladimir Blagojevic
e882a7d5c8
feat: Add HTMLToDocument component (v2) ( #5907 )
2023-09-28 17:22:28 +02:00
Massimiliano Pippi
dfa48eece9
clean up the Slack integrations ( #5908 )
2023-09-28 15:49:19 +02:00
Stefano Fiorucci
d4aacad5f9
feat: OpenAIDocumentEmbedder
( #5822 )
...
* first draft
* release note
* mypy fix
* fix test
* corrections
* pr feedback
* better secrets handling and new tests
* missing imports in embedders/__init__.py
* better format condition
* address feedback
2023-09-28 15:42:51 +02:00
ZanSara
83724b74e3
feat: Make metadata
optional in AnswerBuilder ( #5909 )
...
* optional metadata
* improve docstring
2023-09-28 14:42:19 +02:00
Stefano Fiorucci
9340c572f9
alternative skipif conditions in azure ocr converter test ( #5906 )
2023-09-28 12:09:19 +02:00
Silvano Cerza
35ec8cc8fb
Rework evaluation and metrics calculation for Haystack 2.x ( #5794 )
...
* draft requirements from discussion
* Add some more information
* Update proposal given new feedback
* More drawbacks
* Decision drivers
* Nitpick
* Summary
* PR number
* Mark code snippets
Co-authored-by: ZanSara <sara.zanzottera@deepset.ai>
* Link correct issue
* Add missing word
* More context on blind evaluation
* Rephrase confusing sentence
* Add a more detailed code example
* Ignore mypy and pylint in example file
---------
Co-authored-by: Julian Risch <julian.risch@deepset.ai>
Co-authored-by: ZanSara <sara.zanzottera@deepset.ai>
2023-09-28 00:51:51 +02:00
Julian Risch
4413675e64
feat: Add TextDocumentSplitter that splits by word, sentence, passage (2.0) ( #5870 )
...
* draft split by word, sentence, passage
* naive way to split sentences without nltk
* reno
* add tests
* make input list of docs, review feedback
* add source_id and more validation
* update docstrings
* add split delimiters back to strings
---------
Co-authored-by: Daria Fokina <daria.fokina@deepset.ai>
2023-09-27 12:26:20 +02:00
ZanSara
6665e8ec7f
Add preview
extra to e2e tests ( #5898 )
2023-09-27 10:36:00 +02:00
Stefano Fiorucci
a4787e7b52
pin setuptools_scm only for windows ( #5894 )
2023-09-26 18:39:50 +02:00
Stefano Fiorucci
61877056ef
pin setuptools_scm in the metrics extra ( #5891 )
2023-09-26 17:12:59 +02:00
bogdankostic
80192589b1
feat: Add AzureOCRDocumentConverter
(2.0) ( #5855 )
...
* Add AzureOCRDocumentConverter
* Add tests
* Add release note
* Formatting
* update docstrings
* Apply suggestions from code review
Co-authored-by: ZanSara <sara.zanzottera@deepset.ai>
* PR feedback
* PR feedback
* PR feedback
* Add secrets as environment variables
* Adapt test
* Add azure dependency to CI
* Add azure dependency to CI
---------
Co-authored-by: ZanSara <sara.zanzottera@deepset.ai>
Co-authored-by: Daria Fokina <daria.fokina@deepset.ai>
2023-09-26 15:57:55 +02:00
Stefano Fiorucci
c8398eeb6d
test: e2e test for Extractive QA Pipeline ( #5879 )
...
* e2e test for e. qa pipeline
2023-09-26 15:44:34 +02:00
Silvano Cerza
cf7f0ebc22
Add Pipelines async run ( #5864 )
...
* Add Pipeline.arun()
* Sleeper node
* Fix async running
* Add e2e tests
To run a Pipeline that doesn't have any async node in async mode:
pytest e2e/pipelines/test_standard_pipelines.py::test_query_and_indexing_pipeline
To run a Pipeline that has a single async node in concurrent mode:
pytest e2e/pipelines/test_standard_pipelines.py::test_async_concurrent_complex_pipeline
To run a Pipeline that has a single async node in sequential mode:
pytest e2e/pipelines/test_standard_pipelines.py::test_async_sequential_complex_pipeline
* Remove unused _adispatch_run method
* Make Pipeline.run work with async nodes
* Revert "Make Pipeline.run work with async nodes"
This reverts commit 22d7a94e4d41aca1b59dad18c0b366fbb6e8f431.
* Rename Pipeline.arun to Pipeline._arun
* Enhance docstring
* Add Sleeper docstring
* Add release notes
* ignore typing across the node
* make pylint happy
* skip pylint on needed unused import
* fix
* if a node has an arun method, use it
---------
Co-authored-by: Massimiliano Pippi <mpippi@gmail.com>
2023-09-26 15:37:27 +02:00
github-actions[bot]
8d26057566
Update unstable version ( #5887 )
...
Co-authored-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>
v1.22.0-rc0
2023-09-26 15:23:14 +02:00
ZanSara
6cb7d16e22
feat: preview
extra ( #5869 )
...
* copy the deps list over from haystack-ai
* fix lazyimport usage
* keep jinja and openai
* fix ci
* reno
* separate out preview unit tests
* fix import error message for tika
* tika
* add preview to all
* wrap torch
* remove comment
* unwrap openai and jinja
v1.21.0-rc0
2023-09-26 12:48:15 +02:00
Stefano Fiorucci
e9d34fc0e3
test: e2e tests for RAG Pipelines ( #5876 )
...
* relax extractive reader integration tests
* force reader to CPU
* ensure integration tests reproducibility
* e2e rag tests
* move set_all_seeds to testing package
* refine rag tests
* Update e2e/preview/pipelines/test_rag_pipelines.py
Co-authored-by: ZanSara <sara.zanzottera@deepset.ai>
---------
Co-authored-by: ZanSara <sara.zanzottera@deepset.ai>
2023-09-26 11:49:50 +02:00
Stefano Fiorucci
6aa471ac5e
chore: make preview integration tests reproducible ( #5871 )
...
* relax extractive reader integration tests
* force reader to CPU
* ensure integration tests reproducibility
* move set_all_seeds to testing package
2023-09-25 18:39:10 +02:00
bogdankostic
9a4373bf8e
feat: Add TikaDocumentConverter
(2.0) ( #5847 )
...
* Add TikaFileToDocument component
* Add tests
* Add tika service to CI
* Add release note
* Change name
* PR feedback
* Fix naming in tests
* Fix tika version in CI
* Update tests
---------
Co-authored-by: ZanSara <sara.zanzottera@deepset.ai>
2023-09-25 11:47:21 +02:00
MichelBartels
4da43b6b05
Add link output to SerperDevWebSearch
( #5853 )
...
* add link output
* adjust tests
* fix test
* remove print statements
---------
Co-authored-by: ZanSara <sara.zanzottera@deepset.ai>
2023-09-25 10:03:01 +02:00
Stefano Fiorucci
c0f22372d4
feat: OpenAITextEmbedder
( #5801 )
...
* first draft
* release notes
* avoid serializing secrets
* fix import order
* simplify serialization
* simplification
* monkeypatch delenv
* Update haystack/preview/components/embedders/openai_text_embedder.py
Co-authored-by: Daria Fokina <daria.fokina@deepset.ai>
* docstrings updates
* fix test
* Update haystack/preview/components/embedders/openai_text_embedder.py
Co-authored-by: Massimiliano Pippi <mpippi@gmail.com>
* rm comment
---------
Co-authored-by: Daria Fokina <daria.fokina@deepset.ai>
Co-authored-by: Massimiliano Pippi <mpippi@gmail.com>
2023-09-22 21:54:11 +02:00
Massimiliano Pippi
a5a0dc9f87
feat: optionally pass an id to the Document constructor ( #5862 )
...
* revert #5826
* do not use Optional
2023-09-22 11:09:59 +02:00
Silvano Cerza
cc4f95bf51
Remove unnecessary GPT4Generator class ( #5863 )
...
* Remove GPT4Generator class
* Rename GPT35Generator to GPTGenerator
* Fix tests
* Release notes
2023-09-22 11:05:06 +02:00
MichelBartels
f3dc9edd26
feat: initial ExtractiveReader implementation ( #5553 )
...
* initial ExtractiveReader implementation
* initial ExtractiveReader implementation
* fix mypy
* remove unused import
* Use AutoTokenizer
* rename reader to model
* combine no-answer logit
* support document slicing with proper probabilities
* add variable stride
* validate model
* fix typo
* make postprocessing easier to understand
* remove debug code
* set default reader
* add ExtractiveReader to __init__
* remove validation
* use new answer class
* add batching
* use v2 lazy imports
* move reader
* fix type hints
* add doc strings
* add nucleus sampling
* fix types
* fix doc string
* add no_answer parameter
* remove print statement
* fix gpu support
* turn into binary classification task
* change dataclass so document does not need to be provided for no answer
* add simple tests
* add unit tests
* rename reader folder to readers
* add integration tests
* fix type hints
* add release notes
* remove accidentally included test file
* remove unnecessary __init__ file
* revert __init__ file to main
* rename test script by adding test_ prefix
* undo accidentally moving of test script after renaming it
* remove use of bisect
* rename _flatten and _unflatten
* make variable name more intuitive
* remove type: ignore
* fix mypy issue
* refactor long tuple
* add doc strings
* explain HF test
* remove unnecessary top_k check
---------
Co-authored-by: ZanSara <sara.zanzottera@deepset.ai>
2023-09-21 12:16:51 +02:00
Vladimir Blagojevic
92a6221927
feat: Add PyPDFToDocument component (2.0) ( #5850 )
...
* Initial PyPDFToDocument implementation
* Remove progress bar
* Add release note
* Minor fix
* import check and dependency
---------
Co-authored-by: ZanSara <sara.zanzottera@deepset.ai>
2023-09-21 11:52:26 +02:00
ZanSara
23fdef929e
chore: move GPT35Generator
tests in the main test suite ( #5844 )
...
* move tests
* fix no-test-found error from pytest
* missing self
---------
Co-authored-by: Massimiliano Pippi <mpippi@gmail.com>
2023-09-21 11:42:32 +02:00
Julian Risch
5820120f9b
fix: Change retriever return type to list of docs ( #5848 )
...
Co-authored-by: Stefano Fiorucci <44616784+anakin87@users.noreply.github.com>
2023-09-21 10:32:40 +02:00
ZanSara
28f5c4c780
fix: Whisper integration tests ( #5851 )
...
* fix tests
* add ffmpeg
* apt update for ffmpeg
* not run on windows
2023-09-21 00:14:07 +02:00
bogdankostic
abe2706298
feat: Add MetadataRouter
(2.0) ( #5824 )
...
* Move filter utilities
* Add MetadataRouter
* Add tests for MetadataRouter
* Add more tests
* Rename FileExtensionClassifer to FileExtensionRouter
* Add support for dates in filters
* Add tests
* Add release note
* Add release note
* Apply suggestions from code review
Co-authored-by: ZanSara <sara.zanzottera@deepset.ai>
---------
Co-authored-by: ZanSara <sara.zanzottera@deepset.ai>
2023-09-20 14:49:17 +02:00
ZanSara
c933bcaa69
chore: move Whisper e2e tests in the main tests suite ( #5845 )
...
* move whisper local tests
* remove e2e file
* move remote tests
* remove e2e file
2023-09-20 14:48:09 +02:00
ZanSara
454988672e
feat: UrlCacheChecker
( #5841 )
...
* add UrlCacheChecker
* rename
* add tests
* reno
* pylint
* review feedback
2023-09-20 14:45:50 +02:00
ZanSara
ea2a5595ca
add missing dependency ( #5849 )
2023-09-20 12:57:53 +02:00
bogdankostic
719c1c040c
feat: Add support for dates in filters (2.0) ( #5823 )
...
* Add support for dates in filters
* Add tests
* Add release note
* Update haystack/preview/utils/filters.py
Co-authored-by: ZanSara <sara.zanzottera@deepset.ai>
---------
Co-authored-by: ZanSara <sara.zanzottera@deepset.ai>
2023-09-20 12:05:56 +02:00
ZanSara
44f0c468ac
move websearch tests back to main tests suite ( #5842 )
2023-09-20 11:55:18 +02:00
bogdankostic
57d33ee6da
ci: Run preview integration tests in CI ( #5843 )
...
* Run preview integration tests in CI
* Only install inference extra
2023-09-20 11:54:41 +02:00
Vladimir Blagojevic
0983fb656a
feat: Add LinkContentFetcher
Haystack 2.0 component ( #5724 )
...
* Add LinkContentFetcher
* Add release note
* Small fixes
* Fix pydocs
* PR feedback
* Remove handlers registration
* PR feedback
* adjustments
* improve tests
* initial draft
* tests
* add proposal
* proposal number
* reno
* fix tests and usage of content and content_type
* update branch & fix more tests
* mypy
* use the new document
* add docstring
* fix more tests
* mypy
* fix tests
* add e2e
* review feedback
* improve __str__
* Apply suggestions from code review
Co-authored-by: Daria Fokina <daria.fokina@deepset.ai>
* Update haystack/preview/dataclasses/document.py
Co-authored-by: Daria Fokina <daria.fokina@deepset.ai>
* improve __str__
* fix tests
* fix more tests
* fix test
* Fix end-of-file-fixer
* Post merge fixes
* Move e2e tests back into component
---------
Co-authored-by: ZanSara <sara.zanzottera@deepset.ai>
Co-authored-by: Daria Fokina <daria.fokina@deepset.ai>
2023-09-20 11:03:52 +02:00
Christian Clauss
bf6d306d68
ci: Simplify Python code with ruff rules SIM ( #5833 )
...
* ci: Simplify Python code with ruff rules SIM
* Revert #5828
* ruff --select=I --fix haystack/modeling/infer.py
---------
Co-authored-by: Massimiliano Pippi <mpippi@gmail.com>
2023-09-20 08:32:44 +02:00
Stefano Fiorucci
de84a95970
separate classes and tests ( #5819 )
...
Co-authored-by: Massimiliano Pippi <mpippi@gmail.com>
2023-09-19 19:21:49 +02:00
Malte Pietsch
aa3cc3d5ae
feat: Add support for OpenAI's gpt-3.5-turbo-instruct
model ( #5837 )
...
* support gpt-3.5.-turbo-instruct
* add release note
2023-09-19 16:06:43 +02:00
Christian Clauss
41126397d6
Revert "ci: Speed up pylint GitHub Action ( #5828 )" ( #5832 )
...
This reverts commit d49c86c845ef9ba5bfc17909cd6cf456910516e1.
2023-09-18 10:05:17 +02:00
Christian Clauss
d49c86c845
ci: Speed up pylint GitHub Action ( #5828 )
2023-09-16 16:30:13 +02:00
Christian Clauss
66b8b6656c
test: Fix the test_nin_filter_embedding() function ( #5829 )
...
* Fix the test_nin_filter_embedding() function
* mypy: type: ignore[arg-type]
2023-09-16 16:28:22 +02:00
Christian Clauss
91ab90a256
perf: Python performance improvements with ruff C4 and PERF fixes ( #5803 )
...
* Python performance improvements with ruff C4 and PERF
* pre-commit fixes
* Revert changes to examples/basic_qa_pipeline.py
* Revert changes to haystack/preview/testing/document_store.py
* revert releasenotes
* Upgrade to ruff v0.0.290
2023-09-16 16:26:07 +02:00
Christian Clauss
1bc03ddc73
ci: Fix all ruff pyflakes errors except unused imports ( #5820 )
...
* ci: Fix all ruff pyflakes errors except unused imports
* Delete releasenotes/notes/fix-some-pyflakes-errors-69a1106efa5d0203.yaml
2023-09-15 18:30:33 +02:00
Onur Eren Arpacı
8af0d816e6
bug: fix the date_fields request bottleneck ( #5695 )
...
* bug: fix the date_fields request bottleneck
I encountered a performance issue while attempting to index 1 million vectors. Despite the Weaviate instance having low utilization, the process was estimated to take around 10 hours.
After some investigation, I identified the bottleneck: _get_date_properties function was being called for every document, consequently a request to the Weaviate client was being sent and awaited for each document.
To address this, I optimized the code by invoking the _get_date_properties function only when there is a schema change. This modification resulted in a notable performance improvement, reducing the indexing time to approximately 90 minutes for the same 1 million vectors.
* bug: fix the date_fields request bottleneck
* fix: executed the pre commit hooks for #9341
2023-09-15 18:12:14 +02:00
Silvano Cerza
5c04cd6ba2
Fix Document constructor accepting unused id parameter ( #5826 )
2023-09-15 17:03:03 +02:00
Stefano Fiorucci
771113c901
move ruff after black ( #5825 )
2023-09-15 16:13:02 +02:00