haystack

mirror of https://github.com/deepset-ai/haystack.git synced 2025-10-22 21:39:00 +00:00

Author	SHA1	Message	Date
Silvano Cerza	35ec8cc8fb	Rework evaluation and metrics calculation for Haystack 2.x (#5794 ) * draft requirements from discussion * Add some more information * Update proposal given new feedback * More drawbacks * Decision drivers * Nitpick * Summary * PR number * Mark code snippets Co-authored-by: ZanSara <sara.zanzottera@deepset.ai> * Link correct issue * Add missing word * More context on blind evaluation * Rephrase confusing sentence * Add a more detailed code example * Ignore mypy and pylint in example file --------- Co-authored-by: Julian Risch <julian.risch@deepset.ai> Co-authored-by: ZanSara <sara.zanzottera@deepset.ai>	2023-09-28 00:51:51 +02:00
Julian Risch	4413675e64	feat: Add TextDocumentSplitter that splits by word, sentence, passage (2.0) (#5870 ) * draft split by word, sentence, passage * naive way to split sentences without nltk * reno * add tests * make input list of docs, review feedback * add source_id and more validation * update docstrings * add split delimiters back to strings --------- Co-authored-by: Daria Fokina <daria.fokina@deepset.ai>	2023-09-27 12:26:20 +02:00
ZanSara	6665e8ec7f	Add `preview` extra to e2e tests (#5898 )	2023-09-27 10:36:00 +02:00
Stefano Fiorucci	a4787e7b52	pin setuptools_scm only for windows (#5894 )	2023-09-26 18:39:50 +02:00
Stefano Fiorucci	61877056ef	pin setuptools_scm in the metrics extra (#5891 )	2023-09-26 17:12:59 +02:00
bogdankostic	80192589b1	feat: Add `AzureOCRDocumentConverter` (2.0) (#5855 ) * Add AzureOCRDocumentConverter * Add tests * Add release note * Formatting * update docstrings * Apply suggestions from code review Co-authored-by: ZanSara <sara.zanzottera@deepset.ai> * PR feedback * PR feedback * PR feedback * Add secrets as environment variables * Adapt test * Add azure dependency to CI * Add azure dependency to CI --------- Co-authored-by: ZanSara <sara.zanzottera@deepset.ai> Co-authored-by: Daria Fokina <daria.fokina@deepset.ai>	2023-09-26 15:57:55 +02:00
Stefano Fiorucci	c8398eeb6d	test: e2e test for Extractive QA Pipeline (#5879 ) * e2e test for e. qa pipeline	2023-09-26 15:44:34 +02:00
Silvano Cerza	cf7f0ebc22	Add Pipelines async run (#5864 ) * Add Pipeline.arun() * Sleeper node * Fix async running * Add e2e tests To run a Pipeline that doesn't have any async node in async mode: pytest e2e/pipelines/test_standard_pipelines.py::test_query_and_indexing_pipeline To run a Pipeline that has a single async node in concurrent mode: pytest e2e/pipelines/test_standard_pipelines.py::test_async_concurrent_complex_pipeline To run a Pipeline that has a single async node in sequential mode: pytest e2e/pipelines/test_standard_pipelines.py::test_async_sequential_complex_pipeline * Remove unused _adispatch_run method * Make Pipeline.run work with async nodes * Revert "Make Pipeline.run work with async nodes" This reverts commit 22d7a94e4d41aca1b59dad18c0b366fbb6e8f431. * Rename Pipeline.arun to Pipeline._arun * Enhance docstring * Add Sleeper docstring * Add release notes * ignore typing across the node * make pylint happy * skip pylint on needed unused import * fix * if a node has an arun method, use it --------- Co-authored-by: Massimiliano Pippi <mpippi@gmail.com>	2023-09-26 15:37:27 +02:00
github-actions[bot]	8d26057566	Update unstable version (#5887 ) Co-authored-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com> v1.22.0-rc0	2023-09-26 15:23:14 +02:00
ZanSara	6cb7d16e22	feat: `preview` extra (#5869 ) * copy the deps list over from haystack-ai * fix lazyimport usage * keep jinja and openai * fix ci * reno * separate out preview unit tests * fix import error message for tika * tika * add preview to all * wrap torch * remove comment * unwrap openai and jinja v1.21.0-rc0	2023-09-26 12:48:15 +02:00
Stefano Fiorucci	e9d34fc0e3	test: e2e tests for RAG Pipelines (#5876 ) * relax extractive reader integration tests * force reader to CPU * ensure integration tests reproducibility * e2e rag tests * move set_all_seeds to testing package * refine rag tests * Update e2e/preview/pipelines/test_rag_pipelines.py Co-authored-by: ZanSara <sara.zanzottera@deepset.ai> --------- Co-authored-by: ZanSara <sara.zanzottera@deepset.ai>	2023-09-26 11:49:50 +02:00
Stefano Fiorucci	6aa471ac5e	chore: make preview integration tests reproducible (#5871 ) * relax extractive reader integration tests * force reader to CPU * ensure integration tests reproducibility * move set_all_seeds to testing package	2023-09-25 18:39:10 +02:00
bogdankostic	9a4373bf8e	feat: Add `TikaDocumentConverter` (2.0) (#5847 ) * Add TikaFileToDocument component * Add tests * Add tika service to CI * Add release note * Change name * PR feedback * Fix naming in tests * Fix tika version in CI * Update tests --------- Co-authored-by: ZanSara <sara.zanzottera@deepset.ai>	2023-09-25 11:47:21 +02:00
MichelBartels	4da43b6b05	Add link output to `SerperDevWebSearch` (#5853 ) * add link output * adjust tests * fix test * remove print statements --------- Co-authored-by: ZanSara <sara.zanzottera@deepset.ai>	2023-09-25 10:03:01 +02:00
Stefano Fiorucci	c0f22372d4	feat: `OpenAITextEmbedder` (#5801 ) * first draft * release notes * avoid serializing secrets * fix import order * simplify serialization * simplification * monkeypatch delenv * Update haystack/preview/components/embedders/openai_text_embedder.py Co-authored-by: Daria Fokina <daria.fokina@deepset.ai> * docstrings updates * fix test * Update haystack/preview/components/embedders/openai_text_embedder.py Co-authored-by: Massimiliano Pippi <mpippi@gmail.com> * rm comment --------- Co-authored-by: Daria Fokina <daria.fokina@deepset.ai> Co-authored-by: Massimiliano Pippi <mpippi@gmail.com>	2023-09-22 21:54:11 +02:00
Massimiliano Pippi	a5a0dc9f87	feat: optionally pass an id to the Document constructor (#5862 ) * revert #5826 * do not use Optional	2023-09-22 11:09:59 +02:00
Silvano Cerza	cc4f95bf51	Remove unnecessary GPT4Generator class (#5863 ) * Remove GPT4Generator class * Rename GPT35Generator to GPTGenerator * Fix tests * Release notes	2023-09-22 11:05:06 +02:00
MichelBartels	f3dc9edd26	feat: initial ExtractiveReader implementation (#5553 ) * initial ExtractiveReader implementation * initial ExtractiveReader implementation * fix mypy * remove unused import * Use AutoTokenizer * rename reader to model * combine no-answer logit * support document slicing with proper probabilities * add variable stride * validate model * fix typo * make postprocessing easier to understand * remove debug code * set default reader * add ExtractiveReader to __init__ * remove validation * use new answer class * add batching * use v2 lazy imports * move reader * fix type hints * add doc strings * add nucleus sampling * fix types * fix doc string * add no_answer parameter * remove print statement * fix gpu support * turn into binary classification task * change dataclass so document does not need to be provided for no answer * add simple tests * add unit tests * rename reader folder to readers * add integration tests * fix type hints * add release notes * remove accidentally included test file * remove unnecessary __init__ file * revert __init__ file to main * rename test script by adding test_ prefix * undo accidentally moving of test script after renaming it * remove use of bisect * rename _flatten and _unflatten * make variable name more intuitive * remove type: ignore * fix mypy issue * refactor long tuple * add doc strings * explain HF test * remove unnecessary top_k check --------- Co-authored-by: ZanSara <sara.zanzottera@deepset.ai>	2023-09-21 12:16:51 +02:00
Vladimir Blagojevic	92a6221927	feat: Add PyPDFToDocument component (2.0) (#5850 ) * Initial PyPDFToDocument implementation * Remove progress bar * Add release note * Minor fix * import check and dependency --------- Co-authored-by: ZanSara <sara.zanzottera@deepset.ai>	2023-09-21 11:52:26 +02:00
ZanSara	23fdef929e	chore: move `GPT35Generator` tests in the main test suite (#5844 ) * move tests * fix no-test-found error from pytest * missing self --------- Co-authored-by: Massimiliano Pippi <mpippi@gmail.com>	2023-09-21 11:42:32 +02:00
Julian Risch	5820120f9b	fix: Change retriever return type to list of docs (#5848 ) Co-authored-by: Stefano Fiorucci <44616784+anakin87@users.noreply.github.com>	2023-09-21 10:32:40 +02:00
ZanSara	28f5c4c780	fix: Whisper integration tests (#5851 ) * fix tests * add ffmpeg * apt update for ffmpeg * not run on windows	2023-09-21 00:14:07 +02:00
bogdankostic	abe2706298	feat: Add `MetadataRouter` (2.0) (#5824 ) * Move filter utilities * Add MetadataRouter * Add tests for MetadataRouter * Add more tests * Rename FileExtensionClassifer to FileExtensionRouter * Add support for dates in filters * Add tests * Add release note * Add release note * Apply suggestions from code review Co-authored-by: ZanSara <sara.zanzottera@deepset.ai> --------- Co-authored-by: ZanSara <sara.zanzottera@deepset.ai>	2023-09-20 14:49:17 +02:00
ZanSara	c933bcaa69	chore: move Whisper e2e tests in the main tests suite (#5845 ) * move whisper local tests * remove e2e file * move remote tests * remove e2e file	2023-09-20 14:48:09 +02:00
ZanSara	454988672e	feat: `UrlCacheChecker` (#5841 ) * add UrlCacheChecker * rename * add tests * reno * pylint * review feedback	2023-09-20 14:45:50 +02:00
ZanSara	ea2a5595ca	add missing dependency (#5849 )	2023-09-20 12:57:53 +02:00
bogdankostic	719c1c040c	feat: Add support for dates in filters (2.0) (#5823 ) * Add support for dates in filters * Add tests * Add release note * Update haystack/preview/utils/filters.py Co-authored-by: ZanSara <sara.zanzottera@deepset.ai> --------- Co-authored-by: ZanSara <sara.zanzottera@deepset.ai>	2023-09-20 12:05:56 +02:00
ZanSara	44f0c468ac	move websearch tests back to main tests suite (#5842 )	2023-09-20 11:55:18 +02:00
bogdankostic	57d33ee6da	ci: Run preview integration tests in CI (#5843 ) * Run preview integration tests in CI * Only install inference extra	2023-09-20 11:54:41 +02:00
Vladimir Blagojevic	0983fb656a	feat: Add `LinkContentFetcher` Haystack 2.0 component (#5724 ) * Add LinkContentFetcher * Add release note * Small fixes * Fix pydocs * PR feedback * Remove handlers registration * PR feedback * adjustments * improve tests * initial draft * tests * add proposal * proposal number * reno * fix tests and usage of content and content_type * update branch & fix more tests * mypy * use the new document * add docstring * fix more tests * mypy * fix tests * add e2e * review feedback * improve __str__ * Apply suggestions from code review Co-authored-by: Daria Fokina <daria.fokina@deepset.ai> * Update haystack/preview/dataclasses/document.py Co-authored-by: Daria Fokina <daria.fokina@deepset.ai> * improve __str__ * fix tests * fix more tests * fix test * Fix end-of-file-fixer * Post merge fixes * Move e2e tests back into component --------- Co-authored-by: ZanSara <sara.zanzottera@deepset.ai> Co-authored-by: Daria Fokina <daria.fokina@deepset.ai>	2023-09-20 11:03:52 +02:00
Christian Clauss	bf6d306d68	ci: Simplify Python code with ruff rules SIM (#5833 ) * ci: Simplify Python code with ruff rules SIM * Revert #5828 * ruff --select=I --fix haystack/modeling/infer.py --------- Co-authored-by: Massimiliano Pippi <mpippi@gmail.com>	2023-09-20 08:32:44 +02:00
Stefano Fiorucci	de84a95970	separate classes and tests (#5819 ) Co-authored-by: Massimiliano Pippi <mpippi@gmail.com>	2023-09-19 19:21:49 +02:00
Malte Pietsch	aa3cc3d5ae	feat: Add support for OpenAI's `gpt-3.5-turbo-instruct` model (#5837 ) * support gpt-3.5.-turbo-instruct * add release note	2023-09-19 16:06:43 +02:00
Christian Clauss	41126397d6	Revert "ci: Speed up pylint GitHub Action (#5828 )" (#5832 ) This reverts commit d49c86c845ef9ba5bfc17909cd6cf456910516e1.	2023-09-18 10:05:17 +02:00
Christian Clauss	d49c86c845	ci: Speed up pylint GitHub Action (#5828 )	2023-09-16 16:30:13 +02:00
Christian Clauss	66b8b6656c	test: Fix the test_nin_filter_embedding() function (#5829 ) * Fix the test_nin_filter_embedding() function * mypy: type: ignore[arg-type]	2023-09-16 16:28:22 +02:00
Christian Clauss	91ab90a256	perf: Python performance improvements with ruff C4 and PERF fixes (#5803 ) * Python performance improvements with ruff C4 and PERF * pre-commit fixes * Revert changes to examples/basic_qa_pipeline.py * Revert changes to haystack/preview/testing/document_store.py * revert releasenotes * Upgrade to ruff v0.0.290	2023-09-16 16:26:07 +02:00
Christian Clauss	1bc03ddc73	ci: Fix all ruff pyflakes errors except unused imports (#5820 ) * ci: Fix all ruff pyflakes errors except unused imports * Delete releasenotes/notes/fix-some-pyflakes-errors-69a1106efa5d0203.yaml	2023-09-15 18:30:33 +02:00
Onur Eren Arpacı	8af0d816e6	bug: fix the date_fields request bottleneck (#5695 ) * bug: fix the date_fields request bottleneck I encountered a performance issue while attempting to index 1 million vectors. Despite the Weaviate instance having low utilization, the process was estimated to take around 10 hours. After some investigation, I identified the bottleneck: _get_date_properties function was being called for every document, consequently a request to the Weaviate client was being sent and awaited for each document. To address this, I optimized the code by invoking the _get_date_properties function only when there is a schema change. This modification resulted in a notable performance improvement, reducing the indexing time to approximately 90 minutes for the same 1 million vectors. * bug: fix the date_fields request bottleneck * fix: executed the pre commit hooks for #9341	2023-09-15 18:12:14 +02:00
Silvano Cerza	5c04cd6ba2	Fix Document constructor accepting unused id parameter (#5826 )	2023-09-15 17:03:03 +02:00
Stefano Fiorucci	771113c901	move ruff after black (#5825 )	2023-09-15 16:13:02 +02:00
Chivereanu Radu	cab21da87b	fix: Support for Azure 16k gpt 35 deployment (#5804 ) * Support for Azure 16k gpt 35 deployment * releasenote added --------- Co-authored-by: user11999 <radugabrielchivereanu@gmail.com>	2023-09-14 18:01:22 +02:00
Massimiliano Pippi	c7971a809d	ci: skip mandatory release notes check when not needed (#5817 )	2023-09-14 17:00:41 +02:00
Christian Clauss	9405eb90ee	ci: Fix invalid escape sequences in Python code (#5802 ) * ci: Use ruff in pre-commit to further limit complexity * Fix invalid escape sequences in Python code * Delete releasenotes/notes/ruff-4d2504d362035166.yaml	2023-09-14 16:42:48 +02:00
Massimiliano Pippi	6fc12a2bd1	ci: run apt-get update (#5816 ) * run apt-get update * run when changing the workflow file	2023-09-14 16:37:42 +02:00
ZanSara	9056c43240	fix: remove `__future__` import from `pinecone.py` (#5813 ) * remove future import * fix forward reference	2023-09-14 16:28:39 +02:00
Stefano Fiorucci	1c69070db6	make MemoryEmbeddingRetriever act in non-batch mode (#5809 )	2023-09-14 15:37:20 +02:00
bogdankostic	1a212420b7	refactor: Move filter utilities (2.0) (#5797 ) * Move filter utilities * PR feedback	2023-09-14 13:23:53 +02:00
Stefano Fiorucci	ad5b615503	make SentenceTransformersTextEmbedder non batch (#5811 )	2023-09-14 12:38:24 +02:00
Ivana Zeljkovic	4bad202197	feat: Pinecone document store refactoring (#5725 ) * Refactor codebase so that doc_type metadata is used instead of namespaces for making distinction between documents without embeddings, documents with embeddings and labels * Fix parameter name in integration test * Remove code under comment in add_type_metadata_filter method * Fix mypy and pylint checks * Add release note * Apply minimal changes: rename method, update method docs and remove redundant method * Mypy fixes * Fix docstrings * Revert helper methods for fetching documents when the number of documents exceeds Pinecone limit * Remove unnecessary attributes in PineconeDocumentStore * Fix unit test --------- Co-authored-by: Ivana Zeljkovic <ivana.zeljkovic@smartcat.io> Co-authored-by: DosticJelena <jelena.dostic@smartcat.io>	2023-09-14 11:46:47 +02:00

... 9 10 11 12 13 ...

3115 Commits