* draft requirements from discussion
* Add some more information
* Update proposal given new feedback
* More drawbacks
* Decision drivers
* Nitpick
* Summary
* PR number
* Mark code snippets
Co-authored-by: ZanSara <sara.zanzottera@deepset.ai>
* Link correct issue
* Add missing word
* More context on blind evaluation
* Rephrase confusing sentence
* Add a more detailed code example
* Ignore mypy and pylint in example file
---------
Co-authored-by: Julian Risch <julian.risch@deepset.ai>
Co-authored-by: ZanSara <sara.zanzottera@deepset.ai>
* draft split by word, sentence, passage
* naive way to split sentences without nltk
* reno
* add tests
* make input list of docs, review feedback
* add source_id and more validation
* update docstrings
* add split delimiters back to strings
---------
Co-authored-by: Daria Fokina <daria.fokina@deepset.ai>
* Add Pipeline.arun()
* Sleeper node
* Fix async running
* Add e2e tests
To run a Pipeline that doesn't have any async node in async mode:
pytest e2e/pipelines/test_standard_pipelines.py::test_query_and_indexing_pipeline
To run a Pipeline that has a single async node in concurrent mode:
pytest e2e/pipelines/test_standard_pipelines.py::test_async_concurrent_complex_pipeline
To run a Pipeline that has a single async node in sequential mode:
pytest e2e/pipelines/test_standard_pipelines.py::test_async_sequential_complex_pipeline
* Remove unused _adispatch_run method
* Make Pipeline.run work with async nodes
* Revert "Make Pipeline.run work with async nodes"
This reverts commit 22d7a94e4d41aca1b59dad18c0b366fbb6e8f431.
* Rename Pipeline.arun to Pipeline._arun
* Enhance docstring
* Add Sleeper docstring
* Add release notes
* ignore typing across the node
* make pylint happy
* skip pylint on needed unused import
* fix
* if a node has an arun method, use it
---------
Co-authored-by: Massimiliano Pippi <mpippi@gmail.com>
* copy the deps list over from haystack-ai
* fix lazyimport usage
* keep jinja and openai
* fix ci
* reno
* separate out preview unit tests
* fix import error message for tika
* tika
* add preview to all
* wrap torch
* remove comment
* unwrap openai and jinja
* Add TikaFileToDocument component
* Add tests
* Add tika service to CI
* Add release note
* Change name
* PR feedback
* Fix naming in tests
* Fix tika version in CI
* Update tests
---------
Co-authored-by: ZanSara <sara.zanzottera@deepset.ai>
* bug: fix the date_fields request bottleneck
I encountered a performance issue while attempting to index 1 million vectors. Despite the Weaviate instance having low utilization, the process was estimated to take around 10 hours.
After some investigation, I identified the bottleneck: _get_date_properties function was being called for every document, consequently a request to the Weaviate client was being sent and awaited for each document.
To address this, I optimized the code by invoking the _get_date_properties function only when there is a schema change. This modification resulted in a notable performance improvement, reducing the indexing time to approximately 90 minutes for the same 1 million vectors.
* bug: fix the date_fields request bottleneck
* fix: executed the pre commit hooks for #9341
* ci: Use ruff in pre-commit to further limit complexity
* Fix invalid escape sequences in Python code
* Delete releasenotes/notes/ruff-4d2504d362035166.yaml
* Refactor codebase so that doc_type metadata is used instead of namespaces for making distinction between documents without embeddings, documents with embeddings and labels
* Fix parameter name in integration test
* Remove code under comment in add_type_metadata_filter method
* Fix mypy and pylint checks
* Add release note
* Apply minimal changes: rename method, update method docs and remove redundant method
* Mypy fixes
* Fix docstrings
* Revert helper methods for fetching documents when the number of documents exceeds Pinecone limit
* Remove unnecessary attributes in PineconeDocumentStore
* Fix unit test
---------
Co-authored-by: Ivana Zeljkovic <ivana.zeljkovic@smartcat.io>
Co-authored-by: DosticJelena <jelena.dostic@smartcat.io>