ZanSara
e888852aec
Standardize TextFileToDocument
( #6232 )
...
* simplify textfiletodocument
* fix error handling and tests
* stray print
* reno
* streams->sources
* reno
* feedback
* test
* fix tests
2023-11-17 15:39:39 +01:00
ZanSara
dfc1d452bb
feat: upgrade canals to 0.10.1 ( #6309 )
...
* upgrade canals
* reno
* trigger preview e2e
* bump canals
* fix decorator
* fix test
* test factory
* tests inmemory
* tests writer
* test audio
* tests builders
* tests caching
* tests embedders
* tests converters
* tests generators
* tests rankers
* tests retrievers
* fix pipeline and telemetry tests
* remove trigger
2023-11-17 14:46:23 +01:00
Vladimir Blagojevic
b4d8d1c904
feat: Add custom conversion callable to PyPDFToDocument - Haystack 2.x ( #6258 )
...
* Allow user specified converter hook
* Add a release note
* More unit tests
* PR review - Massi, use protocol as converter
2023-11-09 17:35:33 +01:00
Ashwin Mathur
6bf0b9dc7c
feat: Add MarkdownToTextDocument
(v2) ( #6159 )
...
* Add MarkdownToTextDocument
* Add release notes
* Update GitHub workflows
* Update GitHub workflows
* Refactor code with minimal dependencies
* Update docstrings
* Apply suggestions from code review
Co-authored-by: Daria Fokina <daria.f93@gmail.com>
* Update document with content and meta for backward compatibility
* Refactor Document Class for Backward Compatibility
Co-authored-by: Stefano Fiorucci <44616784+anakin87@users.noreply.github.com>
* Update tests
* Improve test assertions
---------
Co-authored-by: Daria Fokina <daria.f93@gmail.com>
Co-authored-by: Stefano Fiorucci <44616784+anakin87@users.noreply.github.com>
2023-10-31 18:28:13 +01:00
Silvano Cerza
7287657f0e
refactor: Rename Document
's text
field to content
( #6181 )
...
* Rework Document serialisation
Make Document backward compatible
Fix InMemoryDocumentStore filters
Fix InMemoryDocumentStore.bm25_retrieval
Add release notes
Fix pylint failures
Enhance Document kwargs handling and docstrings
Rename Document's text field to content
Fix e2e tests
Fix SimilarityRanker tests
Fix typo in release notes
Rename Document's metadata field to meta (#6183 )
* fix bugs
* make linters happy
* fix
* more fix
* match regex
---------
Co-authored-by: Massimiliano Pippi <mpippi@gmail.com>
2023-10-31 12:44:04 +01:00
Ashwin Mathur
101bd816f8
refactor: Remove api_key from serialization of AzureOCRDocumentConverter
and SerperDevWebSearch
( #6150 )
...
* Remove api_key from serialization of AzureOCRDocumentConverter
* Remove api_key from serialization of SerperDevWebSearch
* Add release notes
* Add init_fail_without_api_key test for SerperDevWebSearch
* Rename env var to AZURE_AI_API_KEY
2023-10-23 12:26:23 +02:00
Silvano Cerza
2a45e7cc06
refactor: Remove id_hash_keys
from all file_converters
( #6125 )
...
* Remove id_hash_keys from DocumentCleaner
* Remove id_hash_keys from TextDocumentSplitter
* Remove id_hash_keys from all file_converters
* Fix pylint failure
* Update docstrings
2023-10-20 16:22:14 +02:00
Julian Risch
9f3b6512be
refactor: Remove reimplementations of default from_dict
/to_dict
and corresponding tests in 2.0 ( #6108 )
...
* whisper transcriber
* remove from/to_dict from builders
* remove from/to_dict from embedders
* remove from/to_dict from fetcher, file_converters
* remove from/to_dict from generators, preprocessors
* remove from/to_dict from ranker, reader
* remove from/to_dict from router, sampler, websearch
* pylint
* reno
* refactor import
* remove unused import
2023-10-19 11:17:02 +02:00
Vladimir Blagojevic
3803d23ff6
feat: Update PyPDFToDocument
to process ByteStream
inputs ( #6021 )
...
* Update PyPDF converter
* Add mixed source unit test
* Update haystack/preview/components/file_converters/pypdf.py
Co-authored-by: ZanSara <sara.zanzottera@deepset.ai>
---------
Co-authored-by: ZanSara <sara.zanzottera@deepset.ai>
2023-10-11 10:52:08 +02:00
Vladimir Blagojevic
1a6a8863e8
feat: Update HTMLToDocument
to handle ByteStream
inputs ( #6020 )
...
* Update HTML converter
* Add mixed source unit test
* Update haystack/preview/components/file_converters/html.py
Co-authored-by: ZanSara <sara.zanzottera@deepset.ai>
---------
Co-authored-by: ZanSara <sara.zanzottera@deepset.ai>
2023-10-11 10:15:58 +02:00
Vladimir Blagojevic
e882a7d5c8
feat: Add HTMLToDocument component (v2) ( #5907 )
2023-09-28 17:22:28 +02:00
Stefano Fiorucci
9340c572f9
alternative skipif conditions in azure ocr converter test ( #5906 )
2023-09-28 12:09:19 +02:00
bogdankostic
80192589b1
feat: Add AzureOCRDocumentConverter
(2.0) ( #5855 )
...
* Add AzureOCRDocumentConverter
* Add tests
* Add release note
* Formatting
* update docstrings
* Apply suggestions from code review
Co-authored-by: ZanSara <sara.zanzottera@deepset.ai>
* PR feedback
* PR feedback
* PR feedback
* Add secrets as environment variables
* Adapt test
* Add azure dependency to CI
* Add azure dependency to CI
---------
Co-authored-by: ZanSara <sara.zanzottera@deepset.ai>
Co-authored-by: Daria Fokina <daria.fokina@deepset.ai>
2023-09-26 15:57:55 +02:00
bogdankostic
9a4373bf8e
feat: Add TikaDocumentConverter
(2.0) ( #5847 )
...
* Add TikaFileToDocument component
* Add tests
* Add tika service to CI
* Add release note
* Change name
* PR feedback
* Fix naming in tests
* Fix tika version in CI
* Update tests
---------
Co-authored-by: ZanSara <sara.zanzottera@deepset.ai>
2023-09-25 11:47:21 +02:00
Vladimir Blagojevic
92a6221927
feat: Add PyPDFToDocument component (2.0) ( #5850 )
...
* Initial PyPDFToDocument implementation
* Remove progress bar
* Add release note
* Minor fix
* import check and dependency
---------
Co-authored-by: ZanSara <sara.zanzottera@deepset.ai>
2023-09-21 11:52:26 +02:00
Christian Clauss
bf6d306d68
ci: Simplify Python code with ruff rules SIM ( #5833 )
...
* ci: Simplify Python code with ruff rules SIM
* Revert #5828
* ruff --select=I --fix haystack/modeling/infer.py
---------
Co-authored-by: Massimiliano Pippi <mpippi@gmail.com>
2023-09-20 08:32:44 +02:00
ZanSara
6e70d403f8
feat: Improve Document
for Haystack 2.0 ( #5738 )
...
* initial draft
* tests
* add proposal
* proposal number
* reno
* fix tests and usage of content and content_type
* update branch & fix more tests
* mypy
* add docstring
* fix more tests
* review feedback
* improve __str__
* Apply suggestions from code review
Co-authored-by: Daria Fokina <daria.fokina@deepset.ai>
* Update haystack/preview/dataclasses/document.py
Co-authored-by: Daria Fokina <daria.fokina@deepset.ai>
* improve __str__
* fix tests
* fix more tests
* Update haystack/preview/document_stores/memory/document_store.py
---------
Co-authored-by: Daria Fokina <daria.fokina@deepset.ai>
2023-09-11 17:40:00 +02:00
ZanSara
b1daa7c647
chore: migrate to canals==0.7.0
( #5647 )
...
* add default_to_dict and default_from_dict placeholders to ease migration to canals 0.7.0
* canals==0.7.0
* whisper components
* add to_dict/from_dict stubs
* import serialization methods in init to hide canals imports
* reno
* export deserializationerror too
* Update haystack/preview/__init__.py
Co-authored-by: Silvano Cerza <3314350+silvanocerza@users.noreply.github.com>
* serialization methods for LocalWhisperTranscriber (#5648 )
* chore: serialization methods for `FileExtensionClassifier` (#5651 )
* serialization methods for FileExtensionClassifier
* Update test_file_classifier.py
* chore: serialization methods for `SentenceTransformersDocumentEmbedder` (#5652 )
* serialization methods for SentenceTransformersDocumentEmbedder
* fix device management
* serialization methods for SentenceTransformersTextEmbedder (#5653 )
* serialization methods for TextFileToDocument (#5654 )
* chore: serialization methods for `RemoteWhisperTranscriber` (#5650 )
* serialization methods for RemoteWhisperTranscriber
* remove patches
* Add default to_dict and from_dict in document stores built with factory (#5674 )
* fix tests (#5671 )
* chore: simplify serialization methods for `MemoryDocumentStore` (#5667 )
* simplify serialization for MemoryDocumentStore
* remove redundant tests
* pylint
* chore: serialization methods for `MemoryRetriever` (#5663 )
* serialization method for MemoryRetriever
* more tests
* remove hash from default_document_store_to_dict
* remove diff in factory.py
* chore: serialization methods for `DocumentWriter` (#5661 )
* serialization methods for DocumentWriter
* more tests
* use factory
* black
---------
Co-authored-by: Silvano Cerza <3314350+silvanocerza@users.noreply.github.com>
2023-08-29 18:15:07 +02:00
Silvano Cerza
66f615a3a4
Remove BaseTestComponent ( #5613 )
...
* Remove BaseTestComponent
* Add release notes
2023-08-23 17:03:37 +02:00
ZanSara
5ca4874df9
Migrate existing v2 components to Canals 0.4.0 ( #5532 )
...
* pin canals==0.4.0
* update audio components
* allow audio components to receive whisper_params in init too
* migrating memoryretriever
* migrate memoryretriever
* migrate TextFileToDocument
* fix TextFileToDocument tests
* fix pipeline tests
* fix defaults management
* reno
* inverted assignments
* Simplify release notes
---------
Co-authored-by: Silvano Cerza <silvanocerza@gmail.com>
2023-08-09 15:51:32 +02:00
bogdankostic
a51ca19fe4
feat: Add TextFileToDocument
component (v2) ( #5467 )
...
* Add TextfileToDocument component
* Add docstrings
* Add unit tests
* Add release note file
* Make use of progress bar
* Add TextfileToDocument to __init__.py
* Use lazy % formatting in logging functions
* Remove f from non-f-string
* Add TextfileToDocument to __init__.py
* Use correct dependency extra
* Compare file path against path object
* PR feedback
* PR feedback
* Update haystack/preview/components/file_converters/txt.py
Co-authored-by: Daria Fokina <daria.fokina@deepset.ai>
* Update docstrings
* Add error handling
* Add unit test
* Reintroduce falsely removed caplog
---------
Co-authored-by: Daria Fokina <daria.fokina@deepset.ai>
2023-08-01 11:34:52 +02:00