Vladimir Blagojevic
e882a7d5c8
feat: Add HTMLToDocument component (v2) ( #5907 )
2023-09-28 17:22:28 +02:00
bogdankostic
80192589b1
feat: Add AzureOCRDocumentConverter (2.0) ( #5855 )
...
* Add AzureOCRDocumentConverter
* Add tests
* Add release note
* Formatting
* update docstrings
* Apply suggestions from code review
Co-authored-by: ZanSara <sara.zanzottera@deepset.ai>
* PR feedback
* PR feedback
* PR feedback
* Add secrets as environment variables
* Adapt test
* Add azure dependency to CI
* Add azure dependency to CI
---------
Co-authored-by: ZanSara <sara.zanzottera@deepset.ai>
Co-authored-by: Daria Fokina <daria.fokina@deepset.ai>
2023-09-26 15:57:55 +02:00
bogdankostic
9a4373bf8e
feat: Add TikaDocumentConverter (2.0) ( #5847 )
...
* Add TikaFileToDocument component
* Add tests
* Add tika service to CI
* Add release note
* Change name
* PR feedback
* Fix naming in tests
* Fix tika version in CI
* Update tests
---------
Co-authored-by: ZanSara <sara.zanzottera@deepset.ai>
2023-09-25 11:47:21 +02:00
Vladimir Blagojevic
92a6221927
feat: Add PyPDFToDocument component (2.0) ( #5850 )
...
* Initial PyPDFToDocument implementation
* Remove progress bar
* Add release note
* Minor fix
* import check and dependency
---------
Co-authored-by: ZanSara <sara.zanzottera@deepset.ai>
2023-09-21 11:52:26 +02:00
ZanSara
28f5c4c780
fix: Whisper integration tests ( #5851 )
...
* fix tests
* add ffmpeg
* apt update for ffmpeg
* not run on windows
2023-09-21 00:14:07 +02:00
Vladimir Blagojevic
0983fb656a
feat: Add LinkContentFetcher Haystack 2.0 component ( #5724 )
...
* Add LinkContentFetcher
* Add release note
* Small fixes
* Fix pydocs
* PR feedback
* Remove handlers registration
* PR feedback
* adjustments
* improve tests
* initial draft
* tests
* add proposal
* proposal number
* reno
* fix tests and usage of content and content_type
* update branch & fix more tests
* mypy
* use the new document
* add docstring
* fix more tests
* mypy
* fix tests
* add e2e
* review feedback
* improve __str__
* Apply suggestions from code review
Co-authored-by: Daria Fokina <daria.fokina@deepset.ai>
* Update haystack/preview/dataclasses/document.py
Co-authored-by: Daria Fokina <daria.fokina@deepset.ai>
* improve __str__
* fix tests
* fix more tests
* fix test
* Fix end-of-file-fixer
* Post merge fixes
* Move e2e tests back into component
---------
Co-authored-by: ZanSara <sara.zanzottera@deepset.ai>
Co-authored-by: Daria Fokina <daria.fokina@deepset.ai>
2023-09-20 11:03:52 +02:00
bogdankostic
a51ca19fe4
feat: Add TextFileToDocument component (v2) ( #5467 )
...
* Add TextfileToDocument component
* Add docstrings
* Add unit tests
* Add release note file
* Make use of progress bar
* Add TextfileToDocument to __init__.py
* Use lazy % formatting in logging functions
* Remove f from non-f-string
* Add TextfileToDocument to __init__.py
* Use correct dependency extra
* Compare file path against path object
* PR feedback
* PR feedback
* Update haystack/preview/components/file_converters/txt.py
Co-authored-by: Daria Fokina <daria.fokina@deepset.ai>
* Update docstrings
* Add error handling
* Add unit test
* Reintroduce falsely removed caplog
---------
Co-authored-by: Daria Fokina <daria.fokina@deepset.ai>
2023-08-01 11:34:52 +02:00
ZanSara
516db4cb52
RemoteWhisperTranscriber (v2) (#4910 )
...
* original-component
* stub
* fix implementation
* fix tests
* review feedback
* review feedback
* upgrade canals
* upgrade canals
* upgrade canals to fix pipeline test
* remove requests_with_retry
* feedback
2023-05-22 16:02:58 +02:00
ZanSara
f2106ab37b
feat: initial implementation of MemoryDocumentStore for new Pipelines ( #4447 )
...
* add stub implementation
* reimplementation
* test files
* docstore tests
* tests for document
* better testing
* remove mmh3
* readme
* only store, no retrieval yet
* linting
* review feedback
* initial filters implementation
* working on filters
* linters
* filtering works and is isolated by document store
* simplify filters
* comments
* improve filters matching code
* review feedback
* pylint
* move logic into_create_id
* mypy
2023-04-13 09:36:23 +02:00