haystack

mirror of https://github.com/deepset-ai/haystack.git synced 2025-11-06 12:53:35 +00:00

Author	SHA1	Message	Date
Sebastian Husch Lee	28ad78c73d	feat: Add XLSXToDocument converter (#8522 ) * Add draft of the Excel To Document converter * Add license header * Add release note * Use Union instead of pipe * Add openpyxl as additional dep * Fix zip issue * few updates from Bijay * Update deps * Add markdown test * Adding more example excels and expanding tests * Added more tests * Fix windows test by setting lineterminator * Addressing PR comments * PR comments * Fix linting	2025-01-09 09:03:19 +01:00
Sebastian Husch Lee	911f3523ab	feat: Increase logging transparency for empty Documents during conversion (#8509 ) * Add log lines for PDF conversion and make skipping more explicit in DocumentSplitter * Add logging statement for PDFMinerToDocument as well * Add tests * Remove unused line * Remove unused line * add reno * Add in PDF file * Update checks in PDF converters and add tests for document splitter * Revert * Remove line * Fix comment * Make mypy happy * Make mypy happy	2024-11-04 09:26:57 +01:00
Vladimir Blagojevic	28161f7bb9	feat: DOCXToDocument: add table extraction (#8457 ) * DOCXToDocument: add table extraction * Add reno note * mypy fixes * add unit tests * Add csv table support * Update release note * Add TableFormat enum * Add table_format as str init param * Update docx.py Co-authored-by: Madeesh Kannan <shadeMe@users.noreply.github.com> * PR feedback * PR feedback --------- Co-authored-by: medsriha <medsriha@gmail.com> Co-authored-by: Mo Sriha <22803208+medsriha@users.noreply.github.com> Co-authored-by: Madeesh Kannan <shadeMe@users.noreply.github.com>	2024-10-29 16:20:27 +01:00
David S. Batista	a50593ede0	fix: whisper tests using audio file from our github repo (#8454 ) * adding audio file * temporary removing failing test * removing failing test	2024-10-14 12:56:37 +02:00
Ajit Singh	2dd8089409	chore: Removed deprecated max_loop_allowed argument from Pipeline init (#8409 ) * Added equality check for sender and receiver in connection function of pipeline * Update base.py irrelevant changes reverted * added release note * removed deprecated param max_loops_allowed from pipeline init * added release note * revert non relevant test * Delete releasenotes/notes/remove-support-to-connect-component-to-self-6eedfb287f2a2a02.yaml * revery non relevant change * Remove unused test_pipeline_deprecated.yaml * Remove PipelineMaxLoops error * Update release notes --------- Co-authored-by: Silvano Cerza <silvanocerza@gmail.com>	2024-09-30 15:58:05 +02:00
Silvano Cerza	5514676b5e	feat: Deprecate `max_loops_allowed` in favour of new argument `max_runs_per_component` (#8354 ) * Deprecate max_loops_allowed in favour of new argument max_runs_per_component * Add missing test file * Some enhancements * Add version that will remove deprecate stuff	2024-09-12 11:00:12 +02:00
Sriniketh J	e98a6fea04	Convertor: CSVToDocument (#8328 ) * carry forwarded initial commit * fix: doc strings * fix: update docstrings * fix: docstring update * fix: csv encoding in actions * fix: line endings through hooks * fix: converter docs addition	2024-09-06 10:59:12 +02:00
Jon Strutz	471f07c8fe	fix: extract page breaks from .docx files (#8232 ) * fix: extract page breaks from .docx files Context: Currently, DOCXToDocument does not extract page breaks from word documents. This makes it impossible to do things like split by page or get correct page number metadata after using something like DocumentSplitter. For example, if you split by word, the 'page_number' metadata field will be 1 for all documents. Solution: Added a method to DOCXToDocument that extracts page breaks from word documents as '\f' characters so that they are recognized by DocumentSplitter. Caveat: Due to the way the python-docx library is set up, you can only accurately determine the location of the first page break for a given paragraph. In the rare case that a paragraph contains more than one page break (which means it is an extremely long paragraph spanning multiple pages), the 2nd, 3rd, etc. page break locations are not known. To sort of fix this, I just appended the page break characters to the end of the paragraph text to keep the overall page number values for the document consistent. * Apply suggestions from code review --------- Co-authored-by: Madeesh Kannan <shadeMe@users.noreply.github.com>	2024-08-21 09:48:02 +00:00
Carlos Fernández	c1c339923f	feat: add DocxToDocument converter (#7838 ) * first fucntioning DocxFileToDocument * fix lazy import message * add reno * Add license headder Co-authored-by: Sebastian Husch Lee <sjrl@users.noreply.github.com> * change DocxFileToDocument to DocxToDocument * Update library install to the maintained version Co-authored-by: Sebastian Husch Lee <sjrl@users.noreply.github.com> * clan try-exvept to only take non haystack errors into account * Add wanring on docstring of component ignoring page brakes, mark test as skip * make warnings lazy evaluations Co-authored-by: Sebastian Husch Lee <sjrl@users.noreply.github.com> * make warnings lazy evaluations Co-authored-by: Sebastian Husch Lee <sjrl@users.noreply.github.com> * Make warnings lazy evaluated Co-authored-by: Sebastian Husch Lee <sjrl@users.noreply.github.com> * Solve f bug * Get more metadata from docx files * add 'python-docx' dependency and docs * Change logging import Co-authored-by: Sebastian Husch Lee <sjrl@users.noreply.github.com> * Fix typo Co-authored-by: Sebastian Husch Lee <sjrl@users.noreply.github.com> * remake metadata extraction for docx * solve bug regarding _get_docx_metadata method * Update haystack/components/converters/docx.py Co-authored-by: Sebastian Husch Lee <sjrl@users.noreply.github.com> * Update haystack/components/converters/docx.py Co-authored-by: Sebastian Husch Lee <sjrl@users.noreply.github.com> * Delete unused test --------- Co-authored-by: Sebastian Husch Lee <sjrl@users.noreply.github.com>	2024-06-12 11:58:36 +02:00
Sebastian Husch Lee	2c2c7c9f56	feat: Add PPTXToDocument converter (#7808 ) * Add first pass at PPTXToDocument converter * Add test and update code * Add doc string * Update docstrings * Add release notes * remove unused imports, add to api docs, update pyproject.toml * Add a new test * Add dep so tests can run	2024-06-07 09:43:29 +00:00
Vladimir Blagojevic	988c360b6d	feat: Azure converter updates (#7409 ) * Initial commit * Remove old mock tests * Fix current_last_page_number calculation * Carry over unit tests from the other side * Update pydocs, skip failing tests * Fix pylint and mypy * Minor adjustments * Add release note * Minor touch ups * Resolve Document unique id issue by using custom id calculation * Better hashing, add unit tests * Small fixes	2024-04-09 09:45:06 +02:00
Vladimir Blagojevic	c3b96392fd	feat: Use all HTMLToDocument extractors until content is extracted (#7452 ) * Use all HTMLToDocument extractors until content is extracted * Add release note * Minor doc update * Improvements, unit test fixes * Add try_others init param, update tests * Update haystack/components/converters/html.py Co-authored-by: Stefano Fiorucci <stefanofiorucci@gmail.com> * PR feedback - Stefano * Improve reno release note, add reference * little fixes --------- Co-authored-by: Stefano Fiorucci <stefanofiorucci@gmail.com>	2024-04-05 16:02:34 +02:00
Vladimir Blagojevic	d871bbbfbd	feat: Add complex types in OpenAPI support (#7065 ) * Add complex types OpenAPI support * Add release note --------- Co-authored-by: Julian Risch <julian.risch@deepset.ai>	2024-02-27 18:11:06 +01:00
Vladimir Blagojevic	cb6389d7a2	feat: Improve OpenAPI integration (#7034 ) * Simplify and improve OpenAPIServiceConnector and OpenAPIServiceToFunctions, add unit tests * Add reno note * Add flask test dependency * Initial PR feedback - Julian * Remove indirection - Silvano * Remove flask end-to-end tests * Remove unused import * Add mixed body unit test * Update unit test, mock properly	2024-02-22 14:03:50 +01:00
Silvano Cerza	f96eb3847f	refactor: Merge `Pipeline`s definition in `core` package (#6973 ) * Move marshalling functions in core Pipeline * Move telemetry gathering in core Pipeline * Move run logic in core Pipeline * Update root Pipeline import * Add release notes * Update Pipeline docs path * Update releasenotes/notes/merge-pipeline-definitions-1da80e9803e2a8bb.yaml Co-authored-by: Massimiliano Pippi <mpippi@gmail.com> --------- Co-authored-by: Massimiliano Pippi <mpippi@gmail.com>	2024-02-12 18:25:28 +01:00
Vladimir Blagojevic	37d9de3c4e	feat: Add service_credentials to OpenAPIServiceConnector run (#6962 ) * Add service_credentials to OpenAPIServiceConnector run * PR feedback Silvano	2024-02-09 16:03:27 +01:00
Silvano Cerza	e6637f5ec2	Fix all tests	2023-11-24 14:48:43 +01:00
Massimiliano Pippi	8adb8bbab8	Remove preview folder in test/ --------- Co-authored-by: Silvano Cerza <silvanocerza@gmail.com>	2023-11-24 11:52:55 +01:00

18 Commits