haystack

mirror of https://github.com/deepset-ai/haystack.git synced 2025-11-07 05:14:08 +00:00

Author	SHA1	Message	Date
Sebastian Husch Lee	28ad78c73d	feat: Add XLSXToDocument converter (#8522 ) * Add draft of the Excel To Document converter * Add license header * Add release note * Use Union instead of pipe * Add openpyxl as additional dep * Fix zip issue * few updates from Bijay * Update deps * Add markdown test * Adding more example excels and expanding tests * Added more tests * Fix windows test by setting lineterminator * Addressing PR comments * PR comments * Fix linting	2025-01-09 09:03:19 +01:00
Michele Pangrazzi	21d53d0ec6	update default value of 'store_full_path' to False in converters (#8619 )	2024-12-10 16:03:38 +01:00
Michele Pangrazzi	b32f85cca2	remove deprecated 'converter' init parameter from PyPDFToDocument component (#8609 )	2024-12-06 15:43:43 +01:00
Amna Mubashar	4c8eb54049	feat: Add store_full_path to converters (3/3) (#8585 ) * Add store_full_path params	2024-12-03 13:48:56 +05:00
Stefano Fiorucci	fb42c035c5	feat: `PyPDFToDocument` - add new customization parameters (#8574 ) * deprecat converter in pypdf * fix linting of MetaFieldGroupingRanker * linting * pypdftodocument: add customization params * fix mypy * incorporate feedback	2024-11-26 16:37:59 +01:00
Amna Mubashar	9302d3d9f0	feat: Add store_full_path to converters (2/3) (#8573 )	2024-11-25 15:22:19 +05:00
Amna Mubashar	21906d0558	feat: Add `store_full_path` to converters (1/3) (#8566 ) * Add store_full_path param to 3 converters	2024-11-22 13:55:08 +01:00
Sebastian Husch Lee	911f3523ab	feat: Increase logging transparency for empty Documents during conversion (#8509 ) * Add log lines for PDF conversion and make skipping more explicit in DocumentSplitter * Add logging statement for PDFMinerToDocument as well * Add tests * Remove unused line * Remove unused line * add reno * Add in PDF file * Update checks in PDF converters and add tests for document splitter * Revert * Remove line * Fix comment * Make mypy happy * Make mypy happy	2024-11-04 09:26:57 +01:00
Madeesh Kannan	33675b4caf	chore: Remove deprecated `DefaultConverter` for `PyPDFToDocument` (#8501 ) * chore: Remove deprecated `DefaultConverter` for `PyPDFToDocument` * Remove unused imports --------- Co-authored-by: Silvano Cerza <silvanocerza@gmail.com>	2024-10-29 16:42:48 +00:00
Vladimir Blagojevic	28161f7bb9	feat: DOCXToDocument: add table extraction (#8457 ) * DOCXToDocument: add table extraction * Add reno note * mypy fixes * add unit tests * Add csv table support * Update release note * Add TableFormat enum * Add table_format as str init param * Update docx.py Co-authored-by: Madeesh Kannan <shadeMe@users.noreply.github.com> * PR feedback * PR feedback --------- Co-authored-by: medsriha <medsriha@gmail.com> Co-authored-by: Mo Sriha <22803208+medsriha@users.noreply.github.com> Co-authored-by: Madeesh Kannan <shadeMe@users.noreply.github.com>	2024-10-29 16:20:27 +01:00
Madeesh Kannan	e7bfd80f3b	fix: (Temporarily) Re-add suport for pre-2.6.0 YAMLs with `PyPDFConverter` (#8443 )	2024-10-08 14:35:43 +02:00
Madeesh Kannan	ee89f6ad57	fix: `PyPDFToDocument` correctly serializes custom converters, deprecate `DefaultConverter` (#8430 ) * fix: `PyPDFToDocument` correctly serializes custom converters, deprecate `DefaultConverter` * Remove `auto` prefix from serde util function names, add unit tests	2024-10-01 16:35:38 +02:00
Silvano Cerza	d6f073f9b3	Revert "fix: make pypdf converter more robust (#8427 )" (#8428 ) This reverts commit d234c75168dcb49866a6714aa232f37d56f72cab.	2024-10-01 11:55:25 +02:00
Tobias Wochinger	d234c75168	fix: make pypdf converter more robust (#8427 ) * fix: make `from_dict` of `PyPDFToDocument` more robust * chore: drop trailing space * converting method to static and making the comment shorter * reverting method to static --------- Co-authored-by: David S. Batista <dsbatista@gmail.com>	2024-09-30 16:47:23 +00:00
Silvano Cerza	29672d4b42	feat: Add `JSONConverter` Component (#8397 ) * Add JSONConverter Component * Handle some corner cases * Add JSONConverter to pydoc config * Add a way to extract all non content fields as metadata * Small fix in docstring * Fix tests * docstrings upd * Update json.py --------- Co-authored-by: Daria Fokina <daria.fokina@deepset.ai>	2024-09-25 12:34:51 +02:00
Sriniketh J	e98a6fea04	Convertor: CSVToDocument (#8328 ) * carry forwarded initial commit * fix: doc strings * fix: update docstrings * fix: docstring update * fix: csv encoding in actions * fix: line endings through hooks * fix: converter docs addition	2024-09-06 10:59:12 +02:00
Silvano Cerza	3e3f79b928	feat: Add `unsafe` init arg in `ConditionalRouter` and `OutputAdapter` to enable previous behaviour (#8176 ) * Add unsafe behaviour to OutputAdapter * Add unsafe behaviour to ConditionalRouter * Add release notes * Fix mypy * Add documentation links --------- Co-authored-by: Madeesh Kannan <shadeMe@users.noreply.github.com>	2024-09-02 14:14:54 +00:00
Stefano Fiorucci	2e619f06c8	fix: make meta produced by `DOCXToDocument` JSON serializable (#8263 ) * make meta from DOCXToDocument JSON serializable * unused import * update docstrings	2024-08-22 12:24:32 +00:00
Jon Strutz	471f07c8fe	fix: extract page breaks from .docx files (#8232 ) * fix: extract page breaks from .docx files Context: Currently, DOCXToDocument does not extract page breaks from word documents. This makes it impossible to do things like split by page or get correct page number metadata after using something like DocumentSplitter. For example, if you split by word, the 'page_number' metadata field will be 1 for all documents. Solution: Added a method to DOCXToDocument that extracts page breaks from word documents as '\f' characters so that they are recognized by DocumentSplitter. Caveat: Due to the way the python-docx library is set up, you can only accurately determine the location of the first page break for a given paragraph. In the rare case that a paragraph contains more than one page break (which means it is an extremely long paragraph spanning multiple pages), the 2nd, 3rd, etc. page break locations are not known. To sort of fix this, I just appended the page break characters to the end of the paragraph text to keep the overall page number values for the document consistent. * Apply suggestions from code review --------- Co-authored-by: Madeesh Kannan <shadeMe@users.noreply.github.com>	2024-08-21 09:48:02 +00:00
Vladimir Blagojevic	3318d894c0	Add sede_with_list_output_type_in_pipeline unit test (#8196 )	2024-08-13 14:37:24 +02:00
Marie-Luise Klaus	ec02817f14	fix: OutputAdapter from_dict with custom_filters None (#8173 ) Co-authored-by: Marie-Luise Klaus <marieluise.klaus@deepset.ai>	2024-08-08 14:02:40 +02:00
Stefano Fiorucci	3d1ad10385	fix html test (#8127 )	2024-07-31 10:59:53 +02:00
Corentin Meyer	1c53aae8f0	fix: Tika converter not yielding page break tags (`\f`) (#8082 ) * Fix TikaConverter not having \f page tag by using HTML mode of parsing and then parsing the HTML to text using the old Haystack 1.X integration as template. * Add Reno * Fix test by making Mock Tika return XML (before parsing) * refinements and test --------- Co-authored-by: anakin87 <stefanofiorucci@gmail.com>	2024-07-26 20:13:47 +02:00
Madeesh Kannan	8faa3fa465	Revert "fix: make PyPDF backward compatible (#7996 )" (#8014 ) This reverts commit 58b48e36eb56a896365133ab4a9d8e327989948c.	2024-07-11 13:06:08 +00:00
Tobias Wochinger	58b48e36eb	fix: make PyPDF backward compatible (#7996 ) * fix: make PyPDF backward compatible * Add release note --------- Co-authored-by: Vladimir Blagojevic <dovlex@gmail.com>	2024-07-09 10:08:37 +02:00
Vladimir Blagojevic	0255422eb3	chore: Mark AzureOCRDocumentConverter test_run_with_pdf_file flaky (#7978 ) * Disable AzureOCRDocumentConverter test_run_with_pdf_file on osx * Mark test flaky instead * Remove import	2024-07-04 16:36:32 +02:00
tstadel	aa46466894	fix: meta from ByteStream input for AzureOCRDocumentConverter (#7955 ) * fix: meta from ByteStream input for AzureOCRDocumentConverter * add test * add reno * fix test	2024-07-04 14:42:30 +02:00
Sebastian Husch Lee	6836079686	chore: Capitalize DOCX in DOCXToDocument converter (#7931 ) * Capitalize DOCX in DOCXToDocument converter * Update docstrings * Update test class name * add releease notes	2024-06-27 08:19:01 +02:00
Stefano Fiorucci	c51f8ffb86	PyPDFToDocument: remove deprecated converter_name and CONVERTERS_REGISTRY (#7910 )	2024-06-21 16:52:03 +02:00
Sebastian Husch Lee	3db56d9066	refactor: DocxToDocument update (#7857 ) * Some changes Use tests file path * Update tests * Add another unit test * Shorten _get_docx_metadata * Update tests * Remove try block * Add a dataclass * Add a to dict unit test * Remove unused import * Add release notes * Update docstrings * Use optional instead of pipe * Update docstring * Remove file	2024-06-19 15:48:31 +02:00
Stefano Fiorucci	8de639bd70	DocxDocument forward reference (#7852 )	2024-06-13 11:29:31 +02:00
Carlos Fernández	c1c339923f	feat: add DocxToDocument converter (#7838 ) * first fucntioning DocxFileToDocument * fix lazy import message * add reno * Add license headder Co-authored-by: Sebastian Husch Lee <sjrl@users.noreply.github.com> * change DocxFileToDocument to DocxToDocument * Update library install to the maintained version Co-authored-by: Sebastian Husch Lee <sjrl@users.noreply.github.com> * clan try-exvept to only take non haystack errors into account * Add wanring on docstring of component ignoring page brakes, mark test as skip * make warnings lazy evaluations Co-authored-by: Sebastian Husch Lee <sjrl@users.noreply.github.com> * make warnings lazy evaluations Co-authored-by: Sebastian Husch Lee <sjrl@users.noreply.github.com> * Make warnings lazy evaluated Co-authored-by: Sebastian Husch Lee <sjrl@users.noreply.github.com> * Solve f bug * Get more metadata from docx files * add 'python-docx' dependency and docs * Change logging import Co-authored-by: Sebastian Husch Lee <sjrl@users.noreply.github.com> * Fix typo Co-authored-by: Sebastian Husch Lee <sjrl@users.noreply.github.com> * remake metadata extraction for docx * solve bug regarding _get_docx_metadata method * Update haystack/components/converters/docx.py Co-authored-by: Sebastian Husch Lee <sjrl@users.noreply.github.com> * Update haystack/components/converters/docx.py Co-authored-by: Sebastian Husch Lee <sjrl@users.noreply.github.com> * Delete unused test --------- Co-authored-by: Sebastian Husch Lee <sjrl@users.noreply.github.com>	2024-06-12 11:58:36 +02:00
Sebastian Husch Lee	2c2c7c9f56	feat: Add PPTXToDocument converter (#7808 ) * Add first pass at PPTXToDocument converter * Add test and update code * Add doc string * Update docstrings * Add release notes * remove unused imports, add to api docs, update pyproject.toml * Add a new test * Add dep so tests can run	2024-06-07 09:43:29 +00:00
Stefano Fiorucci	7181f6b7e9	feat: change HTML conversion backend from boilerpy3 to Trafilatura (#7705 ) * change HTML conversion backed to Trafilatura * rm unused var	2024-05-17 10:38:47 +02:00
Massimiliano Pippi	10c675d534	chore: add license header to all modules (#7675 ) * add license header to modules * check license header at linting time	2024-05-09 13:40:36 +00:00
Mo	2e35f13085	feat: add converter based on pdfminer (#7607 ) * Initial commit pdfminer converter * Revert back naming of argument all_text per pdfminer documentation * Add the component decorator * Add release notes * Reformat code with black * Remove LTPage and comments * Update dependencies in pyproject.toml * Added some tests and incorporated reference doc in docstring * Added some tests and incorporated reference doc in docstring	2024-05-02 10:36:54 +02:00
Vladimir Blagojevic	988c360b6d	feat: Azure converter updates (#7409 ) * Initial commit * Remove old mock tests * Fix current_last_page_number calculation * Carry over unit tests from the other side * Update pydocs, skip failing tests * Fix pylint and mypy * Minor adjustments * Add release note * Minor touch ups * Resolve Document unique id issue by using custom id calculation * Better hashing, add unit tests * Small fixes	2024-04-09 09:45:06 +02:00
Vladimir Blagojevic	c3b96392fd	feat: Use all HTMLToDocument extractors until content is extracted (#7452 ) * Use all HTMLToDocument extractors until content is extracted * Add release note * Minor doc update * Improvements, unit test fixes * Add try_others init param, update tests * Update haystack/components/converters/html.py Co-authored-by: Stefano Fiorucci <stefanofiorucci@gmail.com> * PR feedback - Stefano * Improve reno release note, add reference * little fixes --------- Co-authored-by: Stefano Fiorucci <stefanofiorucci@gmail.com>	2024-04-05 16:02:34 +02:00
Stefano Fiorucci	6925e3a2e1	refactor!: Improve `PyPDFToDocument` (#7362 ) * first draft * rm kwargs from protocol * Simplify * no breaking changes * reno * one more test of the deprecated registry	2024-03-26 10:09:29 +01:00
Vladimir Blagojevic	0e7c41be5e	feat: Improve OpenAPIServiceToFunctions signature (#7257 ) * Convert OpenAPIServiceToFunctions run interface --------- Co-authored-by: Silvano Cerza <3314350+silvanocerza@users.noreply.github.com>	2024-03-04 14:38:58 +01:00
Vladimir Blagojevic	d871bbbfbd	feat: Add complex types in OpenAPI support (#7065 ) * Add complex types OpenAPI support * Add release note --------- Co-authored-by: Julian Risch <julian.risch@deepset.ai>	2024-02-27 18:11:06 +01:00
Vladimir Blagojevic	cb6389d7a2	feat: Improve OpenAPI integration (#7034 ) * Simplify and improve OpenAPIServiceConnector and OpenAPIServiceToFunctions, add unit tests * Add reno note * Add flask test dependency * Initial PR feedback - Julian * Remove indirection - Silvano * Remove flask end-to-end tests * Remove unused import * Add mixed body unit test * Update unit test, mock properly	2024-02-22 14:03:50 +01:00
Vladimir Blagojevic	8d46a2883e	feat: Make system_messages optional in OpenAPIServiceToFunctions run (#6825 ) * Make system_messages optional in OpenAPIServiceToFunctions run * Adjust unit test * PR feedback Massi	2024-02-14 16:04:35 +01:00
Vladimir Blagojevic	6a776e672f	Add OutputAdapter sede for custom filters (#6985 )	2024-02-13 16:56:43 +01:00
Vladimir Blagojevic	97a0df66d2	feat: Add OutputAdapter (#6936 ) * Add OutputAdapter component --------- Co-authored-by: Massimiliano Pippi <mpippi@gmail.com>	2024-02-13 13:03:50 +01:00
Madeesh Kannan	27d1af3068	feat!: Use `Secret` for passing authentication secrets to components (#6887 ) * feat!: Use `Secret` for passing authentication secrets to components * Add comment to clarify type ignore	2024-02-05 13:17:01 +01:00
Madeesh Kannan	5d66d040cc	feat: Add serde methods to `HTMLToDocument` (#6758 )	2024-01-18 10:02:01 +01:00
Sebastian Husch Lee	c0b67432e4	feat: Add page breaks to default PDF to Document converter (#6755 ) * Speedup tests for PyPDFToDocument * Added unit test and removed skipping of empty pages * add release note * Add back some integration marks	2024-01-18 08:54:59 +01:00
ZanSara	abd16ab796	feat: support single metadata dictionary in `MarkdownToDocument` (#6629 ) * support single metadata dict in markdown2document * reno * unwrap list * direct key access * typing * add explicit test	2024-01-09 14:44:39 +01:00
ZanSara	175b5baf45	feat: support single metadata dictionary in `AzureOCRDocumentConverter` (#6635 ) * support single metadata dict in azureconverter * reno * tests * Update releasenotes/notes/single-meta-in-azureconverter-ce1cc196a9b161f3.yaml	2024-01-09 10:49:37 +01:00

1 2

62 Commits