haystack

mirror of https://github.com/deepset-ai/haystack.git synced 2025-11-25 22:46:21 +00:00

Author	SHA1	Message	Date
David S. Batista	da60156174	chore: removing unused imports from tests (#9446 )	2025-05-26 16:22:51 +00:00
David S. Batista	2092bedb90	chore: removing unused imports from tests (#9444 )	2025-05-26 13:41:36 +00:00
David S. Batista	ba41696bba	chore: removing unused fixtures in test functions	2025-05-23 09:43:01 +02:00
Sebastian Husch Lee	0fdb88424b	fix: Fix Azure test on forks (#9312 ) * Fix unit test * Fix test	2025-04-25 11:10:59 +02:00
Stefano Fiorucci	38c39a49de	test: review integration tests (#9306 ) * AzureOCR: convert integration test to unit test and simplify * clean up HuggingFaceAPITextEmbedder * clean up LinkContentFetcher * simplify HuggingFaceLocalGenerator * clean up OpenAIGenerator * OpenAIChatGenerator * SentenceTransformersDiversityRanker * TransformersSimilarityRanker * ChatMessage: rm outdated tests * fail fast false * typo	2025-04-25 09:07:57 +02:00
Stefano Fiorucci	9ae7da8df3	test: workflow for slow/unstable integration tests (#9267 ) * workflow for slow integration tests * try changing skipper * Trigger Build * better names * fix * mv tika to slow * try skipping slow workflow * retry paths-ignore * remove skipper * Revert "remove skipper" This reverts commit 302ed2f07f36b33fa61fde0843b5590d79b98d74. * better skipper * retry * Revert "retry" This reverts commit fe5dff68f496645cc45292d74fcd8d043e868392. * try using one workflow * trigger * try to see if it fails * cosmetic changes * improvements * try matrix * retry * fix * clean up * simplify datadog monitoring and trigger * send event to datadog for nightly failures * tests should run if: manual trigger, scheduled, PR has label, release branch, or relevant files changed * clarify slow marker * improve comments * labels	2025-04-23 10:36:44 +02:00
Sebastian Husch Lee	19cf220136	feat: integrate two ready-made SuperComponents from haystack-experimental (#9235 ) * Add super component decorator * Add reno * MultiFileConverter * Add DocumentPreprocessor * Add reno * Add tests and change doc preprocessor to split first then clean * Remove code from merge * Add to pydoc and missing test file * PR comments * Lint fix * Fix mypy * Fix mypy * Add comment * PR comments * Update haystack/components/converters/multi_file_converter.py Co-authored-by: Daria Fokina <daria.fokina@deepset.ai> * Update haystack/components/preprocessors/document_preprocessor.py Co-authored-by: Daria Fokina <daria.fokina@deepset.ai> * Update haystack/components/preprocessors/document_preprocessor.py Co-authored-by: Daria Fokina <daria.fokina@deepset.ai> * Update haystack/components/preprocessors/document_preprocessor.py Co-authored-by: Daria Fokina <daria.fokina@deepset.ai> * Update haystack/components/preprocessors/document_preprocessor.py Co-authored-by: Daria Fokina <daria.fokina@deepset.ai> * Update haystack/components/preprocessors/document_preprocessor.py Co-authored-by: Daria Fokina <daria.fokina@deepset.ai> * Update haystack/components/preprocessors/document_preprocessor.py Co-authored-by: Daria Fokina <daria.fokina@deepset.ai> * Update haystack/components/preprocessors/document_preprocessor.py Co-authored-by: Daria Fokina <daria.fokina@deepset.ai> * Update haystack/components/preprocessors/document_preprocessor.py Co-authored-by: Daria Fokina <daria.fokina@deepset.ai> * Update haystack/components/preprocessors/document_preprocessor.py Co-authored-by: Daria Fokina <daria.fokina@deepset.ai> * Update haystack/components/preprocessors/document_preprocessor.py Co-authored-by: Daria Fokina <daria.fokina@deepset.ai> * Update haystack/components/converters/multi_file_converter.py Co-authored-by: Daria Fokina <daria.fokina@deepset.ai> * PR comments * PR comment --------- Co-authored-by: Daria Fokina <daria.fokina@deepset.ai>	2025-04-17 10:02:26 +00:00
Julian Risch	e64db61973	feat: include hyperlink addresses in DOCXToDocument output (#9109 ) * add DOCXLinkFormat * handle page breaks * add sample docx files * make no link extraction the default * reno * docstring and comment	2025-03-25 13:33:18 +00:00
David S. Batista	c037052581	feat: adding function to detect unmapped CID characters in `PDFMinerToDocument` (#8992 ) * adding function to detect unmapped CID characters * adding release notes * adding test for logs	2025-03-06 15:44:06 +00:00
Stefano Fiorucci	c04c900f26	build: drop Python 3.8 support (#8978 ) * draft * readd typing_extensions * small fix + release note * remove ruff target-version * Update releasenotes/notes/drop-python-3.8-868710963e794c83.yaml Co-authored-by: David S. Batista <dsbatista@gmail.com> --------- Co-authored-by: David S. Batista <dsbatista@gmail.com>	2025-03-05 14:59:56 +00:00
Stefano Fiorucci	f3c44be904	refactor!: remove `dataframe` field from `Document` and `ExtractedTableAnswer`; make `pandas` optional (#8906 ) * remove dataframe * release note * small fix * group imports * Update pyproject.toml Co-authored-by: Julian Risch <julian.risch@deepset.ai> * Update pyproject.toml Co-authored-by: Julian Risch <julian.risch@deepset.ai> * address feedback --------- Co-authored-by: Julian Risch <julian.risch@deepset.ai>	2025-03-04 11:06:07 +00:00
Sebastian Husch Lee	52a028251c	refactor!: update `AzureOCRDocumentConverter` to not use the `dataframe` field for tabular Documents (#8885 ) * Save document as a csv table now * Fix tests * Fix tests * Add reno	2025-03-03 12:45:02 +00:00
Sebastian Husch Lee	99a998f90b	feat: Add MSGToDocument converter (#8868 ) * Initial commit of MSG converter from Bijay * Updates to the MSG converter * Add license header * Add tests for msg converter * Update converter * Expanding tests * Update docstrings * add license header * Add reno * Add to inits and pydocs * Add test for empty input * Fix types * Fix mypy --------- Co-authored-by: Bijay Gurung <bijay.learning@gmail.com>	2025-02-24 08:12:32 +01:00
David S. Batista	7d51793727	chore: cleaning up unused imports in tests (#8887 )	2025-02-20 16:56:16 +00:00
Sebastian Husch Lee	71416c81bc	feat: Add store_full_path to converter (#8849 ) * Add missing store_full_path to converter * Add release note * Fix pylint	2025-02-12 17:11:59 +01:00
Stefano Fiorucci	2828d9e4ae	refactor!: `DOCXToDocument` converter - store DOCX metadata as a dict (#8804 ) * DOCXToDocument - store DOCX metadata as a dict * do not export DOCXMetadata to converters package	2025-02-05 14:43:19 +01:00
Sebastian Husch Lee	bba84e5517	fix: Fix JSONConverter to properly skip files that are not utf-8 encoded (#8775 ) * Small fix * Add reno * Trying out license header fix here	2025-01-28 10:29:55 +01:00
David S. Batista	5af2888e23	fix: `PDFMinerToDocument` convert function - adding double new lines between each `container_text` so that passages can be detected. (#8729 ) * initial import * adding double new lines between container_texts so that passages can be detected * reducing type specification to avoid import error * adding release notes * renaming variable	2025-01-17 13:01:16 +00:00
David S. Batista	2c84266d8f	test: adding test for PyPDF to extract passages so that they are detect by DocumentSplitter (#8739 )	2025-01-17 10:56:16 +01:00
Julian Risch	642fa60cdf	fix: PDFMinerToDocument initializes documents with content and meta (#8708 ) * fix: PDFMinerToDocument initializes documents with content and meta * add release note * Apply suggestions from code review Co-authored-by: David S. Batista <dsbatista@gmail.com> --------- Co-authored-by: David S. Batista <dsbatista@gmail.com>	2025-01-13 10:12:06 +00:00
Julian Risch	dd9660f90d	fix: PyPDFToDocument initializes documents with content and meta (#8698 ) * initialize document with content and meta * update test * add test checking that not only content is used for id generation	2025-01-09 19:12:10 +00:00
mathislucka	fe9b1e29d4	CI: fix format after newly introduced formatting rules from ruff release (#8696 )	2025-01-09 16:25:55 +00:00
Sebastian Husch Lee	28ad78c73d	feat: Add XLSXToDocument converter (#8522 ) * Add draft of the Excel To Document converter * Add license header * Add release note * Use Union instead of pipe * Add openpyxl as additional dep * Fix zip issue * few updates from Bijay * Update deps * Add markdown test * Adding more example excels and expanding tests * Added more tests * Fix windows test by setting lineterminator * Addressing PR comments * PR comments * Fix linting	2025-01-09 09:03:19 +01:00
Michele Pangrazzi	21d53d0ec6	update default value of 'store_full_path' to False in converters (#8619 )	2024-12-10 16:03:38 +01:00
Michele Pangrazzi	b32f85cca2	remove deprecated 'converter' init parameter from PyPDFToDocument component (#8609 )	2024-12-06 15:43:43 +01:00
Amna Mubashar	4c8eb54049	feat: Add store_full_path to converters (3/3) (#8585 ) * Add store_full_path params	2024-12-03 13:48:56 +05:00
Stefano Fiorucci	fb42c035c5	feat: `PyPDFToDocument` - add new customization parameters (#8574 ) * deprecat converter in pypdf * fix linting of MetaFieldGroupingRanker * linting * pypdftodocument: add customization params * fix mypy * incorporate feedback	2024-11-26 16:37:59 +01:00
Amna Mubashar	9302d3d9f0	feat: Add store_full_path to converters (2/3) (#8573 )	2024-11-25 15:22:19 +05:00
Amna Mubashar	21906d0558	feat: Add `store_full_path` to converters (1/3) (#8566 ) * Add store_full_path param to 3 converters	2024-11-22 13:55:08 +01:00
Sebastian Husch Lee	911f3523ab	feat: Increase logging transparency for empty Documents during conversion (#8509 ) * Add log lines for PDF conversion and make skipping more explicit in DocumentSplitter * Add logging statement for PDFMinerToDocument as well * Add tests * Remove unused line * Remove unused line * add reno * Add in PDF file * Update checks in PDF converters and add tests for document splitter * Revert * Remove line * Fix comment * Make mypy happy * Make mypy happy	2024-11-04 09:26:57 +01:00
Madeesh Kannan	33675b4caf	chore: Remove deprecated `DefaultConverter` for `PyPDFToDocument` (#8501 ) * chore: Remove deprecated `DefaultConverter` for `PyPDFToDocument` * Remove unused imports --------- Co-authored-by: Silvano Cerza <silvanocerza@gmail.com>	2024-10-29 16:42:48 +00:00
Vladimir Blagojevic	28161f7bb9	feat: DOCXToDocument: add table extraction (#8457 ) * DOCXToDocument: add table extraction * Add reno note * mypy fixes * add unit tests * Add csv table support * Update release note * Add TableFormat enum * Add table_format as str init param * Update docx.py Co-authored-by: Madeesh Kannan <shadeMe@users.noreply.github.com> * PR feedback * PR feedback --------- Co-authored-by: medsriha <medsriha@gmail.com> Co-authored-by: Mo Sriha <22803208+medsriha@users.noreply.github.com> Co-authored-by: Madeesh Kannan <shadeMe@users.noreply.github.com>	2024-10-29 16:20:27 +01:00
Madeesh Kannan	e7bfd80f3b	fix: (Temporarily) Re-add suport for pre-2.6.0 YAMLs with `PyPDFConverter` (#8443 )	2024-10-08 14:35:43 +02:00
Madeesh Kannan	ee89f6ad57	fix: `PyPDFToDocument` correctly serializes custom converters, deprecate `DefaultConverter` (#8430 ) * fix: `PyPDFToDocument` correctly serializes custom converters, deprecate `DefaultConverter` * Remove `auto` prefix from serde util function names, add unit tests	2024-10-01 16:35:38 +02:00
Silvano Cerza	d6f073f9b3	Revert "fix: make pypdf converter more robust (#8427 )" (#8428 ) This reverts commit d234c75168dcb49866a6714aa232f37d56f72cab.	2024-10-01 11:55:25 +02:00
Tobias Wochinger	d234c75168	fix: make pypdf converter more robust (#8427 ) * fix: make `from_dict` of `PyPDFToDocument` more robust * chore: drop trailing space * converting method to static and making the comment shorter * reverting method to static --------- Co-authored-by: David S. Batista <dsbatista@gmail.com>	2024-09-30 16:47:23 +00:00
Silvano Cerza	29672d4b42	feat: Add `JSONConverter` Component (#8397 ) * Add JSONConverter Component * Handle some corner cases * Add JSONConverter to pydoc config * Add a way to extract all non content fields as metadata * Small fix in docstring * Fix tests * docstrings upd * Update json.py --------- Co-authored-by: Daria Fokina <daria.fokina@deepset.ai>	2024-09-25 12:34:51 +02:00
Sriniketh J	e98a6fea04	Convertor: CSVToDocument (#8328 ) * carry forwarded initial commit * fix: doc strings * fix: update docstrings * fix: docstring update * fix: csv encoding in actions * fix: line endings through hooks * fix: converter docs addition	2024-09-06 10:59:12 +02:00
Silvano Cerza	3e3f79b928	feat: Add `unsafe` init arg in `ConditionalRouter` and `OutputAdapter` to enable previous behaviour (#8176 ) * Add unsafe behaviour to OutputAdapter * Add unsafe behaviour to ConditionalRouter * Add release notes * Fix mypy * Add documentation links --------- Co-authored-by: Madeesh Kannan <shadeMe@users.noreply.github.com>	2024-09-02 14:14:54 +00:00
Stefano Fiorucci	2e619f06c8	fix: make meta produced by `DOCXToDocument` JSON serializable (#8263 ) * make meta from DOCXToDocument JSON serializable * unused import * update docstrings	2024-08-22 12:24:32 +00:00
Jon Strutz	471f07c8fe	fix: extract page breaks from .docx files (#8232 ) * fix: extract page breaks from .docx files Context: Currently, DOCXToDocument does not extract page breaks from word documents. This makes it impossible to do things like split by page or get correct page number metadata after using something like DocumentSplitter. For example, if you split by word, the 'page_number' metadata field will be 1 for all documents. Solution: Added a method to DOCXToDocument that extracts page breaks from word documents as '\f' characters so that they are recognized by DocumentSplitter. Caveat: Due to the way the python-docx library is set up, you can only accurately determine the location of the first page break for a given paragraph. In the rare case that a paragraph contains more than one page break (which means it is an extremely long paragraph spanning multiple pages), the 2nd, 3rd, etc. page break locations are not known. To sort of fix this, I just appended the page break characters to the end of the paragraph text to keep the overall page number values for the document consistent. * Apply suggestions from code review --------- Co-authored-by: Madeesh Kannan <shadeMe@users.noreply.github.com>	2024-08-21 09:48:02 +00:00
Vladimir Blagojevic	3318d894c0	Add sede_with_list_output_type_in_pipeline unit test (#8196 )	2024-08-13 14:37:24 +02:00
Marie-Luise Klaus	ec02817f14	fix: OutputAdapter from_dict with custom_filters None (#8173 ) Co-authored-by: Marie-Luise Klaus <marieluise.klaus@deepset.ai>	2024-08-08 14:02:40 +02:00
Stefano Fiorucci	3d1ad10385	fix html test (#8127 )	2024-07-31 10:59:53 +02:00
Corentin Meyer	1c53aae8f0	fix: Tika converter not yielding page break tags (`\f`) (#8082 ) * Fix TikaConverter not having \f page tag by using HTML mode of parsing and then parsing the HTML to text using the old Haystack 1.X integration as template. * Add Reno * Fix test by making Mock Tika return XML (before parsing) * refinements and test --------- Co-authored-by: anakin87 <stefanofiorucci@gmail.com>	2024-07-26 20:13:47 +02:00
Madeesh Kannan	8faa3fa465	Revert "fix: make PyPDF backward compatible (#7996 )" (#8014 ) This reverts commit 58b48e36eb56a896365133ab4a9d8e327989948c.	2024-07-11 13:06:08 +00:00
Tobias Wochinger	58b48e36eb	fix: make PyPDF backward compatible (#7996 ) * fix: make PyPDF backward compatible * Add release note --------- Co-authored-by: Vladimir Blagojevic <dovlex@gmail.com>	2024-07-09 10:08:37 +02:00
Vladimir Blagojevic	0255422eb3	chore: Mark AzureOCRDocumentConverter test_run_with_pdf_file flaky (#7978 ) * Disable AzureOCRDocumentConverter test_run_with_pdf_file on osx * Mark test flaky instead * Remove import	2024-07-04 16:36:32 +02:00
tstadel	aa46466894	fix: meta from ByteStream input for AzureOCRDocumentConverter (#7955 ) * fix: meta from ByteStream input for AzureOCRDocumentConverter * add test * add reno * fix test	2024-07-04 14:42:30 +02:00
Sebastian Husch Lee	6836079686	chore: Capitalize DOCX in DOCXToDocument converter (#7931 ) * Capitalize DOCX in DOCXToDocument converter * Update docstrings * Update test class name * add releease notes	2024-06-27 08:19:01 +02:00

1 2

84 Commits