haystack

mirror of https://github.com/deepset-ai/haystack.git synced 2025-12-02 01:46:19 +00:00

Author	SHA1	Message	Date
Sebastian Husch Lee	6836079686	chore: Capitalize DOCX in DOCXToDocument converter (#7931 ) * Capitalize DOCX in DOCXToDocument converter * Update docstrings * Update test class name * add releease notes	2024-06-27 08:19:01 +02:00
Stefano Fiorucci	c51f8ffb86	PyPDFToDocument: remove deprecated converter_name and CONVERTERS_REGISTRY (#7910 )	2024-06-21 16:52:03 +02:00
Sebastian Husch Lee	3db56d9066	refactor: DocxToDocument update (#7857 ) * Some changes Use tests file path * Update tests * Add another unit test * Shorten _get_docx_metadata * Update tests * Remove try block * Add a dataclass * Add a to dict unit test * Remove unused import * Add release notes * Update docstrings * Use optional instead of pipe * Update docstring * Remove file	2024-06-19 15:48:31 +02:00
Stefano Fiorucci	8de639bd70	DocxDocument forward reference (#7852 )	2024-06-13 11:29:31 +02:00
Carlos Fernández	c1c339923f	feat: add DocxToDocument converter (#7838 ) * first fucntioning DocxFileToDocument * fix lazy import message * add reno * Add license headder Co-authored-by: Sebastian Husch Lee <sjrl@users.noreply.github.com> * change DocxFileToDocument to DocxToDocument * Update library install to the maintained version Co-authored-by: Sebastian Husch Lee <sjrl@users.noreply.github.com> * clan try-exvept to only take non haystack errors into account * Add wanring on docstring of component ignoring page brakes, mark test as skip * make warnings lazy evaluations Co-authored-by: Sebastian Husch Lee <sjrl@users.noreply.github.com> * make warnings lazy evaluations Co-authored-by: Sebastian Husch Lee <sjrl@users.noreply.github.com> * Make warnings lazy evaluated Co-authored-by: Sebastian Husch Lee <sjrl@users.noreply.github.com> * Solve f bug * Get more metadata from docx files * add 'python-docx' dependency and docs * Change logging import Co-authored-by: Sebastian Husch Lee <sjrl@users.noreply.github.com> * Fix typo Co-authored-by: Sebastian Husch Lee <sjrl@users.noreply.github.com> * remake metadata extraction for docx * solve bug regarding _get_docx_metadata method * Update haystack/components/converters/docx.py Co-authored-by: Sebastian Husch Lee <sjrl@users.noreply.github.com> * Update haystack/components/converters/docx.py Co-authored-by: Sebastian Husch Lee <sjrl@users.noreply.github.com> * Delete unused test --------- Co-authored-by: Sebastian Husch Lee <sjrl@users.noreply.github.com>	2024-06-12 11:58:36 +02:00
Sebastian Husch Lee	2c2c7c9f56	feat: Add PPTXToDocument converter (#7808 ) * Add first pass at PPTXToDocument converter * Add test and update code * Add doc string * Update docstrings * Add release notes * remove unused imports, add to api docs, update pyproject.toml * Add a new test * Add dep so tests can run	2024-06-07 09:43:29 +00:00
Stefano Fiorucci	7181f6b7e9	feat: change HTML conversion backend from boilerpy3 to Trafilatura (#7705 ) * change HTML conversion backed to Trafilatura * rm unused var	2024-05-17 10:38:47 +02:00
Massimiliano Pippi	10c675d534	chore: add license header to all modules (#7675 ) * add license header to modules * check license header at linting time	2024-05-09 13:40:36 +00:00
Mo	2e35f13085	feat: add converter based on pdfminer (#7607 ) * Initial commit pdfminer converter * Revert back naming of argument all_text per pdfminer documentation * Add the component decorator * Add release notes * Reformat code with black * Remove LTPage and comments * Update dependencies in pyproject.toml * Added some tests and incorporated reference doc in docstring * Added some tests and incorporated reference doc in docstring	2024-05-02 10:36:54 +02:00
Vladimir Blagojevic	988c360b6d	feat: Azure converter updates (#7409 ) * Initial commit * Remove old mock tests * Fix current_last_page_number calculation * Carry over unit tests from the other side * Update pydocs, skip failing tests * Fix pylint and mypy * Minor adjustments * Add release note * Minor touch ups * Resolve Document unique id issue by using custom id calculation * Better hashing, add unit tests * Small fixes	2024-04-09 09:45:06 +02:00
Vladimir Blagojevic	c3b96392fd	feat: Use all HTMLToDocument extractors until content is extracted (#7452 ) * Use all HTMLToDocument extractors until content is extracted * Add release note * Minor doc update * Improvements, unit test fixes * Add try_others init param, update tests * Update haystack/components/converters/html.py Co-authored-by: Stefano Fiorucci <stefanofiorucci@gmail.com> * PR feedback - Stefano * Improve reno release note, add reference * little fixes --------- Co-authored-by: Stefano Fiorucci <stefanofiorucci@gmail.com>	2024-04-05 16:02:34 +02:00
Stefano Fiorucci	6925e3a2e1	refactor!: Improve `PyPDFToDocument` (#7362 ) * first draft * rm kwargs from protocol * Simplify * no breaking changes * reno * one more test of the deprecated registry	2024-03-26 10:09:29 +01:00
Vladimir Blagojevic	0e7c41be5e	feat: Improve OpenAPIServiceToFunctions signature (#7257 ) * Convert OpenAPIServiceToFunctions run interface --------- Co-authored-by: Silvano Cerza <3314350+silvanocerza@users.noreply.github.com>	2024-03-04 14:38:58 +01:00
Vladimir Blagojevic	d871bbbfbd	feat: Add complex types in OpenAPI support (#7065 ) * Add complex types OpenAPI support * Add release note --------- Co-authored-by: Julian Risch <julian.risch@deepset.ai>	2024-02-27 18:11:06 +01:00
Vladimir Blagojevic	cb6389d7a2	feat: Improve OpenAPI integration (#7034 ) * Simplify and improve OpenAPIServiceConnector and OpenAPIServiceToFunctions, add unit tests * Add reno note * Add flask test dependency * Initial PR feedback - Julian * Remove indirection - Silvano * Remove flask end-to-end tests * Remove unused import * Add mixed body unit test * Update unit test, mock properly	2024-02-22 14:03:50 +01:00
Vladimir Blagojevic	8d46a2883e	feat: Make system_messages optional in OpenAPIServiceToFunctions run (#6825 ) * Make system_messages optional in OpenAPIServiceToFunctions run * Adjust unit test * PR feedback Massi	2024-02-14 16:04:35 +01:00
Vladimir Blagojevic	6a776e672f	Add OutputAdapter sede for custom filters (#6985 )	2024-02-13 16:56:43 +01:00
Vladimir Blagojevic	97a0df66d2	feat: Add OutputAdapter (#6936 ) * Add OutputAdapter component --------- Co-authored-by: Massimiliano Pippi <mpippi@gmail.com>	2024-02-13 13:03:50 +01:00
Madeesh Kannan	27d1af3068	feat!: Use `Secret` for passing authentication secrets to components (#6887 ) * feat!: Use `Secret` for passing authentication secrets to components * Add comment to clarify type ignore	2024-02-05 13:17:01 +01:00
Madeesh Kannan	5d66d040cc	feat: Add serde methods to `HTMLToDocument` (#6758 )	2024-01-18 10:02:01 +01:00
Sebastian Husch Lee	c0b67432e4	feat: Add page breaks to default PDF to Document converter (#6755 ) * Speedup tests for PyPDFToDocument * Added unit test and removed skipping of empty pages * add release note * Add back some integration marks	2024-01-18 08:54:59 +01:00
ZanSara	abd16ab796	feat: support single metadata dictionary in `MarkdownToDocument` (#6629 ) * support single metadata dict in markdown2document * reno * unwrap list * direct key access * typing * add explicit test	2024-01-09 14:44:39 +01:00
ZanSara	175b5baf45	feat: support single metadata dictionary in `AzureOCRDocumentConverter` (#6635 ) * support single metadata dict in azureconverter * reno * tests * Update releasenotes/notes/single-meta-in-azureconverter-ce1cc196a9b161f3.yaml	2024-01-09 10:49:37 +01:00
ZanSara	974d65f30a	feat: support single metadata dictionary in `TikaDocumentConverter` (#6698 ) * reno * converter * test * comment	2024-01-09 09:49:47 +01:00
Stefano Fiorucci	bb2b1a20f8	refactor: optimize API keys reading (#6655 ) * centralize API keys handling * fix mypy and pylint * rm utility function, be more explicit	2024-01-05 10:40:03 +01:00
ZanSara	c0f1dab454	feat: support single metadata dictionary in `PyPDFToDocument` (#6615 ) * support single metadata dict in pypdf2document * improve tests * tests * remove line	2023-12-22 14:13:11 +01:00
ZanSara	ff55985e2d	feat: support single metadata dictionary in `HTMLToDocument` (#6613 ) * support single metadata in HTMLToDocument * reno * docstring	2023-12-21 16:45:31 +01:00
ZanSara	cf79aa1485	feat: add support for single meta dict in `TextFileToDocument` (#6606 ) * add support for single meta dict * reno * reno * mypy * extract to function * docstring * mypy	2023-12-21 14:21:17 +01:00
sahusiddharth	3d17e6ff76	changed metadata to meta (#6605 )	2023-12-21 12:39:58 +01:00
Vladimir Blagojevic	2dd5a94b04	feat: Add RAG based OpenAPI service integration (#6555 ) * Add OpenAPIServiceConnector and OpenAPIServiceToFunctions * Add release note * Add test deps * Better docs on OpenAPI spec reqs, improve tests * Silvano PR feedback	2023-12-19 13:27:41 +01:00
Stefano Fiorucci	94cfe5d9ae	feat!: `HTMLToDocument` - allow choosing the boilerpy3 extractor (#6582 ) * allow extractor customizability * release note * typo	2023-12-19 10:52:12 +01:00
Stefano Fiorucci	2f034d3c97	refactor!: Converters - standardize inputs (#6540 ) * standardize converters inputs: first draft * fix precommit * fix precommit 2 * fix precommit 3 * add default for optional param * rm leftover * install boilerpy in linting workflow * add boilerpy3 to the core dependencies * add reno * remove boilerpy3 installation from test workflow * fix pylint: import order and unused import * fix import order * add release note * better Tika docstring * rm boilerpy from linting * leftover * md link brackets * feat: Converters - allow passing `meta` in the `run` method (#6554) * first impl for html * progressing on other components * fix test * add tests - run with meta * release note * reintroduce patches wrongly deleted * add patch in test * fix tika test * Update haystack/components/converters/azure.py Co-authored-by: Massimiliano Pippi <mpippi@gmail.com> --------- Co-authored-by: Massimiliano Pippi <mpippi@gmail.com> * Update releasenotes/notes/converters-standardize-inputs-ed2ba9c97b762974.yaml Co-authored-by: Silvano Cerza <3314350+silvanocerza@users.noreply.github.com> * simplify test --------- Co-authored-by: Massimiliano Pippi <mpippi@gmail.com> Co-authored-by: Julian Risch <julian.risch@deepset.ai> Co-authored-by: Daria Fokina <daria.fokina@deepset.ai> Co-authored-by: Silvano Cerza <3314350+silvanocerza@users.noreply.github.com>	2023-12-15 16:41:35 +01:00
Massimiliano Pippi	7c05f37a53	remove unit marker (#6450 )	2023-11-29 19:24:25 +01:00
Silvano Cerza	e6637f5ec2	Fix all tests	2023-11-24 14:48:43 +01:00
Massimiliano Pippi	8adb8bbab8	Remove preview folder in test/ --------- Co-authored-by: Silvano Cerza <silvanocerza@gmail.com>	2023-11-24 11:52:55 +01:00

35 Commits