haystack

mirror of https://github.com/deepset-ai/haystack.git synced 2025-07-28 11:19:58 +00:00

Author	SHA1	Message	Date
Daniel Bichuetti	28724e2e25	feat: add automatic OCR detection mechanism and improve performance (#4329 ) * feat: add automatic OCR detection mechanism and improve performance * refactor: add error message * refactor: ignore pdftoppm bad typing * refactor: add Tesseract install. docstrings * fix: check if OCR var. assigned on mp * tests: add path to windows/linux tests * tests: add tessdata path * tests: include matrix ref. * tests: custom Tesseract matrix install * refactor: improve user guide * tests: fix macos path * tests: remove brew formulae version * fix: macos paths * tests: fix macos path * tests: add Tesseract to Windows Path * tests: pytesseract path * tests: macos path * refactor: fix path message and remove extra path from tests * refactor: raise exception when path not found * refactor: expression simplification Co-authored-by: Silvano Cerza <3314350+silvanocerza@users.noreply.github.com> * refactor: check ocr parameter * tests: mark as integration * tests: mock deprecation warning * refactor: simplify code Co-authored-by: Silvano Cerza <3314350+silvanocerza@users.noreply.github.com> * refactor: change deprecation test Co-authored-by: Silvano Cerza <3314350+silvanocerza@users.noreply.github.com> * refactor: add unit patch * refactor: black formatting --------- Co-authored-by: Silvano Cerza <3314350+silvanocerza@users.noreply.github.com> Co-authored-by: Mayank Jobanputra <mayankjobanputra@gmail.com>	2023-03-13 20:19:22 +05:30
Daniel Bichuetti	7c49fffc71	feat: Enable PDFToTextConverter multiprocessing, increase general performance and simplify installation (#4226 ) * refactor: isolate PDF converters * refactor: remove xpdf dependency and fix tests * refactor: add min. version * feat: enable multiprocessing and add tests * fix: remove unused imports * fix: regression when moved code * refactor: use itertools * fix: mypy claims * refactor: double tool support * refactor: add fallback to xpdf * refactor: black formatting * refactor: make superclass signature compatible * refactor: complete removal of xPdf * refactor: regroup Haystack imports and fix regression * refactor: remove original declaration * docs: fix docstrings * tests: add [pdf] to [all] * refactor: remove redundant checks, avoid extra processes * refactor: add deprecation warning * refactor: add pytest mark * tests: change PDF test file * fix: correct pytest mark * refactor: deprecate parameter and add new * tests: change pdf sample * Add minor lg changes to docstrings * Fix default value in doc strings * Update test/nodes/test_file_converter.py Co-authored-by: bogdankostic <bogdankostic@web.de> * tests: fix page count * refactor: add imported function * refactor: change default value * tests: change parameters and fix typo * Unify sort_by_position parameter names --------- Co-authored-by: bogdankostic <bogdankostic@web.de> Co-authored-by: agnieszka-m <amarzec13@gmail.com>	2023-03-01 22:34:38 +01:00
Massimiliano Pippi	4b8d195288	refact: mark unit tests under the `test/nodes/*` path (#4235 ) document merger * mark unit tests * revert	2023-02-27 15:00:19 +01:00
Bijay Gurung	d4b822646e	feat: Add JsonConverter node (#4130 ) * Add JsonConverter node * Update language * JsonConverter: Remove id_hash_keys overwrite when it's None Also, changes in docstring based on review * Update docstring for JsonConverter --------- Co-authored-by: agnieszka-m <amarzec13@gmail.com> Co-authored-by: Sebastian Lee <sebastian.lee@deepset.ai>	2023-02-21 09:23:42 +01:00
Daniel Bichuetti	3009ac2988	feat: Add page range support to PDF converters. (#3965 ) * feat: add start and eng page to PDF converters * docs: add missing docstrings * refactor: change list set up, add docstrings and comment * fix: add missing parameter * tests: add page range basic test * tests: test correct page numbers * tests: remove OCR page range test Poppler and Tesseract not installed on CI fix: remove mobile change error	2023-01-30 14:09:22 +01:00
Tuana Celik	93312138de	fix: removing code block in `MarkdownConverter` (#3960 ) * first attempt to add frontmatter of markdown to the metadata * remove bug fix * running black and pre-commit * moving the import line * adding a test * adding pydoc * fix to removing code blocks in markdown converter * adding a test * fixing a test * improving tests * adding language to code block	2023-01-27 15:25:54 +01:00
Tuana Celik	790e9acd3e	feat: add frontmatter to meta in `MarkdownConverter` (#3953 ) * first attempt to add frontmatter of markdown to the metadata * remove bug fix * running black and pre-commit * moving the import line * adding a test * adding pydoc	2023-01-26 17:15:02 +01:00
Benjamin BERNARD	eed009eddb	feat: Add `CsvTextConverter` (#3587 ) * feat: Add Csv2Documents, EmbedDocuments nodes and FAQ indexing pipeline Fixes #3550, allow user to build full FAQ using YAML pipeline description and with CSV import and indexing. * feat: Add Csv2Documents, EmbedDocuments nodes and FAQ indexing pipeline Fix linter issues mypy and pylint. * feat: Add Csv2Documents, EmbedDocuments nodes and FAQ indexing pipeline Fix linter issues mypy. * implement proposal's feedback * tidy up for merge * use BaseConverter * use BaseConverter * pylint * black * Revert "black" This reverts commit e1c45cb1848408bd52a630328750cb67c8eb7110. * black * add check for column names * add check for column names * add tests * fix tests * address lists of paths * typo * remove duplicate line Co-authored-by: ZanSara <sarazanzo94@gmail.com>	2023-01-23 15:56:36 +01:00
bogdankostic	60224412bc	feat: Add headline extraction to `ParsrConverter` (#3488 ) * Add headline extraction to ParsrConverter * Add sample PDF file * Add test * Use extract_headlines if set in convert method * Integrate PR feedback	2022-10-31 19:00:02 +01:00
bogdankostic	4fbe80c098	feat: Extraction of headlines in markdown files (#3445 ) * Extract headings from markdown files + adapt PreProcessor * Add tests * Fix mypy * Generate JSON schema * Apply suggestions from code review Co-authored-by: Agnieszka Marzec <97166305+agnieszka-m@users.noreply.github.com> * Update haystack/nodes/file_converter/markdown.py Co-authored-by: Agnieszka Marzec <97166305+agnieszka-m@users.noreply.github.com> * Apply black * Add PR feedback Co-authored-by: Agnieszka Marzec <97166305+agnieszka-m@users.noreply.github.com>	2022-10-26 11:57:55 +02:00
Daniel Bichuetti	df1f4205b6	feat: add public layout-base extraction support on PDFToTextConverter (#3137 ) * feat(PDFToTextConverter): add option to get text in physical layout order * test: add physical layout extraction test to PDFToTextConverter * refactor: change layout parameter attribution places * docs: manually trigger pre-commits * docs: generate new docs to comply with pydoc-markdown style	2022-09-13 16:55:21 +02:00
bogdankostic	5c3bfad078	feat: Add page number to Documents coming from PDFConverters and PreProcessor (#2932 ) * Add page number to Documents coming from PDFConverters and PreProcessor * Fix mypy * Update API Docs * Update API Docs * Remove unused imports * Generate JSON schema * Generate JSON schema * Make test variable shorter * Make regex a separate function * Move counting of page breaks to a function * Generate JSON schema * Apply suggestions from code review Co-authored-by: Agnieszka Marzec <97166305+agnieszka-m@users.noreply.github.com> * Update API Documentation * Don't create instance for testing staticmethod * Update haystack/nodes/preprocessor/preprocessor.py Co-authored-by: Agnieszka Marzec <97166305+agnieszka-m@users.noreply.github.com> Co-authored-by: Agnieszka Marzec <97166305+agnieszka-m@users.noreply.github.com>	2022-08-09 15:55:27 +02:00
Daniel Augustus Bichuetti Silva	1706729e26	Prevent `PDFToTextConverter` from failing on PDFs with spaces in their names (#2786 ) * Change split logic to list * Fix wrong parameter for run * Fix mypy error * Fix layout/raw parameter * Add test for filename with whitespaces on PDFToText * Update Documentation & Code Style Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>	2022-07-11 13:30:33 +02:00
tstadel	1168f6365d	Fix using id_hash_keys as pipeline params (#2717 ) * Fix using id_hash_keys as pipeline params * Update Documentation & Code Style * add tests Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>	2022-06-24 09:55:09 +02:00
Sara Zan	59608ca474	[CI Refactoring] Workflow refactoring (#2576 ) * Unify CI tests (from #2466) * Update Documentation & Code Style * Change folder names * Fix markers list * Remove marker 'slow', replaced with 'integration' * Soften children check * Start ES first so it has time to boot while Python is setup * Run the full workflow * Try to make pip upgrade on Windows * Set KG tests as integration * Update Documentation & Code Style * typo * faster pylint * Make Pylint use the cache * filter diff files for pylint * debug pylint statement * revert pylint changes * Remove path from asserted log (fails on Windows) * Skip preprocessor test on Windows * Tackling Windows specific failures * Fix pytest command for windows suites * Remove \ from command * Move poppler test into integration * Skip opensearch test on windows * Add tolerance in reader sas score for Windows * Another pytorch approx * Raise time limit for unit tests :( * Skip poppler test on Windows CI * Specify to pull with FF only in docs check * temporarily run the docs check immediately * Allow merge commit for now * Try without fetch depth * Accelerating test * Accelerating test * Add repository and ref alongside fetch-depth * Separate out code&docs check from tests * Use setup-python cache * Delete custom action * Remove the pull step in the docs check, will find a way to run on bot commits * Add requirements.txt in .github for caching * Actually install dependencies * Change deps group for pylint * Unclear why the requirements.txt is still required :/ * Fix the code check python setup * Install all deps for pylint * Make the autoformat check depend on tests and doc updates workflows * Try installing dependencies in another order * Try again to install the deps * quoting the paths * Ad back the requirements * Try again to install rest_api and ui * Change deps group * Duplicate haystack install line * See if the cache is the problem * Disable also in mypy, who knows * split the install step * Split install step everywhere * Revert "Separate out code&docs check from tests" This reverts commit 1cd59b15ffc5b984e1d642dcbf4c8ccc2bb6c9bd. * Add back the action * Proactive support for audio (see text2speech branch) * Fix label generator tests * Remove install of libsndfile1 on win temporarily * exclude audio tests on win * install ffmpeg for integration tests Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>	2022-06-07 09:23:03 +02:00
Julian Risch	075ed7fbcb	Remove encoding option from PDFToTextOCRConverter (#2553 ) * remove encoding option from PDFToTextOCRConverter * Update Documentation & Code Style * add unused 'encoding' param to PDFToTextOCRConverter * Update Documentation & Code Style * call run instead of convert to use ligature replacing * Update Documentation & Code Style * add text to check installed poppler version * Update Documentation & Code Style Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>	2022-05-24 11:31:32 +02:00
Sara Zan	ff4303c51b	[CI refactoring] Categorize tests into folders (#2554 ) * Categorize tests into folders * Fix linux_ci.yml and an import * Wrong path	2022-05-17 09:55:53 +01:00

17 Commits