* Add JsonConverter node
* Update language
* JsonConverter: Remove id_hash_keys overwrite when it's None
Also, changes in docstring based on review
* Update docstring for JsonConverter
---------
Co-authored-by: agnieszka-m <amarzec13@gmail.com>
Co-authored-by: Sebastian Lee <sebastian.lee@deepset.ai>
* feat: add start and eng page to PDF converters
* docs: add missing docstrings
* refactor: change list set up, add docstrings and comment
* fix: add missing parameter
* tests: add page range basic test
* tests: test correct page numbers
* tests: remove OCR page range test
*Poppler and Tesseract not installed on CI
* fix: remove mobile change error
* first attempt to add frontmatter of markdown to the metadata
* remove bug fix
* running black and pre-commit
* moving the import line
* adding a test
* adding pydoc
* fix to removing code blocks in markdown converter
* adding a test
* fixing a test
* improving tests
* adding language to code block
* first attempt to add frontmatter of markdown to the metadata
* remove bug fix
* running black and pre-commit
* moving the import line
* adding a test
* adding pydoc
* feat(PDFToTextConverter): add option to get text in physical layout order
* test: add physical layout extraction test to PDFToTextConverter
* refactor: change layout parameter attribution places
* docs: manually trigger pre-commits
* docs: generate new docs to comply with pydoc-markdown style
* Add page number to Documents coming from PDFConverters and PreProcessor
* Fix mypy
* Update API Docs
* Update API Docs
* Remove unused imports
* Generate JSON schema
* Generate JSON schema
* Make test variable shorter
* Make regex a separate function
* Move counting of page breaks to a function
* Generate JSON schema
* Apply suggestions from code review
Co-authored-by: Agnieszka Marzec <97166305+agnieszka-m@users.noreply.github.com>
* Update API Documentation
* Don't create instance for testing staticmethod
* Update haystack/nodes/preprocessor/preprocessor.py
Co-authored-by: Agnieszka Marzec <97166305+agnieszka-m@users.noreply.github.com>
Co-authored-by: Agnieszka Marzec <97166305+agnieszka-m@users.noreply.github.com>
* Change split logic to list
* Fix wrong parameter for run
* Fix mypy error
* Fix layout/raw parameter
* Add test for filename with whitespaces on PDFToText
* Update Documentation & Code Style
Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
* Unify CI tests (from #2466)
* Update Documentation & Code Style
* Change folder names
* Fix markers list
* Remove marker 'slow', replaced with 'integration'
* Soften children check
* Start ES first so it has time to boot while Python is setup
* Run the full workflow
* Try to make pip upgrade on Windows
* Set KG tests as integration
* Update Documentation & Code Style
* typo
* faster pylint
* Make Pylint use the cache
* filter diff files for pylint
* debug pylint statement
* revert pylint changes
* Remove path from asserted log (fails on Windows)
* Skip preprocessor test on Windows
* Tackling Windows specific failures
* Fix pytest command for windows suites
* Remove \ from command
* Move poppler test into integration
* Skip opensearch test on windows
* Add tolerance in reader sas score for Windows
* Another pytorch approx
* Raise time limit for unit tests :(
* Skip poppler test on Windows CI
* Specify to pull with FF only in docs check
* temporarily run the docs check immediately
* Allow merge commit for now
* Try without fetch depth
* Accelerating test
* Accelerating test
* Add repository and ref alongside fetch-depth
* Separate out code&docs check from tests
* Use setup-python cache
* Delete custom action
* Remove the pull step in the docs check, will find a way to run on bot commits
* Add requirements.txt in .github for caching
* Actually install dependencies
* Change deps group for pylint
* Unclear why the requirements.txt is still required :/
* Fix the code check python setup
* Install all deps for pylint
* Make the autoformat check depend on tests and doc updates workflows
* Try installing dependencies in another order
* Try again to install the deps
* quoting the paths
* Ad back the requirements
* Try again to install rest_api and ui
* Change deps group
* Duplicate haystack install line
* See if the cache is the problem
* Disable also in mypy, who knows
* split the install step
* Split install step everywhere
* Revert "Separate out code&docs check from tests"
This reverts commit 1cd59b15ffc5b984e1d642dcbf4c8ccc2bb6c9bd.
* Add back the action
* Proactive support for audio (see text2speech branch)
* Fix label generator tests
* Remove install of libsndfile1 on win temporarily
* exclude audio tests on win
* install ffmpeg for integration tests
Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>