mirror of
https://github.com/deepset-ai/haystack.git
synced 2025-07-23 17:00:41 +00:00

* Clarify PDF conversion, languages and encodings The parameter name `valid_languages` may be a bit miss-leading from reading only the tutorials. Users may, incorrectly assume that it enforces that the conversions only works for those languages, then it's more of a check. - Provided clarifications in the tutorials to highlight what valid_languages does and that changing the encoding may give better results for their language of choice - Updated the command for `pdftotext` to the correct one * Allow encodings for `convert_files_to_dicts` - Set option of passing encoding to the converters. Trying even for some Latin1 languages, the converter does not do it in a good way. Potential issues is that the encoding defaults to None, which is default for the other converters, but not for the PDFToTextConverter. Could add a check and change the ending to Latin1 for pdf if set to None. Was considering adding it to **kwargs, but since it may be a commonly used feature to be documented, I added it as a keyword argument instead. Would love to hear your input and feedback on in. * Set back PDF default encoding * Update documentation