5 Commits

Author SHA1 Message Date
Malte Pietsch
db2b5d913b Fix param in tutorial 8 2021-10-13 14:45:09 +02:00
Markus Sagen
69a0c9f2ed
Clarify docs for PDF conversion, languages and encodings (#1570)
* Clarify PDF conversion, languages and encodings

The parameter name `valid_languages` may be a bit miss-leading from
reading only the tutorials. Users may, incorrectly assume that it
enforces that the conversions only works for those languages, then it's
more of a check.

- Provided clarifications in the tutorials to highlight what
valid_languages does and that changing the encoding may give better
results for their language of choice
- Updated the command for `pdftotext` to the correct one

* Allow encodings for `convert_files_to_dicts`

- Set option of passing encoding to the converters. Trying even for some
Latin1 languages, the converter does not do it in a good way.

Potential issues is that the encoding defaults to None, which is default
for the other converters, but not for the PDFToTextConverter. Could add
a check and change the ending to Latin1 for pdf if set to None.

Was considering adding it to **kwargs, but since it may be a commonly
used feature to be documented, I added it as a keyword argument instead.
Would love to hear your input and feedback on in.

* Set back PDF default encoding

* Update documentation
2021-10-11 09:30:12 +02:00
Branden Chan
783893c3d2
Tutorial update (#1166)
* Add header / footer

* Add Milvus example

* Generate md files

* Fix mypy CI
2021-06-11 11:09:15 +02:00
Julian Risch
3331608e03
Adding a guard that prevents the tutorial code from being executed in every subprocess when using multiprocessing on windows (#729) 2021-01-13 18:17:54 +01:00
Branden Chan
bb8aba18e0
Create Preprocessing Tutorial (#706)
* WIP: First version of preprocessing tutorial

* stride renamed overlap, ipynb and py files created

* rename split_stride in test

* Update preprocessor api documentation

* define order for markdown files

* define order of modules in api docs

* Add colab links

* Incorporate review feedback

Co-authored-by: PiffPaffM <markuspaff.mp@gmail.com>
2021-01-06 15:54:05 +01:00