9 Commits

Author SHA1 Message Date
Markus Sagen
69a0c9f2ed
Clarify docs for PDF conversion, languages and encodings (#1570)
* Clarify PDF conversion, languages and encodings

The parameter name `valid_languages` may be a bit miss-leading from
reading only the tutorials. Users may, incorrectly assume that it
enforces that the conversions only works for those languages, then it's
more of a check.

- Provided clarifications in the tutorials to highlight what
valid_languages does and that changing the encoding may give better
results for their language of choice
- Updated the command for `pdftotext` to the correct one

* Allow encodings for `convert_files_to_dicts`

- Set option of passing encoding to the converters. Trying even for some
Latin1 languages, the converter does not do it in a good way.

Potential issues is that the encoding defaults to None, which is default
for the other converters, but not for the PDFToTextConverter. Could add
a check and change the ending to Latin1 for pdf if set to None.

Was considering adding it to **kwargs, but since it may be a commonly
used feature to be documented, I added it as a keyword argument instead.
Would love to hear your input and feedback on in.

* Set back PDF default encoding

* Update documentation
2021-10-11 09:30:12 +02:00
bogdankostic
c644e2b4d0
Add comment to tutorial notebooks about restarting runtime in colab (#1486)
* Add comment to tutorial notebooks about restarting runtime in colab

* Add latest docstring and tutorial changes

Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
2021-09-23 14:36:20 +02:00
Branden Chan
efc03f72db
Make PreProcessor.process() work on lists of documents (#1163)
* Add process_batch method

* Rename methods

* Fix doc string, satisfy mypy

* Fix mypy CI

* Fix typp

* Update tutorial

* Fix argument name

* Change arg name

* Incorporate reviewer feedback
2021-06-23 18:13:51 +02:00
Branden Chan
783893c3d2
Tutorial update (#1166)
* Add header / footer

* Add Milvus example

* Generate md files

* Fix mypy CI
2021-06-11 11:09:15 +02:00
Julian Risch
40ceaf418a
Fixing grpcio-tools to version of colab's pre-installed grpcio (#1113) 2021-05-31 19:09:10 +02:00
brandenchan
03cda26d85 Fix link in Tutorial 8 2021-02-15 10:45:27 +01:00
Malte Pietsch
e91518ee00
Update tutorials (torch versions, ES version, replace Finder with Pipeline) (#814)
* remove manual torch install on colab

* update elasticsearch version everywhere to 7.9.2

* fix FAQPipeline

* update tutorials with new pipelines

* Add latest docstring and tutorial changes

* revert faqpipeline change. fix field names in tutorial 4

* Add latest docstring and tutorial changes

Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
2021-02-09 14:56:54 +01:00
Branden Chan
7376185b65
Create DPR training tutorial (#708)
* WIP: Start DPR training tutorial

* Create basics of DPR Train tutorial

* Update documentation

* Allow DPR to be initialized without document store

* WIP: Add param descriptions to DPR notebook

* Clean tutorial

* Improve loading

* Make doc store optional when loading DPR

* Satisfy mypy type check

* Add links

* Add tutorial header

* Add colab badge

* Clear outputs

* Incorporate reviewer feedback

* WIP: Start DPR training tutorial

* Create basics of DPR Train tutorial

* Update documentation

* Allow DPR to be initialized without document store

* WIP: Add param descriptions to DPR notebook

* Clean tutorial

* Improve loading

* Make doc store optional when loading DPR

* Satisfy mypy type check

* Add links

* Add tutorial header

* Add colab badge

* Clear outputs

* Incorporate reviewer feedback

* Add readme links

* Regenerate tutorials

* Add excitement

* Fix typo

* Fix hard negatives comment

* Wrap tutorial for windows users

* Fix mypy issue
2021-01-13 10:33:55 +01:00
Branden Chan
bb8aba18e0
Create Preprocessing Tutorial (#706)
* WIP: First version of preprocessing tutorial

* stride renamed overlap, ipynb and py files created

* rename split_stride in test

* Update preprocessor api documentation

* define order for markdown files

* define order of modules in api docs

* Add colab links

* Incorporate review feedback

Co-authored-by: PiffPaffM <markuspaff.mp@gmail.com>
2021-01-06 15:54:05 +01:00