* Add FormRecognizerConverter
* Change signature of convert method + change return type of all converters
* Adapt preprocessing util to new return type of converters
* Parametrize number of lines used for surrounding context of table
* Change name from FormRecognizerConverter to AzureConverter
* Set version of azure-ai-formrecognizer package
* Change tutorial 8 based on new return type of converters
* Add tests
* Add latest docstring and tutorial changes
* Fix typo
Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Co-authored-by: Malte Pietsch <malte.pietsch@deepset.ai>
* Files moved, imports all broken
* Fix most imports and docstrings into
* Fix the paths to the modules in the API docs
* Add latest docstring and tutorial changes
* Add a few pipelines that were lost in the inports
* Fix a bunch of mypy warnings
* Add latest docstring and tutorial changes
* Create a file_classifier module
* Add docs for file_classifier
* Fixed most circular imports, now the REST API can start
* Add latest docstring and tutorial changes
* Tackling more mypy issues
* Reintroduce from FARM and fix last mypy issues hopefully
* Re-enable old-style imports
* Fix some more import from the top-level package in an attempt to sort out circular imports
* Fix some imports in tests to new-style to prevent failed class equalities from breaking tests
* Change document_store into document_stores
* Update imports in tutorials
* Add latest docstring and tutorial changes
* Probably fixes summarizer tests
* Improve the old-style import allowing module imports (should work)
* Try to fix the docs
* Remove dedicated KnowledgeGraph page from autodocs
* Remove dedicated GraphRetriever page from autodocs
* Fix generate_docstrings.sh with an updated list of yaml files to look for
* Fix some more modules in the docs
* Fix the document stores docs too
* Fix a small issue on Tutorial14
* Add latest docstring and tutorial changes
* Add deprecation warning to old-style imports
* Remove stray folder and import Dict into dense.py
* Change import path for MLFlowLogger
* Add old loggers path to the import path aliases
* Fix debug output of convert_ipynb.py
* Fix circular import on BaseRetriever
* Missed one merge block
* re-run tutorial 5
* Fix imports in tutorial 5
* Re-enable squad_to_dpr CLI from the root package and move get_batches_from_generator into document_stores.base
* Add latest docstring and tutorial changes
* Fix typo in utils __init__
* Fix a few more imports
* Fix benchmarks too
* New-style imports in test_knowledge_graph
* Rollback setup.py
* Rollback squad_to_dpr too
Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
* Update jobs link to personio
* Add latest docstring and tutorial changes
* Change jobs link to main website
* Add latest docstring and tutorial changes
Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
* Clarify PDF conversion, languages and encodings
The parameter name `valid_languages` may be a bit miss-leading from
reading only the tutorials. Users may, incorrectly assume that it
enforces that the conversions only works for those languages, then it's
more of a check.
- Provided clarifications in the tutorials to highlight what
valid_languages does and that changing the encoding may give better
results for their language of choice
- Updated the command for `pdftotext` to the correct one
* Allow encodings for `convert_files_to_dicts`
- Set option of passing encoding to the converters. Trying even for some
Latin1 languages, the converter does not do it in a good way.
Potential issues is that the encoding defaults to None, which is default
for the other converters, but not for the PDFToTextConverter. Could add
a check and change the ending to Latin1 for pdf if set to None.
Was considering adding it to **kwargs, but since it may be a commonly
used feature to be documented, I added it as a keyword argument instead.
Would love to hear your input and feedback on in.
* Set back PDF default encoding
* Update documentation
* WIP: First version of preprocessing tutorial
* stride renamed overlap, ipynb and py files created
* rename split_stride in test
* Update preprocessor api documentation
* define order for markdown files
* define order of modules in api docs
* Add colab links
* Incorporate review feedback
Co-authored-by: PiffPaffM <markuspaff.mp@gmail.com>