Lalit Pagaria
e5b4b62d75
Add CI for windows runner ( #1458 )
...
* Feat: Removing use of temp file while downloading archive from url along with adding CI for windows and mac platform
* Windows CI by default installing pytorch gpu hence updating CI to pick cpu version
* fixing mac cache build issue
* updating windows pip install command for torch
* another attempt
* updating ci
* Adding sudo
* fixing ls failure on windows
* another attempt to fix build issue
* Saving env variable of test files
* Adding debug log
* Github action differ on windows
* adding debug
* anohter attempt
* Windows have different ways to receive env
* fixing template
* minor fx
* Adding debug
* Removing use of json
* Adding back fromJson
* addin toJson
* removing print
* anohter attempt
* disabling parallel run at least for testing
* installing docker for mac runner
* correcting docker install command
* Linux dockers are not suported in windows
* Removing mac changes
* Upgrading pytorch
* using lts pytorch
* Separating win and ubuntu
* Install java 11
* enabling linux container env
* docker cli command
* docker cli command
* start elastic service
* List all service
* correcting service name
* Attempt to fix multiple test run
* convert to json
* another attempt to check
* Updating build cache step
* attempt
* Add tika
* Separating windows CI
* Changing CI name
* Skipping test which does not work in windows
* Skipping tests for windows
* create cleanup function in conftest
* adding skipif marker on tests
* Run windows PR on only push to master
* Addressing review comments
* Enabling windows ci for this PR
* Tika init is being called when importing tika function
* handling tika import issue
* handling tika import issue in test
* Fixing import issue
* removing tika fixure
* Removing fixture from tests
* Disable windows ci on pull request
* Add back extra pytorch install step
Co-authored-by: Malte Pietsch <malte.pietsch@deepset.ai>
2021-10-29 10:22:28 +02:00
Malte Pietsch
4a6c9302b3
Redesign primitives - Document
, Answer
, Label
( #1398 )
...
* first draft / notes on new primitives
* wip label / feedback refactor
* rename doc.text -> doc.content. add doc.content_type
* add datatype for content
* remove faq_question_field from ES and weaviate. rename text_field -> content_field in docstores. update tutorials for content field
* update converters for . Add warning for empty
* renam label.question -> label.query. Allow sorting of Answers.
* WIP primitives
* update ui/reader for new Answer format
* Improve Label. First refactoring of MultiLabel. Adjust eval code
* fixed workflow conflict with introducing new one (#1472 )
* Add latest docstring and tutorial changes
* make add_eval_data() work again
* fix reader formats. WIP fix _extract_docs_and_labels_from_dict
* fix test reader
* Add latest docstring and tutorial changes
* fix another test case for reader
* fix mypy in farm reader.eval()
* fix mypy in farm reader.eval()
* WIP ORM refactor
* Add latest docstring and tutorial changes
* fix mypy weaviate
* make label and multilabel dataclasses
* bump mypy env in CI to python 3.8
* WIP refactor Label ORM
* WIP refactor Label ORM
* simplify tests for individual doc stores
* WIP refactoring markers of tests
* test alternative approach for tests with existing parametrization
* WIP refactor ORMs
* fix skip logic of already parametrized tests
* fix weaviate behaviour in tests - not parametrizing it in our general test cases.
* Add latest docstring and tutorial changes
* fix some tests
* remove sql from document_store_types
* fix markers for generator and pipeline test
* remove inmemory marker
* remove unneeded elasticsearch markers
* add dataclasses-json dependency. adjust ORM to just store JSON repr
* ignore type as dataclasses_json seems to miss functionality here
* update readme and contributing.md
* update contributing
* adjust example
* fix duplicate doc handling for custom index
* Add latest docstring and tutorial changes
* fix some ORM issues. fix get_all_labels_aggregated.
* update drop flags where get_all_labels_aggregated() was used before
* Add latest docstring and tutorial changes
* add to_json(). add + fix tests
* fix no_answer handling in label / multilabel
* fix duplicate docs in memory doc store. change primary key for sql doc table
* fix mypy issues
* fix mypy issues
* haystack/retriever/base.py
* fix test_write_document_meta[elastic]
* fix test_elasticsearch_custom_fields
* fix test_labels[elastic]
* fix crawler
* fix converter
* fix docx converter
* fix preprocessor
* fix test_utils
* fix tfidf retriever. fix selection of docstore in tests with multiple fixtures / parameterizations
* Add latest docstring and tutorial changes
* fix crawler test. fix ocrconverter attribute
* fix test_elasticsearch_custom_query
* fix generator pipeline
* fix ocr converter
* fix ragenerator
* Add latest docstring and tutorial changes
* fix test_load_and_save_yaml for elasticsearch
* fixes for pipeline tests
* fix faq pipeline
* fix pipeline tests
* Add latest docstring and tutorial changes
* fix weaviate
* Add latest docstring and tutorial changes
* trigger CI
* satisfy mypy
* Add latest docstring and tutorial changes
* satisfy mypy
* Add latest docstring and tutorial changes
* trigger CI
* fix question generation test
* fix ray. fix Q-generation
* fix translator test
* satisfy mypy
* wip refactor feedback rest api
* fix rest api feedback endpoint
* fix doc classifier
* remove relation of Labels -> Docs in SQL ORM
* fix faiss/milvus tests
* fix doc classifier test
* fix eval test
* fixing eval issues
* Add latest docstring and tutorial changes
* fix mypy
* WIP replace dataclasses-json with manual serialization
* Add latest docstring and tutorial changes
* revert to dataclass-json serialization for now. remove debug prints.
* update docstrings
* fix extractor. fix Answer Span init
* fix api test
* keep meta data of answers in reader.run()
* fix meta handling
* adress review feedback
* Add latest docstring and tutorial changes
* make document=None for open domain labels
* add import
* fix print utils
* fix rest api
* adress review feedback
* Add latest docstring and tutorial changes
* fix mypy
Co-authored-by: Markus Paff <markuspaff.mp@gmail.com>
Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
2021-10-13 14:23:23 +02:00
Shahrukh Khan
4822536886
Add ImageToTextConverter and PDFToTextOCRConverter that utilize OCR ( #1349 )
...
* add image.py converter
* add PDFtoImageConverter
* add init to PDFtoImageConverter and classes to __init__
* update imagetotext pipeline
* update imagetotext pipeline
* update imagetotext pipeline
* update imagetotext pipeline
* update imagetotext pipeline
* update imagetotext pipeline
* update imagetotext pipeline
* revert change in base.py in file_conv
* Update base.py
* Update pdf.py
* add ocr file_converter testcase & update dockerfile
* fix tesseract exception message typo
* fix _image_to_text doctstring
* add tesseract installation to CI
* add tesseract installation to CI
* add content test for PDF OCR converter
* update PDFToTextOCRConverter constructor doctsring
* replace image files with tmp paths for image.py convert
* replace image files with tmp paths for image.py convert
* Update README.md
Co-authored-by: Malte Pietsch <malte.pietsch@deepset.ai>
2021-09-01 16:42:25 +02:00
Lalit Pagaria
e904deefa7
Add Markdown file convertor ( #875 )
2021-03-23 16:31:26 +01:00
oryx1729
c4607cbd98
Revamp CI ( #825 )
2021-02-12 13:38:54 +01:00