Sara Zan
8ddeda811a
generate docs for search.engine.py ( #3507 )
2022-10-31 16:57:39 +01:00
Massimiliano Pippi
9fe2f69d56
add workflow to triage new issues with GH projects ( #3508 )
2022-10-31 16:01:59 +01:00
Massimiliano Pippi
b694c7b5cb
Document Store test refactoring ( #3449 )
...
* add new marker
* start using test hierarchies
* move ES tests into their own class
* refactor test workflow
* job steps
* add more tests
* move more tests
* more tests
* test labels
* add more tests
* Update tests.yml
* Update tests.yml
* fix
* typo
* fix es image tag
* map es ports
* try
* fix
* default port
* remove opensearch from the markers sorcery
* revert
* skip new tests in old jobs
* skip opensearch_faiss
2022-10-31 15:30:14 +01:00
Mayank Jobanputra
85cdc1040a
Added telemetry changes ( #3503 )
2022-10-31 12:49:52 +01:00
Sara Zan
adc982a624
fix: do not reference package directory in PDFToTextOCRConverter.convert() ( #3478 )
...
* remove weird temp path from PDFToTextOCRConverter.convert()
* remove debug lines
* remove os import
2022-10-31 12:48:43 +01:00
Massimiliano Pippi
17cd79e2c8
[release process] Create new schema when bumping unstable ( #3416 )
...
* also create new schema when bumping unstable version
* openapi schema
* no need to update the json schema anymore
2022-10-31 12:26:48 +01:00
Sara Zan
54cc9cd4cf
refactor: remove json-schemas ( #3485 )
...
* remove json-schemas
* main schema can be removed too
* add .gitignore to schemas folder
* try to explicitly get the new haystack in the rest api tests
* fix workflow again
* fix version string in rest api tests
* add pip freeze
* debug statements in workflow
* -U prevents schema generation
2022-10-31 11:24:43 +01:00
Massimiliano Pippi
b52ed52c4e
fix docker minimal deprecated image ( #3497 )
2022-10-28 16:46:48 +02:00
Sebastian
384663981d
Fixed bug in onnx converter for XLMRoberta architecture ( #3470 )
2022-10-28 15:35:53 +02:00
Massimiliano Pippi
9f4a9a76a3
fix: pattern to match tags push ( #3469 )
2022-10-28 14:52:30 +02:00
Sara Zan
823d0d3006
Add Schemas badge on README.md ( #3493 )
2022-10-28 13:57:42 +02:00
Sara Zan
a66e7caa34
feat: hatch-autorun generates schemas ( #3484 )
...
* hatch-run generates the schemas
* fix path
* keep schemas for now
* fix path
* schemas
* Do not generate rc schemas
* make the autorun hook self-destroy
* typo
* schemas
* schemas were ok
* improve logs to make generate_schema.py usable standalone too
* fix warning
* Update warning
* Update generate_schema.py
* black
2022-10-28 13:55:11 +02:00
Massimiliano Pippi
1f9f4ab03a
fix: fix docs badge ( #3491 )
...
* fix: fix docs badge
* format
Co-authored-by: ZanSara <sarazanzo94@gmail.com>
2022-10-28 11:59:49 +02:00
Sara Zan
f377b78263
refactor: replace YAML schema check with a dispatch call ( #3482 )
...
* Replace yaml check with a dispatch call
* split workflow
* add branch for testing
* access secrets properly
* remove testing branch trigger
2022-10-28 10:48:59 +02:00
Sebastian
8db7dfb884
refactor: TableReader ( #3456 )
...
* Refactoring table reader
2022-10-26 20:57:28 +02:00
Sebastian
59857cb492
feat: Speed up reader tests ( #3476 )
...
* Use a smaller reader where possible
* Change scope to module of reader to get faster load times
2022-10-26 19:04:18 +02:00
Tuana Celik
a4002ae87c
Updating readme to point to new docs site ( #3336 )
...
* Updating readme to point to new docs site
* updating some links
* updating docs link
2022-10-26 17:28:46 +02:00
Sara Zan
dd774b867d
add missing schemas ( #3480 )
2022-10-26 17:27:42 +02:00
Sara Zan
05c68b6624
feat: add document_store to all BaseRetriever.retrieve() and BaseRetriever.retrieve_batch() implementations ( #3379 )
...
* add document_store to retrieve()]
* mypy & pylint
* pass docstore to embedding encoders
* schemas
* mypy and pylint
* fix tfidfretriever
* pylint
* mypy
* pylint
* fix tfidf
* mypy
* pylint
* schemas
* another fix for tfidf
* fix question generation tests
* remove docstore from embedding encoder signature
* pylint
* revert accidental test changes
* Apply suggestions from code review
* check for docstore similarity function only if the docstore is present
* check for docstore similarity function only if the docstore is present
2022-10-26 15:47:06 +02:00
Julian Risch
d0691a4bd5
bug: replace decorator with counter attribute for pipeline event ( #3462 )
2022-10-26 12:09:04 +02:00
bogdankostic
4fbe80c098
feat: Extraction of headlines in markdown files ( #3445 )
...
* Extract headings from markdown files + adapt PreProcessor
* Add tests
* Fix mypy
* Generate JSON schema
* Apply suggestions from code review
Co-authored-by: Agnieszka Marzec <97166305+agnieszka-m@users.noreply.github.com>
* Update haystack/nodes/file_converter/markdown.py
Co-authored-by: Agnieszka Marzec <97166305+agnieszka-m@users.noreply.github.com>
* Apply black
* Add PR feedback
Co-authored-by: Agnieszka Marzec <97166305+agnieszka-m@users.noreply.github.com>
2022-10-26 11:57:55 +02:00
Vladimir Blagojevic
5ca96357ff
feat: Add CohereEmbeddingEncoder to EmbeddingRetriever ( #3453 )
2022-10-25 17:52:29 +02:00
Branden Chan
7b15799853
Change slug and title ( #3474 )
2022-10-25 16:41:27 +01:00
Stefano Fiorucci
a2d459dbed
fix: warning if doc store similarity function is incompatible with Sentence Transformers model ( #3455 )
...
* check_docstore_similarity_function
* remove import
2022-10-25 17:00:35 +02:00
Stefano Fiorucci
54ec13eaf7
refactor: Change no_answer attribute ( #3411 )
...
* always run validation
* update schemas
* no_answer as a property. break things!
* forgotten schema
* fix
* update openapi
* removed my unnecessary test
* fix sql document store
Co-authored-by: Sara Zan <sara.zanzottera@deepset.ai>
2022-10-25 13:07:00 +02:00
Julian Risch
6a422d588f
fix: disabling telemetry prevents writing config ( #3465 )
...
* fix: disabling telemetry prevents writing config
* set user id to empty string if telemetry disabled
* Update haystack/telemetry.py
* set id to None instead of "" in error case
* remove RuntimeError if user id is not set
* Revert "remove RuntimeError if user id is not set"
This reverts commit c59f06d47216afa7ada6199b03f1b09a2b936c02.
Co-authored-by: Sara Zan <sara.zanzottera@deepset.ai>
2022-10-25 12:11:54 +02:00
Mayank Jobanputra
d48577b4e7
bug: removed duplicated meta "name" field addition to content before embedding in update_embeddings workflow ( #3368 )
...
* Removed explicit passage formatting by name field
* passing correct input type for embedding the docs
* Updated test, updated similarity scores and added results
* changed expected input to embed method
2022-10-25 14:52:05 +05:30
Vladimir Blagojevic
1b9586ae40
Add indexing pipeline type ( #3461 )
2022-10-24 17:26:15 +02:00
Timo Moeller
9b931bbf66
Fix prompt length computation ( #3448 )
2022-10-24 11:59:54 +02:00
Sara Zan
cbf44413d8
feat: add __cointains__ to Span ( #3446 )
...
* add __contains__
* add tests
2022-10-21 13:58:17 +02:00
Unai Garay Maestre
e41cb24358
Feat: allow decreasing size of datasets loaded from BEIR ( #3392 )
...
* Adds cropping of dataset in eval beir
* Adapts queries to remaining cropped documents
* Adds logging warning if num_documents has an invalid value
* Adapts to linting suggestions
2022-10-21 13:54:20 +02:00
Branden Chan
03ba07dcb5
docs: Extend utils API docs coverage ( #3402 )
...
* Add more utils modules
* Format docstrings
* Incorporate reviewer feedback
2022-10-21 12:51:11 +01:00
Massimiliano Pippi
df4d20d32c
fix the readme version to sync ( #3417 )
2022-10-20 16:50:36 +02:00
Vladimir Blagojevic
79c6063ac2
feat: send event if number of queries exceeds threshold ( #3419 )
...
Co-authored-by: Julian Risch <julian.risch@deepset.ai>
2022-10-20 16:02:45 +02:00
Branden Chan
3f956c75f4
Add multimodal retrieval to API docs ( #3430 )
2022-10-20 15:07:48 +02:00
Stefano Fiorucci
abdcb8124b
update pyworld pin ( #3435 )
2022-10-20 12:28:38 +02:00
Stefano Fiorucci
8c1a34494d
refactor: update package strategy in ui ( #3396 )
...
* update ui package: first try
* update README
* fixes
* update schemas
* restore schemas
* use matrix folder in tests
* fix tests
* fix schemas
* really fix schemas
* don't use matrix folder
* remove blank line
* cleaner pytest command
2022-10-20 12:18:03 +02:00
Stefano Fiorucci
3860bb9966
fix: improve Document __repr__ ( #3385 )
...
* fix document __repr__
* take the best from 2 approaches
* fix schema
2022-10-19 22:32:23 +02:00
Vladimir Blagojevic
8f31228211
feat: Add exponential backoff decorator; apply it to OpenAI requests ( #3398 )
2022-10-19 17:47:38 +02:00
Massimiliano Pippi
5335e9e4d9
Add new schema for latest unstable ( #3415 )
...
* add new schema for latest unstable
* openapi
2022-10-19 13:21:05 +02:00
Julian Risch
16723bf180
bug: change type of split_by to Literal including None ( #3389 )
...
* change type of split_by
* fix mpy and update schema files
* change split_by type to Literal
* handle ImportError for Literal py<3.8
2022-10-19 10:11:41 +02:00
github-actions[bot]
f4a49f7178
Bump version ( #3409 )
...
Co-authored-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>
2022-10-18 18:05:48 +02:00
Ursin Brunner
5fedfb03b0
fix: Fix the error of wrong page numbers when documents contain empty pages. ( #3330 )
...
* Fix the error of wrong page numbers when documents contain empty pages.
* Reformat using git hooks.
* Use a more descriptive placeholder
2022-10-18 17:51:02 +02:00
Sebastian
51d4fe01c3
fix: Update env variable for model caching timeout ( #3405 )
...
* fix: Update env variable for model caching timeout
The environment variable used to set the timeout for the model caching step had a typo in it from the maintainers of `actions/cache@v3`, which is why it has not been working (see comment [here](https://github.com/actions/cache/issues/810#issuecomment-1281895575 )).
* Removed newline
2022-10-18 17:36:25 +02:00
Branden Chan
cf4642a5f8
[CI] Create Github Workflow that creates a new version branch in Haystack and Readme ( #3335 )
...
* Test readme_integration.yml
* Test readme_integration.yml
* Test variables
* Test variables
* Test variables
* Test variables
* Test commit
* Test commit
* Test commit
* Trigger action
* Add v
* Trigger action
* Trigger action
* Trigger action
* Trigger action
* Update API docs headers
* Revert "Update API docs headers"
This reverts commit 34e665063f4de29854befe575a795dbfef04415c.
* Trigger action
* Trigger action
* Trigger action
* Update release
* Update release
* Update release
* Delete File
* Split steps into own files
* Edit action names
* Start making changes
* Start implementing version bump
* Implement minor version release
* Fix github action
* Test action
* Test action
* Test action
* Test action
* Test action
* Change back to main
* Add comments
* Remove line
* Format docstring
* Incorporate reviewer feedback
* Fix variable name
* Print version.txt
* Incorporate Reviewer feedback
* Rename variables for clarity
* Add fetch
* Change branch
* Change branch
* Change branch
* Change branch
* Change branch
* Revert docstring changes
* Incorporate reviewer feedback
* Run black
Co-authored-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>
2022-10-18 17:09:43 +02:00
Sebastian
93817f63b4
feat: Speed up integration tests (nodes) ( #3408 )
...
* Changed summarizer model to a smaller one (2GB to 500MB) to save on space and speed up the tests.
* Removed google pegasus from cache
2022-10-18 16:23:57 +02:00
Branden Chan
3bf5d4350f
docs: Add comment about the generation of no-answer samples in FARMReader training ( #3404 )
...
* Add comment about no-answer generation
* Add comment about no-answer generation
* Fix typo
Co-authored-by: Sebastian <sjrl@users.noreply.github.com>
* Incorporate reviewer feedback
* Incorporate reviewer feedback
Co-authored-by: Sebastian <sjrl@users.noreply.github.com>
2022-10-18 14:37:37 +02:00
Sebastian
15a59fd040
feat: Updated EntityExtractor to handle long texts and added better postprocessing ( #3154 )
...
* Remove dependence on HuggingFace TokenClassificationPipeline and group all postprocessing functions under one class
* Added copyright notice for HF and deepset to entity file to acknowledge that a lot of the postprocessing parts came from the transformers library.
* Fixed text squishing problem. Added additional unit test for it.
Co-authored-by: ju-gu <julian.gutsch@deepset.ai>
2022-10-17 21:26:44 +02:00
Unai Garay Maestre
3a2c8ae3c5
bug: Adds better way of checking query in BaseRetriever and Pipeline.run() ( #3304 )
...
* changes how query and queries are checked if they have been passed in BaseRetriever
* Fixes checking query properly in Pipeline run
* Fixes checking query properly in Pipeline run
* Adds test for FilterRetriever using run method when query is empty
* Adds mock filter retriever and adapts test
* Removes old test, adds MockRetriever to test file and test uses document_store
* Logs error when query is not of type string with a new test for run batch
* Update test/nodes/test_retriever.py
* schemas
2022-10-17 19:00:13 +02:00
Sara Zan
101d2bc86c
feat: MultiModalRetriever ( #2891 )
...
* Adding Data2VecVision and Data2VecText to the supported models and adapt Tokenizers accordingly
* content_types
* Splitting classes into respective folders
* small changes
* Fix EOF
* eof
* black
* API
* EOF
* whitespace
* api
* improve multimodal similarity processor
* tokenizer -> feature extractor
* Making feature vectors come out of the feature extractor in the similarity head
* embed_queries is now self-sufficient
* couple trivial errors
* Implemented separate language model classes for multimodal inference
* Document embedding seems to work
* removing batch_encode_plus, is deprecated anyway
* Realized the base Data2Vec models are not trained on retrieval tasks
* Issue with the generated embeddings
* Add batching
* Try to fit CLIP in
* Stub of CLIP integration
* Retrieval goes through but returns noise only
* Still working on the scores
* Introduce temporary adapter for CLIP models
* Image retrieval now works with sentence-transformers
* Tidying up the code
* Refactoring is now functional
* Add MPNet to the supported sentence transformers models
* Remove unused classes
* pylint
* docs
* docs
* Remove the method renaming
* mpyp first pass
* docs
* tutorial
* schema
* mypy
* Move devices setup into get_model
* more mypy
* mypy
* pylint
* Move a few params in HaystackModel's init
* make feature extractor work with squadprocessor
* fix feature_extractor_kwargs forwarding
* Forgotten part of the fix
* Revert unrelated ES change
* Revert unrelated memdocstore changes
* comment
* Small corrections
* mypy and pylint
* mypy
* typo
* mypy
* Refactor the call
* mypy
* Do not make FARMReader use the new FeatureExtractor
* mypy
* Detach DPR tests from FeatureExtractor too
* Detach processor tests too
* Add end2end marker
* extract end2end feature extractor tests
* temporary disable feature extraction tests
* Introduce end2end tests for tokenizer tests
* pylint
* Fix model loading from folder in FeatureExtractor
* working o n end2end
* end2end keeps failing
* Restructuring retriever tests
* Restructuring retriever tests
* remove covert_dataset_to_dataloader
* remove comment
* Better check sentence-transformers models
* Use embed_meta_fields properly
* rename passage into document
* Embedding dims can't be found
* Add check for models that support it
* pylint
* Split all retriever tests into suites, running mostly on InMemory only
* fix mypy
* fix tfidf test
* fix weaviate tests
* Parallelize on every docstore
* Fix schema and specify modality in base retriever suite
* tests
* Add first image tests
* remove comment
* Revert to simpler tests
* Update docs/_src/api/api/primitives.md
Co-authored-by: Agnieszka Marzec <97166305+agnieszka-m@users.noreply.github.com>
* Apply suggestions from code review
Co-authored-by: Agnieszka Marzec <97166305+agnieszka-m@users.noreply.github.com>
* Apply suggestions from code review
Co-authored-by: Agnieszka Marzec <97166305+agnieszka-m@users.noreply.github.com>
* Update haystack/modeling/model/multimodal/__init__.py
Co-authored-by: Agnieszka Marzec <97166305+agnieszka-m@users.noreply.github.com>
* Apply suggestions from code review
Co-authored-by: Agnieszka Marzec <97166305+agnieszka-m@users.noreply.github.com>
* Apply suggestions from code review
Co-authored-by: Agnieszka Marzec <97166305+agnieszka-m@users.noreply.github.com>
* Apply suggestions from code review
Co-authored-by: Agnieszka Marzec <97166305+agnieszka-m@users.noreply.github.com>
* Apply suggestions from code review
Co-authored-by: Agnieszka Marzec <97166305+agnieszka-m@users.noreply.github.com>
* Apply suggestions from code review
* Apply suggestions from code review
Co-authored-by: Agnieszka Marzec <97166305+agnieszka-m@users.noreply.github.com>
* get_args
* mypy
* Update haystack/modeling/model/multimodal/__init__.py
* Update haystack/modeling/model/multimodal/base.py
* Update haystack/modeling/model/multimodal/base.py
Co-authored-by: Agnieszka Marzec <97166305+agnieszka-m@users.noreply.github.com>
* Update haystack/modeling/model/multimodal/sentence_transformers.py
* Update haystack/modeling/model/multimodal/sentence_transformers.py
Co-authored-by: Agnieszka Marzec <97166305+agnieszka-m@users.noreply.github.com>
* Update haystack/modeling/model/multimodal/transformers.py
* Update haystack/modeling/model/multimodal/transformers.py
Co-authored-by: Agnieszka Marzec <97166305+agnieszka-m@users.noreply.github.com>
* Update haystack/modeling/model/multimodal/transformers.py
Co-authored-by: Agnieszka Marzec <97166305+agnieszka-m@users.noreply.github.com>
* Apply suggestions from code review
Co-authored-by: Agnieszka Marzec <97166305+agnieszka-m@users.noreply.github.com>
* Apply suggestions from code review
Co-authored-by: Agnieszka Marzec <97166305+agnieszka-m@users.noreply.github.com>
* Apply suggestions from code review
Co-authored-by: Agnieszka Marzec <97166305+agnieszka-m@users.noreply.github.com>
* Update haystack/nodes/retriever/multimodal/retriever.py
Co-authored-by: Agnieszka Marzec <97166305+agnieszka-m@users.noreply.github.com>
* mypy
* mypy
* removing more ContentTypes
* more contentypes
* pylint
* add to __init__
* revert end2end workflow for now
* missing integration markers
* Update haystack/nodes/retriever/multimodal/embedder.py
Co-authored-by: bogdankostic <bogdankostic@web.de>
* review feedback, removing HaystackImageTransformerModel
* review feedback part 2
* mypy & pylint
* mypy
* mypy
* fix multimodal docs also for Pinecone
* add note on internal constants
* Fix pinecone write_documents
* schemas
* keep support for sentence-transformers only
* fix pinecone test
* schemas
* fix pinecone again
* temporarily disable some tests, need to understand if they're still relevant
Co-authored-by: Agnieszka Marzec <97166305+agnieszka-m@users.noreply.github.com>
Co-authored-by: bogdankostic <bogdankostic@web.de>
2022-10-17 18:58:35 +02:00