haystack/pyproject.toml
Sara Zan 101d2bc86c
feat: MultiModalRetriever (#2891)
* Adding Data2VecVision and Data2VecText to the supported models and adapt Tokenizers accordingly

* content_types

* Splitting classes into respective folders

* small changes

* Fix EOF

* eof

* black

* API

* EOF

* whitespace

* api

* improve multimodal similarity processor

* tokenizer -> feature extractor

* Making feature vectors come out of the feature extractor in the similarity head

* embed_queries is now self-sufficient

* couple trivial errors

* Implemented separate language model classes for multimodal inference

* Document embedding seems to work

* removing batch_encode_plus, is deprecated anyway

* Realized the base Data2Vec models are not trained on retrieval tasks

* Issue with the generated embeddings

* Add batching

* Try to fit CLIP in

* Stub of CLIP integration

* Retrieval goes through but returns noise only

* Still working on the scores

* Introduce temporary adapter for CLIP models

* Image retrieval now works with sentence-transformers

* Tidying up the code

* Refactoring is now functional

* Add MPNet to the supported sentence transformers models

* Remove unused classes

* pylint

* docs

* docs

* Remove the method renaming

* mpyp first pass

* docs

* tutorial

* schema

* mypy

* Move devices setup into get_model

* more mypy

* mypy

* pylint

* Move a few params in HaystackModel's init

* make feature extractor work with squadprocessor

* fix feature_extractor_kwargs forwarding

* Forgotten part of the fix

* Revert unrelated ES change

* Revert unrelated memdocstore changes

* comment

* Small corrections

* mypy and pylint

* mypy

* typo

* mypy

* Refactor the  call

* mypy

* Do not make FARMReader use the new FeatureExtractor

* mypy

* Detach DPR tests from FeatureExtractor too

* Detach processor tests too

* Add end2end marker

* extract end2end feature extractor tests

* temporary disable feature extraction tests

* Introduce end2end tests for tokenizer tests

* pylint

* Fix model loading from folder in FeatureExtractor

* working o n end2end

* end2end keeps failing

* Restructuring retriever tests

* Restructuring retriever tests

* remove covert_dataset_to_dataloader

* remove comment

* Better check sentence-transformers models

* Use embed_meta_fields properly

* rename passage into document

* Embedding dims can't be found

* Add check for models that support it

* pylint

* Split all retriever tests into suites, running mostly on InMemory only

* fix mypy

* fix tfidf test

* fix weaviate tests

* Parallelize on every docstore

* Fix schema and specify modality in base retriever suite

* tests

* Add first image tests

* remove comment

* Revert to simpler tests

* Update docs/_src/api/api/primitives.md

Co-authored-by: Agnieszka Marzec <97166305+agnieszka-m@users.noreply.github.com>

* Apply suggestions from code review

Co-authored-by: Agnieszka Marzec <97166305+agnieszka-m@users.noreply.github.com>

* Apply suggestions from code review

Co-authored-by: Agnieszka Marzec <97166305+agnieszka-m@users.noreply.github.com>

* Update haystack/modeling/model/multimodal/__init__.py

Co-authored-by: Agnieszka Marzec <97166305+agnieszka-m@users.noreply.github.com>

* Apply suggestions from code review

Co-authored-by: Agnieszka Marzec <97166305+agnieszka-m@users.noreply.github.com>

* Apply suggestions from code review

Co-authored-by: Agnieszka Marzec <97166305+agnieszka-m@users.noreply.github.com>

* Apply suggestions from code review

Co-authored-by: Agnieszka Marzec <97166305+agnieszka-m@users.noreply.github.com>

* Apply suggestions from code review

Co-authored-by: Agnieszka Marzec <97166305+agnieszka-m@users.noreply.github.com>

* Apply suggestions from code review

* Apply suggestions from code review

Co-authored-by: Agnieszka Marzec <97166305+agnieszka-m@users.noreply.github.com>

* get_args

* mypy

* Update haystack/modeling/model/multimodal/__init__.py

* Update haystack/modeling/model/multimodal/base.py

* Update haystack/modeling/model/multimodal/base.py

Co-authored-by: Agnieszka Marzec <97166305+agnieszka-m@users.noreply.github.com>

* Update haystack/modeling/model/multimodal/sentence_transformers.py

* Update haystack/modeling/model/multimodal/sentence_transformers.py

Co-authored-by: Agnieszka Marzec <97166305+agnieszka-m@users.noreply.github.com>

* Update haystack/modeling/model/multimodal/transformers.py

* Update haystack/modeling/model/multimodal/transformers.py

Co-authored-by: Agnieszka Marzec <97166305+agnieszka-m@users.noreply.github.com>

* Update haystack/modeling/model/multimodal/transformers.py

Co-authored-by: Agnieszka Marzec <97166305+agnieszka-m@users.noreply.github.com>

* Apply suggestions from code review

Co-authored-by: Agnieszka Marzec <97166305+agnieszka-m@users.noreply.github.com>

* Apply suggestions from code review

Co-authored-by: Agnieszka Marzec <97166305+agnieszka-m@users.noreply.github.com>

* Apply suggestions from code review

Co-authored-by: Agnieszka Marzec <97166305+agnieszka-m@users.noreply.github.com>

* Update haystack/nodes/retriever/multimodal/retriever.py

Co-authored-by: Agnieszka Marzec <97166305+agnieszka-m@users.noreply.github.com>

* mypy

* mypy

* removing more ContentTypes

* more contentypes

* pylint

* add to __init__

* revert end2end workflow for now

* missing integration markers

* Update haystack/nodes/retriever/multimodal/embedder.py

Co-authored-by: bogdankostic <bogdankostic@web.de>

* review feedback, removing HaystackImageTransformerModel

* review feedback part 2

* mypy & pylint

* mypy

* mypy

* fix multimodal docs also for Pinecone

* add note on internal constants

* Fix pinecone write_documents

* schemas

* keep support for sentence-transformers only

* fix pinecone test

* schemas

* fix pinecone again

* temporarily disable some tests, need to understand if they're still relevant

Co-authored-by: Agnieszka Marzec <97166305+agnieszka-m@users.noreply.github.com>
Co-authored-by: bogdankostic <bogdankostic@web.de>
2022-10-17 18:58:35 +02:00

359 lines
8.4 KiB
TOML

[build-system]
requires = [
"hatchling>=1.8.0",
]
build-backend = "hatchling.build"
[project]
name = "farm-haystack"
dynamic = [
"version",
]
description = "Neural Question Answering & Semantic Search at Scale. Use modern transformer based models like BERT to find answers in large document collections"
readme = "README.md"
license = "Apache-2.0"
requires-python = ">=3.7"
authors = [
{ name = "deepset.ai", email = "malte.pietsch@deepset.ai" },
]
keywords = [
"BERT",
"QA",
"Question-Answering",
"Reader",
"Retriever",
"albert",
"language-model",
"mrc",
"roberta",
"search",
"semantic-search",
"squad",
"transfer-learning",
"transformer",
]
classifiers = [
"Development Status :: 5 - Production/Stable",
"Intended Audience :: Science/Research",
"License :: Freely Distributable",
"License :: OSI Approved :: Apache Software License",
"Operating System :: OS Independent",
"Programming Language :: Python",
"Programming Language :: Python :: 3",
"Programming Language :: Python :: 3.7",
"Programming Language :: Python :: 3.8",
"Programming Language :: Python :: 3.9",
"Programming Language :: Python :: 3.10",
"Topic :: Scientific/Engineering :: Artificial Intelligence",
]
dependencies = [
"importlib-metadata; python_version < '3.8'",
"torch>1.9,<1.13",
"requests",
"pydantic",
"transformers==4.21.2",
"nltk",
"pandas",
# Utils
"dill", # pickle extension for (de-)serialization
"tqdm", # progress bars in model download and training scripts
"networkx", # graphs library
"mmh3", # fast hashing function (murmurhash3)
"quantulum3", # quantities extraction from text
"posthog", # telemetry
"azure-ai-formrecognizer>=3.2.0b2", # forms reader
# audio's espnet-model-zoo requires huggingface-hub version <0.8 while we need >=0.5 to be able to use create_repo in FARMReader
"huggingface-hub>=0.5.0",
# Preprocessing
"more_itertools", # for windowing
"python-docx",
"langdetect", # for PDF conversions
"tika", # Apache Tika (text & metadata extractor)
# See haystack/nodes/retriever/_embedding_encoder.py, _SentenceTransformersEmbeddingEncoder
"sentence-transformers>=2.2.0",
# for stats in run_classifier
"scipy>=1.3.2",
"scikit-learn>=1.0.0",
# Metrics and logging
"seqeval",
"mlflow",
# Elasticsearch
"elasticsearch>=7.7,<8",
# context matching
"rapidfuzz>=2.0.15,<2.8.0", # FIXME https://github.com/deepset-ai/haystack/pull/3199
# Schema validation
"jsonschema",
]
[project.optional-dependencies]
sql = [
"sqlalchemy>=1.4.2,<2",
"sqlalchemy_utils",
"psycopg2-binary; platform_system != 'Windows'",
]
only-faiss = [
"faiss-cpu>=1.6.3,<2",
]
faiss = [
"farm-haystack[sql,only-faiss]",
]
only-faiss-gpu = [
"faiss-gpu>=1.6.3,<2",
]
faiss-gpu = [
"farm-haystack[sql,only-faiss-gpu]",
]
only-milvus1 = [
"pymilvus<2.0.0", # Refer milvus version support matrix at https://github.com/milvus-io/pymilvus#install-pymilvus
]
milvus1 = [
"farm-haystack[sql,only-milvus1]",
]
only-milvus = [
"pymilvus>=2.0.0,<3", # Refer milvus version support matrix at https://github.com/milvus-io/pymilvus#install-pymilvus
]
milvus = [
"farm-haystack[sql,only-milvus]",
]
weaviate = [
"weaviate-client==3.6.0",
]
only-pinecone = [
"pinecone-client>=2.0.11,<3",
]
pinecone = [
"farm-haystack[sql,only-pinecone]",
]
graphdb = [
"SPARQLWrapper",
]
inmemorygraph = [
"SPARQLWrapper",
]
opensearch = [
"opensearch-py>=2",
]
docstores = [
"farm-haystack[faiss,milvus,weaviate,graphdb,inmemorygraph,pinecone,opensearch]",
]
docstores-gpu = [
"farm-haystack[faiss-gpu,milvus,weaviate,graphdb,inmemorygraph,pinecone,opensearch]",
]
audio = [
"pyworld<=0.2.12",
"espnet",
"espnet-model-zoo",
"pydub",
]
beir = [
"beir; platform_system != 'Windows'",
]
crawler = [
"selenium>=4.0.0,!=4.1.4", # Avoid 4.1.4 due to https://github.com/SeleniumHQ/selenium/issues/10612
"webdriver-manager",
]
preprocessing = [
"beautifulsoup4",
"markdown",
"python-magic; platform_system != 'Windows'", # Depends on libmagic: https://pypi.org/project/python-magic/
"python-magic-bin; platform_system == 'Windows'", # Needs to be installed without python-magic, otherwise Windows CI gets stuck.
]
ocr = [
"pytesseract>0.3.7",
"pillow",
"pdf2image>1.14",
]
onnx = [
"onnxruntime",
"onnxruntime_tools",
]
onnx-gpu = [
"onnxruntime-gpu",
"onnxruntime_tools",
]
ray = [
"ray>=1.9.1,<2; platform_system != 'Windows'",
"ray>=1.9.1,<2,!=1.12.0; platform_system == 'Windows'", # Avoid 1.12.0 due to https://github.com/ray-project/ray/issues/24169 (fails on windows)
"aiorwlock>=1.3.0,<2",
]
colab = [
"grpcio==1.47.0",
"requests>=2.25", # Needed to avoid dependency conflict with crawler https://github.com/deepset-ai/haystack/pull/2921
]
dev = [
"pre-commit",
# Type check
"mypy",
"typing_extensions; python_version < '3.8'",
# Test
"pytest",
"pytest-custom_exit_code", # used in the CI
"responses",
"tox",
"coverage",
"python-multipart",
"psutil",
# Linting
"pylint",
# Code formatting
"black[jupyter]==22.6.0",
# Documentation
"pydoc-markdown",
"mkdocs",
"jupytercontrib",
"watchdog", # ==1.0.2
"requests-cache",
]
test = [
"farm-haystack[docstores,audio,crawler,preprocessing,ocr,ray,dev]",
]
all = [
"farm-haystack[docstores,audio,crawler,preprocessing,ocr,ray,dev,onnx,beir]",
]
all-gpu = [
"farm-haystack[docstores-gpu,audio,crawler,preprocessing,ocr,ray,dev,onnx-gpu,beir]",
]
[project.urls]
"CI: GitHub" = "https://github.com/deepset-ai/haystack/actions"
"Docs: RTD" = "https://haystack.deepset.ai/overview/intro"
"GitHub: issues" = "https://github.com/deepset-ai/haystack/issues"
"GitHub: repo" = "https://github.com/deepset-ai/haystack"
Homepage = "https://github.com/deepset-ai/haystack"
[tool.hatch.version]
path = "VERSION.txt"
pattern = "(?P<version>.+)"
[tool.hatch.build.targets.sdist]
include = [
"/haystack",
"/VERSION.txt",
]
[tool.hatch.build.targets.wheel]
packages = [
"haystack",
]
[tool.black]
line-length = 120
skip_magic_trailing_comma = true # For compatibility with pydoc>=4.6, check if still needed.
[tool.pylint.'MESSAGES CONTROL']
max-line-length=120
disable = [
# To keep
"fixme",
"c-extension-no-member",
"wrong-spelling-in-comment",
"wrong-spelling-in-docstring",
# To review:
"missing-docstring",
"unused-argument",
"no-member",
"line-too-long",
"protected-access",
"too-few-public-methods",
"raise-missing-from",
"invalid-name",
"logging-fstring-interpolation",
"too-many-locals",
"duplicate-code",
"too-many-arguments",
"arguments-differ",
"consider-using-f-string",
"no-else-return",
"unused-variable",
"attribute-defined-outside-init",
"too-many-instance-attributes",
"super-with-arguments",
"anomalous-backslash-in-string",
"redefined-builtin",
"logging-format-interpolation",
"f-string-without-interpolation",
"abstract-method",
"too-many-branches",
"trailing-whitespace",
"unspecified-encoding",
"unidiomatic-typecheck",
"no-name-in-module",
"dangerous-default-value",
"consider-using-with",
"redefined-outer-name",
"arguments-renamed",
"unnecessary-pass",
"broad-except",
"unnecessary-comprehension",
"subprocess-run-check",
"singleton-comparison",
"consider-iterating-dictionary",
"too-many-nested-blocks",
"undefined-loop-variable",
"too-many-statements",
"consider-using-in",
"bare-except",
"too-many-lines",
"unexpected-keyword-arg",
"simplifiable-if-expression",
"use-list-literal",
# To review later
"cyclic-import",
"import-outside-toplevel",
"deprecated-method",
]
[tool.pylint.'DESIGN']
max-args=7
[tool.pylint.'SIMILARITIES']
min-similarity-lines=6
[tool.pytest.ini_options]
minversion = "6.0"
addopts = "--strict-markers"
markers = [
"unit: unit tests",
"integration: integration tests",
"generator: generator tests",
"summarizer: summarizer tests",
"embedding_dim: uses a document store with non-default embedding dimension (e.g @pytest.mark.embedding_dim(128))",
"tika: requires Tika container",
"parsr: requires Parsr container",
"ocr: requires Tesseract",
"elasticsearch: requires Elasticsearch container",
"graphdb: requires GraphDB container",
"weaviate: requires Weaviate container",
"pinecone: requires Pinecone credentials",
"faiss: uses FAISS",
"milvus: requires a Milvus 2 setup",
"milvus1: requires a Milvus 1 container",
"opensearch",
]
log_cli = true
[tool.mypy]
warn_return_any = false
warn_unused_configs = true
ignore_missing_imports = true
plugins = [
"pydantic.mypy",
]