haystack

mirror of https://github.com/deepset-ai/haystack.git synced 2026-01-08 04:56:45 +00:00

Go to file

Pipeline's YAML: syntax validation (#2226 )

* Add BasePipeline.validate_config, BasePipeline.validate_yaml, and some new custom exception classes

* Make error composition work properly

* Clarify typing

* Help mypy a bit more

* Update Documentation & Code Style

* Enable autogenerated docs for Milvus1 and 2 separately

* Revert "Enable autogenerated docs for Milvus1 and 2 separately"

This reverts commit 282be4a78a6e95862a9b4c924fc3dea5ca71e28d.

* Update Documentation & Code Style

* Re-enable 'additionalProperties: False'

* Add pipeline.type to JSON Schema, was somehow forgotten

* Disable additionalProperties on the pipeline properties too

* Fix json-schemas for 1.1.0 and 1.2.0 (should not do it again in the future)

* Cal super in PipelineValidationError

* Improve _read_pipeline_config_from_yaml's error handling

* Fix generate_json_schema.py to include document stores

* Fix json schemas (retro-fix 1.1.0 again)

* Improve custom errors printing, add link to docs

* Add function in BaseComponent to list its subclasses in a module

* Make some document stores base classes abstract

* Add marker 'integration' in pytest flags

* Slighly improve validation of pipelines at load

* Adding tests for YAML loading and validation

* Make custom_query Optional for validation issues

* Fix bug in _read_pipeline_config_from_yaml

* Improve error handling in BasePipeline and Pipeline and add DAG check

* Move json schema generation into haystack/nodes/_json_schema.py (useful for tests)

* Simplify errors slightly

* Add some YAML validation tests

* Remove load_from_config from BasePipeline, it was never used anyway

* Improve tests

* Include json-schemas in package

* Fix conftest imports

* Make BasePipeline abstract

* Improve mocking by making the test independent from the YAML version

* Add exportable_to_yaml decorator to forget about set_config on mock nodes

* Fix mypy errors

* Comment out one monkeypatch

* Fix typing again

* Improve error message for validation

* Add required properties to pipelines

* Fix YAML version for REST API YAMLs to 1.2.0

* Fix load_from_yaml call in load_from_deepset_cloud

* fix HaystackError.__getattr__

* Add super().__init__()in most nodes and docstore, comment set_config

* Remove type from REST API pipelines

* Remove useless init from doc2answers

* Call super in Seq3SeqGenerator

* Typo in deepsetcloud.py

* Fix rest api indexing error mismatch and mock version of JSON schema in all tests

* Working on pipeline tests

* Improve errors printing slightly

* Add back test_pipeline.yaml

* _json_schema.py supports different versions with identical schemas

* Add type to 0.7 schema for backwards compatibility

* Fix small bug in _json_schema.py

* Try alternative to generate json schemas on the CI

* Update Documentation & Code Style

* Make linux CI match autoformat CI

* Fix super-init-not-called

* Accidentally committed file

* Update Documentation & Code Style

* fix test_summarizer_translation.py's import

* Mock YAML in a few suites, split and simplify test_pipeline_debug_and_validation.py::test_invalid_run_args

* Fix json schema for ray tests too

* Update Documentation & Code Style

* Reintroduce validation

* Usa unstable version in tests and rest api

* Make unstable support the latest versions

* Update Documentation & Code Style

* Remove needless fixture

* Make type in pipeline optional in the strings validation

* Fix schemas

* Fix string validation for pipeline type

* Improve validate_config_strings

* Remove type from test p[ipelines

* Update Documentation & Code Style

* Fix test_pipeline

* Removing more type from pipelines

* Temporary CI patc

* Fix issue with exportable_to_yaml never invoking the wrapped init

* rm stray file

* pipeline tests are green again

* Linux CI now needs .[all] to generate the schema

* Bugfixes, pipeline tests seems to be green

* Typo in version after merge

* Implement missing methods in Weaviate

* Trying to avoid FAISS tests from running in the Milvus1 test suite

* Fix some stray test paths and faiss index dumping

* Fix pytest markers list

* Temporarily disable cache to be able to see tests failures

* Fix pyproject.toml syntax

* Use only tmp_path

* Fix preprocessor signature after merge

* Fix faiss bug

* Fix Ray test

* Fix documentation issue by removing quotes from faiss type

* Update Documentation & Code Style

* use document properly in preprocessor tests

* Update Documentation & Code Style

* make preprocessor capable of handling documents

* import document

* Revert support for documents in preprocessor, do later

* Fix bug in _json_schema.py that was breaking validation

* re-enable cache

* Update Documentation & Code Style

* Simplify calling _json_schema.py from the CI

* Remove redundant ABC inheritance

* Ensure exportable_to_yaml works only on implementations

* Rename subclass to class_ in Meta

* Make run() and get_config() abstract in BasePipeline

* Revert unintended change in preprocessor

* Move outgoing_edges_input_node check inside try block

* Rename VALID_CODE_GEN_INPUT_REGEX into VALID_INPUT_REGEX

* Add check for a RecursionError on validate_config_strings

* Address usages of _pipeline_config in data silo and elasticsearch

* Rename _pipeline_config into _init_parameters

* Fix pytest marker and remove unused imports

* Remove most redundant ABCs

* Rename _init_parameters into _component_configuration

* Remove set_config and type from _component_configuration's dict

* Remove last instances of set_config and replace with super().__init__()

* Implement __init_subclass__ approach

* Simplify checks on the existence of _component_configuration

* Fix faiss issue

* Dynamic generation of node schemas & weed out old schemas

* Add debatable test

* Add docstring to debatable test

* Positive diff between schemas implemented

* Improve diff printing

* Rename REST API YAML files to trigger IDE validation

* Fix typing issues

* Fix more typing

* Typo in YAML filename

* Remove needless type:ignore

* Add tests

* Fix tests & validation feedback for accessory classes in custom nodes

* Refactor RAGeneratorType out

* Fix broken import in conftest

* Improve source error handling

* Remove unused import in test_eval.py breaking tests

* Fix changed error message in tests matches too

* Normalize generate_openapi_specs.py and generate_json_schema.py in the actions

* Fix path to generate_openapi_specs.py in autoformat.yml

* Update Documentation & Code Style

* Add test for FAISSDocumentStore-like situations (superclass with init params)

* Update Documentation & Code Style

* Fix indentation

* Remove commented set_config

* Store model_name_or_path in FARMReader to use in DistillationDataSilo

* Rename _component_configuration into _component_config

* Update Documentation & Code Style

Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>

2022-03-15 11:17:26 +01:00

.github

Pipeline's YAML: syntax validation (#2226 )

2022-03-15 11:17:26 +01:00

annotation_tool

Add faq annotation (#1333 )

2021-08-10 14:55:31 +02:00

docs

Pipeline's YAML: syntax validation (#2226 )

2022-03-15 11:17:26 +01:00

haystack

Pipeline's YAML: syntax validation (#2226 )

2022-03-15 11:17:26 +01:00

json-schemas

Pipeline's YAML: syntax validation (#2226 )

2022-03-15 11:17:26 +01:00

rest_api

Pipeline's YAML: syntax validation (#2226 )

2022-03-15 11:17:26 +01:00

test

Pipeline's YAML: syntax validation (#2226 )

2022-03-15 11:17:26 +01:00

tutorials

Replace dpr with embeddingretriever tut11 (#2287 )

2022-03-15 08:30:00 +01:00

Pylint: solve or silence locally rare warnings (#2170 )

2022-02-21 20:16:14 +01:00

.gitignore

Add /documents/get_by_filters endpoint (#1580 )

2021-10-12 10:53:54 +02:00

code_of_conduct.txt

Add code of conduct

2021-03-18 16:39:16 +01:00

conftest.py

Fix skipping of tests using document stores (#2268 )

2022-03-03 15:19:27 +01:00

CONTRIBUTING.md

Testing actions (@ZanSara) (#2200 )

2022-02-17 12:07:56 +01:00

docker-compose-gpu.yml

Fix dependency related build issues in Dockerfiles (#2135 )

2022-02-09 17:35:18 +01:00

docker-compose.mitm.yml

Add docker-compose override file for Traffic Monitoring (#2224 )

2022-02-21 16:12:34 +01:00

docker-compose.yml

Fix dependency related build issues in Dockerfiles (#2135 )

2022-02-09 17:35:18 +01:00

Dockerfile

Fix dependency related build issues in Dockerfiles (#2135 )

2022-02-09 17:35:18 +01:00

Dockerfile-GPU

Fix dependency related build issues in Dockerfiles (#2135 )

2022-02-09 17:35:18 +01:00

Dockerfile-GPU-minimal

Adding a minimal haystack gpu build (#2185 )

2022-02-21 13:34:44 +01:00

LICENSE

Fix name

2021-10-12 10:22:41 +02:00

pyproject.toml

Pipeline's YAML: syntax validation (#2226 )

2022-03-15 11:17:26 +01:00

README.md

adding quotes for zsh shell issue (#2289 )

2022-03-08 17:29:08 +01:00

setup.cfg

Pipeline's YAML: syntax validation (#2226 )

2022-03-15 11:17:26 +01:00

setup.py

Apply black formatting (#2115 )

2022-02-03 13:43:18 +01:00

VERSION.txt

Bump version to 1.2.1rc0 (#2245 )

2022-03-03 16:06:07 +01:00

README.md

Haystack is an end-to-end framework that enables you to build powerful and production-ready pipelines for different search use cases. Whether you want to perform Question Answering or semantic document search, you can use the State-of-the-Art NLP models in Haystack to provide unique search experiences and allow your users to query in natural language. Haystack is built in a modular fashion so that you can combine the best technology from other open-source projects like Huggingface's Transformers, Elasticsearch, or Milvus.

What to build with Haystack

Ask questions in natural language and find granular answers in your documents.
Perform semantic search and retrieve documents according to meaning, not keywords
Use off-the-shelf models or fine-tune them to your domain.
Use user feedback to evaluate, benchmark, and continuously improve your live models.
Leverage existing knowledge bases and better handle the long tail of queries that chatbots receive.
Automate processes by automatically applying a list of questions to new documents and using the extracted answers.

Core Features

Latest models: Utilize all latest transformer-based models (e.g., BERT, RoBERTa, MiniLM) for extractive QA, generative QA, and document retrieval.
Modular: Multiple choices to fit your tech stack and use case. Pick your favorite database, file converter, or modeling framework.
Pipelines: The Node and Pipeline design of Haystack allows for custom routing of queries to only the relevant components.
Open: 100% compatible with HuggingFace's model hub. Tight interfaces to other frameworks (e.g., Transformers, FARM, sentence-transformers)
Scalable: Scale to millions of docs via retrievers, production-ready backends like Elasticsearch / FAISS, and a fastAPI REST API
End-to-End: All tooling in one place: file conversion, cleaning, splitting, training, eval, inference, labeling, etc.
Developer friendly: Easy to debug, extend and modify.
Customizable: Fine-tune models to your domain or implement your custom DocumentStore.
Continuous Learning: Collect new training data via user feedback in production & improve your models continuously


📒 Docs	Overview, Components, Guides, API documentation
💾 Installation	How to install Haystack
🎓 Tutorials	See what Haystack can do with our Notebooks & Scripts
🔰 Quick Demo	Deploy a Haystack application with Docker Compose and a REST API
🖖 Community	Slack, Twitter, Stack Overflow, GitHub Discussions
❤️ Contributing	We welcome all contributions!
📊 Benchmarks	Speed & Accuracy of Retriever, Readers and DocumentStores
🔭 Roadmap	Public roadmap of Haystack
📰 Blog	Read our articles on Medium
☎️ Jobs	We're hiring! Have a look at our open positions

💾 Installation

1. Basic Installation

You can install a basic version of Haystack's latest release by using pip.

    pip3 install farm-haystack

This command will install everything needed for basic Pipelines that use an Elasticsearch Document Store.

2. Full Installation

If you plan to be using more advanced features like Milvus, FAISS, Weaviate, OCR or Ray, you will need to install a full version of Haystack. The following command will install the latest version of Haystack from the master branch.

git clone https://github.com/deepset-ai/haystack.git
cd haystack
pip install --upgrade pip
pip install -e '.[all]' ## or 'all-gpu' for the GPU-enabled dependencies

If you cannot upgrade pip to version 21.3 or higher, you will need to replace:

'.[all]' with '.[sql,only-faiss,only-milvus1,weaviate,graphdb,crawler,preprocessing,ocr,onnx,ray,dev]'
'.[all-gpu]' with '.[sql,only-faiss-gpu,only-milvus1,weaviate,graphdb,crawler,preprocessing,ocr,onnx-gpu,ray,dev]'

For an complete list of the dependency groups available, have a look at the haystack/setup.cfg file.

To install the REST API and UI, run the following from the root directory of the Haystack repo

pip install rest_api/
pip install ui/

3. Installing on Windows

pip install farm-haystack -f https://download.pytorch.org/whl/torch_stable.html

4. Installing on Apple Silicon (M1)

M1 Macbooks require some extra depencies in order to install Haystack.

# some additional dependencies needed on m1 mac
brew install postgresql
brew install cmake
brew install rust

# haystack installation
GRPC_PYTHON_BUILD_SYSTEM_ZLIB=true pip install git+https://github.com/deepset-ai/haystack.git

5. Learn More

See our installation guide for more options. You can find out more about our PyPi package on our PyPi page.

🎓 Tutorials

Follow our introductory tutorial to setup a question answering system using Python and start performing queries! Explore the rest of our tutorials to learn how to tweak pipelines, train models and perform evaluation.

Tutorial 1 - Basic QA Pipeline: Jupyter notebook | Colab | Python
Tutorial 2 - Fine-tuning a model on own data: Jupyter notebook | Colab | Python
Tutorial 3 - Basic QA Pipeline without Elasticsearch: Jupyter notebook | Colab | Python
Tutorial 4 - FAQ-style QA: Jupyter notebook | Colab | Python
Tutorial 5 - Evaluation of the whole QA-Pipeline: Jupyter noteboook | Colab | Python
Tutorial 6 - Better Retrievers via "Dense Passage Retrieval": Jupyter noteboook | Colab | Python
Tutorial 7 - Generative QA via "Retrieval-Augmented Generation": Jupyter noteboook | Colab | Python
Tutorial 8 - Preprocessing: Jupyter noteboook | Colab | Python
Tutorial 9 - DPR Training: Jupyter noteboook | Colab | Python
Tutorial 10 - Knowledge Graph: Jupyter noteboook | Colab | Python
Tutorial 11 - Pipelines: Jupyter noteboook | Colab | Python
Tutorial 12 - Long-Form Question Answering: Jupyter noteboook | Colab | Python
Tutorial 13 - Question Generation: Jupyter noteboook | Colab | Python
Tutorial 14 - Query Classifier: Jupyter noteboook | Colab | Python
Tutorial 15 - TableQA: Jupyter noteboook | Colab | Python

🔰 Quick Demo

Hosted

Try out our hosted Explore The World live demo here! Ask any question on countries or capital cities and let Haystack return the answers to you.

Local

Start up a Haystack service via Docker Compose. With this you can begin calling it directly via the REST API or even interact with it using the included Streamlit UI.

Click here for a step-by-step guide

1. Update/install Docker and Docker Compose, then launch Docker

    apt-get update && apt-get install docker && apt-get install docker-compose
    service docker start

2. Clone Haystack repository

    git clone https://github.com/deepset-ai/haystack.git

3. Pull images & launch demo app

    cd haystack
    docker-compose pull
    docker-compose up
    
    # Or on a GPU machine: docker-compose -f docker-compose-gpu.yml up

You should be able to see the following in your terminal window as part of the log output:

..
ui_1             |   You can now view your Streamlit app in your browser.
..
ui_1             |   External URL: http://192.168.108.218:8501
..
haystack-api_1   | [2021-01-01 10:21:58 +0000] [17] [INFO] Application startup complete.

4. Open the Streamlit UI for Haystack by pointing your browser to the "External URL" from above.

You should see the following:

You can then try different queries against a pre-defined set of indexed articles related to Game of Thrones.

Note: The following containers are started as a part of this demo:

Haystack API: listens on port 8000
DocumentStore (Elasticsearch): listens on port 9200
Streamlit UI: listens on port 8501

Please note that the demo will publish the container ports to the outside world. We suggest that you review the firewall settings depending on your system setup and the security guidelines.

🖖 Community

There is a very vibrant and active community around Haystack which we are regularly interacting with! If you have a feature request or a bug report, feel free to open an issue in Github. We regularly check these and you can expect a quick response. If you'd like to discuss a topic, or get more general advice on how to make Haystack work for your project, you can start a thread in Github Discussions or our Slack channel. We also check Twitter and Stack Overflow.

❤️ Contributing

We are very open to the community's contributions - be it a quick fix of a typo, or a completely new feature! You don't need to be a Haystack expert to provide meaningful improvements. To learn how to get started, check out our Contributor Guidelines first. You can also find instructions to run the tests locally there.

Thanks so much to all those who have contributed to our project!

Who uses Haystack

Here's a list of organizations who use Haystack. Don't hesitate to send a PR to let the world know that you use Haystack. Join our growing community!

Description

AI orchestration framework to build customizable, production-ready LLM applications. Connect components (models, vector DBs, file converters) to pipelines or agents that can interact with your data. With advanced retrieval methods, it's best suited for building RAG, question answering, semantic search or conversational agent chatbots.

ai bert chatgpt generative-ai gpt-3 information-retrieval language-model large-language-models llm machine-learning nlp python pytorch question-answering rag retrieval-augmented-generation semantic-search squad summarization transformers

Readme Apache-2.0 128 MiB

Languages

MDX 66.4%

Python 32.2%

HTML 0.6%

JavaScript 0.5%

CSS 0.2%