mirror of
https://github.com/deepset-ai/haystack.git
synced 2026-02-01 12:33:09 +00:00
* Update documentation and remove unused assets. Enhanced the 'agents' and 'components' sections with clearer descriptions and examples. Removed obsolete images and updated links for better navigation. Adjusted formatting for consistency across various documentation pages. * remove dependency * address comments * delete more empty pages * broken link * unduplicate headings * alphabetical components nav
185 lines
5.9 KiB
Plaintext
185 lines
5.9 KiB
Plaintext
---
|
|
title: "Serializing Pipelines"
|
|
id: serialization
|
|
slug: "/serialization"
|
|
description: "Save your pipelines into a custom format and explore the serialization options."
|
|
---
|
|
|
|
# Serializing Pipelines
|
|
|
|
Save your pipelines into a custom format and explore the serialization options.
|
|
|
|
Serialization means converting a pipeline to a format that you can save on your disk and load later.
|
|
|
|
:::info
|
|
Serialization formats
|
|
|
|
Haystack 2.0 only supports YAML format at this time. We will be rolling out more formats gradually.
|
|
:::
|
|
|
|
## Converting a Pipeline to YAML
|
|
|
|
Use the `dumps()` method to convert a Pipeline object to YAML:
|
|
|
|
```python
|
|
from haystack import Pipeline
|
|
|
|
pipe = Pipeline()
|
|
print(pipe.dumps())
|
|
|
|
## Prints:
|
|
##
|
|
## components: {}
|
|
## connections: []
|
|
## max_loops_allowed: 100
|
|
## metadata: {}
|
|
```
|
|
|
|
You can also use `dump()` method to save the YAML representation of a pipeline in a file:
|
|
|
|
```python
|
|
with open("/content/test.yml", "w") as file:
|
|
pipe.dump(file)
|
|
```
|
|
|
|
## Converting a Pipeline Back to Python
|
|
|
|
You can convert a YAML pipeline back into Python. Use the `loads()` method to convert a string representation of a pipeline (`str`, `bytes` or `bytearray`) or the `load()` method to convert a pipeline represented in a file-like object into a corresponding Python object.
|
|
|
|
Both loading methods support callbacks that let you modify components during the deserialization process.
|
|
|
|
Here is an example script:
|
|
|
|
```python
|
|
from haystack import Pipeline
|
|
from haystack.core.serialization import DeserializationCallbacks
|
|
from typing import Type, Dict, Any
|
|
|
|
## This is the YAML you want to convert to Python:
|
|
pipeline_yaml = """
|
|
components:
|
|
cleaner:
|
|
init_parameters:
|
|
remove_empty_lines: true
|
|
remove_extra_whitespaces: true
|
|
remove_regex: null
|
|
remove_repeated_substrings: false
|
|
remove_substrings: null
|
|
type: haystack.components.preprocessors.document_cleaner.DocumentCleaner
|
|
converter:
|
|
init_parameters:
|
|
encoding: utf-8
|
|
type: haystack.components.converters.txt.TextFileToDocument
|
|
connections:
|
|
- receiver: cleaner.documents
|
|
sender: converter.documents
|
|
max_loops_allowed: 100
|
|
metadata: {}
|
|
"""
|
|
|
|
def component_pre_init_callback(component_name: str, component_cls: Type, init_params: Dict[str, Any]):
|
|
# This function gets called every time a component is deserialized.
|
|
if component_name == "cleaner":
|
|
assert "DocumentCleaner" in component_cls.__name__
|
|
# Modify the init parameters. The modified parameters are passed to
|
|
# the init method of the component during deserialization.
|
|
init_params["remove_empty_lines"] = False
|
|
print("Modified 'remove_empty_lines' to False in 'cleaner' component")
|
|
else:
|
|
print(f"Not modifying component {component_name} of class {component_cls}")
|
|
|
|
pipe = Pipeline.loads(pipeline_yaml, callbacks=DeserializationCallbacks(component_pre_init_callback))
|
|
```
|
|
|
|
## Performing Custom Serialization
|
|
|
|
Pipelines and components in Haystack can serialize simple components, including custom ones, out of the box. Code like this just works:
|
|
|
|
```python
|
|
from haystack import component
|
|
|
|
@component
|
|
class RepeatWordComponent:
|
|
def __init__(self, times: int):
|
|
self.times = times
|
|
|
|
@component.output_types(result=str)
|
|
def run(self, word: str):
|
|
return word * self.times
|
|
```
|
|
|
|
On the other hand, this code doesn't work if the final format is JSON, as the `set` type is not JSON-serializable:
|
|
|
|
```python
|
|
from haystack import component
|
|
|
|
@component
|
|
class SetIntersector:
|
|
def __init__(self, intersect_with: set):
|
|
self.intersect_with = intersect_with
|
|
|
|
@component.output_types(result=set)
|
|
def run(self, data: set):
|
|
return data.intersection(self.intersect_with)
|
|
```
|
|
|
|
In such cases, you can provide your own implementation `from_dict` and `to_dict` to components:
|
|
|
|
```python
|
|
from haystack import component, default_from_dict, default_to_dict
|
|
|
|
class SetIntersector:
|
|
def __init__(self, intersect_with: set):
|
|
self.intersect_with = intersect_with
|
|
|
|
@component.output_types(result=set)
|
|
def run(self, data: set):
|
|
return data.intersect(self.intersect_with)
|
|
|
|
def to_dict(self):
|
|
return default_to_dict(self, intersect_with=list(self.intersect_with))
|
|
|
|
@classmethod
|
|
def from_dict(cls, data):
|
|
# convert the set into a list for the dict representation,
|
|
# so it can be converted to JSON
|
|
data["intersect_with"] = set(data["intersect_with"])
|
|
return default_from_dict(cls, data)
|
|
```
|
|
|
|
## Saving a Pipeline to a Custom Format
|
|
|
|
Once a pipeline is available in its dictionary format, the last step of serialization is to convert that dictionary into a format you can store or send over the wire. Haystack supports YAML out of the box, but if you need a different format, you can write a custom Marshaller.
|
|
|
|
A `Marshaller` is a Python class responsible for converting text to a dictionary and a dictionary to text according to a certain format. Marshallers must respect the `Marshaller` [protocol](https://github.com/deepset-ai/haystack/blob/main/haystack/marshal/protocol.py), providing the methods `marshal` and `unmarshal`.
|
|
|
|
This is the code for a custom TOML marshaller that relies on the `rtoml` library:
|
|
|
|
```python
|
|
## This code requires a `pip install rtoml`
|
|
from typing import Dict, Any, Union
|
|
import rtoml
|
|
|
|
class TomlMarshaller:
|
|
def marshal(self, dict_: Dict[str, Any]) -> str:
|
|
return rtoml.dumps(dict_)
|
|
|
|
def unmarshal(self, data_: Union[str, bytes]) -> Dict[str, Any]:
|
|
return dict(rtoml.loads(data_))
|
|
```
|
|
|
|
You can then pass a Marshaller instance to the methods `dump`, `dumps`, `load`, and `loads`:
|
|
|
|
```python
|
|
from haystack import Pipeline
|
|
from my_custom_marshallers import TomlMarshaller
|
|
|
|
pipe = Pipeline()
|
|
pipe.dumps(TomlMarshaller())
|
|
## prints:
|
|
## 'max_loops_allowed = 100\nconnections = []\n\n[metadata]\n\n[components]\n'
|
|
```
|
|
|
|
## Additional References
|
|
|
|
:notebook: Tutorial: [Serializing LLM Pipelines](https://haystack.deepset.ai/tutorials/29_serializing_pipelines) |