* initial draft * tests * add proposal * proposal number * reno * fix tests and usage of content and content_type * update branch & fix more tests * mypy * add docstring * fix more tests * review feedback * improve __str__ * Apply suggestions from code review Co-authored-by: Daria Fokina <daria.fokina@deepset.ai> * Update haystack/preview/dataclasses/document.py Co-authored-by: Daria Fokina <daria.fokina@deepset.ai> * improve __str__ * fix tests * fix more tests * Update haystack/preview/document_stores/memory/document_store.py --------- Co-authored-by: Daria Fokina <daria.fokina@deepset.ai>
4.4 KiB
- Title: Document class for Haystack 2.0
- Decision driver: ZanSara
- Start Date: 2023-09-07
- Proposal PR: 5738
Summary
With Haystack 2.0 we want to provide a lot more flexibility to Pipelines and Components. In a lot of situations, we found that the Document class inherited from Haystack 1.x was not up to the task: therefore we chose to expand its API to work best in this new paradigm.
Basic example
Documents 2.0 have two fundamental differences with Documents 1.x:
-
They have more than one content field. Documents 1.x only have a
content: Anyfield that needs to match with thecontent_typefield in meaning. Documents 2.0 instead supporttext,array,dataframeandblob, each typed correctly. -
The
content_typefield is gone: In Haystack 1.x we used thecontent_typefield to interpret the data contained in thecontentfield: with the new design, this won't be necessary any longer. Haystack 2.0, however, have amime_typefield that helps interpret the content of theblobfield if necessary.
Motivation
During the development of Haystack 2.0 components, we often found ourselves hold back by the design limitations of the Document class. Unlike in Haystack 1.x, Documents now carry more information across the pipeline: for example, they might contain the file they originated from, they might support more datatypes, etc.
Therefore we decided to extend the Document class to support a wider array of data.
Detailed design
The design of this class was inspired by the DocArray API.
Here is the high-level API of the new Document class:
@dataclass(frozen=True)
class Document:
id: str = field(default_factory=str)
text: Optional[str] = field(default=None)
array: Optional[numpy.ndarray] = field(default=None)
dataframe: Optional[pandas.DataFrame] = field(default=None)
blob: Optional[bytes] = field(default=None)
mime_type: str = field(default="text/plain")
metadata: Dict[str, Any] = field(default_factory=dict, hash=False)
id_hash_keys: List[str] = field(default_factory=lambda: ["text", "array", "dataframe", "blob"], hash=False)
score: Optional[float] = field(default=None, compare=True)
embedding: Optional[numpy.ndarray] = field(default=None, repr=False)
def to_dict(self):
"""
Saves the Document into a dictionary.
"""
def to_json(self, json_encoder: Optional[Type[DocumentEncoder]] = None, **json_kwargs):
"""
Saves the Document into a JSON string that can be later loaded back. Drops all binary data from the blob field.
"""
@classmethod
def from_dict(cls, dictionary):
"""
Creates a new Document object from a dictionary of its fields.
"""
@classmethod
def from_json(cls, data, json_decoder: Optional[Type[DocumentDecoder]] = None, **json_kwargs):
"""
Creates a new Document object from a JSON string.
"""
def flatten(self) -> Dict[str, Any]:
"""
Returns a dictionary with all the document fields and metadata on the same level.
Helpful for filtering in document stores.
"""
As you can notice, the main difference is the management of the content fields: we now have:
text: for text dataarray: for array-like data, for example images, audio, videodataframe: for tabular datablob: for binary data.
In order to help interpret the content of these field, there's a mime_type field that components can use to figure out
how to use the content fields they need.
There are additional information that we may want to add, for example path. For now such info can be
kept into the metadata: if we realize we access it extremely often while processing Documents we should consider
bringing those fields out of metadata as top-level properties of the dataclass.
Drawbacks
As the Document class becomes a bit more complex, components need to be adapted to it. This may cause some issues to DocumentStores, because now they not only need to be able to store text but binary blobs as well.
We can imagine that some very simple DocumentStore will refuse to store the binary blobs. Fully-featured, production-ready document stores instead should be able to find a way to store such blobs.
Unresolved questions
Are the 4 content fields appropriate? Are there other content types we can consider adding?