haystack/proposals/text/5738-document-2.0.md

109 lines
4.4 KiB
Markdown
Raw Permalink Normal View History

- Title: Document class for Haystack 2.0
- Decision driver: ZanSara
- Start Date: 2023-09-07
- Proposal PR: 5738
# Summary
With Haystack 2.0 we want to provide a lot more flexibility to Pipelines and Components. In a lot of situations,
we found that the Document class inherited from Haystack 1.x was not up to the task: therefore we chose to expand its
API to work best in this new paradigm.
# Basic example
Documents 2.0 have two fundamental differences with Documents 1.x:
- They have more than one content field. Documents 1.x only have a `content: Any` field that needs to match with the
`content_type` field in meaning. Documents 2.0 instead support `text`, `array`, `dataframe` and `blob`, each typed
correctly.
- The `content_type` field is gone: In Haystack 1.x we used the `content_type` field to interpret the data contained
in the `content` field: with the new design, this won't be necessary any longer. Haystack 2.0, however, have
a `mime_type` field that helps interpret the content of the `blob` field if necessary.
# Motivation
During the development of Haystack 2.0 components, we often found ourselves hold back by the design limitations of
the Document class. Unlike in Haystack 1.x, Documents now carry more information across the pipeline: for example,
they might contain the file they originated from, they might support more datatypes, etc.
Therefore we decided to extend the Document class to support a wider array of data.
# Detailed design
The design of this class was inspired by the [DocArray API](https://docarray.jina.ai/fundamentals/document/).
Here is the high-level API of the new Document class:
```python
@dataclass(frozen=True)
class Document:
id: str = field(default_factory=str)
text: Optional[str] = field(default=None)
array: Optional[numpy.ndarray] = field(default=None)
dataframe: Optional[pandas.DataFrame] = field(default=None)
blob: Optional[bytes] = field(default=None)
mime_type: str = field(default="text/plain")
metadata: Dict[str, Any] = field(default_factory=dict, hash=False)
id_hash_keys: List[str] = field(default_factory=lambda: ["text", "array", "dataframe", "blob"], hash=False)
score: Optional[float] = field(default=None, compare=True)
embedding: Optional[numpy.ndarray] = field(default=None, repr=False)
def to_dict(self):
"""
Saves the Document into a dictionary.
"""
def to_json(self, json_encoder: Optional[Type[DocumentEncoder]] = None, **json_kwargs):
"""
Saves the Document into a JSON string that can be later loaded back. Drops all binary data from the blob field.
"""
@classmethod
def from_dict(cls, dictionary):
"""
Creates a new Document object from a dictionary of its fields.
"""
@classmethod
def from_json(cls, data, json_decoder: Optional[Type[DocumentDecoder]] = None, **json_kwargs):
"""
Creates a new Document object from a JSON string.
"""
def flatten(self) -> Dict[str, Any]:
"""
Returns a dictionary with all the document fields and metadata on the same level.
Helpful for filtering in document stores.
"""
```
As you can notice, the main difference is the management of the content fields: we now have:
- `text`: for text data
- `array`: for array-like data, for example images, audio, video
- `dataframe`: for tabular data
- `blob`: for binary data.
In order to help interpret the content of these field, there's a `mime_type` field that components can use to figure out
how to use the content fields they need.
There are additional information that we may want to add, for example `path`. For now such info can be
kept into the metadata: if we realize we access it extremely often while processing Documents we should consider
bringing those fields out of `metadata` as top-level properties of the dataclass.
# Drawbacks
As the Document class becomes a bit more complex, components need to be adapted to it. This may cause some issues
to DocumentStores, because now they not only need to be able to store text but binary blobs as well.
We can imagine that some very simple DocumentStore will refuse to store the binary blobs. Fully-featured,
production-ready document stores instead should be able to find a way to store such blobs.
# Unresolved questions
Are the 4 content fields appropriate? Are there other content types we can consider adding?