mirror of
https://github.com/deepset-ai/haystack.git
synced 2025-11-05 04:09:32 +00:00
109 lines
4.4 KiB
Markdown
109 lines
4.4 KiB
Markdown
|
|
- Title: Document class for Haystack 2.0
|
||
|
|
- Decision driver: ZanSara
|
||
|
|
- Start Date: 2023-09-07
|
||
|
|
- Proposal PR: 5738
|
||
|
|
|
||
|
|
# Summary
|
||
|
|
|
||
|
|
With Haystack 2.0 we want to provide a lot more flexibility to Pipelines and Components. In a lot of situations,
|
||
|
|
we found that the Document class inherited from Haystack 1.x was not up to the task: therefore we chose to expand its
|
||
|
|
API to work best in this new paradigm.
|
||
|
|
|
||
|
|
# Basic example
|
||
|
|
|
||
|
|
Documents 2.0 have two fundamental differences with Documents 1.x:
|
||
|
|
|
||
|
|
- They have more than one content field. Documents 1.x only have a `content: Any` field that needs to match with the
|
||
|
|
`content_type` field in meaning. Documents 2.0 instead support `text`, `array`, `dataframe` and `blob`, each typed
|
||
|
|
correctly.
|
||
|
|
|
||
|
|
- The `content_type` field is gone: In Haystack 1.x we used the `content_type` field to interpret the data contained
|
||
|
|
in the `content` field: with the new design, this won't be necessary any longer. Haystack 2.0, however, have
|
||
|
|
a `mime_type` field that helps interpret the content of the `blob` field if necessary.
|
||
|
|
|
||
|
|
# Motivation
|
||
|
|
|
||
|
|
During the development of Haystack 2.0 components, we often found ourselves hold back by the design limitations of
|
||
|
|
the Document class. Unlike in Haystack 1.x, Documents now carry more information across the pipeline: for example,
|
||
|
|
they might contain the file they originated from, they might support more datatypes, etc.
|
||
|
|
|
||
|
|
Therefore we decided to extend the Document class to support a wider array of data.
|
||
|
|
|
||
|
|
# Detailed design
|
||
|
|
|
||
|
|
The design of this class was inspired by the [DocArray API](https://docarray.jina.ai/fundamentals/document/).
|
||
|
|
|
||
|
|
Here is the high-level API of the new Document class:
|
||
|
|
|
||
|
|
```python
|
||
|
|
@dataclass(frozen=True)
|
||
|
|
class Document:
|
||
|
|
id: str = field(default_factory=str)
|
||
|
|
text: Optional[str] = field(default=None)
|
||
|
|
array: Optional[numpy.ndarray] = field(default=None)
|
||
|
|
dataframe: Optional[pandas.DataFrame] = field(default=None)
|
||
|
|
blob: Optional[bytes] = field(default=None)
|
||
|
|
mime_type: str = field(default="text/plain")
|
||
|
|
metadata: Dict[str, Any] = field(default_factory=dict, hash=False)
|
||
|
|
id_hash_keys: List[str] = field(default_factory=lambda: ["text", "array", "dataframe", "blob"], hash=False)
|
||
|
|
score: Optional[float] = field(default=None, compare=True)
|
||
|
|
embedding: Optional[numpy.ndarray] = field(default=None, repr=False)
|
||
|
|
|
||
|
|
def to_dict(self):
|
||
|
|
"""
|
||
|
|
Saves the Document into a dictionary.
|
||
|
|
"""
|
||
|
|
|
||
|
|
def to_json(self, json_encoder: Optional[Type[DocumentEncoder]] = None, **json_kwargs):
|
||
|
|
"""
|
||
|
|
Saves the Document into a JSON string that can be later loaded back. Drops all binary data from the blob field.
|
||
|
|
"""
|
||
|
|
|
||
|
|
@classmethod
|
||
|
|
def from_dict(cls, dictionary):
|
||
|
|
"""
|
||
|
|
Creates a new Document object from a dictionary of its fields.
|
||
|
|
"""
|
||
|
|
|
||
|
|
@classmethod
|
||
|
|
def from_json(cls, data, json_decoder: Optional[Type[DocumentDecoder]] = None, **json_kwargs):
|
||
|
|
"""
|
||
|
|
Creates a new Document object from a JSON string.
|
||
|
|
"""
|
||
|
|
|
||
|
|
def flatten(self) -> Dict[str, Any]:
|
||
|
|
"""
|
||
|
|
Returns a dictionary with all the document fields and metadata on the same level.
|
||
|
|
Helpful for filtering in document stores.
|
||
|
|
"""
|
||
|
|
|
||
|
|
```
|
||
|
|
|
||
|
|
As you can notice, the main difference is the management of the content fields: we now have:
|
||
|
|
|
||
|
|
- `text`: for text data
|
||
|
|
- `array`: for array-like data, for example images, audio, video
|
||
|
|
- `dataframe`: for tabular data
|
||
|
|
- `blob`: for binary data.
|
||
|
|
|
||
|
|
In order to help interpret the content of these field, there's a `mime_type` field that components can use to figure out
|
||
|
|
how to use the content fields they need.
|
||
|
|
|
||
|
|
There are additional information that we may want to add, for example `path`. For now such info can be
|
||
|
|
kept into the metadata: if we realize we access it extremely often while processing Documents we should consider
|
||
|
|
bringing those fields out of `metadata` as top-level properties of the dataclass.
|
||
|
|
|
||
|
|
|
||
|
|
# Drawbacks
|
||
|
|
|
||
|
|
As the Document class becomes a bit more complex, components need to be adapted to it. This may cause some issues
|
||
|
|
to DocumentStores, because now they not only need to be able to store text but binary blobs as well.
|
||
|
|
|
||
|
|
We can imagine that some very simple DocumentStore will refuse to store the binary blobs. Fully-featured,
|
||
|
|
production-ready document stores instead should be able to find a way to store such blobs.
|
||
|
|
|
||
|
|
|
||
|
|
# Unresolved questions
|
||
|
|
|
||
|
|
Are the 4 content fields appropriate? Are there other content types we can consider adding?
|