haystack/proposals/text/5738-document-2.0.md
ZanSara 6e70d403f8
feat: Improve Document for Haystack 2.0 (#5738)
* initial draft

* tests

* add proposal

* proposal number

* reno

* fix tests and usage of content and content_type

* update branch & fix more tests

* mypy

* add docstring

* fix more tests

* review feedback

* improve __str__

* Apply suggestions from code review

Co-authored-by: Daria Fokina <daria.fokina@deepset.ai>

* Update haystack/preview/dataclasses/document.py

Co-authored-by: Daria Fokina <daria.fokina@deepset.ai>

* improve __str__

* fix tests

* fix more tests

* Update haystack/preview/document_stores/memory/document_store.py

---------

Co-authored-by: Daria Fokina <daria.fokina@deepset.ai>
2023-09-11 17:40:00 +02:00

4.4 KiB

  • Title: Document class for Haystack 2.0
  • Decision driver: ZanSara
  • Start Date: 2023-09-07
  • Proposal PR: 5738

Summary

With Haystack 2.0 we want to provide a lot more flexibility to Pipelines and Components. In a lot of situations, we found that the Document class inherited from Haystack 1.x was not up to the task: therefore we chose to expand its API to work best in this new paradigm.

Basic example

Documents 2.0 have two fundamental differences with Documents 1.x:

  • They have more than one content field. Documents 1.x only have a content: Any field that needs to match with the content_type field in meaning. Documents 2.0 instead support text, array, dataframe and blob, each typed correctly.

  • The content_type field is gone: In Haystack 1.x we used the content_type field to interpret the data contained in the content field: with the new design, this won't be necessary any longer. Haystack 2.0, however, have a mime_type field that helps interpret the content of the blob field if necessary.

Motivation

During the development of Haystack 2.0 components, we often found ourselves hold back by the design limitations of the Document class. Unlike in Haystack 1.x, Documents now carry more information across the pipeline: for example, they might contain the file they originated from, they might support more datatypes, etc.

Therefore we decided to extend the Document class to support a wider array of data.

Detailed design

The design of this class was inspired by the DocArray API.

Here is the high-level API of the new Document class:

@dataclass(frozen=True)
class Document:
    id: str = field(default_factory=str)
    text: Optional[str] = field(default=None)
    array: Optional[numpy.ndarray] = field(default=None)
    dataframe: Optional[pandas.DataFrame] = field(default=None)
    blob: Optional[bytes] = field(default=None)
    mime_type: str = field(default="text/plain")
    metadata: Dict[str, Any] = field(default_factory=dict, hash=False)
    id_hash_keys: List[str] = field(default_factory=lambda: ["text", "array", "dataframe", "blob"], hash=False)
    score: Optional[float] = field(default=None, compare=True)
    embedding: Optional[numpy.ndarray] = field(default=None, repr=False)

    def to_dict(self):
        """
        Saves the Document into a dictionary.
        """

    def to_json(self, json_encoder: Optional[Type[DocumentEncoder]] = None, **json_kwargs):
        """
        Saves the Document into a JSON string that can be later loaded back. Drops all binary data from the blob field.
        """

    @classmethod
    def from_dict(cls, dictionary):
        """
        Creates a new Document object from a dictionary of its fields.
        """

    @classmethod
    def from_json(cls, data, json_decoder: Optional[Type[DocumentDecoder]] = None, **json_kwargs):
        """
        Creates a new Document object from a JSON string.
        """

    def flatten(self) -> Dict[str, Any]:
        """
        Returns a dictionary with all the document fields and metadata on the same level.
        Helpful for filtering in document stores.
        """

As you can notice, the main difference is the management of the content fields: we now have:

  • text: for text data
  • array: for array-like data, for example images, audio, video
  • dataframe: for tabular data
  • blob: for binary data.

In order to help interpret the content of these field, there's a mime_type field that components can use to figure out how to use the content fields they need.

There are additional information that we may want to add, for example path. For now such info can be kept into the metadata: if we realize we access it extremely often while processing Documents we should consider bringing those fields out of metadata as top-level properties of the dataclass.

Drawbacks

As the Document class becomes a bit more complex, components need to be adapted to it. This may cause some issues to DocumentStores, because now they not only need to be able to store text but binary blobs as well.

We can imagine that some very simple DocumentStore will refuse to store the binary blobs. Fully-featured, production-ready document stores instead should be able to find a way to store such blobs.

Unresolved questions

Are the 4 content fields appropriate? Are there other content types we can consider adding?