haystack/proposals/text/5738-document-2.0.md

- Title: Document class for Haystack 2.0
- Decision driver: ZanSara
- Start Date: 2023-09-07
- Proposal PR: 5738

# Summary

With Haystack 2.0 we want to provide a lot more flexibility to Pipelines and Components. In a lot of situations,
we found that the Document class inherited from Haystack 1.x was not up to the task: therefore we chose to expand its
API to work best in this new paradigm.

# Basic example

Documents 2.0 have two fundamental differences with Documents 1.x:

- They have more than one content field. Documents 1.x only have a `content: Any` field that needs to match with the
    `content_type` field in meaning. Documents 2.0 instead support `text`, `array`, `dataframe` and `blob`, each typed
    correctly.

- The `content_type` field is gone: In Haystack 1.x we used the `content_type` field to interpret the data contained
    in the `content` field: with the new design, this won't be necessary any longer. Haystack 2.0, however, have
    a `mime_type` field that helps interpret the content of the `blob` field if necessary.

# Motivation

During the development of Haystack 2.0 components, we often found ourselves hold back by the design limitations of
the Document class. Unlike in Haystack 1.x, Documents now carry more information across the pipeline: for example,
they might contain the file they originated from, they might support more datatypes, etc.

Therefore we decided to extend the Document class to support a wider array of data.

# Detailed design

The design of this class was inspired by the [DocArray API](https://docarray.jina.ai/fundamentals/document/).

Here is the high-level API of the new Document class:

```python
@dataclass(frozen=True)
class Document:
    id: str = field(default_factory=str)
    text: Optional[str] = field(default=None)
    array: Optional[numpy.ndarray] = field(default=None)
    dataframe: Optional[pandas.DataFrame] = field(default=None)
    blob: Optional[bytes] = field(default=None)
    mime_type: str = field(default="text/plain")
    metadata: Dict[str, Any] = field(default_factory=dict, hash=False)
    id_hash_keys: List[str] = field(default_factory=lambda: ["text", "array", "dataframe", "blob"], hash=False)
    score: Optional[float] = field(default=None, compare=True)
    embedding: Optional[numpy.ndarray] = field(default=None, repr=False)

    def to_dict(self):
        """
        Saves the Document into a dictionary.
        """

    def to_json(self, json_encoder: Optional[Type[DocumentEncoder]] = None, **json_kwargs):
        """
        Saves the Document into a JSON string that can be later loaded back. Drops all binary data from the blob field.
        """

    @classmethod
    def from_dict(cls, dictionary):
        """
        Creates a new Document object from a dictionary of its fields.
        """

    @classmethod
    def from_json(cls, data, json_decoder: Optional[Type[DocumentDecoder]] = None, **json_kwargs):
        """
        Creates a new Document object from a JSON string.
        """

    def flatten(self) -> Dict[str, Any]:
        """
        Returns a dictionary with all the document fields and metadata on the same level.
        Helpful for filtering in document stores.
        """

```

As you can notice, the main difference is the management of the content fields: we now have:

- `text`: for text data
- `array`: for array-like data, for example images, audio, video
- `dataframe`: for tabular data
- `blob`: for binary data.

In order to help interpret the content of these field, there's a `mime_type` field that components can use to figure out
how to use the content fields they need.

There are additional information that we may want to add, for example `path`. For now such info can be
kept into the metadata: if we realize we access it extremely often while processing Documents we should consider
bringing those fields out of `metadata` as top-level properties of the dataclass.


# Drawbacks

As the Document class becomes a bit more complex, components need to be adapted to it. This may cause some issues
to DocumentStores, because now they not only need to be able to store text but binary blobs as well.

We can imagine that some very simple DocumentStore will refuse to store the binary blobs. Fully-featured,
production-ready document stores instead should be able to find a way to store such blobs.


# Unresolved questions

Are the 4 content fields appropriate? Are there other content types we can consider adding?
feat: Improve `Document` for Haystack 2.0 (#5738) * initial draft * tests * add proposal * proposal number * reno * fix tests and usage of content and content_type * update branch & fix more tests * mypy * add docstring * fix more tests * review feedback * improve __str__ * Apply suggestions from code review Co-authored-by: Daria Fokina <daria.fokina@deepset.ai> * Update haystack/preview/dataclasses/document.py Co-authored-by: Daria Fokina <daria.fokina@deepset.ai> * improve __str__ * fix tests * fix more tests * Update haystack/preview/document_stores/memory/document_store.py --------- Co-authored-by: Daria Fokina <daria.fokina@deepset.ai> 2023-09-11 16:40:00 +01:00			`- Title: Document class for Haystack 2.0`
			`- Decision driver: ZanSara`
			`- Start Date: 2023-09-07`
			`- Proposal PR: 5738`

			`# Summary`

			`With Haystack 2.0 we want to provide a lot more flexibility to Pipelines and Components. In a lot of situations,`
			`we found that the Document class inherited from Haystack 1.x was not up to the task: therefore we chose to expand its`
			`API to work best in this new paradigm.`

			`# Basic example`

			`Documents 2.0 have two fundamental differences with Documents 1.x:`

			- They have more than one content field. Documents 1.x only have a `content: Any` field that needs to match with the
			`content_type` field in meaning. Documents 2.0 instead support `text`, `array`, `dataframe` and `blob`, each typed
			`correctly.`

			- The `content_type` field is gone: In Haystack 1.x we used the `content_type` field to interpret the data contained
			in the `content` field: with the new design, this won't be necessary any longer. Haystack 2.0, however, have
			a `mime_type` field that helps interpret the content of the `blob` field if necessary.

			`# Motivation`

			`During the development of Haystack 2.0 components, we often found ourselves hold back by the design limitations of`
			`the Document class. Unlike in Haystack 1.x, Documents now carry more information across the pipeline: for example,`
			`they might contain the file they originated from, they might support more datatypes, etc.`

			`Therefore we decided to extend the Document class to support a wider array of data.`

			`# Detailed design`

			`The design of this class was inspired by the [DocArray API](https://docarray.jina.ai/fundamentals/document/).`

			`Here is the high-level API of the new Document class:`

			```python
			`@dataclass(frozen=True)`
			`class Document:`
			`id: str = field(default_factory=str)`
			`text: Optional[str] = field(default=None)`
			`array: Optional[numpy.ndarray] = field(default=None)`
			`dataframe: Optional[pandas.DataFrame] = field(default=None)`
			`blob: Optional[bytes] = field(default=None)`
			`mime_type: str = field(default="text/plain")`
			`metadata: Dict[str, Any] = field(default_factory=dict, hash=False)`
			`id_hash_keys: List[str] = field(default_factory=lambda: ["text", "array", "dataframe", "blob"], hash=False)`
			`score: Optional[float] = field(default=None, compare=True)`
			`embedding: Optional[numpy.ndarray] = field(default=None, repr=False)`

			`def to_dict(self):`
			`"""`
			`Saves the Document into a dictionary.`
			`"""`

			`def to_json(self, json_encoder: Optional[Type[DocumentEncoder]] = None, **json_kwargs):`
			`"""`
			`Saves the Document into a JSON string that can be later loaded back. Drops all binary data from the blob field.`
			`"""`

			`@classmethod`
			`def from_dict(cls, dictionary):`
			`"""`
			`Creates a new Document object from a dictionary of its fields.`
			`"""`

			`@classmethod`
			`def from_json(cls, data, json_decoder: Optional[Type[DocumentDecoder]] = None, **json_kwargs):`
			`"""`
			`Creates a new Document object from a JSON string.`
			`"""`

			`def flatten(self) -> Dict[str, Any]:`
			`"""`
			`Returns a dictionary with all the document fields and metadata on the same level.`
			`Helpful for filtering in document stores.`
			`"""`

			```

			`As you can notice, the main difference is the management of the content fields: we now have:`

			- `text`: for text data
			- `array`: for array-like data, for example images, audio, video
			- `dataframe`: for tabular data
			- `blob`: for binary data.

			In order to help interpret the content of these field, there's a `mime_type` field that components can use to figure out
			`how to use the content fields they need.`

			There are additional information that we may want to add, for example `path`. For now such info can be
			`kept into the metadata: if we realize we access it extremely often while processing Documents we should consider`
			bringing those fields out of `metadata` as top-level properties of the dataclass.


			`# Drawbacks`

			`As the Document class becomes a bit more complex, components need to be adapted to it. This may cause some issues`
			`to DocumentStores, because now they not only need to be able to store text but binary blobs as well.`

			`We can imagine that some very simple DocumentStore will refuse to store the binary blobs. Fully-featured,`
			`production-ready document stores instead should be able to find a way to store such blobs.`


			`# Unresolved questions`

			`Are the 4 content fields appropriate? Are there other content types we can consider adding?`