haystack/proposals/text/3959-json-converter.md
Bijay Gurung 79f57d8460
Proposal: Add a JsonConverter node (#3959)
* Add Proposal: JsonConverter

* Add jsonl support + schema to JsonConverter Proposal

* Remove format option from JsonConverter Proposal

---------

Co-authored-by: ZanSara <sara.zanzottera@deepset.ai>
2023-02-09 09:57:00 +01:00

4.6 KiB

  • Title: Addition of a JsonConverter node
  • Decision driver: @bglearning
  • Start Date: 2023-01-26
  • Proposal PR: #3959

Summary

Right now we don't have a node that can take json files as input to be fed into a pipeline.

Proposal: Add a JsonConverter node that takes in a json file, parses it, and generates Documents. It would also support the jsonl format with one line corresponding to one document.

Basic example

from haystack.nodes import JsonConverter

converter = JsonConverter()

# Receive back List[Document]
docs = converter.convert("data_file.json")

With the data_file.json as a list of json representation of documents:

[
    {
        "content": "...",
        "content_type": "text", "meta": {...}
    },
    {
        "content": [["h1", "h2"], ["val1", "val2"]],
        "content_type": "table", "meta": {...}
    }
]

Alternatively, the data can also be jsonl. By default, the converter will try to auto-detect between json and jsonl.

The main use case would be to be able to include this directly in the YAML specification

...

pipelines:
  - name: indexing
    nodes:
      - name: JsonConverter
        inputs: [File]
      - name: Retriever
        inputs: [JsonConverter]
      - name: DocumentStore
        inputs: [Retriever]

Motivation

Users may want to do some processing of the data themselves, persist it somehow, and only then pass it onto a haystack pipeline (for instance, by uploading into the REST API endpoint). Ideally this would happen without the need to create a custom endpoint.

For many such processing, json is a convenient intermediate format as it allows for things like specifying the metadata.

Specifically, one use-case that has come up for a team using haystack: they want to use a PDF parser (for tables) currently not in haystack. As such, they want to handle the parsing themselves outside of haystack, put the parsed result into a json file, and then pass it onto a haystack API endpoint.

Having a JsonConverter node would allow users to setup a haystack pipeline to ingest such data without the user having to create a custom node for it.

Detailed design

The converter would primarily be a wrapper around Document.from_dict.

The schema accepted would be the a list of json dictionary of Documents. So, the following, with content being the only compulsory field.

[
    {
        "content": str or list[list],
        "content_type": str,
        "meta": dict,
        "id_hash_keys": list,
        "score": float,
        "embedding": array
    },
    ...
]
class JsonConverter(BaseConverter):
    def __init__(self, ...):
        ...

    def convert(
        self,
        file_path: Path,
        meta: Optional[Dict[str, str]] = None,
        encoding: Optional[str] = "UTF-8",
        id_hash_keys: Optional[List[str]] = None,
        ...
    ) -> List[Document]:
        if id_hash_keys is None:
            id_hash_keys = self.id_hash_keys

        documents = []
        with open(file_path, encoding=encoding, errors="ignore") as f:
            data = json.load(f)
            for doc_dict in data:
                doc_dict = dict(doc_dict)
                doc_dict['id_hash_keys'] = id_hash_keys
                doc_dict['meta'] = doc_dict.get('meta', dict())

                if meta:
                    doc_dict['meta'].update(meta)

                documents.append(Document.from_dict(doc_dict))

        return documents

Drawbacks

  • It would add another node that needs to be maintained and documented.

Alternatives

  • This node could be created as a custom node for the particular application where it is required. But could be better to have it out-of-the-box.
  • Design Alternative: Also, provide options to map custom fields to Document fields (E.g. {"review": "content"}) which could make this node a bit more flexible and might mean the user doesn't have to do some pre-formatting beforehand. But this can be a future development.

Adoption strategy

It doesn't introduce a breaking change and wouldn't require changes in existing pipelines.

How we teach this

It would be good to have this be part of the Guide (perhaps under File Converters).

Could also be mentioned in one of the tutorials. For instance, in the preprocessing tutorial where we say "Haystack expects data to be provided as a list documents in the following dictionary format".

Unresolved questions

  • Should the content_type allowed be restricted (e.g. only "text" and "table"). And relatedly, should the name be more specific? E.g. JsonTableTextConverter rather than JsonConverter. Currently leaning towards no restriction and the JsonConverter name.