From 79f57d84601d445db8809cc1512a3a0495ba1a6b Mon Sep 17 00:00:00 2001
From: Bijay Gurung <bijay.learning@gmail.com>
Date: Thu, 9 Feb 2023 09:57:00 +0100
Subject: [PATCH] Proposal: Add a JsonConverter node (#3959)

* Add Proposal: JsonConverter

* Add jsonl support + schema to JsonConverter Proposal

* Remove format option from JsonConverter Proposal

---------

Co-authored-by: ZanSara <sara.zanzottera@deepset.ai>
---
 proposals/text/3959-json-converter.md | 142 ++++++++++++++++++++++++++
 1 file changed, 142 insertions(+)
 create mode 100644 proposals/text/3959-json-converter.md

diff --git a/proposals/text/3959-json-converter.md b/proposals/text/3959-json-converter.md
new file mode 100644
index 000000000..1643929ad
--- /dev/null
+++ b/proposals/text/3959-json-converter.md
@@ -0,0 +1,142 @@
+- Title: Addition of a `JsonConverter` node
+- Decision driver: @bglearning
+- Start Date: 2023-01-26
+- Proposal PR: #3959
+
+# Summary
+
+Right now we don't have a node that can take json files as input to be fed into a pipeline.
+
+Proposal: Add a `JsonConverter` node that takes in a json file, parses it, and generates `Document`s.
+It would also support the `jsonl` format with one line corresponding to one document.
+
+# Basic example
+
+```python
+from haystack.nodes import JsonConverter
+
+converter = JsonConverter()
+
+# Receive back List[Document]
+docs = converter.convert("data_file.json")
+```
+
+With the `data_file.json` as a list of json representation of documents:
+
+```json
+[
+    {
+        "content": "...",
+        "content_type": "text", "meta": {...}
+    },
+    {
+        "content": [["h1", "h2"], ["val1", "val2"]],
+        "content_type": "table", "meta": {...}
+    }
+]
+```
+
+Alternatively, the data can also be `jsonl`.
+By default, the converter will try to auto-detect between `json` and `jsonl`.
+
+The main use case would be to be able to include this directly in the YAML specification
+
+```yaml
+...
+
+pipelines:
+  - name: indexing
+    nodes:
+      - name: JsonConverter
+        inputs: [File]
+      - name: Retriever
+        inputs: [JsonConverter]
+      - name: DocumentStore
+        inputs: [Retriever]
+```
+
+# Motivation
+
+Users may want to do some processing of the data themselves, persist it somehow, and only then pass it onto a haystack pipeline (for instance, by uploading into the REST API endpoint). Ideally this would happen without the need to create a custom endpoint.
+
+For many such processing, json is a convenient intermediate format as it allows for things like specifying the metadata.
+
+Specifically, one use-case that has come up for a team using haystack: they want to use a PDF parser (for tables) currently not in haystack. As such, they want to handle the parsing themselves outside of haystack, put the parsed result into a json file, and then pass it onto a haystack API endpoint.
+
+Having a `JsonConverter` node would allow users to setup a haystack pipeline to ingest such data without the user having to create a custom node for it.
+
+# Detailed design
+
+The converter would primarily be a wrapper around `Document.from_dict`.
+
+The schema accepted would be the a list of json dictionary of Documents.
+So, the following, with `content` being the only compulsory field.
+
+```
+[
+    {
+        "content": str or list[list],
+        "content_type": str,
+        "meta": dict,
+        "id_hash_keys": list,
+        "score": float,
+        "embedding": array
+    },
+    ...
+]
+```
+
+```python
+class JsonConverter(BaseConverter):
+    def __init__(self, ...):
+        ...
+
+    def convert(
+        self,
+        file_path: Path,
+        meta: Optional[Dict[str, str]] = None,
+        encoding: Optional[str] = "UTF-8",
+        id_hash_keys: Optional[List[str]] = None,
+        ...
+    ) -> List[Document]:
+        if id_hash_keys is None:
+            id_hash_keys = self.id_hash_keys
+
+        documents = []
+        with open(file_path, encoding=encoding, errors="ignore") as f:
+            data = json.load(f)
+            for doc_dict in data:
+                doc_dict = dict(doc_dict)
+                doc_dict['id_hash_keys'] = id_hash_keys
+                doc_dict['meta'] = doc_dict.get('meta', dict())
+
+                if meta:
+                    doc_dict['meta'].update(meta)
+
+                documents.append(Document.from_dict(doc_dict))
+
+        return documents
+```
+
+# Drawbacks
+
+- It would add another node that needs to be maintained and documented.
+
+# Alternatives
+
+- This node could be created as a custom node for the particular application where it is required. But could be better to have it out-of-the-box.
+- Design Alternative: Also, provide options to map custom fields to `Document` fields (E.g. {"review": "content"}) which could make this node a bit more flexible and might mean the user doesn't have to do some pre-formatting beforehand. But this can be a future development.
+
+# Adoption strategy
+
+It doesn't introduce a breaking change and wouldn't require changes in existing pipelines.
+
+# How we teach this
+
+It would be good to have this be part of the Guide (perhaps under File Converters).
+
+Could also be mentioned in one of the tutorials. For instance, in the preprocessing tutorial where we say "Haystack expects data to be provided as a list documents in the following dictionary format".
+
+# Unresolved questions
+
+- Should the `content_type` allowed be restricted (e.g. only "text" and "table"). And relatedly, should the name be more specific? E.g. `JsonTableTextConverter` rather than `JsonConverter`. Currently leaning towards no restriction and the `JsonConverter` name.