From 79f57d84601d445db8809cc1512a3a0495ba1a6b Mon Sep 17 00:00:00 2001 From: Bijay Gurung Date: Thu, 9 Feb 2023 09:57:00 +0100 Subject: [PATCH] Proposal: Add a JsonConverter node (#3959) * Add Proposal: JsonConverter * Add jsonl support + schema to JsonConverter Proposal * Remove format option from JsonConverter Proposal --------- Co-authored-by: ZanSara --- proposals/text/3959-json-converter.md | 142 ++++++++++++++++++++++++++ 1 file changed, 142 insertions(+) create mode 100644 proposals/text/3959-json-converter.md diff --git a/proposals/text/3959-json-converter.md b/proposals/text/3959-json-converter.md new file mode 100644 index 000000000..1643929ad --- /dev/null +++ b/proposals/text/3959-json-converter.md @@ -0,0 +1,142 @@ +- Title: Addition of a `JsonConverter` node +- Decision driver: @bglearning +- Start Date: 2023-01-26 +- Proposal PR: #3959 + +# Summary + +Right now we don't have a node that can take json files as input to be fed into a pipeline. + +Proposal: Add a `JsonConverter` node that takes in a json file, parses it, and generates `Document`s. +It would also support the `jsonl` format with one line corresponding to one document. + +# Basic example + +```python +from haystack.nodes import JsonConverter + +converter = JsonConverter() + +# Receive back List[Document] +docs = converter.convert("data_file.json") +``` + +With the `data_file.json` as a list of json representation of documents: + +```json +[ + { + "content": "...", + "content_type": "text", "meta": {...} + }, + { + "content": [["h1", "h2"], ["val1", "val2"]], + "content_type": "table", "meta": {...} + } +] +``` + +Alternatively, the data can also be `jsonl`. +By default, the converter will try to auto-detect between `json` and `jsonl`. + +The main use case would be to be able to include this directly in the YAML specification + +```yaml +... + +pipelines: + - name: indexing + nodes: + - name: JsonConverter + inputs: [File] + - name: Retriever + inputs: [JsonConverter] + - name: DocumentStore + inputs: [Retriever] +``` + +# Motivation + +Users may want to do some processing of the data themselves, persist it somehow, and only then pass it onto a haystack pipeline (for instance, by uploading into the REST API endpoint). Ideally this would happen without the need to create a custom endpoint. + +For many such processing, json is a convenient intermediate format as it allows for things like specifying the metadata. + +Specifically, one use-case that has come up for a team using haystack: they want to use a PDF parser (for tables) currently not in haystack. As such, they want to handle the parsing themselves outside of haystack, put the parsed result into a json file, and then pass it onto a haystack API endpoint. + +Having a `JsonConverter` node would allow users to setup a haystack pipeline to ingest such data without the user having to create a custom node for it. + +# Detailed design + +The converter would primarily be a wrapper around `Document.from_dict`. + +The schema accepted would be the a list of json dictionary of Documents. +So, the following, with `content` being the only compulsory field. + +``` +[ + { + "content": str or list[list], + "content_type": str, + "meta": dict, + "id_hash_keys": list, + "score": float, + "embedding": array + }, + ... +] +``` + +```python +class JsonConverter(BaseConverter): + def __init__(self, ...): + ... + + def convert( + self, + file_path: Path, + meta: Optional[Dict[str, str]] = None, + encoding: Optional[str] = "UTF-8", + id_hash_keys: Optional[List[str]] = None, + ... + ) -> List[Document]: + if id_hash_keys is None: + id_hash_keys = self.id_hash_keys + + documents = [] + with open(file_path, encoding=encoding, errors="ignore") as f: + data = json.load(f) + for doc_dict in data: + doc_dict = dict(doc_dict) + doc_dict['id_hash_keys'] = id_hash_keys + doc_dict['meta'] = doc_dict.get('meta', dict()) + + if meta: + doc_dict['meta'].update(meta) + + documents.append(Document.from_dict(doc_dict)) + + return documents +``` + +# Drawbacks + +- It would add another node that needs to be maintained and documented. + +# Alternatives + +- This node could be created as a custom node for the particular application where it is required. But could be better to have it out-of-the-box. +- Design Alternative: Also, provide options to map custom fields to `Document` fields (E.g. {"review": "content"}) which could make this node a bit more flexible and might mean the user doesn't have to do some pre-formatting beforehand. But this can be a future development. + +# Adoption strategy + +It doesn't introduce a breaking change and wouldn't require changes in existing pipelines. + +# How we teach this + +It would be good to have this be part of the Guide (perhaps under File Converters). + +Could also be mentioned in one of the tutorials. For instance, in the preprocessing tutorial where we say "Haystack expects data to be provided as a list documents in the following dictionary format". + +# Unresolved questions + +- Should the `content_type` allowed be restricted (e.g. only "text" and "table"). And relatedly, should the name be more specific? E.g. `JsonTableTextConverter` rather than `JsonConverter`. Currently leaning towards no restriction and the `JsonConverter` name.