* Add Proposal: JsonConverter * Add jsonl support + schema to JsonConverter Proposal * Remove format option from JsonConverter Proposal --------- Co-authored-by: ZanSara <sara.zanzottera@deepset.ai>
4.6 KiB
- Title: Addition of a
JsonConverternode - Decision driver: @bglearning
- Start Date: 2023-01-26
- Proposal PR: #3959
Summary
Right now we don't have a node that can take json files as input to be fed into a pipeline.
Proposal: Add a JsonConverter node that takes in a json file, parses it, and generates Documents.
It would also support the jsonl format with one line corresponding to one document.
Basic example
from haystack.nodes import JsonConverter
converter = JsonConverter()
# Receive back List[Document]
docs = converter.convert("data_file.json")
With the data_file.json as a list of json representation of documents:
[
{
"content": "...",
"content_type": "text", "meta": {...}
},
{
"content": [["h1", "h2"], ["val1", "val2"]],
"content_type": "table", "meta": {...}
}
]
Alternatively, the data can also be jsonl.
By default, the converter will try to auto-detect between json and jsonl.
The main use case would be to be able to include this directly in the YAML specification
...
pipelines:
- name: indexing
nodes:
- name: JsonConverter
inputs: [File]
- name: Retriever
inputs: [JsonConverter]
- name: DocumentStore
inputs: [Retriever]
Motivation
Users may want to do some processing of the data themselves, persist it somehow, and only then pass it onto a haystack pipeline (for instance, by uploading into the REST API endpoint). Ideally this would happen without the need to create a custom endpoint.
For many such processing, json is a convenient intermediate format as it allows for things like specifying the metadata.
Specifically, one use-case that has come up for a team using haystack: they want to use a PDF parser (for tables) currently not in haystack. As such, they want to handle the parsing themselves outside of haystack, put the parsed result into a json file, and then pass it onto a haystack API endpoint.
Having a JsonConverter node would allow users to setup a haystack pipeline to ingest such data without the user having to create a custom node for it.
Detailed design
The converter would primarily be a wrapper around Document.from_dict.
The schema accepted would be the a list of json dictionary of Documents.
So, the following, with content being the only compulsory field.
[
{
"content": str or list[list],
"content_type": str,
"meta": dict,
"id_hash_keys": list,
"score": float,
"embedding": array
},
...
]
class JsonConverter(BaseConverter):
def __init__(self, ...):
...
def convert(
self,
file_path: Path,
meta: Optional[Dict[str, str]] = None,
encoding: Optional[str] = "UTF-8",
id_hash_keys: Optional[List[str]] = None,
...
) -> List[Document]:
if id_hash_keys is None:
id_hash_keys = self.id_hash_keys
documents = []
with open(file_path, encoding=encoding, errors="ignore") as f:
data = json.load(f)
for doc_dict in data:
doc_dict = dict(doc_dict)
doc_dict['id_hash_keys'] = id_hash_keys
doc_dict['meta'] = doc_dict.get('meta', dict())
if meta:
doc_dict['meta'].update(meta)
documents.append(Document.from_dict(doc_dict))
return documents
Drawbacks
- It would add another node that needs to be maintained and documented.
Alternatives
- This node could be created as a custom node for the particular application where it is required. But could be better to have it out-of-the-box.
- Design Alternative: Also, provide options to map custom fields to
Documentfields (E.g. {"review": "content"}) which could make this node a bit more flexible and might mean the user doesn't have to do some pre-formatting beforehand. But this can be a future development.
Adoption strategy
It doesn't introduce a breaking change and wouldn't require changes in existing pipelines.
How we teach this
It would be good to have this be part of the Guide (perhaps under File Converters).
Could also be mentioned in one of the tutorials. For instance, in the preprocessing tutorial where we say "Haystack expects data to be provided as a list documents in the following dictionary format".
Unresolved questions
- Should the
content_typeallowed be restricted (e.g. only "text" and "table"). And relatedly, should the name be more specific? E.g.JsonTableTextConverterrather thanJsonConverter. Currently leaning towards no restriction and theJsonConvertername.