mirror of
https://github.com/deepset-ai/haystack.git
synced 2025-12-26 14:38:36 +00:00
Proposal: Add a JsonConverter node (#3959)
* Add Proposal: JsonConverter * Add jsonl support + schema to JsonConverter Proposal * Remove format option from JsonConverter Proposal --------- Co-authored-by: ZanSara <sara.zanzottera@deepset.ai>
This commit is contained in:
parent
508d9f6b32
commit
79f57d8460
142
proposals/text/3959-json-converter.md
Normal file
142
proposals/text/3959-json-converter.md
Normal file
@ -0,0 +1,142 @@
|
||||
- Title: Addition of a `JsonConverter` node
|
||||
- Decision driver: @bglearning
|
||||
- Start Date: 2023-01-26
|
||||
- Proposal PR: #3959
|
||||
|
||||
# Summary
|
||||
|
||||
Right now we don't have a node that can take json files as input to be fed into a pipeline.
|
||||
|
||||
Proposal: Add a `JsonConverter` node that takes in a json file, parses it, and generates `Document`s.
|
||||
It would also support the `jsonl` format with one line corresponding to one document.
|
||||
|
||||
# Basic example
|
||||
|
||||
```python
|
||||
from haystack.nodes import JsonConverter
|
||||
|
||||
converter = JsonConverter()
|
||||
|
||||
# Receive back List[Document]
|
||||
docs = converter.convert("data_file.json")
|
||||
```
|
||||
|
||||
With the `data_file.json` as a list of json representation of documents:
|
||||
|
||||
```json
|
||||
[
|
||||
{
|
||||
"content": "...",
|
||||
"content_type": "text", "meta": {...}
|
||||
},
|
||||
{
|
||||
"content": [["h1", "h2"], ["val1", "val2"]],
|
||||
"content_type": "table", "meta": {...}
|
||||
}
|
||||
]
|
||||
```
|
||||
|
||||
Alternatively, the data can also be `jsonl`.
|
||||
By default, the converter will try to auto-detect between `json` and `jsonl`.
|
||||
|
||||
The main use case would be to be able to include this directly in the YAML specification
|
||||
|
||||
```yaml
|
||||
...
|
||||
|
||||
pipelines:
|
||||
- name: indexing
|
||||
nodes:
|
||||
- name: JsonConverter
|
||||
inputs: [File]
|
||||
- name: Retriever
|
||||
inputs: [JsonConverter]
|
||||
- name: DocumentStore
|
||||
inputs: [Retriever]
|
||||
```
|
||||
|
||||
# Motivation
|
||||
|
||||
Users may want to do some processing of the data themselves, persist it somehow, and only then pass it onto a haystack pipeline (for instance, by uploading into the REST API endpoint). Ideally this would happen without the need to create a custom endpoint.
|
||||
|
||||
For many such processing, json is a convenient intermediate format as it allows for things like specifying the metadata.
|
||||
|
||||
Specifically, one use-case that has come up for a team using haystack: they want to use a PDF parser (for tables) currently not in haystack. As such, they want to handle the parsing themselves outside of haystack, put the parsed result into a json file, and then pass it onto a haystack API endpoint.
|
||||
|
||||
Having a `JsonConverter` node would allow users to setup a haystack pipeline to ingest such data without the user having to create a custom node for it.
|
||||
|
||||
# Detailed design
|
||||
|
||||
The converter would primarily be a wrapper around `Document.from_dict`.
|
||||
|
||||
The schema accepted would be the a list of json dictionary of Documents.
|
||||
So, the following, with `content` being the only compulsory field.
|
||||
|
||||
```
|
||||
[
|
||||
{
|
||||
"content": str or list[list],
|
||||
"content_type": str,
|
||||
"meta": dict,
|
||||
"id_hash_keys": list,
|
||||
"score": float,
|
||||
"embedding": array
|
||||
},
|
||||
...
|
||||
]
|
||||
```
|
||||
|
||||
```python
|
||||
class JsonConverter(BaseConverter):
|
||||
def __init__(self, ...):
|
||||
...
|
||||
|
||||
def convert(
|
||||
self,
|
||||
file_path: Path,
|
||||
meta: Optional[Dict[str, str]] = None,
|
||||
encoding: Optional[str] = "UTF-8",
|
||||
id_hash_keys: Optional[List[str]] = None,
|
||||
...
|
||||
) -> List[Document]:
|
||||
if id_hash_keys is None:
|
||||
id_hash_keys = self.id_hash_keys
|
||||
|
||||
documents = []
|
||||
with open(file_path, encoding=encoding, errors="ignore") as f:
|
||||
data = json.load(f)
|
||||
for doc_dict in data:
|
||||
doc_dict = dict(doc_dict)
|
||||
doc_dict['id_hash_keys'] = id_hash_keys
|
||||
doc_dict['meta'] = doc_dict.get('meta', dict())
|
||||
|
||||
if meta:
|
||||
doc_dict['meta'].update(meta)
|
||||
|
||||
documents.append(Document.from_dict(doc_dict))
|
||||
|
||||
return documents
|
||||
```
|
||||
|
||||
# Drawbacks
|
||||
|
||||
- It would add another node that needs to be maintained and documented.
|
||||
|
||||
# Alternatives
|
||||
|
||||
- This node could be created as a custom node for the particular application where it is required. But could be better to have it out-of-the-box.
|
||||
- Design Alternative: Also, provide options to map custom fields to `Document` fields (E.g. {"review": "content"}) which could make this node a bit more flexible and might mean the user doesn't have to do some pre-formatting beforehand. But this can be a future development.
|
||||
|
||||
# Adoption strategy
|
||||
|
||||
It doesn't introduce a breaking change and wouldn't require changes in existing pipelines.
|
||||
|
||||
# How we teach this
|
||||
|
||||
It would be good to have this be part of the Guide (perhaps under File Converters).
|
||||
|
||||
Could also be mentioned in one of the tutorials. For instance, in the preprocessing tutorial where we say "Haystack expects data to be provided as a list documents in the following dictionary format".
|
||||
|
||||
# Unresolved questions
|
||||
|
||||
- Should the `content_type` allowed be restricted (e.g. only "text" and "table"). And relatedly, should the name be more specific? E.g. `JsonTableTextConverter` rather than `JsonConverter`. Currently leaning towards no restriction and the `JsonConverter` name.
|
||||
Loading…
x
Reference in New Issue
Block a user