mirror of
https://github.com/deepset-ai/haystack.git
synced 2025-11-20 20:15:27 +00:00
143 lines
4.6 KiB
Markdown
143 lines
4.6 KiB
Markdown
|
|
- Title: Addition of a `JsonConverter` node
|
||
|
|
- Decision driver: @bglearning
|
||
|
|
- Start Date: 2023-01-26
|
||
|
|
- Proposal PR: #3959
|
||
|
|
|
||
|
|
# Summary
|
||
|
|
|
||
|
|
Right now we don't have a node that can take json files as input to be fed into a pipeline.
|
||
|
|
|
||
|
|
Proposal: Add a `JsonConverter` node that takes in a json file, parses it, and generates `Document`s.
|
||
|
|
It would also support the `jsonl` format with one line corresponding to one document.
|
||
|
|
|
||
|
|
# Basic example
|
||
|
|
|
||
|
|
```python
|
||
|
|
from haystack.nodes import JsonConverter
|
||
|
|
|
||
|
|
converter = JsonConverter()
|
||
|
|
|
||
|
|
# Receive back List[Document]
|
||
|
|
docs = converter.convert("data_file.json")
|
||
|
|
```
|
||
|
|
|
||
|
|
With the `data_file.json` as a list of json representation of documents:
|
||
|
|
|
||
|
|
```json
|
||
|
|
[
|
||
|
|
{
|
||
|
|
"content": "...",
|
||
|
|
"content_type": "text", "meta": {...}
|
||
|
|
},
|
||
|
|
{
|
||
|
|
"content": [["h1", "h2"], ["val1", "val2"]],
|
||
|
|
"content_type": "table", "meta": {...}
|
||
|
|
}
|
||
|
|
]
|
||
|
|
```
|
||
|
|
|
||
|
|
Alternatively, the data can also be `jsonl`.
|
||
|
|
By default, the converter will try to auto-detect between `json` and `jsonl`.
|
||
|
|
|
||
|
|
The main use case would be to be able to include this directly in the YAML specification
|
||
|
|
|
||
|
|
```yaml
|
||
|
|
...
|
||
|
|
|
||
|
|
pipelines:
|
||
|
|
- name: indexing
|
||
|
|
nodes:
|
||
|
|
- name: JsonConverter
|
||
|
|
inputs: [File]
|
||
|
|
- name: Retriever
|
||
|
|
inputs: [JsonConverter]
|
||
|
|
- name: DocumentStore
|
||
|
|
inputs: [Retriever]
|
||
|
|
```
|
||
|
|
|
||
|
|
# Motivation
|
||
|
|
|
||
|
|
Users may want to do some processing of the data themselves, persist it somehow, and only then pass it onto a haystack pipeline (for instance, by uploading into the REST API endpoint). Ideally this would happen without the need to create a custom endpoint.
|
||
|
|
|
||
|
|
For many such processing, json is a convenient intermediate format as it allows for things like specifying the metadata.
|
||
|
|
|
||
|
|
Specifically, one use-case that has come up for a team using haystack: they want to use a PDF parser (for tables) currently not in haystack. As such, they want to handle the parsing themselves outside of haystack, put the parsed result into a json file, and then pass it onto a haystack API endpoint.
|
||
|
|
|
||
|
|
Having a `JsonConverter` node would allow users to setup a haystack pipeline to ingest such data without the user having to create a custom node for it.
|
||
|
|
|
||
|
|
# Detailed design
|
||
|
|
|
||
|
|
The converter would primarily be a wrapper around `Document.from_dict`.
|
||
|
|
|
||
|
|
The schema accepted would be the a list of json dictionary of Documents.
|
||
|
|
So, the following, with `content` being the only compulsory field.
|
||
|
|
|
||
|
|
```
|
||
|
|
[
|
||
|
|
{
|
||
|
|
"content": str or list[list],
|
||
|
|
"content_type": str,
|
||
|
|
"meta": dict,
|
||
|
|
"id_hash_keys": list,
|
||
|
|
"score": float,
|
||
|
|
"embedding": array
|
||
|
|
},
|
||
|
|
...
|
||
|
|
]
|
||
|
|
```
|
||
|
|
|
||
|
|
```python
|
||
|
|
class JsonConverter(BaseConverter):
|
||
|
|
def __init__(self, ...):
|
||
|
|
...
|
||
|
|
|
||
|
|
def convert(
|
||
|
|
self,
|
||
|
|
file_path: Path,
|
||
|
|
meta: Optional[Dict[str, str]] = None,
|
||
|
|
encoding: Optional[str] = "UTF-8",
|
||
|
|
id_hash_keys: Optional[List[str]] = None,
|
||
|
|
...
|
||
|
|
) -> List[Document]:
|
||
|
|
if id_hash_keys is None:
|
||
|
|
id_hash_keys = self.id_hash_keys
|
||
|
|
|
||
|
|
documents = []
|
||
|
|
with open(file_path, encoding=encoding, errors="ignore") as f:
|
||
|
|
data = json.load(f)
|
||
|
|
for doc_dict in data:
|
||
|
|
doc_dict = dict(doc_dict)
|
||
|
|
doc_dict['id_hash_keys'] = id_hash_keys
|
||
|
|
doc_dict['meta'] = doc_dict.get('meta', dict())
|
||
|
|
|
||
|
|
if meta:
|
||
|
|
doc_dict['meta'].update(meta)
|
||
|
|
|
||
|
|
documents.append(Document.from_dict(doc_dict))
|
||
|
|
|
||
|
|
return documents
|
||
|
|
```
|
||
|
|
|
||
|
|
# Drawbacks
|
||
|
|
|
||
|
|
- It would add another node that needs to be maintained and documented.
|
||
|
|
|
||
|
|
# Alternatives
|
||
|
|
|
||
|
|
- This node could be created as a custom node for the particular application where it is required. But could be better to have it out-of-the-box.
|
||
|
|
- Design Alternative: Also, provide options to map custom fields to `Document` fields (E.g. {"review": "content"}) which could make this node a bit more flexible and might mean the user doesn't have to do some pre-formatting beforehand. But this can be a future development.
|
||
|
|
|
||
|
|
# Adoption strategy
|
||
|
|
|
||
|
|
It doesn't introduce a breaking change and wouldn't require changes in existing pipelines.
|
||
|
|
|
||
|
|
# How we teach this
|
||
|
|
|
||
|
|
It would be good to have this be part of the Guide (perhaps under File Converters).
|
||
|
|
|
||
|
|
Could also be mentioned in one of the tutorials. For instance, in the preprocessing tutorial where we say "Haystack expects data to be provided as a list documents in the following dictionary format".
|
||
|
|
|
||
|
|
# Unresolved questions
|
||
|
|
|
||
|
|
- Should the `content_type` allowed be restricted (e.g. only "text" and "table"). And relatedly, should the name be more specific? E.g. `JsonTableTextConverter` rather than `JsonConverter`. Currently leaning towards no restriction and the `JsonConverter` name.
|