haystack/proposals/text/3959-json-converter.md

- Title: Addition of a `JsonConverter` node
- Decision driver: @bglearning
- Start Date: 2023-01-26
- Proposal PR: #3959

# Summary

Right now we don't have a node that can take json files as input to be fed into a pipeline.

Proposal: Add a `JsonConverter` node that takes in a json file, parses it, and generates `Document`s.
It would also support the `jsonl` format with one line corresponding to one document.

# Basic example

```python
from haystack.nodes import JsonConverter

converter = JsonConverter()

# Receive back List[Document]
docs = converter.convert("data_file.json")
```

With the `data_file.json` as a list of json representation of documents:

```json
[
    {
        "content": "...",
        "content_type": "text", "meta": {...}
    },
    {
        "content": [["h1", "h2"], ["val1", "val2"]],
        "content_type": "table", "meta": {...}
    }
]
```

Alternatively, the data can also be `jsonl`.
By default, the converter will try to auto-detect between `json` and `jsonl`.

The main use case would be to be able to include this directly in the YAML specification

```yaml
...

pipelines:
  - name: indexing
    nodes:
      - name: JsonConverter
        inputs: [File]
      - name: Retriever
        inputs: [JsonConverter]
      - name: DocumentStore
        inputs: [Retriever]
```

# Motivation

Users may want to do some processing of the data themselves, persist it somehow, and only then pass it onto a haystack pipeline (for instance, by uploading into the REST API endpoint). Ideally this would happen without the need to create a custom endpoint.

For many such processing, json is a convenient intermediate format as it allows for things like specifying the metadata.

Specifically, one use-case that has come up for a team using haystack: they want to use a PDF parser (for tables) currently not in haystack. As such, they want to handle the parsing themselves outside of haystack, put the parsed result into a json file, and then pass it onto a haystack API endpoint.

Having a `JsonConverter` node would allow users to setup a haystack pipeline to ingest such data without the user having to create a custom node for it.

# Detailed design

The converter would primarily be a wrapper around `Document.from_dict`.

The schema accepted would be the a list of json dictionary of Documents.
So, the following, with `content` being the only compulsory field.

```
[
    {
        "content": str or list[list],
        "content_type": str,
        "meta": dict,
        "id_hash_keys": list,
        "score": float,
        "embedding": array
    },
    ...
]
```

```python
class JsonConverter(BaseConverter):
    def __init__(self, ...):
        ...

    def convert(
        self,
        file_path: Path,
        meta: Optional[Dict[str, str]] = None,
        encoding: Optional[str] = "UTF-8",
        id_hash_keys: Optional[List[str]] = None,
        ...
    ) -> List[Document]:
        if id_hash_keys is None:
            id_hash_keys = self.id_hash_keys

        documents = []
        with open(file_path, encoding=encoding, errors="ignore") as f:
            data = json.load(f)
            for doc_dict in data:
                doc_dict = dict(doc_dict)
                doc_dict['id_hash_keys'] = id_hash_keys
                doc_dict['meta'] = doc_dict.get('meta', dict())

                if meta:
                    doc_dict['meta'].update(meta)

                documents.append(Document.from_dict(doc_dict))

        return documents
```

# Drawbacks

- It would add another node that needs to be maintained and documented.

# Alternatives

- This node could be created as a custom node for the particular application where it is required. But could be better to have it out-of-the-box.
- Design Alternative: Also, provide options to map custom fields to `Document` fields (E.g. {"review": "content"}) which could make this node a bit more flexible and might mean the user doesn't have to do some pre-formatting beforehand. But this can be a future development.

# Adoption strategy

It doesn't introduce a breaking change and wouldn't require changes in existing pipelines.

# How we teach this

It would be good to have this be part of the Guide (perhaps under File Converters).

Could also be mentioned in one of the tutorials. For instance, in the preprocessing tutorial where we say "Haystack expects data to be provided as a list documents in the following dictionary format".

# Unresolved questions

- Should the `content_type` allowed be restricted (e.g. only "text" and "table"). And relatedly, should the name be more specific? E.g. `JsonTableTextConverter` rather than `JsonConverter`. Currently leaning towards no restriction and the `JsonConverter` name.
Proposal: Add a JsonConverter node (#3959) * Add Proposal: JsonConverter * Add jsonl support + schema to JsonConverter Proposal * Remove format option from JsonConverter Proposal --------- Co-authored-by: ZanSara <sara.zanzottera@deepset.ai> 2023-02-09 09:57:00 +01:00			- Title: Addition of a `JsonConverter` node
			`- Decision driver: @bglearning`
			`- Start Date: 2023-01-26`
			`- Proposal PR: #3959`

			`# Summary`

			`Right now we don't have a node that can take json files as input to be fed into a pipeline.`

			Proposal: Add a `JsonConverter` node that takes in a json file, parses it, and generates `Document`s.
			It would also support the `jsonl` format with one line corresponding to one document.

			`# Basic example`

			```python
			`from haystack.nodes import JsonConverter`

			`converter = JsonConverter()`

			`# Receive back List[Document]`
			`docs = converter.convert("data_file.json")`
			```

			With the `data_file.json` as a list of json representation of documents:

			```json
			`[`
			`{`
			`"content": "...",`
			`"content_type": "text", "meta": {...}`
			`},`
			`{`
			`"content": [["h1", "h2"], ["val1", "val2"]],`
			`"content_type": "table", "meta": {...}`
			`}`
			`]`
			```

			Alternatively, the data can also be `jsonl`.
			By default, the converter will try to auto-detect between `json` and `jsonl`.

			`The main use case would be to be able to include this directly in the YAML specification`

			```yaml
			`...`

			`pipelines:`
			`- name: indexing`
			`nodes:`
			`- name: JsonConverter`
			`inputs: [File]`
			`- name: Retriever`
			`inputs: [JsonConverter]`
			`- name: DocumentStore`
			`inputs: [Retriever]`
			```

			`# Motivation`

			`Users may want to do some processing of the data themselves, persist it somehow, and only then pass it onto a haystack pipeline (for instance, by uploading into the REST API endpoint). Ideally this would happen without the need to create a custom endpoint.`

			`For many such processing, json is a convenient intermediate format as it allows for things like specifying the metadata.`

			`Specifically, one use-case that has come up for a team using haystack: they want to use a PDF parser (for tables) currently not in haystack. As such, they want to handle the parsing themselves outside of haystack, put the parsed result into a json file, and then pass it onto a haystack API endpoint.`

			Having a `JsonConverter` node would allow users to setup a haystack pipeline to ingest such data without the user having to create a custom node for it.

			`# Detailed design`

			The converter would primarily be a wrapper around `Document.from_dict`.

			`The schema accepted would be the a list of json dictionary of Documents.`
			So, the following, with `content` being the only compulsory field.

			```
			`[`
			`{`
			`"content": str or list[list],`
			`"content_type": str,`
			`"meta": dict,`
			`"id_hash_keys": list,`
			`"score": float,`
			`"embedding": array`
			`},`
			`...`
			`]`
			```

			```python
			`class JsonConverter(BaseConverter):`
			`def __init__(self, ...):`
			`...`

			`def convert(`
			`self,`
			`file_path: Path,`
			`meta: Optional[Dict[str, str]] = None,`
			`encoding: Optional[str] = "UTF-8",`
			`id_hash_keys: Optional[List[str]] = None,`
			`...`
			`) -> List[Document]:`
			`if id_hash_keys is None:`
			`id_hash_keys = self.id_hash_keys`

			`documents = []`
			`with open(file_path, encoding=encoding, errors="ignore") as f:`
			`data = json.load(f)`
			`for doc_dict in data:`
			`doc_dict = dict(doc_dict)`
			`doc_dict['id_hash_keys'] = id_hash_keys`
			`doc_dict['meta'] = doc_dict.get('meta', dict())`

			`if meta:`
			`doc_dict['meta'].update(meta)`

			`documents.append(Document.from_dict(doc_dict))`

			`return documents`
			```

			`# Drawbacks`

			`- It would add another node that needs to be maintained and documented.`

			`# Alternatives`

			`- This node could be created as a custom node for the particular application where it is required. But could be better to have it out-of-the-box.`
			- Design Alternative: Also, provide options to map custom fields to `Document` fields (E.g. {"review": "content"}) which could make this node a bit more flexible and might mean the user doesn't have to do some pre-formatting beforehand. But this can be a future development.

			`# Adoption strategy`

			`It doesn't introduce a breaking change and wouldn't require changes in existing pipelines.`

			`# How we teach this`

			`It would be good to have this be part of the Guide (perhaps under File Converters).`

			`Could also be mentioned in one of the tutorials. For instance, in the preprocessing tutorial where we say "Haystack expects data to be provided as a list documents in the following dictionary format".`

			`# Unresolved questions`

			- Should the `content_type` allowed be restricted (e.g. only "text" and "table"). And relatedly, should the name be more specific? E.g. `JsonTableTextConverter` rather than `JsonConverter`. Currently leaning towards no restriction and the `JsonConverter` name.