kotaemon/docs/pages/app/index/file.md

The file index stores files in a local folder and index them for retrieval.
This file index provides the following infrastructure to support the indexing:

- SQL table Source: store the list of files that are indexed by the system
- Vector store: contain the embedding of segments of the files
- Document store: contain the text of segments of the files. Each text stored
  in this document store is associated with a vector in the vector store.
- SQL table Index: store the relationship between (1) the source and the
  docstore, and (2) the source and the vector store.

The indexing and retrieval pipelines are encouraged to use the above software
infrastructure.

## Indexing pipeline

The ktem has default indexing pipeline: `ktem.index.file.pipelines.IndexDocumentPipeline`.

This default pipeline works as follow:

- **Input**: list of file paths
- **Output**: list of nodes that are indexed into database
- **Process**:
  - Read files into texts. Different file types has different ways to read texts.
  - Split text files into smaller segments
  - Run each segments into embeddings.
  - Store the embeddings into vector store. Store the texts of each segment
    into docstore. Store the list of files in Source. Store the linking
    between Sources and docstore + vectorstore in Index table.

You can customize this default pipeline if your indexing process is close to the
default pipeline. You can create your own indexing pipeline if there are too
much different logic.

### Customize the default pipeline

The default pipeline provides the contact points in `flowsettings.py`.

1. `FILE_INDEX_PIPELINE_FILE_EXTRACTORS`. Supply overriding file extractor,
   based on file extension. Example: `{".pdf": "path.to.PDFReader", ".xlsx": "path.to.ExcelReader"}`
2. `FILE_INDEX_PIPELINE_SPLITTER_CHUNK_SIZE`. The expected number of characters
   of each text segment. Example: 1024.
3. `FILE_INDEX_PIPELINE_SPLITTER_CHUNK_OVERLAP`. The expected number of
   characters that consecutive text segments should overlap with each other.
   Example: 256.

### Create your own indexing pipeline

Your indexing pipeline will subclass `BaseFileIndexIndexing`.

You should define the following methods:

- `run(self, file_paths)`: run the indexing given the pipeline
- `get_pipeline(cls, user_settings, index_settings)`: return the
  fully-initialized pipeline, ready to be used by ktem.
  - `user_settings`: is a dictionary contains user settings (e.g. `{"pdf_mode": True, "num_retrieval": 5}`). You can declare these settings in the `get_user_settings` classmethod. ktem will collect these settings into the app Settings page, and will supply these user settings to your `get_pipeline` method.
  - `index_settings`: is a dictionary. Currently it's empty for File Index.
- `get_user_settings`: to declare user settings, return a dictionary.

By subclassing `BaseFileIndexIndexing`, You will have access to the following resources:

- `self._Source`: the source table
- `self._Index`: the index table
- `self._VS`: the vector store
- `self._DS`: the docstore

Once you have prepared your pipeline, register it in `flowsettings.py`: `FILE_INDEX_PIPELINE = "<python.path.to.your.pipeline>"`.

## Retrieval pipeline

The ktem has default retrieval pipeline:
`ktem.index.file.pipelines.DocumentRetrievalPipeline`. This pipeline works as
follow:

- Input: user text query & optionally a list of source file ids
- Output: the output segments that match the user text query
- Process:
  - If a list of source file ids is given, get the list of vector ids that
    associate with those file ids.
  - Embed the user text query.
  - Query the vector store. Provide a list of vector ids to limit query scope
    if the user restrict.
  - Return the matched text segments

### Create your own retrieval pipeline

Your retrieval pipeline will subclass `BaseFileIndexRetriever`. The retriever
has the same database, vectorstore and docstore accesses like the indexing
pipeline.

You should define the following methods:

- `run(self, query, file_ids)`: retrieve relevant documents relating to the
  query. If `file_ids` is given, you should restrict your search within these
  `file_ids`.
- `get_pipeline(cls, user_settings, index_settings, selected)`: return the
  fully-initialized pipeline, ready to be used by ktem.
  - `user_settings`: is a dictionary contains user settings (e.g. `{"pdf_mode": True, "num_retrieval": 5}`). You can declare these settings in the `get_user_settings` classmethod. ktem will collect these settings into the app Settings page, and will supply these user settings to your `get_pipeline` method.
    - `index_settings`: is a dictionary. Currently it's empty for File Index.
    - `selected`: a list of file ids selected by user. If user doesn't select
      anything, this variable will be None.
- `get_user_settings`: to declare user settings, return a dictionary.

Once you build the retrieval pipeline class, you can register it in
`flowsettings.py`: `FILE_INDEXING_RETRIEVER_PIPELIENS = ["path.to.retrieval.pipelie"]`. Because there can be
multiple parallel pipelines within an index, this variable takes a list of
string rather than a string.

## Software infrastructure

| Infra            | Access        | Schema                                                                                                                                                                                                                                                                                      | Ref                                                        |
| ---------------- | ------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | ---------------------------------------------------------- |
| SQL table Source | self.\_Source | - id (int): id of the source (auto)<br>- name (str): the name of the file<br>- path (str): the path of the file<br>- size (int): the file size in bytes<br>- note (dict): allow extra optional information about the file<br>- date_created (datetime): the time the file is created (auto) | This is SQLALchemy ORM class. Can consult                  |
| SQL table Index  | self.\_Index  | - id (int): id of the index entry (auto)<br>- source_id (int): the id of a file in the Source table<br>- target_id: the id of the segment in docstore or vector store<br>- relation_type (str): if the link is "document" or "vector"                                                       | This is SQLAlchemy ORM class                               |
| Vector store     | self.\_VS     | - self.\_VS.add: add the list of embeddings to the vector store (optionally associate metadata and ids)<br>- self.\_VS.delete: delete vector entries based on ids<br>- self.\_VS.query: get embeddings based on embeddings.                                                                 | kotaemon > storages > vectorstores > BaseVectorStore       |
| Doc store        | self.\_DS     | - self.\_DS.add: add the segments to document stores<br>- self.\_DS.get: get the segments based on id<br>- self.\_DS.get_all: get all segments<br>- self.\_DS.delete: delete segments based on id                                                                                           | kotaemon > storages > docstores > base > BaseDocumentStore |
Restructure index to allow it to be dynamically created by end-user (#151) 1. Introduce the concept of "collection_name" to docstore and vector store. Each collection can be viewed similarly to a table in a SQL database. It allows better organizing information within this data source. 2. Move the `Index` and `Source` tables from the application scope into the index scope. For each new index created by user, these tables should increase accordingly. So it depends on the index, rather than the app. 3. Make each index responsible for the UI components in the app. 4. Construct the File UI page. 2024-03-07 01:50:47 +07:00			`The file index stores files in a local folder and index them for retrieval.`
			`This file index provides the following infrastructure to support the indexing:`

			`- SQL table Source: store the list of files that are indexed by the system`
			`- Vector store: contain the embedding of segments of the files`
			`- Document store: contain the text of segments of the files. Each text stored`
			`in this document store is associated with a vector in the vector store.`
			`- SQL table Index: store the relationship between (1) the source and the`
			`docstore, and (2) the source and the vector store.`

			`The indexing and retrieval pipelines are encouraged to use the above software`
			`infrastructure.`

			`## Indexing pipeline`

			The ktem has default indexing pipeline: `ktem.index.file.pipelines.IndexDocumentPipeline`.

			`This default pipeline works as follow:`

Fix integrating indexing and retrieval pipelines to FileIndex (#155) * Add docs for settings * Add mdx_truly_sane_lists to doc requirements 2024-03-10 16:41:42 +07:00			`- Input: list of file paths`
			`- Output: list of nodes that are indexed into database`
			`- Process:`
			`- Read files into texts. Different file types has different ways to read texts.`
Restructure index to allow it to be dynamically created by end-user (#151) 1. Introduce the concept of "collection_name" to docstore and vector store. Each collection can be viewed similarly to a table in a SQL database. It allows better organizing information within this data source. 2. Move the `Index` and `Source` tables from the application scope into the index scope. For each new index created by user, these tables should increase accordingly. So it depends on the index, rather than the app. 3. Make each index responsible for the UI components in the app. 4. Construct the File UI page. 2024-03-07 01:50:47 +07:00			`- Split text files into smaller segments`
			`- Run each segments into embeddings.`
			`- Store the embeddings into vector store. Store the texts of each segment`
			`into docstore. Store the list of files in Source. Store the linking`
			`between Sources and docstore + vectorstore in Index table.`

			`You can customize this default pipeline if your indexing process is close to the`
			`default pipeline. You can create your own indexing pipeline if there are too`
			`much different logic.`

			`### Customize the default pipeline`

			The default pipeline provides the contact points in `flowsettings.py`.

			1. `FILE_INDEX_PIPELINE_FILE_EXTRACTORS`. Supply overriding file extractor,
			based on file extension. Example: `{".pdf": "path.to.PDFReader", ".xlsx": "path.to.ExcelReader"}`
			2. `FILE_INDEX_PIPELINE_SPLITTER_CHUNK_SIZE`. The expected number of characters
			`of each text segment. Example: 1024.`
			3. `FILE_INDEX_PIPELINE_SPLITTER_CHUNK_OVERLAP`. The expected number of
			`characters that consecutive text segments should overlap with each other.`
			`Example: 256.`

			`### Create your own indexing pipeline`

			Your indexing pipeline will subclass `BaseFileIndexIndexing`.

			`You should define the following methods:`

			- `run(self, file_paths)`: run the indexing given the pipeline
			- `get_pipeline(cls, user_settings, index_settings)`: return the
			`fully-initialized pipeline, ready to be used by ktem.`
			- `user_settings`: is a dictionary contains user settings (e.g. `{"pdf_mode": True, "num_retrieval": 5}`). You can declare these settings in the `get_user_settings` classmethod. ktem will collect these settings into the app Settings page, and will supply these user settings to your `get_pipeline` method.
			- `index_settings`: is a dictionary. Currently it's empty for File Index.
Fix integrating indexing and retrieval pipelines to FileIndex (#155) * Add docs for settings * Add mdx_truly_sane_lists to doc requirements 2024-03-10 16:41:42 +07:00			- `get_user_settings`: to declare user settings, return a dictionary.
Restructure index to allow it to be dynamically created by end-user (#151) 1. Introduce the concept of "collection_name" to docstore and vector store. Each collection can be viewed similarly to a table in a SQL database. It allows better organizing information within this data source. 2. Move the `Index` and `Source` tables from the application scope into the index scope. For each new index created by user, these tables should increase accordingly. So it depends on the index, rather than the app. 3. Make each index responsible for the UI components in the app. 4. Construct the File UI page. 2024-03-07 01:50:47 +07:00
			By subclassing `BaseFileIndexIndexing`, You will have access to the following resources:

			- `self._Source`: the source table
			- `self._Index`: the index table
			- `self._VS`: the vector store
			- `self._DS`: the docstore

			Once you have prepared your pipeline, register it in `flowsettings.py`: `FILE_INDEX_PIPELINE = "<python.path.to.your.pipeline>"`.

			`## Retrieval pipeline`

			`The ktem has default retrieval pipeline:`
			`ktem.index.file.pipelines.DocumentRetrievalPipeline`. This pipeline works as
			`follow:`

			`- Input: user text query & optionally a list of source file ids`
			`- Output: the output segments that match the user text query`
			`- Process:`
			`- If a list of source file ids is given, get the list of vector ids that`
			`associate with those file ids.`
			`- Embed the user text query.`
			`- Query the vector store. Provide a list of vector ids to limit query scope`
			`if the user restrict.`
			`- Return the matched text segments`

Fix integrating indexing and retrieval pipelines to FileIndex (#155) * Add docs for settings * Add mdx_truly_sane_lists to doc requirements 2024-03-10 16:41:42 +07:00			`### Create your own retrieval pipeline`

			Your retrieval pipeline will subclass `BaseFileIndexRetriever`. The retriever
			`has the same database, vectorstore and docstore accesses like the indexing`
			`pipeline.`

			`You should define the following methods:`

			- `run(self, query, file_ids)`: retrieve relevant documents relating to the
			query. If `file_ids` is given, you should restrict your search within these
			`file_ids`.
			- `get_pipeline(cls, user_settings, index_settings, selected)`: return the
			`fully-initialized pipeline, ready to be used by ktem.`
			- `user_settings`: is a dictionary contains user settings (e.g. `{"pdf_mode": True, "num_retrieval": 5}`). You can declare these settings in the `get_user_settings` classmethod. ktem will collect these settings into the app Settings page, and will supply these user settings to your `get_pipeline` method.
			- `index_settings`: is a dictionary. Currently it's empty for File Index.
			- `selected`: a list of file ids selected by user. If user doesn't select
			`anything, this variable will be None.`
			- `get_user_settings`: to declare user settings, return a dictionary.

			`Once you build the retrieval pipeline class, you can register it in`
			`flowsettings.py`: `FILE_INDEXING_RETRIEVER_PIPELIENS = ["path.to.retrieval.pipelie"]`. Because there can be
			`multiple parallel pipelines within an index, this variable takes a list of`
			`string rather than a string.`

Restructure index to allow it to be dynamically created by end-user (#151) 1. Introduce the concept of "collection_name" to docstore and vector store. Each collection can be viewed similarly to a table in a SQL database. It allows better organizing information within this data source. 2. Move the `Index` and `Source` tables from the application scope into the index scope. For each new index created by user, these tables should increase accordingly. So it depends on the index, rather than the app. 3. Make each index responsible for the UI components in the app. 4. Construct the File UI page. 2024-03-07 01:50:47 +07:00			`## Software infrastructure`

feat: merge develop (#123) * Support hybrid vector retrieval * Enable figures and table reading in Azure DI * Retrieve with multi-modal * Fix mixing up table * Add txt loader * Add Anthropic Chat * Raising error when retrieving help file * Allow same filename for different people if private is True * Allow declaring extra LLM vendors * Show chunks on the File page * Allow elasticsearch to get more docs * Fix Cohere response (#86) * Fix Cohere response * Remove Adobe pdfservice from dependency kotaemon doesn't rely more pdfservice for its core functionality, and pdfservice uses very out-dated dependency that causes conflict. --------- Co-authored-by: trducng <trungduc1992@gmail.com> * Add confidence score (#87) * Save question answering data as a log file * Save the original information besides the rewritten info * Export Cohere relevance score as confidence score * Fix style check * Upgrade the confidence score appearance (#90) * Highlight the relevance score * Round relevance score. Get key from config instead of env * Cohere return all scores * Display relevance score for image * Remove columns and rows in Excel loader which contains all NaN (#91) * remove columns and rows which contains all NaN * back to multiple joiner options * Fix style --------- Co-authored-by: linhnguyen-cinnamon <cinmc0019@CINMC0019-LinhNguyen.local> Co-authored-by: trducng <trungduc1992@gmail.com> * Track retriever state * Bump llama-index version 0.10 * feat/save-azuredi-mhtml-to-markdown (#93) * feat/save-azuredi-mhtml-to-markdown * fix: replace os.path to pathlib change theflow.settings * refactor: base on pre-commit * chore: move the func of saving content markdown above removed_spans --------- Co-authored-by: jacky0218 <jacky0218@github.com> * fix: losing first chunk (#94) * fix: losing first chunk. * fix: update the method of preventing losing chunks --------- Co-authored-by: jacky0218 <jacky0218@github.com> * fix: adding the base64 image in markdown (#95) * feat: more chunk info on UI * fix: error when reindexing files * refactor: allow more information exception trace when using gpt4v * feat: add excel reader that treats each worksheet as a document * Persist loader information when indexing file * feat: allow hiding unneeded setting panels * feat: allow specific timezone when creating conversation * feat: add more confidence score (#96) * Allow a list of rerankers * Export llm reranking score instead of filter with boolean * Get logprobs from LLMs * Rename cohere reranking score * Call 2 rerankers at once * Run QA pipeline for each chunk to get qa_score * Display more relevance scores * Define another LLMScoring instead of editing the original one * Export logprobs instead of probs * Call LLMScoring * Get qa_score only in the final answer * feat: replace text length with token in file list * ui: show index name instead of id in the settings * feat(ai): restrict the vision temperature * fix(ui): remove the misleading message about non-retrieved evidences * feat(ui): show the reasoning name and description in the reasoning setting page * feat(ui): show version on the main windows * feat(ui): show default llm name in the setting page * fix(conf): append the result of doc in llm_scoring (#97) * fix: constraint maximum number of images * feat(ui): allow filter file by name in file list page * Fix exceeding token length error for OpenAI embeddings by chunking then averaging (#99) * Average embeddings in case the text exceeds max size * Add docstring * fix: Allow empty string when calling embedding * fix: update trulens LLM ranking score for retrieval confidence, improve citation (#98) * Round when displaying not by default * Add LLMTrulens reranking model * Use llmtrulensscoring in pipeline * fix: update UI display for trulen score --------- Co-authored-by: taprosoft <tadashi@cinnamon.is> * feat: add question decomposition & few-shot rewrite pipeline (#89) * Create few-shot query-rewriting. Run and display the result in info_panel * Fix style check * Put the functions to separate modules * Add zero-shot question decomposition * Fix fewshot rewriting * Add default few-shot examples * Fix decompose question * Fix importing rewriting pipelines * fix: update decompose logic in fullQA pipeline --------- Co-authored-by: taprosoft <tadashi@cinnamon.is> * fix: add encoding utf-8 when save temporal markdown in vectorIndex (#101) * fix: improve retrieval pipeline and relevant score display (#102) * fix: improve retrieval pipeline by extending first round top_k with multiplier * fix: minor fix * feat: improve UI default settings and add quick switch option for pipeline * fix: improve agent logics (#103) * fix: improve agent progres display * fix: update retrieval logic * fix: UI display * fix: less verbose debug log * feat: add warning message for low confidence * fix: LLM scoring enabled by default * fix: minor update logics * fix: hotfix image citation * feat: update docx loader for handle merged table cells + handle zip file upload (#104) * feat: update docx loader for handle merged table cells * feat: handle zip file * refactor: pre-commit * fix: escape text in download UI * feat: optimize vector store query db (#105) * feat: optimize vector store query db * feat: add file_id to chroma metadatas * feat: remove unnecessary logs and update migrate script * feat: iterate through file index * fix: remove unused code --------- Co-authored-by: taprosoft <tadashi@cinnamon.is> * fix: add openai embedidng exponential back-off * fix: update import download_loader * refactor: codespell * fix: update some default settings * fix: update installation instruction * fix: default chunk length in simple QA * feat: add share converstation feature and enable retrieval history (#108) * feat: add share converstation feature and enable retrieval history * fix: update share conversation UI --------- Co-authored-by: taprosoft <tadashi@cinnamon.is> * fix: allow exponential backoff for failed OCR call (#109) * fix: update default prompt when no retrieval is used * fix: create embedding for long image chunks * fix: add exception handling for additional table retriever * fix: clean conversation & file selection UI * fix: elastic search with empty doc_ids * feat: add thumbnail PDF reader for quick multimodal QA * feat: add thumbnail handling logic in indexing * fix: UI text update * fix: PDF thumb loader page number logic * feat: add quick indexing pipeline and update UI * feat: add conv name suggestion * fix: minor UI change * feat: citation in thread * fix: add conv name suggestion in regen * chore: add assets for usage doc * chore: update usage doc * feat: pdf viewer (#110) * feat: update pdfviewer * feat: update missing files * fix: update rendering logic of infor panel * fix: improve thumbnail retrieval logic * fix: update PDF evidence rendering logic * fix: remove pdfjs built dist * fix: reduce thumbnail evidence count * chore: update gitignore * fix: add js event on chat msg select * fix: update css for viewer * fix: add env var for PDFJS prebuilt * fix: move language setting to reasoning utils --------- Co-authored-by: phv2312 <kat87yb@gmail.com> Co-authored-by: trducng <trungduc1992@gmail.com> * feat: graph rag (#116) * fix: reload server when add/delete index * fix: rework indexing pipeline to be able to disable vectorstore and splitter if needed * feat: add graphRAG index with plot view * fix: update requirement for graphRAG and lighten unnecessary packages * feat: add knowledge network index (#118) * feat: add Knowledge Network index * fix: update reader mode setting for knet * fix: update init knet * fix: update collection name to index pipeline * fix: missing req --------- Co-authored-by: jeff52415 <jeff.yang@cinnamon.is> * fix: update info panel return for graphrag * fix: retriever setting graphrag * feat: local llm settings (#122) * feat: expose context length as reasoning setting to better fit local models * fix: update context length setting for agents * fix: rework threadpool llm call * fix: fix improve indexing logic * fix: fix improve UI * feat: add lancedb * fix: improve lancedb logic * feat: add lancedb vectorstore * fix: lighten requirement * fix: improve lanceDB vs * fix: improve UI * fix: openai retry * fix: update reqs * fix: update launch command * feat: update Dockerfile * feat: add plot history * fix: update default config * fix: remove verbose print * fix: update default setting * fix: update gradio plot return * fix: default gradio tmp * fix: improve lancedb docstore * fix: fix question decompose pipeline * feat: add multimodal reader in UI * fix: udpate docs * fix: update default settings & docker build * fix: update app startup * chore: update documentation * chore: update README * chore: update README --------- Co-authored-by: trducng <trungduc1992@gmail.com> * chore: update README * chore: update README --------- Co-authored-by: trducng <trungduc1992@gmail.com> Co-authored-by: cin-ace <ace@cinnamon.is> Co-authored-by: Linh Nguyen <70562198+linhnguyen-cinnamon@users.noreply.github.com> Co-authored-by: linhnguyen-cinnamon <cinmc0019@CINMC0019-LinhNguyen.local> Co-authored-by: cin-jacky <101088014+jacky0218@users.noreply.github.com> Co-authored-by: jacky0218 <jacky0218@github.com> Co-authored-by: kan_cin <kan@cinnamon.is> Co-authored-by: phv2312 <kat87yb@gmail.com> Co-authored-by: jeff52415 <jeff.yang@cinnamon.is> 2024-08-26 08:50:37 +07:00			`\| Infra \| Access \| Schema \| Ref \|`
			`\| ---------------- \| ------------- \| ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- \| ---------------------------------------------------------- \|`
			`\| SQL table Source \| self.\_Source \| - id (int): id of the source (auto)<br>- name (str): the name of the file<br>- path (str): the path of the file<br>- size (int): the file size in bytes<br>- note (dict): allow extra optional information about the file<br>- date_created (datetime): the time the file is created (auto) \| This is SQLALchemy ORM class. Can consult \|`
			`\| SQL table Index \| self.\_Index \| - id (int): id of the index entry (auto)<br>- source_id (int): the id of a file in the Source table<br>- target_id: the id of the segment in docstore or vector store<br>- relation_type (str): if the link is "document" or "vector" \| This is SQLAlchemy ORM class \|`
			`\| Vector store \| self.\_VS \| - self.\_VS.add: add the list of embeddings to the vector store (optionally associate metadata and ids)<br>- self.\_VS.delete: delete vector entries based on ids<br>- self.\_VS.query: get embeddings based on embeddings. \| kotaemon > storages > vectorstores > BaseVectorStore \|`
			`\| Doc store \| self.\_DS \| - self.\_DS.add: add the segments to document stores<br>- self.\_DS.get: get the segments based on id<br>- self.\_DS.get_all: get all segments<br>- self.\_DS.delete: delete segments based on id \| kotaemon > storages > docstores > base > BaseDocumentStore \|`