kotaemon/docs/development/data-components.md

# Data & Data Structure Components

The data & data structure components include:

- The `Document` class.
- The document store.
- The vector store.

## Data Loader

- PdfLoader
- Layout-aware with table parsing PdfLoader

  - MathPixLoader: To use this loader, you need MathPix API key, refer to [mathpix docs](https://docs.mathpix.com/#introduction) for more information
  - OCRLoader: This loader uses lib-table and Flax pipeline to perform OCR and read table structure from PDF file (TODO: add more info about deployment of this module).
  - Output:

    - Document: text + metadata to identify whether it is table or not

      ```
      - "source": source file name
      - "type": "table" or "text"
      - "table_origin": original table in markdown format (to be feed to LLM or visualize using external tools)
      - "page_label": page number in the original PDF document
      ```

## Document Store

- InMemoryDocumentStore

## Vector Store

- ChromaVectorStore
- InMemoryVectorStore
Improve manuals (#19) * Rename Admin -> Resources * Improve ui * Update docs 2024-04-10 17:04:04 +07:00			`# Data & Data Structure Components`

Best docs Cinnamon will probably ever have (#105) 2023-12-20 11:30:25 +07:00			`The data & data structure components include:`

			- The `Document` class.
			`- The document store.`
			`- The vector store.`

Improve manuals (#19) * Rename Admin -> Resources * Improve ui * Update docs 2024-04-10 17:04:04 +07:00			`## Data Loader`
Best docs Cinnamon will probably ever have (#105) 2023-12-20 11:30:25 +07:00
			`- PdfLoader`
			`- Layout-aware with table parsing PdfLoader`

			`- MathPixLoader: To use this loader, you need MathPix API key, refer to [mathpix docs](https://docs.mathpix.com/#introduction) for more information`
			`- OCRLoader: This loader uses lib-table and Flax pipeline to perform OCR and read table structure from PDF file (TODO: add more info about deployment of this module).`
			`- Output:`

			`- Document: text + metadata to identify whether it is table or not`

			```
			`- "source": source file name`
			`- "type": "table" or "text"`
			`- "table_origin": original table in markdown format (to be feed to LLM or visualize using external tools)`
			`- "page_label": page number in the original PDF document`
			```

Improve manuals (#19) * Rename Admin -> Resources * Improve ui * Update docs 2024-04-10 17:04:04 +07:00			`## Document Store`
Best docs Cinnamon will probably ever have (#105) 2023-12-20 11:30:25 +07:00
			`- InMemoryDocumentStore`

Improve manuals (#19) * Rename Admin -> Resources * Improve ui * Update docs 2024-04-10 17:04:04 +07:00			`## Vector Store`
Best docs Cinnamon will probably ever have (#105) 2023-12-20 11:30:25 +07:00
			`- ChromaVectorStore`
			`- InMemoryVectorStore`