187 Commits

Author SHA1 Message Date
yangdx
453efeb924 Fix file path length checking to use UTF-8 byte length instead of char count 2025-08-18 13:59:27 +08:00
yangdx
14e083a1a6 fix: replace pyuca with pypinyin for Chinese pinyin sorting and add file_path sort 2025-08-17 15:21:24 +08:00
yangdx
61469c0a56 Add Chinese pinyin sorting support across document operations
• Replace pyuca with centralized utils function
• Add pinyin sort keys for file paths
• Update MongoDB indexes with zh collation
• Migrate existing indexes for compatibility
• Support Chinese chars in Redis/JSON storage
• Keep PostgreSQL sorting order controled by Database Collate order
2025-08-17 12:45:48 +08:00
yangdx
4a19d0de25 Add chunk tracking system to monitor chunk sources and frequencies
• Track chunk sources (E/R/C types)
• Log frequency and order metadata
• Preserve chunk_id through processing
• Add debug logging for chunk tracking
• Handle rerank and truncation operations
2025-08-14 22:58:26 +08:00
yangdx
a8b7890470 Rename chunk selection functions for better clarity 2025-08-14 16:01:13 +08:00
yangdx
a11e8d77eb Improve missing-vector warning logic in vector similarity
- Check for any missing vectors
- Separate no-vector vs partial-vector warnings
- Ensure early return on empty vectors
2025-08-14 14:24:15 +08:00
yangdx
2e5487305e Merge branch 'main' into pick-trunk-by-vector 2025-08-14 03:12:38 +08:00
yangdx
7fb11193b0 Fix linting 2025-08-14 03:07:29 +08:00
yangdx
331dcf0509 Remove query params from cache key generation for keyword extration 2025-08-14 02:57:39 +08:00
yangdx
3343833571 Remove query params from cache key generation for keyword extration 2025-08-14 02:36:01 +08:00
yangdx
f1dafa0d01 feat: KG related chunks selection by vector similarity
- Add env switch to toggle weighted polling vs vector-similarity strategy
- Implement similarity-based sorting with fallback to weighted
- Introduce batch vector read API for vector storage
- Implement vector store and retrive funtion for Nanovector DB
- Preserve default behavior (weighted polling selection method)
2025-08-13 18:16:42 +08:00
zrguo
f1c7233763 Avoid UTF-8 BOM 2025-08-12 17:06:54 +08:00
yangdx
0463963520 fix: include all query parameters in LLM cache hash key generation
- Add missing query parameters (top_k, enable_rerank, max_tokens, etc.) to cache key generation in kg_query, naive_query, and extract_keywords_only functions
- Add queryparam field to CacheData structure and PostgreSQL storage for debugging
- Update PostgreSQL schema with automatic migration for queryparam JSONB column
- Prevent incorrect cache hits between queries with different parameters

Fixes issue where different query parameters incorrectly shared the same cached results.
2025-08-05 18:03:10 +08:00
yangdx
cb75e6631e Remove quantized embedding info from LLM cache
- Delete quantize_embedding function
- Delete dequantize_embedding function
- Remove embedding fields from CacheData
- Update save_to_cache to exclude embedding data
- Clean up unused quantization-related code
2025-08-05 17:58:34 +08:00
yangdx
32af45ff46 refactor: improve JSON parsing reliability with json-repair library
Replace regex-based JSON extraction with json-repair for better handling of malformed LLM responses. Remove deprecated JSON parsing utilities and clean up keyword_extraction parameter across LLM providers.

- Remove locate_json_string_body_from_string() and convert_response_to_json()
- Use json-repair.loads() in extract_keywords_only() for robust parsing
- Clean up LLM interfaces and remove unused parameters
- Add json-repair dependency
2025-08-01 19:36:20 +08:00
yangdx
2af8a93dc7 fix: resolve _sort_key error in Redis get_docs_paginated function 2025-07-31 02:16:56 +08:00
yangdx
d0bc5e7c4a Extend path filter to also cover POST requests 2025-07-31 02:06:56 +08:00
yangdx
3e5efd0b27 Add /documents/paginated to filtered logging paths 2025-07-31 02:00:00 +08:00
yangdx
6014b9bf73 feat: add track_id support for document processing progress monitoring
- Add get_docs_by_track_id() method to all storage backends (MongoDB, PostgreSQL, Redis, JSON)
- Implement automatic track_id generation with upload_/insert_ prefixes
- Add /track_status/{track_id} API endpoint for frontend progress queries
- Create database indexes for efficient track_id lookups
- Enable real-time document processing status tracking across all storage types
2025-07-29 22:24:21 +08:00
yangdx
9923821d75 refactor: Remove deprecated max_token_size from embedding configuration
This parameter is no longer used. Its removal simplifies the API and clarifies that token length management is handled by upstream text chunking logic rather than the embedding wrapper.
2025-07-29 10:49:35 +08:00
yangdx
e09929b42e Refine rerank filtering log message for clarity 2025-07-27 16:57:38 +08:00
yangdx
f4bca7bfb2 Fix linting 2025-07-27 16:50:45 +08:00
yangdx
a9565d7379 feat: Skip rerank filtering when min_rerank_score is 0.0 2025-07-27 16:50:12 +08:00
yangdx
ebaff228aa feat: Add rerank score filtering with configurable threshold
- Add DEFAULT_MIN_RERANK_SCORE constant (default: 0.0)
- Add MIN_RERANK_SCORE environment variable support
- Filter chunks with rerank scores below threshold in process_chunks_unified
- Add info-level logging for filtering operations
- Handle empty results gracefully after filtering
- Maintain backward compatibility with non-reranked chunks
2025-07-27 16:37:44 +08:00
yangdx
a67f93acc9 Replace hardcoded max tokens with DEFAULT_MAX_TOTAL_TOKENS constant
- Use constant in process_chunks_unified
- Update WebUI default to match (32000)
2025-07-26 11:23:54 +08:00
yangdx
7b915b34f6 Refactor: move build_file_path function from operate.py to utils.py 2025-07-26 10:52:59 +08:00
yangdx
d78fda1d89 Optimize logger message 2025-07-24 04:31:06 +08:00
yangdx
d97913873b Update logger message 2025-07-24 03:44:02 +08:00
yangdx
3075691f72 Refactor: move reranking utilities from operate.py to utils.py
• Move apply_rerank_if_enabled to utils
• Move process_chunks_unified to utils
2025-07-24 03:33:38 +08:00
yangdx
5a5d32dc32 Optimize logger message 2025-07-24 02:13:39 +08:00
yangdx
02f79508e0 Optimize context building with weighted polling and round-robin data selection 2025-07-24 01:18:21 +08:00
zrguo
1541034816 Add DEFAULT_RELATED_CHUNK_NUMBER 2025-07-15 21:35:12 +08:00
SLKun
5f330ec11a remove <think> tag for entities and keywords extraction 2025-07-08 14:59:15 +08:00
yangdx
e56734cb8b Refac: Optimize document deletion performance
- Adding chunks_list to  dock_status
- Adding  llm_cache_list to text_chunks
- Implemented storage types: JsonKV and  Redis
2025-07-03 04:18:25 +08:00
yangdx
271722405f feat: Flatten LLM cache structure for improved recall efficiency
Refactored the LLM cache to a flat Key-Value (KV) structure, replacing the previous nested format. The old structure used the 'mode' as a key and stored specific cache content as JSON nested under it. This change significantly enhances cache recall efficiency.
2025-07-02 16:11:53 +08:00
zrguo
ead82a8dbd update delete_by_doc_id 2025-06-09 18:52:34 +08:00
yangdx
38b862e993 Remove unsed functions 2025-05-18 07:16:52 +08:00
sa9arr
36b606d0db Fix: Correct GraphML to JSON mapping in xml_to_json function 2025-05-17 19:32:25 +05:45
yangdx
2845e268e4 Ensure priority_limit_async_func_call decorator receive callable 2025-05-13 02:00:01 +08:00
yangdx
4d57370c94 Refactor: Move get_env_value from api.config to utils
Relocates the `get_env_value` utility function
from `lightrag.api.config` to `lightrag.utils` to decouple
LightRAG core from API Server
2025-05-10 08:58:18 +08:00
yangdx
3eb3b170ab Remove list_of_list_to_dict function 2025-05-07 18:01:23 +08:00
yangdx
156244e260 Refactor: Unify naive context to JSON format
- Merges 'mix' mode query handling into 'hybrid' mode, simplifying query logic by removing the dedicated `mix_kg_vector_query` function
- Standardizes vector search result by using JSON string format to build context
- Fixes a bug in `query_with_keywords` ensuring `hl_keywords` and `ll_keywords` are correctly passed to `kg_query_with_keywords`
2025-05-07 17:42:14 +08:00
yangdx
3146309fde Change function name from list_of_list_to_json to list_of_list_to_dict 2025-05-07 10:52:26 +08:00
yangdx
dbfcf30801 Fix linting 2025-05-06 22:03:40 +08:00
yangdx
c8ecfa2d68 feat: Centralize configuration and update defaults
This commit introduces `lightrag/constants.py` to centralize default values for various configurations across the API and core components.

Key changes:
- Added `constants.py` to centralize default values
- Improved the `get_env_value` function in `api/config.py` to correctly handle string "None" as a None value and to catch `TypeError` during value conversion.
- Updated the default `SUMMARY_LANGUAGE` to "English"
- Set default `WORKERS` to 2
2025-05-06 22:00:43 +08:00
yangdx
a36abce8d6 Update commnents 2025-05-05 11:26:31 +08:00
yangdx
62fd4a0540 Optimize log messages 2025-04-30 13:53:03 +08:00
yangdx
81953e6d46 Enhance the robustness of concurrency control and scheduling logic 2025-04-29 13:38:11 +08:00
yangdx
1afcbcbfb5 Fix race condition for health_check and ensure_workers 2025-04-29 00:08:52 +08:00
yangdx
1fc26127d5 Fix linting 2025-04-28 23:21:34 +08:00