589 Commits

Author SHA1 Message Date
yangdx
3acb32f547 Add comments explaining chunk deduplication behavior in query context 2025-08-15 02:19:01 +08:00
yangdx
f733ac829c Remove debug logging statements from query context building 2025-08-14 23:44:34 +08:00
yangdx
4a19d0de25 Add chunk tracking system to monitor chunk sources and frequencies
• Track chunk sources (E/R/C types)
• Log frequency and order metadata
• Preserve chunk_id through processing
• Add debug logging for chunk tracking
• Handle rerank and truncation operations
2025-08-14 22:58:26 +08:00
yangdx
a8b7890470 Rename chunk selection functions for better clarity 2025-08-14 16:01:13 +08:00
yangdx
3343833571 Remove query params from cache key generation for keyword extration 2025-08-14 02:36:01 +08:00
yangdx
bac09118d5 Simplify embedding func extraction 2025-08-14 01:09:18 +08:00
yangdx
ac3b5605a1 Refactor logging for relation chunk discovery with dedup info 2025-08-14 00:41:58 +08:00
yangdx
edac10906c fix: Add total_relation_chunks statistics and improve logging in _find_related_text_unit_from_relations 2025-08-13 23:45:31 +08:00
yangdx
5a40ff654e Change KG chunk selection default to VECTOR
- Set KG_CHUNK_PICK_METHOD default to VECTOR
- Update env.example with new config option
2025-08-13 23:10:42 +08:00
yangdx
f1dafa0d01 feat: KG related chunks selection by vector similarity
- Add env switch to toggle weighted polling vs vector-similarity strategy
- Implement similarity-based sorting with fallback to weighted
- Introduce batch vector read API for vector storage
- Implement vector store and retrive funtion for Nanovector DB
- Preserve default behavior (weighted polling selection method)
2025-08-13 18:16:42 +08:00
yangdx
095e0cbfa2 Refac: Add workspace infomation to all logger output for all storage type 2025-08-12 01:19:09 +08:00
yangdx
cf064579ce Remove deprecated keyword extraction query methods
- Delete query_with_keywords function
- Remove kg_query_with_keywords helper
- Drop separate keyword extraction methods
2025-08-08 14:59:39 +08:00
yangdx
eded6d1187 Unify document chunks context format in only_need_context query
- Update Document Chunks label to include (DC) abbreviation
2025-08-08 00:02:53 +08:00
yangdx
0463963520 fix: include all query parameters in LLM cache hash key generation
- Add missing query parameters (top_k, enable_rerank, max_tokens, etc.) to cache key generation in kg_query, naive_query, and extract_keywords_only functions
- Add queryparam field to CacheData structure and PostgreSQL storage for debugging
- Update PostgreSQL schema with automatic migration for queryparam JSONB column
- Prevent incorrect cache hits between queries with different parameters

Fixes issue where different query parameters incorrectly shared the same cached results.
2025-08-05 18:03:10 +08:00
yangdx
cb75e6631e Remove quantized embedding info from LLM cache
- Delete quantize_embedding function
- Delete dequantize_embedding function
- Remove embedding fields from CacheData
- Update save_to_cache to exclude embedding data
- Clean up unused quantization-related code
2025-08-05 17:58:34 +08:00
yangdx
091f2b42c3 feat(performance): Optimize document deletion with entity/relation index
- Introduces an index mapping documents to their corresponding entities and relations. This significantly speeds up `adelete_by_doc_id` by replacing slow graph traversal with a fast key-value lookup.
- Refactors the ingestion pipeline (`merge_nodes_and_edges`) to populate this new index. Adds a one-time data migration script to backfill the index for existing data.
2025-08-03 09:19:02 +08:00
yangdx
32af45ff46 refactor: improve JSON parsing reliability with json-repair library
Replace regex-based JSON extraction with json-repair for better handling of malformed LLM responses. Remove deprecated JSON parsing utilities and clean up keyword_extraction parameter across LLM providers.

- Remove locate_json_string_body_from_string() and convert_response_to_json()
- Use json-repair.loads() in extract_keywords_only() for robust parsing
- Clean up LLM interfaces and remove unused parameters
- Add json-repair dependency
2025-08-01 19:36:20 +08:00
yangdx
598eecd06d Refactor: Rename llm_model_max_token_size to summary_max_tokens
This commit renames the parameter 'llm_model_max_token_size' to 'summary_max_tokens' for better clarity, as it specifically controls the token limit for entity relation summaries.
2025-07-28 00:49:08 +08:00
yangdx
3951a44666 Revert file_path build method, built from related chunks 2025-07-27 21:56:20 +08:00
yangdx
f2d051eea5 Fix: Improve keyword extraction prompt for robust JSON output.
*   Emphasize strict JSON output in key extration prompt
*   Clean up prompt examples in key extration prompt
*   Log raw LLM response on JSON error
2025-07-27 21:10:47 +08:00
yangdx
99e3812c38 refactor: unify file_path handling across merge and rebuild functions
- Replace simple string concatenation with build_file_path() in:
  - _merge_edges_then_upsert
  - _rebuild_single_entity
  - _rebuild_single_relationship
- Ensures consistent deduplication, length limiting, and error handling
- Aligns with existing _merge_nodes_then_upsert implementation
2025-07-27 12:37:24 +08:00
yangdx
7b915b34f6 Refactor: move build_file_path function from operate.py to utils.py 2025-07-26 10:52:59 +08:00
yangdx
c8c3545454 refactor: extract file path length limit to shared constant
• Add DEFAULT_MAX_FILE_PATH_LENGTH constant
• Replace hardcoded 4090 in Milvus impl
2025-07-26 10:45:03 +08:00
yangdx
a943265257 fix: preserve file path order in build_file_path function 2025-07-26 10:21:32 +08:00
yangdx
6efa8ab263 Improve file path length warning message clarity and urgency
• Change debug to warning level
• Simplify message wording
2025-07-26 10:00:18 +08:00
xuewei
56c3cb2dbe Improve build_file_path log 2025-07-26 08:38:02 +08:00
xuewei
b4da3de7d9 Improve file_path drop policy 2025-07-26 00:46:02 +08:00
yangdx
d78fda1d89 Optimize logger message 2025-07-24 04:31:06 +08:00
yangdx
3075691f72 Refactor: move reranking utilities from operate.py to utils.py
• Move apply_rerank_if_enabled to utils
• Move process_chunks_unified to utils
2025-07-24 03:33:38 +08:00
yangdx
5a5d32dc32 Optimize logger message 2025-07-24 02:13:39 +08:00
yangdx
42710221f5 Update log messages 2025-07-24 01:31:49 +08:00
yangdx
02f79508e0 Optimize context building with weighted polling and round-robin data selection 2025-07-24 01:18:21 +08:00
yangdx
7d96ca98f7 Fix linting 2025-07-23 16:16:37 +08:00
yangdx
6cc9411c86 fix: handle empty tasks list in merge_nodes_and_edges to prevent ValueError
- Add empty tasks check before calling asyncio.wait()
- Return early with logging when no entities/relationships to process
2025-07-23 16:06:47 +08:00
yangdx
2d41e5313a Remove redundant tokenizer checks 2025-07-23 10:19:45 +08:00
yangdx
ce9dac9bcf vdb does not store rank any more 2025-07-21 17:04:23 +08:00
yangdx
cb3bf3291c Fix: rename rerank parameter from top_k to top_n
The change aligns with the API parameter naming used by Jina and Cohere rerank services, ensuring consistency and clarity.
2025-07-20 00:26:27 +08:00
yangdx
7e3914052d Optimize text chunk retrieval with batch fetching
- Replace individual chunk fetches with batch get
- Simplify deduplication logic
- Improve error handling for missing data
2025-07-19 21:01:03 +08:00
xuewei
7acca59dfb Improve query for find_text_unit 2025-07-19 17:27:28 +08:00
yangdx
cba97c62fe Merge branch 'fix-memgraph' into fix-keyed-lock 2025-07-19 11:55:24 +08:00
yangdx
2d3a530ce8 Fix: Implemented entity-keyed locks for edge merging operations to ensure robust race condition protection
- Replacing string concatenation with direct list passing for lock keys
- Eliminating deadlock risks by removing the lock around node insertion within the edge merge
2025-07-19 11:48:19 +08:00
yangdx
9f5399c2f1 Replace tenacity retries with manual Memgraph transaction retries
- Implement manual retry logic
- Add exponential backoff with jitter
- Improve error handling for transient errors
2025-07-19 11:31:21 +08:00
yangdx
6e1657a771 Improve thread safety for relationship rebuilding
- Sort src and tgt for consistent lock keys
- Maintain order-independent locking
2025-07-19 10:25:48 +08:00
yangdx
05bc5cfb64 Improve task execution with early failure detection
- Add early failure detection for async tasks
- Cancel pending tasks on first exception
2025-07-19 10:14:22 +08:00
yangdx
12d4f12e57 fix: sort edge_key components in _locked_process_edges for consistent locking
- Ensures bidirectional relationships use same lock key
- Maintains thread safety for knowledge graph edge operations
2025-07-19 07:36:50 +08:00
yangdx
be2d938c84 Fix file path handling in graph operations
- Filter out empty file paths
- Handle missing file_path fields
2025-07-17 18:33:14 +08:00
yangdx
7184c7b3ab fix: change default edge weight from 0.0 to 1.0 in entity extraction and graph storage
- Update extract_entities function in operate.py to use 1.0 as default weight
- Fix Neo4j implementation to use 1.0 instead of 0.0 for missing edge weights
- Fix Memgraph implementation to use 1.0 instead of 0.0 for missing edge weights
- Ensures consistent non-zero default weights across all graph storage backends
2025-07-17 11:30:49 +08:00
yangdx
b1276a079f Fix linting 2025-07-15 23:57:24 +08:00
yangdx
5f7cb437e8 Centralize query parameters into LightRAG class
This commit refactors query parameter management by consolidating settings like `top_k`, token limits, and thresholds into the `LightRAG` class, and consistently sourcing parameters from a single location.
2025-07-15 23:56:49 +08:00
zrguo
3ead0489b8 Remove "rank", "weight", "keywords" 2025-07-15 21:47:33 +08:00