552 Commits

Author SHA1 Message Date
yangdx
0e67ead8fa Rename MAX_TOKENS to SUMMARY_MAX_TOKENS for clarity 2025-08-21 10:15:20 +08:00
yangdx
9b7ed84e05 Improve document deletion error handling and message consistency
- Standardize deletion log messages
- Add try-catch for file operations
- Improve enqueued file error handling
2025-08-20 11:01:24 +08:00
yangdx
485c4b7de7 Change document deletion warnings to info level logging 2025-08-20 03:28:42 +08:00
yangdx
806081645f Refactor text cleaning to use sanitize_text_for_encoding consistently
• Replace clean_text with sanitize_text
• Remove deprecated clean_text function
• Add whitespace trimming to sanitizer
• Improve UTF-8 encoding safety
• Consolidate text cleaning logic
2025-08-19 19:20:01 +08:00
yangdx
e38df464ea Ensure front-end file type uploads are synchronized with back-end 2025-08-19 15:10:13 +08:00
yangdx
1c4d6fde58 Change log level from info to debug for document storage message 2025-08-18 20:04:29 +08:00
yangdx
377f1a022e fix: reset PROCESSING/FAILED docs to PENDING at the beginging of document processing pipeline
- Reset documents with PROCESSING/FAILED status to PENDING when they pass consistency checks
- Update doc_status storage and clear error messages/metadata on reset
2025-08-18 00:49:52 +08:00
yangdx
add8b07a21 Improve logging messages for document processing clarity 2025-08-18 00:22:04 +08:00
yangdx
1941df9cf6 Simplify warning message format for document deletion 2025-08-17 13:30:55 +08:00
yangdx
3e4214cef3 Standardize document deletion warning messages for consistency 2025-08-17 09:35:46 +08:00
yangdx
cceb46b320 fix: subdirectories are no longer processed during file scans
• Change rglob to glob for file scanning
• Simplify error logging messages
2025-08-16 23:46:33 +08:00
yangdx
f5b0c3d38c feat: Recording file extraction error status to document pipeline
- Add apipeline_enqueue_error_documents function to LightRAG class for recording file processing errors in doc_status storage
- Enhance pipeline_enqueue_file with detailed error handling for all file processing stages:
  * File access errors (permissions, not found)
  * UTF-8 encoding errors
  * Format-specific processing errors (PDF, DOCX, PPTX, XLSX)
  * Content validation errors
  * Unsupported file type errors

This implementation ensures all file extraction failures are properly tracked and recorded in the doc_status storage system, providing better visibility into document processing issues and enabling improved error monitoring and debugging capabilities.
2025-08-16 23:08:52 +08:00
yangdx
ca4c18baaa Preserve failed documents during data consistency validation for manual review 2025-08-16 22:29:46 +08:00
yangdx
e1310c5262 Optimize document processing pipeline by removing duplicate step 2025-08-16 17:23:01 +08:00
yangdx
5591ef3ac8 Fix document filtering logic and improve logging for ignored docs 2025-08-16 17:22:08 +08:00
yangdx
5c7ae8721b Merge branch 'main' into pick-trunk-by-vector 2025-08-14 13:11:14 +08:00
yangdx
3bba5fc506 Fix linting 2025-08-14 13:03:23 +08:00
yangdx
65a4437f78 Fix: Persist document data immediately after index update 2025-08-14 12:33:36 +08:00
yangdx
28fc075c59 Simplify inconsistency logging and cleanup messages 2025-08-14 11:49:58 +08:00
yangdx
17faeb2fb8 refactor: integrate document consistency validation into pipeline processing
This ensures data consistency validation is part of the main processing pipeline and provides better monitoring of inconsistent document cleanup operations.
2025-08-14 11:38:36 +08:00
yangdx
a3f7bc5b7e Merge branch 'main' into pick-trunk-by-vector 2025-08-14 06:19:57 +08:00
yangdx
b5ae84fac6 fix: Add data consistency validation to document processing pipeline
- Add _validate_and_fix_document_consistency() method to detect and fix documents with missing content in full_docs storage
- Integrate consistency check into apipeline_process_enqueue_documents() to automatically mark inconsistent documents as FAILED before processing
- Prevent processing errors caused by documents having status records but missing actual content data
2025-08-14 06:18:34 +08:00
yangdx
f1dafa0d01 feat: KG related chunks selection by vector similarity
- Add env switch to toggle weighted polling vs vector-similarity strategy
- Implement similarity-based sorting with fallback to weighted
- Introduce batch vector read API for vector storage
- Implement vector store and retrive funtion for Nanovector DB
- Preserve default behavior (weighted polling selection method)
2025-08-13 18:16:42 +08:00
yangdx
0b2c3d06c7 - Remove redundant collection listing check 2025-08-12 15:24:06 +08:00
yangdx
fc8ca1a706 Fix: add muti-process lock for initialize and drop method for all storage 2025-08-12 04:25:09 +08:00
yangdx
44204abef7 Fix linting 2025-08-10 10:59:32 +08:00
yangdx
eb2320e556 Fix: Initialize first_stage_tasks and entity_relation_task to prevent empty-task cancel errors
- Initialize first_stage_tasks = [] and entity_relation_task = None at coroutine start
- Ensure cancel block safely handles no-op when tasks lists are empty
2025-08-10 10:45:41 +08:00
yangdx
cf064579ce Remove deprecated keyword extraction query methods
- Delete query_with_keywords function
- Remove kg_query_with_keywords helper
- Drop separate keyword extraction methods
2025-08-08 14:59:39 +08:00
yangdx
c22315ea6d refactor: remove selective LLM cache clearing functionality
- Remove optional 'modes' parameter from aclear_cache() and clear_cache() methods
- Replace deprecated drop_cache_by_modes() with drop() method for complete cache clearing
- Update API endpoint to ignore mode-specific parameters and clear all cache
- Simplify frontend clearCache() function to send empty request body

This change ensures all LLM cache is cleared together.
2025-08-05 23:51:51 +08:00
yangdx
01bce8c26e feat: add warning logs for deleting non-completed documents 2025-08-05 12:21:08 +08:00
yangdx
63496698a1 Fix: ensure data migration is handled by single-process
- Wrap migration logic with get_data_init_lock() to ensure single-process execution
- Prevent race conditions when multiple processes start simultaneously
2025-08-04 01:47:20 +08:00
yangdx
bf9a6d699b Fix(lightrag): Handle undirected edges in data migration
The `_migrate_entity_relation_data` function previously processed directed edges from `get_all_edges`, which could lead to duplicates (e.g., (A,B) and (B,A)) and an incorrect relation count.

This commit normalizes edges by sorting their source and target nodes before adding them to the relation set. This ensures all edges are treated as undirected and are properly deduplicated.
2025-08-03 22:14:24 +08:00
yangdx
e8d8afa846 Removed auto storage management from LightRAG instance creation
- The `initialize_storages` method must be explicitly called after LightRAG creation.
The `finalize_storages` method should be called before LightRAG destyoyed.
- Added explicit data migration check
2025-08-03 12:42:57 +08:00
yangdx
06efab4af2 Revert "Remove auto_manage_storages_states option"
This reverts commit bfe6657b316f7e50bc9c5f0cc71d9fbb2b605ddd.
2025-08-03 12:12:13 +08:00
yangdx
bfe6657b31 Remove auto_manage_storages_states option
- Always manage storage states by LightRAG
- Remove rag.initialize_storages() from all examples
2025-08-03 10:29:36 +08:00
yangdx
091f2b42c3 feat(performance): Optimize document deletion with entity/relation index
- Introduces an index mapping documents to their corresponding entities and relations. This significantly speeds up `adelete_by_doc_id` by replacing slow graph traversal with a fast key-value lookup.
- Refactors the ingestion pipeline (`merge_nodes_and_edges`) to populate this new index. Adds a one-time data migration script to backfill the index for existing data.
2025-08-03 09:19:02 +08:00
yangdx
32af45ff46 refactor: improve JSON parsing reliability with json-repair library
Replace regex-based JSON extraction with json-repair for better handling of malformed LLM responses. Remove deprecated JSON parsing utilities and clean up keyword_extraction parameter across LLM providers.

- Remove locate_json_string_body_from_string() and convert_response_to_json()
- Use json-repair.loads() in extract_keywords_only() for robust parsing
- Clean up LLM interfaces and remove unused parameters
- Add json-repair dependency
2025-08-01 19:36:20 +08:00
yangdx
8271e1f6f1 Move OllamaServerInfos class to base module
- Eliminate dependency of the core module on the API module.
2025-07-31 23:24:49 +08:00
yangdx
9d5603d35e Set the default LLM temperature to 1.0 and centralize constant management 2025-07-31 17:15:10 +08:00
yangdx
c7bc4fc42c Add track_id return to document processing pipeline 2025-07-30 10:27:12 +08:00
yangdx
cbaede8455 Add ScanResponse type for scan endpoint in webui 2025-07-30 03:11:09 +08:00
yangdx
7207598fc4 Fix track_id bugs and add track_id to scanning response 2025-07-30 03:06:20 +08:00
yangdx
93afa7d8a7 feat: add processing time tracking to document status with metadata field
- Add metadata field to DocProcessingStatus with start_time and end_time tracking
- Record processing timestamps using Unix time format (seconds precision)
- Update all storage backends (JSON, MongoDB, Redis, PostgreSQL) for new field support
- Maintain backward compatibility with default values for existing data
- Add error_msg field for better error tracking during document processing
2025-07-29 23:42:33 +08:00
yangdx
6014b9bf73 feat: add track_id support for document processing progress monitoring
- Add get_docs_by_track_id() method to all storage backends (MongoDB, PostgreSQL, Redis, JSON)
- Implement automatic track_id generation with upload_/insert_ prefixes
- Add /track_status/{track_id} API endpoint for frontend progress queries
- Create database indexes for efficient track_id lookups
- Enable real-time document processing status tracking across all storage types
2025-07-29 22:24:21 +08:00
yangdx
8274ed52d1 feat: separate document content from doc_status to improve performance
This optimization significantly improves doc_status query/update performance by avoiding large string operations during frequent status checks.
2025-07-29 14:20:07 +08:00
yangdx
f2ffff063b feat: refactor ollama server configuration management
- Add ollama_server_infos attribute to LightRAG class with default initialization
- Move default values to constants.py for centralized configuration
- Refactor OllamaServerInfos class with property accessors and CLI support
- Update OllamaAPI to get configuration through rag object instead of direct import
- Add command line arguments for simulated model name and tag
- Fix type imports to avoid circular dependencies
2025-07-28 01:38:35 +08:00
yangdx
598eecd06d Refactor: Rename llm_model_max_token_size to summary_max_tokens
This commit renames the parameter 'llm_model_max_token_size' to 'summary_max_tokens' for better clarity, as it specifically controls the token limit for entity relation summaries.
2025-07-28 00:49:08 +08:00
yangdx
ebaff228aa feat: Add rerank score filtering with configurable threshold
- Add DEFAULT_MIN_RERANK_SCORE constant (default: 0.0)
- Add MIN_RERANK_SCORE environment variable support
- Filter chunks with rerank scores below threshold in process_chunks_unified
- Add info-level logging for filtering operations
- Handle empty results gracefully after filtering
- Maintain backward compatibility with non-reranked chunks
2025-07-27 16:37:44 +08:00
yangdx
b3c2987006 Reduce default MAX_TOKENS from 32000 to 10000 2025-07-26 08:13:49 +08:00
yangdx
983bacd87e Update logger messages 2025-07-24 16:49:28 +08:00