LightRAG

mirror of https://github.com/HKUDS/LightRAG.git synced 2025-11-22 04:56:59 +00:00

Author	SHA1	Message	Date
yangdx	0e67ead8fa	Rename MAX_TOKENS to SUMMARY_MAX_TOKENS for clarity	2025-08-21 10:15:20 +08:00
yangdx	9b7ed84e05	Improve document deletion error handling and message consistency - Standardize deletion log messages - Add try-catch for file operations - Improve enqueued file error handling	2025-08-20 11:01:24 +08:00
yangdx	485c4b7de7	Change document deletion warnings to info level logging	2025-08-20 03:28:42 +08:00
yangdx	806081645f	Refactor text cleaning to use sanitize_text_for_encoding consistently • Replace clean_text with sanitize_text • Remove deprecated clean_text function • Add whitespace trimming to sanitizer • Improve UTF-8 encoding safety • Consolidate text cleaning logic	2025-08-19 19:20:01 +08:00
yangdx	e38df464ea	Ensure front-end file type uploads are synchronized with back-end	2025-08-19 15:10:13 +08:00
yangdx	1c4d6fde58	Change log level from info to debug for document storage message	2025-08-18 20:04:29 +08:00
yangdx	377f1a022e	fix: reset PROCESSING/FAILED docs to PENDING at the beginging of document processing pipeline - Reset documents with PROCESSING/FAILED status to PENDING when they pass consistency checks - Update doc_status storage and clear error messages/metadata on reset	2025-08-18 00:49:52 +08:00
yangdx	add8b07a21	Improve logging messages for document processing clarity	2025-08-18 00:22:04 +08:00
yangdx	1941df9cf6	Simplify warning message format for document deletion	2025-08-17 13:30:55 +08:00
yangdx	3e4214cef3	Standardize document deletion warning messages for consistency	2025-08-17 09:35:46 +08:00
yangdx	cceb46b320	fix: subdirectories are no longer processed during file scans • Change rglob to glob for file scanning • Simplify error logging messages	2025-08-16 23:46:33 +08:00
yangdx	f5b0c3d38c	feat: Recording file extraction error status to document pipeline - Add apipeline_enqueue_error_documents function to LightRAG class for recording file processing errors in doc_status storage - Enhance pipeline_enqueue_file with detailed error handling for all file processing stages: * File access errors (permissions, not found) * UTF-8 encoding errors * Format-specific processing errors (PDF, DOCX, PPTX, XLSX) * Content validation errors * Unsupported file type errors This implementation ensures all file extraction failures are properly tracked and recorded in the doc_status storage system, providing better visibility into document processing issues and enabling improved error monitoring and debugging capabilities.	2025-08-16 23:08:52 +08:00
yangdx	ca4c18baaa	Preserve failed documents during data consistency validation for manual review	2025-08-16 22:29:46 +08:00
yangdx	e1310c5262	Optimize document processing pipeline by removing duplicate step	2025-08-16 17:23:01 +08:00
yangdx	5591ef3ac8	Fix document filtering logic and improve logging for ignored docs	2025-08-16 17:22:08 +08:00
yangdx	5c7ae8721b	Merge branch 'main' into pick-trunk-by-vector	2025-08-14 13:11:14 +08:00
yangdx	3bba5fc506	Fix linting	2025-08-14 13:03:23 +08:00
yangdx	65a4437f78	Fix: Persist document data immediately after index update	2025-08-14 12:33:36 +08:00
yangdx	28fc075c59	Simplify inconsistency logging and cleanup messages	2025-08-14 11:49:58 +08:00
yangdx	17faeb2fb8	refactor: integrate document consistency validation into pipeline processing This ensures data consistency validation is part of the main processing pipeline and provides better monitoring of inconsistent document cleanup operations.	2025-08-14 11:38:36 +08:00
yangdx	a3f7bc5b7e	Merge branch 'main' into pick-trunk-by-vector	2025-08-14 06:19:57 +08:00
yangdx	b5ae84fac6	fix: Add data consistency validation to document processing pipeline - Add _validate_and_fix_document_consistency() method to detect and fix documents with missing content in full_docs storage - Integrate consistency check into apipeline_process_enqueue_documents() to automatically mark inconsistent documents as FAILED before processing - Prevent processing errors caused by documents having status records but missing actual content data	2025-08-14 06:18:34 +08:00
yangdx	f1dafa0d01	feat: KG related chunks selection by vector similarity - Add env switch to toggle weighted polling vs vector-similarity strategy - Implement similarity-based sorting with fallback to weighted - Introduce batch vector read API for vector storage - Implement vector store and retrive funtion for Nanovector DB - Preserve default behavior (weighted polling selection method)	2025-08-13 18:16:42 +08:00
yangdx	0b2c3d06c7	- Remove redundant collection listing check	2025-08-12 15:24:06 +08:00
yangdx	fc8ca1a706	Fix: add muti-process lock for initialize and drop method for all storage	2025-08-12 04:25:09 +08:00
yangdx	44204abef7	Fix linting	2025-08-10 10:59:32 +08:00
yangdx	eb2320e556	Fix: Initialize first_stage_tasks and entity_relation_task to prevent empty-task cancel errors - Initialize first_stage_tasks = [] and entity_relation_task = None at coroutine start - Ensure cancel block safely handles no-op when tasks lists are empty	2025-08-10 10:45:41 +08:00
yangdx	cf064579ce	Remove deprecated keyword extraction query methods - Delete query_with_keywords function - Remove kg_query_with_keywords helper - Drop separate keyword extraction methods	2025-08-08 14:59:39 +08:00
yangdx	c22315ea6d	refactor: remove selective LLM cache clearing functionality - Remove optional 'modes' parameter from aclear_cache() and clear_cache() methods - Replace deprecated drop_cache_by_modes() with drop() method for complete cache clearing - Update API endpoint to ignore mode-specific parameters and clear all cache - Simplify frontend clearCache() function to send empty request body This change ensures all LLM cache is cleared together.	2025-08-05 23:51:51 +08:00
yangdx	01bce8c26e	feat: add warning logs for deleting non-completed documents	2025-08-05 12:21:08 +08:00
yangdx	63496698a1	Fix: ensure data migration is handled by single-process - Wrap migration logic with get_data_init_lock() to ensure single-process execution - Prevent race conditions when multiple processes start simultaneously	2025-08-04 01:47:20 +08:00
yangdx	bf9a6d699b	Fix(lightrag): Handle undirected edges in data migration The `_migrate_entity_relation_data` function previously processed directed edges from `get_all_edges`, which could lead to duplicates (e.g., (A,B) and (B,A)) and an incorrect relation count. This commit normalizes edges by sorting their source and target nodes before adding them to the relation set. This ensures all edges are treated as undirected and are properly deduplicated.	2025-08-03 22:14:24 +08:00
yangdx	e8d8afa846	Removed auto storage management from LightRAG instance creation - The `initialize_storages` method must be explicitly called after LightRAG creation. The `finalize_storages` method should be called before LightRAG destyoyed. - Added explicit data migration check	2025-08-03 12:42:57 +08:00
yangdx	06efab4af2	Revert "Remove auto_manage_storages_states option" This reverts commit bfe6657b316f7e50bc9c5f0cc71d9fbb2b605ddd.	2025-08-03 12:12:13 +08:00
yangdx	bfe6657b31	Remove auto_manage_storages_states option - Always manage storage states by LightRAG - Remove rag.initialize_storages() from all examples	2025-08-03 10:29:36 +08:00
yangdx	091f2b42c3	feat(performance): Optimize document deletion with entity/relation index - Introduces an index mapping documents to their corresponding entities and relations. This significantly speeds up `adelete_by_doc_id` by replacing slow graph traversal with a fast key-value lookup. - Refactors the ingestion pipeline (`merge_nodes_and_edges`) to populate this new index. Adds a one-time data migration script to backfill the index for existing data.	2025-08-03 09:19:02 +08:00
yangdx	32af45ff46	refactor: improve JSON parsing reliability with json-repair library Replace regex-based JSON extraction with json-repair for better handling of malformed LLM responses. Remove deprecated JSON parsing utilities and clean up keyword_extraction parameter across LLM providers. - Remove locate_json_string_body_from_string() and convert_response_to_json() - Use json-repair.loads() in extract_keywords_only() for robust parsing - Clean up LLM interfaces and remove unused parameters - Add json-repair dependency	2025-08-01 19:36:20 +08:00
yangdx	8271e1f6f1	Move OllamaServerInfos class to base module - Eliminate dependency of the core module on the API module.	2025-07-31 23:24:49 +08:00
yangdx	9d5603d35e	Set the default LLM temperature to 1.0 and centralize constant management	2025-07-31 17:15:10 +08:00
yangdx	c7bc4fc42c	Add track_id return to document processing pipeline	2025-07-30 10:27:12 +08:00
yangdx	cbaede8455	Add ScanResponse type for scan endpoint in webui	2025-07-30 03:11:09 +08:00
yangdx	7207598fc4	Fix track_id bugs and add track_id to scanning response	2025-07-30 03:06:20 +08:00
yangdx	93afa7d8a7	feat: add processing time tracking to document status with metadata field - Add metadata field to DocProcessingStatus with start_time and end_time tracking - Record processing timestamps using Unix time format (seconds precision) - Update all storage backends (JSON, MongoDB, Redis, PostgreSQL) for new field support - Maintain backward compatibility with default values for existing data - Add error_msg field for better error tracking during document processing	2025-07-29 23:42:33 +08:00
yangdx	6014b9bf73	feat: add track_id support for document processing progress monitoring - Add get_docs_by_track_id() method to all storage backends (MongoDB, PostgreSQL, Redis, JSON) - Implement automatic track_id generation with upload_/insert_ prefixes - Add /track_status/{track_id} API endpoint for frontend progress queries - Create database indexes for efficient track_id lookups - Enable real-time document processing status tracking across all storage types	2025-07-29 22:24:21 +08:00
yangdx	8274ed52d1	feat: separate document content from doc_status to improve performance This optimization significantly improves doc_status query/update performance by avoiding large string operations during frequent status checks.	2025-07-29 14:20:07 +08:00
yangdx	f2ffff063b	feat: refactor ollama server configuration management - Add ollama_server_infos attribute to LightRAG class with default initialization - Move default values to constants.py for centralized configuration - Refactor OllamaServerInfos class with property accessors and CLI support - Update OllamaAPI to get configuration through rag object instead of direct import - Add command line arguments for simulated model name and tag - Fix type imports to avoid circular dependencies	2025-07-28 01:38:35 +08:00
yangdx	598eecd06d	Refactor: Rename llm_model_max_token_size to summary_max_tokens This commit renames the parameter 'llm_model_max_token_size' to 'summary_max_tokens' for better clarity, as it specifically controls the token limit for entity relation summaries.	2025-07-28 00:49:08 +08:00
yangdx	ebaff228aa	feat: Add rerank score filtering with configurable threshold - Add DEFAULT_MIN_RERANK_SCORE constant (default: 0.0) - Add MIN_RERANK_SCORE environment variable support - Filter chunks with rerank scores below threshold in process_chunks_unified - Add info-level logging for filtering operations - Handle empty results gracefully after filtering - Maintain backward compatibility with non-reranked chunks	2025-07-27 16:37:44 +08:00
yangdx	b3c2987006	Reduce default MAX_TOKENS from 32000 to 10000	2025-07-26 08:13:49 +08:00
yangdx	983bacd87e	Update logger messages	2025-07-24 16:49:28 +08:00

1 2 3 4 5 ...

552 Commits