107 Commits

Author SHA1 Message Date
yangdx
14e083a1a6 fix: replace pyuca with pypinyin for Chinese pinyin sorting and add file_path sort 2025-08-17 15:21:24 +08:00
yangdx
61469c0a56 Add Chinese pinyin sorting support across document operations
• Replace pyuca with centralized utils function
• Add pinyin sort keys for file paths
• Update MongoDB indexes with zh collation
• Migrate existing indexes for compatibility
• Support Chinese chars in Redis/JSON storage
• Keep PostgreSQL sorting order controled by Database Collate order
2025-08-17 12:45:48 +08:00
yangdx
cceb46b320 fix: subdirectories are no longer processed during file scans
• Change rglob to glob for file scanning
• Simplify error logging messages
2025-08-16 23:46:33 +08:00
yangdx
f5b0c3d38c feat: Recording file extraction error status to document pipeline
- Add apipeline_enqueue_error_documents function to LightRAG class for recording file processing errors in doc_status storage
- Enhance pipeline_enqueue_file with detailed error handling for all file processing stages:
  * File access errors (permissions, not found)
  * UTF-8 encoding errors
  * Format-specific processing errors (PDF, DOCX, PPTX, XLSX)
  * Content validation errors
  * Unsupported file type errors

This implementation ensures all file extraction failures are properly tracked and recorded in the doc_status storage system, providing better visibility into document processing issues and enabling improved error monitoring and debugging capabilities.
2025-08-16 23:08:52 +08:00
yangdx
5d00c4c7a8 feat: move processed files to __enqueued__ directory after processing with filename conflicts handling 2025-08-16 13:19:20 +08:00
yangdx
3bba5fc506 Fix linting 2025-08-14 13:03:23 +08:00
yangdx
772f981e7e fix: check and process queued docs even when upload directory is empty 2025-08-14 12:35:39 +08:00
yangdx
fd0ae4646f Fixes crash when processing files with UTF-8 encoding error
- Fix TypeError "cannot unpack non-iterable bool object" in document processing
- Change all error returns from `False` to `(False, "")` for consistency
- Ensure pipeline_enqueue_file always returns tuple (bool, str)
- Add missing return statement for no-content-extracted case
- Improve error handling for UTF-8 encoding issues and unsupported file types
2025-08-14 05:31:38 +08:00
yangdx
c22315ea6d refactor: remove selective LLM cache clearing functionality
- Remove optional 'modes' parameter from aclear_cache() and clear_cache() methods
- Replace deprecated drop_cache_by_modes() with drop() method for complete cache clearing
- Update API endpoint to ignore mode-specific parameters and clear all cache
- Simplify frontend clearCache() function to send empty request body

This change ensures all LLM cache is cleared together.
2025-08-05 23:51:51 +08:00
yangdx
e04d8ed8a7 Improved storage drop logging with namespace details
- Added namespace and workspace to drop logs
2025-08-04 00:56:39 +08:00
yangdx
7505195303 fix: add full_entities and full_relations to clear_documents storage list 2025-08-03 23:02:58 +08:00
yangdx
0eac1a883a Feat: add file path sorting for document manager
- Add file_path sorting support to all database backends (JSON, Redis, PostgreSQL, MongoDB)
- Implement smart column header switching between "ID" and "File Name" based on display mode
- Add automatic sort field switching when toggling between ID and file name display
- Create composite indexes for workspace+file_path in PostgreSQL and MongoDB for better query performance
- Update frontend to maintain sort state when switching display modes
- Add internationalization support for "fileName" in English and Chinese locales

This enhancement improves user experience by providing intuitive file-based sorting
while maintaining performance through optimized database indexes.
2025-07-30 18:46:55 +08:00
yangdx
74eecc46e5 feat(pagination): Implement document list pagination backends and frontend UI
- Add pagination support to BaseDocStatusStorage interface and all implementations (PostgreSQL, MongoDB, Redis, JSON)
- Implement RESTful API endpoints for paginated document queries and status counts
- Create reusable pagination UI components with internationalization support
- Optimize performance with database-level pagination and efficient in-memory processing
- Maintain backward compatibility while adding configurable page sizes (10-200 items)
2025-07-30 17:58:32 +08:00
yangdx
c24c2ff2f6 Remove deprecated temp file saving function
- Delete unused save_temp_file function
2025-07-30 14:23:08 +08:00
yangdx
29e829113b Fix status key serialization issue in get_rack_status 2025-07-30 04:45:48 +08:00
yangdx
7207598fc4 Fix track_id bugs and add track_id to scanning response 2025-07-30 03:06:20 +08:00
yangdx
6f958d5aee feat: add metadata timestamps to document processing and update frontend compatibility
- Add metadata field to doc_status storage with Unix timestamps for processing start/end times
- Update frontend API types: error -> error_msg, add track_id and metadata support
- Add getTrackStatus API method for document tracking functionality
- Fix frontend DocumentManager to use error_msg field for proper error display
- Ensure full compatibility between backend metadata changes and frontend UI
2025-07-30 00:04:27 +08:00
yangdx
6014b9bf73 feat: add track_id support for document processing progress monitoring
- Add get_docs_by_track_id() method to all storage backends (MongoDB, PostgreSQL, Redis, JSON)
- Implement automatic track_id generation with upload_/insert_ prefixes
- Add /track_status/{track_id} API endpoint for frontend progress queries
- Create database indexes for efficient track_id lookups
- Enable real-time document processing status tracking across all storage types
2025-07-29 22:24:21 +08:00
yangdx
910c6973f3 Limit file deletion to current directory only after document cleaning 2025-07-16 20:35:24 +08:00
yangdx
033098c1bc Feat: Add WORKSPACE support to all storage types 2025-07-07 00:57:21 +08:00
yangdx
98150e80b8 Improved empty/whitespace file handling
- Better detection of whitespace-only files
- Changed error to warning for empty chunks
2025-07-05 23:16:39 +08:00
xuewei
49cb51b5dc PDF文件解析不到内容 2025-07-05 13:47:47 +08:00
yangdx
04d793abbd Update logger message 2025-07-03 22:15:32 +08:00
yangdx
67f51597c2 Bump api version to 0178 2025-07-03 21:37:47 +08:00
yangdx
05231233f1 Feat: Check pending equest_pending after document deletion
- Add double-check for pipeline status to prevent race conditions
- Implement automatic processing of pending indexing requests after deletion
2025-07-03 21:36:35 +08:00
yangdx
a506753548 Fix linting 2025-06-27 02:33:20 +08:00
yangdx
60777d535b fix: prevent Path Traversal vulnerability in upload endpoint
- Add sanitize_filename() function to validate and clean uploaded filenames
- Remove path separators, traversal sequences, and control characters
- Verify final paths stay within input directory using Path.resolve()
- Return HTTP 400 errors for unsafe filenames
- Prevents directory traversal attacks like ../../../etc/passwd
2025-06-27 02:33:05 +08:00
yangdx
8fb1c09b08 Refac: pipelinge message 2025-06-26 01:00:54 +08:00
yangdx
bdcd55a871 Feat: Add delete upload file option to document deletion 2025-06-25 19:02:46 +08:00
yangdx
51bb0471cd Change the API for deleting documents to support deleting multiple documents at once. 2025-06-25 16:19:49 +08:00
yangdx
495d6c8cce Improve the pipeline status message for document deletetion 2025-06-25 15:46:58 +08:00
yangdx
2aaa6d5f7d Fix linting 2025-06-25 14:59:45 +08:00
yangdx
49baeb7318 Change document deletion API to async 2025-06-25 14:59:10 +08:00
yangdx
922484915b Remove deprecated API endpoint. 2025-06-25 13:55:47 +08:00
yangdx
8b6dcfb6eb Pls do not use /delete_document API endpoint 2025-06-24 11:26:38 +08:00
yangdx
5ae945c1e5 Improved error handling for document deletion
Added HTTPException for not_found status
Added HTTPException for fail status
2025-06-24 01:12:25 +08:00
yangdx
c18065a912 Disable document deletion when LLM cache for extraction is off 2025-06-23 22:41:27 +08:00
yangdx
1973c80dca Feat: Add entity and relation deletion endpoints 2025-06-23 22:14:50 +08:00
yangdx
bd487dd252 Unify document APIs returen status string 2025-06-23 21:38:47 +08:00
yangdx
5099ac8213 Fix linting 2025-06-23 18:41:30 +08:00
yangdx
dffe659388 Feat: Add document deletion by ID API endpoint
- New DELETE endpoint for document removal
- Implements doc_id-based deletion
- Handles pipeline status during operation
- Includes proper error handling
- Updates pipeline status messages
2025-06-23 18:10:40 +08:00
yangdx
a6046bf827 Fix linting 2025-05-22 10:06:09 +08:00
Benjamin L
1b6ddcaf5b change validator method names 2025-05-21 16:06:35 +02:00
Benjamin L
62b536ea6f Adding file_source.s as optional attribute to text.s requests 2025-05-21 15:10:27 +02:00
yangdx
36f8787bc7 Fix linting 2025-05-01 10:04:31 +08:00
yangdx
a561be0cff Fix time zone problem of doc status 2025-05-01 02:16:19 +08:00
yangdx
31bd274601 Add Unicode collation for Chinese file sorting of document scanning 2025-04-25 01:02:09 +08:00
yangdx
3aab5b41f2 Fix linting 2025-04-24 14:15:10 +08:00
yangdx
fc425f1397 Send all found files to pipeline at once 2025-04-24 14:00:43 +08:00
cuikunyu
135a40d696 Optimize: Use python-docx for better parsing. 2025-04-11 03:10:20 +00:00