4753 Commits

Author SHA1 Message Date
yangdx
45365ff6ef Bump api version to 0202 2025-08-16 23:53:01 +08:00
yangdx
cceb46b320 fix: subdirectories are no longer processed during file scans
• Change rglob to glob for file scanning
• Simplify error logging messages
2025-08-16 23:46:33 +08:00
yangdx
f5b0c3d38c feat: Recording file extraction error status to document pipeline
- Add apipeline_enqueue_error_documents function to LightRAG class for recording file processing errors in doc_status storage
- Enhance pipeline_enqueue_file with detailed error handling for all file processing stages:
  * File access errors (permissions, not found)
  * UTF-8 encoding errors
  * Format-specific processing errors (PDF, DOCX, PPTX, XLSX)
  * Content validation errors
  * Unsupported file type errors

This implementation ensures all file extraction failures are properly tracked and recorded in the doc_status storage system, providing better visibility into document processing issues and enabling improved error monitoring and debugging capabilities.
2025-08-16 23:08:52 +08:00
Matt23-star
a0593ec1c9 feat: enhance query performance by restructuring relationships, entities, and chunks retrieval in PostgreSQL.
Fixed: duplicate items query
2025-08-16 22:49:54 +08:00
Matt23-star
6a7e3092ea feat: optimize node and edge queries in PostgreSQL. query tables Directly 2025-08-16 22:37:48 +08:00
Matt23-star
a7da48e05c feat: add batch size parameter to node and edge retrieval methods 2025-08-16 22:35:22 +08:00
yangdx
ca4c18baaa Preserve failed documents during data consistency validation for manual review 2025-08-16 22:29:46 +08:00
yangdx
e1310c5262 Optimize document processing pipeline by removing duplicate step 2025-08-16 17:23:01 +08:00
yangdx
5591ef3ac8 Fix document filtering logic and improve logging for ignored docs 2025-08-16 17:22:08 +08:00
yangdx
5d00c4c7a8 feat: move processed files to __enqueued__ directory after processing with filename conflicts handling 2025-08-16 13:19:20 +08:00
SJ
f7ca9ae16a Ruff formatted 2025-08-15 22:21:34 +00:00
yangdx
dc7a6e1c5b Update README 2025-08-16 06:15:27 +08:00
SJ
3aa3332505
Merge pull request #1 from HKUDS/main
merge
2025-08-15 17:09:03 -05:00
Daniel.y
bdd1169cfb
Merge pull request #1959 from danielaskdd/pick-trunk-by-vector
Feat: add KG related chunks selection by vector similarity
2025-08-15 19:33:51 +08:00
yangdx
2a781dfb91 Update Neo4j database naming in env.example 2025-08-15 19:14:38 +08:00
yangdx
3a227e37b8 Add get_vectors_by_ids method to MongoVectorDBStorage 2025-08-15 16:53:14 +08:00
yangdx
7a7385a200 Add efficient vector retrieval by IDs to PGVectorStorage 2025-08-15 16:51:41 +08:00
yangdx
8f7031b882 Add get_vectors_by_ids method to QdrantVectorDBStorage 2025-08-15 16:46:52 +08:00
yangdx
a71499a180 Add get_vectors_by_ids method to MilvusVectorDBStorage 2025-08-15 16:36:50 +08:00
yangdx
1e2d5252d7 Add get_vectors_by_ids method and filter out vector data from query results 2025-08-15 16:32:26 +08:00
yangdx
6cab68bb47 Improve KG chunk selection documentation and configuration clarity 2025-08-15 10:09:44 +08:00
yangdx
3acb32f547 Add comments explaining chunk deduplication behavior in query context 2025-08-15 02:19:01 +08:00
yangdx
0b45d463df Add .clinerules to .gitignore 2025-08-15 00:43:45 +08:00
yangdx
f733ac829c Remove debug logging statements from query context building 2025-08-14 23:44:34 +08:00
yangdx
4a19d0de25 Add chunk tracking system to monitor chunk sources and frequencies
• Track chunk sources (E/R/C types)
• Log frequency and order metadata
• Preserve chunk_id through processing
• Add debug logging for chunk tracking
• Handle rerank and truncation operations
2025-08-14 22:58:26 +08:00
yangdx
a8b7890470 Rename chunk selection functions for better clarity 2025-08-14 16:01:13 +08:00
yangdx
a11e8d77eb Improve missing-vector warning logic in vector similarity
- Check for any missing vectors
- Separate no-vector vs partial-vector warnings
- Ensure early return on empty vectors
2025-08-14 14:24:15 +08:00
yangdx
5c7ae8721b Merge branch 'main' into pick-trunk-by-vector 2025-08-14 13:11:14 +08:00
Daniel.y
79d5210988
Merge pull request #1954 from danielaskdd/pipeline-refactor
Feat: Reprocessing of failed documents without the original file being present
2025-08-14 13:09:23 +08:00
yangdx
3bba5fc506 Fix linting 2025-08-14 13:03:23 +08:00
yangdx
772f981e7e fix: check and process queued docs even when upload directory is empty 2025-08-14 12:35:39 +08:00
yangdx
65a4437f78 Fix: Persist document data immediately after index update 2025-08-14 12:33:36 +08:00
yangdx
28fc075c59 Simplify inconsistency logging and cleanup messages 2025-08-14 11:49:58 +08:00
yangdx
17faeb2fb8 refactor: integrate document consistency validation into pipeline processing
This ensures data consistency validation is part of the main processing pipeline and provides better monitoring of inconsistent document cleanup operations.
2025-08-14 11:38:36 +08:00
yangdx
a3f7bc5b7e Merge branch 'main' into pick-trunk-by-vector 2025-08-14 06:19:57 +08:00
yangdx
b5ae84fac6 fix: Add data consistency validation to document processing pipeline
- Add _validate_and_fix_document_consistency() method to detect and fix documents with missing content in full_docs storage
- Integrate consistency check into apipeline_process_enqueue_documents() to automatically mark inconsistent documents as FAILED before processing
- Prevent processing errors caused by documents having status records but missing actual content data
2025-08-14 06:18:34 +08:00
yangdx
cb122c63e4 Merge branch 'main' into pick-trunk-by-vector 2025-08-14 05:34:15 +08:00
Daniel.y
dc76ae02d6
Merge pull request #1952 from danielaskdd/fix-pipeline
Fixes crash when processing files with UTF-8 encoding error
2025-08-14 05:33:08 +08:00
yangdx
fd0ae4646f Fixes crash when processing files with UTF-8 encoding error
- Fix TypeError "cannot unpack non-iterable bool object" in document processing
- Change all error returns from `False` to `(False, "")` for consistency
- Ensure pipeline_enqueue_file always returns tuple (bool, str)
- Add missing return statement for no-content-extracted case
- Improve error handling for UTF-8 encoding issues and unsupported file types
2025-08-14 05:31:38 +08:00
yangdx
042637d6a3 Merge branch 'main' into pick-trunk-by-vector 2025-08-14 05:09:14 +08:00
yangdx
3ccd10f1e4 Update webui assets 2025-08-14 05:03:43 +08:00
yangdx
6969038fd5 Update mermaid version to 11.9.0 2025-08-14 05:02:53 +08:00
yangdx
160a40dc04 Bump api version to 0201 2025-08-14 05:02:20 +08:00
yangdx
ae517181ad Bump api version to 0200 2025-08-14 05:01:13 +08:00
yangdx
f85e2aa4bf Merge branch 'main' into pick-trunk-by-vector 2025-08-14 03:54:26 +08:00
Daniel.y
2bbb19143a
Merge pull request #1951 from danielaskdd/main
Refac: uniformly protected with the get_data_init_lock for all storage initializations
2025-08-14 03:52:37 +08:00
yangdx
0b22ffb252 Refac: uniformly protected with the get_data_init_lock for all storage initializations 2025-08-14 03:46:19 +08:00
yangdx
2e5487305e Merge branch 'main' into pick-trunk-by-vector 2025-08-14 03:12:38 +08:00
Daniel.y
1be1649f75
Merge pull request #1949 from danielaskdd/main
Fix: remove query params from cache key generation for keyword extraction
2025-08-14 03:09:09 +08:00
yangdx
7fb11193b0 Fix linting 2025-08-14 03:07:29 +08:00