ragflow

mirror of https://github.com/infiniflow/ragflow.git synced 2025-07-05 08:01:10 +00:00

Author	SHA1	Message	Date
Jin Hai	4a2ff633e0	Fix typo in code (#8327 ) ### What problem does this PR solve? Fix typo in code ### Type of change - [x] Refactoring --------- Signed-off-by: Jin Hai <haijin.chn@gmail.com>	2025-06-18 09:41:09 +08:00
cutiechi	8f9bcb1c74	Feat: make document parsing and embedding batch sizes configurable via environment variables (#8266 ) ### Description This PR introduces two new environment variables, ‎`DOC_BULK_SIZE` and ‎`EMBEDDING_BATCH_SIZE`, to allow flexible tuning of batch sizes for document parsing and embedding vectorization in RAGFlow. By making these parameters configurable, users can optimize performance and resource usage according to their hardware capabilities and workload requirements. ### What problem does this PR solve? Previously, the batch sizes for document parsing and embedding were hardcoded, limiting the ability to adjust throughput and memory consumption. This PR enables users to set these values via environment variables (in ‎`.env`, Helm chart, or directly in the deployment environment), improving flexibility and scalability for both small and large deployments. - ‎`DOC_BULK_SIZE`: Controls how many document chunks are processed in a single batch during document parsing (default: 4). - ‎`EMBEDDING_BATCH_SIZE`: Controls how many text chunks are processed in a single batch during embedding vectorization (default: 16). This change updates the codebase, documentation, and configuration files to reflect the new options. ### Type of change - [ ] Bug Fix (non-breaking change which fixes an issue) - [x] New Feature (non-breaking change which adds functionality) - [x] Documentation Update - [ ] Refactoring - [x] Performance Improvement - [ ] Other (please describe): ### Additional context - Updated ‎`.env`, ‎`helm/values.yaml`, and documentation to describe the new variables. - Modified relevant code paths to use the environment variables instead of hardcoded values. - Users can now tune these parameters to achieve better throughput or reduce memory usage as needed. Before: Default value: <img width="643" alt="image" src="https://github.com/user-attachments/assets/086e1173-18f3-419d-a0f5-68394f63866a" /> After: 10x: <img width="777" alt="image" src="https://github.com/user-attachments/assets/5722bbc0-0bcb-4536-b928-077031e550f1" />	2025-06-16 13:40:47 +08:00
Yongteng Lei	8f9e7a6f6f	Refa: revert to original task message collection logic (#8251 ) ### What problem does this PR solve? Get rid of 'RedisDB.get_unacked_iterator queue rag_flow_svr_queue_1 doesn't exist' ---- Edit: revert to original message collection logic. ### Type of change - [x] Refactoring --------- Co-authored-by: Zhichang Yu <yuzhichang@gmail.com> Co-authored-by: Kevin Hu <kevinhu.sh@gmail.com>	2025-06-13 16:38:53 +08:00
Yongteng Lei	24ca4cc6b7	Refa: GraphRAG and explaining GraphRAG stalling behavior on large files (#8223 ) ### What problem does this PR solve? This PR investigates the cause of #7957. TL;DR: Incorrect similarity calculations lead to too many candidates. Since candidate selection involves interaction with the LLM, this causes significant delays in the program. What this PR does: 1. Fix similarity calculation: When processing a 64 pages government document, the corrected similarity calculation reduces the number of candidates from over 100,000 to around 16,000. With a default batch size of 100 pairs per LLM call, this fix reduces unnecessary LLM interactions from over 1,000 calls to around 160, a roughly 10x improvement. 2. Add concurrency and timeout limits: Up to 5 entity types are processed in "parallel", each with a 180-second timeout. These limits may be configurable in future updates. 3. Improve logging: The candidate resolution process now reports progress in real time. 4. Mitigates potential concurrency risks ### Type of change - [x] Bug Fix (non-breaking change which fixes an issue) - [x] Refactoring	2025-06-12 19:09:50 +08:00
Wanderson Pinto dos Santos	0e03542db5	fix: single task executor getting all tasks from Redis queue (#7330 ) ### What problem does this PR solve? Currently, as long as there are tasks in Redis, this loop will keep getting the tasks. This will lead to a single task executor with many tasks in the pending state. Then we need to wait for the pending tasks to get them back in the queue. In first place, if we set the `MAX_CONCURRENT_TASKS` to X, then only X tasks should be picked from the queue, and others should be left in the queue for other `task_executors` or be picked after 1 of the spots in the current executor gets free. This PR ensures this behavior. The additional changes were due to the Ruff linting in pre-commit. But I believe these are expected to keep the coding style. ### Type of change - [X] Bug Fix (non-breaking change which fixes an issue) - [ ] New Feature (non-breaking change which adds functionality) - [ ] Documentation Update - [ ] Refactoring - [ ] Performance Improvement - [ ] Other (please describe): Co-authored-by: Zhichang Yu <yuzhichang@gmail.com>	2025-06-06 14:32:35 +08:00
Stephen Hu	273f36cc54	Perf: reduce upload to minio limiter scope (#7878 ) ### What problem does this PR solve? reduce upload_to_minio limter scope ### Type of change - [x] Performance Improvement	2025-05-27 17:49:37 +08:00
Kevin Hu	28cb4df127	Fix: raptor overloading (#7889 ) ### What problem does this PR solve? #7840 ### Type of change - [x] Bug Fix (non-breaking change which fixes an issue)	2025-05-27 17:41:35 +08:00
Kevin Hu	959793e83c	Fix: task limiter issue. (#7873 ) ### What problem does this PR solve? #7869 ### Type of change - [x] Bug Fix (non-breaking change which fixes an issue)	2025-05-27 11:16:29 +08:00
Kevin Hu	be83074131	Fix: restore task limiter. (#7844 ) ### What problem does this PR solve? Close #7828 ### Type of change - [x] Bug Fix (non-breaking change which fixes an issue)	2025-05-26 10:59:01 +08:00
Stephen Hu	ce816edb5f	Fix: improve task cancel lag (#7765 ) ### What problem does this PR solve? https://github.com/infiniflow/ragflow/issues/7761 but it may be difficult to achieve 0 delay (which need to pass the cancel token to all parts) Another solution is just 0 delay effect at UI. And task will stop latter ### Type of change - [x] Bug Fix (non-breaking change which fixes an issue)	2025-05-22 09:28:08 +08:00
Stephen Hu	e3e7c7ddaa	Feat: delete useless image blobs when task executor meet edge cases (#7727 ) ### What problem does this PR solve? delete useless image blobs when the task executor meets edge cases ### Type of change - [x] New Feature (non-breaking change which adds functionality)	2025-05-21 10:22:30 +08:00
S0b3Rr	5d21cc3660	fix: Fix the problem that concurrent execution limit in task executor fails and causes OOM (issue#7580) (#7700 ) ### What problem does this PR solve? ## Cause of the bug: During the execution process, due to improper use of trio CapacityLimiter, the configuration parameter MAX_CONCURRENT_TASKS is invalid, causing the executor to take out a large number of tasks from the Redis queue at one time. This behavior will cause the task executor to occupy too much memory and be killed by the OS when a large number of tasks exist at the same time. As a result, all executing tasks are suspended. ## Fix: Added the task_manager method to the entry of /rag/svr/task_executor.py to make CapacityLimiter effective. Deleted the invalid async with statement. ## Fix result: After testing, the task executor execution meets expectations, that is: concurrent execution of up to $MAX_CONCURRENT_TASKS tasks. ### Type of change - [x] Bug Fix (non-breaking change which fixes an issue) - [ ] New Feature (non-breaking change which adds functionality) - [ ] Documentation Update - [ ] Refactoring - [ ] Performance Improvement - [ ] Other (please describe):	2025-05-19 10:25:56 +08:00
alkscr	4ae8f87754	Fix: missing graph resolution and community extraction in graphrag tasks (#7586 ) ### What problem does this PR solve? Info of whether applying graph resolution and community extraction is storage in `task["kb_parser_config"]`. However, previous code get `graphrag_conf` from `task["parser_config"]`, making `with_resolution` and `with_community` are always false. ### Type of change - [x] Bug Fix (non-breaking change which fixes an issue) - [ ] New Feature (non-breaking change which adds functionality) - [ ] Documentation Update - [ ] Refactoring - [ ] Performance Improvement - [ ] Other (please describe):	2025-05-13 09:21:03 +08:00
hfrt456	332e6ffbd4	Fix:local_es_tag (#7534 ) Two Case when local Es tag search has result which is filtered by score 1: Doc has empty tag,and not visi LLM 2: Code may use empty examples in Prompt for LLM search tag Co-authored-by: huangfuqunze <huangfuqunze.hfqz@alibaba-inc.com>	2025-05-09 10:17:24 +08:00
liuzhenghua	2f768b96e8	perf: optimze figure parser (#7392 ) ### What problem does this PR solve? When parsing documents containing images, the current code uses a single-threaded approach to call the VL model, resulting in extremely slow parsing speed (e.g., parsing a Word document with dozens of images takes over 20 minutes). By switching to a multithreaded approach to call the VL model, the parsing speed can be improved to an acceptable level. ### Type of change - [x] Performance Improvement --------- Co-authored-by: liuzhenghua-jk <liuzhenghua-jk@360shuke.com>	2025-05-06 14:39:45 +08:00
Stephen Hu	c88e4b3fc0	Fix: improve recover_pending_tasks timeout (#7408 ) ### What problem does this PR solve? Fix the redis lock will always timeout (change the logic order release lock first) ### Type of change - [x] Bug Fix (non-breaking change which fixes an issue)	2025-04-29 16:50:39 +08:00
benni82	216cd7474b	fix: task_executor bug fix (#7253 ) ### What problem does this PR solve? The lock is not released correctly when task_exectuor is abnormal ### Type of change - [ ] Bug Fix (non-breaking change which fixes an issue) - [ ] New Feature (non-breaking change which adds functionality) - [ ] Documentation Update - [ ] Refactoring - [ ] Performance Improvement - [ ] Other (please describe):	2025-04-24 11:44:34 +08:00
liuzhenghua	d4dbdfb61d	feat: Recover pending tasks while pod restart. (#7073 ) ### What problem does this PR solve? If you deploy Ragflow using Kubernetes, the hostname will change during a rolling update. This causes the consumer name of the task executor to change, making it impossible to schedule tasks that were previously in a pending state. To address this, I introduced a recovery task that scans these pending messages and re-publishes them, allowing the tasks to continue being processed. ### Type of change - [ ] Bug Fix (non-breaking change which fixes an issue) - [x] New Feature (non-breaking change which adds functionality) - [ ] Documentation Update - [ ] Refactoring - [ ] Performance Improvement - [ ] Other (please describe): --------- Co-authored-by: liuzhenghua-jk <liuzhenghua-jk@360shuke.com>	2025-04-19 16:18:51 +08:00
aniaan	8b8a2f2949	fix(nursery): Fix Closure Trap Issues in Trio Concurrent Tasks (#7106 ) ## Problem Description Multiple files in the RAGFlow project contain closure trap issues when using lambda functions with `trio.open_nursery()`. This problem causes concurrent tasks created in loops to reference the same variable, resulting in all tasks processing the same data (the data from the last iteration) rather than each task processing its corresponding data from the loop. ## Issue Details When using a `lambda` to create a closure function and passing it to `nursery.start_soon()` within a loop, the lambda function captures a reference to the loop variable rather than its value. For example: ```python # Problematic code async with trio.open_nursery() as nursery: for d in docs: nursery.start_soon(lambda: doc_keyword_extraction(chat_mdl, d, topn)) ``` In this pattern, when concurrent tasks begin execution, `d` has already become the value after the loop ends (typically the last element), causing all tasks to use the same data. ## Fix Solution Changed the way concurrent tasks are created with `nursery.start_soon()` by leveraging Trio's API design to directly pass the function and its arguments separately: ```python # Fixed code async with trio.open_nursery() as nursery: for d in docs: nursery.start_soon(doc_keyword_extraction, chat_mdl, d, topn) ``` This way, each task uses the parameter values at the time of the function call, rather than references captured through closures. ## Fixed Files Fixed closure traps in the following files: 1. `rag/svr/task_executor.py`: 3 fixes, involving document keyword extraction, question generation, and tag processing 2. `rag/raptor.py`: 1 fix, involving document summarization 3. `graphrag/utils.py`: 2 fixes, involving graph node and edge processing 4. `graphrag/entity_resolution.py`: 2 fixes, involving entity resolution and graph node merging 5. `graphrag/general/mind_map_extractor.py`: 2 fixes, involving document processing 6. `graphrag/general/extractor.py`: 3 fixes, involving content processing and graph node/edge merging 7. `graphrag/general/community_reports_extractor.py`: 1 fix, involving community report extraction ## Potential Impact This fix resolves a serious concurrency issue that could have caused: - Data processing errors (processing duplicate data) - Performance degradation (all tasks working on the same data) - Inconsistent results (some data not being processed) After the fix, all concurrent tasks should correctly process their respective data, improving system correctness and reliability.	2025-04-18 18:00:20 +08:00
Zhichang Yu	6bf26e2a81	Optimize graphrag again (#6513 ) ### What problem does this PR solve? Removed set_entity and set_relation to avoid accessing doc engine during graph computation. Introduced GraphChange to avoid writing unchanged chunks. ### Type of change - [x] Performance Improvement	2025-03-26 15:34:42 +08:00
Kevin Hu	d83911b632	Fix: huggingface rerank model issue. (#6385 ) ### What problem does this PR solve? #6348 ### Type of change - [x] Bug Fix (non-breaking change which fixes an issue)	2025-03-21 12:43:32 +08:00
Zhichang Yu	bb869aca33	Fix get_unacked_iterator (#6280 ) ### What problem does this PR solve? Fix get_unacked_iterator. Close #6132 ### Type of change - [x] Bug Fix (non-breaking change which fixes an issue)	2025-03-19 17:46:58 +08:00
zhou	9cad60fa6d	Fix: Add a basic example when the example of content_tagging is empty (#6276 ) ### What problem does this PR solve? When using LLM for auto-tag, if there are no examples, the tag format generated by LLM may be wrong. This will cause Elasticsearch insert errors. Adding basic examples can avoid this problem. ### Type of change - [x] Bug Fix (non-breaking change which fixes an issue)	2025-03-19 17:30:47 +08:00
Kevin Hu	3a99c2b5f4	Refa: PARALLEL_DEVICES is a static parameter. (#6168 ) ### What problem does this PR solve? ### Type of change - [x] Refactoring	2025-03-17 16:49:54 +08:00
Debug Doctor	3e19044dee	Feat: add OCR's muti-gpus and parallel processing support (#5972 ) ### What problem does this PR solve? Add OCR's muti-gpus and parallel processing support ### Type of change - [x] New Feature (non-breaking change which adds functionality) @yuzhichang I've tried to resolve the comments in #5697. OCR jobs can now be done on both CPU and GPU. ( By the way, I've encountered a “Generate embedding error” issue #5954 that might be due to my outdated GPUs? idk. ) Please review it and give me suggestions. GPU: ![gpu_ocr](https://github.com/user-attachments/assets/0ee2ecfb-a665-4e50-8bc7-15941b9cd80e) ![smi](https://github.com/user-attachments/assets/a2312f8c-cf24-443d-bf89-bec50503546d) CPU: ![cpu_ocr](https://github.com/user-attachments/assets/1ba6bb0b-94df-41ea-be79-790096da4bf1)	2025-03-17 11:58:40 +08:00
Zhichang Yu	89a69eed72	Introduced task priority (#6118 ) ### What problem does this PR solve? Introduced task priority ### Type of change - [x] New Feature (non-breaking change which adds functionality)	2025-03-14 23:43:46 +08:00
Zhichang Yu	5d75b6be62	Fix executor name (#6080 ) ### What problem does this PR solve? Fix executor name ### Type of change - [x] Bug Fix (non-breaking change which fixes an issue)	2025-03-14 14:13:47 +08:00
Zhichang Yu	e213873852	Optimize graphrag cache get entity (#6018 ) ### What problem does this PR solve? Optimize graphrag cache get entity ### Type of change - [x] Performance Improvement	2025-03-13 14:37:59 +08:00
donblack01	1c663b32b9	Fix:signal.SIGUSR1 and signal.SIGUSR2 can't use in window. so don't bind signal.SIGUSR1 and signal.SIGUSR2 in the windows env (#5941 ) ### What problem does this PR solve? Fix:signal.SIGUSR1 and signal.SIGUSR2 can't use in window. so don't bind signal.SIGUSR1 and signal.SIGUSR2 in the windows env ### Type of change - [✓ ] Bug Fix (non-breaking change which fixes an issue) - [ ] New Feature (non-breaking change which adds functionality) - [ ] Documentation Update - [ ] Refactoring - [ ] Performance Improvement - [ ] Other (please describe): Co-authored-by: tangyu <1@1.com>	2025-03-12 09:43:18 +08:00
Zhichang Yu	6ec6ca6971	Refactor graphrag to remove redis lock (#5828 ) ### What problem does this PR solve? Refactor graphrag to remove redis lock ### Type of change - [x] Refactoring	2025-03-10 15:15:06 +08:00
Zhichang Yu	f65c3ae62b	Refactored DocumentService.update_progress (#5642 ) ### What problem does this PR solve? Refactored DocumentService.update_progress ### Type of change - [x] Bug Fix (non-breaking change which fixes an issue)	2025-03-05 14:48:03 +08:00
Zhichang Yu	4d6484b03e	Fix nursery.start_soon. Close #5575 (#5591 ) ### What problem does this PR solve? Fix nursery.start_soon. Close #5575 ### Type of change - [x] Bug Fix (non-breaking change which fixes an issue)	2025-03-04 14:46:54 +08:00
Zhichang Yu	c813c1ff4c	Made task_executor async to speedup parsing (#5530 ) ### What problem does this PR solve? Made task_executor async to speedup parsing ### Type of change - [x] Performance Improvement	2025-03-03 18:59:49 +08:00
yihong	8a2542157f	Fix: possible memory leaks close #5277 (#5500 ) ### What problem does this PR solve? close #5277 by make sure the file close ### Type of change - [x] Performance Improvement --------- Signed-off-by: yihong0618 <zouzou0208@gmail.com>	2025-03-03 10:26:45 +08:00
Kevin Hu	21943ce0e2	Refine error message while embedding model error, (#5490 ) ### What problem does this PR solve? ### Type of change - [x] Refactoring	2025-02-28 17:52:38 +08:00
yihong	622b72db4b	Fix: add ctrl+c signal for better exit (#5469 ) ### What problem does this PR solve? This patch add signal for ctrl + c that can exit the code friendly cause code base use thread daemon can not exit friendly for being started. how to reproduce 1. docker-compose -f docker/docker-compose-base.yml up 2. other window `bash docker/launch_backend_service.sh` 3. stop 1 first 4. try to stop 2 then two thread can not exit which must use `kill pid` This patch fix it and should fix most the related issues in the `issues` ### Type of change - [x] Bug Fix (non-breaking change which fixes an issue) --------- Signed-off-by: yihong0618 <zouzou0208@gmail.com>	2025-02-28 14:52:40 +08:00
Kevin Hu	4f40f685d9	Code refactor (#5371 ) ### What problem does this PR solve? #5173 ### Type of change - [x] Refactoring	2025-02-26 15:40:52 +08:00
Zhichang Yu	ffb4cda475	Run keyword_extraction, question_proposal, content_tagging in thread pool (#5376 ) ### What problem does this PR solve? Run keyword_extraction, question_proposal, content_tagging in threads ### Type of change - [x] Performance Improvement	2025-02-26 15:21:14 +08:00
Zhichang Yu	db42d0e0ae	Optimize ocr (#5297 ) ### What problem does this PR solve? Introduced OCR.recognize_batch ### Type of change - [x] Performance Improvement	2025-02-24 16:21:55 +08:00
Kevin Hu	86892959a0	Rebuild graph when it's out of time. (#4607 ) ### What problem does this PR solve? #4543 ### Type of change - [x] Bug Fix (non-breaking change which fixes an issue) - [x] Refactoring	2025-01-23 17:26:20 +08:00
Kevin Hu	dd0ebbea35	Light GraphRAG (#4585 ) ### What problem does this PR solve? #4543 ### Type of change - [x] New Feature (non-breaking change which adds functionality)	2025-01-22 19:43:14 +08:00
Kevin Hu	c852a6dfbf	Accelerate titles' embeddings. (#4492 ) ### What problem does this PR solve? ### Type of change - [x] Performance Improvement	2025-01-15 15:20:29 +08:00
Jin Hai	4dde73f897	Error message: Infinity not support table parsing method (#4439 ) ### What problem does this PR solve? Specific error message. ### Type of change - [x] Refactoring Signed-off-by: jinhai <haijin.chn@gmail.com>	2025-01-10 16:39:13 +08:00
Kevin Hu	c5da3cdd97	Tagging (#4426 ) ### What problem does this PR solve? #4367 ### Type of change - [x] New Feature (non-breaking change which adds functionality)	2025-01-09 17:07:21 +08:00
Kevin Hu	0e5124ec99	Show the errors out. (#4305 ) ### What problem does this PR solve? ### Type of change - [x] Refactoring	2024-12-31 15:32:02 +08:00
Kevin Hu	4ba4f622a5	Refactor (#4303 ) ### What problem does this PR solve? ### Type of change - [x] Refactoring	2024-12-31 14:31:31 +08:00
Jin Hai	d030b4a680	Update progress time info (#4193 ) ### What problem does this PR solve? Ignore the millisecond and microsecond value. ### Type of change - [x] Refactoring Signed-off-by: jinhai <haijin.chn@gmail.com>	2024-12-23 21:04:44 +08:00
Zhichang Yu	8d73cf6f02	Added time to progress message (#4185 ) ### What problem does this PR solve? Added time to progress message ### Type of change - [x] Refactoring	2024-12-23 17:25:55 +08:00
Kevin Hu	cb6e9ce164	Cache the result from llm for graphrag and raptor (#4051 ) ### What problem does this PR solve? #4045 ### Type of change - [x] New Feature (non-breaking change which adds functionality)	2024-12-17 09:48:03 +08:00
Zhichang Yu	9b2ef62aee	Fix xinfo_groups returns unexpected result (#4026 ) ### What problem does this PR solve? Fix xinfo_groups returns unexpected result. Close #3545 ### Type of change - [x] Bug Fix (non-breaking change which fixes an issue)	2024-12-13 17:31:15 +08:00

1 2 3

140 Commits