ragflow

mirror of https://github.com/infiniflow/ragflow.git synced 2025-12-05 19:39:02 +00:00

Author	SHA1	Message	Date
Yongteng Lei	2d491188b8	Refa: improve flow of GraphRAG and RAPTOR (#10709 ) ### What problem does this PR solve? Improve flow of GraphRAG and RAPTOR. ### Type of change - [x] Refactoring	2025-10-22 09:29:20 +08:00
Jin Hai	deb81810e9	Update message printout when start ingestion server (#10677 ) ### What problem does this PR solve? ``` ____ __ _ / _/ ____ ____ _ ___ _____ / /_ (_) ____ ____ _____ ___ _____ _ __ ___ _____ / / / __ \ / __ `/ / _ \ / ___/ / __/ / / / __ \ / __ \ / ___/ / _ \ / ___/\| \| / / / _ \ / ___/ _/ / / / / / / /_/ / / __/ (__ ) / /_ / / / /_/ / / / / / (__ ) / __/ / / \| \|/ / / __/ / / /___/ /_/ /_/ \__, / \___/ /____/ \__/ /_/ \____/ /_/ /_/ /____/ \___/ /_/ \|___/ \___/ /_/ /____/ ``` ### Type of change - [x] Refactoring Signed-off-by: Jin Hai <haijin.chn@gmail.com>	2025-10-21 09:38:20 +08:00
Liu An	8af769de41	Fix: add toc_kwd field and update page_num_int type (#10596 ) ### What problem does this PR solve? - Added new field 'toc_kwd' to infinity_mapping.json for table of contents keyword support - Changed page_num_int from integer to array type in task_executor.py to handle multiple page numbers ### Type of change - [x] Bug Fix (non-breaking change which fixes an issue)	2025-10-16 12:47:24 +08:00
Kevin Hu	f92a45dcc4	Feat: let toc run asynchronizly... (#10513 ) ### What problem does this PR solve? #10436 ### Type of change - [x] New Feature (non-breaking change which adds functionality)	2025-10-14 14:14:52 +08:00
Yongteng Lei	65c3f0406c	Fix: maintain backward compatibility for KB tasks (#10508 ) ### What problem does this PR solve? Maintain backward compatibility for KB tasks ### Type of change - [x] Bug Fix (non-breaking change which fixes an issue)	2025-10-13 11:53:48 +08:00
Kevin Hu	7d2f65671f	Feat: debugging toc part. (#10486 ) ### What problem does this PR solve? #10436 ### Type of change - [x] New Feature (non-breaking change which adds functionality)	2025-10-11 18:45:21 +08:00
Kevin Hu	0d8791936e	Feat: TOC retrieval (#10456 ) ### What problem does this PR solve? #10436 ### Type of change - [x] New Feature (non-breaking change which adds functionality)	2025-10-10 17:07:55 +08:00
Jin Hai	d931c33ced	Fix typos: retrievaler -> retriever (#10372 ) ### What problem does this PR solve? Fix typos ### Type of change - [x] Refactoring --------- Signed-off-by: Jin Hai <haijin.chn@gmail.com>	2025-10-10 09:17:36 +08:00
Kevin Hu	cbf04ee470	Feat: Use data pipeline to visualize the parsing configuration of the knowledge base (#10423 ) ### What problem does this PR solve? #9869 ### Type of change - [x] New Feature (non-breaking change which adds functionality) --------- Signed-off-by: dependabot[bot] <support@github.com> Signed-off-by: jinhai <haijin.chn@gmail.com> Signed-off-by: Jin Hai <haijin.chn@gmail.com> Co-authored-by: chanx <1243304602@qq.com> Co-authored-by: balibabu <cike8899@users.noreply.github.com> Co-authored-by: Lynn <lynn_inf@hotmail.com> Co-authored-by: 纷繁下的无奈 <zhileihuang@126.com> Co-authored-by: huangzl <huangzl@shinemo.com> Co-authored-by: writinwaters <93570324+writinwaters@users.noreply.github.com> Co-authored-by: Wilmer <33392318@qq.com> Co-authored-by: Adrian Weidig <adrianweidig@gmx.net> Co-authored-by: Zhichang Yu <yuzhichang@gmail.com> Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> Co-authored-by: Yongteng Lei <yongtengrey@outlook.com> Co-authored-by: Liu An <asiro@qq.com> Co-authored-by: buua436 <66937541+buua436@users.noreply.github.com> Co-authored-by: BadwomanCraZY <511528396@qq.com> Co-authored-by: cucusenok <31804608+cucusenok@users.noreply.github.com> Co-authored-by: Russell Valentine <russ@coldstonelabs.org> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> Co-authored-by: Billy Bao <newyorkupperbay@gmail.com> Co-authored-by: Zhedong Cen <cenzhedong2@126.com> Co-authored-by: TensorNull <129579691+TensorNull@users.noreply.github.com> Co-authored-by: TensorNull <tensor.null@gmail.com> Co-authored-by: TeslaZY <TeslaZY@outlook.com> Co-authored-by: Ajay <160579663+aybanda@users.noreply.github.com> Co-authored-by: AB <aj@Ajays-MacBook-Air.local> Co-authored-by: 天海蒼灆 <huangaoqin@tecpie.com> Co-authored-by: He Wang <wanghechn@qq.com> Co-authored-by: Atsushi Hatakeyama <atu729@icloud.com> Co-authored-by: Jin Hai <haijin.chn@gmail.com> Co-authored-by: Mohamed Mathari <155896313+melmathari@users.noreply.github.com> Co-authored-by: Mohamed Mathari <nocodeventure@Mac-mini-van-Mohamed.fritz.box> Co-authored-by: Stephen Hu <stephenhu@seismic.com> Co-authored-by: Shaun Zhang <zhangwfjh@users.noreply.github.com> Co-authored-by: zhimeng123 <60221886+zhimeng123@users.noreply.github.com> Co-authored-by: mxc <mxc@example.com> Co-authored-by: Dominik Novotný <50611433+SgtMarmite@users.noreply.github.com> Co-authored-by: EVGENY M <168018528+rjohny55@users.noreply.github.com> Co-authored-by: mcoder6425 <mcoder64@gmail.com> Co-authored-by: lemsn <lemsn@msn.com> Co-authored-by: lemsn <lemsn@126.com> Co-authored-by: Adrian Gora <47756404+adagora@users.noreply.github.com> Co-authored-by: Womsxd <45663319+Womsxd@users.noreply.github.com> Co-authored-by: FatMii <39074672+FatMii@users.noreply.github.com>	2025-10-09 12:36:19 +08:00
Jin Hai	4eb7659499	Fix bug: broken import from rag.prompts.prompts (#10217 ) ### What problem does this PR solve? Fix broken imports ### Type of change - [x] Bug Fix (non-breaking change which fixes an issue) --------- Signed-off-by: jinhai <haijin.chn@gmail.com>	2025-09-23 10:19:25 +08:00
Yongteng Lei	45f52e85d7	Feat: refine dataflow and initialize dataflow app (#9952 ) ### What problem does this PR solve? Refine dataflow and initialize dataflow app. ### Type of change - [x] New Feature (non-breaking change which adds functionality)	2025-09-05 18:50:46 +08:00
Stephen Hu	c461261f0b	Refactor: Improve the try logic for upload_to_minio (#9735 ) ### What problem does this PR solve? Improve the try logic for upload_to_minio ### Type of change - [x] Refactoring	2025-08-28 09:35:29 +08:00
Kevin Hu	8d8a5f73b6	Fix: meta data filter with AND logic operations. (#9687 ) ### What problem does this PR solve? Close #9648 ### Type of change - [x] Bug Fix (non-breaking change which fixes an issue)	2025-08-25 18:29:24 +08:00
Kevin Hu	ca720bd811	Fix: save team's canvas issue. (#9518 ) ### What problem does this PR solve? ### Type of change - [x] Bug Fix (non-breaking change which fixes an issue)	2025-08-18 13:05:29 +08:00
Kevin Hu	2114e966d8	Feat: add citation option to agent and enlarge the timeouts. (#9484 ) ### What problem does this PR solve? #9422 ### Type of change - [x] New Feature (non-breaking change which adds functionality)	2025-08-15 10:05:01 +08:00
Kevin Hu	153e430b00	Feat: add meta data filter. (#9405 ) ### What problem does this PR solve? #8531 #7417 #6761 #6573 #6477 ### Type of change - [x] New Feature (non-breaking change which adds functionality)	2025-08-12 14:12:56 +08:00
Jay Xu	ce3dd019c3	Fix broken data stream when writing image file (#9354 ) ### What problem does this PR solve? fix "broken data stream when writing image file", just log warning and ignore Close #8379 ### Type of change - [x] Bug Fix (non-breaking change which fixes an issue)	2025-08-11 17:07:49 +08:00
Kevin Hu	9ca86d801e	Refa: add provider info while adding model. (#9273 ) ### What problem does this PR solve? #9248 ### Type of change - [x] Refactoring	2025-08-07 09:40:42 +08:00
gooodboyAo	a7eba61067	FIX: If chunk["content_with_weight"] contains one or more unpaired surrogate characters (such as incomplete emoji or other special characters), then calling .encode("utf-8") directly will raise a UnicodeEncodeError. (#9246 ) FIX: If chunk["content_with_weight"] contains one or more unpaired surrogate characters (such as incomplete emoji or other special characters), then calling .encode("utf-8") directly will raise a UnicodeEncodeError. ### What problem does this PR solve? ### Type of change - [x] Bug Fix (non-breaking change which fixes an issue)	2025-08-06 10:36:50 +08:00
Kevin Hu	2124329e95	Fix: local variable issue. (#9255 ) ### What problem does this PR solve? #9227 ### Type of change - [x] Bug Fix (non-breaking change which fixes an issue)	2025-08-05 19:24:34 +08:00
Kevin Hu	935ce872d8	Refa: remove temperature since some LLMs fail to support. (#8981 ) ### What problem does this PR solve? ### Type of change - [x] Refactoring	2025-07-23 10:17:04 +08:00
Kevin Hu	c783d90ba3	Perf: set timeout for building chunks. (#8940 ) ### What problem does this PR solve? ### Type of change - [x] Performance Improvement	2025-07-21 15:56:45 +08:00
Kevin Hu	ecdb1701df	Perf: test llm before RAPTOR. (#8897 ) ### What problem does this PR solve? ### Type of change - [x] Performance Improvement	2025-07-17 16:48:50 +08:00
Kevin Hu	fbd115773b	Perf: set timeout of some steps in KG. (#8873 ) ### What problem does this PR solve? ### Type of change - [x] Performance Improvement	2025-07-16 18:06:03 +08:00
Yongteng Lei	e9b14142a5	Fix: fixed invalid save() arguments for slide thumbnails (#8851 ) ### What problem does this PR solve? Fixed invalid save() arguments for slide thumbnails. ### Type of change - [x] Bug Fix (non-breaking change which fixes an issue)	2025-07-15 17:19:45 +08:00
Kevin Hu	aa4a725529	Pref: use redis to check if canceled. (#8853 ) ### What problem does this PR solve? ### Type of change - [x] Performance Improvement	2025-07-15 17:19:27 +08:00
Kevin Hu	24c41d2a61	Perf: make `do_cancel` quicker. (#8846 ) ### What problem does this PR solve? ### Type of change - [x] Performance Improvement	2025-07-15 14:35:00 +08:00
Kevin Hu	c642dbefca	Perf: Enhance timeout handling. (#8826 ) ### What problem does this PR solve? ### Type of change - [x] Performance Improvement	2025-07-15 09:36:45 +08:00
Stephen Hu	d5f6335f99	Fix: The data set created by API call failed to parse after uploading the file. (#8657 ) ### What problem does this PR solve? https://github.com/infiniflow/ragflow/issues/8656 ### Type of change - [x] Bug Fix (non-breaking change which fixes an issue)	2025-07-04 12:41:28 +08:00
Kevin Hu	e3edcc3064	Trivals. (#8597 ) ### What problem does this PR solve? ### Type of change - [x] Bug Fix (non-breaking change which fixes an issue)	2025-07-01 14:05:18 +08:00
Tuan Le	303c6dd1a8	Fix memory leaks in PIL image and BytesIO handling during chunk processing (#8522 ) ### What problem does this PR solve? This PR addresses critical memory leaks in the task executor's image processing pipeline. The current implementation fails to properly dispose of PIL Image objects and BytesIO buffers during chunk processing, leading to progressive memory accumulation that can cause the task executor to consume excessive memory over time. ### Background context - The `upload_to_minio` function processes images from document chunks and converts them to JPEG format for storage. - PIL Image objects hold significant memory resources that must be explicitly closed to prevent memory leaks. - BytesIO objects also consume memory and should be properly disposed of after use. - In high-throughput scenarios with many image-containing documents, these memory leaks can lead to out-of-memory errors and degraded performance. ### Specific issues fixed - PIL Image objects were not being explicitly closed after processing. - BytesIO buffers lacked proper cleanup in all code paths. - Converted images (RGBA/P to RGB) were not disposing of the original image object. - Memory references to large image data were not being cleared promptly. ### Type of change - [x] Bug Fix (non-breaking change which fixes an issue) - [x] Performance Improvement ### Changes made - Added explicit `d["image"].close()` calls after image processing operations. - Implemented proper cleanup of converted images when changing formats from RGBA/P to RGB. - Enhanced BytesIO cleanup with `try/finally` blocks to ensure disposal in all code paths. - Added explicit `del d["image"]` to clear memory references after processing. This fix ensures stable memory usage during long-running document processing tasks and prevents potential out-of-memory conditions in production environments.	2025-06-27 10:23:21 +08:00
Stephen Hu	be712714af	Refactor:improve the logic to check cancel (#8524 ) ### What problem does this PR solve? improve the logic to check cancel ### Type of change - [x] Refactoring --------- Co-authored-by: Kevin Hu <kevinhu.sh@gmail.com>	2025-06-27 10:22:53 +08:00
Stephen Hu	8d9d2cc0a9	Fix: some cases Task return but not set progress (#8469 ) ### What problem does this PR solve? https://github.com/infiniflow/ragflow/issues/8466 I go through the codes, current logic: When do_handle_task raises an exception, handle_task will set the progress, but for some cases do_handle_task internal will just return but not set the right progress, at this cases the redis stream will been acked but the task is running. ### Type of change - [x] Bug Fix (non-breaking change which fixes an issue) --------- Co-authored-by: Kevin Hu <kevinhu.sh@gmail.com>	2025-06-25 09:58:55 +08:00
WuWeiFlow	bc1b837616	FIX:Saving an RGBA image directly as JPEG will cause an error. If the… (#8399 ) Saving an RGBA image directly as JPEG will cause an error. If the image is in RGBA mode, convert it to RGB mode before saving it in JPG format. ### What problem does this PR solve? During document parsing in the knowledge base, we occasionally encounter the error 'cannot write mode RGBA as JPEG.' This occurs because images in RGBA mode cannot be directly saved as JPEG. They must be converted first before saving. ### Type of change - [x] Bug Fix (non-breaking change which fixes an issue)	2025-06-24 18:01:13 +08:00
Stephen Hu	794a4102c2	Fix: Document parse via API will alot problen (#8407 ) ### What problem does this PR solve? #8391 #8404 ### Type of change - [x] Bug Fix (non-breaking change which fixes an issue) --------- Co-authored-by: Kevin Hu <kevinhu.sh@gmail.com>	2025-06-23 13:08:11 +08:00
Kevin Hu	8f3fe63d73	Fix: duplicated task (#8358 ) ### What problem does this PR solve? ### Type of change - [x] Bug Fix (non-breaking change which fixes an issue)	2025-06-19 11:12:29 +08:00
Jin Hai	4a2ff633e0	Fix typo in code (#8327 ) ### What problem does this PR solve? Fix typo in code ### Type of change - [x] Refactoring --------- Signed-off-by: Jin Hai <haijin.chn@gmail.com>	2025-06-18 09:41:09 +08:00
cutiechi	8f9bcb1c74	Feat: make document parsing and embedding batch sizes configurable via environment variables (#8266 ) ### Description This PR introduces two new environment variables, ‎`DOC_BULK_SIZE` and ‎`EMBEDDING_BATCH_SIZE`, to allow flexible tuning of batch sizes for document parsing and embedding vectorization in RAGFlow. By making these parameters configurable, users can optimize performance and resource usage according to their hardware capabilities and workload requirements. ### What problem does this PR solve? Previously, the batch sizes for document parsing and embedding were hardcoded, limiting the ability to adjust throughput and memory consumption. This PR enables users to set these values via environment variables (in ‎`.env`, Helm chart, or directly in the deployment environment), improving flexibility and scalability for both small and large deployments. - ‎`DOC_BULK_SIZE`: Controls how many document chunks are processed in a single batch during document parsing (default: 4). - ‎`EMBEDDING_BATCH_SIZE`: Controls how many text chunks are processed in a single batch during embedding vectorization (default: 16). This change updates the codebase, documentation, and configuration files to reflect the new options. ### Type of change - [ ] Bug Fix (non-breaking change which fixes an issue) - [x] New Feature (non-breaking change which adds functionality) - [x] Documentation Update - [ ] Refactoring - [x] Performance Improvement - [ ] Other (please describe): ### Additional context - Updated ‎`.env`, ‎`helm/values.yaml`, and documentation to describe the new variables. - Modified relevant code paths to use the environment variables instead of hardcoded values. - Users can now tune these parameters to achieve better throughput or reduce memory usage as needed. Before: Default value: <img width="643" alt="image" src="https://github.com/user-attachments/assets/086e1173-18f3-419d-a0f5-68394f63866a" /> After: 10x: <img width="777" alt="image" src="https://github.com/user-attachments/assets/5722bbc0-0bcb-4536-b928-077031e550f1" />	2025-06-16 13:40:47 +08:00
Yongteng Lei	8f9e7a6f6f	Refa: revert to original task message collection logic (#8251 ) ### What problem does this PR solve? Get rid of 'RedisDB.get_unacked_iterator queue rag_flow_svr_queue_1 doesn't exist' ---- Edit: revert to original message collection logic. ### Type of change - [x] Refactoring --------- Co-authored-by: Zhichang Yu <yuzhichang@gmail.com> Co-authored-by: Kevin Hu <kevinhu.sh@gmail.com>	2025-06-13 16:38:53 +08:00
Yongteng Lei	24ca4cc6b7	Refa: GraphRAG and explaining GraphRAG stalling behavior on large files (#8223 ) ### What problem does this PR solve? This PR investigates the cause of #7957. TL;DR: Incorrect similarity calculations lead to too many candidates. Since candidate selection involves interaction with the LLM, this causes significant delays in the program. What this PR does: 1. Fix similarity calculation: When processing a 64 pages government document, the corrected similarity calculation reduces the number of candidates from over 100,000 to around 16,000. With a default batch size of 100 pairs per LLM call, this fix reduces unnecessary LLM interactions from over 1,000 calls to around 160, a roughly 10x improvement. 2. Add concurrency and timeout limits: Up to 5 entity types are processed in "parallel", each with a 180-second timeout. These limits may be configurable in future updates. 3. Improve logging: The candidate resolution process now reports progress in real time. 4. Mitigates potential concurrency risks ### Type of change - [x] Bug Fix (non-breaking change which fixes an issue) - [x] Refactoring	2025-06-12 19:09:50 +08:00
Wanderson Pinto dos Santos	0e03542db5	fix: single task executor getting all tasks from Redis queue (#7330 ) ### What problem does this PR solve? Currently, as long as there are tasks in Redis, this loop will keep getting the tasks. This will lead to a single task executor with many tasks in the pending state. Then we need to wait for the pending tasks to get them back in the queue. In first place, if we set the `MAX_CONCURRENT_TASKS` to X, then only X tasks should be picked from the queue, and others should be left in the queue for other `task_executors` or be picked after 1 of the spots in the current executor gets free. This PR ensures this behavior. The additional changes were due to the Ruff linting in pre-commit. But I believe these are expected to keep the coding style. ### Type of change - [X] Bug Fix (non-breaking change which fixes an issue) - [ ] New Feature (non-breaking change which adds functionality) - [ ] Documentation Update - [ ] Refactoring - [ ] Performance Improvement - [ ] Other (please describe): Co-authored-by: Zhichang Yu <yuzhichang@gmail.com>	2025-06-06 14:32:35 +08:00
Stephen Hu	273f36cc54	Perf: reduce upload to minio limiter scope (#7878 ) ### What problem does this PR solve? reduce upload_to_minio limter scope ### Type of change - [x] Performance Improvement	2025-05-27 17:49:37 +08:00
Kevin Hu	28cb4df127	Fix: raptor overloading (#7889 ) ### What problem does this PR solve? #7840 ### Type of change - [x] Bug Fix (non-breaking change which fixes an issue)	2025-05-27 17:41:35 +08:00
Kevin Hu	959793e83c	Fix: task limiter issue. (#7873 ) ### What problem does this PR solve? #7869 ### Type of change - [x] Bug Fix (non-breaking change which fixes an issue)	2025-05-27 11:16:29 +08:00
Kevin Hu	be83074131	Fix: restore task limiter. (#7844 ) ### What problem does this PR solve? Close #7828 ### Type of change - [x] Bug Fix (non-breaking change which fixes an issue)	2025-05-26 10:59:01 +08:00
Stephen Hu	ce816edb5f	Fix: improve task cancel lag (#7765 ) ### What problem does this PR solve? https://github.com/infiniflow/ragflow/issues/7761 but it may be difficult to achieve 0 delay (which need to pass the cancel token to all parts) Another solution is just 0 delay effect at UI. And task will stop latter ### Type of change - [x] Bug Fix (non-breaking change which fixes an issue)	2025-05-22 09:28:08 +08:00
Stephen Hu	e3e7c7ddaa	Feat: delete useless image blobs when task executor meet edge cases (#7727 ) ### What problem does this PR solve? delete useless image blobs when the task executor meets edge cases ### Type of change - [x] New Feature (non-breaking change which adds functionality)	2025-05-21 10:22:30 +08:00
S0b3Rr	5d21cc3660	fix: Fix the problem that concurrent execution limit in task executor fails and causes OOM (issue#7580) (#7700 ) ### What problem does this PR solve? ## Cause of the bug: During the execution process, due to improper use of trio CapacityLimiter, the configuration parameter MAX_CONCURRENT_TASKS is invalid, causing the executor to take out a large number of tasks from the Redis queue at one time. This behavior will cause the task executor to occupy too much memory and be killed by the OS when a large number of tasks exist at the same time. As a result, all executing tasks are suspended. ## Fix: Added the task_manager method to the entry of /rag/svr/task_executor.py to make CapacityLimiter effective. Deleted the invalid async with statement. ## Fix result: After testing, the task executor execution meets expectations, that is: concurrent execution of up to $MAX_CONCURRENT_TASKS tasks. ### Type of change - [x] Bug Fix (non-breaking change which fixes an issue) - [ ] New Feature (non-breaking change which adds functionality) - [ ] Documentation Update - [ ] Refactoring - [ ] Performance Improvement - [ ] Other (please describe):	2025-05-19 10:25:56 +08:00
alkscr	4ae8f87754	Fix: missing graph resolution and community extraction in graphrag tasks (#7586 ) ### What problem does this PR solve? Info of whether applying graph resolution and community extraction is storage in `task["kb_parser_config"]`. However, previous code get `graphrag_conf` from `task["parser_config"]`, making `with_resolution` and `with_community` are always false. ### Type of change - [x] Bug Fix (non-breaking change which fixes an issue) - [ ] New Feature (non-breaking change which adds functionality) - [ ] Documentation Update - [ ] Refactoring - [ ] Performance Improvement - [ ] Other (please describe):	2025-05-13 09:21:03 +08:00
hfrt456	332e6ffbd4	Fix:local_es_tag (#7534 ) Two Case when local Es tag search has result which is filtered by score 1: Doc has empty tag,and not visi LLM 2: Code may use empty examples in Prompt for LLM search tag Co-authored-by: huangfuqunze <huangfuqunze.hfqz@alibaba-inc.com>	2025-05-09 10:17:24 +08:00

1 2 3 4

176 Commits