Support server health check by http://localhost:<port>/v1/system/healthz (#10150)

### What problem does this PR solve? Support server health check. Solved issue: #10106 ### Type of change - [x] New Feature (non-breaking change which adds functionality)
2025-11-25 06:26:16 +00:00 · 2025-09-19 11:11:07 +08:00 · 2025-09-19 11:11:07 +08:00 · a24547aa66
commit a24547aa66
parent a04c5247ab
5 changed files with 153 additions and 122 deletions
--- a/api/apps/HEALTHCHECK_TESTING.md
+++ b/api/apps/HEALTHCHECK_TESTING.md
@ -1,105 +0,0 @@
-# 健康检查与 Kubernetes 探针简明说明
-
-本文件说明：什么是 K8s 探针、如何用 `/v1/system/healthz` 做健康检查，以及下文用例中的关键词含义。
-
-## 什么是 K8s 探针（Probe）
- 探针是 K8s 用来“探测”容器是否健康/可对外服务的机制。
- 常见三类：
-  - livenessProbe：活性探针。失败时 K8s 会重启容器，用于“应用卡死/失去连接时自愈”。
-  - readinessProbe：就绪探针。失败时 Endpoint 不会被加入 Service 负载均衡，用于“应用尚未准备好时不接流量”。
-  - startupProbe：启动探针。给慢启动应用更长的初始化窗口，期间不执行 liveness/readiness。
- 这些探针通常通过 HTTP GET 访问一个公开且轻量的健康端点（无需鉴权），以 HTTP 状态码判定结果：200=通过；5xx/超时=失败。
-
-## 本项目健康端点
- 已实现：`GET /v1/system/healthz`（无需认证）。
- 语义：
-  - 200：关键依赖正常。
-  - 500：任一关键依赖异常（当前判定为 DB 或 Chat）。
-  - 响应体：JSON，最小字段 `status, db, chat`；并包含 `redis, doc_engine, storage` 等可观测项。失败项会在 `_meta` 中包含 `error/elapsed`。
- 示例（DB 故障）：
-```json
-{"status":"nok","chat":"ok","db":"nok"}
-```
-
-## 用例背景（Problem/use case）
- 现状：Ragflow 跑在 K8s，数据库是 AWS RDS Postgres，凭证由 Secret Manager 管理并每 7 天轮换。轮换后应用连接失效，需要手动重启 Pod 才能重新建立连接。
- 目标：通过 K8s 探针自动化检测并重启异常 Pod，减少人工操作。
- 需求：一个“无需鉴权”的公共健康端点，能在依赖异常时返回非 200（如 500）且提供 JSON 详情。
- 现已满足：`/v1/system/healthz` 正是为此设计。
-
-## 关键术语解释（对应你提供的描述）
- Ragflow instance：部署在 K8s 的 Ragflow 服务。
- AWS RDS Postgres：托管的 PostgreSQL 数据库实例。
- Secret Manager rotation：Secrets 定期轮换（每 7 天），会导致旧连接失效。
- Probes（K8s 探针）：liveness/readiness，用于自动重启或摘除不健康实例。
- Public endpoint without API key：无需 Authorization 的 HTTP 路由，便于探针直接访问。
- Dependencies statuses：依赖健康状态（db、chat、redis、doc_engine、storage 等）。
- HTTP 500 with JSON：当依赖异常时返回 500，并附带 JSON 说明哪个子系统失败。
-
-## 快速测试
- 正常：
-```bash
-curl -i http://<host>/v1/system/healthz
-```
- 制造 DB 故障（docker-compose 示例）：
-```bash
-docker compose stop db && curl -i http://<host>/v1/system/healthz
-```
-（预期 500，JSON 中 `db:"nok"`）
-
-## 更完整的测试清单
-### 1) 仅查看 HTTP 状态码
-```bash
-curl -s -o /dev/null -w "%{http_code}\n" http://<host>/v1/system/healthz
-```
-期望：`200` 或 `500`。
-
-### 2) Windows PowerShell
-```powershell
-# 状态码
-(Invoke-WebRequest -Uri "http://<host>/v1/system/healthz" -Method GET -TimeoutSec 3 -ErrorAction SilentlyContinue).StatusCode
-# 完整响应
-Invoke-RestMethod -Uri "http://<host>/v1/system/healthz" -Method GET
-```
-
-### 3) 通过 kubectl 端口转发本地测试
-```bash
-# 前端/网关暴露端口不同环境自行调整
-kubectl port-forward deploy/<your-deploy> 8080:80 -n <ns>
-curl -i http://127.0.0.1:8080/v1/system/healthz
-```
-
-### 4) 制造常见失败场景
- DB 失败（推荐）：
-```bash
-docker compose stop db
-curl -i http://<host>/v1/system/healthz   # 预期 500
-```
- Chat 失败（可选）：将 `CHAT_CFG` 的 `factory`/`base_url` 设为无效并重启后端，再请求应为 500，且 `chat:"nok"`。
- Redis/存储/文档引擎：停用对应服务后再次请求，可在 JSON 中看到相应字段为 `"nok"`（不影响 200/500 判定）。
-
-### 5) 浏览器验证
- 直接打开 `http://<host>/v1/system/healthz`，在 DevTools Network 查看 200/500；页面正文就是 JSON。
- 反向代理注意：若有自定义 500 错页，需对 `/healthz` 关闭错误页拦截（如 `proxy_intercept_errors off;`）。
-
-## K8s 探针示例
-```yaml
-readinessProbe:
-  httpGet:
-    path: /v1/system/healthz
-    port: 80
-  initialDelaySeconds: 5
-  periodSeconds: 10
-  timeoutSeconds: 2
-  failureThreshold: 1
-livenessProbe:
-  httpGet:
-    path: /v1/system/healthz
-    port: 80
-  initialDelaySeconds: 10
-  periodSeconds: 10
-  timeoutSeconds: 2
-  failureThreshold: 3
-```
-
-提示：如有反向代理（Nginx）自定义 500 错页，需对 `/healthz` 关闭错误页拦截，以便保留 JSON。
--- a/api/apps/system_app.py
+++ b/api/apps/system_app.py
@ -37,7 +37,7 @@ from timeit import default_timer as timer

 from rag.utils.redis_conn import REDIS_CONN
 from flask import jsonify
-from api.utils.health import run_health_checks
+from api.utils.health_utils import run_health_checks

@manager.route("/version", methods=["GET"])  # noqa: F821
@login_required
--- a/api/utils/health_utils.py
+++ b/api/utils/health_utils.py
@ -48,31 +48,16 @@ def check_storage() -> tuple[bool, dict]:
        return False, {"elapsed": f"{(timer() - st) * 1000.0:.1f}", "error": str(e)}


-def check_chat() -> tuple[bool, dict]:
-    st = timer()
-    try:
-        cfg = getattr(settings, "CHAT_CFG", None)
-        ok = bool(cfg and cfg.get("factory"))
-        return ok, {"elapsed": f"{(timer() - st) * 1000.0:.1f}"}
-    except Exception as e:
-        return False, {"elapsed": f"{(timer() - st) * 1000.0:.1f}", "error": str(e)}


 def run_health_checks() -> tuple[dict, bool]:
    result: dict[str, str | dict] = {}

    db_ok, db_meta = check_db()
-    chat_ok, chat_meta = check_chat()
-
    result["db"] = _ok_nok(db_ok)
    if not db_ok:
        result.setdefault("_meta", {})["db"] = db_meta

-    result["chat"] = _ok_nok(chat_ok)
-    if not chat_ok:
-        result.setdefault("_meta", {})["chat"] = chat_meta
-
-    # Optional probes (do not change minimal contract but exposed for observability)
    try:
        redis_ok, redis_meta = check_redis()
        result["redis"] = _ok_nok(redis_ok)
@ -97,7 +82,8 @@ def run_health_checks() -> tuple[dict, bool]:
    except Exception:
        result["storage"] = "nok"

-    all_ok = (result.get("db") == "ok") and (result.get("chat") == "ok")
+
+    all_ok = (result.get("db") == "ok") and (result.get("redis") == "ok") and (result.get("doc_engine") == "ok") and (result.get("storage") == "ok")
    result["status"] = "ok" if all_ok else "nok"
    return result, all_ok

--- a/docs/guides/run_health_check.md
+++ b/docs/guides/run_health_check.md
@ -31,3 +31,79 @@ You can click on a specific 30-second time interval to view the details of compl
 ![done_tasks](https://github.com/user-attachments/assets/49b25ec4-03af-48cf-b2e5-c892f6eaa261)

 ![done_vs_failed](https://github.com/user-attachments/assets/eaa928d0-a31c-4072-adea-046091e04599)
+
+## API Health Check
+
+In addition to checking the system dependencies from the **avatar > System** page in the UI, you can directly query the backend health check endpoint:
+
+```bash
+http://IP_OF_YOUR_MACHINE/v1/system/healthz
+```
+
+Here `<port>` refers to the actual port of your backend service (e.g., `7897`, `9222`, etc.).
+
+Key points:
+- **No login required** (no `@login_required` decorator)
+- Returns results in JSON format
+- If all dependencies are healthy → HTTP **200 OK**
+- If any dependency fails → HTTP **500 Internal Server Error**
+
+### Example 1: All services healthy (HTTP 200)
+
+```bash
+http://127.0.0.1/v1/system/healthz
+```
+
+Response:
+
+```http
+HTTP/1.1 200 OK
+Content-Type: application/json
+Content-Length: 120
+
+{
+  "db": "ok",
+  "redis": "ok",
+  "doc_engine": "ok",
+  "storage": "ok",
+  "status": "ok"
+}
+```
+
+Explanation:
+- Database (MySQL/Postgres), Redis, document engine (Elasticsearch/Infinity), and object storage (MinIO) are all healthy.
+- The `status` field returns `"ok"`.
+
+### Example 2: One service unhealthy (HTTP 500)
+
+For example, if Redis is down:
+
+Response:
+
+```http
+HTTP/1.1 500 INTERNAL SERVER ERROR
+Content-Type: application/json
+Content-Length: 300
+
+{
+  "db": "ok",
+  "redis": "nok",
+  "doc_engine": "ok",
+  "storage": "ok",
+  "status": "nok",
+  "_meta": {
+    "redis": {
+      "elapsed": "5.2",
+      "error": "Lost connection!"
+    }
+  }
+}
+```
+
+Explanation:
+- `redis` is marked as `"nok"`, with detailed error info under `_meta.redis.error`.
+- The overall `status` is `"nok"`, so the endpoint returns 500.
+
+---
+
+This endpoint allows you to monitor RAGFlow’s core dependencies programmatically in scripts or external monitoring systems, without relying on the frontend UI.
--- a/docs/references/http_api_reference.md
+++ b/docs/references/http_api_reference.md
@ -4102,3 +4102,77 @@ Failure:
 ```

 ---
+
+### System
+---
+### Check system health
+
+**GET** `/v1/system/healthz`
+
+Check the health status of RAGFlow’s dependencies (database, Redis, document engine, object storage).
+
+#### Request
+
+- Method: GET
+- URL: `/v1/system/healthz`
+- Headers:
+  - 'Content-Type: application/json'
+  (no Authorization required)
+
+##### Request example
+
+```bash
+curl --request GET
+     --url http://{address}/v1/system/healthz
+     --header 'Content-Type: application/json'
+```
+
+##### Request parameters
+
+- `address`: (*Path parameter*), string  
+  The host and port of the backend service (e.g., `localhost:7897`).
+
+---
+
+#### Responses
+
+- **200 OK** – All services healthy
+
+```http
+HTTP/1.1 200 OK
+Content-Type: application/json
+
+{
+  "db": "ok",
+  "redis": "ok",
+  "doc_engine": "ok",
+  "storage": "ok",
+  "status": "ok"
+}
+```
+
+- **500 Internal Server Error** – At least one service unhealthy
+
+```http
+HTTP/1.1 500 INTERNAL SERVER ERROR
+Content-Type: application/json
+
+{
+  "db": "ok",
+  "redis": "nok",
+  "doc_engine": "ok",
+  "storage": "ok",
+  "status": "nok",
+  "_meta": {
+    "redis": {
+      "elapsed": "5.2",
+      "error": "Lost connection!"
+    }
+  }
+}
+```
+
+Explanation:  
+- Each service is reported as "ok" or "nok".  
+- The top-level `status` reflects overall health.  
+- If any service is "nok", detailed error info appears in `_meta`.