diff --git a/api/apps/HEALTHCHECK_TESTING.md b/api/apps/HEALTHCHECK_TESTING.md deleted file mode 100644 index a97a03c0e..000000000 --- a/api/apps/HEALTHCHECK_TESTING.md +++ /dev/null @@ -1,105 +0,0 @@ -# 健康检查与 Kubernetes 探针简明说明 - -本文件说明:什么是 K8s 探针、如何用 `/v1/system/healthz` 做健康检查,以及下文用例中的关键词含义。 - -## 什么是 K8s 探针(Probe) -- 探针是 K8s 用来“探测”容器是否健康/可对外服务的机制。 -- 常见三类: - - livenessProbe:活性探针。失败时 K8s 会重启容器,用于“应用卡死/失去连接时自愈”。 - - readinessProbe:就绪探针。失败时 Endpoint 不会被加入 Service 负载均衡,用于“应用尚未准备好时不接流量”。 - - startupProbe:启动探针。给慢启动应用更长的初始化窗口,期间不执行 liveness/readiness。 -- 这些探针通常通过 HTTP GET 访问一个公开且轻量的健康端点(无需鉴权),以 HTTP 状态码判定结果:200=通过;5xx/超时=失败。 - -## 本项目健康端点 -- 已实现:`GET /v1/system/healthz`(无需认证)。 -- 语义: - - 200:关键依赖正常。 - - 500:任一关键依赖异常(当前判定为 DB 或 Chat)。 - - 响应体:JSON,最小字段 `status, db, chat`;并包含 `redis, doc_engine, storage` 等可观测项。失败项会在 `_meta` 中包含 `error/elapsed`。 -- 示例(DB 故障): -```json -{"status":"nok","chat":"ok","db":"nok"} -``` - -## 用例背景(Problem/use case) -- 现状:Ragflow 跑在 K8s,数据库是 AWS RDS Postgres,凭证由 Secret Manager 管理并每 7 天轮换。轮换后应用连接失效,需要手动重启 Pod 才能重新建立连接。 -- 目标:通过 K8s 探针自动化检测并重启异常 Pod,减少人工操作。 -- 需求:一个“无需鉴权”的公共健康端点,能在依赖异常时返回非 200(如 500)且提供 JSON 详情。 -- 现已满足:`/v1/system/healthz` 正是为此设计。 - -## 关键术语解释(对应你提供的描述) -- Ragflow instance:部署在 K8s 的 Ragflow 服务。 -- AWS RDS Postgres:托管的 PostgreSQL 数据库实例。 -- Secret Manager rotation:Secrets 定期轮换(每 7 天),会导致旧连接失效。 -- Probes(K8s 探针):liveness/readiness,用于自动重启或摘除不健康实例。 -- Public endpoint without API key:无需 Authorization 的 HTTP 路由,便于探针直接访问。 -- Dependencies statuses:依赖健康状态(db、chat、redis、doc_engine、storage 等)。 -- HTTP 500 with JSON:当依赖异常时返回 500,并附带 JSON 说明哪个子系统失败。 - -## 快速测试 -- 正常: -```bash -curl -i http:///v1/system/healthz -``` -- 制造 DB 故障(docker-compose 示例): -```bash -docker compose stop db && curl -i http:///v1/system/healthz -``` -(预期 500,JSON 中 `db:"nok"`) - -## 更完整的测试清单 -### 1) 仅查看 HTTP 状态码 -```bash -curl -s -o /dev/null -w "%{http_code}\n" http:///v1/system/healthz -``` -期望:`200` 或 `500`。 - -### 2) Windows PowerShell -```powershell -# 状态码 -(Invoke-WebRequest -Uri "http:///v1/system/healthz" -Method GET -TimeoutSec 3 -ErrorAction SilentlyContinue).StatusCode -# 完整响应 -Invoke-RestMethod -Uri "http:///v1/system/healthz" -Method GET -``` - -### 3) 通过 kubectl 端口转发本地测试 -```bash -# 前端/网关暴露端口不同环境自行调整 -kubectl port-forward deploy/ 8080:80 -n -curl -i http://127.0.0.1:8080/v1/system/healthz -``` - -### 4) 制造常见失败场景 -- DB 失败(推荐): -```bash -docker compose stop db -curl -i http:///v1/system/healthz # 预期 500 -``` -- Chat 失败(可选):将 `CHAT_CFG` 的 `factory`/`base_url` 设为无效并重启后端,再请求应为 500,且 `chat:"nok"`。 -- Redis/存储/文档引擎:停用对应服务后再次请求,可在 JSON 中看到相应字段为 `"nok"`(不影响 200/500 判定)。 - -### 5) 浏览器验证 -- 直接打开 `http:///v1/system/healthz`,在 DevTools Network 查看 200/500;页面正文就是 JSON。 -- 反向代理注意:若有自定义 500 错页,需对 `/healthz` 关闭错误页拦截(如 `proxy_intercept_errors off;`)。 - -## K8s 探针示例 -```yaml -readinessProbe: - httpGet: - path: /v1/system/healthz - port: 80 - initialDelaySeconds: 5 - periodSeconds: 10 - timeoutSeconds: 2 - failureThreshold: 1 -livenessProbe: - httpGet: - path: /v1/system/healthz - port: 80 - initialDelaySeconds: 10 - periodSeconds: 10 - timeoutSeconds: 2 - failureThreshold: 3 -``` - -提示:如有反向代理(Nginx)自定义 500 错页,需对 `/healthz` 关闭错误页拦截,以便保留 JSON。 diff --git a/api/apps/system_app.py b/api/apps/system_app.py index df17e4b57..0ec808fb9 100644 --- a/api/apps/system_app.py +++ b/api/apps/system_app.py @@ -37,7 +37,7 @@ from timeit import default_timer as timer from rag.utils.redis_conn import REDIS_CONN from flask import jsonify -from api.utils.health import run_health_checks +from api.utils.health_utils import run_health_checks @manager.route("/version", methods=["GET"]) # noqa: F821 @login_required diff --git a/api/utils/health.py b/api/utils/health_utils.py similarity index 79% rename from api/utils/health.py rename to api/utils/health_utils.py index 394154b9a..0d47df081 100644 --- a/api/utils/health.py +++ b/api/utils/health_utils.py @@ -48,31 +48,16 @@ def check_storage() -> tuple[bool, dict]: return False, {"elapsed": f"{(timer() - st) * 1000.0:.1f}", "error": str(e)} -def check_chat() -> tuple[bool, dict]: - st = timer() - try: - cfg = getattr(settings, "CHAT_CFG", None) - ok = bool(cfg and cfg.get("factory")) - return ok, {"elapsed": f"{(timer() - st) * 1000.0:.1f}"} - except Exception as e: - return False, {"elapsed": f"{(timer() - st) * 1000.0:.1f}", "error": str(e)} def run_health_checks() -> tuple[dict, bool]: result: dict[str, str | dict] = {} db_ok, db_meta = check_db() - chat_ok, chat_meta = check_chat() - result["db"] = _ok_nok(db_ok) if not db_ok: result.setdefault("_meta", {})["db"] = db_meta - result["chat"] = _ok_nok(chat_ok) - if not chat_ok: - result.setdefault("_meta", {})["chat"] = chat_meta - - # Optional probes (do not change minimal contract but exposed for observability) try: redis_ok, redis_meta = check_redis() result["redis"] = _ok_nok(redis_ok) @@ -97,7 +82,8 @@ def run_health_checks() -> tuple[dict, bool]: except Exception: result["storage"] = "nok" - all_ok = (result.get("db") == "ok") and (result.get("chat") == "ok") + + all_ok = (result.get("db") == "ok") and (result.get("redis") == "ok") and (result.get("doc_engine") == "ok") and (result.get("storage") == "ok") result["status"] = "ok" if all_ok else "nok" return result, all_ok diff --git a/docs/guides/run_health_check.md b/docs/guides/run_health_check.md index f07538460..5ed7cfdf6 100644 --- a/docs/guides/run_health_check.md +++ b/docs/guides/run_health_check.md @@ -31,3 +31,79 @@ You can click on a specific 30-second time interval to view the details of compl ![done_tasks](https://github.com/user-attachments/assets/49b25ec4-03af-48cf-b2e5-c892f6eaa261) ![done_vs_failed](https://github.com/user-attachments/assets/eaa928d0-a31c-4072-adea-046091e04599) + +## API Health Check + +In addition to checking the system dependencies from the **avatar > System** page in the UI, you can directly query the backend health check endpoint: + +```bash +http://IP_OF_YOUR_MACHINE/v1/system/healthz +``` + +Here `` refers to the actual port of your backend service (e.g., `7897`, `9222`, etc.). + +Key points: +- **No login required** (no `@login_required` decorator) +- Returns results in JSON format +- If all dependencies are healthy → HTTP **200 OK** +- If any dependency fails → HTTP **500 Internal Server Error** + +### Example 1: All services healthy (HTTP 200) + +```bash +http://127.0.0.1/v1/system/healthz +``` + +Response: + +```http +HTTP/1.1 200 OK +Content-Type: application/json +Content-Length: 120 + +{ + "db": "ok", + "redis": "ok", + "doc_engine": "ok", + "storage": "ok", + "status": "ok" +} +``` + +Explanation: +- Database (MySQL/Postgres), Redis, document engine (Elasticsearch/Infinity), and object storage (MinIO) are all healthy. +- The `status` field returns `"ok"`. + +### Example 2: One service unhealthy (HTTP 500) + +For example, if Redis is down: + +Response: + +```http +HTTP/1.1 500 INTERNAL SERVER ERROR +Content-Type: application/json +Content-Length: 300 + +{ + "db": "ok", + "redis": "nok", + "doc_engine": "ok", + "storage": "ok", + "status": "nok", + "_meta": { + "redis": { + "elapsed": "5.2", + "error": "Lost connection!" + } + } +} +``` + +Explanation: +- `redis` is marked as `"nok"`, with detailed error info under `_meta.redis.error`. +- The overall `status` is `"nok"`, so the endpoint returns 500. + +--- + +This endpoint allows you to monitor RAGFlow’s core dependencies programmatically in scripts or external monitoring systems, without relying on the frontend UI. diff --git a/docs/references/http_api_reference.md b/docs/references/http_api_reference.md index b112e8618..062c942d7 100644 --- a/docs/references/http_api_reference.md +++ b/docs/references/http_api_reference.md @@ -4102,3 +4102,77 @@ Failure: ``` --- + +### System +--- +### Check system health + +**GET** `/v1/system/healthz` + +Check the health status of RAGFlow’s dependencies (database, Redis, document engine, object storage). + +#### Request + +- Method: GET +- URL: `/v1/system/healthz` +- Headers: + - 'Content-Type: application/json' + (no Authorization required) + +##### Request example + +```bash +curl --request GET + --url http://{address}/v1/system/healthz + --header 'Content-Type: application/json' +``` + +##### Request parameters + +- `address`: (*Path parameter*), string + The host and port of the backend service (e.g., `localhost:7897`). + +--- + +#### Responses + +- **200 OK** – All services healthy + +```http +HTTP/1.1 200 OK +Content-Type: application/json + +{ + "db": "ok", + "redis": "ok", + "doc_engine": "ok", + "storage": "ok", + "status": "ok" +} +``` + +- **500 Internal Server Error** – At least one service unhealthy + +```http +HTTP/1.1 500 INTERNAL SERVER ERROR +Content-Type: application/json + +{ + "db": "ok", + "redis": "nok", + "doc_engine": "ok", + "storage": "ok", + "status": "nok", + "_meta": { + "redis": { + "elapsed": "5.2", + "error": "Lost connection!" + } + } +} +``` + +Explanation: +- Each service is reported as "ok" or "nok". +- The top-level `status` reflects overall health. +- If any service is "nok", detailed error info appears in `_meta`.