使用 OpenClaw 的 diagnostics-prometheus 插件将运行指标暴露为 Prometheus 标准格式。需要先安装并启用插件、确保 diagnostics.enabled: true，然后通过具备 operator 权限的令牌拉取 /api/diagnostics/prometheus。指标默认缓存 2048 个时间序列，超出会丢弃并触发 openclaw_prometheus_series_dropped_total 计数器增加。还提供 PromQL 示例、标签策略说明，以及与 OpenTelemetry 导出方式的对比选择。

OpenClaw Prometheus 指标配置与排查

OpenClaw 可以通过官方 diagnostics-prometheus 插件暴露诊断指标。该插件监听内部诊断事件，并在以下路径提供 Prometheus 文本端点：

GET /api/diagnostics/prometheus

内容类型为 text/plain; version=0.0.4; charset=utf-8，即标准的 Prometheus 公开格式。

::: warning 该路由使用 Gateway 认证（operator 作用域）。不要将其暴露为无需认证的公开 /metrics 端点。请通过与其它 operator API 相同的认证路径进行抓取。 :::

如需链路追踪、日志、OTLP 推送和 OpenTelemetry GenAI 语义属性，请参见 OpenTelemetry 导出。

快速开始

安装插件

```bash
openclaw plugins install clawhub:@openclaw/diagnostics-prometheus
```

启用插件

配置文件

    ```json5
    {
      plugins: {
        allow: ["diagnostics-prometheus"],
        entries: {
          "diagnostics-prometheus": { enabled: true },
        },
      },
      diagnostics: {
        enabled: true,
      },
    }
    ```

CLI

    ```bash
    openclaw plugins enable diagnostics-prometheus
    ```

重启 Gateway

HTTP 路由在插件启动时注册，因此启用后需要重启。

抓取受保护的路由

使用与 operator 客户端相同的 gateway 认证令牌：

```bash
curl -H "Authorization: Bearer $OPENCLAW_GATEWAY_TOKEN" \
  http://127.0.0.1:18789/api/diagnostics/prometheus
```

配置 Prometheus

```yaml
# prometheus.yml
scrape_configs:
  - job_name: openclaw
    scrape_interval: 30s
    metrics_path: /api/diagnostics/prometheus
    authorization:
      credentials_file: /etc/prometheus/openclaw-gateway-token
    static_configs:
      - targets: ["openclaw-gateway:18789"]
```

::: info 必须设置 diagnostics.enabled: true。否则，插件虽然会注册 HTTP 路由，但不会有诊断事件流入导出器，响应为空。 :::

导出的指标

指标名称	类型	标签
`openclaw_run_completed_total`	counter	`channel`, `model`, `outcome`, `provider`, `trigger`
`openclaw_run_duration_seconds`	histogram	`channel`, `model`, `outcome`, `provider`, `trigger`
`openclaw_model_call_total`	counter	`api`, `error_category`, `model`, `outcome`, `provider`, `transport`
`openclaw_model_call_duration_seconds`	histogram	`api`, `error_category`, `model`, `outcome`, `provider`, `transport`
`openclaw_model_tokens_total`	counter	`agent`, `channel`, `model`, `provider`, `token_type`
`openclaw_gen_ai_client_token_usage`	histogram	`model`, `provider`, `token_type`
`openclaw_model_cost_usd_total`	counter	`agent`, `channel`, `model`, `provider`
`openclaw_tool_execution_total`	counter	`error_category`, `outcome`, `params_kind`, `tool`
`openclaw_tool_execution_duration_seconds`	histogram	`error_category`, `outcome`, `params_kind`, `tool`
`openclaw_harness_run_total`	counter	`channel`, `error_category`, `harness`, `model`, `outcome`, `phase`, `plugin`, `provider`
`openclaw_harness_run_duration_seconds`	histogram	`channel`, `error_category`, `harness`, `model`, `outcome`, `phase`, `plugin`, `provider`
`openclaw_message_received_total`	counter	`channel`, `source`
`openclaw_message_dispatch_started_total`	counter	`channel`, `source`
`openclaw_message_dispatch_completed_total`	counter	`channel`, `outcome`, `reason`, `source`
`openclaw_message_dispatch_duration_seconds`	histogram	`channel`, `outcome`, `reason`, `source`
`openclaw_message_processed_total`	counter	`channel`, `outcome`, `reason`
`openclaw_message_processed_duration_seconds`	histogram	`channel`, `outcome`, `reason`
`openclaw_message_delivery_started_total`	counter	`channel`, `delivery_kind`
`openclaw_message_delivery_total`	counter	`channel`, `delivery_kind`, `error_category`, `outcome`
`openclaw_message_delivery_duration_seconds`	histogram	`channel`, `delivery_kind`, `error_category`, `outcome`
`openclaw_talk_event_total`	counter	`brain`, `event_type`, `mode`, `provider`, `transport`
`openclaw_talk_event_duration_seconds`	histogram	`brain`, `event_type`, `mode`, `provider`, `transport`
`openclaw_talk_audio_bytes`	histogram	`brain`, `event_type`, `mode`, `provider`, `transport`
`openclaw_queue_lane_size`	gauge	`lane`
`openclaw_queue_lane_wait_seconds`	histogram	`lane`
`openclaw_session_state_total`	counter	`reason`, `state`
`openclaw_session_queue_depth`	gauge	`state`
`openclaw_session_turn_created_total`	counter	`agent`, `channel`, `trigger`
`openclaw_session_recovery_total`	counter	`action`, `active_work_kind`, `state`, `status`
`openclaw_session_recovery_age_seconds`	histogram	`action`, `active_work_kind`, `state`, `status`
`openclaw_memory_bytes`	gauge	`kind`
`openclaw_memory_rss_bytes`	histogram	无
`openclaw_memory_pressure_total`	counter	`level`, `reason`
`openclaw_telemetry_exporter_total`	counter	`exporter`, `reason`, `signal`, `status`
`openclaw_prometheus_series_dropped_total`	counter	无

标签策略

有界、低基数标签

Prometheus 标签保持有界且低基数。导出器不会发出原始的诊断标识符，如 `runId`, `sessionKey`, `sessionId`, `callId`, `toolCallId`、消息 ID、聊天 ID 或提供者请求 ID。

标签值经过脱敏处理，必须符合 OpenClaw 的低基数字符策略。未通过策略的值会替换为 `unknown`、`other` 或 `none`（取决于指标）。

序列上限与溢出统计

导出器在内存中最多保留 **2048** 个时间序列（counter、gauge、histogram 合计）。超出上限的新序列会被丢弃，每次丢弃会将 `openclaw_prometheus_series_dropped_total` 计数器加 1。

如果该计数器持续增长，表明上游某个属性正在泄露高基数值。导出器不会自动提高上限，请修复源头而非禁用上限。

Prometheus 输出中绝不会出现的内容

- 提示文本、响应文本、工具输入、工具输出、系统提示
- Talk 转录、音频载荷、呼叫 ID、房间 ID、转移令牌、轮次 ID 和原始会话 ID
- 原始提供者请求 ID（仅用于 span 的有界哈希，绝不用于指标）
- 会话密钥和会话 ID
- 主机名、文件路径、机密值

PromQL 示例

# 按提供者拆分每分钟 Token 数
sum by (provider) (rate(openclaw_model_tokens_total[1m]))

# 过去 1 小时按模型统计花费（美元）
sum by (model) (increase(openclaw_model_cost_usd_total[1h]))

# 模型运行延时的 95 分位值
histogram_quantile(
  0.95,
  sum by (le, provider, model)
    (rate(openclaw_run_duration_seconds_bucket[5m]))
)

# 队列等待时间 SLO（95 分位低于 2 秒）
histogram_quantile(
  0.95,
  sum by (le, lane) (rate(openclaw_queue_lane_wait_seconds_bucket[5m]))
) < 2

# 丢弃的 Prometheus 序列（基数告警）
increase(openclaw_prometheus_series_dropped_total[15m]) > 0

::: tip 在跨提供者仪表盘中，首选 gen_ai_client_token_usage：它遵循 OpenTelemetry GenAI 语义约定，并与非 OpenClaw 的 GenAI 服务指标一致。 :::

在 Prometheus 和 OpenTelemetry 导出之间选择

OpenClaw 独立支持两种方式。可以只启用其中一个、同时启用，或都不启用。

diagnostics-prometheus

- **拉取**模式：Prometheus 抓取 `/api/diagnostics/prometheus`。
- 无需外部收集器。
- 通过正常的 Gateway 认证进行身份验证。
- 仅提供指标（不包含链路跟踪或日志）。
- 最适合已标准化 Prometheus + Grafana 的监控体系。

diagnostics-otel

- **推送**模式：OpenClaw 通过 OTLP/HTTP 发送到收集器或兼容 OTLP 的后端。
- 包含指标、链路跟踪和日志。
- 可通过 OpenTelemetry Collector（`prometheus` 或 `prometheusremotewrite` 导出器）桥接到 Prometheus。
- 详见 [OpenTelemetry 导出](/ai/ai-tools/openclaw/gateway/opentelemetry)。

故障排查

响应体为空

- 检查配置中是否包含 `diagnostics.enabled: true`。
- 执行 `openclaw plugins list --enabled` 确认插件已启用并加载。
- 生成一些流量；counter 和 histogram 在至少有一个事件后才会输出数据行。

401 未授权

该端点需要 Gateway operator 作用域（`auth: "gateway"` 配合 `gatewayRuntimeScopeSurface: "trusted-operator"`）。请使用与其它 Gateway operator 路由相同的令牌或密码。没有公开的免认证模式。

openclaw_prometheus_series_dropped_total 持续增长

某个属性超过了 **2048** 系列上限。检查最近的指标中是否存在意外高基数的标签，并在源头修复。导出器会故意丢弃新系列而不是静默重写标签。

重启后 Prometheus 显示过时序列

插件仅将状态保留在内存中。Gateway 重启后，counter 重置为零，gauge 在下次上报值后重新开始。请使用 PromQL 的 `rate()` 和 `increase()` 函数来干净地处理重置。

关联文档

诊断导出 — 本地诊断 zip，用于支持包
健康与就绪检查 — /healthz 和 /readyz 探测
日志记录 — 基于文件的日志
OpenTelemetry 导出 — OTLP 推送（链路、指标、日志）

常见问题

抓取 `/api/diagnostics/prometheus` 返回空，怎么解决？

首先确认配置中 diagnostics.enabled: true 已启用。然后执行 openclaw plugins list --enabled 检查插件是否已加载。最后，生成一些消息流量后重试——指标只在有事件后才输出。

返回 401 未授权，怎么办？

该端点强制使用 Gateway operator 作用域的认证。请在 curl 命令中使用正确的令牌（环境变量 $OPENCLAW_GATEWAY_TOKEN），并确保 Prometheus 配置文件中的 authorization 配置了有效的令牌文件。没有免认证的公开模式。

`openclaw_prometheus_series_dropped_total` 持续增加，是什么原因？

这个计数器增长表明某个标签值的基数超过了 2048 的上限。请检查近期指标是否出现了高基数标签（例如会话 ID、用户 ID 等不应作为标签的信息），然后在源头修复，而不是关闭上限。

OpenClaw Prometheus 指标配置与排查 #

快速开始 #

导出的指标 #

标签策略 #

PromQL 示例 #

在 Prometheus 和 OpenTelemetry 导出之间选择 #

故障排查 #

关联文档 #

常见问题 #

抓取 /api/diagnostics/prometheus 返回空，怎么解决？ #

返回 401 未授权，怎么办？ #

openclaw_prometheus_series_dropped_total 持续增加，是什么原因？ #