跳到主要内容

Prometheus-Grafana监控体系

1. 概念

Prometheus 是 Pull 模型:每个被监控对象暴露 /metrics 端点,Prometheus 定期抓取。

┌─────── 应用 / Node Exporter / kube-state-metrics ──────┐
│ 暴露 /metrics │
└───────────────────────┬──────────────────────────────────┘
↓ scrape
┌─────────────────────────────────────────────────────────┐
│ Prometheus(存储 + 查询) │
└─────────────────────────────────────────────────────────┘

┌─────────┴─────────┐
↓ ↓
Grafana Alertmanager
(看板可视化) (告警分发)

2. 数据模型

http_requests_total{method="GET", status="200"} 12345 @timestamp
└──────┬──────┘ └──────────┬──────────────────┘ └─┬─┘
metric name labels value

四种类型:

类型用途
Counter单调递增(请求数、错误数)
Gauge可增可减(CPU 使用率、连接数)
Histogram分布(请求延迟)
Summary类似 histogram,客户端算分位数

3. 应用暴露指标(Node 示例)

const express = require('express')
const client = require('prom-client')

const register = new client.Registry()
client.collectDefaultMetrics({ register })

const httpDuration = new client.Histogram({
name: 'http_request_duration_seconds',
help: 'HTTP request duration',
labelNames: ['method', 'route', 'status'],
buckets: [0.005, 0.01, 0.05, 0.1, 0.5, 1, 5],
})
register.registerMetric(httpDuration)

const app = express()

app.use((req, res, next) => {
const end = httpDuration.startTimer()
res.on('finish', () => {
end({ method: req.method, route: req.route?.path || 'unknown', status: res.statusCode })
})
next()
})

app.get('/metrics', async (req, res) => {
res.set('Content-Type', register.contentType)
res.send(await register.metrics())
})

4. PromQL 速查

# 当前值
http_requests_total
http_requests_total{status="500"}

# 速率(5 分钟平均每秒)
rate(http_requests_total[5m])

# 求和(按 label 聚合)
sum by (status) (rate(http_requests_total[5m]))

# 错误率
sum(rate(http_requests_total{status=~"5.."}[5m]))
/
sum(rate(http_requests_total[5m]))

# P95 延迟
histogram_quantile(0.95, sum by (le) (rate(http_request_duration_seconds_bucket[5m])))

# 带 namespace 拆
histogram_quantile(0.95, sum by (le, route) (rate(http_request_duration_seconds_bucket[5m])))

# 同比
http_requests_total - http_requests_total offset 1d

# 节点 CPU 使用率
1 - avg by (instance) (rate(node_cpu_seconds_total{mode="idle"}[5m]))

# 内存使用率
(node_memory_MemTotal_bytes - node_memory_MemAvailable_bytes) / node_memory_MemTotal_bytes

5. Recording Rules(预聚合)

复杂查询写规则提前算,加速 Grafana:

# rules.yml
groups:
- name: aggregations
interval: 30s
rules:
- record: api:http_requests:rate5m
expr: sum by (route, status) (rate(http_requests_total[5m]))
- record: api:http_errors:ratio5m
expr: |
sum by (route) (rate(http_requests_total{status=~"5.."}[5m]))
/
sum by (route) (rate(http_requests_total[5m]))

6. 告警规则

groups:
- name: api
rules:
- alert: HighErrorRate
expr: api:http_errors:ratio5m > 0.01
for: 5m
labels:
severity: warning
team: backend
annotations:
summary: "}} $labels.route }} 错误率 > 1%"
description: "当前 }} $value | humanizePercentage }}"
runbook: "https://wiki.example.com/runbooks/high-error-rate"

- alert: HighLatency
expr: |
histogram_quantile(0.95,
sum by (le, route) (rate(http_request_duration_seconds_bucket[5m]))
) > 1
for: 10m
labels:
severity: warning

7. Alertmanager 路由

route:
receiver: default
group_by: [alertname, namespace]
group_wait: 30s
group_interval: 5m
repeat_interval: 12h
routes:
- matchers: [severity="critical"]
receiver: pager
continue: true
- matchers: [team="frontend"]
receiver: frontend-slack

receivers:
- name: pager
webhook_configs:
- url: 'https://events.pagerduty.com/...'

- name: frontend-slack
slack_configs:
- api_url: 'https://hooks.slack.com/...'
channel: '#frontend-alerts'

inhibit_rules:
- source_matchers: [severity="critical"]
target_matchers: [severity="warning"]
equal: [alertname, namespace]

8. Grafana

Datasource → Prometheus URL → 添加。

8.1 Dashboard

社区导入:

  • Node Exporter Full: ID 1860
  • Kubernetes Cluster Monitoring: 7249
  • NGINX Ingress: 9614
  • Loki Logs: 13639

或自建:Panel → 选 metric → 配可视化。

8.2 Variables(动态过滤)

namespace: label_values(kube_pod_info, namespace)
pod: label_values(kube_pod_info{namespace="$namespace"}, pod)

下拉框选 namespace → pod 联动。

8.3 模板复用

JSON 导出 / 导入。多环境共享。

9. 高基数陷阱

每个 unique label 组合 = 一条时间序列。userId / orderId 当 label = 几亿条序列 = Prometheus OOM。

Label 应该是低基数维度:method、status、route、namespace、pod 名(pod 名也是中基数,要小心)。

10. 长期存储

Prometheus 单机适合 15 天数据。长期:

  • Thanos:Prometheus + 对象存储,全局视图
  • VictoriaMetrics:性能更好,单点也行
  • Mimir(Grafana):Cortex 演进
  • 云托管:阿里 Prometheus、AMP

11. 常见反模式

  • userId 当 label:基数爆炸
  • 不设 retention:磁盘塞满
  • 告警阈值瞎拍:业务噪声 / 漏报
  • 告警没 runbook:值班人员不知道怎么处理
  • 每条告警都 critical:on-call 麻木
  • Grafana 不要密码:暴露
  • Prometheus 暴露公网:可被任意查询
  • 不用 recording rules:复杂查询每次跑全量

12. 延伸阅读