日志-监控-告警集成
1. 整体架构
┌────────────────────────────────────────────────┐
│ 应用 Pod │
│ stdout/stderr → 容器运行时 → 节点 /var/log/ │
└────────────────────────────────────────────────┘
↓
┌────────────────────────────────────────────────┐
│ DaemonSet 日志收集器 (每节点一个) │
│ Fluentd / Fluent Bit / Promtail / Vector │
└────────────────────────────────────────────────┘
↓
┌──────────────┴──────────────┐
↓ ↓
Elasticsearch Loki
↓ ↓
Kibana Grafana
┌────────────────────────────────────────────────┐
│ 监控(指标) │
│ Pod / Node 暴露 /metrics → Prometheus 抓取 │
└────────────────────────────────────────────────┘
↓
Grafana 看板
↓
Alertmanager 告警
2. 日志方案选型
| 栈 | 特点 |
|---|---|
| EFK(Elasticsearch + Fluentd/Fluent Bit + Kibana) | 老牌,全文索引,重 |
| PLG(Promtail + Loki + Grafana) | 轻量,按 label 索引(不索引内容),便宜 |
| Vector + 各种后端 | Rust 写的高性能 |
| 商业:Datadog / 阿里 SLS / 腾讯 CLS | 省心 |
中小规模 Loki 是高 ROI 选择。
3. Fluent Bit + Loki 示例
3.1 部署 Loki + Grafana(Helm)
helm repo add grafana https://grafana.github.io/helm-charts
helm install loki grafana/loki-stack -n monitoring --create-namespace \
--set grafana.enabled=true \
--set promtail.enabled=true
3.2 应用日志
应用输出 JSON 到 stdout:
// pino 示例
const logger = require('pino')()
logger.info({ userId: 123, requestId: 'abc' }, '用户登录')
K8s 自动收集 stdout 到 /var/log/pods/<ns>_<pod>/<container>/0.log,promtail 读取并打 label:
{namespace="frontend", pod="my-frontend-abc", container="web"}
Grafana 查询:
{namespace="frontend"} |= "ERROR"
{namespace="frontend"} | json | level="error" | __error__=""
sum by (pod) (rate({namespace="frontend"} |~ "ERROR"[5m]))
4. Prometheus 监控栈
4.1 kube-prometheus-stack(Helm)
包含 Prometheus + Alertmanager + Grafana + node-exporter + kube-state-metrics。
helm repo add prometheus-community https://prometheus-community.github.io/helm-charts
helm install prom prometheus-community/kube-prometheus-stack \
-n monitoring --create-namespace \
-f values.yaml
values.yaml 关键:
grafana:
adminPassword: "yourpassword"
ingress:
enabled: true
hosts: [grafana.example.com]
prometheus:
prometheusSpec:
retention: 30d
storageSpec:
volumeClaimTemplate:
spec:
storageClassName: ssd
resources:
requests:
storage: 100Gi
alertmanager:
config:
receivers:
- name: webhook
webhook_configs:
- url: 'https://hooks.slack.com/...'
4.2 应用暴露指标
Node:
const express = require('express')
const client = require('prom-client')
client.collectDefaultMetrics()
const httpDuration = new client.Histogram({
name: 'http_request_duration_seconds',
help: 'HTTP request duration',
labelNames: ['method', 'route', 'status'],
})
app.get('/metrics', (req, res) => {
res.set('Content-Type', client.register.contentType)
client.register.metrics().then(m => res.send(m))
})
Pod 加注解让 Prometheus Operator 抓取:
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
name: my-frontend
namespace: frontend
labels:
release: prom
spec:
selector:
matchLabels:
app: my-frontend
endpoints:
- port: http
path: /metrics
interval: 30s
4.3 关键指标
应用层:
- 请求量 / 错误率 / 延迟(RED 方法)
http_requests_total{status="5xx"}错误率histogram_quantile(0.95, ...)P95 延迟
K8s 层(kube-state-metrics 提供):
kube_pod_status_phase{phase="Failed"}kube_deployment_status_replicas_unavailablekube_node_status_condition{condition="Ready", status="false"}
节点层(node-exporter):
node_cpu_seconds_totalnode_memory_MemAvailable_bytesnode_filesystem_avail_bytes
5. 告警规则
apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
name: frontend-alerts
namespace: monitoring
spec:
groups:
- name: frontend
rules:
- alert: HighErrorRate
expr: |
sum(rate(http_requests_total{namespace="frontend",status=~"5.."}[5m]))
/
sum(rate(http_requests_total{namespace="frontend"}[5m]))
> 0.01
for: 5m
labels:
severity: warning
annotations:
summary: "前端 5xx 比例 > 1%"
description: "}} $value | humanizePercentage }}"
- alert: PodCrashLooping
expr: |
rate(kube_pod_container_status_restarts_total{namespace="frontend"}[15m]) > 0
for: 5m
labels:
severity: critical
6. 链路追踪
详见模块 08 OpenTelemetry。K8s 里典型集成:
- Jaeger / Tempo(后端)
- OpenTelemetry Collector(DaemonSet 收集)
- 应用 SDK 注入 traceId
7. Grafana 必备 Dashboard
社区 ID 直接导入:
- Node Exporter Full: 1860
- Kubernetes Cluster (Prometheus): 315
- Kubernetes Pods: 6417
- NGINX Ingress Controller: 9614
- Loki Logs: 13639
8. 实战告警通道
不要只配邮件。重要告警走多通道:
- 飞书 / 钉钉机器人 webhook
- PagerDuty / OpsGenie 值班
- Slack 频道
- 短信(关键服务)
# Alertmanager
route:
group_by: [alertname, namespace]
group_wait: 30s
group_interval: 5m
repeat_interval: 12h
receiver: default
routes:
- matchers: [severity="critical"]
receiver: pager
- matchers: [severity="warning"]
receiver: slack
9. 故障排查清单
# 看 Prometheus 是否抓取到指标
# 浏览器打开 prometheus → Status → Targets
# 看告警规则
# Status → Rules
# 看告警当前状态
# Alerts
# Loki 查询不到日志
# 检查 promtail Pod
kubectl logs -n monitoring -l app=promtail
# Grafana 没数据
# Datasource → Test
10. 常见反模式
- 应用日志写文件不写 stdout:容器销毁日志丢
- info 级别记每个 SQL 查询:日志量爆炸
- Prometheus 不限 retention:磁盘瞬间满
- 告警噪声过大:所有人麻木屏蔽
- Grafana admin/admin 不改密码:暴露 = 数据泄漏
- 指标基数爆炸(high cardinality):把 userId 当 label,时间序列千万级
- 没 dashboard 直接看 Prometheus:UX 灾难
11. 延伸阅读
- kube-prometheus-stack Chart
- Loki 文档
- Prometheus 最佳实践
- 模块 08 监控与可观测性(更深入)