Pod生命周期与调度策略
1. Pod 状态机
Pending → Running → Succeeded
↘
Failed
↘
Unknown
| 状态 | 含义 |
|---|---|
| Pending | 已创建但未运行(调度中、拉镜像中) |
| Running | 至少一个容器在跑 |
| Succeeded | 所有容器正常退出(Job 类) |
| Failed | 至少一个容器异常退出 |
| Unknown | 联系不上 kubelet |
kubectl get pods 看到的列还有:
CrashLoopBackOff:容器反复崩,K8s 退避重启ImagePullBackOff:拉镜像失败Init:0/2:init container 还没跑完Terminating:删除中
2. 容器生命周期
[init container 1] → [init container 2] → ...
↓
[main container 启动]
↓
[postStart hook] ← 异步,不阻塞
↓
[readinessProbe 通过] → Service 加入流量
[livenessProbe 持续检查]
↓
[收到 SIGTERM]
↓
[preStop hook] ← 同步,阻塞
↓
[发 SIGTERM 给容器进程,等 terminationGracePeriodSeconds]
↓
[超时强杀 SIGKILL]
2.1 探针(Probes)
spec:
containers:
- name: web
readinessProbe: # 决定是否接流量
httpGet:
path: /ready
port: 80
initialDelaySeconds: 5
periodSeconds: 10
timeoutSeconds: 3
successThreshold: 1
failureThreshold: 3
livenessProbe: # 决定是否重启
httpGet:
path: /health
port: 80
initialDelaySeconds: 30
periodSeconds: 30
failureThreshold: 3
startupProbe: # 慢启动应用专用
httpGet:
path: /health
port: 80
failureThreshold: 30
periodSeconds: 10
三种探针差别:
| 探针 | 失败后果 |
|---|---|
| readinessProbe | 从 Service 摘除(不接新流量) |
| livenessProbe | 重启容器 |
| startupProbe | 启动期专用,未通过前另两个不生效 |
重要:readiness 失败 ≠ 重启。这区分对前端 SSR 很重要:依赖的下游短暂不可用时不该重启自己。
2.2 探针类型
# HTTP
httpGet:
path: /health
port: 80
# TCP
tcpSocket:
port: 80
# 命令
exec:
command: ["sh", "-c", "test -f /tmp/healthy"]
# gRPC(K8s 1.24+)
grpc:
port: 9000
service: health
2.3 hook
lifecycle:
postStart:
exec:
command: ["sh", "-c", "echo started"]
preStop:
exec:
command: ["sh", "-c", "sleep 10 && nginx -s quit"]
preStop 是优雅退出关键:先 sleep 5-10 秒让 Service 把流量摘除,再让应用停。
2.4 优雅退出
spec:
terminationGracePeriodSeconds: 30 # 默认 30s
containers:
- name: web
lifecycle:
preStop:
exec:
command: ["sh", "-c", "sleep 5 && nginx -s quit"]
应用必须正确处理 SIGTERM:
process.on('SIGTERM', async () => {
server.close()
await closeDbConnections()
process.exit(0)
})
3. init container
主容器启动前必须完成的任务:
spec:
initContainers:
- name: wait-for-db
image: busybox
command: ["sh", "-c", "until nc -z db 5432; do sleep 1; done"]
- name: migrate
image: myapp:v1
command: ["npm", "run", "migrate"]
containers:
- name: web
image: myapp:v1
按顺序执行,全部成功才起主容器。失败按 restartPolicy 重试。
4. 调度策略
4.1 nodeSelector(最简)
spec:
nodeSelector:
disktype: ssd
region: cn-hangzhou
节点要打了对应标签:
kubectl label nodes node-1 disktype=ssd
4.2 affinity / anti-affinity
spec:
affinity:
nodeAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
nodeSelectorTerms:
- matchExpressions:
- key: node-type
operator: In
values: [frontend]
podAntiAffinity:
preferredDuringSchedulingIgnoredDuringExecution:
- weight: 100
podAffinityTerm:
labelSelector:
matchLabels:
app: my-frontend
topologyKey: kubernetes.io/hostname
required 硬要求,preferred 软偏好。podAntiAffinity 让自己的 Pod 分散到不同节点(高可用)。
4.3 topologySpreadConstraints(推荐)
比 affinity 更直观:
spec:
topologySpreadConstraints:
- maxSkew: 1
topologyKey: kubernetes.io/hostname
whenUnsatisfiable: DoNotSchedule
labelSelector:
matchLabels:
app: my-frontend
- maxSkew: 1
topologyKey: topology.kubernetes.io/zone
whenUnsatisfiable: ScheduleAnyway
labelSelector:
matchLabels:
app: my-frontend
强制每节点最多 1 个 + 偏好跨可用区。
4.4 taint / toleration
节点打 taint,只接受能 tolerate 的 Pod:
# 节点
kubectl taint nodes gpu-node-1 gpu=true:NoSchedule
# Pod
tolerations:
- key: gpu
operator: Equal
value: "true"
effect: NoSchedule
GPU 节点、master 节点常用。
5. 资源 requests 与 limits
resources:
requests:
cpu: 100m # 0.1 核(保留)
memory: 128Mi
limits:
cpu: 500m # 0.5 核(最大)
memory: 256Mi
| 概念 | 作用 |
|---|---|
| requests | 调度依据:节点必须有这么多剩余才放你 |
| limits | 运行时限制:超 CPU 限速、超内存 OOM kill |
requests = 0 会被认为"没占资源",节点超卖严重时被驱逐。
5.1 QoS 等级
Guaranteed:requests = limits ← 优先级最高
Burstable:有 requests 但 ≠ limits ← 中
BestEffort:都没设 ← 最低,先被驱逐
生产关键服务用 Guaranteed。
6. 驱逐与抢占
节点资源紧张时 kubelet 驱逐 Pod。优先级:
- BestEffort
- Burstable(用量超 requests 多的)
- Guaranteed(最后)
PriorityClass 可让重要 Pod 抢占低优先级 Pod 的位置:
apiVersion: scheduling.k8s.io/v1
kind: PriorityClass
metadata:
name: high-priority
value: 1000000
globalDefault: false
7. 重启策略
spec:
restartPolicy: Always # Always | OnFailure | Never
- Pod 控制器(Deployment)只能 Always
- Job 用 OnFailure
- 一次性任务用 Never
8. 故障排查
# 看为什么 Pending
kubectl describe pod <pod>
# Events 段:FailedScheduling、ImagePullBackOff
# CrashLoopBackOff
kubectl logs <pod> --previous # 上次崩的日志
kubectl describe pod <pod> # exit code
# Liveness 反复重启
kubectl describe pod <pod> | grep -A 5 "Last State"
9. 常见反模式
- 不设 readinessProbe:Pod 起来但应用没就绪就接流量
- liveness 检查依赖外部:DB 短暂故障导致 Pod 反复重启
- terminationGracePeriodSeconds 太短:长请求被截断
- 不用 preStop sleep:流量摘除前就停应用,正在处理的请求 502
- requests = limits 太小:被频繁 OOM kill
- 没有 podAntiAffinity / topologySpread:Pod 集中在一节点,节点挂全挂
- liveness 和 readiness 同一接口:依赖故障时连续重启雪崩
- init container 不幂等:迁移脚本跑两次出问题