4.1 Kubernetes 集群监控与日志详解
监控资源配置主要架构组件Prometheus 生态系统组件Prometheus Server: 数据采集和存储引擎Alertmanager: 处理告警通知Pushgateway: 接收短期作业的指标推送Node Exporter: 节点级指标收集器kube-state-metrics: Kubernetes 对象状态指标导出器cAdvisor: 容器资源使用和性能分析代理架构图示意日志采集方式主流
·
4.1 Kubernetes 集群监控与日志详解
监控系统主要方面
主要监控维度
- 节点级别监控:CPU、内存、磁盘、网络使用情况
- Pod/容器级别监控:资源使用率、状态、重启次数
- 应用性能监控:响应时间、吞吐量、错误率
- Kubernetes 组件监控:API Server、etcd、kubelet 等核心组件健康状况
实施配置要点
Prometheus 部署配置
apiVersion: apps/v1
kind: Deployment
metadata:
name: prometheus-server
namespace: monitoring
spec:
replicas: 1
selector:
matchLabels:
app: prometheus
template:
metadata:
labels:
app: prometheus
spec:
containers:
- name: prometheus
image: prom/prometheus:v2.37.0
args:
- '--config.file=/etc/prometheus/prometheus.yml'
- '--storage.tsdb.path=/prometheus/'
ports:
- containerPort: 9090
volumeMounts:
- name: prometheus-config-volume
mountPath: /etc/prometheus/
- name: prometheus-storage-volume
mountPath: /prometheus/
volumes:
- name: prometheus-config-volume
configMap:
name: prometheus-server-conf
- name: prometheus-storage-volume
emptyDir: {}
监控资源配置
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
name: kube-apiserver
namespace: monitoring
spec:
endpoints:
- interval: 30s
port: https
scheme: https
tlsConfig:
caFile: /var/run/secrets/kubernetes.io/serviceaccount/ca.crt
serverName: kubernetes
bearerTokenFile: /var/run/secrets/kubernetes.io/serviceaccount/token
jobLabel: component
selector:
matchLabels:
component: apiserver
provider: kubernetes
主要架构组件
Prometheus 生态系统组件
- Prometheus Server: 数据采集和存储引擎
- Alertmanager: 处理告警通知
- Pushgateway: 接收短期作业的指标推送
- Node Exporter: 节点级指标收集器
- kube-state-metrics: Kubernetes 对象状态指标导出器
- cAdvisor: 容器资源使用和性能分析代理
架构图示意
+------------------+ +-------------------+ +------------------+
| Applications | --> | Node Exporter | --> | |
| + cAdvisor | | | | |
+------------------+ +-------------------+ | |
| Prometheus |
+------------------+ +-------------------+ | Server |
| K8s Components | --> | kube-state-metrics| --> | |
| (API Server etc) | | | | |
+------------------+ +-------------------+ +------------------+
|
v
+-------------------+
| Alertmanager |
+-------------------+
日志采集方式
主流日志采集方案
Fluentd 方案
apiVersion: apps/v1
kind: DaemonSet
metadata:
name: fluentd
namespace: kube-system
spec:
selector:
matchLabels:
name: fluentd
template:
metadata:
labels:
name: fluentd
spec:
containers:
- name: fluentd
image: fluent/fluentd-kubernetes-daemonset:v1.14.6-debian-elasticsearch7-1.0
env:
- name: FLUENT_ELASTICSEARCH_HOST
value: "elasticsearch.logging.svc.cluster.local"
- name: FLUENT_ELASTICSEARCH_PORT
value: "9200"
volumeMounts:
- name: varlog
mountPath: /var/log
- name: varlibdockercontainers
mountPath: /var/lib/docker/containers
readOnly: true
volumes:
- name: varlog
hostPath:
path: /var/log
- name: varlibdockercontainers
hostPath:
path: /var/lib/docker/containers
Filebeat 方案
apiVersion: v1
kind: ConfigMap
metadata:
name: filebeat-config
namespace: kube-system
data:
filebeat.yml: |-
filebeat.inputs:
- type: container
paths:
- '/var/log/containers/*.log'
processors:
- add_kubernetes_metadata:
host: ${NODE_NAME}
matchers:
- logs_path:
logs_path: "/var/log/containers/"
output.elasticsearch:
hosts: ['${ELASTICSEARCH_HOST:elasticsearch}:${ELASTICSEARCH_PORT:9200}']
日志类型与格式要求
日志分类
-
系统组件日志
- API Server、etcd、kubelet 等核心组件日志
- 存储路径:
/var/log/kube-apiserver.log等
-
节点系统日志
- 操作系统级别的日志
- 包括内核、服务进程等日志
-
应用容器日志
- Pod 中应用程序输出的标准输出/标准错误
- 默认存储在
/var/log/pods/或/var/lib/docker/containers/
字段格式要求
JSON 格式日志结构示例
{
"timestamp": "2023-07-20T10:30:00.123Z",
"level": "INFO",
"message": "User login successful",
"service": "user-service",
"pod": "user-service-7d5b8c9c4-xl2v9",
"namespace": "production",
"trace_id": "abc123def456",
"user_id": "user123"
}
结构化日志推荐字段
- [timestamp](file://d:\BaiduSyncdisk\SRE体系\Kubernetes\projects\wmsuk\xy_wmsuk_web_sit_pl\src\utils\index.js#L290-L290): 时间戳
level: 日志级别 (DEBUG/INFO/WARN/ERROR)- [message](file://d:\BaiduSyncdisk\SRE体系\运维开发专题\Python\Scripts\utils\exceptions.py#L19-L19): 日志消息主体
- [service](file://d:\BaiduSyncdisk\SRE体系\Docker\grm\web\src\utils\request.js#L7-L10): 服务名称
pod: Pod 名称- [namespace](file://d:\BaiduSyncdisk\SRE体系\nacos-k8s\operator\test\func\main.go#L16-L16): 命名空间
trace_id: 分布式追踪IDerror_code: 错误码(如有)
智能告警能力
告警维度
-
基础设施层面
- CPU 使用率超过阈值
- 内存不足告警
- 磁盘空间不足
- 节点不可达
-
平台层面
- Pod 重启次数过多
- Deployment 可用副本数不足
- Service 不可达
- API Server 响应延迟高
-
应用层面
- HTTP 错误率上升
- 响应时间变长
- QPS 异常波动
- 业务关键指标异常
Alertmanager 安装部署
部署配置
Alertmanager 配置文件
apiVersion: v1
kind: ConfigMap
metadata:
name: alertmanager-config
namespace: monitoring
data:
config.yml: |-
global:
resolve_timeout: 5m
smtp_smarthost: 'smtp.gmail.com:587'
smtp_from: 'alert@example.com'
smtp_auth_username: 'alert@example.com'
smtp_auth_password: 'password'
route:
group_by: ['alertname', 'cluster']
group_wait: 30s
group_interval: 5m
repeat_interval: 3h
receiver: 'default-receiver'
receivers:
- name: 'default-receiver'
email_configs:
- to: 'admin@example.com'
send_resolved: true
- name: 'slack-notifications'
slack_configs:
- api_url: 'https://hooks.slack.com/services/YOUR/SLACK/WEBHOOK'
channel: '#alerts'
send_resolved: true
Alertmanager Deployment
apiVersion: apps/v1
kind: Deployment
metadata:
name: alertmanager
namespace: monitoring
spec:
replicas: 1
selector:
matchLabels:
app: alertmanager
template:
metadata:
labels:
app: alertmanager
spec:
containers:
- name: alertmanager
image: prom/alertmanager:v0.24.0
args:
- '--config.file=/etc/alertmanager/config.yml'
- '--storage.path=/alertmanager'
ports:
- containerPort: 9093
volumeMounts:
- name: config-volume
mountPath: /etc/alertmanager
- name: storage-volume
mountPath: /alertmanager
volumes:
- name: config-volume
configMap:
name: alertmanager-config
- name: storage-volume
emptyDir: {}
告警规则配置
Prometheus 告警规则示例
apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
name: k8s-rules
namespace: monitoring
spec:
groups:
- name: k8s.rules
rules:
- alert: HighPodCPUUsage
expr: sum(rate(container_cpu_usage_seconds_total{container!="POD",container!=""}[5m])) by (pod, namespace) > 0.8
for: 10m
labels:
severity: warning
annotations:
summary: "High CPU usage detected for pod {{ $labels.pod }}"
description: "{{ $labels.pod }} in namespace {{ $labels.namespace }} is using high CPU for more than 10 minutes."
- alert: PodRestartTooFrequent
expr: rate(kube_pod_container_status_restarts_total[5m]) * 60 * 5 > 5
for: 5m
labels:
severity: critical
annotations:
summary: "Pod restarting too frequently"
description: "Pod {{ $labels.pod }} in namespace {{ $labels.namespace }} has restarted more than 5 times in the last 5 minutes."
- alert: APIServerLatencyHigh
expr: histogram_quantile(0.99, sum(rate(apiserver_request_duration_seconds_bucket[5m])) by (verb, resource, subresource, le)) > 1
for: 10m
labels:
severity: warning
annotations:
summary: "API Server latency high"
description: "API Server latency for {{ $labels.verb }} {{ $labels.resource }} is higher than 1s."
告警内容模板
自定义告警模板
apiVersion: v1
kind: ConfigMap
metadata:
name: alertmanager-templates
namespace: monitoring
data:
default.tmpl: |
{{ define "__alert_severity_prefix" }}{{ if eq .Status "firing" }}{{ if eq .Labels.severity "critical" }}[Critical] {{ else if eq .Labels.severity "warning" }}[Warning] {{ end }}{{ end }}{{ end }}
{{ define "alert_title" }}{{ template "__alert_severity_prefix" . }}{{ .CommonLabels.alertname }} @ {{ .Status | toUpper }}{{ end }}
{{ define "alert_body" }}
{{- if eq .Status "resolved" -}}
✅ Alert resolved: {{ .CommonLabels.alertname }}
{{- else -}}
🔴 Alert firing: {{ .CommonLabels.alertname }}
{{- end }}
Severity: {{ .CommonLabels.severity }}
Description: {{ .CommonAnnotations.description }}
{{- if .CommonAnnotations.summary }}
Summary: {{ .CommonAnnotations.summary }}
{{- end }}
Affected resources:
{{- range .Alerts }}
- {{ .Labels.namespace }}/{{ .Labels.pod }}
{{- end }}
Started: {{ .StartsAt }}
{{- if ne .Status "firing" }}
Resolved: {{ .EndsAt }}
{{- end }}
{{ end }}
Slack 告警模板示例
receivers:
- name: 'slack-notifications'
slack_configs:
- api_url: 'https://hooks.slack.com/services/YOUR/SLACK/WEBHOOK'
channel: '#alerts'
title: '{{ template "alert_title" . }}'
text: '{{ template "alert_body" . }}'
send_resolved: true
color: '{{ if eq .Status "firing" }}danger{{ else }}good{{ end }}'
通过合理配置各项指标阈值和告警策略,可以有效预防潜在故障,提高系统的可靠性和稳定性。
更多推荐



所有评论(0)