监控系统主要方面

主要监控维度

  • 节点级别监控:CPU、内存、磁盘、网络使用情况
  • Pod/容器级别监控:资源使用率、状态、重启次数
  • 应用性能监控:响应时间、吞吐量、错误率
  • Kubernetes 组件监控:API Server、etcd、kubelet 等核心组件健康状况

实施配置要点

Prometheus 部署配置
apiVersion: apps/v1
kind: Deployment
metadata:
  name: prometheus-server
  namespace: monitoring
spec:
  replicas: 1
  selector:
    matchLabels:
      app: prometheus
  template:
    metadata:
      labels:
        app: prometheus
    spec:
      containers:
      - name: prometheus
        image: prom/prometheus:v2.37.0
        args:
          - '--config.file=/etc/prometheus/prometheus.yml'
          - '--storage.tsdb.path=/prometheus/'
        ports:
        - containerPort: 9090
        volumeMounts:
        - name: prometheus-config-volume
          mountPath: /etc/prometheus/
        - name: prometheus-storage-volume
          mountPath: /prometheus/
      volumes:
      - name: prometheus-config-volume
        configMap:
          name: prometheus-server-conf
      - name: prometheus-storage-volume
        emptyDir: {}
监控资源配置
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
  name: kube-apiserver
  namespace: monitoring
spec:
  endpoints:
  - interval: 30s
    port: https
    scheme: https
    tlsConfig:
      caFile: /var/run/secrets/kubernetes.io/serviceaccount/ca.crt
      serverName: kubernetes
    bearerTokenFile: /var/run/secrets/kubernetes.io/serviceaccount/token
  jobLabel: component
  selector:
    matchLabels:
      component: apiserver
      provider: kubernetes

主要架构组件

Prometheus 生态系统组件

  1. Prometheus Server: 数据采集和存储引擎
  2. Alertmanager: 处理告警通知
  3. Pushgateway: 接收短期作业的指标推送
  4. Node Exporter: 节点级指标收集器
  5. kube-state-metrics: Kubernetes 对象状态指标导出器
  6. cAdvisor: 容器资源使用和性能分析代理

架构图示意

+------------------+     +-------------------+     +------------------+
|   Applications   | --> |  Node Exporter    | --> |                  |
|   + cAdvisor     |     |                   |     |                  |
+------------------+     +-------------------+     |                  |
                                                  |  Prometheus      |
+------------------+     +-------------------+     |  Server          |
|  K8s Components  | --> | kube-state-metrics| --> |                  |
| (API Server etc) |     |                   |     |                  |
+------------------+     +-------------------+     +------------------+
                                                              |
                                                              v
                                                    +-------------------+
                                                    |   Alertmanager    |
                                                    +-------------------+

日志采集方式

主流日志采集方案

Fluentd 方案
apiVersion: apps/v1
kind: DaemonSet
metadata:
  name: fluentd
  namespace: kube-system
spec:
  selector:
    matchLabels:
      name: fluentd
  template:
    metadata:
      labels:
        name: fluentd
    spec:
      containers:
      - name: fluentd
        image: fluent/fluentd-kubernetes-daemonset:v1.14.6-debian-elasticsearch7-1.0
        env:
          - name: FLUENT_ELASTICSEARCH_HOST
            value: "elasticsearch.logging.svc.cluster.local"
          - name: FLUENT_ELASTICSEARCH_PORT
            value: "9200"
        volumeMounts:
        - name: varlog
          mountPath: /var/log
        - name: varlibdockercontainers
          mountPath: /var/lib/docker/containers
          readOnly: true
      volumes:
      - name: varlog
        hostPath:
          path: /var/log
      - name: varlibdockercontainers
        hostPath:
          path: /var/lib/docker/containers
Filebeat 方案
apiVersion: v1
kind: ConfigMap
metadata:
  name: filebeat-config
  namespace: kube-system
data:
  filebeat.yml: |-
    filebeat.inputs:
    - type: container
      paths:
        - '/var/log/containers/*.log'
      processors:
        - add_kubernetes_metadata:
            host: ${NODE_NAME}
            matchers:
            - logs_path:
                logs_path: "/var/log/containers/"
    
    output.elasticsearch:
      hosts: ['${ELASTICSEARCH_HOST:elasticsearch}:${ELASTICSEARCH_PORT:9200}']

日志类型与格式要求

日志分类

  1. 系统组件日志

    • API Server、etcd、kubelet 等核心组件日志
    • 存储路径:/var/log/kube-apiserver.log
  2. 节点系统日志

    • 操作系统级别的日志
    • 包括内核、服务进程等日志
  3. 应用容器日志

    • Pod 中应用程序输出的标准输出/标准错误
    • 默认存储在 /var/log/pods//var/lib/docker/containers/

字段格式要求

JSON 格式日志结构示例
{
  "timestamp": "2023-07-20T10:30:00.123Z",
  "level": "INFO",
  "message": "User login successful",
  "service": "user-service",
  "pod": "user-service-7d5b8c9c4-xl2v9",
  "namespace": "production",
  "trace_id": "abc123def456",
  "user_id": "user123"
}
结构化日志推荐字段
  • [timestamp](file://d:\BaiduSyncdisk\SRE体系\Kubernetes\projects\wmsuk\xy_wmsuk_web_sit_pl\src\utils\index.js#L290-L290): 时间戳
  • level: 日志级别 (DEBUG/INFO/WARN/ERROR)
  • [message](file://d:\BaiduSyncdisk\SRE体系\运维开发专题\Python\Scripts\utils\exceptions.py#L19-L19): 日志消息主体
  • [service](file://d:\BaiduSyncdisk\SRE体系\Docker\grm\web\src\utils\request.js#L7-L10): 服务名称
  • pod: Pod 名称
  • [namespace](file://d:\BaiduSyncdisk\SRE体系\nacos-k8s\operator\test\func\main.go#L16-L16): 命名空间
  • trace_id: 分布式追踪ID
  • error_code: 错误码(如有)

智能告警能力

告警维度

  1. 基础设施层面

    • CPU 使用率超过阈值
    • 内存不足告警
    • 磁盘空间不足
    • 节点不可达
  2. 平台层面

    • Pod 重启次数过多
    • Deployment 可用副本数不足
    • Service 不可达
    • API Server 响应延迟高
  3. 应用层面

    • HTTP 错误率上升
    • 响应时间变长
    • QPS 异常波动
    • 业务关键指标异常

Alertmanager 安装部署

部署配置

Alertmanager 配置文件
apiVersion: v1
kind: ConfigMap
metadata:
  name: alertmanager-config
  namespace: monitoring
data:
  config.yml: |-
    global:
      resolve_timeout: 5m
      smtp_smarthost: 'smtp.gmail.com:587'
      smtp_from: 'alert@example.com'
      smtp_auth_username: 'alert@example.com'
      smtp_auth_password: 'password'
    
    route:
      group_by: ['alertname', 'cluster']
      group_wait: 30s
      group_interval: 5m
      repeat_interval: 3h
      receiver: 'default-receiver'
    
    receivers:
    - name: 'default-receiver'
      email_configs:
      - to: 'admin@example.com'
        send_resolved: true
    
    - name: 'slack-notifications'
      slack_configs:
      - api_url: 'https://hooks.slack.com/services/YOUR/SLACK/WEBHOOK'
        channel: '#alerts'
        send_resolved: true
Alertmanager Deployment
apiVersion: apps/v1
kind: Deployment
metadata:
  name: alertmanager
  namespace: monitoring
spec:
  replicas: 1
  selector:
    matchLabels:
      app: alertmanager
  template:
    metadata:
      labels:
        app: alertmanager
    spec:
      containers:
      - name: alertmanager
        image: prom/alertmanager:v0.24.0
        args:
          - '--config.file=/etc/alertmanager/config.yml'
          - '--storage.path=/alertmanager'
        ports:
        - containerPort: 9093
        volumeMounts:
        - name: config-volume
          mountPath: /etc/alertmanager
        - name: storage-volume
          mountPath: /alertmanager
      volumes:
      - name: config-volume
        configMap:
          name: alertmanager-config
      - name: storage-volume
        emptyDir: {}

告警规则配置

Prometheus 告警规则示例

apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
  name: k8s-rules
  namespace: monitoring
spec:
  groups:
  - name: k8s.rules
    rules:
    - alert: HighPodCPUUsage
      expr: sum(rate(container_cpu_usage_seconds_total{container!="POD",container!=""}[5m])) by (pod, namespace) > 0.8
      for: 10m
      labels:
        severity: warning
      annotations:
        summary: "High CPU usage detected for pod {{ $labels.pod }}"
        description: "{{ $labels.pod }} in namespace {{ $labels.namespace }} is using high CPU for more than 10 minutes."
        
    - alert: PodRestartTooFrequent
      expr: rate(kube_pod_container_status_restarts_total[5m]) * 60 * 5 > 5
      for: 5m
      labels:
        severity: critical
      annotations:
        summary: "Pod restarting too frequently"
        description: "Pod {{ $labels.pod }} in namespace {{ $labels.namespace }} has restarted more than 5 times in the last 5 minutes."
        
    - alert: APIServerLatencyHigh
      expr: histogram_quantile(0.99, sum(rate(apiserver_request_duration_seconds_bucket[5m])) by (verb, resource, subresource, le)) > 1
      for: 10m
      labels:
        severity: warning
      annotations:
        summary: "API Server latency high"
        description: "API Server latency for {{ $labels.verb }} {{ $labels.resource }} is higher than 1s."

告警内容模板

自定义告警模板

apiVersion: v1
kind: ConfigMap
metadata:
  name: alertmanager-templates
  namespace: monitoring
data:
  default.tmpl: |
    {{ define "__alert_severity_prefix" }}{{ if eq .Status "firing" }}{{ if eq .Labels.severity "critical" }}[Critical] {{ else if eq .Labels.severity "warning" }}[Warning] {{ end }}{{ end }}{{ end }}
    
    {{ define "alert_title" }}{{ template "__alert_severity_prefix" . }}{{ .CommonLabels.alertname }} @ {{ .Status | toUpper }}{{ end }}
    
    {{ define "alert_body" }}
    {{- if eq .Status "resolved" -}}
    ✅ Alert resolved: {{ .CommonLabels.alertname }}
    {{- else -}}
    🔴 Alert firing: {{ .CommonLabels.alertname }}
    {{- end }}
    
    Severity: {{ .CommonLabels.severity }}
    Description: {{ .CommonAnnotations.description }}
    
    {{- if .CommonAnnotations.summary }}
    Summary: {{ .CommonAnnotations.summary }}
    {{- end }}
    
    Affected resources:
    {{- range .Alerts }}
    - {{ .Labels.namespace }}/{{ .Labels.pod }}
    {{- end }}
    
    Started: {{ .StartsAt }}
    {{- if ne .Status "firing" }}
    Resolved: {{ .EndsAt }}
    {{- end }}
    {{ end }}

Slack 告警模板示例

receivers:
- name: 'slack-notifications'
  slack_configs:
  - api_url: 'https://hooks.slack.com/services/YOUR/SLACK/WEBHOOK'
    channel: '#alerts'
    title: '{{ template "alert_title" . }}'
    text: '{{ template "alert_body" . }}'
    send_resolved: true
    color: '{{ if eq .Status "firing" }}danger{{ else }}good{{ end }}'

通过合理配置各项指标阈值和告警策略,可以有效预防潜在故障,提高系统的可靠性和稳定性。

Logo

开源鸿蒙跨平台开发社区汇聚开发者与厂商,共建“一次开发,多端部署”的开源生态,致力于降低跨端开发门槛,推动万物智联创新。

更多推荐