在k8s集群内搭建Prometheus监控平台

基本架构

Prometheus由SoundCloud发布，是一套由go语言开发的开源的监控&报警&时间序列数据库的组合。

Prometheus的基本原理是通过HTTP协议周期性抓取被监控组件的状态，任意组件只要提供对应的HTTP接口就可以接入监控。不需要任何SDK或者其他的集成过程。这样做非常适合做虚拟化环境监控系统，比如VM、Docker、Kubernetes等。

在这里插入图片描述
Prometheus 主要的组件功能如下：

Prometheus Server:server的作用主要是定期从静态配置的targets或者服务发现（主要是DNS、consul、k8s、mesos等）的 targets 拉取数据。
Exporter: 主要负责向prometheus server做数据汇报。而不同的数据汇报由不同的exporters实现，比如监控主机有node-exporters，mysql有MySQL server exporter。
Pushgateway:Prometheus获得数据的方式除了到对应exporter去Pull，还可以由服务先Push到pushgateway，server再去pushgateway 拉取。
Alertmanager：实现prometheus的告警功能。
webui：主要通过grafana来实现webui展示。

我们在实际使用的时候的基本流程就是：
各个服务push监控数据到其对应的指标(比如下面提到的Exporter) --> Prometheus Server定时采集数据并存储 --> 配置Grafana展示数据 & 配置告警规则进行告警

Helm部署Prometheus平台

使用helm部署kube-prometheus-stack
helm地址：传送门
github地址：传送门

请添加图片描述
首先需要在服务器上安装helm工具，怎么安装不再赘述，网上很多教程。使用helm安装prometheus的具体操作为：

helm repo add prometheus-community https://prometheus-community.github.io/helm-charts
helm repo update
helm install[RELEASE_NAME] prometheus-community/kube-prometheus-stack

Exporter

要采集目标的监控数据，首先就要在被采集目标地方安装采集组件，这种采集组件被称为Exporter。prometheus.io官网上有很多这种exporter，官方exporter列表。

采集完了怎么传输到Prometheus？

Exporter 会暴露一个HTTP接口，prometheus通过Pull模式的方式来拉取数据，会通过HTTP协议周期性抓取被监控的组件数据。
不过prometheus也提供了一种方式来支持Push模式，你可以将数据推送到Push Gateway，prometheus通过pull的方式从Push Gateway获取数据。

golang应用中接入采集组件

kratos框架

在微服务框架kratos中接入Prometheus采集组件的示例，kratos官方教程：

package main
import("context""fmt""log"
    prom "github.com/go-kratos/kratos/contrib/metrics/prometheus/v2""github.com/go-kratos/kratos/v2/middleware/metrics""github.com/prometheus/client_golang/prometheus/promhttp""github.com/go-kratos/examples/helloworld/helloworld""github.com/go-kratos/kratos/v2""github.com/go-kratos/kratos/v2/transport/grpc""github.com/go-kratos/kratos/v2/transport/http""github.com/prometheus/client_golang/prometheus")// go build -ldflags "-X main.Version=x.y.z"var(// Name is the name of the compiled software.
    Name ="metrics"// Version is the version of the compiled software.// Version = "v1.0.0"
    _metricSeconds = prometheus.NewHistogramVec(prometheus.HistogramOpts{
        Namespace:"server",
        Subsystem:"requests",
        Name:"duration_sec",
        Help:"server requests duration(sec).",
        Buckets:[]float64{0.005,0.01,0.025,0.05,0.1,0.250,0.5,1},},[]string{"kind","operation"})
    _metricRequests = prometheus.NewCounterVec(prometheus.CounterOpts{
        Namespace:"client",
        Subsystem:"requests",
        Name:"code_total",
        Help:"The total number of processed requests",},[]string{"kind","operation","code","reason"}))// server is used to implement helloworld.GreeterServer.type server struct{
    helloworld.UnimplementedGreeterServer
}// SayHello implements helloworld.GreeterServerfunc(s *server)SayHello(ctx context.Context, in *helloworld.HelloRequest)(*helloworld.HelloReply,error){return&helloworld.HelloReply{Message: fmt.Sprintf("Hello %+v", in.Name)},nil}funcinit(){
    prometheus.MustRegister(_metricSeconds, _metricRequests)}funcmain(){
    grpcSrv := grpc.NewServer(
        grpc.Address(":9000"),
        grpc.Middleware(
            metrics.Server(
                metrics.WithSeconds(prom.NewHistogram(_metricSeconds)),
                metrics.WithRequests(prom.NewCounter(_metricRequests)),),),)
    httpSrv := http.NewServer(
        http.Address(":8000"),
        http.Middleware(
            metrics.Server(
                metrics.WithSeconds(prom.NewHistogram(_metricSeconds)),
                metrics.WithRequests(prom.NewCounter(_metricRequests)),),),)
    httpSrv.Handle("/metrics", promhttp.Handler())
    s :=&server{}
    helloworld.RegisterGreeterServer(grpcSrv, s)
    helloworld.RegisterGreeterHTTPServer(httpSrv, s)
    app := kratos.New(
        kratos.Name(Name),
        kratos.Server(
            httpSrv,
            grpcSrv,),)if err := app.Run(); err !=nil{
        log.Fatal(err)}}

最终暴露出一个

http://127.0.0.1:8000/metrics

HTTP接口出来，Prometheus可以通过这个接口拉取监控数据。

Gin框架

在轻量级HTTP框架Gin中接入Prometheus采集组件的示例：

package main
import("strconv""time""github.com/gin-gonic/gin""github.com/prometheus/client_golang/prometheus""github.com/prometheus/client_golang/prometheus/promhttp")var(
    handler = promhttp.Handler()
    _metricSeconds = prometheus.NewHistogramVec(prometheus.HistogramOpts{
        Namespace:"server",
        Subsystem:"requests",
        Name:"duration_sec",
        Help:"server requests duration(sec).",
        Buckets:[]float64{0.005,0.01,0.025,0.05,0.1,0.250,0.5,1},},[]string{"method","path"})
    _metricRequests = prometheus.NewCounterVec(prometheus.CounterOpts{
        Namespace:"client",
        Subsystem:"requests",
        Name:"code_total",
        Help:"The total number of processed requests",},[]string{"method","path","code"}))funcinit(){
    prometheus.MustRegister(_metricSeconds, _metricRequests)}funcHandlerMetrics()func(c *gin.Context){returnfunc(c *gin.Context){
        handler.ServeHTTP(c.Writer, c.Request)}}funcWithProm() gin.HandlerFunc {returnfunc(c *gin.Context){var(
            method string
            path   string
            code   int)
        startTime := time.Now()
        method = c.Request.Method
        path = c.Request.URL.Path
        c.Next()
        code = c.Writer.Status()
        _metricSeconds.WithLabelValues(method, path).Observe(time.Since(startTime).Seconds())
        _metricRequests.WithLabelValues(method, path, strconv.Itoa(code)).Inc()}}funcmain(){
    r := gin.Default()
    r.Use(WithProm())
    r.GET("/ping",func(c *gin.Context){
        c.JSON(200, gin.H{"message":"pong",})})
    r.GET("/metrics",HandlerMetrics())
    r.Run()// 监听并在 0.0.0.0:8080 上启动服务}

最终暴露出一个

http://127.0.0.1:8080/metrics

HTTP接口出来，Prometheus可以通过这个接口拉取监控数据。

抓取集群外部数据源

背景：在已有的K8s集群中通过
helm
部署了一个
kube-prometheus-stack
，用于监控服务器和服务。现在已经将k8s集群中的node、pod等组件接入到prometheus了。还需要将部署在k8s集群外部的其他应用服务接入到prometheus。

prometheus抓取k8s集群外部的数据时，有以下途径：

ServiceMonitor
Additional Scrape Configuration

ServiceMonitor

ServiceMonitor 是一个CRD，它定义了 Prometheus 应该抓取的服务端点以及抓取的时间间隔。
通过ServiceMonitor监控集群外部的服务，需要配置Service、Endpoints和ServiceMonitor。

现在有一个已经部署到

192.168.1.100:8000

的后端服务，已经通过

/metrics

将监控指标暴露出来了。尝试将其接入到prometheus，具体操作如下：

在命令行中输入

$ touch external-application.yaml
$ vim external-application.yaml

然后将下面的yaml文件内容拷贝进去

---apiVersion: v1
kind: Service
metadata:name: external-application-exporter
  namespace: monitoring
  labels:app: external-application-exporter
    app.kubernetes.io/name: application-exporter
spec:type: ClusterIP
  ports:-name: metrics
    port:9101protocol: TCP
    targetPort:9101---apiVersion: v1
kind: Endpoints
metadata:name: external-application-exporter
    namespace: monitoring
    labels:app: external-application-exporter
      app.kubernetes.io/name: application-exporter
subsets:-addresses:-ip: 192.168.1.100  # 这里是外部的资源列表ports:-name: metrics
    port:8000-addresses:-ip: 192.168.1.100  # 这里是外部的资源列表2ports:-name: metrics
    port:8080---apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:name: external-application-exporter
  namespace: monitoring
  labels:app: external-application-exporter
    release: prometheus
spec:selector:matchLabels:# Service选择器app: external-application-exporter
  namespaceSelector:# Namespace选择器matchNames:- monitoring
  endpoints:-port: metrics           # 采集节点端口（svc定义）interval: 10s           # 采集频率根据实际需求配置，prometheus默认10spath: /metrics          # 默认地址/metrics

保存好文件之后运行命令：

kubectl apply -f external-application.yaml

之后打开prometheus控制台，进入Targets目录。可以看到新增的external-application-exporter显示出来了：

请添加图片描述

Additional Scrape Configuration

除了ip加端口提供的HTTP服务以外，我还在其他服务器上部署了可以通过域名访问的HTTPS服务。现在想用同样的方法将其接入进来。

首先尝试修改

Endpoints

，找到k8s的官方文档，发现

Endpoints

仅支持

ip

，也没有配置

HTTPS

协议的地方。
请添加图片描述
那么我们尝试换一种方式。

第一种方法

首先查阅官方文档，找到关于关于prometheus抓取配置的地方，可以看到，prometheus的抓取配置的关键字是

scrape_config

请添加图片描述
我们的prometheus是通过helm部署kube-prometheus-stack得到的，所以我们查看一下该charts的value.yaml文件，看看有无配置。

输入命令：

$ cat values.yaml  |grep-C20  scrape_config

得到如下结果：
请添加图片描述
从注释中知道，kube-prometheus是通过additionalScrapeConfigs配置抓取策略的。

于是写一个配置文件去更新helm已经部署好的prometheus的release。

$ touch prometheus.yml
$ vim prometheus.yml

将一下内容写入：

prometheus:prometheusSpec:additionalScrapeConfigs:-job_name: external-application-exporter-https
      scrape_interval: 10s
      scrape_timeout: 10s
      metrics_path: /metrics
      scheme: https
      tls_config:insecure_skip_verify:truestatic_configs:-targets:["www.baidu.com:443"]

最后更新release：

$ helm upgrade -nmonitoring-f prometheus.yaml prometheus kube-prometheus-stack-40.0.0.tgz

使用

prometheus.yaml

更新release，其中

kube-prometheus-stack-40.0.0.tgz

是我在部署prometheus时已经helm pull到本地的chart文件。

我们在prometheus的控制台的Targets目录下可以看到我们新添加的数据源。

到这里其实就可以结束了，但是有一个不好的地方是，每次添加新的域名监控，都需要重新更新helm的release，不是特别方便。

第二种方法

翻一翻prometheus-operator的源码，发现在说明中，有关于抓取配置热更新的教程。简单的概括就是，通过配置secret，来控制prometheus的抓取数据源。secret的内容修改时，可以热更新prometheus的抓取配置。截个图看一下：

请添加图片描述

第一步，生成

prometheus-additional.yaml

文件

$ touch prometheus-additional.yaml
$ vim prometheus-additional.yaml

prometheus-additional.yaml

内容：

-job_name: external-application-exporter-https
  scrape_interval: 10s
  scrape_timeout: 10s
  metrics_path: /metrics
  scheme: https
  tls_config:insecure_skip_verify:truestatic_configs:-targets:["www.baidu.com:443"]

第二步，生成secret

生成用于创建secret的配置文件：

$ kubectl create secret generic additional-scrape-configs --from-file=prometheus-additional.yaml --dry-run=client -oyaml> additional-scrape-configs.yaml
$ cat additional-scrape-configs.yaml

可以看到生成的

additional-scrape-configs.yaml

内容如下：

apiVersion: v1
data:prometheus-additional.yaml: LSBqb2JfbmFtZTogZXh0ZXJuYWwtYXBwbGljYXRpb24tZXhwb3J0ZXItaHR0cHMKICBzY3JhcGVfaW50ZXJ2YWw6IDEwcwogIHNjcmFwZV90aW1lb3V0OiAxMHMKICBtZXRyaWNzX3BhdGg6IC9tZXRyaWNzCiAgc2NoZW1lOiBodHRwcwogIHRsc19jb25maWc6CiAgICBpbnNlY3VyZV9za2lwX3ZlcmlmeTogdHJ1ZQogIHN0YXRpY19jb25maWdzOgogICAgLSB0YXJnZXRzOiBbImNpYW10ZXN0LnNtb2EuY2M6NDQzIl0K
kind: Secret
metadata:creationTimestamp:nullname: additional-scrape-configs

将这段编码解码看一下内容：

$ echo"LSBqb2JfbmFtZTogZXh0ZXJuYWwtYXBwbGljYXRpb24tZXhwb3J0ZXItaHR0cHMKICBzY3JhcGVfaW50ZXJ2YWw6IDEwcwogIHNjcmFwZV90aW1lb3V0OiAxMHMKICBtZXRyaWNzX3BhdGg6IC9tZXRyaWNzCiAgc2NoZW1lOiBodHRwcwogIHRsc19jb25maWc6CiAgICBpbnNlY3VyZV9za2lwX3ZlcmlmeTogdHJ1ZQogIHN0YXRpY19jb25maWdzOgogICAgLSB0YXJnZXRzOiBbImNpYW10ZXN0LnNtb2EuY2M6NDQzIl0K"| base64 -d

得到：

-job_name: external-application-exporter-https
  scrape_interval: 10s
  scrape_timeout: 10s
  metrics_path: /metrics
  scheme: https
  tls_config:insecure_skip_verify:truestatic_configs:-targets:["www.baidu.com:443"]

可以确认配置文件生成无误，接着生成secret：

$ kubectl apply -f additional-scrape-configs.yaml -n monitoring

monitoring是prometheus部署所在的命名空间，把它们放到同一个命名空间。

确认secret生成了：

$ kubectl get secret -n monitoring

输出：
请添加图片描述

最后，修改CRD

Finally, reference this additional configuration in your prometheus.yaml CRD.

官方文档让我们修改prometheus的配置
先找到prometheus这个CRD：

$ kubectl get prometheus -n monitoring
NAME                                    VERSION   REPLICAS   AGE
prometheus-kube-prometheus-prometheus   v2.38.0   1          2d18h

然后修改它

$ kubectl edit prometheus prometheus-kube-prometheus-prometheus -n monitoring

apiVersion: monitoring.coreos.com/v1
kind: Prometheus
metadata:name: prometheus
  labels:prometheus: prometheus
spec:...additionalScrapeConfigs:name: additional-scrape-configs
    key: prometheus-additional.yaml
  ...

最后，在prometheus控制台看一下效果：
请添加图片描述
域名服务已经监控上了，以后想添加其他域名监控，只需要修改secret就行，great！！！

告警

关于告警，我们采用prometheus+alertmanager这一套方案。从监控告警信息到处置告警事件的主要流程如下：
请添加图片描述

我们的业务需求是，在服务挂了的时候能够收到通知，及时处置。所以我们这里需要配置的告警规则为，收集应用的存活信息，当检测到不存活状态，告警消息状态设为

peding

。当peding时长到达一定时间阈值，就将其设为

firing

，此时触发告警，告警信息提交到

alertmanager

，然后在alertmanager中按照规则，发送告警消息给

消息接收者

，如企微、钉钉、邮件等。

具体的做法如下：

步骤一 prometheus告警触发器

参考：kube-prometheus-stack 告警配置

由于我是用helm部署的

kube-prometheus-stack

，为了保持版本一致性，将charts：

kube-prometheus-stack-40.0.0.tgz

提前下载（

helm pull prometheus-community/kube-prometheus-stack --version=40.0.0

）到本地了。解压之后，可以在

kube-prometheus-stack

的

values.yaml

中找到如下

PrometheusRules

步骤二 alertmanager 告警通知

参考：kube-prometheus-stack 配置AlertManager

prometheus触发器收集到了告警消息之后，会发送到alertmanager进行统一管理。alertmanager配置一定的规则，将告警消息分发给不同的接收者。
在

kube-prometheus-stack

的

values.yaml

中找到如下

alertmanager.config

步骤三搭建企业微信群机器人webhook

参考：prometheus通过企业微信机器人报警

生成一个企微机器人

在群设置中，进入群机器人功能：
请添加图片描述
然后添加群机器人，复制添加的群机器人的

Webhook

地址
请添加图片描述

编写

deployment

配置文件

wx-webhook-deployment.yaml

：

apiVersion: apps/v1
kind: Deployment
metadata:name: wx-webhook
  labels:app: wx-webhook
spec:replicas:1selector:matchLabels:app: wx-webhook
  template:metadata:labels:app: wx-webhook
    spec:containers:-name: wx-webhook
        image: guyongquan/webhook-adapter:latest
        imagePullPolicy: IfNotPresent
        args:["--adapter=/app/prometheusalert/wx.js=/wx=https://qyapi.weixin.qq.com/cgi-bin/webhook/send?key=xxxxxxxxxxxxxxxxxxxxxx"]ports:-containerPort:80---apiVersion: v1
kind: Service
metadata:name: wx-webhook
  labels:app: wx-webhook
spec:selector:app: wx-webhook
  ports:-name: wx-webhook
      port:80protocol: TCP
      targetPort:80nodePort:30904type: NodePort

其中

args: ["--adapter=/app/prometheusalert/wx.js=/wx=https://qyapi.weixin.qq.com/cgi-bin/webhook/send?key=xxxxxxxxxxxxxxxxxxxxxx"]

的内容为上一步创建的企微机器人

Webhook

地址
紧接着运行命令：

$ kubectl apply -f wx-webhook-deployment.yaml -nmonitoring
$ kubectl get pod -n monitoring |grep wx-webhook
wx-webhook-78d4dc95fc-9nsjn                              1/1     Running   0                26d
$ kubectl get service-n monitoring |grep wx-webhook
wx-webhook          NodePort    10.106.111.183   <none>80:30904/TCP                 27d

这样就完成了企业微信群机器人webhook的搭建。

这里我使用的是企业微信作为告警消息的接收者，alertmanager也支持其他消息接收者。可以参考这篇文章：kube-promethues监控告警详解（邮件、钉钉、微信、企微机器人、自研平台）

遇到的问题

更新抓取配置的secret后prometheus的控制台看不到效果尝试重启pod：prometheus-prometheus-kube-prometheus-prometheus-0，报错：

ts=2023-07-29T09:30:54.188Z caller=main.go:454 level=error msg=“Error loading config (–config.file=/etc/prometheus/config_out/prometheus.env.yaml)” file=/etc/prometheus/config_out/prometheus.env.yaml err=“parsing YAML file /etc/prometheus/config_out/prometheus.env.yaml: scrape timeout greater than scrape interval for scrape config with job name “external-application-exporter-https””

原因是，自定义指标的配置出错导致prometheus启动失败，scrape_interval和scrape_timeout存在问题

-job_name: external-application-exporter-https
  scrape_interval: 10s
  scrape_timeout: 30s
  metrics_path: /metrics
  scheme: https
  tls_config:insecure_skip_verify:truestatic_configs:-targets:["www.baidu.com:443"]

需要改成

-job_name: external-application-exporter-https
  scrape_interval: 10s
  scrape_timeout: 10s
  metrics_path: /metrics
  scheme: https
  tls_config:insecure_skip_verify:truestatic_configs:-targets:["www.baidu.com:443"]

引用

Grafana & prometheus 入门
Prometheus监控+Grafana+Alertmanager告警安装使用 (图文详解)
Prometheus官方教程
Helm仓库
kube-prometheus项目的Github地址
kratos官方教程
K8s官方文档
prometheus-operator的源码
kube-prometheus-stack 告警配置
kube-prometheus-stack 配置AlertManager
prometheus通过企业微信机器人报警
kube-promethues监控告警详解（邮件、钉钉、微信、企微机器人、自研平台）

标签： kubernetes prometheus 云原生

本文转载自: https://blog.csdn.net/qq_26356861/article/details/131997852
版权归原作者 谢小鱼 所有，如有侵权，请联系我们删除。