搭建Prometheus + Alertmanager + Grafana 学习笔记

搭建Prometheus + Alertmanager + Grafana 学习笔记

Aoi Komiya Lv3

前言

Zabbix存在性能瓶颈、界面不友好、大量脏数据等问题而且在监控容器化应用方面存在一定的局限性。

尝试使用thanos+prometheus+grafana作为运维监控架构

架构图

arch

Prometheus 的告警规则擅长确定_当前_出现了什么问题,但它们不是一个完整的通知解决方案。需要另一层来添加汇总、通知速率限制、静默和告警依赖关系,以补充简单的告警定义。在 Prometheus 的生态系统中,Alertmanager 承担了此角色。

Detail

Prometheus PL: go
两种选用的搭建version:,最后选择了LTS作为实验环境

  • 2.53.2 LTS
  • 3.0.0 latest

Prometheus 安装

docker安装

拉取2.53.2 LTS镜像

1
docker pull prom/prometheus:v2.53.2

使用命名卷,存储持久化数据

将docker用于生产部署Prometheus时,偏好命名卷(named volume)

  • 绑定挂载(bind mounts) 更容易备份或迁移(跨平台)
  • 与容器解耦
  • 性能更优
  • 适用于永久存储数据的场景(对于非持久化存储,使用tmpfs)
  • 容器间可共享

docker中关于卷的命令:

1
2
# CRUD
docker volume create/ls/rm [volume-name]

创建Prometheus

1
2
3
4
5
6
7
8
9
10
# 创建卷存储
docker volume create prometheus-data

# 使用卷存储启动服务,使用本地配置文件
docker run \
-d \ # 后台启动
-p 9090:9090 \
-v /path/to/prometheus.yml:/etc/prometheus/prometheus.yml \ # 本地配置文件路径
-v prometheus-data:/prometheus \ # 本地卷
prom/prometheus

docker操作小tips

1
2
3
4
5
6
7
8
9
# 查看进程
docker ps # 查看runing container
docker ps -a # 查看所有容器
docker ps -aq # 查看所有容器的ID
docker rm xxx # 删除容器
docker rmi xxx # 删除镜像

# 快速删除、清空所有容器
docker rm $(docker ps -aq)

内部细节

默认启动的dockerfile

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
ARG ARCH="amd64"
ARG OS="linux"
FROM quay.io/prometheus/busybox-${OS}-${ARCH}:latest
LABEL maintainer="The Prometheus Authors <[email protected]>"
LABEL org.opencontainers.image.source="https://github.com/prometheus/prometheus"

ARG ARCH="amd64"
ARG OS="linux"
COPY .build/${OS}-${ARCH}/prometheus /bin/prometheus
COPY .build/${OS}-${ARCH}/promtool /bin/promtool
COPY documentation/examples/prometheus.yml /etc/prometheus/prometheus.yml
COPY LICENSE /LICENSE
COPY NOTICE /NOTICE
COPY npm_licenses.tar.bz2 /npm_licenses.tar.bz2

WORKDIR /prometheus
RUN chown -R nobody:nobody /etc/prometheus /prometheus

USER nobody
EXPOSE 9090
VOLUME [ "/prometheus" ]
ENTRYPOINT [ "/bin/prometheus" ]
CMD [ "--config.file=/etc/prometheus/prometheus.yml", \
"--storage.tsdb.path=/prometheus" ]

docker inspect的回显

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
[
{
"Id": "sha256:4f7c13071e390a3d3a59041d8e702f08ee184e1099aa15cd82c7913d7b6fde8d",
"RepoTags": [
"prom/prometheus:latest"
],
"RepoDigests": [
"prom/prometheus@sha256:3b9b2a15d376334da8c286d995777d3b9315aa666d2311170ada6059a517b74f"
],
"Parent": "",
"Comment": "buildkit.dockerfile.v0",
"Created": "2024-11-14T17:00:12.207533746Z",
"Container": "",
"ContainerConfig": {
"Hostname": "",
"Domainname": "",
"User": "",
"AttachStdin": false,
"AttachStdout": false,
"AttachStderr": false,
"Tty": false,
"OpenStdin": false,
"StdinOnce": false,
"Env": null,
"Cmd": null,
"Image": "",
"Volumes": null,
"WorkingDir": "",
"Entrypoint": null,
"OnBuild": null,
"Labels": null
},
"DockerVersion": "",
"Author": "The Prometheus Authors <[email protected]>",
"Config": {
"Hostname": "",
"Domainname": "",
"User": "nobody",
"AttachStdin": false,
"AttachStdout": false,
"AttachStderr": false,
"ExposedPorts": {
"9090/tcp": {}
},
"Tty": false,
"OpenStdin": false,
"StdinOnce": false,
"Env": [
"PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin"
],
"Cmd": [
"--config.file=/etc/prometheus/prometheus.yml",
"--storage.tsdb.path=/prometheus"
],
"ArgsEscaped": true,
"Image": "",
"Volumes": {
"/prometheus": {}
},
"WorkingDir": "/prometheus",
"Entrypoint": [
"/bin/prometheus"
],
"OnBuild": null,
"Labels": {
"maintainer": "The Prometheus Authors <[email protected]>"
}
},
"Architecture": "amd64",
"Os": "linux",
"Size": 291785425,
"VirtualSize": 291785425,
"GraphDriver": {
"Data": {
"LowerDir": "/var/lib/docker/overlay2/72d8ad201355568c5e7927197af0dbde4854730d6482115b386373bbc0528203/diff:/var/lib/docker/overlay2/6897eb2494ea920c32f6a50334b09902b2143c531ead91682743d463fe5fa115/diff:/var/lib/docker/overlay2/a7f20f114282ee84b0d7694d1ce7f74cc36168a34aaeb9f6f019fd955f7d1ca8/diff:/var/lib/docker/overlay2/3b7e8cae6d68e4b570ebe9568f0cb3964bd1c16d5a69c914f65ca0040cd16df3/diff:/var/lib/docker/overlay2/b787c85b914238d1f20cab24d1ef32e3a6a79273c16c42bbabc28af107bca8a0/diff:/var/lib/docker/overlay2/485523c83750f065007f98ccd82991c5393d7692908f91efad140fba102bd433/diff:/var/lib/docker/overlay2/b82bc26e96c2eee065a136db5be025cbf60fec8fc5e3890f5d0c808d5119d654/diff:/var/lib/docker/overlay2/b7d255da88543eca6cba90968278095e2d32ef0dff0f675fbd5a35d8bef9ca39/diff:/var/lib/docker/overlay2/c34b168594998b7eb393e2168c803a443d7d2facaa79ef4cec51a64672ca5cb6/diff",
"MergedDir": "/var/lib/docker/overlay2/669e2c1776525fbd3480c72711acd34e561bece15f2c0430ba3eb7130c48a196/merged",
"UpperDir": "/var/lib/docker/overlay2/669e2c1776525fbd3480c72711acd34e561bece15f2c0430ba3eb7130c48a196/diff",
"WorkDir": "/var/lib/docker/overlay2/669e2c1776525fbd3480c72711acd34e561bece15f2c0430ba3eb7130c48a196/work"
},
"Name": "overlay2"
},
"RootFS": {
"Type": "layers",
"Layers": [
"sha256:1e604deea57dbda554a168861cff1238f93b8c6c69c863c43aed37d9d99c5fed",
"sha256:6b83872188a9e8912bee1d43add5e9bc518601b02a14a364c0da43b0d59acf33",
"sha256:c41589936910264dc000f410db334b5c6ad1067f7b2e5940a8c9ce4ef3f68a70",
"sha256:06ff10b1fcccc440f50777ddaa2031b799dca8eb01d60f03dd2404d4b8ba2942",
"sha256:49d1f2743e2e6cdabd42a0a04fa93c8fd4b944f0fce5f0f89d49cfbdf7ae2e22",
"sha256:739ef8c6e23f487e5aadbe7ef79968b0ba5cb67840f87f7924b0818448ce0040",
"sha256:fd3dddd88c270d27c933467b37bc9ea547781376ceee3e9723fb3c61c2216a0f",
"sha256:12f8e0bd6ece7c52626a715ef722e2ccfc7a46c1287b6cf34b4dcab2b0252aa6",
"sha256:f46e2af56f782ca6eb14a5c2f3937221599c9d5ecda3f0d25a48ef2a60bb8c90",
"sha256:5ed64e0f662e204eb04ed846f496d2c0a7821cd06f809bd6063e683c2f5d8c22"
]
},
"Metadata": {
"LastTagTime": "0001-01-01T00:00:00Z"
}
}
]

镜像内部的prometheus.yml

可以作为自己编写配置文件的参考,可以跳过,后面会详细写一下配置文件。
配置文件是yaml格式,对于缩进很敏感,一定要好好看文档,不然就会浪费时间(

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
# my global config
global:
scrape_interval: 15s
# Set the scrape interval to every 15 seconds. Default is every 1 minute.
evaluation_interval: 15s
# Evaluate rules every 15 seconds. The default is every 1 minute.
# scrape_timeout is set to the global default (10s).

# Alertmanager configuration
alerting:
alertmanagers:
- static_configs:
- targets:
# - alertmanager:9093

# Load rules once and periodically evaluate them according to the global 'evaluation_interval'.
rule_files:
# - "first_rules.yml"
# - "second_rules.yml"

# A scrape configuration containing exactly one endpoint to scrape:
# Here it's Prometheus itself.
scrape_configs:
# The job name is added as a label `job=<job_name>` to any timeseries scraped from this config.
- job_name: "prometheus"

# metrics_path defaults to '/metrics'
# scheme defaults to 'http'.

static_configs:
- targets: ["localhost:9090"]

源码安装

兜兜转转还是源码
下了包直接解压启动就行

系统服务脚本

包管理器安装的软件,其unit文件路径一般为/usr/lib/systemd/system/xxx.service
用户自己写的服务unit文件,路径一般为/etc/systemd/system/xxx.service,其优先级比上面路径里的同名文件高。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
[Unit]
Description=prometheus # 服务的描述信息,帮助识别该服务

[Service]
ExecStart=/home/aoi/prometheus/prometheus-2.53.3.linux-amd64/prometheus \ # 启动服务时执行的命令
--config.file=/home/aoi/prometheus/prometheus-2.53.3.linux-amd64/prometheus.yml \ # 指定 Prometheus 配置文件的路径
--storage.tsdb.path=/home/aoi/prometheus/data # 指定 Prometheus 存储数据的路径
ExecReload=/bin/kill -HUG $MAINPID # 重新加载服务时的命令,发送 HUP 信号给主进程
KillMode=process # 指定服务停止时的行为,这里表示只杀死主进程
Restart=on-failure # 服务失败时自动重启

[Install]
WantedBy=multi-user.target # 指定服务在多用户目标下启动


添加监控目标target

需要两步

  1. 在Prometheus配置文件中添加target
  2. 在客户端安装并启动node-exporter

1. 添加target

scrape_configs中添加target

1
2
3
4
5
6
7
8
9
10
11
12
13
14
scrape_configs:
# The job name is added as a label `job=<job_name>` to any timeseries scraped from this config.
- job_name: "prometheus"

# metrics_path defaults to '/metrics'
# scheme defaults to 'http'.

static_configs:
- targets: ["localhost:9090"] # 监控本地prometheus服务

- job_name: "node"
static_configs:
- targets: ["ip:9100"] # 监控安装了node_exporter的机子

2. 安装node-exporter

  1. 官网拉包,本地scp传给宿主机;或者直接wget(虚拟机有性能损失)
  2. 解压,运行binary文件即可(一般都有exec权限)

PS. 没装DE的server cli界面,通过alt切换tty操作

配置node-exporter unit服务

编写/etc/systemd/system/node_exporter.service

1
2
3
4
5
6
7
8
9
10
11
[Unit]
Description=node_exporter

[Service]
ExecStart=/path/to/binary/file
ExecReload=/bin/kill -HUG $MAINPID
KillMode=process
Restart=on-failure

[Install]
WantedBy=multi-user.target
1
2
3
4
5
6
7
8
# 重载systemd服务
sudo systemctl daemon-reload

# 设置开机自启动服务
sudo systemctl enable node_exporter

# 启动node_exporter服务
sudo systemctl restart node_exporter

问题:局域网ip变化

服务跑起来没问题

1
sudo systemctl status node_exporter.service

问题出在,每次重启集群机器,局域网ip变化。

issue1
ip变化问题已解决,修改网卡配置文件,配置静态ip
成因是dhcp ip池的随机分配

issue2
机子是ubuntu server,在我修改网卡配置文件,设定静态ip后,机器重启后ip会变化。

经查看,发现网卡的配置文件会回滚。原因如下:
网卡的配置文件是cloud的init文件,定期从云端拉取配置文件,修改过后网卡配置文件会恢复原状。

解决办法:
/etc/cloud/cloud.cfg.d/99-disable-network-config.cfg 文件中写入network: {config: disable}

在偷懒想要命令一步到位的时候,发现个小issue,记录一下。

为什么不能用以下命令一步到位?

1
sudo echo "network: {config: disable}" >>  /etc/cloud/cloud.cfg.d/99-disable-network`-config.cfg

命令行中执行输出重定向,shell的处理机制是先打开要写入的文件,文件打开成功后再执行写入操作。
即使您在命令前使用了 sudo,Shell 在执行重定向时仍然是以普通用户的身份尝试打开文件。

可以使用管道符号+tee实现

1
echo "network: {config: disable}" | sudo tee -a /etc/cloud/cloud.cfg.d/99-disable-network`-config.cfg

配置后开机速度变慢,开机的时候会wait for network configuration。

安装grafana

apt 安装

1
2
3
sudo apt-get install -y adduser libfontconfig1 musl  
wget https://dl.grafana.com/enterprise/release/grafana-enterprise_11.2.4_amd64.deb
sudo dpkg -i grafana-enterprise_11.2.4_amd64.deb

系统服务设置

因为是apt安装,不是自己下载源码安装,所以不需要手写unit文件。
仅需配置systemctl服务

1
2
3
sudo systemctl daemon-reload grafana-server
sudo systemctl enable grafana-server
sudo systemctl start grafana-server

安装Alertmanager

目前装完的有Prometheus、node-exporter、Grafana,分别作用是报警、客户端监控、数据可视化。
现在安装的alertmanager是报警管理器,负责汇总、通知速率限制、静默和告警依赖关系,以补充简单的告警定义。

安装

安装是从源码,不赘述。

systemd服务文件:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
[Unit]
Description=prometheus-alertmanager

[Service]
ExecStart=/home/aoi/prometheus/alertmanager-0.27.0.linux-amd64/alertmanager \
--config.file=/home/aoi/prometheus/alertmanager-0.27.0.linux-amd64/alertmanager.yml
# --storage.tsdb.path=/home/aoi/prometheus/data/alertmanager-data
ExecReload=/bin/kill -HUG $MAINPID
KillMode=process
Restart=on-failure

[Install]
WantedBy=multi-user.target

配置文件服务不启动的时候,仔细看日志,上面我注释掉的那行,就是原因(
alertmanager没有--storage.tsdb.path flag,Prometheus有。

有问题仔细查日志命令即可:

1
journalctl --unit=alertmanager.service

配置文件

alertmanager是配置文件与Prometheus配置文件是两个东西。
首先要在prometheus.yml中添加以下内容,以启动alertmanager.yml

1
2
3
4
5
6
# Alertmanager configuration
alerting:
alertmanagers:
- static_configs:
- targets:
- 127.0.0.1:9093

接着再编写alertmanager.yml

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
global:
resolve_timeout: 5m
smtp_smarthost: 'smtp.163.com:25'
smtp_from: '[email protected]' # 报警后使用邮件通知,需要配置SMTP
smtp_auth_username: 'xxxx@163com'
smtp_auth_password: 'xxxx' # 填写授权码,需要去邮箱厂商的设置中获取授权码
smtp_require_tls: false

route:
group_by: ['alertname']
group_wait: 10s
group_interval: 10s
repeat_interval: 1h
#routes:
# - receiver: 'dev-mail'
# - receiver: 'web.hook'
receiver: 'dev-mail'

receivers:
#- name: 'web.hook'
# webhook_configs:
# - url: 'http://127.0.0.1:5001/'
- name: 'dev-mail'
email_configs:
- to: '[email protected]' # 报警通知的接收者(运维人员)

inhibit_rules:
- source_match:
severity: 'critical'
target_match:
severity: 'warning'
equal: ['alertname', 'dev', 'instance']

手动拉高cpu测试

1
cat /dev/zero>/dev/null

设置的邮件报警可以收到邮件
也可以在Prometheus的Web UI中看到pending和firing

查询target数据

通过node_exporter服务进行数据的收集,然后通过prometheus内置的PromQL语句进行组合查询,对于每一个公式,都可以在Prometheus 的WEB UI界面测试查询。

  1. 收集系统CPU监控信息

    1
    2
    # CPU使用率
    100 - (avg by (instance) (irate(node_cpu_seconds_total{mode="idle"}[5m])) * 100)
  2. 收集系统内存监控信息

    1
    2
    3
    4
    5
    # 内存使用率
    (1 - (node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes ))* 100

    # 可用内存(单位:M)
    node_memory_MemAvailable_bytes / 1024 / 1024
  3. 收集系统磁盘监控信息

1
2
3
4
5
6
7
8
# 磁盘总大小(单位: G)
node_filesystem_size_bytes {fstype=~"ext4|xfs"} / 1024 / 1024 / 1024

# 磁盘剩余大小(单位: G)
node_filesystem_avail_bytes {fstype=~"ext4|xfs"} / 1024 / 1024 / 1024

# 磁盘使用率
(1-(node_filesystem_free_bytes{fstype=~"ext4|xfs"} node_filesystem_size_bytes{fstype=~"ext4|xfs"})) * 100
  1. 收集系统网络监控信息
1
2
3
4
5
# 入网流量 (指定某一个网卡)
irate(node_network_receive_bytes_total{device='ens32'}[5m])

# 出网流量(指定某一个网卡)
irate(node_network_transmit_bytes_total{device='ens32'}[5m])

参考

https://www.cnblogs.com/tchua/category/1494538.html

  • Title: 搭建Prometheus + Alertmanager + Grafana 学习笔记
  • Author: Aoi Komiya
  • Created at: 2024-11-23 11:46:25
  • Updated at: 2024-11-23 12:17:00
  • Link: https://blog.komiya.monster/2024/11/23/prometheus/
  • License: This work is licensed under CC BY-NC-SA 4.0.