前言

Zabbix存在性能瓶颈、界面不友好、大量脏数据等问题而且在监控容器化应用方面存在一定的局限性。

尝试使用thanos+prometheus+grafana作为运维监控架构

架构图

Prometheus 的告警规则擅长确定_当前_出现了什么问题，但它们不是一个完整的通知解决方案。需要另一层来添加汇总、通知速率限制、静默和告警依赖关系，以补充简单的告警定义。在 Prometheus 的生态系统中，Alertmanager 承担了此角色。

Detail

Prometheus PL： go
两种选用的搭建version:，最后选择了LTS作为实验环境

2.53.2 LTS
3.0.0 latest

Prometheus 安装

docker安装

拉取2.53.2 LTS镜像

docker pull prom/prometheus:v2.53.2

使用命名卷，存储持久化数据

将docker用于生产部署Prometheus时，偏好命名卷（named volume）

比绑定挂载（bind mounts） 更容易备份或迁移（跨平台）
与容器解耦
性能更优
适用于永久存储数据的场景（对于非持久化存储，使用tmpfs）
容器间可共享

docker中关于卷的命令：

# CRUD
docker volume create/ls/rm [volume-name]

创建Prometheus

# 创建卷存储
docker volume create prometheus-data

# 使用卷存储启动服务，使用本地配置文件
docker run \
	-d \                                                        # 后台启动
    -p 9090:9090 \
    -v /path/to/prometheus.yml:/etc/prometheus/prometheus.yml \ # 本地配置文件路径
    -v prometheus-data:/prometheus \                            # 本地卷
    prom/prometheus

docker操作小tips

# 查看进程
docker ps        # 查看runing container
docker ps -a     # 查看所有容器
docker ps -aq    # 查看所有容器的ID
docker rm xxx    # 删除容器
docker rmi xxx   # 删除镜像

# 快速删除、清空所有容器
docker rm $(docker ps -aq)

内部细节

默认启动的dockerfile

ARG ARCH="amd64"
ARG OS="linux"
FROM quay.io/prometheus/busybox-${OS}-${ARCH}:latest
LABEL maintainer="The Prometheus Authors <[email protected]>"
LABEL org.opencontainers.image.source="https://github.com/prometheus/prometheus"

ARG ARCH="amd64"
ARG OS="linux"
COPY .build/${OS}-${ARCH}/prometheus        /bin/prometheus
COPY .build/${OS}-${ARCH}/promtool          /bin/promtool
COPY documentation/examples/prometheus.yml  /etc/prometheus/prometheus.yml
COPY LICENSE                                /LICENSE
COPY NOTICE                                 /NOTICE
COPY npm_licenses.tar.bz2                   /npm_licenses.tar.bz2

WORKDIR /prometheus
RUN chown -R nobody:nobody /etc/prometheus /prometheus

USER       nobody
EXPOSE     9090
VOLUME     [ "/prometheus" ]
ENTRYPOINT [ "/bin/prometheus" ]
CMD        [ "--config.file=/etc/prometheus/prometheus.yml", \
             "--storage.tsdb.path=/prometheus" ]

docker inspect的回显

[
    {
        "Id": "sha256:4f7c13071e390a3d3a59041d8e702f08ee184e1099aa15cd82c7913d7b6fde8d",
        "RepoTags": [
            "prom/prometheus:latest"
        ],
        "RepoDigests": [
            "prom/prometheus@sha256:3b9b2a15d376334da8c286d995777d3b9315aa666d2311170ada6059a517b74f"
        ],
        "Parent": "",
        "Comment": "buildkit.dockerfile.v0",
        "Created": "2024-11-14T17:00:12.207533746Z",
        "Container": "",
        "ContainerConfig": {
            "Hostname": "",
            "Domainname": "",
            "User": "",
            "AttachStdin": false,
            "AttachStdout": false,
            "AttachStderr": false,
            "Tty": false,
            "OpenStdin": false,
            "StdinOnce": false,
            "Env": null,
            "Cmd": null,
            "Image": "",
            "Volumes": null,
            "WorkingDir": "",
            "Entrypoint": null,
            "OnBuild": null,
            "Labels": null
        },
        "DockerVersion": "",
        "Author": "The Prometheus Authors <[email protected]>",
        "Config": {
            "Hostname": "",
            "Domainname": "",
            "User": "nobody",
            "AttachStdin": false,
            "AttachStdout": false,
            "AttachStderr": false,
            "ExposedPorts": {
                "9090/tcp": {}
            },
            "Tty": false,
            "OpenStdin": false,
            "StdinOnce": false,
            "Env": [
                "PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin"
            ],
            "Cmd": [
                "--config.file=/etc/prometheus/prometheus.yml",
                "--storage.tsdb.path=/prometheus"
            ],
            "ArgsEscaped": true,
            "Image": "",
            "Volumes": {
                "/prometheus": {}
            },
            "WorkingDir": "/prometheus",
            "Entrypoint": [
                "/bin/prometheus"
            ],
            "OnBuild": null,
            "Labels": {
                "maintainer": "The Prometheus Authors <[email protected]>"
            }
        },
        "Architecture": "amd64",
        "Os": "linux",
        "Size": 291785425,
        "VirtualSize": 291785425,
        "GraphDriver": {
            "Data": {
                "LowerDir": "/var/lib/docker/overlay2/72d8ad201355568c5e7927197af0dbde4854730d6482115b386373bbc0528203/diff:/var/lib/docker/overlay2/6897eb2494ea920c32f6a50334b09902b2143c531ead91682743d463fe5fa115/diff:/var/lib/docker/overlay2/a7f20f114282ee84b0d7694d1ce7f74cc36168a34aaeb9f6f019fd955f7d1ca8/diff:/var/lib/docker/overlay2/3b7e8cae6d68e4b570ebe9568f0cb3964bd1c16d5a69c914f65ca0040cd16df3/diff:/var/lib/docker/overlay2/b787c85b914238d1f20cab24d1ef32e3a6a79273c16c42bbabc28af107bca8a0/diff:/var/lib/docker/overlay2/485523c83750f065007f98ccd82991c5393d7692908f91efad140fba102bd433/diff:/var/lib/docker/overlay2/b82bc26e96c2eee065a136db5be025cbf60fec8fc5e3890f5d0c808d5119d654/diff:/var/lib/docker/overlay2/b7d255da88543eca6cba90968278095e2d32ef0dff0f675fbd5a35d8bef9ca39/diff:/var/lib/docker/overlay2/c34b168594998b7eb393e2168c803a443d7d2facaa79ef4cec51a64672ca5cb6/diff",
                "MergedDir": "/var/lib/docker/overlay2/669e2c1776525fbd3480c72711acd34e561bece15f2c0430ba3eb7130c48a196/merged",
                "UpperDir": "/var/lib/docker/overlay2/669e2c1776525fbd3480c72711acd34e561bece15f2c0430ba3eb7130c48a196/diff",
                "WorkDir": "/var/lib/docker/overlay2/669e2c1776525fbd3480c72711acd34e561bece15f2c0430ba3eb7130c48a196/work"
            },
            "Name": "overlay2"
        },
        "RootFS": {
            "Type": "layers",
            "Layers": [
                "sha256:1e604deea57dbda554a168861cff1238f93b8c6c69c863c43aed37d9d99c5fed",
                "sha256:6b83872188a9e8912bee1d43add5e9bc518601b02a14a364c0da43b0d59acf33",
                "sha256:c41589936910264dc000f410db334b5c6ad1067f7b2e5940a8c9ce4ef3f68a70",
                "sha256:06ff10b1fcccc440f50777ddaa2031b799dca8eb01d60f03dd2404d4b8ba2942",
                "sha256:49d1f2743e2e6cdabd42a0a04fa93c8fd4b944f0fce5f0f89d49cfbdf7ae2e22",
                "sha256:739ef8c6e23f487e5aadbe7ef79968b0ba5cb67840f87f7924b0818448ce0040",
                "sha256:fd3dddd88c270d27c933467b37bc9ea547781376ceee3e9723fb3c61c2216a0f",
                "sha256:12f8e0bd6ece7c52626a715ef722e2ccfc7a46c1287b6cf34b4dcab2b0252aa6",
                "sha256:f46e2af56f782ca6eb14a5c2f3937221599c9d5ecda3f0d25a48ef2a60bb8c90",
                "sha256:5ed64e0f662e204eb04ed846f496d2c0a7821cd06f809bd6063e683c2f5d8c22"
            ]
        },
        "Metadata": {
            "LastTagTime": "0001-01-01T00:00:00Z"
        }
    }
]

镜像内部的prometheus.yml

可以作为自己编写配置文件的参考，可以跳过，后面会详细写一下配置文件。
配置文件是yaml格式，对于缩进很敏感，一定要好好看文档，不然就会浪费时间（

# my global config
global:
  scrape_interval: 15s 
  # Set the scrape interval to every 15 seconds. Default is every 1 minute.
  evaluation_interval: 15s 
  # Evaluate rules every 15 seconds. The default is every 1 minute.
  # scrape_timeout is set to the global default (10s).

# Alertmanager configuration
alerting:
  alertmanagers:
    - static_configs:
        - targets:
          # - alertmanager:9093

# Load rules once and periodically evaluate them according to the global 'evaluation_interval'.
rule_files:
  # - "first_rules.yml"
  # - "second_rules.yml"

# A scrape configuration containing exactly one endpoint to scrape:
# Here it's Prometheus itself.
scrape_configs:
  # The job name is added as a label `job=<job_name>` to any timeseries scraped from this config.
  - job_name: "prometheus"

    # metrics_path defaults to '/metrics'
    # scheme defaults to 'http'.

    static_configs:
      - targets: ["localhost:9090"]

源码安装

兜兜转转还是源码
下了包直接解压启动就行

系统服务脚本

包管理器安装的软件，其unit文件路径一般为/usr/lib/systemd/system/xxx.service
用户自己写的服务unit文件，路径一般为/etc/systemd/system/xxx.service，其优先级比上面路径里的同名文件高。

[Unit]
Description=prometheus                     # 服务的描述信息，帮助识别该服务

[Service]
ExecStart=/home/aoi/prometheus/prometheus-2.53.3.linux-amd64/prometheus \  # 启动服务时执行的命令
  --config.file=/home/aoi/prometheus/prometheus-2.53.3.linux-amd64/prometheus.yml \  # 指定 Prometheus 配置文件的路径
  --storage.tsdb.path=/home/aoi/prometheus/data  # 指定 Prometheus 存储数据的路径
ExecReload=/bin/kill -HUG $MAINPID          # 重新加载服务时的命令，发送 HUP 信号给主进程
KillMode=process                            # 指定服务停止时的行为，这里表示只杀死主进程
Restart=on-failure                          # 服务失败时自动重启

[Install]
WantedBy=multi-user.target                  # 指定服务在多用户目标下启动

添加监控目标target

需要两步

在Prometheus配置文件中添加target
在客户端安装并启动node-exporter

1. 添加target

在scrape_configs中添加target

scrape_configs:
  # The job name is added as a label `job=<job_name>` to any timeseries scraped from this config.
  - job_name: "prometheus"

    # metrics_path defaults to '/metrics'
    # scheme defaults to 'http'.

    static_configs:
      - targets: ["localhost:9090"]       # 监控本地prometheus服务

  - job_name: "node"
    static_configs:
      - targets: ["ip:9100"]              # 监控安装了node_exporter的机子

2. 安装node-exporter

官网拉包，本地scp传给宿主机；或者直接wget（虚拟机有性能损失）
解压，运行binary文件即可（一般都有exec权限）

PS. 没装DE的server cli界面，通过alt切换tty操作

配置node-exporter unit服务

编写/etc/systemd/system/node_exporter.service

[Unit]
Description=node_exporter

[Service]
ExecStart=/path/to/binary/file
ExecReload=/bin/kill -HUG $MAINPID
KillMode=process
Restart=on-failure

[Install]
WantedBy=multi-user.target

# 重载systemd服务
sudo systemctl daemon-reload

# 设置开机自启动服务
sudo systemctl enable node_exporter

# 启动node_exporter服务
sudo systemctl restart node_exporter

问题：局域网ip变化

服务跑起来没问题

sudo systemctl status node_exporter.service

问题出在，每次重启集群机器，局域网ip变化。

issue1
ip变化问题已解决，修改网卡配置文件，配置静态ip
成因是dhcp ip池的随机分配

issue2
机子是ubuntu server，在我修改网卡配置文件，设定静态ip后，机器重启后ip会变化。

经查看，发现网卡的配置文件会回滚。原因如下：
网卡的配置文件是cloud的init文件，定期从云端拉取配置文件，修改过后网卡配置文件会恢复原状。

解决办法：
在/etc/cloud/cloud.cfg.d/99-disable-network-config.cfg 文件中写入network: {config: disable}

在偷懒想要命令一步到位的时候，发现个小issue，记录一下。

为什么不能用以下命令一步到位？

sudo echo "network: {config: disable}" >>  /etc/cloud/cloud.cfg.d/99-disable-network`-config.cfg

命令行中执行输出重定向，shell的处理机制是先打开要写入的文件，文件打开成功后再执行写入操作。
即使您在命令前使用了 sudo，Shell 在执行重定向时仍然是以普通用户的身份尝试打开文件。

可以使用管道符号+tee实现

echo "network: {config: disable}" | sudo tee -a /etc/cloud/cloud.cfg.d/99-disable-network`-config.cfg

配置后开机速度变慢，开机的时候会wait for network configuration。

安装grafana

apt 安装

sudo apt-get install -y adduser libfontconfig1 musl  
wget https://dl.grafana.com/enterprise/release/grafana-enterprise_11.2.4_amd64.deb
sudo dpkg -i grafana-enterprise_11.2.4_amd64.deb

系统服务设置

因为是apt安装，不是自己下载源码安装，所以不需要手写unit文件。
仅需配置systemctl服务

sudo systemctl daemon-reload grafana-server
sudo systemctl enable grafana-server
sudo systemctl start grafana-server

安装Alertmanager

目前装完的有Prometheus、node-exporter、Grafana，分别作用是报警、客户端监控、数据可视化。
现在安装的alertmanager是报警管理器，负责汇总、通知速率限制、静默和告警依赖关系，以补充简单的告警定义。

安装

安装是从源码，不赘述。

systemd服务文件：

[Unit]
Description=prometheus-alertmanager

[Service]
ExecStart=/home/aoi/prometheus/alertmanager-0.27.0.linux-amd64/alertmanager \
  --config.file=/home/aoi/prometheus/alertmanager-0.27.0.linux-amd64/alertmanager.yml 
#  --storage.tsdb.path=/home/aoi/prometheus/data/alertmanager-data
ExecReload=/bin/kill -HUG $MAINPID
KillMode=process
Restart=on-failure

[Install]
WantedBy=multi-user.target

配置文件服务不启动的时候，仔细看日志，上面我注释掉的那行，就是原因（
alertmanager没有--storage.tsdb.path flag，Prometheus有。

有问题仔细查日志命令即可：

journalctl --unit=alertmanager.service

配置文件

alertmanager是配置文件与Prometheus配置文件是两个东西。
首先要在prometheus.yml中添加以下内容，以启动alertmanager.yml

# Alertmanager configuration
alerting:
  alertmanagers:
    - static_configs:
        - targets:
          - 127.0.0.1:9093

接着再编写alertmanager.yml

global:
  resolve_timeout: 5m
  smtp_smarthost: 'smtp.163.com:25'
  smtp_from: '[email protected]'   # 报警后使用邮件通知，需要配置SMTP
  smtp_auth_username: 'xxxx@163com' 
  smtp_auth_password: 'xxxx' # 填写授权码，需要去邮箱厂商的设置中获取授权码
  smtp_require_tls: false

route:
  group_by: ['alertname']
  group_wait: 10s
  group_interval: 10s
  repeat_interval: 1h
  #routes:
  #  - receiver: 'dev-mail'
  #  - receiver: 'web.hook'
  receiver: 'dev-mail'

receivers:
#- name: 'web.hook'
#  webhook_configs:
#  - url: 'http://127.0.0.1:5001/'
- name: 'dev-mail'
  email_configs:
  - to: '[email protected]' # 报警通知的接收者（运维人员）

inhibit_rules:
  - source_match:
      severity: 'critical'
    target_match:
      severity: 'warning'
    equal: ['alertname', 'dev', 'instance']

手动拉高cpu测试

cat /dev/zero>/dev/null

设置的邮件报警可以收到邮件
也可以在Prometheus的Web UI中看到pending和firing

查询target数据

通过node_exporter服务进行数据的收集，然后通过prometheus内置的PromQL语句进行组合查询，对于每一个公式，都可以在Prometheus 的WEB UI界面测试查询。

收集系统CPU监控信息

# CPU使用率
100 - (avg by (instance) (irate(node_cpu_seconds_total{mode="idle"}[5m])) * 100)

收集系统内存监控信息

# 内存使用率
(1 - (node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes ))* 100

# 可用内存（单位：M）
node_memory_MemAvailable_bytes / 1024 / 1024 

收集系统磁盘监控信息

# 磁盘总大小（单位: G）
node_filesystem_size_bytes {fstype=~"ext4|xfs"} / 1024 / 1024 / 1024

# 磁盘剩余大小（单位: G）
node_filesystem_avail_bytes {fstype=~"ext4|xfs"}  / 1024 / 1024 / 1024

# 磁盘使用率
(1-(node_filesystem_free_bytes{fstype=~"ext4|xfs"} node_filesystem_size_bytes{fstype=~"ext4|xfs"})) * 100

收集系统网络监控信息

# 入网流量 （指定某一个网卡）
irate(node_network_receive_bytes_total{device='ens32'}[5m])

# 出网流量（指定某一个网卡）
irate(node_network_transmit_bytes_total{device='ens32'}[5m])

参考

https://www.cnblogs.com/tchua/category/1494538.html

搭建Prometheus + Alertmanager + Grafana 学习笔记

前言

架构图