当前位置：首页 > 监控 > prometheus > 正文内容

prometheus+thanos高可用部署

只想摆烂的运维2025-10-22 18:37:05prometheus

一、prometheus+thanos介绍

Thanos是prometheus的高可用解决方案之一，thanos与prometheus无缝集成，并提高了一些高级特性，满足了长期存储 + 无限拓展 + 全局视图 + 无侵入性的需求。

prometheus+thanos架构

Thanos Sidecar

连接 Prometheus，将其数据提供给 Thanos Query 查询，并且/或者将其上传到对象存储，以供长期存储。

Thanos Query：

实现了 Prometheus API，提供全局查询视图，将来StoreAPI提供的数据进行聚合最终返回给查询数据的client（如grafana）。

Thanos Store Gateway：

将对象存储的数据暴露给 Thanos Query 去查询。

Thanos Ruler：

对监控数据进行评估和告警，还可以计算出新的监控数据，将这些新数据提供给 Thanos Query 查询并且/或者上传到对象存储，以供长期存储。

Thanos Compact：

将对象存储中的数据进行压缩和降低采样率，加速大时间区间监控数据查询的速度。

Thanos Receiver：

从 Prometheus 的远程写入 WAL 接收数据，将其公开和/或上传到云存储。

二、部署节点情况

ip	部署组件
172.25.230.56	prometheus server1 alertmanager1 thanos-receive1 thanos-query1 thanos-rule1 grafana
172.25.230.57	prometheus server2 alertmanager2 thanos-receive2 thanos-query2 thanos-rule2 thanos-compact minio
172.25.230.58	nginx

三、minio部署

1.下载minio应用程序

下载地址：https://www.minio.org.cn/download.shtml

根据需求选择要下载的minio应用程序

2.部署应用程序

目录：/opt/minio

建立运行用户：

]# useradd minio

建立目录：

]# mkdir -p /opt/minio/{data,logs,script}

logs：日志目录

config：配置文件目录

script：脚本目录

将minio应用程序上传到/opt/minio中

3.minio启停

手动启停

启动minio：

]# su - minio
]$ cd /opt/minio/
]$ ./minio server --address ":9500" --console-address ":9600"  ./data/ >> logs/minio.log 2>&1 &

--address prometheus各组件连接minio存储、查询数据端口

--console-address minio控制台访问端口

停止minio：

]# su - minio
##确定minio服务进程的pid
]$ ps -ef | grep minio | grep -v grep

##结束minio服务进程
]$ kill -9 $pid

4.配置minio

访问minio链接：http://172.25.230.57:9600

初始密码为minioadmin/minioadmin

建立minio的桶

【Buckets】 ==> 【Create Bucket】

在【Bucket Name】中输入要建立桶的名称，点击【Create Bucket】

建立Access Keys，thanos用来连接minio的凭证

【Access Keys】==> 【Create Access Key】

Access Key和Secret Key都设置为"monitor@123"，点击【Create】

Access Keys设置成功

三、prometheus server部署

1.下载安装包

下载prometheus对应版本的应用包

下载地址：https://github.com/prometheus/prometheus/releases/tag/v2.55.1

2.部署安装包

目录：/opt/prometheus

建立运行用户

]# useradd prometheus

解压安装包

]# tar xvf prometheus-2.55.1.linux-amd64.tar -C /opt
]# cd /opt
]# mv prometheus-2.55.1.linux-amd64 prometheus

为prometheus目录授权

]# chown -R prometheus:prometheus /opt/prometheus

建立目录

]# cd /opt/prometheus
]# mkdir {logs,config,script}

logs：日志目录

config：配置文件目录

script：脚本目录

3.prometheus配置文件

主配置文件：prometheus.yml

将prometheus.yml移到config目录中

global:
  ##指定Prometheus对目标进行数据收集的时间间隔，即每隔多久抓取一次数据
  scrape_interval: 1m
  ##指定Prometheus执行规则的时间间隔，即每隔多久对告警规则做一次计算，然后更新告警状态
  evaluation_interval: 1m
  ##指定Prometheus抓取一个目标数据的超时时间，即在抓取数据时，如果超过这个时间仍未获取到数据，则判断为超时，如果网络不太好，可以将该值适当调大
  scrape_timeout: 1m
  ##指定Prometheus的标签，replica参数，在Prometheus server集群中，识别节点，每台配置不一致，第1台设置为0，第2台就设置为1，以此类推
  external_labels:
    env: dev
    replica: 0

##配置监控目标节点,后面部署node_exporter时会说明
scrape_configs:
  - job_name: "node-exporter"      ##监控任务名称，自定义
    file_sd_configs:               
      - files:
          - 'node/*.yml'           ##监控目标服务器节点(node-exporter)配置文件路径
    basic_auth:                    ##node-exporter的账号、密码
      username: 'exporter'
      password: '123456'
    relabel_configs:              
      ##目标服务器节点ip显示时，去除node-exporter的端口号
      - source_labels: [__address__]
        target_label: instance
        regex: '([^:]+)(:\d+)?'
        replacement: '$1'

prometheus密码配置文件：web.yml

将web.yml文件放到config目录中

basic_auth_users:
  ##用户名及hash密码字符串
  admin: ************************************************************

hash密码生成方式，详见"附：hash密码生成方式"

4.prometheus启停

手动启停

启动prometheus服务

]# su - prometheus
]$ cd /opt/prometheus
]$ nohup ./prometheus \
--config.file=config/prometheus.yml \
--web.config.file=config/web.yml  \
--web.enable-lifecycle \
--storage.tsdb.retention.time=15d   \
--storage.tsdb.min-block-duration=2h   \ 
--storage.tsdb.max-block-duration=2h

--config.file prometheus配置文件路径

--web.config.file prometheus web配置文件路径，存放prometheus的http访问用户、密码

--web.enable-lifecycle 启用 Prometheus 的生命周期管理功能，支持在运行时重新加载规则文件

--storage.tsdb.retention.time 设置 Prometheus 存储的时序数据（TSDB）的保留时间，这里设置为15天

--storage.tsdb.min-block-duration 设置 Prometheus 存储块（block）的最小持续时间，这里设置为2h

--storage.tsdb.max-block-duration 设置 Prometheus 存储块（block）的最大持续时间，这里设置为2h

在部署thanos高可用情况下，一般"--storage.tsdb.retention.time"，"--storage.tsdb.min-block-duration"，"--storage.tsdb.max-block-duration"参数都设置为默认的最小数据块时间2h。

停止prometheus服务

]# su - prometheus

##确定prometheus服务进程的pid
]$ ps -ef | grep prometheus | grep -v grep

##结束prometheus服务进程
]$ kill -9 $pid

脚本启停

启动脚本：prometheus_start.sh

#!/bin/bash

pro_dir='/opt/prometheus'
user='prometheus'

ps -ef | grep -w "${pro_dir}/prometheus" | grep -v 'grep' &>/dev/null
if [ $? -eq 0 ];then
  echo "error: prometheus server runing"
  exit 10
fi


if [ "`whoami`" != "${user}" ];then
  echo "error: excute user not ${user}"
  exit 10
fi

if [ ! -f "${pro_dir}/prometheus" ];then
  echo "error: prometheus binary file not exist"
  exit 10
fi

cd ${pro_dir}

start_para=`cat ${pro_dir}/script/running_parameters`
value=()

for i in ${start_para};do
  value+=($i)
done

all_value=`echo ${value[*]}`

nohup ${pro_dir}/prometheus \
--config.file=config/prometheus.yml \
${all_value} >> logs/prometheus.log 2>&1 &

sleep 2

ps -ef | grep -w "${pro_dir}/prometheus" | grep -v 'grep' &>/dev/null
if [ $? -eq 0 ];then
  echo "success: prometheus server start"
else
  echo "error: prometheus server start failed"
  exit 10
fi

prometheus_start.sh

停止脚本：prometheus_stop.sh

#!/bin/bash

pro_dir='/opt/prometheus'
user='prometheus'

pid=`ps -ef | grep -w "${pro_dir}/prometheus" | grep -v grep | awk '{print $2}'`
if [ ! -n "$pid" ];then
  echo "error: prometheus server not running"
  exit 10
fi

if [ "`whoami`" != "${user}" ];then
  echo "error: excute user not ${user}"
  exit 10
fi

kill $pid
sleep 2
ps -ef | grep -w "${pro_dir}/prometheus" | grep -v grep &>/dev/null
if [ $? -ne 0 ];then
  echo "success: prometheus process exit"
else
  echo "error: prometheus process exit failed"
  exit 10
fi

prometheus_stop.sh

重置脚本：prometheus_reload.sh

#!/bin/bash

host=127.0.0.1
port=19090

read -p 'please enter user: ' user
read -s -p 'please enter password' pass
echo ''

if [ -n "$user" ] && [ -n "$pass" ];then
  authorize="-u ${user}:${pass}"
fi

curl ${authorize} -XPOST http://${host}:${port}/-/reload
if [ $? -eq 0 ];then
  echo "success: prometheus server reload"
else
  echo "error: prometheus server reload failed"
fi

prometheus_reload.sh

执行时，需要按照提示输入prometheus的登录账号、密码，如果未设置账号，密码，直接回车即可

prometheus启动参数文件：running_parameters

用来指定prometheus的启动参数

--web.config.file=config/web.yml 
--web.enable-lifecycle
--log.level=debug
--storage.tsdb.retention.time=15d
--storage.tsdb.min-block-duration=2h 
--storage.tsdb.max-block-duration=2h

将脚本、文件上传至script目录中，启停时切换到prometheus用户，执行相关的脚本即可

5.prometheus目录结构

├── config   prometheus配置文件目录
│ ├── prometheus.yml   prometheus主配置文件
│ └── web.yml prometheus web相关配置目录，如用户名、密码、TLS证书
├── console_libraries
├── consoles
├── data   prometheus数据存放目录
├── LICENSE
├── logs prometheus日志文件目录
│ └── prometheus.log   prometheus运行日志
├── NOTICE
├── prometheus
├── promtool
├── rules   prometheus规则文件目录
│ └── system.yml 设置的prometheus阈值规则文件
└── script prometheus脚本目录
├── prometheus_reload.sh   prometheus应用重新加载脚本
├── prometheus_start.sh prometheus启动脚本
├── prometheus_stop.sh prometheus停止脚本
└── running_parameters prometheus运行参数

四、alertmanager部署

1.下载安装包

下载alertmanager对应版本的应用包

下载地址：https://github.com/prometheus/alertmanager/releases/tag/v0.26.0

2.部署安装包

目录：/opt/alertmanager

建立运行用户

]# useradd alertmanager

解压安装包

]# tar xvf alertmanager-0.26.0.linux-amd64.tar -C /opt
]# cd /opt
]# mv alertmanager-0.26.0.linux-amd64 alertmanager

为alertmanager目录授权

]# chown -R alertmanager:alertmanager /opt/alertmanger

建立目录

]# cd /opt/alertmanger
]# mkdir {logs,config,script}

logs：日志目录

config：配置文件目录

script：脚本目录

3.alertmanager配置文件

主配置文件：alertmanager.yml

将alertmanager.yml移到config目录中

route:
  ##配置警报分组的标准,在这里,将根据 alertname,instance进行分组
  group_by: ['alertname','instance']  
  ##配置在发送第一个警报通知之前等待的时间，如果警报在 group_wait 期间再次触发，它们将被添加到同一个组中
  group_wait: 30s 
  ##配置在发送后续警报通知之前等待的时间，如果警报在 group_interval 期间再次触发，它们将被添加到同一个组中 
  group_interval: 10s
  ##配置在发送重复警报通知之前等待的时间。如果警报在 repeat_interval 期间再次触发，它们将被发送到接收器
  repeat_interval: 1h  
  ##配置接收警报的接收器名称。在这个例子中，警报将发送到名为 “email-receiver” 的接收器 
  receiver: 'email-receiver' 
 
##告警发送信息模板文件位置
templates:
  - '/opt/alertmanager/templates/*.tmpl'
 
##设置告警接收器
receivers:
  ##配置接收器的名称为email-receiver
  - name: 'email-receiver'
    ##告警发送邮件的配置
    email_configs:
      ##设置了告警通知将发送到的电子邮件地址。你可以指定多个收件人，用逗号分隔
      - to: 'your_email@example.com'
        ##定义了告警通知的发件人地址。在某些情况下，这可能会显示在收件人的电子邮件客户端中
        from: 'alertmanager@example.com'
        ##是否发送告警恢复邮件，设置为true，告警恢复时发送邮件通知
        send_resolved: true
        ##指定告警发送内容
        headers:
          ##告警邮件标题，引用模板文件中的"email.default.subject"
          Subject: '{{ template "email.default.subject" . }}'
        ##告警邮件内容，引用模板文件中的"email.to.html"
        html: '{{ template "email.to.html" . }}'
      ##SMTP服务器的地址和端口
        smarthost: 'smtp.example.com:587'
        ##SMTP服务器认证的用户名，密码
        auth_username: 'your_smtp_username'
        auth_password: 'your_smtp_password'
        ##指定是否需要TLS加密来与SMTP服务器通信。设置为true时，Alertmanager将使用TLS方式来加密SMTP连接
        require_tls: true
      tls_config:
          是否跳过验证SMTP服务器的SSL证书
          insecure_skip_verify: false

alertmanager密码配置文件：web.yml

将web.yml文件放到config目录中

basic_auth_users:
  ##用户名及hash密码字符串
  alert: ************************************************************

hash密码生成方式，详见"附：hash密码生成方式"

4.alertmanager启停

手动启停

启动alertmanager服务

]# su - alertmanager
]$ cd /opt/alertmanager
]$ nohup ./alertmanager \
--web.external-url=http://172.25.230.56:9093 \ 
--cluster.advertise-address=172.25.230.56:9094 \
--cluster.peer=172.25.230.57:9094 \
--web.config.file=config/web.yml

--web.external-url alertmanager声明的外部url地址，用于生成回链到Alertmanager的相对和绝对链接。如果URL有路径部分，它将用于前缀所有由Alertmanager提供的HTTP端点

--cluster.advertise-address alertmanager集群中，声明本机节点的地址

--cluster.peer alertmanager集群中，另外的节点地址，可以重复设置

--web.config.file alertmanager web配置文件路径，存放alertmanager的http访问用户、密码

脚本启停

启动脚本：alertmanager_start.sh

#!/bin/bash

script_dir=$(dirname "$(readlink -f "$0")")
alert_dir=$(dirname "$script_dir")
user='alertmanager'
log_dir="logs/alertmanager.log"

ps -ef | grep -w "${alert_dir}/alertmanager" | grep -v 'grep' &>/dev/null
if [ $? -eq 0 ];then
  echo "error: !!!Alertmanager server runing!!!"
  exit 10
fi


if [ "`whoami`" != "${user}" ];then
  echo "error: !!!Excute user not ${user}!!!"
  exit 10
fi

if [ ! -f "${alert_dir}/alertmanager" ];then
  echo "error: !!!Alertmanager binary file not exist!!!"
  exit 10
fi

cd ${alert_dir}

start_para=`cat ${alert_dir}/script/running_paraments`

nohup ${alert_dir}/alertmanager \
--config.file=config/alertmanager.yml \
${start_para} >> ${log_dir} 2>&1 &

sleep 2

ps -ef | grep -w "${alert_dir}/alertmanager" | grep -v 'grep' &>/dev/null
if [ $? -eq 0 ];then
  echo "success: Alertmanager server start"
else
  echo "error: Alertmanager server start failed"
  exit 10
fi

alertmanager_start.sh

停止脚本：alertmanager_stop.sh

#!/bin/bash

script_dir=$(dirname "$(readlink -f "$0")")
alert_dir=$(dirname "$script_dir")
user='alertmanager'
log_dir="logs/alertmanager.log"

pid=`ps -ef | grep -w "${alert_dir}/alertmanager" | grep -v grep | awk '{print $2}'`
if [ ! -n "$pid" ];then
  echo "error: !!!Alertmanager server not running!!!"
  exit 10
fi

if [ "`whoami`" != "${user}" ];then
  echo "error: !!!Excute user not ${user}!!!"
  exit 10
fi

kill $pid
sleep 2
ps -ef | grep -w "${alert_dir}/alertmanager" | grep -v grep &>/dev/null
if [ $? -ne 0 ];then
  echo "success: !!!Alertmanager process exit!!!"
else
  echo "error: !!!Alertmanager process exit failed!!!"
  exit 10
fi

alertmanager_stop.sh

重新加载alertmanager服务进程脚本：alertmanager_reload.sh

#!/bin/bash


host=127.0.0.1
port=19093

read -p 'please enter user: ' user
read -s -p 'please enter password' pass
echo ''

if [ -n "$user" ] && [ -n "$pass" ];then
  authorize="-u ${user}:${pass}"
fi

curl ${authorize} -XPOST http://${host}:${port}/-/reload
if [ $? -eq 0 ];then
  echo "success: Alertmanager server reload"
else
  echo "error: Alertmanager server reload failed"
fi

执行时，需要按照提示输入alertmanager的登录账号、密码，如果未设置账号，密码，直接回车即可

alertmanager启动参数文件：running_paraments

用来指定alertmanager的启动参数

--web.external-url=http://172.25.230.56:9093
--cluster.advertise-address=172.25.230.56:9094
--cluster.peer=172.25.230.57:9094
--log.level=debug
--web.config.file=config/web.yml

将脚本、文件上传至script目录中，启停时切换到alertmanager用户，执行相关的脚本即可

5.alertmanager目录结构

├── alertmanager

├── amtool

├── config

│ ├── alertmanager.yml

│ └── web.yml

├── data

├── LICENSE

├── logs

│ └── alertmanager.log

├── NOTICE

├── script

│ ├── alertmanager_reload.sh

│ ├── alertmanager_start.sh

│ ├── alertmanager_stop.sh

│ └── running_paraments

└── templates

└── end.tmpl

五、thanos部署

thanos下载地址：https://github.com/thanos-io/thanos/releases/tag/v0.37.0

1.thanos-receive部署

（1）部署安装包

建立运行用户

]# useradd receive
]# mkdir /opt/thanos-receive

解压安装包

]# tar xvf thanos-0.37.0.linux-amd64.tar.gz -C /opt/thanos-receive

建立目录

]# cd /opt/thanos-receive
]# mkdir {logs,script,config}

（2）配置thanos-receive

配置连接对象存储：config/receive-objstore.yml

type: S3
config:
  bucket: "prometheus-data"
  endpoint: "172.25.230.57:9500"
  access_key: "monitor@123"
  insecure: true
  secret_key: "monitor@123"

bucket：minio建立的桶

endpoint：minio的连接地址

access_key：minio配置的access_key

secret_key：minio配置的secret_key

配置另一节点

将thanos-receive目录复制到另一节点

目录授权

在两台主机上，为thanos-receive目录授权

]# chown -R receive:receive /opt/thanos-receive

(3)thanos-receive启停

手动启停

启动thanos-receive

在两个节点的thanos-receive上分别执行：

]# su - receive
]# cd /opt/thanos-receive
./thanos receive \
--tsdb.path=./data \
--grpc-address=0.0.0.0:10906 \
--http-address=0.0.0.0:10907 \
--tsdb.retention=2h \
--label=replica="1" \
--label=receive_cluster="receiver1" \
--remote-write.address=0.0.0.0:10908 \
--receive.tenant-label-name=monitortenant \
--objstore.config-file=config/receive-objstore.yml \
--log.level=debug

--tsdb.path：thanos-receive本地数据的存储路径

--grpc-address：thanos-receive的grpc通信端口

--http-address：thanos-receive的的http通信端口

--tsdb.retention：thanos-receive数据在本地存储的时长

--label：thanos-receive的标签，在这里，"replica"表示节点编号；"receive_cluster"表示集群名称

两个节点的"--label=replica"编号要设置不同

停止thanos-receive

]# su - receive

##确定thanos-receive服务进程的pid
]$ ps -ef | grep thanos-receive | grep -v grep

##结束thanos-store服务进程
]$ kill -9 $pid

脚本启停

启动脚本：receive_start.sh

#!/bin/bash

SCRIPT_DIR=$(dirname "$(readlink -f "$0")")
thanos_dir=$(dirname "$SCRIPT_DIR")
user='receive'
module='receive'
log_dir="logs/thanos-${module}.log"


ps -ef | grep -w "${thanos_dir}/thanos[[:space:]]*${module}" | grep -v 'grep' &>/dev/null
if [ $? -eq 0 ];then
  echo "error: !!!Thanos-${module} server runing!!!"
  exit 10
fi


if [ "`whoami`" != "${user}" ];then
  echo "error: !!!Excute user not ${user}!!!"
  exit 10
fi

if [ ! -f "${thanos_dir}/thanos" ];then
  echo "error: !!!Thanos binary file not exist!!!"
  exit 10
fi

cd ${thanos_dir}

start_para=`cat ${thanos_dir}/script/running_paraments`


if [ -n "${start_para}" ];then
  nohup ${thanos_dir}/thanos ${module} \
  ${start_para} >> ${log_dir} 2>&1 &
else
  nohup ${thanos_dir}/thanos ${module} >> ${log_dir}  2>&1 &
fi

sleep 2

ps -ef | grep -w "${thanos_dir}/thanos[[:space:]]*${module}" | grep -v 'grep' &>/dev/null
if [ $? -eq 0 ];then
  echo "success: Thanos-${module} server start"
else
  echo "error: !!!Thanos-${module} server start failed!!!"
  exit 10
fi

receive_start.sh

停止脚本：receive_stop.sh

#!/bin/bash

SCRIPT_DIR=$(dirname "$(readlink -f "$0")")
thanos_dir=$(dirname "$SCRIPT_DIR")
user='receive'
module='receive'
log_dir="logs/thanos-${module}.log"

pid=`ps -ef | grep -w "${thanos_dir}/thanos[[:space:]]*${module}" | grep -v 'grep' | awk '{print $2}'`
if [ ! -n "$pid" ];then
  echo "error: !!!Thanos-${module} server not running!!!"
  exit 10
fi

if [ "`whoami`" != "${user}" ];then
  echo "error: !!!Excute user not ${user}!!!"
  exit 10
fi

kill -s QUIT $pid
sleep 2
ps -ef | grep -w "${thanos_dir}/thanos[[:space:]]*${module}" | grep -v 'grep' &>/dev/null
if [ $? -ne 0 ];then
  echo "success: Thanos-${module} process exit"
else
  echo "error: !!!Thanos-${module} process exit failed!!!"
  exit 10
fi

receive_stop.sh

thanos-receive启动参数文件：running_parameters

用来指定thanos-receive的启动参数

--tsdb.path=./data 
--grpc-address=0.0.0.0:10906 
--http-address=0.0.0.0:10907 
--tsdb.retention=1h
--label=replica="0" 
--label=receive_cluster="receiver1" 
--remote-write.address=0.0.0.0:10908 
--receive.tenant-label-name=monitortenant
--objstore.config-file=config/receive-objstore.yml
--log.level=debug

将脚本、文件上传至script目录中，启停时切换到receive用户，执行相关的脚本即可

(4)配置prometheus

配置promeheus连接thanos-receive

config/prometheus.yml

remote_write:
  ##prometheus连接thanos-receive地址
  - url: http://172.25.230.58:10908/api/v1/receive
     ##prometheus连接thanos-receive的超时时间
    remote_timeout: 60s

(5)thanos-receive目录结构

├── config thanos-receive配置文件目录

│ └── receive-objstore.yml

├── data thanos-receive数据存放目录

├── logs thanos-receive日志目录

│ └── thanos-receive.log

├── script thanos-receive脚本目录

│ ├── receive_start.sh

│ ├── receive_stop.sh

│ └── running_paraments

└── thanos

2.thanos-store部署

（1）部署安装包

建立运行用户

]# useradd store
]# mkdir /opt/thanos-store

解压安装包

]# tar xvf thanos-0.37.0.linux-amd64.tar.gz -C /opt/thanos-store

建立目录

]# cd /opt/thanos-store
]# mkdir {config,logs,script}

（2）配置thanos-store

配置连接对象存储：config/receive-objstore.yml

  type: S3
  config:
    bucket: "prometheus-data"
    endpoint: "172.25.230.57:9500"
    access_key: "monitor@123"
    insecure: true
    secret_key: "monitor@123"

配置另一节点

将thanos-receive目录复制到另一节点

在两台主机上，为thanos-store目录授权

]# chown -R receive:receive /opt/thanos-store

（3）启动thanos-store

手动启停

启动thanos-store

在两个thanos-store节点上，执行以下命令：

]# su - store
]# cd /opt/thanos-store
]# nohup ./thanos store \
--data-dir=./data \
--grpc-address=0.0.0.0:10901 \
--http-address=0.0.0.0:10902 \
--objstore.config-file=config/store-objstore.yml >> logs/thanos-store.log 2>&1 &

停止thanos-store

]# su - store

##确定thanos-store服务进程的pid
]$ ps -ef | grep thanos-store | grep -v grep

##结束thanos-store服务进程
]$ kill -9 $pid

脚本启停

启动脚本：store_start.sh

#!/bin/bash

SCRIPT_DIR=$(dirname "$(readlink -f "$0")")
thanos_dir=$(dirname "$SCRIPT_DIR")
user='store'
module='store'
log_dir="logs/thanos-${module}.log"


ps -ef | grep -w "${thanos_dir}/thanos[[:space:]]*${module}" | grep -v 'grep' &>/dev/null
if [ $? -eq 0 ];then
  echo "error: !!!Thanos-${module} server runing!!!"
  exit 10
fi


if [ "`whoami`" != "${user}" ];then
  echo "error: !!!Excute user not ${user}!!!"
  exit 10
fi

if [ ! -f "${thanos_dir}/thanos" ];then
  echo "error: !!!Thanos binary file not exist!!!"
  exit 10
fi

cd ${thanos_dir}

start_para=`cat ${thanos_dir}/script/running_paraments`


if [ -n "${start_para}" ];then
  nohup ${thanos_dir}/thanos ${module} \
  ${start_para} >> ${log_dir} 2>&1 &
else
  nohup ${thanos_dir}/thanos ${module} >> ${log_dir}  2>&1 &
fi

sleep 2

ps -ef | grep -w "${thanos_dir}/thanos[[:space:]]*${module}" | grep -v 'grep' &>/dev/null
if [ $? -eq 0 ];then
  echo "success: Thanos-${module} server start"
else
  echo "error: !!!Thanos-${module} server start failed!!!"
  exit 10
fi

store_start.sh

停止脚本：store_stop.sh

#!/bin/bash

SCRIPT_DIR=$(dirname "$(readlink -f "$0")")
thanos_dir=$(dirname "$SCRIPT_DIR")
user='store'
module='store'
log_dir="logs/thanos-${module}.log"

pid=`ps -ef | grep -w "${thanos_dir}/thanos[[:space:]]*${module}" | grep -v 'grep' | awk '{print $2}'`
if [ ! -n "$pid" ];then
  echo "error: !!!Thanos-${module} server not running!!!"
  exit 10
fi

if [ "`whoami`" != "${user}" ];then
  echo "error: !!!Excute user not ${user}!!!"
  exit 10
fi

kill -s QUIT $pid
sleep 2
ps -ef | grep -w "${thanos_dir}/thanos[[:space:]]*${module}" | grep -v 'grep' &>/dev/null
if [ $? -ne 0 ];then
  echo "success: Thanos-${module} process exit"
else
  echo "error: !!!Thanos-${module} process exit failed!!!"
  exit 10
fi

store_stop.sh

thanos-store启动参数文件：running_parameters

用来指定thanos-store的启动参数

--data-dir=./data 
--grpc-address=0.0.0.0:10901 
--http-address=0.0.0.0:10902 
--request.logging-config-file=logs/request-logging.json 
--objstore.config-file=config/gateway_tsdb.yml

（4）登录thanos-store页面

分别登录2个节点的thanos-store

http://172.25.230.56:10902

http://172.25.230.57:10902

3.thanos-compact部署

注：thanos-compact只需部署一个节点

（1）部署安装包

建立运行用户

]# useradd compact
]# mkdir /opt/thanos-compact

解压thanos

tar xvf thanos-0.37.0.linux-amd64.tar.gz -C /opt/thanos-compact

建立目录

cd /opt/thanos-compact
mkdir {config,logs,script}

(2)配置thanos-compact

配置连接对象存储：config/compact-objstore.yml

type: S3
config:
  bucket: "prometheus-data"
  endpoint: "172.25.230.57:9500"
  access_key: "monitor@123"
  insecure: true
  secret_key: "monitor@123"

为thanos-compact目录授权

]# chown -R receive:receive /opt/thanos-compact

(3)thanos-compact启停

手动启停

启动thanos-compact

在thanos-compact节点上，执行以下命令：

]# su - compact
]$ cd /opt/thanos-compact
]$ ./thanos compact \
--http-address=0.0.0.0:10905 \
--data-dir=./data \
--objstore.config-file=config/compact-objstore.yml \
--log.level=debug

停止thanos-compact

]# su - compact

##确定thanos-compact服务进程的pid
]$ ps -ef | grep thanos-compact | grep -v grep

##结束thanos-compact服务进程
]$ kill -9 $pid

脚本启停

启动脚本：compact_start.sh

#!/bin/bash

SCRIPT_DIR=$(dirname "$(readlink -f "$0")")
thanos_dir=$(dirname "$SCRIPT_DIR")
user='compact'
module='compact'
log_dir="logs/thanos-${module}.log"


ps -ef | grep -w "${thanos_dir}/thanos[[:space:]]*${module}" | grep -v 'grep' &>/dev/null
if [ $? -eq 0 ];then
  echo "error: !!!Thanos-${module} server runing!!!"
  exit 10
fi


if [ "`whoami`" != "${user}" ];then
  echo "error: !!!Excute user not ${user}!!!"
  exit 10
fi

if [ ! -f "${thanos_dir}/thanos" ];then
  echo "error: !!!Thanos binary file not exist!!!"
  exit 10
fi

cd ${thanos_dir}

start_para=`cat ${thanos_dir}/script/running_paraments`


if [ -n "${start_para}" ];then
  nohup ${thanos_dir}/thanos ${module} \
  --wait \
  ${start_para} >> ${log_dir} 2>&1 &
else
  nohup ${thanos_dir}/thanos ${module} \
  --wait \ >> ${log_dir}  2>&1 &
fi

sleep 2

ps -ef | grep -w "${thanos_dir}/thanos[[:space:]]*${module}" | grep -v 'grep' &>/dev/null
if [ $? -eq 0 ];then
  echo "success: Thanos-${module} server start"
else
  echo "error: !!!Thanos-${module} server start failed!!!"
  exit 10
fi

compact_start.sh

停止脚本：compact_stop.sh

#!/bin/bash

SCRIPT_DIR=$(dirname "$(readlink -f "$0")")
thanos_dir=$(dirname "$SCRIPT_DIR")
user='compact'
module='compact'
log_dir="logs/thanos-${module}.log"

pid=`ps -ef | grep -w "${thanos_dir}/thanos[[:space:]]*${module}" | grep -v 'grep' | awk '{print $2}'`
if [ ! -n "$pid" ];then
  echo "error: !!!Thanos-${module} server not running!!!"
  exit 10
fi

if [ "`whoami`" != "${user}" ];then
  echo "error: !!!Excute user not ${user}!!!"
  exit 10
fi

kill -s QUIT $pid
sleep 2
ps -ef | grep -w "${thanos_dir}/thanos[[:space:]]*${module}" | grep -v 'grep' &>/dev/null
if [ $? -ne 0 ];then
  echo "success: Thanos-${module} process exit"
else
  echo "error: !!!Thanos-${module} process exit failed!!!"
  exit 10
fi

compact_stop.sh

thanos-compact启动参数文件：running_parameters

用来指定thanos-compact的启动参数

--http-address=0.0.0.0:10905 
--data-dir=./data 
--objstore.config-file=config/compact-objstore.yml
--log.level=debug

(4)登录thanos-compact页面

分别登录2个节点的thanos-compact

http://172.25.230.57:10905

4.thanos-query部署

（1）部署安装包

建立运行用户

]# useradd query
]# mkdir /opt/thanos-query

解压thanos

]# tar xvf thanos-0.37.0.linux-amd64.tar.gz -C /opt/thanos-query

建立目录

]# cd /opt/thanos-query
]# mkdir {config,logs,script}

(2)配置thanos-query

配置连接数据源thanos-store，thanos-receive文件：config/endpoint.yml

- targets:
   ##连接thanos-receive的grpc端口
   - 172.25.230.56:10906
   - 172.25.230.57:10906
 
   ##连接thanos-store的grpc端口
   - 172.25.230.56:10901
   - 172.25.230.57:10901

配置另一节点

将thanos-query目录复制到另一节点

在两台主机上，为thanos-query目录授权

]# chown -R query:query /opt/thanos-query

(3)thanos-query启停

手动启停

启动thanos-query

在两个thanos-query节点上，执行以下命令：

]# su - query
]$ cd /opt/thanos-query
]$ ./thanos query \
--grpc-address=0.0.0.0:10903 \
--http-address 0.0.0.0:10904  \
--query.replica-label=replica  \
--store.sd-files=./config/endpoint.yml \
--log.level=debug

--query.replica-label 查询数据时，根据哪个标签的内容进行去重，由于集群中prometheus是2个节点，会同时收集监控数据，如果不进行去重，那么查看到两份监控数据。这里配置的是"replica"，表示在查询监控数据时，如果数据除replica标签外，其他标签都相同，那么进行去重，只保留一份数据。

--store.sd-files 连接数据源的文件

停止thanos-query

]# su - query

##确定thanos-query服务进程的pid
]$ ps -ef | grep thanos-query | grep -v grep

##结束thanos-query服务进程
]$ kill -9 $pid

脚本启停

将脚本及启动参数文件上传到script目录中

启动脚本：query_start.sh

query_start.sh

停止脚本：query_stop.sh

query_stop.sh

thanos-query启动参数文件：running_parameters

用来指定thanos-query的启动参数

--grpc-address=0.0.0.0:10903 
--http-address 0.0.0.0:10904 
--query.replica-label=replica 
--store.sd-files=./config/endpoint.yml
--log.level=debug

(4)登录thanos-query页面

分别登录2个节点的thanos-query

http://172.25.230.56:10904

http://172.25.230.57:10904

(4)配置负载均衡

通过nginx配置访问两个节点thanos-query的负载均衡，作为访问thanos-query页面的统一入口，方便后续grafana连接thanos-query展示监控数据时，可以自动进行故障转移。

nginx主配置文件：conf/nginx.conf

http {
    ##访问日志配置
    log_format serform '$remote_addr - $remote_user [$time_local] '
                            '"$request" "$status" "$body_bytes_sent"'
                            '"$http_referer" "$http_user_agent" "$http_x_forwarded_for"'
                            '"$upstream_addr" "$request_time" "$upstream_response_time"'
                            '"$upstream_status"';
 
    ##子配置文件
    include vhosts/*.conf;

子配置文件：vhosts/thanos-query.conf

upstream query {        
        ##负载的thanos-query节点地址   
        server 172.25.230.56:10904;   
        server 172.25.230.57:10904;
}
 
server {
        ##访问端口
        listen 10904;
        add_header X-Frame-Options SAMEORIGIN;
        proxy_set_header Host $host:$server_port;
        proxy_set_header X-Real-IP $remote_addr;
        proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
        proxy_set_header X-Forwarded-Proto $scheme;
 
       ##访问日志路径
       access_log  /opt/nginx/log/access_query.log serform;
       error_log   /opt/nginx/log/error_query.log;
 
  
  
        location / {
            proxy_pass http://query;
        }
    }

配置完成后，重启nginx

]# /opt/nginx/sbin/nginx -s reload

通过负载均衡地址，访问thanos-query页面

http://172.25.230.58:10904

可以看到nginx访问日志中，分别负载到两个thanos-query节点上：

5.thanos-rule部署

（1）部署安装包

建立运行用户

]# useradd ruler

解压thanos

]# tar xvf thanos-0.37.0.linux-amd64.tar.gz -C /opt/thanos-rule

建立目录

]# cd /opt/thanos-rule
]# mkdir {config,logs,script}

(2)配置thanos-rule

配置连接对象存储：config/rule-objstore.yml

  type: S3
  config:
    bucket: "prometheus-data"
    endpoint: "172.25.230.57:9500"
    access_key: "monitor@123"
    insecure: true
    secret_key: "monitor@123"

配置连接alertmanager：config/alert-send.yml

alertmanagers:
- http_config:
    ##连接alertmanager的账号、密码
    basic_auth:
      username: "alert"
      password: "alert123"
  ##连接alertmanager的地址
  static_configs: ["172.25.230.56:19093","172.25.230.57:19093"]

配置另一节点

将thanos-rule目录复制到另一节点

在两台主机上，为thanos-rule目录授权

]# chown -R ruler:ruler /opt/thanos-rule

(3)thanos-rule启停

手动启停

启动thanos-rule

在两个thanos-rule节点上，执行以下命令：

]# su - rules
]$ cd /opt/thanos-rule
]$ ./thanos rule \
--grpc-address=0.0.0.0:10909 \
--http-address=0.0.0.0:10910  \
--data-dir=./data  \
--eval-interval=30s  \
--rule-file=./rules/*.yml  \
--alert.query-url=http://172.25.230.58:10904  \
--alertmanagers.config-file=config/alert-send.yml  \
--query=http://172.25.230.58:10904  \
--objstore.config-file=config/rule-objstore.yml  \
--log.level=debug  \
--label=monitor_cluster="rule-cluster" \
--label=replica="0"  \
--alert.label-drop=replica

--eval-interval 评估配置的告警规则的频率，这里设置为每30s评估一次

--rule-file 告警规则配置文件路径

--alert.query-url 触发告警的监控指标数据查询地址，配置为thanos-query的负载均衡入口

--alertmanagers.config-file 连接alertmanager配置文件的路径

--label=monitor_cluster thanos-rule集群名称，自定义

--label=replica thanos-rule集群节点编号，自定义，每个节点的编号不一致即可

--alert.label-drop 向alertmanager发送触发的告警前，删除的标签名称，由于集群中多个thanos-rule都会向alertmanager发送同一条告警，alertmanager会比较告警携带的标签，如果标签一致会自动进行告警去重，这里将标签"replica"删除，发送给alertmanager的告警信息不包含"replica"标签

停止thanos-rule

]# su - rule

##确定thanos-rule服务进程的pid
]$ ps -ef | grep thanos-rule | grep -v grep

##结束thanos-rule服务进程
]$ kill -9 $pid

脚本启停

将脚本及启动参数文件上传到script目录中

启动脚本：rule_start.sh

#!/bin/bash

SCRIPT_DIR=$(dirname "$(readlink -f "$0")")
thanos_dir=$(dirname "$SCRIPT_DIR")
user='ruler'
module='rule'
log_dir="logs/thanos-${module}.log"


ps -ef | grep -w "${thanos_dir}/thanos[[:space:]]*${module}" | grep -v 'grep' &>/dev/null
if [ $? -eq 0 ];then
  echo "error: !!!Thanos-${module} server runing!!!"
  exit 10
fi


if [ "`whoami`" != "${user}" ];then
  echo "error: !!!Excute user not ${user}!!!"
  exit 10
fi

if [ ! -f "${thanos_dir}/thanos" ];then
  echo "error: !!!Thanos binary file not exist!!!"
  exit 10
fi

cd ${thanos_dir}

start_para=`cat ${thanos_dir}/script/running_paraments | grep -v '^#'`


if [ -n "${start_para}" ];then
  nohup ${thanos_dir}/thanos ${module} \
  ${start_para} >> ${log_dir} 2>&1 &
else
  nohup ${thanos_dir}/thanos ${module} >> ${log_dir}  2>&1 &
fi

sleep 2

ps -ef | grep -w "${thanos_dir}/thanos[[:space:]]*${module}" | grep -v 'grep' &>/dev/null
if [ $? -eq 0 ];then
  echo "success: Thanos-${module} server start"
else
  echo "error: !!!Thanos-${module} server start failed!!!"
  exit 10
fi

rule_start.sh

停止脚本：rule_stop.sh

#!/bin/bash

SCRIPT_DIR=$(dirname "$(readlink -f "$0")")
thanos_dir=$(dirname "$SCRIPT_DIR")
user='ruler'
module='rule'
log_dir="logs/thanos-${module}.log"

pid=`ps -ef | grep -w "${thanos_dir}/thanos[[:space:]]*${module}" | grep -v 'grep' | awk '{print $2}'`
if [ ! -n "$pid" ];then
  echo "error: !!!Thanos-${module} server not running!!!"
  exit 10
fi

if [ "`whoami`" != "${user}" ];then
  echo "error: !!!Excute user not ${user}!!!"
  exit 10
fi

#kill -s QUIT $pid
kill $pid
sleep 2
ps -ef | grep -w "${thanos_dir}/thanos[[:space:]]*${module}" | grep -v 'grep' &>/dev/null
if [ $? -ne 0 ];then
  echo "success: Thanos-${module} process exit"
else
  echo "error: !!!Thanos-${module} process exit failed!!!"
  exit 10
fi

rule_stop.sh

重置脚本：rule_reload.sh

#!/bin/bash

host=127.0.0.1
port=10910

curl ${authorize} -XPOST http://${host}:${port}/-/reload
if [ $? -eq 0 ];then
  echo "success: thanos-rule server reload"
else
  echo "error: thanos-rule server reload failed"
fi

rule_reload.sh

thanos-rule启动参数文件：running_parameters

用来指定thanos-rule的启动参数

--grpc-address=0.0.0.0:10909 
--http-address=0.0.0.0:10910 
--data-dir=./data 
--eval-interval=30s 
--rule-file=./rules/*.yml 
--alert.query-url=http://172.25.230.58:10904 
--alertmanagers.config-file=config/alert-send.yml
--query=http://172.25.230.58:10904
--objstore.config-file=config/rule-objstore.yml
--log.level=debug
--label=monitor_cluster="rule-cluster"
--label=replica="0"
--alert.label-drop=replica

(4)配置thanos-query连接thanos-rule(可选)

通过配置thanos-query连接thanos-rule，这样可以在thanos-query上查看到thanos-rule触发的告警信息

配置thanos-query的config/endpoint.yml文件，添加连接thanos-rule地址

- targets:
   ##连接thanos-rule的grpc端口
   - 172.25.230.56:10909
   - 172.25.230.57:10909

(5)登录thanos-rule页面

分别登录2个节点的thanos-store

http://172.25.230.56:10910

http://172.25.230.57:10910

配置服务器监控

在要监控的服务器上进行安装及配置

1.安装node_exporter

版本: 1.8.0

下载地址：https://github.com/prometheus/node_exporter/releases/tag/v1.8.0

基础配置

下载node_exporter安装包

解压安装包

]# tar xvf node_exporter-1.8.0.linux-amd64.tar.gz -C /opt
]# cd /opt
]# mv node_exporter-1.8.0.linux-amd64 node_exporter

建立用户、目录

]# useradd exporter
]# mkdir /opt/node_exporter/{config,logs,script}
]# chown -R exporter:exporter /opt/node_exporter

node_exporter配置

配置node_exporter访问密码: config/http.yml

访问账号：exporter，密码：123456

密码生成方法详见"附：hash密码生成方式"配置密码

basic_auth_users:
  exporter: ***************************************

启动node-exporter

]# su - exporter
]$ cd /opt/node_exporter
]$ ./node_exporter \
--web.listen-address=:19100 \
--web.config.file=config/http.yml

--web.listen-address node-exporter端口
--web.config.file node-exporter账号,密码文件

启停脚本：

将脚本放到script目录中执行

启动脚本：

#!/bin/bash

##主目录路径
script_dir=$(dirname "$(readlink -f "$0")")
node_dir=$(dirname "$script_dir")
node_logs="${node_dir}/logs/node_exporter.log"
node_parament=`cat ${node_dir}/script/running_paraments 2>/dev/null`
user='exporter'

##判断进程是否存在
ps -ef | grep -w "${node_dir}/node_exporter" | grep -v 'grep' &>/dev/null
if [ $? -eq 0 ];then
  echo "error: Node_exporter server runing"
  exit 10
fi

##判断执行用户
if [ "`whoami`" != "${user}" ];then
  echo "error: !!!excute user not ${user}!!!"
  exit 10
fi

##判断执行文件是否存在
if [ ! -f "${node_dir}/node_exporter" ];then
  echo "error: !!!Node_exporter binary file not exist!!!"
  exit 10
fi

cd ${node_dir}

##执行启动进程的命令
if [ -n "${node_parament}" ];then
  nohup ${node_dir}/node_exporter \
  ${node_parament} >> ${node_logs} 2>&1 &
else
  nohup ${node_dir}/node_exporter >> ${node_logs} 2>&1 &
fi

sleep 2

##启动命令执行后,判断进程是否存在
ps -ef | grep -w "${node_dir}/node_exporter" | grep -v 'grep' &>/dev/null
if [ $? -eq 0 ];then
  echo "success: Node_exporter server start"
else
  echo "error: Node_exporter server start failed"
  exit 10
fi

node_exporter_start.sh

停止脚本：

#!/bin/bash

##主目录路径
script_dir=$(dirname "$(readlink -f "$0")")
node_dir=$(dirname "$script_dir")
user='exporter'

##判断进程是否存在
pid=`ps -ef | grep -w "${node_dir}/node_exporter" | grep -v 'grep' | awk '{print $2}'`
if [ ! -n "$pid" ];then
  echo "error: Node_exporter server not runing"
  exit 10
fi

##判断执行用户
if [ "`whoami`" != "${user}" ];then
  echo "error: !!!excute user not ${user}!!!"
  exit 10
fi

##结束进程
kill $pid

sleep 2

##结束进程后,判断进程是否存在
ps -ef | grep -w "${node_dir}/node_exporter" | grep -v 'grep' &>/dev/null
if [ $? -ne 0 ];then
  echo "success: Node_exporter server stop"
else
  echo "error: Node_exporter server stop failed"
  exit 10
fi

node_exporter_stop.sh

登录node_exporter指标界面

http://172.25.230.57:19100/metrics

输入账号密码：exporter/123456

配置prometheus

在每个prometheus节点上进行配置

建立目录

]# su - prometheus
]$ cd /opt/prometheus
]$ mkdir config/node

编辑prometheus配置文件:config/prometheus.yml

scrape_configs:
  ##指定任务名称,自定义
  - job_name: "node-exporter"
    file_sd_configs:
      ##node_exporter节点文件路径
      - files:
          - 'node/*.yml'
    ##连接node_exporter节点的账号,密码
    basic_auth:
      username: 'exporter'
      password: '123456'
    ##将监控服务器节点的展示形式更改为纯ip
    relabel_configs:
      - source_labels: [__address__]
        target_label: instance
        regex: '([^:]+)(:\d+)?'
        replacement: '$1'

node-exporter节点：config/node/node-exporter.yml

##监控节点地址,ip是部署node-exporter服务器的ip，端口为node-exporter端口
- targets:
    - 172.25.230.56:19100
    - 172.25.230.57:19100
    - 172.25.230.55:19100
    - 172.25.230.49:19100
    - 172.25.230.52:19100
    - 172.25.230.47:19100 
    - 172.25.230.58:19100
    - 172.25.230.59:19100
  ##监控服务器的标签，自定义 
  labels:
    System: 测试主机
    Devicetype: 服务器
    Servertype: 虚拟机

重置prometheus

使用脚本进行prometheus的重置：

测试服务器监控对接

登录thanos-query的负载均衡入口：

http://172.25.230.58:10904/

在"查询框"输入"up"命令，会显示所有已对接的服务器信息，如果后面的数值为1，那么表明prometheus服务端对接服务器监控(node-exporter)成功。

配置告警发送规则

具体请见alertmanager部署配置文档

配置告警规则

在每个thanos-rule的节点上进行配置

编写告警规则文件

配置文件：thanos-rule/rules/system.yml

groups:     
- name: upstatus
  rules:
  - alert: 监控离线
    expr: up == 0
    for: 5m
    labels:
      severity: 严重
      category: 监控离线
    annotations:
      summary: "主机 {{ $labels.instance }}已离线"
      description: "主机 {{ $labels.instance }}已离线"
      value: "{{ $value }}"
 
- name: CpuPerUsed
  rules:
  - alert: CPU使用率
    expr: round (100 -avg(irate(node_cpu_seconds_total{mode="idle"}[5m])) by (instance,job,System,env)* 100,0.01) > 90
    labels:
      severity: 主要
      category: CPU使用率
    annotations:
      summary: "主机: {{ $labels.instance }},CPU使用率已达到{{ $value }}%,最近5分钟大于阈值90%"
      description: "主机: {{ $labels.instance }},CPU使用率已达到{{ $value }}%,最近5分钟大于阈值90%"
      value: "{{ $value }}%" 
 
- name: MemPerUsed
  rules:
  - alert: 内存使用率
    expr: round((node_memory_MemTotal_bytes - (node_memory_MemFree_bytes + node_memory_Buffers_bytes + node_memory_Cached_bytes)) / node_memory_MemTotal_bytes * 100,0.01) > 80
    for: 5m
    labels:
      severity: 主要
      category: 内存使用率
    annotations:
      summary: "主机: {{ $labels.instance }},内存使用率已达到{{ $value }}%,最近5分钟大于阈值80%"
      description: "主机: {{ $labels.instance }},内存使用率已达到{{ $value }}%,最近5分钟大于阈值80%"
      value: "{{ $value }}%"   
 
- name: DiskPerUsed
  rules:
  - alert: 磁盘/分区使用率
    expr: round (100 - ((node_filesystem_avail_bytes{device!~'rootfs'} * 100) / node_filesystem_size_bytes{device!~'rootfs'}),0.01) > 90
    for: 5m
    labels:
      severity: 主要
      category: 磁盘使用率({{ $labels.mountpoint }})
    annotations:
      summary: "主机: {{ $labels.instance }},磁盘使用率({{ $labels.mountpoint }})已达到{{ $value }}%,最近5分钟大于阈值90%"
      description: "主机: {{ $labels.instance }},磁盘使用率({{ $labels.mountpoint }})已达到{{ $value }}%,最近5分钟大于阈值90%"
      value: "{{ $value }}%"

重置thanos-rule

通过脚本重置thanos-rule加载告警规则

查看告警情况

登录thanos-rule

http://172.25.230.56:10910

可以看到监控规则的触发情况

查看接收到的告警邮件

登录alertmanager查看到当前存在的告警信息

http://172.25.230.56:19093

Grafana

grafana部署1个节点

1.基础配置

上传并上传安装包，下载地址：https://grafana.com/grafana/download，根据需要下载安装包，这里安装的版本是11.5.

解压安装包

]# tar xvf grafana-enterprise-11.5.1.linux-amd64.tar.gz -C /opt
]# cd /opt/
]# mv grafana-enterprise-11.5.1.linux-amd64

建立用户

]# useradd grafana

解压thanos

]# tar xvf thanos-0.37.0.linux-amd64.tar.gz -C /opt/thanos-rule

建立目录

]# cd /opt/grafana
]# mkdir {logs,script}

2.配置grafana

将配置文件模板defaults.ini复制一份，修改为grafana.ini

]# cd /opt/grafana/conf
]# cp defaults.ini grafana.ini

进行端口配置：conf/grafana.ini

http_port = 13000

3.grafana启停

启动grafana

]# grafana server \
--config=conf/grafana.ini \
--homepath=/opt/grafana

--config 指定grafana配置文件路径

--homepath 指定grafana主目录

停止grafana

]# su - grafana

##确定grafana服务进程的pid
]$ ps -ef | grep grafana | grep -v grep

##结束grafana服务进程
]$ kill -9 $pid

启停脚本：

将脚本放到script目录中执行

启动脚本：

#!/bin/bash

script_dir=$(dirname "$(readlink -f "$0")")
grafana_dir=$(dirname "$script_dir")
user='grafana'
log_dir="logs/grafana.log"

ps -ef | grep -w "${grafana_dir}/bin/grafana" | grep -v 'grep' &>/dev/null
if [ $? -eq 0 ];then
  echo "error: !!!grafana server runing!!!"
  exit 10
fi

if [ "`whoami`" != "${user}" ];then
  echo "error: !!!Excute user not ${user}!!!"
  exit 10
fi

if [ ! -f "${grafana_dir}/bin/grafana" ];then
  echo "error: !!!grafana binary file not exist!!!"
  exit 10
fi

cd ${grafana_dir}

start_para=`cat ${grafana_dir}/script/running_paraments`

if [ -n "${start_para}" ];then
  nohup ${grafana_dir}/bin/grafana server \
  ${start_para} >> ${log_dir} 2>&1 &
else
  nohup ${grafana_dir}/bin/grafana >> ${log_dir} 2>&1 &
fi

sleep 2

ps -ef | grep -w "${grafana_dir}/bin/grafana" | grep -v 'grep' &>/dev/null
if [ $? -eq 0 ];then
  echo "success: grafana server start"
else
  echo "error: grafana server start failed"
  exit 10
fi

grafana_start.sh

停止脚本：

#!/bin/bash

script_dir=$(dirname "$(readlink -f "$0")")
grafana_dir=$(dirname "$script_dir")
user='grafana'
log_dir="logs/grafana.log"

pid=`ps -ef | grep -w "${grafana_dir}/bin/grafana" | grep -v 'grep' | awk '{print $2}'`
if [ ! -n "$pid" ];then
  echo "error: !!!grafana server not running!!!"
  exit 10
fi

if [ "`whoami`" != "${user}" ];then
  echo "error: !!!Excute user not ${user}!!!"
  exit 10
fi

kill -9 $pid
sleep 2

ps -ef | grep -w "${grafana_dir}/bin/grafana" | grep -v 'grep' &>/dev/null
if [ $? -ne 0 ];then
  echo "success: grafana process exit"
else
  echo "error: !!!grafana process exit failed!!!"
  exit 10
fi

grafana_stop.sh

grafana启动参数文件：running_parameters

用来指定grafana的启动参数

--config=conf/grafana.ini 
--homepath=/opt/grafana

3.配置grafana页面

(1)登录grafana

登录页面http://172.25.230.56:13000/

初始账号/密码是：admin/admin

配置界面语言

将界面语言修改为中文

在左侧栏的【Administrator】==> 【General】==> 【Default preferences】

grafana页面语言已改为中文

配置grafana数据源

【连接】 ==> 【数据源】

选择【Add data source】，【Prometheus】

在【Name】中，填写数据源名称，自定义

在【connection】中，填写thanos-query的负载均衡地址

点击【Save & test】，提交，提交时观察是否有报错

导入服务器监控面板

下载grafana面板

地址：https://grafana.com/grafana/dashboards/1860-node-exporter-full/

打开【仪表板】

点击【新建】==> 【导入】

导入grafana模板文件

点击仪表盘，查看监控数据

附：hash密码生成方式

python脚本

pass.py

import bcrypt

# 用户输入的密码
password = input("Enter password: ")

# 生成盐值并创建密码哈希
salt = bcrypt.gensalt()
hashed_password = bcrypt.hashpw(password, salt)

# 打印哈希密码
print(f"Hashed Password: {hashed_password}")

运行python脚本，输入要设置的密码进行加密

扫描二维码推送至手机访问。

本文链接：https://opszzfwordpress.club/post/232.html

分享给朋友：

返回列表

没有更早的文章了...

没有最新的文章了...

“prometheus+thanos高可用部署” 的相关文章

发表评论

« 2025年12月 »
一	二	三	四	五	六	日
1	2	3	4	5	6	7
8	9	10	11	12	13	14
15	16	17	18	19	20	21
22	23	24	25	26	27	28
29	30	31

标签列表

最新留言

prometheus+thanos高可用部署

一、prometheus+thanos介绍

三、minio部署

1.下载minio应用程序

2.部署应用程序

3.minio启停

4.配置minio

三、prometheus server部署

1.下载安装包

2.部署安装包

3.prometheus配置文件

4.prometheus启停

5.prometheus目录结构

四、alertmanager部署

1.下载安装包

2.部署安装包

3.alertmanager配置文件

4.alertmanager启停

5.alertmanager目录结构

五、thanos部署

1.thanos-receive部署

（1）部署安装包

（2）配置thanos-receive

(3)thanos-receive启停

(4)配置prometheus

(5)thanos-receive目录结构

2.thanos-store部署

（1）部署安装包

（2）配置thanos-store

（3）启动thanos-store

（4）登录thanos-store页面

3.thanos-compact部署

（1）部署安装包

(2)配置thanos-compact

(3)thanos-compact启停

(4)登录thanos-compact页面

4.thanos-query部署

（1）部署安装包

(2)配置thanos-query

(3)thanos-query启停

(4)登录thanos-query页面

(4)配置负载均衡

5.thanos-rule部署

（1）部署安装包

(2)配置thanos-rule

(3)thanos-rule启停

(4)配置thanos-query连接thanos-rule(可选)

(5)登录thanos-rule页面

配置服务器监控

1.安装node_exporter

基础配置

启动node-exporter

登录node_exporter指标界面

配置prometheus

重置prometheus

测试服务器监控对接

配置告警发送规则

配置告警规则

Grafana

1.基础配置

2.配置grafana

3.grafana启停

3.配置grafana页面

(1)登录grafana

配置界面语言

配置grafana数据源

附：hash密码生成方式

“prometheus+thanos高可用部署” 的相关文章

发表评论取消回复

京ICP备2022009687号-1

Powered By Z-BlogPHP. Theme by TOYEAN.

发表评论