Grafana+Prometheus实现Ceph监控和钉钉告警
获取软件包
最新的软件包获取地址 https://prometheus.io/download/
Prometheus
1、下载Prometheus
$ wget https://github.com/prometheus/prometheus/releases/download/v2.6.0/prometheus-2.6.0.linux-amd64.tar.gz
2、解压软件包
$ tar xf prometheus-2.6.0.linux-amd64.tar.gz
3、配置Prometheus启动程序
把解压出来的文件移动到/usr/local/目录下,并重命名为prometheus
$ mv prometheus-2.6.0.linux-amd64 /usr/local/prometheus
生成启动脚本
$ vim /usr/lib/systemd/system/prometheus.service[Unit]Description=Prometheus: the monitoring systemDocumentation=http://prometheus.io/docs/[Service]ExecStart=/usr/local/prometheus/prometheus \--config.file=/usr/local/prometheus/prometheus.yml \--storage.tsdb.path=/var/lib/prometheus \--web.console.templates=/usr/local/prometheus/consoles \--web.console.libraries=/usr/local/prometheus/console_libraries \--web.listen-address=0.0.0.0:9090 --web.external-url=Restart=alwaysStartLimitInterval=0RestartSec=10[Install]WantedBy=multi-user.target
创建监控数据存储目录
$ mkdir /var/lib/prometheus
4、启动Prometheus
$ systemctl daemon-reload$ systemctl enable prometheus$ systemctl start prometheus
5、查看端口监听状态
Prometheus监听的端口为9090,启动成功后可以通过netstat命令进行查看端口的监听状态
$ netstat -antpu | grep 9090tcp 0 0 127.0.0.1:33270 127.0.0.1:9090 ESTABLISHED 6426/prometheustcp6 0 0 :::9090 :::* LISTEN 6426/prometheustcp6 0 0 ::1:9090 ::1:51821 ESTABLISHED 6426/prometheustcp6 0 0 ::1:51821 ::1:9090 ESTABLISHED 6426/prometheustcp6 0 0 127.0.0.1:9090 127.0.0.1:33270 ESTABLISHED 6426/prometheus
6、通过浏览器进行访问
Prometheus启动成功后,可以通过浏览器访问查看状态和配置信息

Ceph_export
Ceph_export 需要使用Go进行编译,也可以下载已经编译好的Ceph_exporter直接使用 链接:https://pan.baidu.com/s/1AEF_pdDvSJ5gMPapaBuBrA 提取码:jkuh
1、安装软件Go环境
$ yum -y install golang
2、查看Go环境变量
$ go envGOARCH="amd64"GOBIN=""GOCACHE="/root/.cache/go-build"GOEXE=""GOFLAGS=""GOHOSTARCH="amd64"GOHOSTOS="linux"GOOS="linux"GOPATH="/root/go"GOPROXY=""GORACE=""GOROOT="/usr/lib/golang"GOTMPDIR=""GOTOOLDIR="/usr/lib/golang/pkg/tool/linux_amd64"GCCGO="gccgo"CC="gcc"CXX="g++"CGO_ENABLED="1"GOMOD=""CGO_CFLAGS="-g -O2"CGO_CPPFLAGS=""CGO_CXXFLAGS="-g -O2"CGO_FFLAGS="-g -O2"CGO_LDFLAGS="-g -O2"PKG_CONFIG="pkg-config"GOGCCFLAGS="-fPIC -m64 -pthread -fmessage-length=0 -fdebug-prefix-map=/tmp/go-build359765015=/tmp/go-build -gno-record-gcc-switches"
3、设置Go环境变量
$ vim /etc/profile.d/go.shexport GOROOT=/usr/lib/golangexport GOBIN=$GOROOT/binexport GOPATH=/root/goexport PATH=$PATH:$GOROOT/bin:$GOPATH/bin$ source /etc/profile.d/go.sh
4、下载并编译Ceph_exporter
$ mkdir go/src/github.com/digitalocean/$ cd go/src/github.com/digitalocean/$ git clone https://github.com/digitalocean/ceph_exporter$ cd ceph_exporter$ go build
5、创建Ceph_exporter启动程序
$ mkdir ~/go/bin/$ cp ~/go/src/github.com/digitalocean/ceph_exporter/ceph_exporter ~/go/bin/$ vim /usr/lib/systemd/system/ceph_exporter.service[Unit]Description=Prometheus's ceph metrics exporter[Service]User=rootGroup=rootExecStart=/root/go/bin/ceph_exporter[Install]WantedBy=multi-user.targetAlias=ceph_exporter.service
6、启动Ceph_exporter
$ systemctl daemon-reload$ systemctl enable ceph_exporter$ systemctl start ceph_exporter
7、查看端口监听状态
Ceph_exporter使用的是9128端口,可以通过netstat进行查看端口的监听状态
$ netstat -antpu | grep 9128tcp6 0 0 :::9128 :::* LISTEN 6839/ceph_exporter
8、修改Prometheus配置
把Ceph_exporter的接口添加到Prometheus的配置中
$ vim /usr/local/prometheus/prometheus.ymlscrape_configs:- job_name: 'ceph'honor_labels: truestatic_configs:- targets: ['192.168.1.10:9128']labels:instance: Ceph测试集群
9、重启Prometheus进程
$ systemctl restart prometheus
10、浏览器访问验证

Grafana
1、下载软件包
$ wget https://dl.grafana.com/oss/release/grafana-5.4.3-1.x86_64.rpm
不同系统的最新软件包可以在Grafana的官网获取下载地址https://grafana.com/grafana/download
2、安装Grafana
$ yum -y install grafana-5.4.3-1.x86_64.rpm
3、启动Grafana
$ systemctl enable grafana-server$ systemctl start grafana-server
4、查看端口监听状态
Grafana监听端口为3000,可以使用netstat查看监听状态
$ netstat -antpu | grep 3000tcp6 0 0 :::3000 :::* LISTEN 7147/grafana-server
5、浏览器访问登录
访问地址为http://$IP:3000
初始用户名和密码均为admin,首次登录后会提示设置新的密码

6、配置Dashboard
点击Add data source添加数据源
选择Prometheus
URL地址为Prometheus的访问地址http://$IP:9090
导入Dashboard,模板的编号为917,如果无法连接互联网,也可以在Grafana的官网下载模板后手动导入https://grafana.com/dashboards/917
查看监控状态

AlertManager
1、安装Alertmanager
$ wget https://github.com/prometheus/alertmanager/releases/download/v0.16.0/alertmanager-0.16.0.linux-amd64.tar.gz$ tar xf alertmanager-0.16.0-alpha.0.linux-amd64.tar.gz$ cd alertmanager-0.16.0-alpha.0.linux-amd64$ cp alertmanager amtool /usr/bin/$ cp alertmanager.yml /usr/local/prometheus/
2、生成启动程序
$ vim /usr/lib/systemd/system/alertmanager.service[Unit]Description=Prometheus: the alerting systemDocumentation=http://prometheus.io/docs/After=prometheus.service[Service]ExecStart=/usr/bin/alertmanager --config.file=/usr/local/prometheus/alertmanager.ymlRestart=alwaysStartLimitInterval=0RestartSec=10[Install]WantedBy=multi-user.target
3、启动Alertmanager
$ systemctl enable alertmanager$ systemctl start alertmanager
4、查看端口监听状态
Alertmanager的监听端口为9093,可以使用netstat查看端口监听状态
$ netstat -antpu | grep 9093tcp6 0 0 :::9093 :::* LISTEN 7381/alertmanager
5、配置Prometheus,添加Alertmanager端点
$ vim /usr/local/prometheus/prometheus.ymlalerting:alertmanagers:- static_configs:- targets: ["192.168.1.10:9093"]
6、重启Prometheus
$ systemctl restart prometheus
配置钉钉告警
1、配置webhook
$ mkdir -p /usr/lib/golang/src/github.com/timonwong/$ cd /usr/lib/golang/src/github.com/timonwong/$ git clone https://github.com/timonwong/prometheus-webhook-dingtalk.git$ cd prometheus-webhook-dingtalk$ make$ nohup ./prometheus-webhook-dingtalk --ding.profile="webhook=https://oapi.dingtalk.com/robot/send?access_token=8fe12c1a58b0769d7fcbf6ebf3bcd2cfcba825f2c45b4b39055890fd705df543" &> /var/log/dingding.log &
2、添加webhook告警
$ vim /usr/local/prometheus/alertmanager.ymlglobal:resolve_timeout: 5mroute:group_by: ['alertname']group_wait: 10sgroup_interval: 10srepeat_interval: 1hreceiver: 'web.hook'receivers:- name: 'web.hook'webhook_configs:- url: 'http://192.168.1.10:8060/dingtalk/webhook/send'inhibit_rules:- source_match:severity: 'critical'target_match:severity: 'warning'equal: ['alertname', 'dev', 'instance']
3、添加告警规则文件
$ vim /usr/local/prometheus/prometheus.ymlrule_files:- /usr/local/prometheus/ceph.yml
4、配置告警规则
$ vim /usr/local/prometheus/ceph.ymlgroups:- name: ceph-rulerules:- alert: Ceph OSD Downexpr: ceph_osd_down > 0for: 2mlabels:product: Ceph测试集群annotations:Warn: "{{$labels.instance}}: 有{{ $value }}个OSD挂掉了"Description: "{{$labels.instance}}:{{ $labels.osd }}当前状态为{{ $labels.status }}"- alert: 集群空间使用率expr: ceph_cluster_used_bytes / ceph_cluster_capacity_bytes * 100 > 80for: 2mlabels:product: Ceph测试集群annotations:Warn: "{{$labels.instance}}:集群空间不足"Description: "{{$labels.instance}}:当前空间使用率为{{ $value }}"
5、重启进程使配置生效
$ systemctl restart alertmanager$ systemctl restart prometheus.service
6、钉钉验证
停掉一个OSD后,钉钉收到如下告警

重新启动后收到恢复通知

