如何使用supervisor监控es

0x00 前言

最近碰到ES集群因JVM崩溃而宕机次数过多，为了能第一时间快速恢复和得到通知，所以打算搭建一个异常重启和告警的运维工具。首先，调研了三个程序：systemd、monit、supervisor，其中systemd是Centos7系统自带的，稳定性很好，而且配置比较熟悉方便上手，但是不支持告警。monit，功能很强大，不仅支持进程监控，还支持进程资源使用、目录、文件等等监控，并且对系统侵入较小，进程不必从monit启动，但其告警方式不支持自定义代码。supervisor，用python写的进程监控框架，只支持前台进程监控，如果一个进程运行在后台，那么是不能使用supervisor，但其告警方式通过event/listener，可以实现自定义代码监控。

由于使用钉钉机器人进行告警，并且ES可以在前台运行，所以选取了supervisor方案。

0x01 安装supervisor

Centos7开启了epel，可以通过yum安装python。如果没开启，使用以下命令安装。


1
1
yum install -y epel-release

安装python3.6，epel附带的python最高只有3.6，也可以通过本地编译高版本的python。


1
1
yum install -y python36

当python安装完成，通过pip安装supervisor。


1
1
python3 -m pip install supervisor -i https://mirrors.aliyun.com/pypi/simple

0x02 systemd管理supervisor

由于ES集群采用的单节点32GB JVM配置，所以需要对Linux进行系统设置以下4个部分，其中supervisor默认继承的配置是vm.max_map_count。nofile和memlock，都不会从系统配置读取，所以需要在systemd启动supervisor时设置。


5
1
echo 'es  soft  nofile  655350' >> /etc/security/limits.conf && \
2
echo 'es  hard  nofile  655350' >> /etc/security/limits.conf && \
3
echo 'es  -     memlock unlimited' >> /etc/security/limits.conf && \
4
echo 'vm.max_map_count=2621440' >> /etc/sysctl.conf && \
5
sysctl -p

设置方式和supervisrd配置如下，创建/usr/lib/systemd/system/supervisord.service文件，此时还不能启动supervisor，因为没有配置它的配置文件。


16
1
[Unit]
2
Description=Supervisor daemon
3

4
[Service]
5
Type=forking
6
ExecStart=/usr/local/bin/supervisord -c /etc/supervisord.conf
7
ExecStop=/usr/local/bin/supervisorctl -c /etc/supervisord.conf $OPTIONS shutdown
8
ExecReload=/usr/local/bin/supervisorctl -c /etc/supervisord.conf $OPTIONS reload
9
KillMode=process
10
Restart=on-failure
11
RestartSec=42s
12
LimitNOFILE=655350
13
LimitMEMLOCK=infinity
14

15
[Install]
16
WantedBy=multi-user.target

0x03 配置supervisor

创建/etc/supervisord.conf文件，supervisor的配置文件大体上分为3部分：

第一部分是supervisr服务的配置，主要设置日志、子进程启动时的系统配置、unix sock等。

第二部分是需要被启动进程的配置，主要设置主目录、重启配置、日志、环境变量等。

第三部分是事件监听器配置，同第2部分。完整的配置如下：


84
1
[unix_http_server]
2
file=/var/run/supervisor.sock   ; the path to the socket file
3

4
[supervisord]
5
logfile=/var/log/supervisord.log ; main log file; default $CWD/supervisord.log
6
logfile_maxbytes=50MB        ; max main logfile bytes b4 rotation; default 50MB
7
logfile_backups=10           ; # of main logfile backups; 0 means none, default 10
8
loglevel=info                ; log level; default info; others: debug,warn,trace
9
pidfile=/var/run/supervisord.pid ; supervisord pidfile; default supervisord.pid
10
nodaemon=false               ; start in foreground if true; default false
11
silent=false                 ; no logs to stdout if true; default false
12
minfds=655350                  ; min. avail startup file descriptors; default 1024
13
minprocs=4096                 ; min. avail process descriptors;default 200
14

15

16
[rpcinterface:supervisor]
17
supervisor.rpcinterface_factory = supervisor.rpcinterface:make_main_rpcinterface
18

19

20
[supervisorctl]
21
serverurl=unix:///var/run/supervisor.sock ; use a unix:// URL  for a unix socket
22

23
[program:es_node_a]
24
command=/data1/es7/elasticsearch-7.4.0/bin/elasticsearch              ; the program (relative uses PATH, can take args)
25
process_name=%(program_name)s ; process_name expr (default %(program_name)s)
26
numprocs=1                    ; number of processes copies to start (def 1)
27
directory=/data1/es7/elasticsearch-7.4.0/bin                ; directory to cwd to before exec (def no cwd)
28
priority=10                  ; the relative start priority (default 999)
29
autostart=true                ; start at supervisord start (default: true)
30
startsecs=120                   ; # of secs prog must stay up to be running (def. 1)
31
startretries=3                ; max # of serial start failures when starting (default 3)
32
autorestart=true        ; when to restart if exited after running (def: unexpected)
33
exitcodes=0                   ; 'expected' exit codes used with autorestart (default 0)
34
stopsignal=TERM               ; signal used to kill process (default TERM)
35
stopwaitsecs=60               ; max num secs to wait b4 SIGKILL (default 10)
36
user=es                  ; setuid to this UNIX account to run the program
37
redirect_stderr=false          ; redirect proc stderr to stdout (default false)
38
stdout_logfile=NONE
39
stderr_logfile=/data1/es_supervisor_error.log        ; stdout log path, NONE for none; default AUTO
40
environment=JAVA_HOME=""
41

42
[program:es_node_b]
43
command=/data1/es7b/elasticsearch-7.4.0/bin/elasticsearch              ; the program (relative uses PATH, can take args)
44
process_name=%(program_name)s ; process_name expr (default %(program_name)s)
45
numprocs=1                    ; number of processes copies to start (def 1)
46
directory=/data1/es7b/elasticsearch-7.4.0/bin                ; directory to cwd to before exec (def no cwd)
47
priority=10                  ; the relative start priority (default 999)
48
autostart=true                ; start at supervisord start (default: true)
49
startsecs=120                   ; # of secs prog must stay up to be running (def. 1)
50
startretries=3                ; max # of serial start failures when starting (default 3)
51
autorestart=true        ; when to restart if exited after running (def: unexpected)
52
exitcodes=0                   ; 'expected' exit codes used with autorestart (default 0)
53
stopsignal=TERM               ; signal used to kill process (default TERM)
54
stopwaitsecs=60               ; max num secs to wait b4 SIGKILL (default 10)
55
user=es                  ; setuid to this UNIX account to run the program
56
redirect_stderr=false          ; redirect proc stderr to stdout (default false)
57
stdout_logfile=NONE
58
stderr_logfile=/data1/es_supervisor_error.log        ; stdout log path, NONE for none; default AUTO
59
environment=JAVA_HOME=""
60

61

62

63

64

65
[eventlistener:es_event_listener]
66
command=/data1/es_event_monitor.py    ; the program (relative uses PATH, can take args)
67
process_name=%(program_name)s ; process_name expr (default %(program_name)s)
68
numprocs=1                    ; number of processes copies to start (def 1)
69
events=PROCESS_STATE_FATAL                  ; event notif. types to subscribe to (req'd)
70
buffer_size=10                ; event buffer queue size (default 10)
71
directory=/data1                ; directory to cwd to before exec (def no cwd)
72
priority=-1                   ; the relative start priority (default -1)
73
autostart=true                ; start at supervisord start (default: true)
74
startsecs=10                   ; # of secs prog must stay up to be running (def. 1)
75
startretries=3                ; max # of serial start failures when starting (default 3)
76
autorestart=unexpected        ; autorestart if exited after running (def: unexpected)
77
exitcodes=0                   ; 'expected' exit codes used with autorestart (default 0)
78
stopsignal=TERM               ; signal used to kill process (default TERM)
79
stopwaitsecs=10               ; max num secs to wait b4 SIGKILL (default 10)
80
user=root                 ; setuid to this UNIX account to run the program
81
redirect_stderr=false         ; redirect_stderr=true is not allowed for eventlisteners
82
stdout_logfile=event.log        ; stdout log path, NONE for none; default AUTO
83
stderr_logfile=event.log
84

需要注意的几个点：

第一个minfds和minprocs，前面通过systemd设置supervisor进程的fds和mem，此处需要配置supervisor启动的进程的最小文件描述符数和最小进程数。

第二个startsecs=120，该配置表示supervisor启动es后，将es的状态保留在STARTING 120秒，之后es的状态就会进入RUNNING。该配置默认值是1s，而一般来说es启动过程至少超过20秒，所以如果采用默认设置或者停留时间过少，并且autorestart=true，当es启动报错时，supervisor会反复重启es，忽略了startretries=3，进而无法触发报警。原因在于，startsecs过短，会导致进程在STARTING和RUNNING状态反复横跳。而startretries=3触发条件是BACKOFF状态。

第三个stopsignal=TERM，es使用ctrl+c时发送的是SIGTERM信号，可以使es正常退出。supervisor通过echo_supervisord_conf命令会默认生成stopsignal=QUIT导致es无法正常退出。

第四个environment=JAVA_HOME=""，有些环境下设置JAVA_HOME，无法使用es自带的JDK，而supervisor不能删除一个环境变量，可以使用将环境变量置空的方式。

0x04 编写事件通知

事件状态一共3种：

Name	Description
ACKNOWLEDGED	The event listener has acknowledged (accepted or rejected) an event send.
READY	Event notifications may be sent to this event listener
BUSY	Event notifications may not be sent to this event listener.

When an event listener process first starts, supervisor automatically places it into the ACKNOWLEDGED state to allow for startup activities or guard against startup failures (hangs). Until the listener sends a READY\n string to its stdout, it will stay in this state.
When supervisor sends an event notification to a listener in the READY state, the listener will be placed into the BUSY state until it receives an OK or FAIL response from the listener, at which time, the listener will be transitioned back into the ACKNOWLEDGED state.¹

简单理解，当事件监听器启动后，首先会处于ACKNOWLEDGED状态，当接收到READY消息，会使用readline阻塞读取supervisor发送的消息，然后处于BUSY状态，直到本次事件处理完毕。


99
1
#!/usr/bin/env python3
2
import sys
3
import json
4
import socket
5
import time
6
import requests
7
from loguru import logger
8

9
_local_ip = None
10
# 调试
11
# _dd_url = 'x'
12

13
# 正式
14
_dd_url = 'x'
15

16

17
def get_host_ip():
18
    """
19
    这个方法是目前见过最优雅获取本机服务器的IP方法了。没有任何的依赖，也没有去猜测机器上的网络设备信息。
20
    而且是利用 UDP 协议来实现的，生成一个UDP包，把自己的 IP 放如到 UDP 协议头中，然后从UDP包中获取本机的IP。
21
    这个方法并不会真实的向外部发包，所以用抓包工具是看不到的。但是会申请一个 UDP 的端口，所以如果经常调用也会比较耗时的，这里如果需要可以将查询到的IP给缓存起来，性能可以获得很大提升。
22
    :return:
23
    """
24
    global _local_ip
25
    s = None
26
    try:
27
        if not _local_ip:
28
            s = socket.socket(socket.AF_INET, socket.SOCK_DGRAM)
29
            s.connect(('223.5.5.5', 80))
30
            _local_ip = s.getsockname()[0]
31
        return _local_ip
32
    finally:
33
        if s:
34
            s.close()
35

36

37
def _dd_send(data):
38
    message = json.dumps({
39
        "msgtype": "text",
40
        "at": {
41
            "atMobiles": [xxx],
42
            "atUserIds": [],
43
            "isAtAll": False
44
        },
45
        "text": {"content": "{}".format(data)}
46
        })
47
    headers = {
48
        'User-Agent': 'Mozilla/5.0 (iPad; U; CPU OS 3_2_1 like Mac OS X; en-us) AppleWebKit/531.21.10 (KHTML, like Gecko) Mobile/7B405',
49
        'Content-Type': 'application/json'
50
    }
51
    try:
52
        requests.post(_dd_url, headers=headers, data=message)
53
        # logger.info('Sends Successfully with {}'.format(data))
54
        return True
55
    except Exception as e:
56
        logger.error('Sends Failed with {}'.format(e))
57
        return False
58

59

60
def write_stdout(s):
61
    # only eventlistener protocol messages may be sent to stdout
62
    sys.stdout.write(s)
63
    sys.stdout.flush()
64

65

66
def write_stderr(s):
67
    sys.stderr.write(s)
68
    sys.stderr.flush()
69

70

71
def main():
72
    get_host_ip()
73
    logger.remove()
74
    log_format = '<green>{time:YYYY-MM-DD HH:mm:ss.SSS}</green> | <level>{level}</level> | <cyan>{file}</cyan>:<cyan>{line}</cyan> - <level>{message}</level>'
75
    logger.add('/data1/es_event_monitor.log', format=log_format,
76
               rotation='50 MB', colorize=True)
77
    while 1:
78
        # transition from ACKNOWLEDGED to READY
79
        write_stdout('READY\n')
80

81
        # read header line and print it to stderr
82
        line = sys.stdin.readline()
83

84
        # read event payload and print it to stderr
85
        headers = dict([x.split(':') for x in line.split()])
86
        notify_data = sys.stdin.read(int(headers['len']))
87
        if 'es_node' in notify_data:
88
            now = time.strftime('%Y-%m-%d %H:%M:%S', time.localtime())
89
            message = f'\nES节点宕机监测\n时间: {now}\n节点: {_local_ip}\n事件: {notify_data}'
90
            logger.info(message)
91
            _dd_send(message)
92
        
93
        # transition from READY to ACKNOWLEDGED
94
        write_stdout('RESULT 2\nOK')
95

96

97
if __name__ == '__main__':
98
    main()
99

0x05 参考

1 Event Listener States¶ ↩

如何使用supervisor监控es

0x00 前言

0x01 安装supervisor

0x02 systemd管理supervisor

0x03 配置supervisor

0x04 编写事件通知

0x05 参考

Bypass disable_functions

一次logstash性能排查记录

Comments NOTHING

发表评论取消回复

0x00 前言

0x01 安装supervisor

0x02 systemd管理supervisor

0x03 配置supervisor

0x04 编写事件通知

0x05 参考

分享到：

Bypass disable_functions

一次logstash性能排查记录

Comments NOTHING

发表评论取消回复