mount nfs目录挂载导致目标服务器宕机
宕机背景
由于业务规模一般,现业务由两台互相备大容量存储服务器BIG42、BIG43 向 8台WEB服务器提供文件存储服务。
BIG42、BIG43部署情况
BIG42、BIG43采用sersync实现互备(基于rsync互相同步数据,BIG42新增数据会向BIG43同步,BIG43新增数据会向BIG42同步)
WEB服务器部署情况
WEB服务器通过mount -t nfs方式向BIG42、BIG43挂载远程目录使用
mount -t nfs xx.xxx.xxx.42:/data/uploads/www_xxxx_com/data /home/wwwroot/xxx.xxxxx.com/data
...
宕机现象
- WEB服务器 : load average达到15~18
- WEB服务器 : df -h 无法查看到挂载目录
- WEB服务器 : nginx服务停止,站点无法访问(挂载BIG42的WEB均出现故障)
- DATA服务器 : 晚上20点BIG43宕机ssh无法连接、BIG42正常load average在0.5以下
- DATA服务器 : 紧急将BIG43的挂载切换至BIG42,业务暂时恢复,次日凌晨02点BIG42出现宕机
BIG42日志
Jun 21 02:43:47 szwg-m91-store-webmedia02 xinetd[9600]: Deactivating service rsync due to excessive incoming connections. Restarting in 10 seconds.
Jun 21 02:43:47 szwg-m91-store-webmedia02 xinetd[9600]: FAIL: rsync connections per second from=10.199.146.43
Jun 21 02:43:47 szwg-m91-store-webmedia02 xinetd[9600]: EXIT: rsync status=0 pid=1317 duration=0(sec)
Jun 21 02:43:57 szwg-m91-store-webmedia02 xinetd[9600]: Activating service rsync
Jun 21 02:44:00 szwg-m91-store-webmedia02 xinetd[9600]: select reported EBADF but no bad file descriptors were found
Jun 21 02:44:00 szwg-m91-store-webmedia02 xinetd[9600]: START: rsync pid=30137 from=10.199.146.43
Jun 21 02:44:00 szwg-m91-store-webmedia02 xinetd[9600]: select reported EBADF but no bad file descriptors were found
Jun 21 02:44:00 szwg-m91-store-webmedia02 xinetd[9600]: START: rsync pid=30138 from=10.199.146.43
Jun 21 02:44:00 szwg-m91-store-webmedia02 xinetd[9600]: select reported EBADF but no bad file descriptors were found
Jun 21 02:44:00 szwg-m91-store-webmedia02 xinetd[9600]: START: rsync pid=30162 from=10.199.146.43
Jun 21 02:44:00 szwg-m91-store-webmedia02 xinetd[9600]: select reported EBADF but no bad file descriptors were found
Jun 21 02:44:00 szwg-m91-store-webmedia02 xinetd[9600]: START: rsync pid=30175 from=10.199.146.43
Jun 21 02:44:00 szwg-m91-store-webmedia02 xinetd[9600]: select reported EBADF but no bad file descriptors were found
Jun 21 02:44:00 szwg-m91-store-webmedia02 xinetd[9600]: START: rsync pid=30176 from=10.199.146.43
Jun 21 02:44:00 szwg-m91-store-webmedia02 xinetd[9600]: 1 descriptors still set
Jun 21 02:44:00 szwg-m91-store-webmedia02 xinetd[9600]: 1 descriptors still set
Jun 21 02:44:00 szwg-m91-store-webmedia02 rsyslogd-2177: imuxsock begins to drop messages from pid 9600 due to rate-limiting
...
Jun 21 02:44:06 szwg-m91-store-webmedia02 rsyslogd-2177: imuxsock lost 423120 messages from pid 9600 due to rate-limiting
...
故障分析
原因:
xinetd当最大连接数到达xx数据量后,xinetd会将配置的服务停止yy秒,用于防止DDOS攻击.
“Deactivating service rsync due to excessive incoming connections. Restarting in 10 seconds.”
相关配置:
/etc/xinetd.d/rsync配置中没有定义per_source、instances参数,此时将继承defaults部分参数设置。
cat /etc/xinetd.conf
defaults
{
...
cps = 50 10
instances = 50
per_source = 10
...
cat /etc/xinetd.d/rsync
service rsync
{
disable = yes
flags = IPv6
socket_type = stream
wait = no
user = root
server = /usr/bin/rsync
server_args = --daemon
log_on_failure += USERID
}
参数说明:
cps :用来设定连接速率。它需要两个参数,第一个参数表示每秒可以处理的连接数,如果超过了这个连接数时,之后进入的连接将被暂时停止处理;第二个参数表示停止处理多少秒后,继续处理先前暂停处理的连接
per_source:参数值可以为整数或者UNLIMITED关键字。它表示每一个IP地址上最多可以建立的实例数目。本属性也可以定义在defaults部分
instances :接受一个大于或等于1的整数或UNLIMITED。设置可同时运行的最大进程数。UNLIMITED意味着xinetd对该数没有限制
故障修复
不限制per_source和instances
调整配置:
cat /etc/xinetd.d/rsync
service rsync
{
per_source = UNLIMITED
instances = UNLIMITED
disable = no
flags = IPv6
socket_type = stream
wait = no
user = root
server = /usr/bin/rsync
server_args = --daemon
log_on_failure += USERID
}
参考资料
http://server.it168.com/a2009/0624/594/000000594998_2.shtml
http://blog.sina.com.cn/s/blog_5d3da3280100btjh.html
http://blog.chinaunix.net/uid-20269419-id-4192611.html