如果MySQL事务中发生了网络异常?
The following article is from yangyidba Author yangyidba
一 前言
在我们运维MySQL的时候,总会遇到各种情况导致程序和MySQL之间的会话异常中断,比如
假如强制关闭应用
假如client机器突然崩溃宕机/断电
假如网络发生抖动/网卡发生故障
机房级别断网
那么此时正在MySQL中执行的事务会何表现?
二 实践
设计一个案例模拟client 在MySQL中执行事务,但是client机器突然down机,导致会话异常中断。
client 192.168.56.102
MySQL 192.168.56.101
client 连接db 执行 select for update 动作给记录加上锁,当然实际上 也可以是update,delete 这样的动作给记录加上锁。
然后关闭client机器模拟断电,断网。
此时server端 网络层的连接状态依然是 ESTABLISH 数据库中的事务处于running状态。
再开启另外一个会话,对t1表进行加锁需要等待,说明断网之后的事务依然处于活跃状态。
ok 表演结束 ,我们接下来继续分析网络断开,事务为啥没有退出?
三 分析
3.1 服务端为什么没有退出这个事务呢?
MySQL普通的会话连接没有保活机制,即没有设置socket属性,也没有设置心跳机制。如果网络连接异常断开服务端不能及时探测到该异常。更进一步,我们通过 TCP 关闭的四次握手来看
网络异常的时候,TCP连接的状态还是ESTABLISHED,说明 server 和 client 任何一方都没有主动发送FIN包,服务端还在等待 client端 发送数据,此时的 MySQL 事务无法直接退出。
3.2 事务在网络断开后如何处理
事务正在执行
一个连接进行事务后,如果事务语句正在执行,那么网络断开后会在语句执行完成后回滚掉。因为执行状态包不能送达客户端,因此会感知到这种网络断开的错误。调试堆栈信息参考 堆栈1。
if (thd->is_error() || (thd->variables.option_bits & OPTION_MASTER_SQL_ERROR))
trans_rollback_stmt(thd);
事务执行完成未提交
如果事务中sql执行完成而没有提交,此时网络断开,那么事务还存在服务端,需要手动kill。client到server端的连接路径:
socket->listen->poll(socketfd)->accept->newthread->poll(newfd,wait_timeout)
一旦有新的数据到来,如果需要读取或者写入由于网络问题依旧使用poll进行等待,直到超时。其中参数 read_timeout/write_timeout
用于读取网络数据的,如果网络不可用,会话保持的时间就是等待网络可用的时间,也就是 wait_timeout
和 read_timeout/write_timeout
均使用poll的timeout实现。见栈2,可见 vio_io_wait
函数用于处理各种超时,主要用poll来处理。
net_write_packet
->net_write_raw_loop 一个包大小16M
->vio_write
->mysql_socket_send 如果发送信息失败 inline_mysql_socket_send调用send命令
->如果是SOCKET_EAGAIN那么通过vio_socket_io_wait函数进行判定,需要等待的时间
/* Wait for the output buffer to become writable.*/
3.3 何时退出呢?
这里需要分情况来讨论。
空闲连接状态时
此时连接的会话时间由 MySQL参数 wait_timeout
决定,默认是8h,也即会话时间空闲超过8h,会被MySQL自动关闭。详细知识可以移步到如下链接:
浅析interactive_timeout和wait_timeout
等待TCP超时
默认情况下会话会保持2小时+11次*75秒,此时服务端为啥没有退出这个事务呢? 。(由TCP属性决定)
/proc/sys/net/ipv4/tcp_keepalive_time = 7200(等待空闲时间秒)
/proc/sys/net/ipv4/tcp_keepalive_intvl = 75(探测间隔秒)
/proc/sys/net/ipv4/tcp_keepalive_probes = 9(探测次数)
主动kill 异常会话
kill $thread_id;
我们之前遇到一次机房级别的断网,应用重连之后遇到大量sql 锁等待,于是乎,我们写了一个脚本定期kill 长时间活跃的事务,仅供大家参考。
# 针对系统运维层面
Kill_SQL="select concat('kill ', trx_mysql_thread_id ,' ;') from information_schema.processlist a,information_schema.INNODB_TRX b where a.ID=b.trx_mysql_thread_id and trx_started <=SUBDATE(now(),interval 60 second) and STATE='' and COMMAND='Sleep' and TIME>=60"
LOGFILE='/data/logs/zandb_agent/kill_long_trx.log'
run_user=`whoami`
ret_log()
{
msg="$1"
echo `date +%Y%m%d_%H:%M:%S` "[info]" " $msg" >> ${LOGFILE}
}
if [ $# -lt 1 ]; then
ports=`ps -ef |grep mysqld| grep -v mysqld_safe |grep port=|awk -F"port=" '{print $NF}' |awk '{print $1}' | sort `
else
ports=$1
fi
for port in ${ports};
do
if [ -f /tmp/long_trx_${port}.lock ];then
msg="there is a lock file ,so we skip instance ${port}.."
ret_log "${msg}"
continue
fi
if [ -S /srv/my${port}/run/mysql.sock ]; then
SOCKET="/srv/my${port}/run/mysql.sock"
else
msg="socket file does not exists,please check ."
ret_log "${msg}"
fi
MYSQL="/opt/mysql/bin/mysql -uroot -S ${SOCKET} "
${MYSQL} --skip-column-names -e "$Kill_SQL" > /tmp/kill_trx_${port}.sql.${run_user} 2>/tmp/kill_trx_${port}.log.${run_user}
num=`grep kill -c /tmp/kill_trx_${port}.sql.${run_user}`
if [[ ${num} -gt 0 ]]; then
msg="${port} ${num} long trx was killed "
ret_log "${msg}"
${MYSQL} -e "source /tmp/kill_trx_${port}.sql.${run_user}"
fi
done
栈1:回滚栈
(gdb) bt
#0 innobase_rollback (hton=0x2e12440, thd=0x7ffefc000950, rollback_trx=false) at /home/mysql/soft/percona-server-5.7.29-32/storage/innobase/handler/ha_innodb.cc:5452
#1 0x0000000000ea6ab8 in ha_rollback_low (thd=0x7ffefc000950, all=false) at /home/mysql/soft/percona-server-5.7.29-32/sql/handler.cc:2019
#2 0x00000000017f0f23 in MYSQL_BIN_LOG::rollback (this=0x2d668a0 <mysql_bin_log>, thd=0x7ffefc000950, all=false) at /home/mysql/soft/percona-server-5.7.29-32/sql/binlog.cc:2532
#3 0x0000000000ea6d40 in ha_rollback_trans (thd=0x7ffefc000950, all=false) at /home/mysql/soft/percona-server-5.7.29-32/sql/handler.cc:2106
#4 0x00000000015c6a13 in trans_rollback_stmt (thd=0x7ffefc000950) at /home/mysql/soft/percona-server-5.7.29-32/sql/transaction.cc:515
#5 0x00000000014c08de in mysql_execute_command (thd=0x7ffefc000950, first_level=true) at /home/mysql/soft/percona-server-5.7.29-32/sql/sql_parse.cc:5325
#6 0x00000000014c2025 in mysql_parse (thd=0x7ffefc000950, parser_state=0x7fffe88824a0, update_userstat=false) at /home/mysql/soft/percona-server-5.7.29-32/sql/sql_parse.cc:5927
#7 0x00000000014b6c5f in dispatch_command (thd=0x7ffefc000950, com_data=0x7fffe8882c90, command=COM_QUERY) at /home/mysql/soft/percona-server-5.7.29-32/sql/sql_parse.cc:1539
#8 0x00000000014b5a94 in do_command (thd=0x7ffefc000950) at /home/mysql/soft/percona-server-5.7.29-32/sql/sql_parse.cc:1060
#9 0x00000000015e9d32 in handle_connection (arg=0x3c09eb0) at /home/mysql/soft/percona-server-5.7.29-32/sql/conn_handler/connection_handler_per_thread.cc:325
#10 0x00000000018b97f2 in pfs_spawn_thread (arg=0x3b784b0) at /home/mysql/soft/percona-server-5.7.29-32/storage/perfschema/pfs.cc:2198
#11 0x00007ffff7bc6ea5 in start_thread () from /lib64/libpthread.so.0
#12 0x00007ffff5fa08dd in clone () from /lib64/libc.so.6
栈2 写入等待timeout=60000 即默认的write_timeout=60S
#0 0x00007ffff5f95c3d in poll () from /lib64/libc.so.6
#1 0x0000000001da0e3b in vio_io_wait (vio=0x7ffefc005690, event=VIO_IO_EVENT_WRITE, timeout=60000) at /home/mysql/soft/percona-server-5.7.29-32/vio/viosocket.c:1173
#2 0x0000000001d9f352 in vio_socket_io_wait (vio=0x7ffefc005690, event=VIO_IO_EVENT_WRITE) at /home/mysql/soft/percona-server-5.7.29-32/vio/viosocket.c:127
#3 0x0000000001d9f74a in vio_write (vio=0x7ffefc005690, buf=0x7ffefc013838 "\001\061\001\060\004", size=12872) at /home/mysql/soft/percona-server-5.7.29-32/vio/viosocket.c:260
#4 0x00000000016f92a3 in net_write_raw_loop (net=0x7ffefc002528, buf=0x7ffefc013838 "\001\061\001\060\004", count=12872) at /home/mysql/soft/percona-server-5.7.29-32/sql/net_serv.cc:522
#5 0x00000000016f9588 in net_write_packet (net=0x7ffefc002528, packet=0x7ffefc012a80 "\001\061\001\060\004", length=16384) at /home/mysql/soft/percona-server-5.7.29-32/sql/net_serv.cc:661
#6 0x00000000016f9177 in net_write_buff (net=0x7ffefc002528, packet=0x7ffefc936f70 "\001\061\001\060", len=4) at /home/mysql/soft/percona-server-5.7.29-32/sql/net_serv.cc:474
#7 0x00000000016f8e1c in my_net_write (net=0x7ffefc002528, packet=0x7ffefc936f70 "\001\061\001\060", len=4) at /home/mysql/soft/percona-server-5.7.29-32/sql/net_serv.cc:347
#8 0x0000000001745635 in Protocol_classic::end_row (this=0x7ffefc001c68) at /home/mysql/soft/percona-server-5.7.29-32/sql/protocol_classic.cc:1204
#9 0x00000000014571d7 in Query_result_send::send_data (this=0x7ffefc0097a0, items=...) at /home/mysql/soft/percona-server-5.7.29-32/sql/sql_class.cc:2937
#10 0x0000000001477a0d in end_send (join=0x7ffefc009a70, qep_tab=0x7ffefc93b290, end_of_records=false) at /home/mysql/soft/percona-server-5.7.29-32/sql/sql_executor.cc:2946
#11 0x000000000147461b in evaluate_join_record (join=0x7ffefc009a70, qep_tab=0x7ffefc93b118) at /home/mysql/soft/percona-server-5.7.29-32/sql/sql_executor.cc:1652
#12 0x0000000001473a5b in sub_select (join=0x7ffefc009a70, qep_tab=0x7ffefc93b118, end_of_records=false) at /home/mysql/soft/percona-server-5.7.29-32/sql/sql_executor.cc:1304
#13 0x00000000014732dc in do_select (join=0x7ffefc009a70) at /home/mysql/soft/percona-server-5.7.29-32/sql/sql_executor.cc:957
#14 0x0000000001471243 in JOIN::exec (this=0x7ffefc009a70) at /home/mysql/soft/percona-server-5.7.29-32/sql/sql_executor.cc:206
#15 0x000000000150d2d5 in handle_query (thd=0x7ffefc000950, lex=0x7ffefc003000, result=0x7ffefc0097a0, added_options=0, removed_options=0)
at /home/mysql/soft/percona-server-5.7.29-32/sql/sql_select.cc:192
#16 0x00000000014c1097 in execute_sqlcom_select (thd=0x7ffefc000950, all_tables=0x7ffefc009160) at /home/mysql/soft/percona-server-5.7.29-32/sql/sql_parse.cc:5490
#17 0x00000000014ba323 in mysql_execute_command (thd=0x7ffefc000950, first_level=true) at /home/mysql/soft/percona-server-5.7.29-32/sql/sql_parse.cc:3016
#18 0x00000000014c2025 in mysql_parse (thd=0x7ffefc000950, parser_state=0x7fffe88824a0, update_userstat=false) at /home/mysql/soft/percona-server-5.7.29-32/sql/sql_parse.cc:5927
#19 0x00000000014b6c5f in dispatch_command (thd=0x7ffefc000950, com_data=0x7fffe8882c90, command=COM_QUERY) at /home/mysql/soft/percona-server-5.7.29-32/sql/sql_parse.cc:1539
#20 0x00000000014b5a94 in do_command (thd=0x7ffefc000950) at /home/mysql/soft/percona-server-5.7.29-32/sql/sql_parse.cc:1060
#21 0x00000000015e9d32 in handle_connection (arg=0x3c09eb0) at /home/mysql/soft/percona-server-5.7.29-32/sql/conn_handler/connection_handler_per_thread.cc:325
#22 0x00000000018b97f2 in pfs_spawn_thread (arg=0x3b784b0) at /home/mysql/soft/percona-server-5.7.29-32/storage/perfschema/pfs.cc:2198
#23 0x00007ffff7bc6ea5 in start_thread () from /lib64/libpthread.so.0
#24 0x00007ffff5fa08dd in clone () from /lib64/libc.so.6
(gdb)
参考链接
https://www.jianshu.com/p/735fdf9673e4
往期推荐
给大家推荐本文作者的原创公众号,该号长期关注于数据库技术以及性能优化,故障案例分析,数据库运维技术知识分享,个人成长和自我管理等主题,欢迎扫码关注!