引言

本文介绍一个主从 GTID 不一致导致 MHA 切换失败的案例，并且在主从切换过程中发生复制延迟切换。

为解释主从切换异常的中间状态的原因，以及发生延迟切换的原因，分别介绍了主从切换的流程与从库延迟切换的触发条件。

注意本次切换为 swtichover，不是 failover。

现象

服务器由于硬件故障进行手动主从切换时失败，最终主从都有复制关系，并且复制都停止。

经分析，该集群的复制基于 GTID 搭建，因此 MHA 切换时新从也是基于 GTID 搭建，切换过程中会检测新从的复制状态。由于主从 GTID 不一致导致复制停止，因此 MHA 报错切换失败。

也就是说切换失败的原因是从库复制停止，从库复制停止的原因是主从 GTID 不一致。

处理方式是将新从的复制由 GTID 改为位点模式后复制开始正常运行。

过程

切换前准备

VIP

分别查看主从 VIP 的 vrrp 进程，显示启动时间分别为 2017 与 2021 年。

# x.x.x.127
[2023/1/16 12:20:28] [root@DB ~]# ps -ef|grep vrrp
[2023/1/16 12:20:28] root      20762  20732  0 12:20 pts/1    00:00:00 grep vrrp
[2023/1/16 12:20:28] root      44237      1  0  2017 ?        05:05:19 /export/servers/vrrp_scripts/vrrpd-1.0/vrrpd -i eth1 -n -v 127 -p 100 x.x.x.34

# x.x.x.128
[2023/1/16 12:20:50] [root@DB ~]# ps -ef|grep vrrp
[2023/1/16 12:20:50] root       3921   3893  0 12:20 pts/1    00:00:00 grep vrrp
[2023/1/16 12:20:50] root      43313      1  0  2021 ?        01:48:57 /export/servers/vrrp_scripts/vrrpd-1.0/vrrpd -i eth1 -n -v 35 -p 100 x.x.x.35

因此切换前主从 IP 与 VIP 的绑定关系见下表。

	IP	VIP
主	x.x.x.127	x.x.x.34
从	x.x.x.128	x.x.x.35

显示主 VIP 绑定在主 IP，从 VIP 绑定在从 IP。

复制

从库，查看当前复制状态，显示 IO 线程与 SQL 线程运行正常，主从延迟等于 0。

[2023/1/16 12:20:56] MySQL> SHOW SLAVE STATUS G
[2023/1/16 12:20:56] *************************** 1. row ***************************
[2023/1/16 12:20:56]                Slave_IO_State: Waiting for master to send event
[2023/1/16 12:20:56]                   Master_Host: x.x.x.127
[2023/1/16 12:20:56]                   Master_User: wms_rep
[2023/1/16 12:20:56]                   Master_Port: 3358
[2023/1/16 12:20:56]                 Connect_Retry: 60
[2023/1/16 12:20:56]               Master_Log_File: mysql-bin.018127
[2023/1/16 12:20:56]           Read_Master_Log_Pos: 304231357
[2023/1/16 12:20:56]                Relay_Log_File: relay-log.054380
[2023/1/16 12:20:56]                 Relay_Log_Pos: 304229802
[2023/1/16 12:20:56]         Relay_Master_Log_File: mysql-bin.018127
[2023/1/16 12:20:56]              Slave_IO_Running: Yes
[2023/1/16 12:20:56]             Slave_SQL_Running: Yes
[2023/1/16 12:20:56]               Replicate_Do_DB: 
[2023/1/16 12:20:56]           Replicate_Ignore_DB: test
[2023/1/16 12:20:56]            Replicate_Do_Table: 
[2023/1/16 12:20:56]        Replicate_Ignore_Table: 
[2023/1/16 12:20:56]       Replicate_Wild_Do_Table: 
[2023/1/16 12:20:56]   Replicate_Wild_Ignore_Table: 
[2023/1/16 12:20:56]                    Last_Errno: 0
[2023/1/16 12:20:56]                    Last_Error: 
[2023/1/16 12:20:56]                  Skip_Counter: 0
[2023/1/16 12:20:56]           Exec_Master_Log_Pos: 304229629
[2023/1/16 12:20:56]               Relay_Log_Space: 304230681
[2023/1/16 12:20:56]               Until_Condition: None
[2023/1/16 12:20:56]                Until_Log_File: 
[2023/1/16 12:20:56]                 Until_Log_Pos: 0
[2023/1/16 12:20:56]            Master_SSL_Allowed: No
[2023/1/16 12:20:56]            Master_SSL_CA_File: 
[2023/1/16 12:20:56]            Master_SSL_CA_Path: 
[2023/1/16 12:20:56]               Master_SSL_Cert: 
[2023/1/16 12:20:56]             Master_SSL_Cipher: 
[2023/1/16 12:20:56]                Master_SSL_Key: 
[2023/1/16 12:20:56]         Seconds_Behind_Master: 0
[2023/1/16 12:20:56] Master_SSL_Verify_Server_Cert: No
[2023/1/16 12:20:56]                 Last_IO_Errno: 0
[2023/1/16 12:20:56]                 Last_IO_Error: 
[2023/1/16 12:20:56]                Last_SQL_Errno: 0
[2023/1/16 12:20:56]                Last_SQL_Error: 
[2023/1/16 12:20:56]   Replicate_Ignore_Server_Ids: 
[2023/1/16 12:20:56]              Master_Server_Id: 14364127
[2023/1/16 12:20:56]                   Master_UUID: 1a7d571b-d996-11e7-962d-ecf4bbc4fdf9
[2023/1/16 12:20:56]              Master_Info_File: mysql.slave_master_info
[2023/1/16 12:20:56]                     SQL_Delay: 0
[2023/1/16 12:20:56]           SQL_Remaining_Delay: NULL
[2023/1/16 12:20:56]       Slave_SQL_Running_State: Slave has read all relay log; waiting for more updates
[2023/1/16 12:20:56]            Master_Retry_Count: 86400
[2023/1/16 12:20:56]                   Master_Bind: 
[2023/1/16 12:20:56]       Last_IO_Error_Timestamp: 
[2023/1/16 12:20:56]      Last_SQL_Error_Timestamp: 
[2023/1/16 12:20:56]                Master_SSL_Crl: 
[2023/1/16 12:20:56]            Master_SSL_Crlpath: 
[2023/1/16 12:20:56]            Retrieved_Gtid_Set: 1a7d571b-d996-11e7-962d-ecf4bbc4fdf9:3817208629-7191310254
[2023/1/16 12:20:56]             Executed_Gtid_Set: 13351c5e-d8dd-11e7-9704-b8ca3a5c1739:1-553773:771884-1101141,
[2023/1/16 12:20:56] 1a7d571b-d996-11e7-962d-ecf4bbc4fdf9:1-7191310254
[2023/1/16 12:20:56]                 Auto_Position: 1
[2023/1/16 12:20:56]          Replicate_Rewrite_DB: 
[2023/1/16 12:20:56]                  Channel_Name: 
[2023/1/16 12:20:56]            Master_TLS_Version: 
[2023/1/16 12:20:56] 1 row in set (0.00 sec)

其中：

Master_User: wms_rep
Auto_Position: 1，表示已开启 GTID；
GTID，仅供参考，原因是在切换前的查询结果。

切换异常

2023/1/16 12:21:07 开始切换，主从切换程序阻塞。查看切换进度，显示主从都有复制关系，并且复制都停止，因此切换异常，下面介绍详情。

新主

查看复制关系，显示新主切换后还是从库。

[2023/1/16 12:22:17] mysql> SHOW SLAVE STATUS G
[2023/1/16 12:22:17] *************************** 1. row ***************************
[2023/1/16 12:22:17]                Slave_IO_State: Waiting for master to send event
[2023/1/16 12:22:17]                   Master_Host: x.x.x.127
[2023/1/16 12:22:17]                   Master_User: wms_rep
[2023/1/16 12:22:17]                   Master_Port: 3358
[2023/1/16 12:22:17]                 Connect_Retry: 60
[2023/1/16 12:22:17]               Master_Log_File: mysql-bin.018127
[2023/1/16 12:22:17]           Read_Master_Log_Pos: 305455668
[2023/1/16 12:22:17]                Relay_Log_File: relay-log.054380
[2023/1/16 12:22:17]                 Relay_Log_Pos: 305455841
[2023/1/16 12:22:17]         Relay_Master_Log_File: mysql-bin.018127
[2023/1/16 12:22:17]              Slave_IO_Running: Yes
[2023/1/16 12:22:17]             Slave_SQL_Running: No
[2023/1/16 12:22:17]               Replicate_Do_DB: 
[2023/1/16 12:22:17]           Replicate_Ignore_DB: test
[2023/1/16 12:22:17]            Replicate_Do_Table: 
[2023/1/16 12:22:17]        Replicate_Ignore_Table: 
[2023/1/16 12:22:17]       Replicate_Wild_Do_Table: 
[2023/1/16 12:22:17]   Replicate_Wild_Ignore_Table: 
[2023/1/16 12:22:17]                    Last_Errno: 0
[2023/1/16 12:22:17]                    Last_Error: 
[2023/1/16 12:22:17]                  Skip_Counter: 0
[2023/1/16 12:22:17]           Exec_Master_Log_Pos: 305455668
[2023/1/16 12:22:17]               Relay_Log_Space: 305454992
[2023/1/16 12:22:17]               Until_Condition: None
[2023/1/16 12:22:17]                Until_Log_File: 
[2023/1/16 12:22:17]                 Until_Log_Pos: 0
[2023/1/16 12:22:17]            Master_SSL_Allowed: No
[2023/1/16 12:22:17]            Master_SSL_CA_File: 
[2023/1/16 12:22:17]            Master_SSL_CA_Path: 
[2023/1/16 12:22:17]               Master_SSL_Cert: 
[2023/1/16 12:22:17]             Master_SSL_Cipher: 
[2023/1/16 12:22:17]                Master_SSL_Key: 
[2023/1/16 12:22:17]         Seconds_Behind_Master: NULL
[2023/1/16 12:22:17] Master_SSL_Verify_Server_Cert: No
[2023/1/16 12:22:17]                 Last_IO_Errno: 0
[2023/1/16 12:22:17]                 Last_IO_Error: 
[2023/1/16 12:22:17]                Last_SQL_Errno: 0
[2023/1/16 12:22:17]                Last_SQL_Error: 
[2023/1/16 12:22:17]   Replicate_Ignore_Server_Ids: 
[2023/1/16 12:22:17]              Master_Server_Id: 14364127
[2023/1/16 12:22:17]                   Master_UUID: 1a7d571b-d996-11e7-962d-ecf4bbc4fdf9
[2023/1/16 12:22:17]              Master_Info_File: mysql.slave_master_info
[2023/1/16 12:22:17]                     SQL_Delay: 0
[2023/1/16 12:22:17]           SQL_Remaining_Delay: NULL
[2023/1/16 12:22:17]       Slave_SQL_Running_State: 
[2023/1/16 12:22:17]            Master_Retry_Count: 86400
[2023/1/16 12:22:17]                   Master_Bind: 
[2023/1/16 12:22:17]       Last_IO_Error_Timestamp: 
[2023/1/16 12:22:17]      Last_SQL_Error_Timestamp: 
[2023/1/16 12:22:17]                Master_SSL_Crl: 
[2023/1/16 12:22:17]            Master_SSL_Crlpath: 
[2023/1/16 12:22:17]            Retrieved_Gtid_Set: 1a7d571b-d996-11e7-962d-ecf4bbc4fdf9:3817208629-7191310859
[2023/1/16 12:22:17]             Executed_Gtid_Set: 13351c5e-d8dd-11e7-9704-b8ca3a5c1739:1-553999:771884-1101141,
[2023/1/16 12:22:17] 1a7d571b-d996-11e7-962d-ecf4bbc4fdf9:1-7191310859
[2023/1/16 12:22:17]                 Auto_Position: 1
[2023/1/16 12:22:17]          Replicate_Rewrite_DB: 
[2023/1/16 12:22:17]                  Channel_Name: 
[2023/1/16 12:22:17]            Master_TLS_Version: 
[2023/1/16 12:22:17] 1 row in set (0.00 sec)

其中：

Slave_IO_Running: Yes
Slave_SQL_Running: No，表明 SQL 线程停止，但是复制没有报错；
Seconds_Behind_Master: NULL

这里有两个问题：

为什么新主 SQL 线程停止？
为什么新主复制关系没有被清除？

查看错误日志，显示 12:21:11 SQL 线程在读取 relay log 时报错停止。

2023-01-16T12:21:11.864092+08:00 11 [Note] Error reading relay log event for channel '': slave SQL thread was killed

分别列出主从的 server_id 与 uuid。

	server_id	uuid
127，老主	14364127	1a7d571b-d996-11e7-962d-ecf4bbc4fdf9
128，新主	14364128	13351c5e-d8dd-11e7-9704-b8ca3a5c1739

对比新主 12:20:56 与 12:22:17 的 GTID，即 Executed_Gtid_Set 参数。

# 12:20:56
Executed_Gtid_Set: 13351c5e-d8dd-11e7-9704-b8ca3a5c1739:1-553773:771884-1101141,
1a7d571b-d996-11e7-962d-ecf4bbc4fdf9:1-7191310254

# 12:22:17
Executed_Gtid_Set: 13351c5e-d8dd-11e7-9704-b8ca3a5c1739:1-553999:771884-1101141,
1a7d571b-d996-11e7-962d-ecf4bbc4fdf9:1-7191310859

其中：

在切换前老主有写入并同步到了从库，具体是从 7191310254 写到了 7191310859；
在切换后新主也有写入，具体是从 771884 写到了 553999。

查看只读参数，显示新主已关闭只读，因此业务账号可以写入新主。

[2023/1/16 12:29:00] mysql> select @@read_only;
[2023/1/16 12:29:00] +-------------+
[2023/1/16 12:29:00] | @@read_only |
[2023/1/16 12:29:00] +-------------+
[2023/1/16 12:29:00] |           0 |
[2023/1/16 12:29:00] +-------------+
[2023/1/16 12:29:00] 1 row in set (0.00 sec)

查看 VIP，显示新主没有绑定 VIP，主从 VIP 都没有绑定。

[2023/1/16 12:24:46] [root@DB ~]# ps -ef|grep vrrp
[2023/1/16 12:24:46] root       4535   3893  0 12:24 pts/1    00:00:00 grep vrrp

新主切换前有从 VIP，主从切换后理论上主从 VIP 都绑定在新主，因此新主切换异常。

老主

查看复制关系，显示切换后老主复制停止，当然实际上是作为新从复制启动失败。

mysql> SHOW SLAVE STATUS G
*************************** 1. row ***************************
               Slave_IO_State: 
                  Master_Host: x.x.x.128
                  Master_User: replicater
                  Master_Port: 3358
                Connect_Retry: 60
              Master_Log_File: 
          Read_Master_Log_Pos: 4
               Relay_Log_File: relay-log.000001
                Relay_Log_Pos: 4
        Relay_Master_Log_File: 
             Slave_IO_Running: No
            Slave_SQL_Running: Yes
              Replicate_Do_DB: 
          Replicate_Ignore_DB: test
           Replicate_Do_Table: 
       Replicate_Ignore_Table: 
      Replicate_Wild_Do_Table: 
  Replicate_Wild_Ignore_Table: 
                   Last_Errno: 0
                   Last_Error: 
                 Skip_Counter: 0
          Exec_Master_Log_Pos: 0
              Relay_Log_Space: 154
              Until_Condition: None
               Until_Log_File: 
                Until_Log_Pos: 0
           Master_SSL_Allowed: No
           Master_SSL_CA_File: 
           Master_SSL_CA_Path: 
              Master_SSL_Cert: 
            Master_SSL_Cipher: 
               Master_SSL_Key: 
        Seconds_Behind_Master: 0
Master_SSL_Verify_Server_Cert: No
                Last_IO_Errno: 1236
                Last_IO_Error: Got fatal error 1236 from master when reading data from binary log: 'The slave is connecting using CHANGE MASTER TO MASTER_AUTO_POSITION = 1, but the master has purged binary logs containing GTIDs that the slave requires.'
               Last_SQL_Errno: 0
               Last_SQL_Error: 
  Replicate_Ignore_Server_Ids: 
             Master_Server_Id: 14364128
                  Master_UUID: 13351c5e-d8dd-11e7-9704-b8ca3a5c1739
             Master_Info_File: mysql.slave_master_info
                    SQL_Delay: 0
          SQL_Remaining_Delay: NULL
      Slave_SQL_Running_State: Slave has read all relay log; waiting for more updates
           Master_Retry_Count: 86400
                  Master_Bind: 
      Last_IO_Error_Timestamp: 230116 12:21:14
     Last_SQL_Error_Timestamp: 
               Master_SSL_Crl: 
           Master_SSL_Crlpath: 
           Retrieved_Gtid_Set: 
            Executed_Gtid_Set: 13351c5e-d8dd-11e7-9704-b8ca3a5c1739:1-553373:771884-1101141,
1a7d571b-d996-11e7-962d-ecf4bbc4fdf9:1-7191310859
                Auto_Position: 1
         Replicate_Rewrite_DB: 
                 Channel_Name: 
           Master_TLS_Version: 
1 row in set (0.00 sec)

其中：

Slave_IO_Running: No，表明 IO 线程停止；
Slave_SQL_Running: Yes
Seconds_Behind_Master: 0
Last_IO_Errno: 1236
Last_IO_Error: Got fatal error 1236 from master when reading data from binary log: ‘The slave is connecting using CHANGE MASTER TO MASTER_AUTO_POSITION = 1, but the master has purged binary logs containing GTIDs that the slave requires.’

报错显示复制停止的原因是主库已清理从库需要的 GTID。

对比 12:22:17 老主与新主的 GTID，显示老主与新主的 GTID 不一致，进而导致复制线程启动失败。

# 新主，12:22:17
Executed_Gtid_Set: 13351c5e-d8dd-11e7-9704-b8ca3a5c1739:1-553999:771884-1101141,
1a7d571b-d996-11e7-962d-ecf4bbc4fdf9:1-7191310859

# 老主，12:22:17
Executed_Gtid_Set: 13351c5e-d8dd-11e7-9704-b8ca3a5c1739:1-553373:771884-1101141,
1a7d571b-d996-11e7-962d-ecf4bbc4fdf9:1-7191310859

查看 VIP，显示老主上已绑定从 VIP。

[2023/1/16 12:21:35] [root@DB ~]# ps -ef|grep vrrp
[2023/1/16 12:21:36] root      20912      1  0 12:21 ?        00:00:00 /export/servers/vrrp_scripts/vrrpd-1.0/vrrpd -i eth1 -n -v 35 -p 150 x.x.x.35
[2023/1/16 12:21:36] root      20960  20732  0 12:21 pts/1    00:00:00 grep vrrp

老主切换前有主 VIP，主从切换后理论上主从 VIP 都绑定在新主，老主没有绑定 VIP，因此老主也切换异常。

查看只读参数，显示老主只读开启。

[2023/1/16 12:26:20] mysql> select @@read_only;
[2023/1/16 12:26:20] +-------------+
[2023/1/16 12:26:20] | @@read_only |
[2023/1/16 12:26:20] +-------------+
[2023/1/16 12:26:20] |           1 |
[2023/1/16 12:26:20] +-------------+
[2023/1/16 12:26:20] 1 row in set (0.00 sec)

因此，切换异常主要表现为：

老主新主都有复制关系，而且复制停止；

新主，IO 线程正常，SQL 线程停止，复制未报错；
老主，IO 线程停止，SQL 线程正常，复制报错。

老主新主的 VIP 绑定关系均异常，其中：

新主，切换前有从 VIP，主从切换后理论上主从 VIP 都绑定在新主，实际上切换后没有绑定 VIP；
老主，切换前有主 VIP，主从切换后理论上主从 VIP 都绑定在新主，实际上切换后绑定从 VIP。

下面将详细分析原因。

首先经查看切换记录，老主绑定从 VIP 的原因是在主从切换开始后完成前再次触发器从库的延迟切换。

从库延迟切换

2023-01-16 12:21:07，手动主从切换，理论上会将主 VIP 切换到从库；

2023-01-16 12:21:35，自动从库延迟切换，理论上会将从 VIP 切换到主库。

MySQL 新从复制停止导致 MHA 切换失败 — image-20230227103538027

根据切换日志，在主从切换开始后完成前由于触发器了从库的延迟切换，因此将从 VIP 绑定到了老主。

从库延迟(DB/x.x.x.128)，发生切换！
Vip:[u'x.x.x.35']已切换至x.x.x.127,切换成功 ！
Vip在原服务器x.x.x.128 上停止 成功 ！

那么从库延迟的判断条件又是什么呢？

处理

新主开启只读

新主，为防止切换异常导致双写，因此手动开启只读。

[2023/1/16 12:29:04] mysql> set global read_only=1;
[2023/1/16 12:29:04] Query OK, 0 rows affected (0.00 sec)

新主挂载 VIP

新主，手动挂载主从 VIP。

[2023/1/16 12:29:59] [root@DB ~]# /export/servers/vrrp_scripts/vrrpd-1.0/vrrpd -i eth1 -n -v 34 -p 150 x.x.x.34
[2023/1/16 12:32:22] [root@DB ~]# /export/servers/vrrp_scripts/vrrpd-1.0/vrrpd -i eth1 -n -v 35 -p 150 x.x.x.35

检查 VIP，显示主从 VIP 已绑定到新主，执行ip a命令的结果相同，因此不再展示。

[2023/1/16 12:32:33] [root@DB ~]# ps -ef|grep vrrp
[2023/1/16 12:32:33] root       5347      1  0 12:29 ?        00:00:00 /export/servers/vrrp_scripts/vrrpd-1.0/vrrpd -i eth1 -n -v 34 -p 150 x.x.x.34
[2023/1/16 12:32:33] root       5798      1  0 12:32 ?        00:00:00 /export/servers/vrrp_scripts/vrrpd-1.0/vrrpd -i eth1 -n -v 35 -p 150 x.x.x.35
[2023/1/16 12:32:33] root       6004   5811  0 12:32 pts/4    00:00:00 grep vrrp

但是过段时间再次查看发现绑定的 vip 被摘除。

[2023/1/16 12:39:03] [root@DB ~]# ps -ef|grep vrrp
[2023/1/16 12:39:03] root       6949   5811  0 12:39 pts/4    00:00:00 grep vrrp

为确认 vip 挂载与摘除的真实记录查看系统日志。

messages

查看系统日志 /var/log/messages。

[2023/1/16 12:39:34] [root@DB ~]# ll -h /var/log/messages
[2023/1/16 12:39:34] -rw------- 1 root root 36M Jan 16 12:39 /var/log/messages

其中：

12:21:17，主 VIP 挂载到新主，原因是主从切换；
12:21:37，主从 VIP 从新主摘除，原因是从库延迟切换；
12:30:02、12:32:25，主从 VIP 挂载到新主，原因是手动挂载。

# 主 VIP 挂载到新主，原因是主从切换
367641 Jan 16 12:21:13 DB vrrpd: vrrpd version 0.4 starting...
367642 Jan 16 12:21:13 DB vrrpd: VRRP ID 34 on eth1: we are now a backup router.
367643 Jan 16 12:21:17 DB vrrpd: VRRP ID 34 on eth1: we are now the master router.
367644 Jan 16 12:21:19 DB ntpd[61975]: Listen normally on 7 eth1 x.x.x.34 UDP 123
367645 Jan 16 12:21:19 DB ntpd[61975]: peers refreshed

# 主从 VIP 从新主摘除，原因是延迟切换
367789 Jan 16 12:21:37 DB ntpd[61975]: Deleting interface #7 eth1, x.x.x.34#123, interface stats: received=0, sent=0, dropp       ed=0, active_time=18 secs
367790 Jan 16 12:21:37 DB ntpd[61975]: Deleting interface #4 eth1, x.x.x.35#123, interface stats: received=0, sent=0, dropp       ed=0, active_time=8128517 secs
367791 Jan 16 12:21:37 DB ntpd[61975]: peers refreshed

# 手动挂载主从 VIP
369382 Jan 16 12:29:59 DB vrrpd: vrrpd version 0.4 starting...
369383 Jan 16 12:29:59 DB vrrpd: VRRP ID 34 on eth1: we are now a backup router.
369384 Jan 16 12:30:02 DB vrrpd: VRRP ID 34 on eth1: we are now the master router.
369385 Jan 16 12:30:04 DB ntpd[61975]: Listen normally on 8 eth1 x.x.x.34 UDP 123
369386 Jan 16 12:30:04 DB ntpd[61975]: peers refreshed
369772 Jan 16 12:32:22 DB vrrpd: vrrpd version 0.4 starting...
369773 Jan 16 12:32:22 DB vrrpd: VRRP ID 35 on eth1: we are now a backup router.
369774 Jan 16 12:32:25 DB vrrpd: VRRP ID 35 on eth1: we are now the master router.
369775 Jan 16 12:32:27 DB ntpd[61975]: Listen normally on 9 eth1 x.x.x.35 UDP 123
369776 Jan 16 12:32:27 DB ntpd[61975]: peers refreshed

# 主从 VIP 从新主摘除，原因是延迟切换
370354 Jan 16 12:35:07 DB ntpd[61975]: Deleting interface #9 eth1, x.x.x.35#123, interface stats: received=0, sent=0, dropp       ed=0, active_time=160 secs
370355 Jan 16 12:35:07 DB ntpd[61975]: Deleting interface #8 eth1, x.x.x.34#123, interface stats: received=0, sent=0, dropp       ed=0, active_time=303 secs
370356 Jan 16 12:35:07 DB ntpd[61975]: peers refreshed

# 手动挂载主从 VIP
372667 Jan 16 12:47:04 DB vrrpd: vrrpd version 0.4 starting...
372668 Jan 16 12:47:04 DB vrrpd: VRRP ID 35 on eth1: we are now a backup router.
372669 Jan 16 12:47:08 DB vrrpd: VRRP ID 35 on eth1: we are now the master router.
372670 Jan 16 12:47:10 DB ntpd[61975]: Listen normally on 10 eth1 x.x.x.35 UDP 123
372671 Jan 16 12:47:10 DB ntpd[61975]: peers refreshed
372672 Jan 16 12:47:17 DB vrrpd: vrrpd version 0.4 starting...
372673 Jan 16 12:47:17 DB vrrpd: VRRP ID 34 on eth1: we are now a backup router.
372674 Jan 16 12:47:20 DB vrrpd: VRRP ID 34 on eth1: we are now the master router.
372675 Jan 16 12:47:22 DB ntpd[61975]: Listen normally on 11 eth1 x.x.x.34 UDP 123
372676 Jan 16 12:47:22 DB ntpd[61975]: peers refreshed

注意其中显示 12:35:07 主从 VIP 再次从新主摘除，实际上是由于再次触发从库延迟切换。

kill VIP

再次查看老主，发现主从 VIP 已绑定到老主，进程的启动时间为 12:35。

[2023/1/16 12:46:36] [root@DB ~]# ps -ef|grep vrrp
[2023/1/16 12:46:36] root      22732      1  0 12:35 ?        00:00:00 /export/servers/vrrp_scripts/vrrpd-1.0/vrrpd -i eth1 -n -v 34 -p 150 x.x.x.34
[2023/1/16 12:46:36] root      22793      1  0 12:35 ?        00:00:00 /export/servers/vrrp_scripts/vrrpd-1.0/vrrpd -i eth1 -n -v 35 -p 150 x.x.x.35
[2023/1/16 12:46:36] root      24297  20732  0 12:46 pts/1    00:00:00 grep vrrp

查看切换任务，发现又一次出现从库延迟切换，开始时间为 12:35:06。

由于新主已写入，并且未同步至从库，因此不建议回滚切换流程，手动将老主的 VIP 摘除。

[2023/1/16 12:46:43] [root@DB ~]# kill 22732 22793 
[2023/1/16 12:46:44] [root@DB ~]# 
[2023/1/16 12:46:45] [root@DB ~]# ps -ef|grep vrrp
[2023/1/16 12:46:45] root      24299  20732  0 12:46 pts/1    00:00:00 grep vrrp

对比主从 GTID，显示主库并从库多 553374-553999 的事务。

# 主，2023/1/16 12:22:17
13351c5e-d8dd-11e7-9704-b8ca3a5c1739:1-553999:771884-1101141

# 从，2023/1/16 12:22:17
13351c5e-d8dd-11e7-9704-b8ca3a5c1739:1-553373:771884-1101141

重新搭复制

老主，查看错误日志确认报错原因，显示从库在开启复制后报错。

2023-01-16T12:21:14.095528+08:00 63230531 [Note] 'CHANGE MASTER TO FOR CHANNEL '' executed'. Previous state master_host='', master_port= 3358, master_log_file='', master_log_pos= 4, master_bind=''. New state master_host='x.x.x.128', master_port= 3358, master_log_file='', master_log_pos= 4, master_bind=''.
2023-01-16T12:21:14.150054+08:00 63230532 [Warning] Storing MySQL user name or password information in the master info repository is not secure and is therefore not recommended. Please consider using the USER and PASSWORD connection options for START SLAVE; see the 'START SLAVE Syntax' in the MySQL Manual for more information.
2023-01-16T12:21:14.150739+08:00 63230532 [Note] Slave I/O thread for channel '': connected to master 'replicater@x.x.x.x:3358',replication started in log 'FIRST' at position 4
2023-01-16T12:21:14.158683+08:00 63230533 [Note] Slave SQL thread for channel '' initialized, starting replication in log 'FIRST' at position 0, relay log './relay-log.000001' position: 4
2023-01-16T12:21:14.160553+08:00 63230532 [ERROR] Error reading packet from server for channel '': The slave is connecting using CHANGE MASTER TO MASTER_AUTO_POSITION = 1, but the master has purged binary logs containing GTIDs that the slave requires. (server_errno=1236)
2023-01-16T12:21:14.160596+08:00 63230532 [ERROR] Slave I/O for channel '': Got fatal error 1236 from master when reading data from binary log: 'The slave is connecting using CHANGE MASTER TO MASTER_AUTO_POSITION = 1, but the master has purged binary logs containing GTIDs that the slave requires.', Error_code: 1236
2023-01-16T12:21:14.160625+08:00 63230532 [Note] Slave I/O thread exiting for channel '', read up to log 'FIRST', position 4

报错原因是由于主库清理了从库需要的 binlog，导致 IO 线程启动成功，SQL 线程启动失败。

处理方法有两种：

基于 GTID 模式，清空 GTID 后根据主库 Executed_Gtid_Set 设置从库参数；
基于位点模式，清空复制后根据位点重新搭建主从。

基于位点模式搭建主从时，MHA 在 GTID 不一致的场景下也可以切换成功，因此将复制从 GTID 改为位点模式。

老主，停掉 GTID 复制，注意其中stop slave后Seconds_Behind_Master由 0 变为NULL。

mysql> stop slave;
Query OK, 0 rows affected (0.00 sec)

mysql> SHOW SLAVE STATUS G
*************************** 1. row ***************************
               Slave_IO_State: 
                  Master_Host: x.x.x.128
                  Master_User: replicater
                  Master_Port: 3358
                Connect_Retry: 60
              Master_Log_File: 
          Read_Master_Log_Pos: 4
               Relay_Log_File: relay-log.000001
                Relay_Log_Pos: 4
        Relay_Master_Log_File: 
             Slave_IO_Running: No
            Slave_SQL_Running: No
              Replicate_Do_DB: 
          Replicate_Ignore_DB: test
           Replicate_Do_Table: 
       Replicate_Ignore_Table: 
      Replicate_Wild_Do_Table: 
  Replicate_Wild_Ignore_Table: 
                   Last_Errno: 0
                   Last_Error: 
                 Skip_Counter: 0
          Exec_Master_Log_Pos: 0
              Relay_Log_Space: 154
              Until_Condition: None
               Until_Log_File: 
                Until_Log_Pos: 0
           Master_SSL_Allowed: No
           Master_SSL_CA_File: 
           Master_SSL_CA_Path: 
              Master_SSL_Cert: 
            Master_SSL_Cipher: 
               Master_SSL_Key: 
        Seconds_Behind_Master: NULL
Master_SSL_Verify_Server_Cert: No
                Last_IO_Errno: 1236
                Last_IO_Error: Got fatal error 1236 from master when reading data from binary log: 'The slave is connecting using CHANGE MASTER TO MASTER_AUTO_POSITION = 1, but the master has purged binary logs containing GTIDs that the slave requires.'
               Last_SQL_Errno: 0
               Last_SQL_Error: 
  Replicate_Ignore_Server_Ids: 
             Master_Server_Id: 14364128
                  Master_UUID: 13351c5e-d8dd-11e7-9704-b8ca3a5c1739
             Master_Info_File: mysql.slave_master_info
                    SQL_Delay: 0
          SQL_Remaining_Delay: NULL
      Slave_SQL_Running_State: 
           Master_Retry_Count: 86400
                  Master_Bind: 
      Last_IO_Error_Timestamp: 230116 12:21:14
     Last_SQL_Error_Timestamp: 
               Master_SSL_Crl: 
           Master_SSL_Crlpath: 
           Retrieved_Gtid_Set: 
            Executed_Gtid_Set: 13351c5e-d8dd-11e7-9704-b8ca3a5c1739:1-553373:771884-1101141,
1a7d571b-d996-11e7-962d-ecf4bbc4fdf9:1-7191310859
                Auto_Position: 1
         Replicate_Rewrite_DB: 
                 Channel_Name: 
           Master_TLS_Version: 
1 row in set (0.00 sec)

mysql> reset slave;
Query OK, 0 rows affected (0.01 sec)

mysql> SHOW SLAVE STATUS G
*************************** 1. row ***************************
               Slave_IO_State: 
                  Master_Host: x.x.x.128
                  Master_User: replicater
                  Master_Port: 3358
                Connect_Retry: 60
              Master_Log_File: 
          Read_Master_Log_Pos: 4
               Relay_Log_File: relay-log.000001
                Relay_Log_Pos: 4
        Relay_Master_Log_File: 
             Slave_IO_Running: No
            Slave_SQL_Running: No
              Replicate_Do_DB: 
          Replicate_Ignore_DB: test
           Replicate_Do_Table: 
       Replicate_Ignore_Table: 
      Replicate_Wild_Do_Table: 
  Replicate_Wild_Ignore_Table: 
                   Last_Errno: 0
                   Last_Error: 
                 Skip_Counter: 0
          Exec_Master_Log_Pos: 0
              Relay_Log_Space: 177
              Until_Condition: None
               Until_Log_File: 
                Until_Log_Pos: 0
           Master_SSL_Allowed: No
           Master_SSL_CA_File: 
           Master_SSL_CA_Path: 
              Master_SSL_Cert: 
            Master_SSL_Cipher: 
               Master_SSL_Key: 
        Seconds_Behind_Master: NULL
Master_SSL_Verify_Server_Cert: No
                Last_IO_Errno: 0
                Last_IO_Error: 
               Last_SQL_Errno: 0
               Last_SQL_Error: 
  Replicate_Ignore_Server_Ids: 
             Master_Server_Id: 14364128
                  Master_UUID: 13351c5e-d8dd-11e7-9704-b8ca3a5c1739
             Master_Info_File: mysql.slave_master_info
                    SQL_Delay: 0
          SQL_Remaining_Delay: NULL
      Slave_SQL_Running_State: 
           Master_Retry_Count: 86400
                  Master_Bind: 
      Last_IO_Error_Timestamp: 
     Last_SQL_Error_Timestamp: 
               Master_SSL_Crl: 
           Master_SSL_Crlpath: 
           Retrieved_Gtid_Set: 
            Executed_Gtid_Set: 13351c5e-d8dd-11e7-9704-b8ca3a5c1739:1-553373:771884-1101141,
1a7d571b-d996-11e7-962d-ecf4bbc4fdf9:1-7191310859
                Auto_Position: 1
         Replicate_Rewrite_DB: 
                 Channel_Name: 
           Master_TLS_Version: 
1 row in set (0.00 sec)

mysql> reset slave all;
Query OK, 0 rows affected (0.00 sec)

mysql> SHOW SLAVE STATUS G
Empty set (0.00 sec)

其中清理复制的流程包括：

stop slave;
reset slave;
reset slave all;

然后基于位点重新搭建复制。

切换前老主的最后一个 GTID 为 1a7d571b-d996-11e7-962d-

ecf4bbc4fdf9:7191310859，它对应的位点是多少呢？

实际上，切换日志中会在切换前分别记录主从位点。

Mon Jan 16 12:21:11 2023 - [info] Orig master binlog:pos is mysql-bin.018127:305455668.
Mon Jan 16 12:21:11 2023 - [info]  Waiting to execute all relay logs on x.x.x.128(x.x.x.128:3358)..
Mon Jan 16 12:21:11 2023 - [info]  master_pos_wait(mysql-bin.018127:305455668) completed on x.x.x.128(x.x.x.128:3358). Executed 0 events.
Mon Jan 16 12:21:11 2023 - [info]   done.
Mon Jan 16 12:21:11 2023 - [info] Getting new master's binlog name and position..
Mon Jan 16 12:21:11 2023 - [info]  mysql-bin.017995:131421710

因此，新主的起始位点为 mysql-bin.017995:131421710。然后使用位点搭复制。

change master to
master_host='x.x.x.128',
master_port=3358,
master_user='replicater',
master_password='',
master_log_file='mysql-bin.017995',
master_log_pos=131421710;

启动复制后恢复正常。

mysql> start slave;
Query OK, 0 rows affected (0.01 sec)

mysql> SHOW SLAVE STATUS G
*************************** 1. row ***************************
               Slave_IO_State: Queueing master event to the relay log
                  Master_Host: x.x.x.128
                  Master_User: replicater
                  Master_Port: 3358
                Connect_Retry: 60
              Master_Log_File: mysql-bin.017995
          Read_Master_Log_Pos: 263792154
               Relay_Log_File: relay-log.000002
                Relay_Log_Pos: 9654437
        Relay_Master_Log_File: mysql-bin.017995
             Slave_IO_Running: Yes
            Slave_SQL_Running: Yes
              Replicate_Do_DB: 
          Replicate_Ignore_DB: test
           Replicate_Do_Table: 
       Replicate_Ignore_Table: 
      Replicate_Wild_Do_Table: 
  Replicate_Wild_Ignore_Table: 
                   Last_Errno: 0
                   Last_Error: 
                 Skip_Counter: 0
          Exec_Master_Log_Pos: 141075827
              Relay_Log_Space: 132370965
              Until_Condition: None
               Until_Log_File: 
                Until_Log_Pos: 0
           Master_SSL_Allowed: No
           Master_SSL_CA_File: 
           Master_SSL_CA_Path: 
              Master_SSL_Cert: 
            Master_SSL_Cipher: 
               Master_SSL_Key: 
        Seconds_Behind_Master: 17983
Master_SSL_Verify_Server_Cert: No
                Last_IO_Errno: 0
                Last_IO_Error: 
               Last_SQL_Errno: 0
               Last_SQL_Error: 
  Replicate_Ignore_Server_Ids: 
             Master_Server_Id: 14364128
                  Master_UUID: 13351c5e-d8dd-11e7-9704-b8ca3a5c1739
             Master_Info_File: mysql.slave_master_info
                    SQL_Delay: 0
          SQL_Remaining_Delay: NULL
      Slave_SQL_Running_State: Reading event from the relay log
           Master_Retry_Count: 86400
                  Master_Bind: 
      Last_IO_Error_Timestamp: 
     Last_SQL_Error_Timestamp: 
               Master_SSL_Crl: 
           Master_SSL_Crlpath: 
           Retrieved_Gtid_Set: 13351c5e-d8dd-11e7-9704-b8ca3a5c1739:553774-588683
            Executed_Gtid_Set: 13351c5e-d8dd-11e7-9704-b8ca3a5c1739:1-553373:553774-556119:771884-1101141,
1a7d571b-d996-11e7-962d-ecf4bbc4fdf9:1-7191310859
                Auto_Position: 0
         Replicate_Rewrite_DB: 
                 Channel_Name: 
           Master_TLS_Version: 
1 row in set (0.00 sec)

下面分析在 GTID 不一致的场景下，MHA 切换失败的原因。

分析

online_switch.log

masterha.cnf

首先自动生成主从切换的配置文件。

2023-01-16 12:21:07,006 - root - INFO - [server default]
manager_log=/export/Logs/mysql_ha/cluster_299-202301161221.log
manager_workdir=/export/Logs/mysql_ha
master_binlog_dir=/export/data/mysql/data
master_ip_failover_script=/export/servers/masterha/mysql_ha/failover_vip.py
master_ip_online_change_script=/export/servers/masterha/mysql_ha/online_change_vip.py
report_script=/export/servers/masterha/mysql_ha/send_report.py
password=
ping_interval=1
remote_workdir=/tmp
repl_password=
repl_user=replicater
shutdown_script=""
ssh_user=root
user=mha_admin
use_gtid_auto_pos=0


[server_x.x.x.127]
hostname=x.x.x.127
port=3358


[server_x.x.x.128]
hostname=x.x.x.128
port=3358


2023-01-16 12:21:07,007 - <module> - INFO - initial mha class ok !
2023-01-16 12:21:07,016 - root - INFO - run_switch_online candidate_host

其中：

manager_log=/export/Logs/mysql_ha/cluster_299-202301161221.log
master_ip_failover_script=failover_vip.py，用于死切时启动与停止 VIP；
master_ip_online_change_script=online_change_vip.py，用于活切时启动与停止 VIP；
use_gtid_auto_pos=0，默认使用位点模式搭建复制。

switchover

开始切换，从日志中可以看到详细的切换流程。

Mon Jan 16 12:21:07 2023 - [info] Starting online master switch..
Mon Jan 16 12:21:07 2023 - [info] 
Mon Jan 16 12:21:07 2023 - [info] * Phase 1: Configuration Check Phase..
Mon Jan 16 12:21:07 2023 - [info] 
Mon Jan 16 12:21:07 2023 - [info] Reading default configuration from /etc/masterha_default.cnf..
Mon Jan 16 12:21:07 2023 - [info] Reading application default configuration from /export/servers/masterha/etc/cluster_299.conf..
Mon Jan 16 12:21:07 2023 - [info] Reading server configuration from /export/servers/masterha/etc/cluster_299.conf..
Mon Jan 16 12:21:09 2023 - [info] GTID failover mode = 1
Mon Jan 16 12:21:09 2023 - [info] Current Alive Master: x.x.x.127(x.x.x.127:3358)
Mon Jan 16 12:21:09 2023 - [info] Alive Slaves:
Mon Jan 16 12:21:09 2023 - [info]   x.x.x.128(x.x.x.128:3358)  Version=5.7.19-log (oldest major version between slaves) log-bin:enabled
Mon Jan 16 12:21:09 2023 - [info]     GTID ON
Mon Jan 16 12:21:09 2023 - [info]     Replicating from x.x.x.127(x.x.x.127:3358)
Mon Jan 16 12:21:09 2023 - [info] Executing FLUSH NO_WRITE_TO_BINLOG TABLES. This may take long time..
Mon Jan 16 12:21:09 2023 - [info]  ok.
Mon Jan 16 12:21:09 2023 - [info] Checking MHA is not monitoring or doing failover..
Mon Jan 16 12:21:09 2023 - [info] Checking replication health on x.x.x.128..
Mon Jan 16 12:21:09 2023 - [info]  ok.
Mon Jan 16 12:21:09 2023 - [info] x.x.x.128 can be new master.
Mon Jan 16 12:21:09 2023 - [info] 
From:
x.x.x.127(x.x.x.127:3358) (current master)
 +--x.x.x.128(x.x.x.128:3358)

To:
x.x.x.128(x.x.x.128:3358) (new master)
 +--x.x.x.127(x.x.x.127:3358)
Mon Jan 16 12:21:09 2023 - [info] Checking whether x.x.x.128(x.x.x.128:3358) is ok for the new master..
Mon Jan 16 12:21:09 2023 - [info]  ok.
Mon Jan 16 12:21:09 2023 - [info] x.x.x.127(x.x.x.127:3358): SHOW SLAVE STATUS returned empty result. To check replication filtering rules, temporarily executing CHANGE MASTER to a dummy host.
Mon Jan 16 12:21:09 2023 - [info] x.x.x.127(x.x.x.127:3358): Resetting slave pointing to the dummy host.
Mon Jan 16 12:21:09 2023 - [info] ** Phase 1: Configuration Check Phase completed.
Mon Jan 16 12:21:09 2023 - [info] 
Mon Jan 16 12:21:09 2023 - [info] * Phase 2: Rejecting updates Phase..
Mon Jan 16 12:21:09 2023 - [info] 
Mon Jan 16 12:21:09 2023 - [info] Executing master ip online change script to disable write on the current master:
Mon Jan 16 12:21:09 2023 - [info]   /export/servers/masterha/mysql_ha/online_change_vip.py --command=stop --orig_master_host=x.x.x.127 --orig_master_ip=x.x.x.127 --orig_master_port=3358 --orig_master_user='mha_admin' --orig_master_password='' --new_master_host=x.x.x.128 --new_master_ip=x.x.x.128 --new_master_port=3358 --new_master_user='mha_admin' --new_master_password='' --orig_master_ssh_user=root --new_master_ssh_user=root   --orig_master_is_new_slave
Mon Jan 16 12:21:11 2023 - [info]  ok.
Mon Jan 16 12:21:11 2023 - [info] Locking all tables on the orig master to reject updates from everybody (including root):
Mon Jan 16 12:21:11 2023 - [info] Executing FLUSH TABLES WITH READ LOCK..
Mon Jan 16 12:21:11 2023 - [info]  ok.
Mon Jan 16 12:21:11 2023 - [info] Orig master binlog:pos is mysql-bin.018127:305455668.
Mon Jan 16 12:21:11 2023 - [info]  Waiting to execute all relay logs on x.x.x.128(x.x.x.128:3358)..
Mon Jan 16 12:21:11 2023 - [info]  master_pos_wait(mysql-bin.018127:305455668) completed on x.x.x.128(x.x.x.128:3358). Executed 0 events.
Mon Jan 16 12:21:11 2023 - [info]   done.
Mon Jan 16 12:21:11 2023 - [info] Getting new master's binlog name and position..
Mon Jan 16 12:21:11 2023 - [info]  mysql-bin.017995:131421710
Mon Jan 16 12:21:11 2023 - [info]  All other slaves should start replication from here. Statement should be: CHANGE MASTER TO MASTER_HOST='x.x.x.128', MASTER_PORT=3358, MASTER_AUTO_POSITION=1, MASTER_USER='replicater', MASTER_PASSWORD='xxx';
Mon Jan 16 12:21:11 2023 - [info] Executing master ip online change script to allow write on the new master:
Mon Jan 16 12:21:11 2023 - [info]   /export/servers/masterha/mysql_ha/online_change_vip.py --command=start --orig_master_host=x.x.x.127 --orig_master_ip=x.x.x.127 --orig_master_port=3358 --orig_master_user='mha_admin' --orig_master_password='' --new_master_host=x.x.x.128 --new_master_ip=x.x.x.128 --new_master_port=3358 --new_master_user='mha_admin' --new_master_password='' --orig_master_ssh_user=root --new_master_ssh_user=root   --orig_master_is_new_slave
Mon Jan 16 12:21:13 2023 - [info]  ok.
Mon Jan 16 12:21:13 2023 - [info] Setting read_only=0 on x.x.x.128(x.x.x.128:3358)..
Mon Jan 16 12:21:14 2023 - [info]  ok.
Mon Jan 16 12:21:14 2023 - [info] 
Mon Jan 16 12:21:14 2023 - [info] * Switching slaves in parallel..
Mon Jan 16 12:21:14 2023 - [info] 
Mon Jan 16 12:21:14 2023 - [info] Unlocking all tables on the orig master:
Mon Jan 16 12:21:14 2023 - [info] Executing UNLOCK TABLES..
Mon Jan 16 12:21:14 2023 - [info]  ok.
Mon Jan 16 12:21:14 2023 - [info] Starting orig master as a new slave..
Mon Jan 16 12:21:14 2023 - [info]  Resetting slave x.x.x.127(x.x.x.127:3358) and starting replication from the new master x.x.x.128(x.x.x.128:3358)..
Mon Jan 16 12:21:14 2023 - [info]  Executed CHANGE MASTER.

其中：

GTID failover mode = 1，表明老从使用 GTID 复制；
GTID ON，表明老从已开启 GTID；
Executing master ip online change script to disable write on the current master，老主 VIP 已摘除；
CHANGE MASTER TO MASTER_HOST=’x.x.x.128′, MASTER_PORT=3358, MASTER_AUTO_POSITION=1，老主基于 GTID 搭建复制关系；
Executing master ip online change script to allow write on the new master，新主 VIP 已绑定；
Setting read_only=0 on the new master，新主已关闭只读。

fail

主从切换报错。

Mon Jan 16 12:21:14 2023 - [info]  Executed CHANGE MASTER.
Mon Jan 16 12:31:41 2023 - [error][/usr/local/share/perl5/MHA/Server.pm, ln784] Slave could not be started on x.x.x.127(x.x.x.127:3358)! Check slave status.
Mon Jan 16 12:31:41 2023 - [error][/usr/local/share/perl5/MHA/Server.pm, ln862] Starting slave IO/SQL thread on x.x.x.127(x.x.x.127:3358) failed!
Mon Jan 16 12:31:41 2023 - [error][/usr/local/share/perl5/MHA/MasterRotate.pm, ln573]  Failed!
Mon Jan 16 12:31:41 2023 - [error][/usr/local/share/perl5/MHA/MasterRotate.pm, ln602] Switching master to x.x.x.128(x.x.x.128:3358) done, but switching slaves partially failed.

2023-01-16 12:31:41,894 - <module> - INFO - new_master_host:x.x.x.128,mha command exec result-out:
(('x.x.x.34', 'x.x.x.127'),)
(('x.x.x.34', 'x.x.x.127'),)
('eth1', '')
('x.x.x.34', 'x.x.x.127')
/export/servers/vrrp_scripts/vrrpd-1.0/vrrpd -i eth1         -n -v 34 -p 150 x.x.x.34 >> /export/Logs/vrrpd.log 2>&1

2023-01-16 12:31:41,894 - <module> - INFO - new_master_host:x.x.x.128,mha command exec result-err:

其中：

Slave could not be started on x.x.x.127(x.x.x.127:3358)! Check slave status.，老主复制启动失败；
starting slave IO/SQL thread on x.x.x.127(x.x.x.127:3358) failed!，老主复制线程启动失败；
Switching master to x.x.x.128(x.x.x.128:3358) done, but switching slaves partially failed.

下面结合 MHA 源码分析复制报错时切换失败的原因。

MHA

wait_until_slave_starts

根据切换日志，Server.pm 代码 784 行报错，定位到 wait_until_slave_starts 方法。

Mon Jan 16 12:31:41 2023 - [error][/usr/local/share/perl5/MHA/Server.pm, ln784] Slave could not be started on x.x.x.127(x.x.x.127:3358)! Check slave status.

wait_until_slave_starts 源码如下所示。

# It is possible that slave io thread has not started or established yet
# when you execute "START SLAVE". It should start within 0-4 seconds.
# So we wait some time until slave starts.
# Return: 0: OK  1: NG
sub wait_until_slave_starts($$) {
  ...
  for ( my $i = 0 ; $i < $retry_count ; $i++ ) {
    my %status = $dbhelper->check_slave_status();
    if ( $status{Status} ) {

      # on GTID mode, it may take time to have Master_Log_File
      if ( !$self->{has_gtid} || ( $self->{has_gtid} && $status{Errstr} ) ) {
        my $msg = sprintf( "Checking slave status failed on %s.",
          $self->get_hostinfo() );
        $msg .= " err=$status{Errstr}" if ( $status{Errstr} );
        $log->error($msg);
        return 1;
      }
    }
    if ( $type eq "IO" ) {
      return 0 if ( $status{Slave_IO_Running} eq "Yes" );
    }
    elsif ( $type eq "SQL" ) {
      return 0 if ( $status{Slave_SQL_Running} eq "Yes" );
    }
    else {
      return 0
        if ( $status{Slave_IO_Running} eq "Yes"
        && $status{Slave_SQL_Running} eq "Yes" );
    }

    if ( $status{Slave_SQL_Running} eq "No" && $status{Last_Errno} ne '0' ) {
      $log->error(
        sprintf( "SQL Thread could not be started on %s! Check slave status.",
          $self->get_hostinfo() )
      );
      $log->error(
        sprintf(
          " Last Error= %d, Last Error=%s",
          $status{Last_Errno}, $status{Last_Error}
        )
      );
      return 1;
    }
    sleep($interval);
  }
  $log->error(  # ln784
    sprintf( "Slave could not be started on %s! Check slave status.",
      $self->get_hostinfo() )
  );
  return 1;
}

wait_until_slave_starts 方法实际上是在 start_slave 方法中调用，具体是在执行start slave命令启动复制后检查复制状态，包括 IO 线程与 SQL 线程是否是运行中，并支持重试。

sub start_slave {
  ,,,
  $log->debug(
    sprintf( " Starting slave IO/SQL thread on %s..", $self->get_hostinfo() ) );
  $dbhelper->start_slave();
  if ( $self->wait_until_slave_starts( 'ALL', $log ) ) {
    $log->error(
      sprintf( "Starting slave IO/SQL thread on %s failed!",
        $self->get_hostinfo() )
    );
    return 1;
  }
  $log->debug("  done.");
  return 0;
}

因此新从复制停止将导致 MHA 切换失败，注意这里并不检测复制延迟参数 Seconds_Behind_Master。

具体重试频率与重试次数与是否使用 GTID 复制（MASTER_AUTO_POSITION=0）有关。

    my $retry_count = 100;
    my $interval    = 0.1;
    if ( $self->{has_gtid} ) {
        $interval    = 1;
        $retry_count = 600;
    }

重试次数与案例中主从切换的现象一致，12:21:07 开始切换，12:21:14 复制启动失败，12:31:41 切换报错，重试用时 10 分钟，期间未输出日志，导致主从切换报错延迟了 10 分钟。

Mon Jan 16 12:21:07 2023 - [info] Starting online master switch..
...
2023-01-16T12:21:14.160596+08:00 63230532 [ERROR] Slave I/O for channel '': Got fatal error 1236 from master when reading data from binary log
...
Mon Jan 16 12:31:41 2023 - [error][/usr/local/share/perl5/MHA/Server.pm, ln862] Starting slave IO/SQL thread on x.x.x.127(x.x.x.127:3358) failed!

switch_slaves

根据切换日志，MasterRotate.pm 代码 573 与 602 行报错，定位到 switch_slaves 方法。

Mon Jan 16 12:31:41 2023 - [error][/usr/local/share/perl5/MHA/MasterRotate.pm, ln573]  Failed!
Mon Jan 16 12:31:41 2023 - [error][/usr/local/share/perl5/MHA/MasterRotate.pm, ln602] Switching master to x.x.x.128(x.x.x.128:3358) done, but switching slaves partially failed.

switch_slaves 方法中调用 change_master_and_start_slave 方法，并判断返回值。

sub switch_slaves($$$$$$) {
  ...
  $log->info("Unlocking all tables on the orig master:");
  $orig_master->unlock_tables();

  if ($g_orig_master_is_new_slave) {
    $log->info("Starting orig master as a new slave..");
    if (
      $_server_manager->change_master_and_start_slave(
        $orig_master, $new_master, $master_log_file, $master_log_pos
      )
      )
    {
      $log->error(" Failed!");  # ln573
      $ret = 1;
    }
    else {
      $log->debug(" ok.");
    }
  }
  if ( $ret eq '0' ) {
    $log->info("All new slave servers switched successfully.");
    $log->info();
    $log->info("* Phase 5: New master cleanup phase..");
    $log->info();
    if ( $new_master->{skip_reset_slave} ) {
      $log->info("Skipping RESET SLAVE on the new master.");
    }
    else {
      $ret = $new_master->reset_slave_on_new_master();
    }
  }
  if ( $ret eq '0' ) {
    my $message = sprintf( "Switching master to %s completed successfully.",
      $new_master->get_hostinfo() );
    $log->info($message);
    return 0;
  }
  else {
    my $message = sprintf(
      "Switching master to %s done, but switching slaves partially failed.",
      $new_master->get_hostinfo() );
    $log->error($message);  # ln602
    return 10;
  }
}

其中：

$new_master->reset_slave_on_new_master，新从复制启动后清理新主的复制关系。

change_master_and_start_slave 方法在 ServerManager.pm 中定义。

sub change_master_and_start_slave {
  ...
  $log->info(
    sprintf(
      " Resetting slave %s and starting replication from the new master %s..",
      $target->get_hostinfo(),
      $master->get_hostinfo()
    )
  );
  $target->stop_slave($log) unless ( $target->{not_slave} );
  $dbhelper->reset_slave()  unless ( $target->{not_slave} );
  ...
  if ( $self->is_gtid_auto_pos_enabled() && !$target->{is_mariadb} ) {
    $dbhelper->change_master_gtid( $addr, $master->{port},
      $master->{repl_user}, $master->{repl_password} );
  }
  else {
    $dbhelper->change_master( $addr,
      $master->{port}, $master_log_file, $master_log_pos, $master->{repl_user},
      $master->{repl_password} );
  }
  $log->info(" Executed CHANGE MASTER.");

  # After executing CHANGE MASTER, relay_log_purge is automatically disabled.
  # If the original value is 0, we should turn to 0 explicitly.
  if ( !$target->{has_gtid} ) {
    unless ( $target->{relay_purge} ) {
      $target->disable_relay_log_purge();
    }
  }
  my $ret = $target->start_slave($log);
  unless ($ret) {
    $log->info(" Slave started.");
  }
  return $ret;
}

其中：

调用 stop_slave 方法停止新主的复制，执行stop slave命令；
根据是否使用 GTID 搭建复制，执行change master命令；
调用 start_slave 方法启动复制，执行start slave命令。调用 start_slave 方法时报错，因此就回到了切换日志的开始报错位置，即 Server.pm 代码 784 行调用 wait_until_slave_starts 方法报错。

is_gtid_auto_pos_enabled

生产环境中主从切换始终基于 MHA 实现，之前切换都是基于位点搭建复制，为什么本次切换基于 GTID 呢？

MHA 如何判断是否采用 GTID 模式进行主从切换？

具体是在 ServerManager.pm 的 is_gtid_auto_pos_enabled 方法中访问 gtid_failover_mode 属性判断，1 表示 GitD，0 表示位点。

sub is_gtid_auto_pos_enabled($) {
  my $self = shift;
  return 1 if ( $self->{gtid_failover_mode} == 1 );
  return 0;
}

gtid_failover_mode 属性等于 get_gtid_status 方法的返回值。

  $self->{gtid_failover_mode} = $self->get_gtid_status();
  $log->info(
    sprintf( "GTID failover mode = %d", $self->{gtid_failover_mode} ) );

get_gtid_status 方法中分别根据 has_gtid、Executed_Gtid_Set、Auto_Position 三个参数判断是否使用 GTID 模式。

sub get_gtid_status($) {
  my $self    = shift;
  my @servers = $self->get_alive_servers();
  my @slaves  = $self->get_alive_slaves();
  return 0 if ( $#servers < 0 );
  foreach (@servers) {
    return 0 unless ( $_->{has_gtid} );
  }
  foreach (@slaves) {
    return 0 unless ( $_->{Executed_Gtid_Set} );
  }
  foreach (@slaves) {
    return 1
      if ( defined( $_->{Auto_Position} )
      && $_->{Auto_Position} == 1 );
    return 1 if ( $_->{use_gtid_auto_pos} );
  }
  return 2;
}

has_gtid 方法中通过查看 gtid_mode 参数判断是否开启 GTID。

use constant Has_Gtid_SQL             => "SELECT @@global.gtid_mode As Value";
sub has_gtid($) {
  my $self  = shift;
  my $value = $self->get_variable(Has_Gtid_SQL);
  if ( defined($value) && $value eq "ON" ) {
    $self->{has_gtid} = 1;
    return 1;
  }
  return 0;
}

因此，使用非 GTID 模式切换的场景有：

集群中任意节点未开启 GTID（gtid_mode=0）或 Executed_Gtid_Set 为空；
集群中所有节点都未使用 GTID 复制（MASTER_AUTO_POSITION=0），且配置文件中指定 use_gtid_auto_pos=0。

使用 GTID 模式切换的场景有：

集群中任意节点使用 GTID 复制（MASTER_AUTO_POSITION=0），与 use_gtid_auto_pos 无关；
集群中所有节点开启 GTID（gtid_mode=0），且 Executed_Gtid_Set 不为空，配置文件未指定 use_gtid_auto_pos（默认等于 1）或指定 use_gtid_auto_pos =1。

本文中的案例符合集群中任意节点使用 GTID 复制的条件，尽管配置文件中指定 use_gtid_auto_pos=0，切换过程中最终使用了 GTID 模式。

Mon Jan 16 12:21:09 2023 - [info] GTID failover mode = 1

那么，什么是 use_gtid_auto_pos？

从上面的描述中可以看到 use_gtid_auto_pos 参数可以影响到是否使用 GTID 模式。比如当集群中所有节点都开启 GTID，但未使用 GTID 复制，而配置文件中指定 use_gtid_auto_pos=1 时，将会使用 GTID 模式。

简单来说，MHA 基于以下两点综合判断是否使用 GTID：

配置文件中指定 use_gtid_auto_pos 参数；
切换时自动判断集群当前使用的复制模式。

本文中的案例默认的配置文件中已指定 use_gtid_auto_pos 参数。

root@DB etc# cat masterha_default.cnf | grep gtid
use_gtid_auto_pos=0

从代码中也可以看到，如果配置文件中未指明参数 use_gtid_auto_pos，则默认 use_gtid_auto_pos=1。

# MHA/Config.pm
  $value{use_gtid_auto_pos} = $param_arg->{use_gtid_auto_pos};
  if ( !defined( $value{use_gtid_auto_pos} ) ) {
    $value{use_gtid_auto_pos} = $default->{use_gtid_auto_pos};
    $value{use_gtid_auto_pos} = 1 if ( !defined( $value{use_gtid_auto_pos} ) );
  }

当前使用的 MHA 版本为 0.56，MySQL 5.6 中引入 GTID，MHA 也是从 0.56 版本开始支持基于 GTID 的复制。

实际上，MHA 针对 GTID 的支持并不是很大，因此可以了解下主从切换的其他工具，比如 Orchestrator、Xenon。

切换流程

主从切换

MHA 主从切换时首先分别检查主从实例的运行状态，并基于 gtid_mode 与 MASTER_AUTO_POSITION 判断是否使用 GTID 模式进行切换。不同的模式对应不同的切换流程，后续将详细介绍。

为解释文中切换异常时的现象，比如：

新主，复制 SQL 线程停止，而且复制关系没有被清理；
新主，切换完成前关闭只读；
新从，复制 IO 线程停止，复制报错导致切换失败；
新从，切换完成前开启只读。

分析 MHA 的切换流程如下图，同样不区分是否使用 GTID 模式。

其中：

切换时首先摘除老主的 VIP，并绑定到新主，防止业务写入老主；
新主在切换完成前关闭只读，因此有可能出现切换失败但新主已写入；
新从复制启动后检测复制线程的运行状态，因此如果复制启动失败，将导致切换失败；
新从启动复制后清理老主的复制关系，因此切换异常时复制关系未清理。

老主（新从）在切换完成前开启只读的原因与 MHA 无关，而是在执行 online_change_vip.py 脚本时开启只读。

Mon Jan 16 12:21:09 2023 - [info]   /export/servers/masterha/mysql_ha/online_change_vip.py --command=stop ... orig_master_is_new_slave

online_change_vip.py 中调用 stop_alive_vips 方法。

    if command == 'stop' or command == 'stopssh':
        stop_alive_vips(orig_master_host)

stop_vips 方法用于死切，stop_alive_vips 方法用于活切。

def stop_vips(host):
    cmd = 'pkill vrrpd'
    proc = subprocess.Popen(
        ['ssh', host, cmd], shell=False,
        stdin=subprocess.PIPE, stdout=subprocess.PIPE, stderr=subprocess.PIPE
    )
    out, err = proc.communicate()
    return out, err


def stop_alive_vips(host):
    Config.mha_conn['host'] = host
    sql = 'set global read_only=1'
    conn = Myexec(Config.mha_conn)
    conn.execute_write(sql)
    cmd = 'pkill vrrpd'
    proc = subprocess.Popen(
        ['ssh', host, cmd], shell=False,
        stdin=subprocess.PIPE, stdout=subprocess.PIPE, stderr=subprocess.PIPE
    )
    out, err = proc.communicate()
    return out, err

其中：

两者的共同点是都会在服务器可以登录的前提下 kill vip；
两者的不同点在于活切会给老主开启只读，死切时认为数据库已无法连接，因此无法设置只读；
死切时如果服务器无法登录，也就无法 kill vrrp 了，这种情况下依赖权重参数防止脑裂。

从库延迟切换

相比于主从切换，从库复制延迟自动切换的流程就简单很多了，不过提出以下两个问题：

为什么将从库上绑定的全部 VIP 切换到主库，包括主库
为什么 Seconds_Behind_Master: NULL 的条件下触发器从库延迟切换

切换任务的触发条件很重要，而活切由用户发起，因此不介绍相关代码。下面通过从库延迟切换的代码实现介绍相关流程。

定时任务检测从库复制延迟，然后异步执行切换任务。

    # 从库延迟切换,早7点 - 晚0点，延迟自动切换！
    slave_switch_time = datetime.now().strftime("%H:%M:%S")
    if not (slave_switch_time > '00:00:00' and slave_switch_time < '07:00:00'):
        replicate_host_list = get_replicate_host_list()
        if replicate_host_list is not None:
            cluster_list = get_cluster_list(replicate_host_list,2)
            if cluster_list == []: 
                return
            rep_tasks = []
            for cluster in cluster_list:
                if cluster.get_role(cluster.source_ip) == 'slave':
                    log.warning('begin add slave replicate switch task')
                    rep_tasks.append(gevent.spawn(switch_rep_slave, cluster))
                    log.warning('end add slave replicate switch task')
            gevent.joinall(rep_tasks)
        else:
            return
    else:
        return

replicate_host_list 方法接入 zabbix 监控，获取发生主从延迟的服务器。

def get_replicate_host_list():
    """
    获取从库延迟告警
    :return:
    """
    zabbix_api = Zabbix_api(
        Config.zabbix_user, Config.zabbix_api_url, Config.replicate_trigger_list
    )
    disaster_host_list = zabbix_api.get_disaster_host()
    if len(disaster_host_list) > 0:
        host_list = []
        for host in disaster_host_list:
            message = "found replicate server %s - %s" % (host[0], host[1])
            log.warning(message)
            ip = host[0].split('/')[-1]
            host_list.append((ip, host[0]))
        return host_list
    else:
        return None

具体是调用 zabbix 接口获取 replicate_trigger_list 触发器的数据。

class Config():
    # 死切条件
    trigger_list = ['MySQL已5分钟无响应']
    # 从库延迟切换条件
    replicate_trigger_list = ['主从延迟报警']

查看触发器定义，显示当延迟大于 500 且 IO 线程正常时触发主从延迟告警。

从库延迟切换主要代码如下所示。

host_info = cluster.get_host_info(cluster.source_ip)
inter_out, inter_err = get_interface(cluster.master['instance_ip'])
start_out, start_err = start_local_vip(cluster.master['instance_ip'], host_info, inter_out)
for vip in host_info['vips']:
    Config.mha_conn['host'] = vip 
    vip_status = check_alive(Config.mha_conn, Config.log_file)
    vip_status_list.append(vip_status)
if False not in vip_status_list:
    cluster.update_cluster_vip(host_info['vips'], cluster.source_ip, cluster.master['instance_ip'])
    # 原从库上，停vip！ 否则双vip，回切会失败
    stop_out, stop_err = stop_vips(cluster.source_ip)
    # kill连接！
    kill_sessions(cluster.source_ip)

其中：

get_interface 方法中获取网卡上的全部 VIP，然后将从库上绑定的全部 VIP 切换到主库，包括主库。当然，从库上绑定主库 VIP 属于异常情况；
在将从库 VIP 切换到主库后将从库上的 VIP kill 摘除，并将从库上的业务连接 kill 掉；
上文中提到新主的复制显示 Seconds_Behind_Master: NULL，Slave_SQL_Running: No，Slave_IO_Running: Yes，为什么会触发主从延迟切换？

首先，当时复制有延迟吗？

查看监控，显示两次从库延迟切换 12:21:35 与 12:35:06 时间段内，实际上并没有延迟，但是 SQL 线程中断。

查看 zabbix 告警记录，显示该时间段有收到主从延迟告警与从库复制线程停止告警，而且其中显示延迟大小 Slave behind Master:9999。

mysql> select * from sms where host like '%x.x.x.128%' and time>'2023-01-16' and time<'2023-01-17' limit 5;
+---------+---------------------+-----------------+--------------------------+-----------------------------------+---------+------+----------+-----------+-----------------+------------+
| id      | time                |      host       | TRIGGER_NAME             | DETAIL                            | STATUS  | type |  zone    | send_type | sms_send_status | alarm_type |
+---------+---------------------+-----------------+--------------------------+-----------------------------------+---------+------+----------+-----------+-----------------+------------+
| 2648942 | 2023-01-16 12:21:27 | x.x.x.128       | 主从延迟报警             | 详细:Slave behind Master:9999     | PROBLEM |    0 | DB       |         1 |               1 |          0 |
| 2648943 | 2023-01-16 12:21:30 | x.x.x.128       | 从库复制线程停止         | 详细:Slave SQL thread running:0   | PROBLEM |    0 | DB       |         1 |               1 |          0 |
| 2648950 | 2023-01-16 12:31:51 | x.x.x.128       | 主从延迟报警             | 详细:Slave behind Master:9999     | PROBLEM |    0 | DB       |         1 |               1 |          0 |
| 2648951 | 2023-01-16 12:31:54 | x.x.x.128       | 从库复制线程停止         | 详细:Slave SQL thread running:0   | PROBLEM |    0 | DB       |         1 |               1 |          0 |
| 2648954 | 2023-01-16 12:41:30 | x.x.x.128       | 主从延迟报警             | 详细:Slave behind Master:9999     | PROBLEM |    0 | DB       |         1 |               1 |          0 |
+---------+---------------------+------------------------------------------+--------------------------+-----------------------------------+---------+------+----------------------------+-----------+-----------------+------------+
5 rows in set (2.05 sec)

SQL 线程停止，IO 线程未停止，告警显示延迟等于 9999，因此触发从库延迟切换。

但实际延迟等于 0，那么为什么会显示 Slave behind Master:9999 呢？

查看 zabbix 源码，同样也是 perl 实现。显示在执行show slave status命令获取主从延迟时，如果延迟等于 NULL，会给默认值 9999。

# /usr/local/mpm/lib/FromDualMySQLslave.pm

$$status_ref{'Seconds_Behind_Master'} = defined($ref->{'Seconds_Behind_Master'}) ? $ref->{'Seconds_Behind_Master'} : 9999;

因此，从库延迟切换的条件包括：

主从延迟大于阈值，Slave_IO_Running: Yes，当前阈值 500s，时间段除 2-6 点外；
Seconds_Behind_Master: NULL，Slave_SQL_Running: No，Slave_IO_Running: Yes，表明从库在拉取 binlog 但是停止回放，因此也会产生延迟。

Last_IO_Errno: 1236

实际上，复制报错的原因还没有找到，尽管报错很明确，但是分析时解释不通。

正常情况下，从库 IO 线程报错1236 的原因包括：

从库有写入，之后发生主从切换，但 binlog 已被清理；
主库 binlog 在发送给从库之前被清理。

由于从库 GTID 仅 553374-553999 部分不一致，而且这部分是在主从切换后写入，因此怀疑主库 purge binlog 导致复制报错。

下面展示切换前后查看到的主从 GTID，并在切换后一个月再次查看。

# 从，2023/1/16 12:20:56
13351c5e-d8dd-11e7-9704-b8ca3a5c1739:1-553773:771884-1101141

# 12:21:07 开始切换
# 12:21:14，搭复制报错
# 12:21:17，主 VIP 绑定到新主

# 主，2023/1/16 12:22:17
13351c5e-d8dd-11e7-9704-b8ca3a5c1739:1-553999:771884-1101141

# 从，2023/1/16 12:22:17
13351c5e-d8dd-11e7-9704-b8ca3a5c1739:1-553373:771884-1101141

# 主，2023/2/25
13351c5e-d8dd-11e7-9704-b8ca3a5c1739:1-266134790

# 从，2023/2/25
13351c5e-d8dd-11e7-9704-b8ca3a5c1739:1-553373:553774-266131124

实际上，还存在以下三个问题：

主从切换后新从复制为什么报错 master has purged binary logs containing GTIDs that the slave requires？
主从切换后新主 GTID 从 1-553999:771884-1101141 变为 1-266134790，为什么变成连续的了，即 554000-771883 的事务从哪来的？
主从切换后新从 GTID 为 1-553373:553774-266131124，与主库不一致，是不是表示 553374-553773 的事务丢失？

优化

上面提到，从库 IO 线程报错1236 的原因包括：

从库写入后主从切换
主库 purge binlog

其中针对前者，可以通过关闭从库的 super 权限（super_read_only）保证从库完全只读实现主从数据的一致性。

针对 GTID 不一致导致主从切换失败的情况，可以：

MHA 在新从复制启动前新主关闭只读，可以调整为复制启动后新主关闭只读，防止新主写入；
判断是否使用 GTID 模式进行切换时添加检测，如果主从 GTID 不一致，强制指定主从 GTID 一致，风险较高；
判断是否使用 GTID 模式进行切换时添加检测，如果主从 GTID 不一致，使用位点模式。

结论

MHA + GTID 复制报错导致切换失败，因此提出几下几个问题：

为什么复制报错会导致切换失败

主从切换过程中在搭建新从后启动复制，然后检测复制运行状态，如果复制线程启动失败，主从切换失败。

复制为什么报错，1236
MHA 如何判断使用位点模式还是 GTID 模式，具体基于以下两点综合判断：

配置文件中指定 use_gtid_auto_pos 参数；
切换时自动判断集群当前使用的复制模式。

MHA 针对不同的复制模式的处理逻辑有什么区别

切换异常时显示老主新主都有复制关系，而且复制停止；

新主，IO 线程正常，SQL 线程停止，复制未报错，已关闭只读

日志报错：12:21:11 Error reading relay log event for channel ”: slave SQL thread was killed

老主，IO 线程停止，SQL 线程正常，复制报错，已开启只读

日志报错：12:21:14 Slave I/O for channel ”: Got fatal error 1236 from master when reading data from binary log

主从切换过程中触发从库复制延迟切换，因此提出以下几个问题：

从库复制延迟切换的触发条件是什么

Seconds_Behind_Master > 500s，Slave_IO_Running: Yes
Seconds_Behind_Master: NULL，Slave_IO_Running: Yes

主从切换与从库延迟切换同时进行有什么影响

12:21:17，主 VIP 挂载到新主，原因是主从切换
12:21:37，主从 VIP 从新主摘除，原因是从库延迟切换

待办

swtichover VS failover
MHA GTID vs No-GTID

参考教程

MySQL MHA–故障切换模式(GTID模式和非GTID模式)

https://www.cnblogs.com/gaogao67/p/11143604.html

MHA 基于GTID的恢复不会从原Master拉取差异日志且不再需要relay_log_purge=0

https://blog.csdn.net/ashic/article/details/104601315

原文始发于微信公众号（丹柿小院）：MySQL 新从复制停止导致 MHA 切换失败

文章由极客之音整理，本文链接：https://www.bmabk.com/index.php/post/178594.html

MySQL 新从复制停止导致 MHA 切换失败

引言

现象

过程

切换前准备

VIP

复制

切换异常

新主

老主

从库延迟切换

处理

新主开启只读

新主挂载 VIP

messages

kill VIP

重新搭复制

分析

online_switch.log

masterha.cnf

switchover

fail

MHA

wait_until_slave_starts

switch_slaves

is_gtid_auto_pos_enabled

切换流程

主从切换

从库延迟切换

Last_IO_Errno: 1236

优化

结论

待办

参考教程

相关推荐

发表回复