容器化RDS｜计算存储分离架构下的“Split-Brain”

2018-01-30 熊中哲 沃趣科技

不管是架构选型还是生活，大多数时候都是在做trade off，获了计算存储分离带来的好处, 也意味着要忍受它带来的一些棘手问题。本文尝试结合Kubernetes，Docker，MySQL和计算存储分离架构，分享我们遇到的诸多问题之一 “Split-Brain” 。

2018年1月19号参加了阿里巴巴双十一数据库技术峰会，见到了好多老同事（各位研究员、资深专家），也了解到业界最新的数据库技术发展趋势：

数据库容器化作为下一代数据库基础架构
基于编排架构管理容器化数据库
采用计算存储分离架构

这和我们在私有 RDS 上的技术选型不谋而合，尤其是计算存储分离架构。

在我们看来, 其最大优势在于：

计算资源 / 存储资源独立扩展，架构更清晰，部署更容易。
将有状态的数据下沉到存储层，Scheduler 调度时，无需感知计算节点的存储介质，只需调度到满足计算资源要求的 Node，数据库实例启动时，只需在分布式文件系统挂载mapping volume 即可，可以显著的提高数据库实例的部署密度和计算资源利用率。

以阿里巴巴为例，考虑到今时今日它的规模，如果能够实现数据库服务的离线(ODPS)/在线集群的混合部署，意义极其重大。关键问题在于，离线(ODPS)计算和在线计算对实时性要求不同，硬件配置也不同，尤其是本地存储介质：

离线(ODPS)以机械磁盘为主
在线以 SSD / Flash 为主

如果采用本地存储作为数据库实例的存储介质，试想一下，一个 Storage Qos 要求是 Flash 的数据库实例无法调度到离线计算集群，哪怕离线计算集群 CPU，Memory 有大量空闲。

计算存储分离为实现离线(ODPS)/在线集群的混合部署提供了可能。

结合 Kubernetes，Docker和 MySQL，进一步细化架构图, 如下图所示 :

同时，这套架构也带给我们更加简单、通用，高效的 High Availability 方案。当集群中某个 Node 不可用后，借助 Kubernetes 的原生组件Node Controller, Scheduler 和原生 API Statefulset 即可将数据库实例调度到其他可用节点，以实现数据库实例的高可用。

一切是多么的美好，是不是可以得到这个结论：

借助 Kubernetes 的原生组件NodeController，Scheduler和原生API Statefulset, 加上计算存储分离架构，并将成熟的分布式文件系统集成到 Kubernetes 存储系统，就能提供私有RDS服务。

之前我们也是这么想的，直到遇到 “Split-Brain” 问题(也即是本文的主题)。

回到上面的 High Availability 方案。

当集群中某个 Node不可用后，借助Kubernetes 的原生组件Node Controller，Scheduler 和原生 API Statefulset 即可将数据库实例调度到其他可用节点，以实现数据库实例的高可用。

判定 Node 不可用将是后续触发 Failover 动作的关键。

所以这里需要对节点状态的判定机制稍作展开：

Kubelet 借助 API Server 定期(node-status-update-frequency)更新 etcd 中对应节点的心跳信息
Controller Manager 中的 Node Controller 组件定期(node-monitor-period)轮询 ETCD 中节点的心跳信息
如果在周期 (node-monitor-grace-period) 内,心跳更新丢失, 该节点标记为 Unknown(ConditionUnknown)
如果在周期 (pod-eviction-timeout) 内,心跳更新持续丢失, Node Controller 将会触发集群层面的驱逐机制
Scheduler 将 Unknown 节点上的所有数据库实例调度到其他健康(Ready)节点

访问架构图如下所示：

补充一句，借助 ETCD 集群的高可用强一致，得以保证 Kubernetes 集群元信息的一致性：

ETCD基于 Raft 算法实现
Raft算法是一种基于消息传递(state machine replicated)且具有高度容错(fault tolerance)特性的一致性算法(consensusalgorithm)
Raft是大名鼎鼎的 Paxos 的简化版本
如果对于 Raft 算法的实现有兴趣,可以看看https://github.com/goraft/raft

所有感兴趣一致性算法的同学，都值得花精力学习。基于 goraft/raft，我实现了Network Partition Failures/Recovery TestCase，收获不小。

看上去合理的机制会给我们带来两个问题，

问题一：无法判定节点真实状态

心跳更新是判断节点是否可用的依据。但是，心跳更新丢失是无法判定节点真实状态的 (Kubernetes 中将节点标记为 Condition Unknown 也说明了这点)。

Node 可能仅仅是网络问题，CPU 繁忙，“假死”， Kubelet bug 等原因导致心跳更新丢失，但节点上的数据库实例还在运行中。

问题二：缺乏有效的 Fence 机制

在这个情况下，借助 Kubernetes 的原生组件 Node Controller，Scheduler 和原生 API Statefulset 实现的 Failover, 将数据库实例从 Unknown 节点驱逐到可用节点，但对原 Unknown 节点不做任何操作。

这种“软驱逐”，将会导致新旧两个数据库实例同时访问同一份数据文件。

发生 “Split-Brain” 导致 Data Corruption.数据丢失，损失无法弥补。

下面是枯燥的故障复现，通过日志和代码分析驱逐的工作机制，总结 “Split-Brain” 过程。

测试过程

使用 Statefulset 创建 MySQL 单实例 gxr-oracle-statefulset (这是一个 Oracle DBA 取的名字，原谅他)
Scheduler 将 MySQL 单实例调度到集群中的节点 “k8s-node3”
通过 sysbench 对该实例制造极高的负载，k8s-node3” load 飙升，致“k8s-node3” 上的 Kubelet 无法跟 API Server 通讯, 并开始报错
NodeController 启动驱逐
Statefulset 发起重建
Scheduler 将 MySQL 实例调度到 “k8s-node1” 上
新旧 MySQL 实例访问同一个 Volume
数据文件被写坏，旧MySQL实例都报错, 并无法启动

测试参数

kube-controller-manager 启动参数

kubelet 启动参数

基于日志，个事件流如下：

时间点 December 1st 2017，10:18:05.000 (最后一次更新成功应该是 10:17:42.000)

节点 (k8s-node3) 启动数据库压力测试，以模拟该节点“假死”，kubelet 跟 API Server 出现心跳丢失。

kubelet 日志报错，法通过 API Server 更新 k8s-node3 状态。

Kubelet 细节如下：

通过 API Server 更新集群信息

if kl.kubeClient != nil {

//Start syncing node status immediately, this may set up things the runtime needsto run.

gowait.Until(kl.syncNodeStatus, kl.nodeStatusUpdateFrequency, wait.NeverStop)

}

定期(node Status Update Frequency)更新对应节点状态

nodeStatusUpdateFrequency 默认时间为 10 秒, 测试时设置的是8s

obj.NodeStatusUpdateFrequency =metav1.Duration{Duration: 10 * time.Second}

更新如下信息:

func (kl *Kubelet) defaultNodeStatusFuncs()[]func(*v1.Node) error {

//initial set of node status update handlers, can be modified by Option's

withoutError := func(f func(*v1.Node)) func(*v1.Node) error {

return func(n *v1.Node) error {

f(n)

return nil

}

return []func(*v1.Node) error{

kl.setNodeAddress,

withoutError(kl.setNodeStatusInfo),

withoutError(kl.setNodeOODCondition),

withoutError(kl.setNodeMemoryPressureCondition),

withoutError(kl.setNodeDiskPressureCondition),

withoutError(kl.setNodeReadyCondition),

withoutError(kl.setNodeVolumesInUseStatus),

withoutError(kl.recordNodeSchedulableEvent),

}

通过 kubectl 可以获得节点的信息：

时间点 December 1st 2017, 10:18:14.000

· NodeController 发现 k8s-node3 的状态有32s 没有发生更新。

○ ready/ outofdisk / diskpressure / memorypressue condition

将该节点状态更新为 UNKNOWN：

每隔 NodeMonitorPeriod 继续节点状态是否有更新：

定期(NodeMonitorPeriod)查看一次节点状态

// Incorporate the results of node statuspushed from kubelet to master.

go wait.Until(func() {

iferr := nc.monitorNodeStatus(); err != nil {

glog.Errorf("Error monitoring node status: %v", err)

}

}, nc.nodeMonitorPeriod, wait.NeverStop)

NodeMonitorPeriod默认 5秒, 测试时4s

NodeMonitorPeriod: metav1.Duration{Duration: 5 * time.Second},

当超过 NodeMonitorGracePeriod 时间后, 节点状态没有更新将节点状态设置成 unknown

ifnc.now().After(savedNodeStatus.probeTimestamp.Add(gracePeriod)) {

//NodeReady condition was last set longer ago than gracePeriod, so update it toUnknown

//(regardless of its current value) in the master.

ifcurrentReadyCondition == nil {

glog.V(2).Infof("node %v is never updated by kubelet",node.Name)

node.Status.Conditions = append(node.Status.Conditions,v1.NodeCondition{

Type: v1.NodeReady,

Status: v1.ConditionUnknown,

Reason: "NodeStatusNeverUpdated",

Message: fmt.Sprintf("Kubelet never posted node status."),

LastHeartbeatTime: node.CreationTimestamp,

LastTransitionTime: nc.now(),

})

}else {

glog.V(4).Infof("node %v hasn't been updated for %+v. Last readycondition is: %+v",

node.Name, nc.now().Time.Sub(savedNodeStatus.probeTimestamp.Time),observedReadyCondition)

if observedReadyCondition.Status != v1.ConditionUnknown {

currentReadyCondition.Status = v1.ConditionUnknown

currentReadyCondition.Reason = "NodeStatusUnknown"

currentReadyCondition.Message = "Kubelet stopped posting nodestatus."

// LastProbeTime is the last time we heard from kubelet.

currentReadyCondition.LastHeartbeatTime =observedReadyCondition.LastHeartbeatTime

currentReadyCondition.LastTransitionTime = nc.now()

}

时间点 December 1st 2017, 10:19:42.000

刚好过去 podEvictionTimeout , 将该节点添加到驱逐队列中：

在podEvictionTimeout 后，为该节点上 pods 需要开始驱逐

if observedReadyCondition.Status ==v1.ConditionUnknown {

ifnc.useTaintBasedEvictions {

// We want to update the taint straight away if Node is already taintedwith the UnreachableTaint

if taintutils.TaintExists(node.Spec.Taints, NotReadyTaintTemplate) {

taintToAdd := *UnreachableTaintTemplate

if !util.SwapNodeControllerTaint(nc.kubeClient,[]*v1.Taint{&taintToAdd}, []*v1.Taint{NotReadyTaintTemplate}, node) {

glog.Errorf("Failed to instantly swap UnreachableTaint toNotReadyTaint. Will try again in the next cycle.")

}

} else if nc.markNodeForTainting(node) {

glog.V(2).Infof("Node %v is unresponsive as of %v. Adding it to theTaint queue.",

node.Name,

decisionTimestamp,

)

}

}else {

ifdecisionTimestamp.After(nc.nodeStatusMap[node.Name].probeTimestamp.Add(nc.podEvictionTimeout)){

if nc.evictPods(node) {

glog.V(2).Infof("Node is unresponsive. Adding Pods on Node %s toeviction queues: %v is later than %v + %v",

node.Name,

decisionTimestamp,

nc.nodeStatusMap[node.Name].readyTransitionTimestamp,

nc.podEvictionTimeout-gracePeriod,

)

}

放到驱逐数组中

// evictPods queues an eviction for theprovided node name, and returns false if the node is already

// queued for eviction.

func (nc *Controller) evictPods(node*v1.Node) bool {

nc.evictorLock.Lock()

defer nc.evictorLock.Unlock()

return nc.zonePodEvictor[utilnode.GetZoneKey(node)].Add(node.Name,string(node.UID))

}

时间点 December 1st 2017, 10:19:42.000

开始驱逐：

驱逐 goroutine

if nc.useTaintBasedEvictions {

//Handling taint based evictions. Because we don't want a dedicated logic inTaintManager for NC-originated

//taints and we normally don't rate limit evictions caused by taints, we need torate limit adding taints.

gowait.Until(nc.doNoExecuteTaintingPass, scheduler.NodeEvictionPeriod,wait.NeverStop)

} else {

//Managing eviction of nodes:

//When we delete pods off a node, if the node was not empty at the time we then

//queue an eviction watcher. If we hit an error, retry deletion.

gowait.Until(nc.doEvictionPass, scheduler.NodeEvictionPeriod, wait.NeverStop)

}

通过删除 pods 的方式驱逐

func (nc *Controller) doEvictionPass() {

nc.evictorLock.Lock()

defer nc.evictorLock.Unlock()

fork := range nc.zonePodEvictor {

// Function should return 'false' and a time after which it should beretried, or 'true' if it shouldn't (it succeeded).

nc.zonePodEvictor[k].Try(func(value scheduler.TimedValue) (bool,time.Duration) {

node, err := nc.nodeLister.Get(value.Value)

if apierrors.IsNotFound(err) {

glog.Warningf("Node %v no longer present in nodeLister!",value.Value)

} else if err != nil {

glog.Warningf("Failed to get Node %v from the nodeLister: %v",value.Value, err)

} else {

zone := utilnode.GetZoneKey(node)

evictionsNumber.WithLabelValues(zone).Inc()

}

nodeUID, _ := value.UID.(string)

remaining, err := util.DeletePods(nc.kubeClient, nc.recorder,value.Value, nodeUID, nc.daemonSetStore)

if err != nil {

utilruntime.HandleError(fmt.Errorf("unable to evict node %q:%v", value.Value, err))

return false, 0

}

if remaining {

glog.Infof("Pods awaiting deletion due to Controllereviction")

}

return true, 0

})

}

时间点 December 1st 2017, 10:19:42.000

statefulsetcontroller 发现 default/gxr1-oracle-statefulset 状态异常

时间点 December 1st 2017, 10:19:42.000

scheduler 将 pod 调度到 k8s-node1

这样旧的 MySQL 实例在 k8s-node3 上，ubernetes 又将新的实例调度到 k8s-node1.

两个数据库实例写同一份数据文件， data corruption. 两个节点都无法启动。

老实例启动报错，日志为：

2017-12-01 10:19:47 5628 [Note] mysqld(mysqld 5.7.19-log) starting as process 963 ...

2017-12-01 10:19:47 5628 [Note] InnoDB:PUNCH HOLE support available

2017-12-01 10:19:47 5628 [Note] InnoDB:Mutexes and rw_locks use GCC atomic builtins

2017-12-01 10:19:47 5628 [Note] InnoDB:Uses event mutexes

2017-12-01 10:19:47 5628 [Note] InnoDB: GCCbuiltin __atomic_thread_fence() is used for memory barrier

2017-12-01 10:19:47 5628 [Note] InnoDB:Compressed tables use zlib 1.2.3

2017-12-01 10:19:47 5628 [Note] InnoDB:Using Linux native AIO

2017-12-01 10:19:47 5628 [Note] InnoDB:Number of pools: 1

2017-12-01 10:19:47 5628 [Note] InnoDB:Using CPU crc32 instructions

2017-12-01 10:19:47 5628 [Note] InnoDB:Initializing buffer pool, total size = 3.25G, instances = 2, chunk size = 128M

2017-12-01 10:19:47 5628 [Note] InnoDB:Completed initialization of buffer pool

2017-12-01 10:19:47 5628 [Note] InnoDB: Ifthe mysqld execution user is authorized, page cleaner thread priority can bechanged. See the man page of setpriority().

2017-12-01 10:19:47 5628 [Note] InnoDB:Highest supported file format is Barracuda.

2017-12-01 10:19:47 5628 [Note] InnoDB: Logscan progressed past the checkpoint lsn 406822323

2017-12-01 10:19:47 5628 [Note] InnoDB:Doing recovery: scanned up to log sequence number 406823190

2017-12-01 10:19:47 5628 [Note] InnoDB:Database was not shutdown normally!

2017-12-01 10:19:47 5628 [Note] InnoDB:Starting crash recovery.

2017-12-01 10:19:47 5669 [Note] InnoDB:Starting an apply batch of log records to the database...

InnoDB: Progress in percent: 89 90 91 92 93 94 95 96 97 98 99

2017-12-01 10:19:47 5669 [Note] InnoDB:Apply batch completed

2017-12-01 10:19:47 5669 [Note] InnoDB:Last MySQL binlog file position 0 428730, file name mysql-bin.000004

2017-12-01 10:19:47 5669 [Note] InnoDB:Removed temporary tablespace data file: "ibtmp1"

2017-12-01 10:19:47 5669 [Note] InnoDB:Creating shared tablespace for temporary tables

2017-12-01 10:19:47 5669 [Note] InnoDB:Setting file './ibtmp1' size to 12 MB. Physically writing the file full; Pleasewait ...

2017-12-01 10:19:47 5669 [Note] InnoDB:File './ibtmp1' size is now 12 MB.

2017-12-01 10:19:47 5669 [Note] InnoDB: 96redo rollback segment(s) found. 96 redo rollback segment(s) are active.

2017-12-01 10:19:47 5669 [Note] InnoDB: 32non-redo rollback segment(s) are active.

2017-12-01 10:19:47 5669 [Note] InnoDB:Waiting for purge to start

2017-12-01 10:19:47 0x7fcb08928700 InnoDB: Assertion failure in thread140509998909184 in file trx0purge.cc line 168

InnoDB: Failing assertion:purge_sys->iter.trx_no <= purge_sys->rseg->last_trx_no

InnoDB: We intentionally generate a memorytrap.

InnoDB: Submit a detailed bug report tohttp://bugs.mysql.com.

InnoDB: If you get repeated assertionfailures or crashes, even

InnoDB: immediately after the mysqld startup,there may be

InnoDB: corruption in the InnoDBtablespace. Please refer to

InnoDB:http://dev.mysql.com/doc/refman/5.7/en/forcing-innodb-recovery.html

InnoDB: about forcing recovery.

10:19:47 5669 - mysqld got signal 6 ;

以上问题通过 WOQU RDS Operator 提供的 Fence 机制已经得到有效解决。

Kubernetes 使我们站在巨人的肩膀上，从各大互联网公司的技术发展看，将编排和容器技术应用到持久化 workload 也是显见的趋势之一。

但是，借用 Portworx CEO 的 Murli Thirumale 对 Kubenretes 的预测：

Kubernetes 相当复杂。Kubernetes 被拥趸们冠以”优雅”的头衔，但优雅并不意味着简单。弦论是优雅的，但是理解它需要付出极大的努力。Kubernetes一样，使用 Kubernetes 构建和运行应用程序并不是一个简单的命题。

Kubernetes iscomplicated. Kubernetes is oftendescribed as elegant by enthusiasts. But its elegance doesn’t make itsimple. String theory is elegant, butunderstanding it with anything except the most imprecise analogies takes a lotof effort. Kubernetes is the same. Using Kubernetes to build and run anapplication is not a straightforward proposition.

革命尚未成功，同志任需努力。看到这里的同学，有兴趣加入我们吗？发邮件我吧。orain.xiong@woqutech.com

观察｜官方通报陕西蒲城一职校学生坠亡：事发前与舍友发生口角和肢体冲突认定该生系高空坠落死亡

市管干部“龚书记”免职迷局

近视的孩子有救了！国内最新近视防控矫正技术，不手术，扫码进群即可了解！

法明传[2024]173号：1月1日起，未用示范文本提交起诉状，部分法院将不予立案

法明传[2024]173号：1月1日起，未用示范文本提交起诉状，部分法院将不予立案