【Hummer 引擎优化系列】究极GC难题定位记
Abstract
本文记录升级到flutter 2.2.3后出现的数个诡异的、无法重现的top崩溃的分析定位过程。
崩溃的样子
这个崩溃有多种情况,并且包揽了top5崩溃中的几个:
共同点是都崩溃在了Expando.[]=里面。根据代码可以知道这其实是Expando._rehash inline到了Expand.[]=。
for (var i = 0; i < old_data.length; i++) {
var entry = old_data[i];
if (entry != null) {
// Ensure that the entry.key is not cleared between checking for it and
// inserting it into the new table.
var val = entry.value;
var key = entry.key;
if (key != null) {
=> this[key] = val;
}
}
}
}
箭头处插入了AssertAssignable检查val是否能够赋值到T类型。就在此处进入TypeCheck或者RuntimeTypeIsSubtypeOf等函数。
检查val的值,发现变成了FreeListElement:
可见val对象在91c5edd0,class id在高16位也就是1。查阅头文件可以知道1是FreeListElement。说明此处已经被sweeper sweep了,也就是没有被marker mark过。这是个use after free错误。
观察这些崩溃可知他们都发生在Expando,而且共同点是都是被WeakProperty引用。下面是Expand.[]=的代码:
if (_used < _limit) {
var ephemeron = new _WeakProperty();
ephemeron.key = object;
ephemeron.value = value;
_data[idx] = ephemeron;
_used++;
return;
}
分析定位过程
由上面的源码可见引用关系是Expando Instance=>_data array=>WeakProperty Instance=>value。
究竟发生了什么情况导致了value被mismark呢?查阅了marker里面的ProcessWeakProperty的代码觉得没什么显著的问题。但是跑monkey还是能重现的,那么我们就打一些log吧。思路是在rehash和[]=的时候都打印一些当时的状态。
DEFINE_NATIVE_ENTRY(WeakProperty_validate, 0, 2) {
GET_NON_NULL_NATIVE_ARGUMENT(WeakProperty, weak_property,
arguments->NativeArgAt(0));
GET_NON_NULL_NATIVE_ARGUMENT(String, tag, arguments->NativeArgAt(1));
ObjectPtr value = weak_property.value();
ObjectPtr key = weak_property.key();
OS::PrintErr(
"WeakProperty_validate %s: weak: %p, IsOld: %d: value: %p, key: %p, "
"value cid: %zx, key cid: %zx, old space phase: %d.\n",
tag.ToCString(), weak_property.ptr()->untag(),
weak_property.ptr()->IsOldObject(), value->untag(), key->untag(),
value->GetClassId(), key->GetClassId(),
thread->heap()->old_space()->phase());
return Object::null();
}
...
@@ -85,6 +90,8 @@ class Expando<T> {
var ephemeron = new _WeakProperty();
ephemeron.key = object;
ephemeron.value = value;
+ if (_should_validate)
+ ephemeron.validate("from []=");
_data[idx] = ephemeron;
_used++;
return;
...
// Ensure that the entry.key is not cleared between checking for it and
// inserting it into the new table.
+ if (_should_validate)
+ entry.validate("from _rehash");
var val = entry.value;
var key = entry.key;
if (key != null) {
崩溃的时候收集到下面的log:
可以看到崩溃目标对象(黄色标)在Concurrent Marking(gc phase为1)阶段发生了_rehash。申请到了Old Space里面分配的WeakProperty(图1)。然后在Parallel Sweeping阶段再次发生了_rehash,这次从New Space里面分配了WeakProperty(图2)。最后是崩溃,在gc阶段为Done,发生_rehash,此刻value的cid已经为1了,也就是FreeListElement。
原因分析
这一刻事情已经很明了了。因为Concurrent Marking这阶段分配的对象会马上mark:
紧接着会进入Defer Marking Stack:
void StubCodeCompiler::GenerateAllocateObjectSlowStub(Assembler* assembler) {
...
__ CallRuntime(kAllocateObjectRuntimeEntry, 2);
// Load result off the stack into result register.
__ ldr(kInstanceReg, Address(SP, 2 * target::kWordSize));
// Write-barrier elimination is enabled for [cls] and we therefore need to
// ensure that the object is in new-space or has remembered bit set.
=> EnsureIsNewOrRemembered(assembler, /*preserve_registers=*/false);
...
static void EnsureIsNewOrRemembered(Assembler* assembler,
bool preserve_registers = true) {
...
=> __ CallRuntime(kEnsureRememberedAndMarkingDeferredRuntimeEntry, 2);
...
DEFINE_LEAF_RUNTIME_ENTRY(uword /*ObjectPtr*/,
EnsureRememberedAndMarkingDeferred,
2,
uword /*ObjectPtr*/ object_in,
Thread* thread) {
ObjectPtr object = static_cast<ObjectPtr>(object_in);
...
// For incremental write barrier elimination, we need to ensure that the
// allocation ends up in the new space or else the object needs to added
// to deferred marking stack so it will be [re]scanned.
if (thread->is_marking()) {
=> thread->DeferredMarkingStackAddObject(object);
}
到了marker,因为已经WeakProperty已经marked,所以不会放进Weak集合里面完成定点处理。导致了最终的value变成dangle指针:
void ProcessDeferredMarking() {
ObjectPtr raw_obj;
while ((raw_obj = deferred_work_list_.Pop()) != nullptr) {
ASSERT(raw_obj->IsHeapObject() && raw_obj->IsOldObject());
// N.B. We are scanning the object even if it is already marked.
=> bool did_mark = TryAcquireMarkBit(raw_obj);
// did_mark在这里永远为假。
...
size = ProcessWeakProperty(raw_weak, did_mark);
intptr_t ProcessWeakProperty(WeakPropertyPtr raw_weak, bool did_mark) {
// The fate of the weak property is determined by its key.
ObjectPtr raw_key = LoadPointerIgnoreRace(&raw_weak->untag()->key_);
if (raw_key->IsHeapObject() && raw_key->IsOldObject() &&
!raw_key->untag()->IsMarked()) {
// Key was white. Enqueue the weak property.
if (did_mark) {
=> EnqueueWeakProperty(raw_weak);
// did_mark为假,无法进入Weak集合
另外由于WeakProperty的new和 key value的store放在一起会触发优化器标记这两个store没有write barrier。在一般情况下这是对的,因为new WeakProperty会返回New Space Object,这些store不需要被rememebered。但在Concurrent marking的情况下,由于new WeakProperty会返回Old Space的Object,这种情况下其实需要WriteBarrier。假如有write barrier, 那么key 和value都会被write barrier stub当场mark掉然后丢进mark stack。也就不会出问题,这也是1.X没有问题的原因。
我把分析结果报告给了dart的工程师雨果洛夫,他在github上报了bug,并且对我表示感谢。
总结
这是我遇到top 5 难的bug了吧。多个边界条件在一起碰撞导致的bug,非常考验我对GC、代码生成、对象layout等模块的熟悉程度和想象力。
连dart团队本身遇到这种问题都跟踪了很久。其他团队遇到了肯定一筹莫展。
遇到这些疑难问题如何能保证不慌?当然是选择技术沉淀深厚的hummer、U4团队的作品。强力的团队为你的产品稳定性护航,赋能亿万用户。