正如图1展示:先用8类图片预训练只有一层隐藏层的线性神经网络,之后预训练网络在两种情况下学习新类(Boots,靴子)的表现。FoL(focused learning):只提供新类的图片;FIL(focused all classes):提供所有类别的图片。第一二列图片表示,在两种训练策略下,分别在新类和训练过的旧类下,召回率(Recall)随着训练迭代(epoch)次数的变化情况;第三列图片表示,在两种训练策略下,所有类的准确率(accuracy)随着训练迭代次数的变化;第四列表示,两种训练策略下,交叉熵损失函数随着训练迭代次数的变化。FoL会使得学过的旧类图片的信息被遗忘,召回率、准确率指标显著下降,而混合所有旧类图片和新类图片,训练效果则保持稳定。图2. 6种训练策略下,神经网络的训练表现。 图2表明,在8类图片上预训练好的一层隐藏层的线性神经网络学习新类(Boots 图片),在6种不同的训练策略下,网络表现随迭代次数的变化。A图表示六种不同是训练策略:FoL;FIL;PIL (partial interleaved learning) :9类图片的子集参与每次迭代训练(n=350 images per epoch,39 images/class);SWIL(相似度加权交叉学习):根据训练过的8类图片和新类(Boots)的相似度,决定每类在训练新类过程中的图片数量,越相似则数量越多;EqWIL(Equally Weighted Interleaved Learning):新类的图片数量和SWIL保持一致,其它8类的图片和新类图片相似度权重设置为一致,则训练图片为75 images/class。B图表示,6个训练策略下,召回率在新类和相似度高的旧类(sneaker,Sandals)以及相似度不高的其它6类上随着迭代次数的变化;整体准确率,交叉熵损失函数训练过程的变化。可见,文章作者提出的SWIL训练策略能够很好的缓和神经网络遗忘旧信息的问题,同时提高训练的时间效率。图3. 各类图片的相似度计算结果以及聚类结果 那么,各个种类的图片的相似度指标的计算也是我们需要关注的重点之一,概括来说,文章作者使用公开的方法在特征层面计算相似度。简单地说,作者计算了目标隐藏层中已有类样本和新类样本的激活向量平均值之间的余弦相似度。图3展示了作者计算得到的各种类图片之间的相似度矩阵A图;B图:依据A图的聚类结果;C图:经过FIL策略训练学习“Boots”类图片后的混淆矩阵 。对角线值被删除,以提高缩放的清晰度。最近以大脑为灵感的缓解灾难性干扰的方法可以分为:1.基于正则化的方法(regularization-based )2.基于生成重放的方法(generative-replay–based methods)。基于正则化的方法有:EWC(Elastic Weight Consolidation);Learning without Forgetting;Synaptic Intelligence;这些通常包括测量每个参数的重要性和添加一种正则化项,惩罚网络中最相关的参数或映射函数的变化。当需要增量地学习许多新类时,这些方法通常会受到影响。基于生成重放的方法,通常包含深度生成和任务解决网络。在重学习的过程中,新类样本与生成的伪数据混合在一起(捕捉先前学习的代表性统计信息)。生成重放方法不用再访问旧样本,然而,问题转向了改进生成器(一个重要和困难的问题)。另一方面,作者计算了新类的目标层的平均激活与旧类之间的相似度。因为他们关注的是类层次上的学习动力,而不是跨类的单个属性,所以研究单个的特征映射的学习动力也会很有趣。相似度也可以在不同层次上计算:像素级或功能性。
论文 Abstract
Understanding how the brain learns throughout a lifetime remains a long-standing challenge.In artificial neural networks (ANNs), incorporating novel information too rapidly results in catastrophic interference, i.e., abrupt loss of previously acquired knowledge. Complementary Learning Systems Theory (CLST) suggests that new memories can be gradually integrated into the neocortex by interleaving new memories with existing knowledge. This approach, however, has been assumed to require interleaving all existing knowledge every time something new is learned, which is implausible because it is timeconsuming and requires a large amount of data. We show that deep, nonlinear ANNs can learn new information by interleaving only a subset of old items that share substantial representational similarity with the new information. By using such similarityweighted interleaved learning (SWIL), ANNs can learn new information rapidly with a similar accuracy level and minimal interference, while using a much smaller number of old items presented per epoch (fast and data-efficient). SWIL is shown to work with various standard classification datasets (Fashion-MNIST, CIFAR10, and CIFAR100), deep neural network architectures, and in sequential learning frameworks. We show that data efficiency and speedup in learning new items are increased roughly proportionally to the number of nonoverlapping classes stored in the network, which implies an enormous possible speedup in human brains, which encode a high number of separate categories. Finally, we propose a theoretical model of how SWIL might be implemented in the brain.