查看原文
其他

Chip发表清华大学吴华强团队综述论文:基于阻变存储器的存算一体系统的趋势与挑战

上海加油的 FUTURE远见 2022-04-13
FUTURE | 远见

FUTURE | 远见 闵青云 选编

近日,清华大学吴华强团队以"Trends and challenges in the circuit and macro of RRAM-based computing-in-memory systems"为题在Chip上发表长篇综述文章,全面介绍了基于阻变存储器的存算一体系统研究领域的发展趋势与挑战。第一作者为卫松涛,通讯作者是高滨、伍冬和吴华强。Chip是全球唯一聚焦芯片类研究的综合性国际期刊,是入选了国家高起点新刊计划的「三类高质量论文」期刊之一。


基于阻变存储器的存算一体:发展趋势与挑战


在当今这个数据爆炸的时代,人工智能(AI)在处理更多复杂且实际的问题中已经取得了巨大的成功。在一些特定的领域内,AI甚至超越了人类。由于具有很高的精度,以及在处理逻辑运算以及指令密集型任务中的优异表现,传统的冯诺依曼架构在过去的几十年中得到的很多的研究。然而,在数据密集型的任务中,由于需要大量的数据流动,冯诺依曼架构面临着很多挑战,例如存储墙问题。而且,冯诺依曼架构中存储单元与运算单元的分离会在数据搬移的过程中引入很大的延时和功率损耗。存算一体系统避免了传统冯诺依曼架构中数据的大量搬移,因而具有高能效和大吞吐量。因为其优异的性能,存算一体系统在近些年受到了越来越多的关注。


传统的冯诺依曼存算分离(上):计算单元与存储单元通过总线进行数据交互,受限于总线带宽以及数据传输的功耗开销;存算一体(下):每个处理单元(PE)在存储数据的同时也进行计算,只需要进行PE之间的数据交互。


卷积神经网络(CNN)由于其优异的图像处理能力,是目前最流行的一种神经网络。CNN中最核心的操作就是矩阵向量乘。这种运算可以很自然的在一个电阻阵列上实现,原理就是基础的欧姆定律和基尔霍夫电流定律。除此以外,由于电阻可调,卷积神经网络可以映射到真实的电阻阵列上去。这样一来我们可以在我们的硬件资源上映射不同的神经网络。


阻变存储器(RRAM)是一种两端器件,其结构类似于三明治。通过改变施加于两个电极上的电压,其电导值可以被改写。与其他的实现存算一体的器件相比,RRAM由于编程快,保持力以及耐久性强,较低的编程功耗以及与CMOS工艺兼容等优势被寄予厚望。目前,RRAM存算一体系统在诸多科研领域(例如器件、电路、架构、算法以及工具链领域)都吸引了很多的研究精力。


RRAM存算一体端到端的实现。神经网络权重映射到相应的阵列,通过阵列外部的数据转化电路加载输入并得到输出再进行阵列之间的交互完成神经网络的运算。电阻可变的RRAM器件组成的阵列天然地执行矩阵向量乘(图中左上照片来源于百度)。


这篇文章主要回顾一些RRAM存算一体系统中的电路以及阵列设计层面的创新。作为一种模拟运算,模拟电路的设计对于RRAM存算一体系统的实现来说是最基本也是最重要的部分。通常来讲,整个系统的能效以及吞吐率往往是由模拟电路尤其是输入输出电路所限制。所以,这里我们重点调研了近些年来RRAM存算一体系统中的输入输出电路以及阵列本身上的一些创新。


首先,我们介绍了一些阵列层面的创新。RRAM存算一体系统的操作单元通常来讲包括无源阵列(1R),伪交叉阵列(1T1R)以及其他类型的阵列。除此以外,2T2R的结构使用一对晶体管和电阻可以表示正负权重。更高层次的考虑点在于如何安排数据流动和配置在有限的外围电路的资源下实现较高的数据吞吐率。在文章中我们总结了一些经过流片验证的阵列级的创新。


其次,提高输入精度对于实现更高的网络推理准确率必不可少。然而在延时、功耗、面积之间有天然的折衷。由电平幅度表示的多比特输入需要功耗开销较大的数模转换器(DAC),DAC需要在较短的时间内稳定来保证整个系统的吞吐量。通常来讲,输入方式包括:单比特串行、多比特并行以及两者的折衷方式,也即相较于串行行输入方式消耗更少的迭代,然而相较于全并行方式在一个迭代内比特数更少。我们同样在文章中回顾了一些RRAM存算一体系统中输入电路的创新点。


最后,我们介绍了一些在电路以及架构层次上输出机制。一方面,由于器件的随机性以及高低阻比值的减小,乘加运算各个结果之间的分辨差值越来越小。这样一来,使得输出电路例如ADC、SA在较短的时间内分辨只有细微差别的不同输出值变得更加困难,这需要输出电路的失调较小同时速度较快。但速度以及失调往往是相互矛盾的,因为失调的抵消往往需要额外的时间。另一方面,我们也可以从更高层次的架构来优化整个系统。在这篇文章中,我们简要展示了一些架构层面优化对整个RRAM存算一体系统的能效以及吞吐率的有效提升。


作为未来AI中极有前景的一种实现方式,RRAM存算一体系统一定会在更多复杂且实际的任务中扮演越来越重要的角色。在将来,电路以及器件、架构等等领域的研究者需要协同努力,方能推动RRAM存算一体系统变得更加有竞争力。


Trends and challenges in the circuit and macro of RRAM-based computing-in-memory systems


Nowadays, artificial intelligence (AI) has been successful in dealing with more and more complex and practical problems in such a data-centric era. In some specific classes of problems, AI can even exceed the capability of humans. Conventional von Neumann architecture has been well studied in the past few decades because of its high accuracy and capability of doing logic operations as well as instruction-intensive tasks. However, when dealing with data-intensive tasks that require huge amounts of data being fed into AI systems, von Neumann architecture faces many challenges such as  memory wall and excessive energy and time consumption due to the separation of memory and processing elements. By avoiding huge amounts of data movements in conventional von Neumann architecture, computing-in-memory (CIM) system has been gaining more and more attention due to its high energy efficiency and throughput in data-intensive AI tasks.


Convolutional neural network (CNN) is one of the most popular NNs because it is capable of dealing with image processing and other tasks. The core operation in any CNN is the matrix vector multiplication. This process can be naturally implemented on resistor array according to Ohm’s law and Kirchhoff’s law. Besides, the resistance of every cell can be adjusted to map the CNNs’ weights onto a physical array. As a result, we can implement different neural networks on our hardware resources.


Resistive random-access memory (RRAM) is a type of two-terminal devices with a sandwich-like structure. Its conductance can be modulated by the voltage applied on both electrodes. Compared with its traditional counterparts, RRAM is a promising candidate due to its fast programming, good retention and endurance, relatively low power to program, as well as compatibility with CMOS technology. As a result, more and more attention has been attracted to push the development of RRAM-based CIM from different perspectives such as devices, circuits, architecture, algorithm, and tool chains.



This paper mainly reviews some innovations on the macro and circuit level in RRAM-based CIM. As a type of analog computing, the analog circuit design is the most basic and important part of the realization of RRAM-based CIM. In general, the energy efficiency and throughput are mainly limited by analog circuit especially input and output interfaces. We investigate recent innovations of input and output interfaces of RRAM-based CIM as well as the array and macro itself.


We firstly introduce some innovations about array and macro. Operation units of RRAM-based CIM array generally includes passive crossbar array (1R), pseudo-crossbar array (1T1R), and some other arrays. Apart from these, 2T2R structure can represent either positive or negative weights using two differential 1T1R cells. Another consideration at higher level is how to arrange data location and flow to achieve relatively large throughput with a limited number of peripheral circuits. We review some silicon-verified RRAM CIM systems concentrated on array and macro.


Secondly, increasing precision of inputs is necessary to achieve higher inference accuracy. However, there is an inherent tradeoff between latency, power, and area. Multibit input represented by difference of amplitude requires energy-hungry digital analog converters (DACs) which need to settle in a short time to keep the throughput of the entire system. Generally, input schemes include single-bit serial input, multibit input in one cycle, and a tradeoff between them. Such tradeoff entails less bits in one cycle compared with single-bit serial input and less cycles compared to multibit input in one cycle. Some silicon-verified innovations about input interface circuits and schemes are also reviewed.


Thirdly, we introduced some recent progress on output schemes in terms of circuits and architecture. On the one hand, because of device variation and decreased R-ratio, the sensing margin between different MAC values becomes smaller. As a result, output circuits like ADCs/SAs have difficulty in distinguishing the subtle difference between MAC currents and reference currents in a reasonable period. These difficulties request the output circuits must have little offset and good performance. But speed and offset are always contradictory because offset cancelling needs extra phase which is time-consuming. On the other hand, we can also benefit more to optimize higher level architecture. Some innovations are introduced in this review to briefly show how efficiently some architecture modifications can improve the energy efficiency and throughput of the whole RRAM-based CIM system.


As a promising candidate for future AI applications, RRAM-based CIM will certainly play an important role in more practical and complex tasks. Concerted research efforts in circuitry, device, architecture, and all other relevant fields are necessary to make RRAM-based CIM more competent.


文章预印版:https://www.sciencedirect.com/science/article/pii/S2709472322000028?v=s5


关于 Chip


Chip全球唯一聚焦芯片类研究的综合性国际期刊,已入选由中国科协、教育部、科技部、中科院等单位联合实施的「中国科技期刊卓越行动计划高起点新刊项目」,为科技部鼓励发表「三类高质量论文」期刊之一。


Chip期刊由上海交通大学与Elsevier集团合作出版,并与多家国内外知名学术组织展开合作,为学术会议提供高质量交流平台。


Chip秉承创刊理念: All About Chip,旨在发表与芯片相关的各科研领域尖端突破,助力未来芯片科技发展。迄今为止,Chip已在其编委会汇集了来自13个国家的67名世界知名专家学者,其中包括多名中外院士及IEEE、ACM等知名国际协会终身会士(Fellow)。


Chip首刊于2022年3月在爱思唯尔Chip官网以完全开放获取形式(Open Access)发布,欢迎访问阅读文章。


爱思唯尔Chip官网:

https://www.journals.elsevier.com/chip



点击「阅读原文」直达文章预印版。


--Chip编辑组 投稿


延伸阅读
01 《Chip》发表云南大学史衍丽团队综述论文:近红外单光子探测器发展
02 牛津大学最新研究登上Nature,新冠轻症也会导致大脑退化
03 上海交大金贤敏、唐豪课题组实现基于量子随机行走的哈尔随机酉矩阵

FUTURE|远见

End

您可能也对以下帖子感兴趣

文章有问题?点此查看未经处理的缓存