『译』计算机体系结构发展史(六)
系列文章第六篇(对应M.7)
往期文章
M.7 The History of Multiprocessors and Parallel Processing (Chapter 5 and Appendices F, G, and I)
SIMD Computers: Attractive Idea, Many Attempts, No Lasting Successes
SIMD模型是最早的并行计算模型之一,其历史可以追溯到第一个大型多处理器系统Illiac IV。这些计算机没有像矢量体系结构那样流水地进行数据计算,而是通过功能单元的阵列进行并行计算。因此,它们也可能被视为阵列处理器。
SIMD风格计算机的思想最早来自Unger [1958]和Slotnick,Borck and McReynolds [1962]。Slotnick的Solomon设计了Illiac IV(也许是最“臭名昭著”的超级计算机项目)的基础。尽管成功地推出了在后来的项目中证明有用的多种技术,但它作为计算机还是失败了。即使只建造了计划中的多处理器系统的四分之一,但成本从1966年的800万美元增加到1972年的3,100万美元。(以2011年的美元价值计算,即从5400万美元增加到1.52亿美元。)整个系统的实际性能最多为15 MFLOPS,而最初的预测是1000 MFLOPS [Hord 1982]。该计算机于1972年交付给NASA Ames Research,但还需要经过三年的工程才能投入使用。这些事件减慢了对SIMD的研究,但是Danny Hillis [1985]在所谓的Connection Machine中恢复了这种架构,它具有65,536个1位处理器。
实际的SIMD计算机需要混合使用SISD和SIMD指令。一个SISD主机可以执行不需要并行的操作,例如分支和地址计算。SIMD指令被广播到所有执行单元,每个执行单元(execution units)都有自己的一组寄存器。为了提高灵活性,在SIMD指令执行期间,某个执行单元可以被禁用。此外,大规模并行SIMD多处理器系统依赖于互连或通信网络在处理元件(processing elements)之间交换数据。
SIMD最适合处理for循环中的数组。因此,要想充分利用SIMD架构实现大规模并行处理,就必须有大量数据,或者数据本身有很好的并行性(data parallelism)。SIMD架构对于case语句的处理则效率很低,因为每个执行单元必须根据其数据执行不同的操作。这种情况下,需要将拿到“错误”数据的执行单元禁用,以便让需要的执行单元得以继续运行。因此,系统基本上是以1/n的性能在运行,其中n是case语句的分支数量。
SIMD多处理器设计的最基本的权衡是性能与处理器数量之间的关系。80年代的SIMD超级计算机都强调了高度的并行性而非在单个处理器的性能。例如,Connec- tionMultiprocessor 2提供了65,536个单比特处理器,而Illiac IV则计划使用64个64比特处理器。
在80年代,首先是Thinking Machines,继而由MasPar复活之后,SIMD价格在超级计算机领域逐渐消失,要原因有两个:首先,它太僵化了。许多重要问题本身并不是数据并行的,并且体系结构无法有效的缩减规模。也就是说,与替代产品相比,小规模的SIMD多处理器系统的性价比通常较低。其次,SIMD无法利用80年代SISD(单指令,单数据)微处理器技术的巨大性能和成本优势,该技术每18个月的性能就翻一番。SIMD多处理器系统的设计者并没有利用这种低成本技术,而必须为其多处理器构系统研制定制的处理器。
尽管SIMD计算机并不适合通用场景,但这种架构风格将继续在特殊用途的场景中发挥作用。许多特殊用途的任务是处理高度并行的数据,并且只需要一组功能有限的处理单元。因此,设计人员可以设计对特殊操作的硬件,以及硬连线方式的互联通路来连接这些功能单元。这样的架构通常被称为阵列处理器(array processors),它们对于诸如图像和信号处理之类的任务很有效。
Other Early Experiments
要确定第一个MIMD多处理器是非常困难的。例如,我们很难想到,Eckert-Mauchly Corporation的第一台计算机具有重复的功能单元以提高可用性。Holland(1959)提出了针对多个处理器的早期论证。在记录保留的最充分的多处理器项目中,有两个是70年代在卡内基梅隆大学进行的。其中第一个是C.mmp [Wulf and Bell 1972; Wulf and Harbison 1978],由16个PDP-11组成,这些PDP-11通过交叉连接访问16个存储单元。它是最早的具有多个处理器的多处理器系统之一,并且具有共享内存编程模型。C.mmp项目中的许多研究重点都放在软件上,尤其是在操作系统领域。后来的多处理器系统Cm* [Swan et al. 1977]是一个基于集群的多处理器系统,具有分布式内存,其访问时间是不一致的(nonunitorm)。由于缺少高速缓存,加之远程访问的延迟很大,在这个系统中数据放置的位置是至关重要的。Gehringer,Siewiorek and Segall [1987]对这种多处理器系统和许多应用实验进行了很好的讨论。到了80年代,微处理器的出现使制造多处理器系统的成本大大降低,这些早期多处理器系统中探索的许多想法重新得到应用。
Great Debates in Parallel Processing
The turning away from the conventional organization came in the middle 1960s, when the law of diminishing returns began to take effect in the effort to increase the operational speed of a computer. … Electronic circuits are ultimately limited in their speed of operation by the speed of light … and many of the circuits were already operating in the nanosecond range.
Bouknight et al. [1972]
… sequential computers are approaching a fundamental physical limit on their potential computational power. Such a limit is the speed of light …
Angel L. DeCegama
The Technology of Parallel Processing, Vol. I (1989)
… today’s multiprocessors … are nearing an impasse as technologies approach the speed of light. Even if the components of a sequential processor could be made to work this fast, the best that could be expected is no more than a few million instructions per second.
David Mitchell
The Transputer: The Time Is Now (1989)
上面的几段引文,都是关于并行处理的经典讨论,而Amdahl [1967]对于这个问题也给出了他经典的回答(其结论支持了IBM 360体系结构持续改进)。关于并行处理的争论可以追溯到19世纪[Menabrea 1842]!但是,通过多处理器架构来减少单个重要程序的处理延迟,仍然是在探索中的问题。除了这些关于并行处理的优点和局限性的争论之外,一些热烈的争论集中在如何构建多处理器系统上。预测未来非常困难,但在1989年,戈登·贝尔(Gordon Bell)对1995年做出了两项预测。我们将这些预测包括在本书的第一版中,当时结果还不明朗。我们将在本节中讨论这些预测,同时评估预测的准确性。
首先是,到1995年,能够维持TeraFLOPS处理能力的计算机将会出现,其构建形式可能是具有4K到32K节点的多计算机(multicomputer)系统,或者是具有数百万个处理元素的Connection Multiprocessor[Bell 1989]。为了验证这一预测,每年Gordon Bell Prize会表彰并行处理方面的进步,包括运行最快的实用程序(实现最高的MFLOPS)。1989年,获胜者使用了八处理器的Cray Y-MP以1680 MFLOPS的速度运行。根据这些数据,1995年要达到1 TFLOPS的处理能力,多处理器系统和应用程序必须每年提高3.6倍。1999年,第一名获得者超过了1 TFLOPS的标准。他们使用了专门为Livermore Laboratories设计的5832处理器IBM RS/6000 SST系统,他们在冲击波仿真中实现了1.18 TFLOPS。这个比率表明,每年都有1.93的提高,这仍然是相当可观的。
90年代以来,已经公认的是,尽管已经拥有构建TFLOPS多处理器系统的技术,但除少数非常特殊且至关重要的应用外,还不清楚该机器是否具有成本效益。我们在1990年估计,要实现1 TFLOPS,则需要一台具有约5000个处理器的机器,成本约为1亿美元。Livermore的5832处理器的IBM系统花费了1.1亿美元。可以预料的是,单个微处理器在成本和性能方面的提升直接影响大型多处理器系统的成本和性能。但是,使用5000个处理器的系统的价格,将是使用相同处理器的台式机的5000倍以上。从那时起,我们已经建造了速度更快的多处理器系统,但是主要的改进来自过去五年中处理器性能的提升,而不是并行体系结构本身有根本的突破。
Bell的第二个预测涉及1995年发货的超级计算机中的数据流数量。Danny Hillis相信,虽然数据流数量少的超级计算机可能是最畅销的,但最大的多处理器系统将是具有许多数据流,并且将处理大量的计算。Bell和Hillis打赌,认为到1995年最后一个季度,使用较少数据流(≤100)而不是大量数据流(≥1000)的多处理器产品会实现更高的持续性MFLOPS(sustained MFLOPS)。这个赌注只涉及超级计算机(超级计算机定义为,用于科学应用,成本超过100万美元的多处理器系统)。持续性MFLOPS的定义是每月的浮点运算次数,因此多处理器系统的可用性也会影响其评级。
在1989年下注时,我们还不清楚最终谁会获胜。到1995年,对当时公开的超级计算机进行的一项调查显示,世界上仅存在六个多处理器系统,具有1000个以上的数据流。因此,Bell的预测无疑是准确的。实际上,在1995年,基于微处理器的小型多处理器系统(20个处理器)开始占主导地位。1995年,对使用的500个最高性能的多处理器系统(基于Linpack等级)进行的一项调查(称为TOP500)表明,最多的多处理器系统是基于总线的共享内存多处理器!到2005年,各种集群或多计算机系统发挥了重要作用。例如,在排名前25位的系统中,有11个是自定义集群,例如IBM Blue Gene系统或Cray XT3。10个是共享内存多处理器集群(均使用分布式和集中式内存);其余4个是使用PC和现成的互连技术构建的集群。
More Recent Advancesand Developments
除了并行向量多处理器(请参阅附录G)和IBM Blue Gene的最新版本以外,所有其它最近的MIMD计算机都是使用现成微处理器构建的,它们都使用总线和逻辑上的中央存储器(logically centralmemory),或者是互联网络结合分布式存储。80年代建造的许多实验性多处理器系统进一步完善和增强了这些概念,构成了当今许多多处理器系统的基础。
The Development of Bus-Based Coherent Multiprocessors
尽管在60年代和70年代,人们使用多个处理器构建了规模巨大的大型机,但直到80年代,多处理器系统才取得了巨大的成功。Bell [1985]指出,关键原因在于微处理器的较小尺寸使得用内存总线代替互连网络硬件成为可能,而可移植的操作系统则意味着多处理器项目不再需要开发新的操作系统。Bell在他的论文中定义了multiprocessor和multicomputer这两个术语,并为构建大型多处理器系统的两种不同方法奠定了基础。
第一个基于总线,使用了侦听缓存(snooping caches)的多处理器系统是Synapse N+1 Frank [1984]. Goodman [1983]是第一篇讨论监听缓存的论文。80年代后期,出现了很多商用的基于总线侦听缓存架构的系统,包括Silicon Graphics 4D/240 [Baskett, Jermoluk, and Solomon 1988], Encore Multimax [Wilson1987], 以及Sequent Symmetry[Lovett and Thakkar 1988]。80年代中期,出现大量缓存一致性协议,Archibald and Baer [1986]给出了很好的综述和分析,包括对原始论文的参考。图M.2总结了几种侦听缓存一致性协议,并显示了一些已经使用或正在使用该协议的多处理器系统。
90年代初,这类系统的扩展开始了,它们使用了非常宽的高速总线(SGI Challenge系统使用256位,面向数据包的总线,最多支持8个处理器板和32个处理器)。之后,开始使用多个总线和交叉开关互连,例如Sun SPARC Center和Enterprise系统(Charlesworth [1998]讨论了这些多处理器的互连体系结构)。在2001年,Sun Enterprise服务器代表了广泛使用的大规模(> 16个处理器),对称多处理器(symmet- ric multiprocessors)。如今,大多数基于总线的系统仅提供4个处理和交换机,其他一些设计支持8个或更多的处理器。
Toward Large-Scale Multiprocessors
在构建大型多处理器系统的过程中,人们探索了两个不同的方向:消息传递多计算机(message-passing multicomputers)和可伸缩共享内存多处理器(scalable shared-memory multipro- cessors)。尽管人们开始的尝试包括基于网格(mesh )和超立方体连接(hypercube- connected)的多处理器系统,但成功将所有部件组合在一起的首批多处理器之一是在Caltech建造的Cosmic Cube [Seitz 1985]。它实现了路由和互连技术方面的重要进步,并显著降低了互连成本,这有助于使多计算机成为可能。英特尔iPSC 860是超立方体连接的i860的集合,它就是基于这些思想。诸如Intel Paragon之类的较新的多处理器系统已经使用了较低维数和较高单个链接的网络。尽管许多用户发现最好同时使用i860处理器进行计算和通信,但Paragon在每个节点中还是采用了单独的i860作为通信控制器。Thinking Multiprocessors CM-5使用了现成的微处理器和胖树互连(fat tree)(请参阅附录F)。它支持在用户级访问的通信通道,从而显著改善了通信延迟。在1995年,这两个多处理器系统代表了消息传递多计算机的最新技术。
构建可伸缩共享内存多处理器系统的早期尝试包括IBM RP3 [Pfiste et al. 1985],纽约大学的Ultracomputer[Elder et al. 1985年;Schwartz 1980],伊利诺伊大学Cedar项目[Gajksi et al. 1983],以及BBN Butterfly and Monarch [BBN Laboratories 1986; Rettberg et al. 1990]。这些多处理器系统都是非一致的分布式内存模型上的变体,因此属于分布式共享内存(DSM)多处理器架构,但它们不支持缓存一致性,这使编程非常复杂。RP3和Ultracomputer项目都探索了同步(fetch-and-operate),以及在网络中合并内存访问的新想法。在所有四个项目中,互联网络的成本都比处理节点高,这给较小版本的多处理器带来了问题。Cray T3D/E(Arpaci et al. [1995]对T3D的评估,以及Scott [1996]对T3E增强的描述)基于这些探索,使用了非一致(noncoherent)的共享地址空间,但使用了在多计算机领域开发的互连技术(Scott and Thorson [1996])。
通过结合不同的思想,人们提出了具有可扩展缓存一致性的扩展共享内存模型(shared-memory modelwith scalable cachecoherence)。在侦听缓存技术之前,实际上已经有了基于目录的缓存一致性技术。第一个缓存一致性协议使用的是目录方式(如Tang [1976]介绍,并在IBM 3081中实现)。Censier and Feautrier [1978]描述了一种在存储器中带有标签的目录一致性方案。Agarwal et al. [1988]首先提出了用内存分配目录以获得可扩展的高速缓存一致性的想法,并作为斯坦福DASH多处理器系统的基础(Lenoski et al. [1990,1992]),这是第一个可操作的,缓存一致的DSM多处理器。DASH是一台“ plump”节点cc-NUMA机器,该机器使用四处理器SMP作为其节点,以类似于Wildfire的样式互连,但是互连使用的是可扩展性更好的二维网格而不是交叉开关。
Kendall Square Research的KSR-1 [Burkhardt et al. 1992年]是可伸缩一致共享内存架构的第一个商业实现。它扩展了基本DSM方法,以实现称为cache-only memory architecture(COMA)的概念,该概念使主内存成为缓存。在KSR-1中,可以在每个节点的主内存中复制存储块,由硬件处理这些复制块的其他一致性要求。(KSR-1并不是严格意义上的纯COMA,因为它不会迁移数据项的原始位置,而是始终在原位置保留一个副本。从本质上讲,它仅实现复制。)许多其他研究[Falsafi and Wood 1997; Hagersten,Landin and Haridi,1992年;Saulsbury et al. 1995; Stenstr€om,Joe and Gupta,1992年]提出了各种COMA样式的体系结构和类似方法,这些方法通过迁移(migration)减少了非一致内存访问的负担[Chandra et al. 1994年;Soundararajan et al. 1998年]开发,但都没有进一步的商业实施。
Convex Exemplar使用两级体系结构实现了可伸缩的一致性共享内存:在最低级,使用交叉开关构建了八个处理器模块。在此基础上,一个环最多可以连接32个这样的模块,总共256个处理器(Thekkath et al. [1997])。Laudon and Lenoski [1997]描述了SGI Origin,于1996年首次交付,尽管它在可伸缩性和易于编程方面进行了许多创新,但还是很大程度上基于原始的Stanford DASH机器。Origin为目录结构使用一个比特向量,该向量的长度为16或32比特。每一比特代表一个节点(由两个处理器组成);另一种粗粒度的比特向量表示方法允许每个比特表示最多8个节点,总共1024个处理器。如Galles [1996]所述,它将高性能胖超立方体(fat hypercube)用于全局互连。Hristea,Lenoski and Keen [1997]提供了对Origin存储系统性能的全面评估。
同时,人们开展了一些研究原型工作,以探索有无多线程情况下的可伸缩一致性,包括MIT Alewife 计算机[Agarwal et al. 1995]和斯坦福FLASH多处理器系统[Gibson et al. 2000; Kuskin et al. 1994]。
Clusters
集群可能是计算机的用户在60年代“发明”的,他们可能无法在一台计算机上完成所有工作,或者在主计算机出现故障时需要备份计算机[Pfister 1998]。Tandem于1975年推出了一个16节点的群集。之后,Digital在1984年推出的VAX群集。它们最初是共享I/O设备的独立计算机,需要分布式操作系统来协调工作。很快,计算机之间建立了通信链接,从而可以在不同地理位置上分散部署,以提高单个站点发生故障或遇到灾难时的可用性。用户登录到群集,并且不知道他们的任务在哪台计算机上运行。到1993年,DEC(现在的惠普)销售了超过25,000个集群。其他早期的公司包括Tandem(现在的惠普)和IBM(仍然是IBM)。如今,几乎每个公司都有集群产品。这些产品大多数都是针对可用性,而性能可扩展性是其另一个优势。
集群上的科学计算成为MPP(Massively Parallel Processor)的竞争者。1993年,Beowulf项目开始,它的目标是满足NASA对价格低于50,000美元的1 GFLOPS计算机的需求。1994年,使用现成的PC(80486处理器)构建的16节点群集实现了该目标[Bell and Gray 2001]。这种趋势也催生了各种软件接口,使提交,协调和调试大型程序或大量独立程序更加容易。
这个领域的重要努力包括尽力减少集群中通信的延迟并增加带宽。一些研究项目致力解决这些该问题。(低延迟研究的一项商业成果是VI接口标准,该接口已被Infiniband标准接受,下面将进行讨论。)此外,低延迟被证明在其他应用程序中很有用。例如,在1997年,加州大学伯克利分校的一个由100台Ultra-SPARC台式计算机组成的群集,通过Myrinet交换机,以每个链接以160 MB/秒的速度连接,打破了数据库排序的世界纪录 - 1分钟内完成8.6 GB最初保存在硬盘上的数据排序;还打破了破解加密消息的记录,只需花费3.5个小时即可解密40位DES密钥。
称为Network of Workstations的研究项目[Anderson,Culler and Patterson 1995],还开发了Inktomi搜索引擎,该搜索引擎催生了一家具有相同名称的初创公司。Google遵循Inktomi的经验,利用台式计算机集群而不是大型SMP计算机构建搜索引擎,这是Google取代领先的搜索引擎Alta Vista的策略[Brin and Page 1998]。2011年,几乎所有Internet服务都依靠群集为数百万客户提供服务。
集群在科学家中也很受欢迎,原因之一是成本低廉。因此,单个科学家或小组可以拥有一个专用于运行他们的程序的集群。相对于在超级计算机中心的共享MPP的工作队列中等待数周时间,这样的集群可以更快获得结果。如果有兴趣深入,可以看看Pfister [1998],一本关于集群的很有趣的书。
Recent Trends in Large-Scale Multiprocessors
使用Myrinet或Infiniband等互连技术集成标准台式机主板的群集。 由配置为处理元件并通过自定义互连模块连接的标准微处理器构建的多计算机。其中包括Cray XT3(使用较早版本的Cray互连和简单的集群体系结构)和IBM Blue Gene(后面提供更多信息)。 可能具有矢量支持的小型共享内存计算机集群,其中包括Earth Simulator(可在线访问他自己的期刊)。 大型共享内存多处理器,例如Cray X1 [Dunigan el al. 2005]和SGI Origin和Altix系统。SGI系统也已配置为群集,以提供512个以上的处理器,但在群集之间仅支持消息传递。
IBM Blue Gene是这些设计中最有趣的,因为其基本思想和单处理器体系结构中朝多核发展的根本原因是相悖的。Blue Gene是IBM内部的一个研究项目,旨在解决蛋白质测序和折叠问题。Blue Gene的设计人员发现,能耗在大型多处理器系统中正变得越来越需要关注,而嵌入式系统中的处理器的性能/瓦数要比高端单处理器好得多。如果并行性是实现高性能的途径,那么为什么不从最高效的处理器开始,仅仅需要更大数量?
因此,Blue Gene是使用定制芯片构建的,该芯片包括嵌入式PowerPC微处理器,该微处理器提供高端PowerPC的一半性能,但功耗却小得多。这样可以将更多的系统功能(包括全局互连)集成到同一芯片上。通过这种方式可以获得高度可复制和高效的基本模块,使Blue Gene可以更有效地实现更大的处理器数量。Blue Gene并未使用独立的微处理器或标准台式机板作为基本模块,而是使用了处理器内核。毫无疑问,这种方法可以提供更高的效率。但是,市场是否可以支撑定制设计和专用软件的成本仍然是一个悬而未决的问题。
2006年,Lawrence Livermore的Blue Gene处理器有32K个处理器(原计划于2005年底达到65K),在Linpack性能方面领先于排名第三的由20个SGI Altix 512处理器构成的Infiniband集群2.6倍。
Blue Gene的前身是一台实验性计算机QCDOD,它利用了低功耗嵌入式微处理器和紧密集成的互连技术来降低节点的成本和功耗。
Developments in Synchronization and Consistency Models
针对共享内存多处理器系统,人们提出了各种各样的同步原语(synchronization primitives)。Mellor-Crummey and Scott [1991]提供了对这些问题的概述,以及重要原语(例如locks and barriers)的有效实现。大量书籍包含了对其他重要贡献的参考,包括spin locks, queuing locks, and barriers的发展。Lamport [1979]引入了顺序一致性(sequential consistency)的概念以及并行程序正确执行的含义。Dubois,Scheurich and Briggs [1988]引入了弱序(weak ordering)的思想(最初是1986年)。1990年,Adve and Hill提供了更好的weak ordering定义,并定义了data-race-free的概念。在同一次会议上,Gharachorloo和他的同事(1990年)介绍了release consistency,并提供了有关relaxed consistency models性能的第一批数据。更宽松的一致性模型已在包括Sun SPARC,Alpha和IA-64在内的微处理器体系结构中广泛采用。Adve and Gharachorloo [1996]提供了关于内存一致性以及这些模型之间差异的出色教程。
Other References
这些参考文献除了记录了目前在实践中使用的概念是如何发现,还提供了对许多已经探索的但发现不够理想,或者应用机会尚未到来的想法。考虑到高性能计算机体系结构的未来将朝着多核和多处理器的方向发展,我们预计在未来的几年中将探索许多新方法。一些人将设法解决在过去40年中使用multiprocessing的遇到的关键硬件和软件问题!
- 第六部分完 -
参考资料
Adve, S. V., and K. Gharachorloo [1996]. “Shared memory consistency models: A tutorial,” IEEE Computer 29:12 (December), 66–76.
Adve, S. V., and M. D. Hill [1990]. “Weak ordering—a new definition,” Proc. 17th Annual Int’l. Symposium on Computer Architecture (ISCA), May 28–31, 1990, Seattle, Wash., 2–14.
Agarwal, A., R. Bianchini, D. Chaiken, K. Johnson, and D. Kranz [1995]. “The MIT Alewife machine: Architecture and performance,” 22nd Annual Int’l. Symposium on Computer Architecture (ISCA), June 22–24, 1995, Santa Margherita, Italy, 2–13.
Agarwal, A., J. L. Hennessy, R. Simoni, and M. A. Horowitz [1988]. “An evaluation of directory schemes for cache coherence,” Proc. 15th Annual Int’l. Symposium on Computer Architecture, May 30–June 2, 1988, Honolulu, Hawaii, 280–289.
Agarwal, A., J. Kubiatowicz, D. Kranz, B.-H. Lim, D. Yeung, G. D’Souza, and M. Parkin [1993]. “Sparcle: An evolutionary processor design for large-scale multiprocessors,” IEEE Micro 13 (June), 48–61.
Alles, A. [1995]. “ATM Internetworking,” White Paper (May), Cisco Systems, Inc., San Jose, Calif. (www.cisco.com/warp/public/614/12.html).
Almasi, G. S., and A. Gottlieb [1989]. Highly Parallel Computing, Benjamin/ Cummings, Redwood City, Calif.
Alverson, G., R. Alverson, D. Callahan, B. Koblenz, A. Porterfield, and B. Smith [1992]. “Exploiting heterogeneous parallelism on a multithreaded multiprocessor,” Proc. ACM/IEEE Conf. on Supercomputing, November 16–20, 1992, Minneapolis, Minn., 188–197.
Amdahl, G. M. [1967]. “Validity of the single processor approach to achieving large scale computing capabilities,” Proc. AFIPS Spring Joint Computer Conf., April 18–20, 1967, Atlantic City, N.J., 483–485.
Amza C., A. L. Cox, S. Dwarkadas, P. Keleher, H. Lu, R. Rajamony, W. Yu, and W. Zwaenepoel [1996]. “Treadmarks: Shared memory computing on networks of workstations,” IEEE Computer 29:2 (February), 18–28.
Anderson, T. E., D. E. Culler, and D. Patterson [1995]. “A case for NOW (networks of workstations),” IEEE Micro 15:1 (February), 54–64.
Ang, B., D. Chiou, D. Rosenband, M. Ehrlich, L. Rudolph, and Arvind [1998]. “StarT-Voyager: A flexible platform for exploring scalable SMP issues,” Proc. ACM/IEEE Conf. on Supercomputing, November 7–13, 1998, Orlando, FL.
Archibald, J., and J.-L. Baer [1986]. “Cache coherence protocols: Evaluation using a multiprocessor simulation model,” ACM Trans. on Computer Systems 4:4 (November), 273–298.
Arpaci, R. H., D. E. Culler, A. Krishnamurthy, S. G. Steinberg, and K. Yelick [1995]. “Empirical evaluation of the CRAY-T3D: A compiler perspective,” Proc. 22nd Annual Int’l. Symposium on Computer Architecture (ISCA), June 22–24, 1995, Santa Margherita, Italy.
Baer, J.-L., and W.-H. Wang [1988]. “On the inclusion properties for multi-level cache hierarchies,” Proc. 15th Annual Int’l. Symposium on Computer Architecture, May 30–June 2, 1988, Honolulu, Hawaii, 73–80.
Balakrishnan, H. V., N. Padmanabhan, S. Seshan, and R. H. Katz [1997]. “A comparison of mechanisms for improving TCP performance over wireless links,” IEEE/ACM Trans. on Networking 5:6 (December), 756–769.
Barroso, L. A., K. Gharachorloo, and E. Bugnion [1998]. “Memory system characterization of commercial workloads,” Proc. 25th Annual Int’l. Symposium on Computer Architecture (ISCA), July 3–14, 1998, Barcelona, Spain, 3–14.
Baskett, F., T. Jermoluk, and D. Solomon [1988]. “The 4D-MP graphics super- workstation: Computing+graphics 40 MIPS + 40 MFLOPS and 10,000 lighted polygons per second,” Proc. IEEE COMPCON, February 29–March 4, 1988, San Francisco, 468–471.
BBN Laboratories. [1986]. Butterfly Parallel Processor Overview, Tech. Rep. 6148, BBN Laboratories, Cambridge, Mass.
Bell, C. G. [1985]. “Multis: A new class of multiprocessor computers,” Science 228 (April 26), 462–467.
Bell, C. G. [1989]. “The future of high performance computers in science and engineering,” Communications of the ACM 32:9 (September), 1091–1101.
Bell, C. G., and J. Gray [2001]. Crays, Clusters and Centers, Tech. Rep. MSR-TR- 2001-76, Microsoft Research, Redmond, Wash.
Bell, C. G., and J. Gray [2002]. “What’s next in high performance computing,” CACM, 45:2 (February), 91–95.
Bouknight, W. J., S. A. Deneberg, D. E. McIntyre, J. M. Randall, A. H. Sameh, and D. L. Slotnick [1972]. “The Illiac IV system,” Proc. IEEE 60:4, 369–379. Also appears in D. P. Siewiorek, C. G. Bell, and A. Newell, Computer Structures: Principles and Examples, McGraw-Hill, New York, 1982, 306–316.
Brain, M. [2000]. Inside a Digital Cell Phone, www.howstuffworks.com/inside- cell-phone.htm.
Brewer, E. A., and B. C. Kuszmaul [1994]. “How to get good performance from the CM-5 data network,” Proc. Eighth Int’l. Parallel Processing Symposium (IPPS), April 26–29, 1994, Cancun, Mexico.
Brin, S., and L. Page [1998]. “The anatomy of a large-scale hypertextual Web search engine,” Proc. 7th Int’l. World Wide Web Conf., April 14–18, 1998, Brisbane, Queensland, Australia, 107–117.
Burkhardt III, H., S. Frank, B. Knobe, and J. Rothnie [1992]. Overview of the KSR1 Computer System, Tech. Rep. KSR-TR-9202001, Kendall Square Research, Boston.
Censier, L., and P. Feautrier [1978]. “A new solution to coherence problems in multicache systems,” IEEE Trans. on Computers C-27:12 (December), 1112–1118.
Chandra, R., S. Devine, B. Verghese, A. Gupta, and M. Rosenblum [1994]. “Scheduling and page migration for multiprocessor compute servers,” Proc. Sixth Int’l. Conf. on Architectural Support for Programming Languages and Operating Systems (ASPLOS), October 4–7, 1994, San Jose, Calif., 12–24.
Charlesworth, A. [1998]. “Starfire: Extending the SMP envelope,” IEEE Micro 18:1 (January/February), 39–49.
Clark, W. A. [1957]. “The Lincoln TX-2 computer development,” Proc. Western Joint Computer Conference, February 26–28, 1957, Los Angeles, 143–145.
Comer, D. [1993]. Internetworking with TCP/IP, 2nd ed., Prentice Hall, Englewood Cliffs, N.J.
Culler, D. E., J. P. Singh, and A. Gupta [1999]. Parallel Computer Architecture: A Hardware/Software Approach, Morgan Kaufmann, San Francisco.
Dally, W. J., and C. I. Seitz [1986]. “The torus routing chip,” Distributed Computing 1:4, 187–196.
Davie, B. S., L. L. Peterson, and D. Clark [1999]. Computer Networks: A Systems Approach, 2nd ed., Morgan Kaufmann, San Francisco.
Desurvire, E. [1992]. “Lightwave communications: The fifth generation,” Scientific American (International Edition) 266:1 (January), 96–103.
Dongarra, J., T. Sterling, H. Simon, and E. Strohmaier [2005]. “High-performance computing: Clusters, constellations, MPPs, and future directions,” Computing in Science & Engineering, 7:2 (March/April), 51–59.
Dubois, M., C. Scheurich, and F. Briggs [1988]. “Synchronization, coherence, and event ordering,” IEEE Computer 21:2 (February), 9–21.
Dunigan, W., K. Vetter, K. White, and P. Worley [2005]. “Performance evaluation of the Cray X1 distributed shared memory architecture,” IEEE Micro, January/ February, 30–40.
Eggers, S. [1989]. “Simulation Analysis of Data Sharing in Shared Memory Multiprocessors,” Ph.D. thesis, Computer Science Division, University of California, Berkeley.
Elder, J., A. Gottlieb, C. K. Kruskal, K. P. McAuliffe, L. Randolph, M. Snir, P. Teller, and J. Wilson [1985]. “Issues related to MIMD shared-memory computers: The NYU Ultracomputer approach,” Proc. 12th Annual Int’l. Symposium on Computer Architecture (ISCA), June 17–19, 1985, Boston, Mass., 126–135.
Erlichson, A., N. Nuckolls, G. Chesson, and J. L. Hennessy [1996]. “SoftFLASH: Analyzing the performance of clustered distributed virtual shared memory,” Proc. Seventh Int’l. Conf. on Architectural Support for Programming Languages and Operating Systems (ASPLOS), October 1–5, 1996, Cambridge, Mass., 210–220.
Falsafi, B., and D. A. Wood [1997]. “Reactive NUMA: A design for unifying S- COMA and CC-NUMA,” Proc. 24th Annual Int’l. Symposium on Computer Architecture (ISCA), June 2–4, 1997, Denver, Colo., 229–240.
Flynn, M. J. [1966]. “Very high-speed computing systems,” Proc. IEEE 54:12 (December), 1901–1909.
Forgie, J. W. [1957]. “The Lincoln TX-2 input-output system,” Proc. Western Joint Computer Conference, February 26–28, 1957, Los Angeles, 156–160.
Frank, S. J. [1984]. “Tightly coupled multiprocessor systems speed memory access time,” Electronics 57:1 (January), 164–169.
Gajski, D., D. Kuck, D. Lawrie, and A. Sameh [1983]. “CEDAR—a large scale multiprocessor,” Proc. Int’l. Conf. on Parallel Processing (ICPP), August, Columbus, Ohio, 524–529.
Galles, M. [1996]. “Scalable pipelined interconnect for distributed endpoint routing: The SGI SPIDER chip,” Proc. IEEE HOT Interconnects ’96, August 15– 17, 1996, Stanford University, Palo Alto, Calif.
Gehringer, E. F., D. P. Siewiorek, and Z. Segall [1987]. Parallel Processing: The Cm* Experience, Digital Press, Bedford, Mass.
Gharachorloo, K., A. Gupta, and J. L. Hennessy [1992]. “Hiding memory latency using dynamic scheduling in shared-memory multiprocessors,” Proc. 19th Annual Int’l. Symposium on Computer Architecture (ISCA), May 19–21, 1992, Gold Coast, Australia.
Gharachorloo, K., D. Lenoski, J. Laudon, P. Gibbons, A. Gupta, and J. L. Hen- nessy [1990]. “Memory consistency and event ordering in scalable shared-memory multiprocessors,” Proc. 17th Annual Int’l. Symposium on Computer Architecture (ISCA), May 28–31, 1990, Seattle, Wash., 15–26.
Gibson, J., R. Kunz, D. Ofelt, M. Horowitz, J. Hennessy, and M. Heinrich [2000]. “FLASH vs. (simulated) FLASH: Closing the simulation loop,” Proc. Ninth Int’l. Conf. on Architectural Support for Programming Languages and Operating Systems (ASPLOS), November 12–15, Cambridge, Mass., 49–58.
Goodman, J. R. [1983]. “Using cache memory to reduce processor memory traffic,” Proc. 10th Annual Int’l. Symposium on Computer Architecture (ISCA), June 5–7, 1982, Stockholm, Sweden, 124–131.
Goralski, W. [1997]. SONET: A Guide to Synchronous Optical Network, McGraw- Hill, New York.
Grice, C., and M. Kanellos [2000]. “Cell phone industry at crossroads: Go high or low?” CNET News (August 31), technews.netscape.com/news/0-1004-201- 2518386-0.html?tag st.ne.1002.tgif.sf.
Groe, J. B., and L. E. Larson [2000]. CDMA Mobile Radio Design, Artech House, Boston.
Hagersten E., and M. Koster [1998]. “WildFire: A scalable path for SMPs,” Proc. Fifth Int’l. Symposium on High-Performance Computer Architecture, January 9–12, 1999, Orlando, Fla.
Hagersten, E., A. Landin, and S. Haridi [1992]. “DDM—a cache-only memory architecture,” IEEE Computer 25:9 (September), 44–54.
Hill, M. D. [1998]. “Multiprocessors should support simple memory consistency models,” IEEE Computer 31:8 (August), 28–34.
Hillis, W. D. [1985]. The Connection Multiprocessor, MIT Press, Cambridge, Mass.
Hirata, H., K. Kimura, S. Nagamine, Y. Mochizuki, A. Nishimura, Y. Nakase, and
T. Nishizawa [1992]. “An elementary processor architecture with simultaneous instruction issuing from multiple threads,” Proc. 19th Annual Int’l. Symposium on Computer Architecture (ISCA), May 19–21, 1992, Gold Coast, Australia, 136–145.
Hockney, R. W., and C. R. Jesshope [1988]. Parallel Computers 2: Architectures, Programming and Algorithms, Adam Hilger, Ltd., Bristol, England.
Holland, J. H. [1959]. “A universal computer capable of executing an arbitrary number of subprograms simultaneously,” Proc. East Joint Computer Conf. 16, 108–113.
Hord, R. M. [1982]. The Illiac-IV, The First Supercomputer, Computer Science Press, Rockville, Md.
Hristea, C., D. Lenoski, and J. Keen [1997]. “Measuring memory hierarchy performance of cache-coherent multiprocessors using micro benchmarks,” Proc. ACM/IEEE Conf. on Supercomputing, November 15–21, 1997, San Jose, Calif.
Hwang, K. [1993]. Advanced Computer Architecture and Parallel Programming, McGraw-Hill, New York.
IBM. [2005]. “Blue Gene,” IBM J. of Research and Development, 49:2/3 (special issue).
Infiniband Trade Association. [2001]. InfiniBand Architecture Specifications Release 1.0.a, www.infinibandta.org.
Jordan, H. F. [1983]. “Performance measurements on HEP—a pipelined MIMD computer,” Proc. 10th Annual Int’l. Symposium on Computer Architecture (ISCA), June 5–7, 1982, Stockholm, Sweden, 207–212.
Kahn, R. E. [1972]. “Resource-sharing computer communication networks,” Proc. IEEE 60:11 (November), 1397–1407.
Keckler, S. W., and W. J. Dally [1992]. “Processor coupling: Integrating compile time and runtime scheduling for parallelism,” Proc. 19th Annual Int’l. Symposium on Computer Architecture (ISCA), May 19–21, 1992, Gold Coast, Australia, 202–213.
Kontothanassis, L., G. Hunt, R. Stets, N. Hardavellas, M. Cierniak, S. Parthasarathy, W. Meira, S. Dwarkadas, and M. Scott [1997]. “VM-based shared memory on low-latency, remote memory-access networks,” Proc. 24th Annual Int’l. Symposium on Computer Architecture (ISCA), June 2–4, 1997, Denver, Colo.
Kurose, J. F., and K. W. Ross [2001]. Computer Networking: A Top-Down Approach Featuring the Internet, Addison-Wesley, Boston.
Kuskin, J., D. Ofelt, M. Heinrich, J. Heinlein, R. Simoni, K. Gharachorloo, J. Cha- pin, D. Nakahira, J. Baxter, M. Horowitz, A. Gupta, M. Rosenblum, and J. L. Hen- nessy [1994]. “The Stanford FLASH multiprocessor,” Proc. 21st Annual Int’l. Symposium on Computer Architecture (ISCA), April 18–21, 1994, Chicago.
Lamport, L. [1979]. “How to make a multiprocessor computer that correctly executes multiprocess programs,” IEEE Trans. on Computers C-28:9 (September), 241–248.
Laudon, J., A. Gupta, and M. Horowitz [1994]. “Interleaving: A multithreading technique targeting multiprocessors and workstations,” Proc. Sixth Int’l. Conf. on Architectural Support for Programming Languages and Operating Systems (ASPLOS), October 4–7, 1994, San Jose, Calif., 308–318.
Laudon, J., and D. Lenoski [1997]. “The SGI Origin: A ccNUMA highly scalable server,” Proc. 24th Annual Int’l. Symposium on Computer Architecture (ISCA), June 2–4, 1997, Denver, Colo., 241–251.
Lenoski, D., J. Laudon, K. Gharachorloo, A. Gupta, and J. L. Hennessy [1990]. “The Stanford DASH multiprocessor,” Proc. 17th Annual Int’l. Symposium on Computer Architecture (ISCA), May 28–31, 1990, Seattle, Wash., 148–159.
Lenoski, D., J. Laudon, K. Gharachorloo, W.-D. Weber, A. Gupta, J. L. Hennessy, M. A. Horowitz, and M. Lam [1992]. “The Stanford DASH multiprocessor,” IEEE Computer 25:3 (March), 63–79.
Li, K. [1988]. “IVY: A shared virtual memory system for parallel computing,” Proc. Int’l. Conf. on Parallel Processing (ICCP), August, The Pennsylvania State University, University Park, Penn.
Lo, J., L. Barroso, S. Eggers, K. Gharachorloo, H. Levy, and S. Parekh [1998]. “An analysis of database workload performance on simultaneous multithreaded processors,” Proc. 25th Annual Int’l. Symposium on Computer Architecture (ISCA), July 3–14, 1998, Barcelona, Spain, 39–50.
Lo, J., S. Eggers, J. Emer, H. Levy, R. Stamm, and D. Tullsen [1997]. “Converting thread-level parallelism into instruction-level parallelism via simultaneous multithreading,” ACM Trans. on Computer Systems 15:2 (August), 322–354.
Lovett, T., and S. Thakkar [1988]. “The Symmetry multiprocessor system,” Proc. Int’l. Conf. on Parallel Processing (ICCP), August, The Pennsylvania State University, University Park, Penn., 303–310.
Mellor-Crummey, J. M., and M. L. Scott [1991]. “Algorithms for scalable synchronization on shared-memory multiprocessors,” ACM Trans. on Computer Systems 9:1 (February), 21–65.
Menabrea, L. F. [1842]. “Sketch of the analytical engine invented by Charles Babbage,” Bibliothèque Universelle de Genève, 82 (October).
Metcalfe, R. M. [1993]. “Computer/network interface design: Lessons from Arpa- net and Ethernet.” IEEE J. on Selected Areas in Communications 11:2 (February), 173–180.
Metcalfe, R. M., and D. R. Boggs [1976]. “Ethernet: Distributed packet switching for local computer networks,” Communications of the ACM 19:7 (July), 395–404.
Mitchell, D. [1989]. “The Transputer: The time is now,” Computer Design (RISC suppl.), 40–41.
Miya, E. N. [1985]. “Multiprocessor/distributed processing bibliography,” Computer Architecture News 13:1, 27–29.
National Research Council. [1997]. The Evolution of Untethered Communications, Computer Science and Telecommunications Board, National Academy Press, Washington, D.C.
Nikhil, R. S., G. M. Papadopoulos, and Arvind [1992]. “*T: A multithreaded massively parallel architecture,” Proc. 19th Annual Int’l. Symposium on Computer Architecture (ISCA), May 19–21, 1992, Gold Coast, Australia, 156–167.
Noordergraaf, L., and R. van der Pas [1999]. “Performance experiences on Sun’s WildFire prototype,” Proc. ACM/IEEE Conf. on Supercomputing, November 13–19, 1999, Portland, Ore.
Partridge, C. [1994]. Gigabit Networking, Addison-Wesley, Reading, Mass. Pfister, G. F. [1998]. In Search of Clusters, 2nd ed., Prentice Hall, Upper Saddle River, N.J.
Pfister, G. F., W. C. Brantley, D. A. George, S. L. Harvey, W. J. Kleinfekder, K. P. McAuliffe, E. A. Melton, V. A. Norton, and J. Weiss [1985]. “The IBM research parallel processor prototype (RP3): Introduction and architecture,” Proc. 12th Annual Int’l. Symposium on Computer Architecture (ISCA), June 17–19, 1985, Boston, Mass., 764–771.
Reinhardt, S. K., J. R. Larus, and D. A. Wood [1994]. “Tempest and Typhoon: User-level shared memory,” Proc. 21st Annual Int’l. Symposium on Computer Architecture (ISCA), April 18–21, 1994, Chicago, 325–336.
Rettberg, R. D., W. R. Crowther, P. P. Carvey, and R. S. Towlinson [1990]. “The Monarch parallel processor hardware design,” IEEE Computer 23:4 (April), 18–30.
Rosenblum, M., S. A. Herrod, E. Witchel, and A. Gupta [1995]. “Complete computer simulation: The SimOS approach,” in IEEE Parallel and Distributed Technology (now called Concurrency) 4:3, 34–43.
Saltzer, J. H., D. P. Reed, and D. D. Clark [1984]. “End-to-end arguments in sys- tem design,” ACM Trans. on Computer Systems 2:4 (November), 277–288. Satran, J., D. Smith, K. Meth, C. Sapuntzakis, M. Wakeley, P. Von Stamwitz, R.
Haagens, E. Zeidner, L. Dalle Ore, and Y. Klein [2001]. “iSCSI,” IPS Working Group of IETF, Internet draft www.ietf.org/internet-drafts/draft-ietf-ips-iscsi- 07.txt.
Saulsbury, A., T. Wilkinson, J. Carter, and A. Landin [1995]. “An argument for Simple COMA,” Proc. First IEEE Symposium on High-Performance Computer Architectures, January 22–25, 1995, Raleigh, N.C., 276–285.
Schwartz, J. T. [1980]. “Ultracomputers,” ACM Trans. on Programming Languages and Systems 4:2, 484–521.
Scott, S. L. [1996]. “Synchronization and communication in the T3E multiprocessor,” Seventh Int’l. Conf. on Architectural Support for Programming Languages and Operating Systems (ASPLOS), October 1–5, 1996, Cambridge, Mass., 26– 36.
Scott, S. L., and G. M. Thorson [1996]. “The Cray T3E network: Adaptive routing in a high-performance 3D torus,” Proc. IEEE HOT Interconnects ’96, August 15–17, 1996, Stanford University, Palo Alto, Calif., 14–156.
Seitz, C. L. [1985]. “The Cosmic Cube (concurrent computing),” Communications of the ACM 28:1 (January), 22–33.
Singh, J. P., J. L. Hennessy, and A. Gupta [1993]. “Scaling parallel programs for multiprocessors: Methodology and examples,” Computer 26:7 (July), 22–33.
Slotnick, D. L., W. C. Borck, and R. C. McReynolds [1962]. “The Solomon computer,” Proc. AFIPS Fall Joint Computer Conf., December 4–6, 1962, Philadelphia, Penn., 97–107.
Smith, B. J. [1978]. “A pipelined, shared resource MIMD computer,” Proc. Int’l. Conf. on Parallel Processing (ICPP), August, Bellaire, Mich., 6–8.
Soundararajan, V., M. Heinrich, B. Verghese, K. Gharachorloo, A. Gupta, and J. L. Hennessy [1998]. “Flexible use of memory for replication/migration in cache-coherent DSM multiprocessors,” Proc. 25th Annual Int’l. Symposium on Computer Architecture (ISCA), July 3–14, 1998, Barcelona, Spain, 342–355.
Spurgeon, C. [2001]. “Charles Spurgeon’s Ethernet Web site,” www.host.ots.utexas.edu/ethernet/ethernet-home.html.
Stenstr€om, P., T. Joe, and A. Gupta [1992]. “Comparative performance evaluation of cache coherent NUMA and COMA architectures,” Proc. 19th Annual Int’l. Symposium on Computer Architecture (ISCA), May 19–21, 1992, Gold Coast, Australia, 80–91.
Sterling, T. [2001]. Beowulf PC Cluster Computing with Windows and Beowulf PC Cluster Computing with Linux, MIT Press, Cambridge, Mass.
Stevens, W. R. [1994–1996]. TCP/IP Illustrated (three volumes), Addison- Wesley, Reading, Mass.
Stone, H. [1991]. High Performance Computers, Addison-Wesley, New York.
Swan, R. J., A. Bechtolsheim, K. W. Lai, and J. K. Ousterhout [1977]. “The implementation of the Cm* multi-microprocessor,” Proc. AFIPS National Computing Conf., June 13–16, 1977, Dallas, Tex., 645–654.
Swan, R. J., S. H. Fuller, and D. P. Siewiorek [1977]. “Cm*—a modular, multi- microprocessor,” Proc. AFIPS National Computing Conf., June 13–16, 1977, Dallas, Tex., 637–644.
Tanenbaum, A. S. [1988]. Computer Networks, 2nd ed., Prentice Hall, Englewood Cliffs, N.J.
Tang, C. K. [1976]. “Cache design in the tightly coupled multiprocessor system,” Proc. AFIPS National Computer Conf., June 7–10, 1976, New York, 749–753.
Thacker, C. P., E. M. McCreight, B. W. Lampson, R. F. Sproull, and D. R. Boggs [1982]. “Alto: A personal computer,” in D. P. Siewiorek, C. G. Bell, and A. Newell, eds., Computer Structures: Principles and Examples, McGraw-Hill, New York, 549–572.
Thekkath, R., A. P. Singh, J. P. Singh, S. John, and J. L. Hennessy [1997]. “An evaluation of a commercial CC-NUMA architecture—the CONVEX Exemplar SPP1200,” Proc. 11th Int’l. Parallel Processing Symposium (IPPS), April 1–7, 1997, Geneva, Switzerland.
Tullsen, D. M., S. J. Eggers, J. S. Emer, H. M. Levy, J. L. Lo, and R. L. Stamm [1996]. “Exploiting choice: Instruction fetch and issue on an implementable simultaneous multithreading processor,” Proc. 23rd Annual Int’l. Symposium on Computer Architecture (ISCA), May 22–24, 1996, Philadelphia, Penn., 191–202.
Tullsen, D. M., S. J. Eggers, and H. M. Levy [1995]. “Simultaneous multithreading: Maximizing on-chip parallelism,” Proc. 22nd Annual Int’l. Symposium on Computer Architecture (ISCA), June 22–24, 1995, Santa Margherita, Italy, 392– 403.
Unger, S. H. [1958]. “A computer oriented towards spatial problems,” Proc. Insti- tute of Radio Engineers 46:10 (October), 1744–1750.
Walrand, J. [1991]. Communication Networks: A First Course, Aksen Associates: Irwin, Homewood, Ill.
Wilson, A. W., Jr. [1987]. “Hierarchical cache/bus architecture for shared-memory multiprocessors,” Proc. 14th Annual Int’l. Symposium on Computer Architecture (ISCA), June 2–5, 1987, Pittsburgh, Penn., 244–252.
Wolfe, A., and J. P. Shen [1991]. “A variable instruction stream extension to the VLIW architecture.” Proc. Fourth Int’l. Conf. on Architectural Support for Programming Languages and Operating Systems (ASPLOS), April 8–11, 1991, Palo Alto, Calif., 2–14.
Wood, D. A., and M. D. Hill [1995]. “Cost-effective parallel computing,” IEEE Computer 28:2 (February), 69–72.
Wulf, W., and C. G. Bell [1972]. “C.mmp—A multi-mini-processor,” Proc. AFIPS Fall Joint Computer Conf., December 5–7, 1972, Anaheim, Calif., 765–777.
Wulf, W., and S. P. Harbison [1978]. “Reflections in a pool of processors—an experience report on C.mmp/Hydra,” Proc. AFIPS National Computing Conf. June 5–8, 1978, Anaheim, Calif., 939–951.
Yamamoto, W., M. J. Serrano, A. R. Talcott, R. C. Wood, and M. Nemirosky [1994]. “Performance estimation of multistreamed, superscalar processors,” Proc. 27th Hawaii Int’l. Conf. on System Sciences, January 4–7, 1994, Wailea, 195–204.
公众号专题:
人工智能芯片技术进步
人工智能芯片产业发展
人工智能芯片初创公司
人工智能芯片评测对比
科技巨头的芯片尝试
从学术会议看人工智能芯片
基础芯片技术