系列文章第四篇(对应M.5)
往期文章
『译』计算机体系结构发展史(一)
『译』计算机体系结构发展史(二)
『译』计算机体系结构发展史(三)
M.5 The Development of Pipelining and Instruction-Level Parallelism (Chapter 3 and Appendices C and H)
第一个通用的流水线处理器被认为是Stretch,即IBM7030。Stretch紧随IBM 704,其目标是比704快100倍。“Stretch”这个昵称也是源于这个跨越式的目标。其计划是使用四级流水线,通过将取指,译码和执行重叠来获得1.6倍的加速。Bloch [1959]和Bucholtz [1962]描述了上述设计以及工程实现上的权衡,包括使用ALU旁路(bypass)等技术。
70年代末和80年代初出现的一系列讨论流水线技术的文章创造了大多数流水线相关的术语,并描述了简单流水线中使用的大多数基本技术。这些综述文章包括Keller [1975],Ramamoorthy和Li [1977],Chen [1980]以及Kogge [1981],该书籍完全致力于流水线研究。Davidson及其同事[1971,1975]发明了流水线保留表(pipeline reservation tables)的概念,将其作为带反馈的多周期流水线(multicycle pipe- lines
with feedback)的设计方法(在Kogge [1981]中也有描述)。之后,许多设计人员在设计流水线,或设计调度流水线的软件的时候都使用了这些概念的变体。
RISC处理器最初在设计时就考虑了易于实现以及流水线技术。80年代初发表的几篇RISC早期论文试图量化简化指令集带来的性能优势。然而,最好的分析还是Bhandarkar和Clark在1991年发布的对VAX和MIPS实现情况的对比,这是在RISC首次发表之后的10年(见图M.1, 『译』计算机体系结构发展史(三))。在对RISC的实现优势争论了10年之后,这篇文章甚至说服了最怀疑的设计人员。
J. E. Smith和他的同事撰写了许多论文,研究了高速标量CPU的指令发射,异常处理和流水线深度。Kunkel和Smith(1986)评估了流水线开销的影响以及选择最佳流水线深度时的依赖项。他们还对锁存器设计及其对流水线的影响进行了精彩的讨论。Smith和Pleszkun [1988]评估了各种保留精确异常(preserving precise exceptions)的技术。Weiss和Smith(1984)评估了各种硬件流水线调度和指令发射技术。
MIPS R4000是最早的深度流水线微处理器之一,Killian [1991]和Heinrich [1993]对此进行了讨论。最初的Alpha处理器的实现(21064)具有相似的指令集和相似的整数流水线结构,而其浮点单元则更多的使用了流水线。
The Introduction of Dynamic Scheduling
1964年,CDC交付了第一台CDC6600,它在许多方面都是独一无二的。除了引入记分板(scoreboarding)技术之外,CDC 6600是第一个大规模使用多个功能单元的处理器;它还具有多线程的外设处理器(peripheral processors);对流水线与指令集设计之间的相互作用有很深的理解,使用了一个简单的load-store指令集来更好的实现流水线化;该计算机还使用了先进的封装技术。Thornton [1964]描述了CDC 6600流水线和I/O处理器的体系结构,包括乱序指令执行的概念。Thornton的书[1970]对整个CDC 6600处理器从技术到体系结构进行了出色的描述(还有Cray所写的前言)。不幸的是,这本书目前已绝版。CDC 6600还具有FORTRAN编译器的指令调度程序,Thorlin [1967]对此进行了描述。
The IBM 360 Model 91: A Landmark Computer
IBM 360/91引入了许多新概念,包括数据标记,寄存器重命名,内存冒险的动态检测以及通用转发(tagging of data, register renaming, dynamic detection of memory hazards, and generalized forwarding)。Tomasulo算法在他1967年的论文中进行了描述。Anderson,Sparacio和Tomasulo [1967]则描述了360/91处理器的其他方面内容,包括分支预测的使用。360/91中的许多创意在近25年内便逐渐消失,直到90年代才广泛复活。不幸的是,360/91商业上并不成功,售出的数量很少。其设计的复杂性使其面市很晚,也使其性能落后于Model 85(这是第一款具有高速缓存的IBM处理器)。
Branch-Prediction Schemes
J. E. Smith [1981]描述了2比特动态硬件分支预测方案。Ditzel和McLellan [1987]描述了一种新颖的CRISP分支目标缓冲区(branch-target buffer),它实现了分支折叠(branch folding)。我们讨论过的相关预测器(correlating predictor)在Pan,So和Rameh [1992]中介绍。Yeh和Patt [1992,1993]推广了相关性思想,并描述了为每个分支使用分支历史的多级预测器,类似于21264中使用的本地历史预测器(local history predictor)。McFarling的“锦标赛”预测方案(tournament prediction scheme)也被他称为组合预测器( combined predictor)。他在1993年的技术报告中对此进行了讨论。另外还有一些较新的论文讨论分支预测,主要是多级和相关预测器思想的变体。Kaeli和Emma [1991]讨论了返回地址的预测。Evers et al.[1998]提供了对多级预测器的深入分析。本书正文第3章中显示的数据来自Skadron et al.[1999]。有几种预测方案可能会提供比组合预测器更多的优点。Eden和Mudge [1998],以及Jimenez和Lin [2002]描述了这些方法。
The Development of Multiple-Issue Processors
IBM在多发射上做了开拓性工作。在60年代,一个名为ACS的项目在加州开展。它包括了多发射的概念,一种动态调度的方案(尽管其机制比使用备份寄存器的Tomasulo方案要简单),以及同时沿两个分支路径取指的技术。该项目最初是作为一种新的体系结构开始的,以延续Stretch项目并超过了CDC 6600/6800。ACS项目始于纽约,但后来搬到了加州,后来改为与S/360兼容,并最终被取消。John Cocke是该团队背后的智囊团成员之一,该团队包括许多IBM资深专家和年轻的贡献者,其中许多人在IBM和其它项目中担任过其他重要角色,包括:Jack Bertram,
Ed Sussenguth, Gene Amdahl, Herb Schorr, Fran Allen, Lynn Conway, and Phil Dauber。此外,尽管当时相关的编译器团队发表了许多想法并在IBM之外产生了巨大影响,但其架构思想并未得到广泛传播。 www.cs.clemson.edu/mark/acs.html,这个网站可以查询这个非常重要项目最完整的可访问文档,其中包括与ACS资深专家的访谈以及指向其它资源的链接。此外,Sussenguth [1999]是对ACS的非常好的综述。
实际上,大多数早期进入市场的多发射处理器都遵循LIW或VLIW设计方法。Charlesworth [1981]对Floating Point Systems公司的AP-120B进行了介绍,这是最早的每条指令包含多个操作的wide-instruction处理器之一。Floating Point Systems公司在编译器和手写汇编语言库中都应用了软件流水线的概念,以有效地使用该处理器。因为该处理器是一个附加处理器,所以可以忽略在通用处理器中实现多发射所面临的许多困难(例如,虚拟内存和异常处理)。
早期VLIW处理器(例如AP-120B和i860)中使用的一种有趣的方法是组织流水线的方式,该方法试图将操作(operations)“推过”(pushed through)功能单元,并在流水线末端捕获结果。在这样的处理器中,一个操作,只在另一个操作(按顺序)将它从后往前推时,才在流水线中前进。此外,如果一条指令指定了较早发射的指令的目的地址,则较早发射的指令在执行新操作时将被从流水线中推出。这种方法的优点是,它在一个操作刚发射的时候并不指定其结果目的地址,而仅在结果寄存器实际需要被写的时候才确定。这种分离消除了在硬件中检测写后写(WAW)和读后写(WAR)风险的需要。缺点是它增加了代码大小,因为当对仍在流水线中的某个操作的结果存在依赖关系的时候,即使本来不需要立即进行该类型的其它操作,也必须添加一个no-op指令来把这个结果推出来。除了这两个处理器中使用的“push-and- catch”方法,几乎所有设计人员都选择使用自排流水线(self-draining pipelines),这种方式在发射指令时指定了目的地址,并且已发射的指令在没有进一步动作的情况下也会自己完成。后者在代码密度和简化代码生成方面的优势似乎超过了前者特殊结构的优势。
在80年代中期,很多研究项目都引入了某种形式的多发射。例如,斯坦福MIPS处理器能够在一条指令中进行两项操作。尽管,出于性能方面的考虑,该功能在商业化版本中被放弃。Fisher [1983]与耶鲁大学的同事一起,提出了具有非常宽的指令(512位)的处理器,并将这种处理器命名为VLIW。Fisher使用跟踪调度(trace scheduling)为处理器生成代码,Fisher [1981]最初是为跟踪水平微码(horizontal microcode,对应传统微码)而开发的。耶鲁处理器的跟踪调度的实现可见Fisher et al.[1984]和埃利斯[1986]。
尽管IBM取消了ACS项目,但在80年代继续在该领域进行了积极的研究。十多年后,John Cocke提出了一项新的超标量处理器方案,可动态地做出指令发射的决策。他和Tilak Agerwala在1980年代中期的几次会谈中讨论了其关键思想,并创造了“superscalar”一词。他将这个设计称为“America”(Agerwala and Cocke [1987])。IBM Power1体系结构(RS/6000系列)就是基于这些思想设计的(Bakoglu et al. [1989])。
J. E. Smith [1984]和他的同事提出了一种解耦的方法,包括使用有限的动态流水线调度的多发射技术。该处理器的主要功能是使用队列来维护一类指令(例如内存访问)之间的顺序,同时允许其在另一类指令之后或之前滑动。Smith等人描述的Astronautics ZS-1 [1987]如何通过用队列来连接load/store单元和功能单元来实施了这种方法。Power2设计以类似的方式使用队列。J. E. Smith [1989]也描述了动态调度的优点,并比较了静态调度的方法。
推测(speculation,也译作预测或者投机)概念源于原始的IBM 360/91计算机,它可以执行非常有限speculation。最近的处理器中使用的方法将360/91的动态调度技术与缓冲区结合起来以允许顺序提交(in-order
commit)。Smith和Pleszkun [1988]探索了使用缓冲区来维护精确中断的方法,并描述了重排序缓冲区(reorder
buffer)的概念。Sohi [1990]描述了增加重命名和动态调度的方法,从而可以将机制用于speculation。Patt和他的同事是激进重排序和推测(aggressive reordering and speculation)的早期支持者。他们专注于检查点(checkpoint )和重新启动机制(restart
mechanisms),并开创了一种称为HPSm的方法,该方法也是Tomasulo算法的扩展[Hwu and Patt 1986]。
Smith,Johnson和Horowitz [1989]使用重排序缓冲区技术,对将推测用于多发射处理器中进行了评估。他们的目标是结合推测和多发射,研究非科学计算类代码的可用的ILP。在随后的书中,Johnson [1990]描述了一种推测执行超标量处理器的设计。Johnson随后领导了AMD K-5设计,这是最早的推测超标量(speculative superscalars)之一。
在超标量技术发展的同时,对VLIW方法的商业兴趣也在增加。Multiflow处理器(Colwell
et al. [1987])基于耶鲁大学研究的概念(尽管进行了许多重要的改进以提高其实用性)。其中有一个方法是使用了控制标签缓冲区(control-lable
store buffer),以支持某种形式的推测。尽管Multiflow处理器售出了100多个,但存在很多问题,其中包括一家小公司引入新指令集的困难以及来自商业RISC微处理器的竞争(这改变了小型计算机市场),导致Multiflow作为一家公司的失败。
几乎与Multiflow同时,Cydrome公司成立,设计VLIW架构的处理器(Rau et al. [1989]),但在商业上也没有取得成功。Dehnert,Hsu和Bratt [1989]分析了Cydrome Cydra 5的体系结构和性能,该处理器具有宽指令字,可提供动态寄存器重命名和对软件流水线的额外支持。Cydra 5是硬件和软件的独特组合,包括条件指令和寄存器轮转(register rotation),旨在获得更高的ILP。Cydrome依赖于比Multiflow处理器更多的硬件,并且主要在向量运算模式的代码上获得了有竞争力的性能。最后,Cydrome遇到了与Multiflow相似的问题,没有取得商业成功。尽管Multiflow和Cydrome作为商业实体未能成功,但他们培养了许多在充分利用ILP以及在先进的编译器技术方面具有丰富经验的人才。其中的许多人都将他们的经验和技术整合到了新的处理器当中。Fisher和Rau [1993]编写了涵盖这两个重要处理器的硬件和软件的综合论文集。
Rau还开发了一种称为多环调度(polycyclic scheduling)的技术,是大多数软件流水线方案的基础(Rau,
Glaeser, and Picard [1982])。Rau的工作建立在Davidson和他的同事们为流水线处理器设计最优硬件调度程序的早期工作的基础上。历史上其它LIW处理器还包括Apollo DN 10000和Intel i860,它们都可以实现双重发射的浮点和整数运算。
Compiler Technology and Hardware Support for Scheduling
循环级并行性(Loop-level parallelism)和依赖性分析主要是由70年代伊利诺伊大学的D. Kuck及其同事开发的。他们还提出了“antidependence”和“output dependence”等常用术语,并开发了几种标准的依赖性测试,包括GCD和Banerjee测试。后一种测试以Uptal Banerjee命名,并具有多种配置。最近有关依赖性分析的工作集中在使用各种精确测试(exact tests)方法,最新的是称为Fourier–Motzkin的线性编程算法。D. Maydan和W. Pugh都表明精确测试的序列是一种实用的解决方案。
在挖掘和调度ILP方面,许多早期工作都与前面提到的VLIW处理器的开发有关。Lam [1988]开发了用于软件流水线的算法,并评估了它们在Warp处理器上的使用,Warp是为特殊用途设计的宽指令字处理器。Weiss
and Smith [1987]对比了软件流水线与循环展开这两种在流水线处理器上调度代码的技术。Rau [1994]开发了模调度(modulo scheduling)技术来处理软件流水线循环问题并同时解决寄存器分配的问题。
在多种情况下,人们探索了推测性代码调度(speculative code scheduling)方法,有几个处理器提供了一种忽略了异常的模式,从而可以更激进地调度负载(例如,the
MIPS TFP pro- cessor [Hsu 1994])。几个小组探讨了为推测性代码调度提供更激进的硬件支持的想法。例如,Smith, Horowitz, and Lam [1992]创建了一个名为boosting的概念,其中包含支持推测的硬件,但提供了一种类似于IA-64和Crusoe中的检查和恢复机制。由伊利诺伊大学和惠普实验室的研究人员共同开发的哨兵调度(sentinel scheduling)思想也与Crusoe和IA-64中使用的推测和检查方法相似(Mahlke et al. [1992])。
90年代初,Wil-Mei Hwu和伊利诺伊大学的同事开发了一种名为IMPACT的编译器框架(Chang et al. [1991]),以探索多发射体系结构和编译器技术之间的相互作用。这个项目引发了几个重要的想法,包括超级块(superblock)调度(Hwu et al. [1993]),广泛使用了profiling方法来指导各种优化(例如,过程内联 procedure inlining),以及使用特殊的缓冲区(类似于ALAT或程序控制的存储缓冲区)用于编译辅助的内存冲突检测(Gallagher et al. [1994])。Mahlke et al. [1995]探索了部分和完全支持预测(predication)功能的性能折衷。早期的RISC处理器都具有延迟分支,这是受微程序(microprogramming)启发的方案,而编译时分支预测的一些研究则受到延迟分支机制的启发。McFarling和Hennessy [1986]对各种编译间和运行时的分支预测方案进行了定量比较。Fisher和Freudenberger [1992]使用“错误预测之间的距离”这一度量来评估一系列编译时分支预测方案。Ball and Larus [1993]和Calder et al. [1997]描述了基于收集的程序行为(collected program behavior)的静态预测方案。
EPIC and the IA-64 Development
EPIC(Explicitly Parallel Instruction Computing)源自早期尝试制造LIW和VLIW机器(尤其是Cydrome和Multiflow的机器)的努力,以及这些公司倒闭之后,在惠普,伊利诺伊大学和其他地方的长期的编译器研发的经验。从这些工作中获得的洞见促使惠普的设计师提出了一种VLIW风格的64位架构,作为HP PA RISC架构的后继。同时,英特尔一直在寻找一种新的架构来替代x86(现在称为IA-32)架构并提供64位功能。1995年,他们合作设计了一种新的架构IA-64(Huck et al. [2000]),并在此基础上设计处理器。安腾Itanium(Sharangpani and Arora [2000])是第一个这样的处理器。2002年,英特尔推出了第二代IA-64设计Itanium 2(McNairy and Soltis [2003] and McCormick and Knies [2002])。
Studies of ILP and Ideas to Increase ILP
对于ILP,Tjaden和Flynn [1970]以及Riseman和Foster [1972]在内的一系列早期论文得出的结论是,如果没有大量的硬件资源,则指令级只有少量并行性可以挖掘。这些论文在后面10年减弱了多指令发射的吸引力。Nicolau和Fisher [1984]基于他们的跟踪调度(trace scheduling)工作发表了一篇论文,并断言科学技术程序中存在大量潜在的ILP。
从那时起,人们对可用的ILP进行了许多研究。由于这类研究假定一定程度的硬件支持和编译器技术,因此受到批评。但是,这些研究对于设定优化的期望值以及理解发掘并行性所受限制的来源很有帮助。Wall参加了许多此类研究,包括Jouppi和Wall [1989],以及Wall [1991,1993]。尽管早期的研究被批评为保守的(例如,其中不包括推测技术),但后续的研究则是迄今为止最雄心勃勃的ILP研究,也是本书正文第3.10节中数据的基础。Sohi and Vajapeyam [1989]提出了宽指令字处理器可用并行度的度量。Smith,Johnson and Horowitz [1989]也使用推测性超标量处理器来研究ILP的局限性。他们假定研究中指定的处理器架构已经是当时合理设计的上限。但是,最近和即将推出的处理器可能至少与他们的处理器一样雄心勃勃。Skadron
et al. [1999]研究了与2005年最激进的处理器相当的一个处理器的性能折衷和局限性,得出的结论是,如果不对整数程序的分支预测进行重大改进,则更大的发射窗口将毫无意义。
Lam and Wilson [1992]研究了推测带来的局限性,并表明通过允许处理器在多个方向上进行推测(这需要一台以上的PC),可能会获得更多收益。(这种方案不可能超过完美的推测所能实现的性能,但是它们有助于缩小现实的推测方案与完美预测之间的差距。)Wall在1993年的研究中对该方法进行了有限的评估(最多探讨了八个分支的情况)。
Going Beyond the Data Flow Limit
研究中已探索的另一种方法是使用值预测(value prediction),可以允许基于数据的值进行推测。关于价值预测的使用已有许多研究。Lipasti和Shen在1996年发表了两篇论文,评估了值预测的概念及其对ILP开发的潜在影响。Calder,Reinman和Tullsen [1999]探索了选择性值预测的思想。Sodani and Sohi [1997] 从重用指令产生的值的观点来讨论同样的问题。Moshovos et al. [1997]指出,通过跟踪过去的预测是否准确来决定何时进行值预测对于通过值预测获得性能收益非常重要。Moshovos and Sohi [1997]以及Chrysos and Emer [1998]专注于预测存储的依赖性,并使用此信息通过存储器来消除依赖。González and González[1998],Babbay and Mendelson [1998],以及Calder,Reinman and Tullsen [1999]是使用值预测的最新研究。该领域目前非常活跃,每次学术会议都有新的结果发布。
Recent Advanced Microprocessors
1994 和 1995 年,每个主要的处理器供应商都发布了宽超标量处理器(每个时钟发射三条或更多的指令):Intel Pentium Pro 和 Pentium II(这些处理器共享相同的核心流水线架构,Col- welland Steck[1995]);AMDK-5、K-6 和Athlon;Sun UltraSPARC(Lauterbach and Horel [1999]);Alpha 21164(Edmondson et al. [1995])和 21264(Kessler [1999]);MIPS R10000 和 R12000(Yeager [1996]);PowerPC 603、604 和 620(Diep, Nelson, and Shen [1995]);和 HP 8000 (Kumar [1997])。在这十年的后期(1996 年至 2000 年),许多此类处理器(Pentium III、AMD Athlon 和 Alpha 21264 等)发布了第二代产品。第二代虽然发射率相似,但可以维持较低的 CPI 并提供更高的时钟频率。所有这些处理器都包括动态调度,并且它们几乎普遍支持推测技术。在实践中,许多因素,包括实现技术、存储器层次结构、设计人员的技能以及基准测试的应用程序类型等等,都会对哪种方法表现更好产生影响。
从 2000 年到 2005 年,超标量处理器的三个趋势占据主导地位:通过更深流水线实现的更高时钟速率(例如,在 Pentium 4 中;Hinton et al. [2001]);IBM 在Power 4 和 Intel 在 Pentium 4 Extreme中引入多线程;IBM 在 Power 4中,AMD在 Opteron 中(Keltcher et al. [2003]),以及之后的 Intel(Douglas [2005])开始走向多核。
Multithreading and Simultaneous Multithreading
多线程的概念可以追溯到最早的晶体管计算机之一TX-2。TX-2还因Ivan Sutherland在其上创建了第一个计算机图形系统Sketchpad而闻名。TX-2是在麻省理工学院的林肯实验室建造的,于1959年投入运行。它使用多个线程来支持快速上下文切换以处理I/O操作。Clark [1957]描述了其基本体系结构,而Forgie [1957]描述了I/O体系结构。CDC 6600也使用了多线程,其中将细粒度的多线程方案与线程之间的交错调度一起用于I/O处理器的架构设计。HEP处理器是由Denelcor设计并于1982年交付的流水线多处理架构,它使用细粒度的多线程来隐藏流水线延迟,以及隐藏所有处理器共享的大内存的延时。由于HEP没有高速缓存,因此这种隐藏内存延迟的技术至关重要。主要架构师之一Burton Smith在1978年的一篇论文中描述了HEP体系结构,Jordan[1983]发表了性能评估。TERA处理器扩展了多线程思想,Alverson等人在1992年的一篇论文中对此进行了讨论。Niagara的多线程方法与HEP和TERA系统的方法相似,尽管Niagara使用了高速缓存,从而减少了对基于多线程技术隐藏延迟的需求。
80年代末和90年代初,研究人员探索了粗粒度多线程(也称为块多线程)的概念,这种方法可以忍受延迟(尤其是在多处理器环境中)。Alewife系统中的SPARCLE处理器使用这种方案,每当发生高延迟异常事件(例如一次延时很长的缓存未命中情况)时就切换线程(Agarwal et al. [1993])。IBM Pulsar处理器使用类似的想法。
1990年代初,几个研究小组得出了两个关键结论。首先,他们意识到需要细粒度的多线程来获得最大的性能收益,因为在粗粒度方法中,线程切换和线程启动(例如,从新线程填充流水线)的开销会抵消性能优势(Laudon,
Gupta, and Horowitz [1994])。其次,有效使用大量功能单元同时需要ILP和线程级并行性(TLP)。这些见解导致了使用多线程和多发射组合的几种体系结构。Wolfe and Shen [1991]描述了一种称为XIMD的体系结构,该体系结构静态的交织为VLIW处理器调度的多个线程。Hirata et al. [1992]提出了一种用于多媒体处理的处理器,将静态超标量流水线与对多线程的支持相结合,并展示了结合两种并行形式的性能提升。Keckler and Dally [1992]将具有多个功能单元的处理器的ILP静态调度和多线程的动态调度结合在一起。但是,如何平衡功能单元在ILP和TLP之间的分配以及如何调度两种形式的并行化的问题仍然悬而未决。
在90年代中期,当支持动态调度的超标量计算机将很快出现,一些研究小组提出使用动态调度功能来混合来自多个线程的指令。尽管他们的多线程超标量架构的仿真结果使用了简化的假设,Yamamoto
et al. [1994]应该是第一个发表了这样的方案。紧随其后的是Tullsen, Eggers,and Levy [1995],他们提供了第一个现实的仿真评估并创造了“simultaneous multithreading”一词。该小组以及业界合作者随后的工作共同解决了许多有关SMT的开放性问题。例如,Tullsen et al. [1996]解决了有关ILP与TLP的调度的挑战。Lo et al. [1997]对SMT概念进行了广泛的讨论,并对其性能潜力进行了评估。Lo
et. al. [1998]评估了SMT处理器上的数据库性能。Tuck
and Tullsen [2003]回顾了SMT在奔腾4上的性能。
IBM Power4引入了多线程(Tendler et al. [2002]),而Power5使用了同时多线程。Mathis et al. [2005]探讨了Power5中SMT的性能,Sinharoy
et al. [2005]则描述了系统架构。
参考资料
Agarwal, A., J. Kubiatowicz, D. Kranz, B.-H. Lim, D. Yeung, G. D’Souza, and M. Parkin [1993]. “Sparcle: An evolutionary processor design for large-scale multiprocessors,” IEEE Micro 13 (June), 48–61.
Agerwala, T., and J. Cocke [1987]. High Performance Reduced Instruction Set Processors, Tech. Rep. RC12434, IBM Thomas Watson Research Center, Yorktown Heights, N.Y.Alverson, G., R. Alverson, D. Callahan, B. Koblenz, A. Porterfield, and B. Smith [1992]. “Exploiting heterogeneous parallelism on a multithreaded multiprocessor,” Proc. ACM/IEEE Conf. on Supercomputing, November 16–20, 1992, Minneapolis, Minn., 188–197.Anderson, D. W., F. J. Sparacio, and R. M. Tomasulo [1967]. “The IBM 360 Model 91: Processor philosophy and instruction handling,” IBM J. Research and Development 11:1 (January), 8–24.Austin, T. M., and G. Sohi [1992]. “Dynamic dependency analysis of ordinary pro- grams,” Proc. 19th Annual Int’l. Symposium on Computer Architecture (ISCA), May 19–21, 1992, Gold Coast, Australia, 342–351.Babbay, F., and A. Mendelson [1998]. “Using value prediction to increase the power of speculative execution hardware,” ACM Trans. on Computer Systems 16:3 (August), 234–270.Bakoglu, H. B., G. F. Grohoski, L. E. Thatcher, J. A. Kaeli, C. R. Moore, D. P. Tattle, W. E. Male, W. R. Hardell, D. A. Hicks, M. Nguyen Phu, R. K. Montoye,W. T. Glover, and S. Dhawan [1989]. “IBM second-generation RISC processor organization,” Proc. IEEE Int’l. Conf. on Computer Design, October, Rye Brook, N.Y., 138–142.Ball, T., and J. Larus [1993]. “Branch prediction for free,” Proc. ACM SIG- PLAN’93 Conference on Programming Language Design and Implementation (PLDI), June 23–25, 1993, Albuquerque, N.M., 300–313.Bhandarkar, D., and D. W. Clark [1991]. “Performance from architecture: Comparing a RISC and a CISC with similar hardware organizations,” Proc. Fourth Int’l. Conf. on Architectural Support for Programming Languages and Operating Systems (ASPLOS), April 8–11, 1991, Palo Alto, Calif., 310–319.Bhandarkar, D., and J. Ding [1997]. “Performance characterization of the Pentium Pro processor,” Proc. Third Int’l. Symposium on High Performance Computer Architecture, February 1–5, 1997, San Antonio, Tex., 288–297.Bloch, E. [1959]. “The engineering design of the Stretch computer,” Proc. Eastern Joint Computer Conf., December 1–3, 1959, Boston, Mass., 48–59.Bucholtz, W. [1962]. Planning a Computer System: Project Stretch, McGraw-Hill, New York.Calder, B., D. Grunwald, M. Jones, D. Lindsay, J. Martin, M. Mozer, and B. Zorn [1997]. “Evidence-based static branch prediction using machine learning,” ACM Trans. Program. Lang. Syst. 19:1, 188–222.Calder, B., G. Reinman, and D. M. Tullsen [1999]. “Selective value prediction,” Proc. 26th Annual Int’l. Symposium on Computer Architecture (ISCA), May 2–4, 1999, Atlanta, Ga.Chang, P. P., S. A. Mahlke, W. Y. Chen, N. J. Warter, and W. W. Hwu [1991]. “IMPACT: An architectural framework for multiple-instruction-issue processors,” Proc. 18th Annual Int’l. Symposium on Computer Architecture (ISCA), May 27–30, 1991, Toronto, Canada, 266–275.Charlesworth, A. E. [1981]. “An approach to scientific array processing: The architecture design of the AP-120B/FPS-164 family,” Computer 14:9 (September), 18–27.Chen, T. C. [1980]. “Overlap and parallel processing,” in Introduction to Computer Architecture, H. Stone, ed., Science Research Associates, Chicago, 427–486.Chrysos, G. Z., and J. S. Emer [1998]. “Memory dependence prediction using store sets,” Proc. 25th Annual Int’l. Symposium on Computer Architecture (ISCA), July 3–14, 1998, Barcelona, Spain, 142–153.Clark, D. W. [1987]. “Pipelining and performance in the VAX 8800 processor,” Proc. Second Int’l. Conf. on Architectural Support for Programming Languages and Operating Systems (ASPLOS), October 5–8, 1987, Palo Alto, Calif., 173–177. Clark, W. A. [1957]. “The Lincoln TX-2 computer development,” Proc. Western Joint Computer Conference, February 26–28, 1957, Los Angeles, 143–145.Colwell, R. P., and R. Steck [1995]. “A 0.6 μm BiCMOS processor with dynamic execution.” Proc. of IEEE Int’l. Symposium on Solid State Circuits (ISSCC), February 15–17, 1995, San Francisco, 176–177.Colwell, R. P., R. P. Nix, J. J. O’Donnell, D. B. Papworth, and P. K. Rodman [1987]. “A VLIW architecture for a trace scheduling compiler,” Proc. Second Int’l. Conf. on Architectural Support for Programming Languages and Operating Systems (ASPLOS), October 5–8, 1987, Palo Alto, Calif., 180–192.Cvetanovic, Z., and R. E. Kessler [2000]. “Performance analysis of the Alpha 21264-based Compaq ES40 system,” 27th Annual Int’l. Symposium on Computer Architecture (ISCA), June 10–14, 2000, Vancouver, Canada, 192–202.Davidson, E. S. [1971]. “The design and control of pipelined function generators,” Proc. IEEE Conf. on Systems, Networks, and Computers, January 19–21, 1971, Oaxtepec, Mexico, 19–21.Davidson, E. S., A. T. Thomas, L. E. Shar, and J. H. Patel [1975]. “Effective control for pipelined processors,” Proc. IEEE COMPCON, February 25–27, 1975, San Francisco, 181–184.Dehnert, J. C., P. Y.-T. Hsu, and J. P. Bratt [1989]. “Overlapped loop support on the Cydra 5,” Proc. Third Int’l. Conf. on Architectural Support for Program- ming Languages and Operating Systems (ASPLOS), April 3–6, 1989, Boston, Mass., 26–39.Diep, T. A., C. Nelson, and J. P. Shen [1995]. “Performance evaluation of the PowerPC 620 microarchitecture,” Proc. 22nd Annual Int’l. Symposium on Computer Architecture (ISCA), June 22–24, 1995, Santa Margherita, Italy.Ditzel, D. R., and H. R. McLellan [1987]. “Branch folding in the CRISP micro- processor: Reducing the branch delay to zero,” Proc. 14th Annual Int’l. Symposium on Computer Architecture (ISCA), June 2–5, 1987, Pittsburgh, Penn., 2–7. Douglas, J. [2005]. “Intel 8xx series and Paxville Xeon-MP Microprocessors,” paper presented at Hot Chips 17, August 14–16, 2005, Stanford University, Palo Alto, Calif.Eden, A., and T. Mudge [1998]. “The YAGS branch prediction scheme,” Proc. of the 31st Annual ACM/IEEE Int’l. Symposium on Microarchitecture, November 30–December 2, 1998, Dallas, Tex., 69–80.Edmondson, J. H., P. I. Rubinfield, R. Preston, and V. Rajagopalan [1995]. “Superscalar instruction execution in the 21164 Alpha microprocessor,” IEEE Micro 15:2, 33–43.Ellis, J. R. [1986]. Bulldog: A Compiler for VLIW Architectures, MIT Press, Cambridge, Mass.Emer, J. S., and D. W. Clark [1984]. “A characterization of processor performance in the VAX-11/780,” Proc. 11th Annual Int’l. Symposium on Computer Architecture (ISCA), June 5–7, 1984, Ann Arbor, Mich., 301–310.Evers, M., S. J. Patel, R. S. Chappell, and Y. N. Patt [1998]. “An analysis of correlation and predictability: What makes two-level branch predictors work,” Proc. 25th Annual Int’l. Symposium on Computer Architecture (ISCA), July 3–14, 1998, Barcelona, Spain, 52–61.Fisher, J. A. [1981]. “Trace scheduling: A technique for global microcode compaction,” IEEE Trans. on Computers 30:7 (July), 478–490.Fisher, J. A. [1983]. “Very long instruction word architectures and ELI-512,” 10th Annual Int’l. Symposium on Computer Architecture (ISCA), June 5–7, 1982, Stockholm, Sweden, 140–150.Fisher, J. A., and S. M. Freudenberger [1992]. “Predicting conditional branches from previous runs of a program,” Proc. Fifth Int’l. Conf. on Architectural Support for Programming Languages and Operating Systems (ASPLOS), October 12–15, 1992, Boston, 85–95.Fisher, J. A., and B. R. Rau [1993]. Journal of Supercomputing, January (special issue).Fisher, J. A., J. R. Ellis, J. C. Ruttenberg, and A. Nicolau [1984]. “Parallel processing: A smart compiler and a dumb processor,” Proc. SIGPLAN Conf. on Compiler Construction, June 17–22, 1984, Montreal, Canada, 11–16.Forgie, J. W. [1957]. “The Lincoln TX-2 input-output system,” Proc. Western Joint Computer Conference, February 26–28, 1957, Los Angeles, 156–160.Foster, C. C., and E. M. Riseman [1972]. “Percolation of code to enhance parallel dispatching and execution,” IEEE Trans. on Computers C-21:12 (December), 1411–1415.Gallagher, D. M., W. Y. Chen, S. A. Mahlke, J. C. Gyllenhaal, and W.W. Hwu [1994]. “Dynamic memory disambiguation using the memory conflict buffer,” Proc. Sixth Int’l. Conf. on Architectural Support for Programming Languages and Operating Systems (ASPLOS), October 4–7, Santa Jose, Calif., 183–193. González, J., and A. González [1998]. “Limits of instruction level parallelism with data speculation,” Proc. Vector and Parallel Processing (VECPAR) Conf., June 21–23, 1998, Porto, Portugal, 585–598.Heinrich, J. [1993]. MIPS R4000 User’s Manual, Prentice Hall, Englewood Cliffs, N.J.Hinton, G., D. Sager, M. Upton, D. Boggs, D. Carmean, A. Kyker, and P. Roussel [2001]. “The microarchitecture of the Pentium 4 processor,” Intel Technology Journal, February.Hirata, H., K. Kimura, S. Nagamine, Y. Mochizuki, A. Nishimura, Y. Nakase, and T. Nishizawa [1992]. “An elementary processor architecture with simultaneous instruction issuing from multiple threads,” Proc. 19th Annual Int’l. Symposium on Computer Architecture (ISCA), May 19–21, 1992, Gold Coast, Australia, 136–145.Hopkins, M. [2000]. “A critical look at IA-64: Massive resources, massive ILP, but can it deliver?” Microprocessor Report, February.Hsu, P. [1994]. “Designing the TFP microprocessor,” IEEE Micro 18:2 (April), 2333.Huck, J. et al. [2000]. “Introducing the IA-64 Architecture” IEEE Micro, 20:5 (September–October), 12–23.Hwu, W.-M., and Y. Patt [1986]. “HPSm, a high performance restricted data flow architecture having minimum functionality,” 13th Annual Int’l. Symposium on Computer Architecture (ISCA), June 2–5, 1986, Tokyo, 297–307.Hwu, W. W., S. A. Mahlke, W. Y. Chen, P. P. Chang, N. J. Warter, R. A. Bringmann, R. O. Ouellette, R. E. Hank, T. Kiyohara, G. E. Haab, J. G. Holm, and D.M. Lavery [1993]. “The superblock: An effective technique for VLIW and superscalar compilation,” J. Supercomputing 7:1, 2 (March), 229–248.IBM. [1990]. “The IBM RISC System/6000 processor” (collection of papers), IBM J. Research and Development 34:1 (January).Jimenez, D. A., and C. Lin [2002]. “Neural methods for dynamic branch prediction,” ACM Trans. Computer Sys 20:4 (November), 369–397.Johnson, M. [1990]. Superscalar Microprocessor Design, Prentice Hall, Englewood Cliffs, N.J.Jordan, H. F. [1983]. “Performance measurements on HEP—a pipelined MIMD computer,” Proc. 10th Annual Int’l. Symposium on Computer Architecture (ISCA), June 5–7, 1982, Stockholm, Sweden, 207–212.Jouppi, N. P., and D. W. Wall [1989]. “Available instruction-level parallelism for superscalar and superpipelined processors,” Proc. Third Int’l. Conf. on Architectural Support for Programming Languages and Operating Systems (ASPLOS), April 3–6, 1989, Boston, 272–282.Kaeli, D. R., and P. G. Emma [1991]. “Branch history table prediction of moving target branches due to subroutine returns,” Proc. 18th Annual Int’l. Symposium on Computer Architecture (ISCA), May 27–30, 1991, Toronto, Canada, 34–42. Keckler, S. W., and W. J. Dally [1992]. “Processor coupling: Integrating compile time and runtime scheduling for parallelism,” Proc. 19th Annual Int’l. Symposium on Computer Architecture (ISCA), May 19–21, 1992, Gold Coast, Australia, 202–213.Keller, R. M. [1975]. “Look-ahead processors,” ACM Computing Surveys 7:4 (December), 177–195.Keltcher, C. N., K. J. McGrath, A. Ahmed, and P. Conway [2003]. “The AMD Opteron processor for multiprocessor servers,” IEEE Micro 23:2 (March– April), 66–76.Kessler, R. [1999]. “The Alpha 21264 microprocessor,” IEEE Micro 19:2 (March/ April) 24–36.Killian, E. [1991]. “MIPS R4000 technical overview–64 bits/100 MHz or bust,” Hot Chips III Symposium Record, August 26–27, 1991, Stanford University, Palo Alto, Calif., 1.6–1.19.Kogge, P. M. [1981]. The Architecture of Pipelined Computers, McGraw-Hill, New York.Kumar, A. [1997]. “The HP PA-8000 RISC CPU,” IEEE Micro 17:2 (March/April).Kunkel, S. R., and J. E. Smith [1986]. “Optimal pipelining in supercomputers,” Proc. 13th Annual Int’l. Symposium on Computer Architecture (ISCA), June 2–5, 1986, Tokyo, 404–414.Lam, M. [1988]. “Software pipelining: An effective scheduling technique for VLIW processors,” SIGPLAN Conf. on Programming Language Design and Implementation, June 22–24, 1988, Atlanta, Ga., 318–328.Lam, M. S., and R. P. Wilson [1992]. “Limits of control flow on parallelism,” Proc. 19th Annual Int’l. Symposium on Computer Architecture (ISCA), May 19–21, 1992, Gold Coast, Australia, 46–57.Laudon, J., A. Gupta, and M. Horowitz [1994]. “Interleaving: A multithreading technique targeting multiprocessors and workstations,” Proc. Sixth Int’l. Conf. on Architectural Support for Programming Languages and Operating Systems (ASPLOS), October 4–7, San Jose, Calif., 308–318.Lauterbach, G., and T. Horel [1999]. “UltraSPARC-III: Designing third generation 64-bit performance,” IEEE Micro 19:3 (May/June).Lipasti, M. H., and J. P. Shen [1996]. “Exceeding the dataflow limit via value pre-diction,” Proc. 29th Int’l. Symposium on Microarchitecture, December 2–4, 1996, Paris, France.Lipasti, M. H., C. B. Wilkerson, and J. P. Shen [1996]. “Value locality and load value prediction,” Proc. Seventh Conf. on Architectural Support for Programming Languages and Operating Systems (ASPLOS), October 1–5, 1996, Cambridge, Mass., 138–147.Lo, J., L. Barroso, S. Eggers, K. Gharachorloo, H. Levy, and S. Parekh [1998]. “An analysis of database workload performance on simultaneous multithreaded pro- cessors,” Proc. 25th Annual Int’l. Symposium on Computer Architecture (ISCA), July 3–14, 1998, Barcelona, Spain, 39–50.Lo, J., S. Eggers, J. Emer, H. Levy, R. Stamm, and D. Tullsen [1997]. “Converting thread-level parallelism into instruction-level parallelism via simultaneous multithreading,” ACM Trans. on Computer Systems 15:2 (August), 322–354.Mahlke, S. A., W. Y. Chen, W.-M. Hwu, B. R. Rau, and M. S. Schlansker [1992]. “Sentinel scheduling for VLIW and superscalar processors,” Proc. Fifth Int’l. Conf. on Architectural Support for Programming Languages and Operating Systems (ASPLOS), October 12–15, 1992, Boston, 238–247.Mahlke, S. A., R. E. Hank, J. E. McCormick, D. I. August, and W. W. Hwu [1995]. “A comparison of full and partial predicated execution support for ILP processors,” Proc. 22nd Annual Int’l. Symposium on Computer Architecture (ISCA), June 22–24, 1995, Santa Margherita, Italy, 138–149.Mathis, H. M., A. E. Mercias, J. D. McCalpin, R. J. Eickemeyer, and S. R. Kunkel [2005]. “Characterization of the multithreading (SMT) efficiency in Power5,” IBM J. of Research and Development, 49:4/5 (July/September), 555–564.McCormick, J., and A. Knies [2002]. “A brief analysis of the SPEC CPU2000 benchmarks on the Intel Itanium 2 processor,” paper presented at Hot Chips 14, August 18–20, 2002, Stanford University, Palo Alto, Calif.McFarling, S. [1993]. Combining Branch Predictors, WRL Technical Note TN-36, Digital Western Research Laboratory, Palo Alto, Calif.McFarling, S., and J. Hennessy [1986]. “Reducing the cost of branches,” Proc. 13th Annual Int’l. Symposium on Computer Architecture (ISCA), June 2–5, 1986, Tokyo, 396–403.McNairy, C., and D. Soltis [2003]. “Itanium 2 processor microarchitecture,” IEEE Micro 23:2 (March–April), 44–55.Moshovos, A., and G. S. Sohi [1997]. “Streamlining inter-operation memory communication via data dependence prediction,” Proc. 30th Annual Int’l. Symposium on Microarchitecture, December 1–3, Research Triangle Park, N.C., 235–245.Moshovos, A., S. Breach, T. N. Vijaykumar, and G. S. Sohi [1997]. “Dynamic speculation and synchronization of data dependences,” Proc. 24th Annual Int’l. Symposium on Computer Architecture (ISCA), June 2–4, 1997, Denver, Colo. Nicolau, A., and J. A. Fisher [1984]. “Measuring the parallelism available for very long instruction word architectures,” IEEE Trans. on Computers C-33:11(November), 968–976.Pan, S.-T., K. So, and J. T. Rameh [1992]. “Improving the accuracy of dynamic branch prediction using branch correlation,” Proc. Fifth Int’l. Conf. on Architectural Support for Programming Languages and Operating Systems (ASPLOS), October 12–15, 1992, Boston, 76–84.Postiff, M.A., D. A. Greene, G. S. Tyson, and T. N. Mudge [1999]. “The limits of instruction level parallelism in SPEC95 applications,” Computer Architecture News 27:1 (March), 31–40.Ramamoorthy, C. V., and H. F. Li [1977]. “Pipeline architecture,” ACM Computing Surveys 9:1 (March), 61–102.Rau, B. R. [1994]. “Iterative modulo scheduling: An algorithm for software pipe-lining loops,” Proc. 27th Annual Int’l. Symposium on Microarchitecture, November 30–December 2, 1994, San Jose, Calif., 63–74.Rau, B. R., C. D. Glaeser, and R. L. Picard [1982]. “Efficient code generation for horizontal architectures: Compiler techniques and architectural support,” Proc. Ninth Annual Int’l. Symposium on Computer Architecture (ISCA), April 26–29, 1982, Austin, Tex., 131–139.Rau, B. R., D. W. L. Yen, W. Yen, and R. A. Towle [1989]. “The Cydra 5 depart- mental supercomputer: Design philosophies, decisions, and trade-offs,” IEEE Computers 22:1 (January), 12–34.Riseman, E. M., and C. C. Foster [1972]. “Percolation of code to enhance paralleled dispatching and execution,” IEEE Trans. on Computers C-21:12 (December), 1411–1415.Rymarczyk, J. [1982]. “Coding guidelines for pipelined processors,” Proc. Symposium Architectural Support for Programming Languages and Operating Sys- tems (ASPLOS), March 1–3, 1982, Palo Alto, Calif., 12–19.Sharangpani, H., and K. Arora [2000]. “Itanium Processor Microarchitecture,” IEEE Micro, 20:5 (September–October), 24–43.Sinharoy, B., R. N. Koala, J. M. Tendler, R. J. Eickemeyer, and J. B. Joyner [2005]. “POWER5 system microarchitecture,” IBM J. of Research and Development, 49:4–5, 505–521.Sites, R. [1979]. Instruction Ordering for the CRAY-1 Computer, Tech. Rep. 78-CS-023, Dept. of Computer Science, University of California, San Diego.Skadron, K., P. S. Ahuja, M. Martonosi, and D. W. Clark [1999]. “Branch prediction, instruction-window size, and cache size: Performance tradeoffs and simulation techniques,” IEEE Trans. on Computers, 48:11 (November).Smith, A., and J. Lee [1984]. “Branch prediction strategies and branch-target buffer design,” Computer 17:1 (January), 6–22.Smith, B. J. [1978]. “A pipelined, shared resource MIMD computer,” Proc. Int’l. Conf. on Parallel Processing (ICPP), August, Bellaire, Mich., 6–8.Smith, J. E. [1981]. “A study of branch prediction strategies,” Proc. Eighth Annual Int’l. Symposium on Computer Architecture (ISCA), May 12–14, 1981, Minneapolis, Minn., 135–148.Smith, J. E. [1984]. “Decoupled access/execute computer architectures,” ACM Trans. on Computer Systems 2:4 (November), 289–308.Smith, J. E. [1989]. “Dynamic instruction scheduling and the Astronautics ZS-1,” Computer 22:7 (July), 21–35.Smith, J. E., and A. R. Pleszkun [1988]. “Implementing precise interrupts in pipelined processors,” IEEE Trans. on Computers 37:5 (May), 562–573. (This paper is based on an earlier paper that appeared in Proc. 12th Annual Int’l. Symposium on Computer Architecture (ISCA), June 17–19, 1985, Boston, Mass.)Smith, J. E., G. E. Dermer, B. D. Vanderwarn, S. D. Klinger, C. M. Rozewski,D. L. Fowler, K. R. Scidmore, and J. P. Laudon [1987]. “The ZS-1 central pro- cessor,” Proc. Second Int’l. Conf. on Architectural Support for Programming Languages and Operating Systems (ASPLOS), October 5–8, 1987, Palo Alto, Calif., 199–204.Smith, M. D., M. Horowitz, and M. S. Lam [1992]. “Efficient superscalar performance through boosting,” Proc. Fifth Int’l. Conf. on Architectural Support for Programming Languages and Operating Systems (ASPLOS), October 12–15, 1992, Boston, 248–259.Smith, M. D., M. Johnson, and M. A. Horowitz [1989]. “Limits on multiple instruction issue,” Proc. Third Int’l. Conf. on Architectural Support for Programming Languages and Operating Systems (ASPLOS), April 3–6, 1989, Bos- ton, 290–302.Sodani, A., and G. Sohi [1997]. “Dynamic instruction reuse,” Proc. 24th Annual Int’l. Symposium on Computer Architecture (ISCA), June 2–4, 1997, Denver, Colo.Sohi, G. S. [1990]. “Instruction issue logic for high-performance, interruptible, multiple functional unit, pipelined computers,” IEEE Trans. on Computers 39:3 (March), 349–359.Sohi, G. S., and S. Vajapeyam [1989]. “Tradeoffs in instruction format design for horizontal architectures,” Proc. Third Int’l. Conf. on Architectural Support for Programming Languages and Operating Systems (ASPLOS), April 3–6, 1989, Boston, 15–25.Sussenguth, E. [1999]. “IBM’s ACS-1 Machine,” IEEE Computer 22:11 (November).Tendler, J. M., J. S. Dodson, J. S. Fields, Jr., H. Le, and B. Sinharoy [2002]. “Power4 system microarchitecture,” IBM J. of Research and Development, 46:1, 5–26.Thorlin, J. F. [1967]. “Code generation for PIE (parallel instruction execution) computers,” Proc. Spring Joint Computer Conf., April 18–20, 1967, Atlantic City, N.J., 27.Thornton, J. E. [1964]. “Parallel operation in the Control Data 6600,” Proc. AFIPS Fall Joint Computer Conf., Part II, October 27–29, 1964, San Francisco, 26, 33–40.Thornton, J. E. [1970]. Design of a Computer, the Control Data 6600, Scott, Fores- man, Glenview, Ill.Tjaden, G. S., and M. J. Flynn [1970]. “Detection and parallel execution of independent instructions,” IEEE Trans. on Computers C-19:10 (October), 889–895.Tomasulo, R. M. [1967]. “An efficient algorithm for exploiting multiple arithmetic units,” IBM J. Research and Development 11:1 (January), 25–33.Tuck, N., and D. Tullsen [2003]. “Initial observations of the simultaneous multi-threading Pentium 4 processor,” Proc. 12th Int. Conf. on Parallel Architectures and Compilation Techniques (PACT’03), September 27–October 1, New Orleans, La., 26–34.Tullsen, D. M., S. J. Eggers, and H. M. Levy [1995]. “Simultaneous multithreading: Maximizing on-chip parallelism,” Proc. 22nd Annual Int’l. Symposium on Computer Architecture (ISCA), June 22–24, 1995, Santa Margherita, Italy, 392– 403.Tullsen, D. M., S. J. Eggers, J. S. Emer, H. M. Levy, J. L. Lo, and R. L. Stamm [1996]. “Exploiting choice: Instruction fetch and issue on an implementable simultaneous multithreading processor,” Proc. 23rd Annual Int’l. Symposium on Computer Architecture (ISCA), May 22–24, 1996, Philadelphia, Penn., 191–202.Wall, D. W. [1991]. “Limits of instruction-level parallelism,” Proc. Fourth Int’l. Conf. on Architectural Support for Programming Languages and Operating Systems (ASPLOS), April 8–11, 1991, Palo Alto, Calif., 248–259.Wall, D. W. [1993]. Limits of Instruction-Level Parallelism, Research Rep. 93/6, Western Research Laboratory, Digital Equipment Corp., Palo Alto, Calif.Weiss, S., and J. E. Smith [1984]. “Instruction issue logic for pipelined supercomputers,” Proc. 11th Annual Int’l. Symposium on Computer Architecture (ISCA), June 5–7, 1984, Ann Arbor, Mich., 110–118.Weiss, S., and J. E. Smith [1987]. “A study of scalar compilation techniques for pipelined supercomputers,” Proc. Second Int’l. Conf. on Architectural Support for Programming Languages and Operating Systems (ASPLOS), October 5–8, 1987, Palo Alto, Calif., 105–109.Wilson, R. P., and M. S. Lam [1995]. “Efficient context-sensitive pointer analysis for C programs,” Proc. ACM SIGPLAN’95 Conf. on Programming Language Design and Implementation, June 18–21, 1995, La Jolla, Calif., 1–12.Wolfe, A., and J. P. Shen [1991]. “A variable instruction stream extension to the VLIW architecture,” Proc. Fourth Int’l. Conf. on Architectural Support for Programming Languages and Operating Systems (ASPLOS), April 8–11, 1991, Palo Alto, Calif., 2–14.Yamamoto, W., M. J. Serrano, A. R. Talcott, R. C. Wood, and M. Nemirosky [1994]. “Performance estimation of multistreamed, superscalar processors,” Proc. 27th Annual Hawaii Int’l. Conf. on System Sciences, January 4–7, 1994, Maui, 195–204.Yeager, K. [1996]. “The MIPS R10000 superscalar microprocessor,” IEEE Micro 16:2 (April), 28–40.Yeh, T., and Y. N. Patt [1992]. “Alternative implementations of two-level adaptive branch prediction,” Proc. 19th Annual Int’l. Symposium on Computer Architecture (ISCA), May 19–21, 1992, Gold Coast, Australia, 124–134.Yeh, T., and Y. N. Patt [1993]. “A comparison of dynamic branch predictors that use two levels of branch history,” Proc. 20th Annual Int’l. Symposium on Computer Architecture (ISCA), May 16–19, 1993, San Diego, Calif., 257–266.
公众号专题:
人工智能芯片技术基础
人工智能芯片技术进步
人工智能芯片产业发展
人工智能芯片初创公司
人工智能芯片评测对比
科技巨头的芯片尝试
从学术会议看人工智能芯片
基础芯片技术
计算机体系结构发展史
prompt: a beautiful oil and canvas painting of computers and networks technology, abstract, concept, overlook, by van gogh, trending on artstation
本文为个人兴趣之作,仅代表本人观点,与就职单位无关