『译』计算机体系结构发展史(二)
系列文章第二篇(对应M.3)
往期文章
M.3 The Development of Memory Hierarchy and Protection (Chapter 2 and Appendix B)
Atlas和IBM 360都提供了内存页保护功能,而GE 645是第一个提供页式分段(paged segmentation)的系统。较早的Burroughs计算机使用分段方式来支持虚拟内存,这和Intel 8086的分段地址方案类似。80286是第一个具有附录C中所述保护机制的80x86处理器,它受到运行在GE 645上的Multics保护软件的启发。随着时间的推移,一些更为复杂的机制相继出现。最精细的机制是“capabilities”,它在70年代末和80年代初引起了极大的兴趣[Fabry 1974; Wulf, Levin, and Harbison 1981]。Wilkes [1982]是最早从事相关研究的工作者之一,他曾这样说:
“Anyone who has been concerned with an implementation of the type just described [capability system], or has tried to explain one to others, is likely to feel that complexity has got out of hand. It is particularly disappointing that the attractive idea of capabilities being tickets that can be freely handed around has become lost ….
Compared with a conventional computer system, there will inevitably be a cost to be met in providing a system in which the domains of protection are small and frequently changed. This cost will manifest itself in terms of additional hardware, decreased runtime speed, and increased memory occupancy. It is at present an open question whether, by adoption of the capability approach, the cost can be reduced to reasonable proportions. [p. 112]”
如今,尽管对内存保护和安全性的关注度越来越高,操作系统或计算机体系结构领域对于这项技术已经几乎没什么兴趣了。
Bell and Strecker [1976]对PDP-11的设计进行了反思,并确定了“地址空间较小”是唯一难以修复的体系结构错误。在设计PDP-11的时期,核心内存的容量增长非常缓慢。此外,当时有100多家小型计算机公司,竞争非常激烈。如果每个地址必须两次通过16位的数据通路,那么DEC的产品可能就失去成本竞争力了。因此,架构师决定相对其前身只增加4个地址位。
IBM 360的架构师意识到地址空间大小的重要性,因此计划将体系结构扩展到32位地址。但是,由于在1964年使用更大的地址时,低端360机型甚至会变得更慢,因此在IBM 360中仅使用了24位地址。不幸的是,当时架构师没有向软件开发人员说明他们的计划,一些程序员在高8位“未使用”地址位中存储额外信息导致了问题。(20年后,Apple在Motorola 68000中使用24位地址犯了类似的错误,当稍后的68000使用完整的32位虚拟地址时,需要一个额外的程序来为Macintosh计算机检查哪些是“ 32位干净”的程序。)此后几乎每台计算机都会对地址访问进行检查,确保未使用的地址位保持未使用状态,并在这些位的值出现错误时触发“陷阱”(trap)。
据记载,系统虚拟机( system virtual machines)由IBM率先提出,是其虚拟内存研究的一部分。IBM的第一台带有虚拟内存的计算机是1967年推出的IBM 360/67。IBM研究人员为其编写了CP-67程序,该程序可以虚拟多台独立的360台计算机。之后,他们编写了一个在可在这些虚拟机上运行的名为CMS的交互式单用户操作系统。CP-67为后续产品VM/370打下了基础。今天,IBM为其大型机销售z/VM虚拟机操作系统 [Meyer and Seawright 1970; Van Vleck 2005]。
在Atlas论文发表几年后,Wilkes发表了第一篇描述高速缓存(Cache)概念的论文[1965]:
“The use is discussed of a fast core memory of, say, 32,000 words as slave to a slower core memory of, say, one million words in such a way that in practical cases the effective access time is nearer that of the fast memory than that of the slow memory. [p. 270]”
这篇两页纸的论文介绍了直接映射缓存(direct-mapped cache)的概念。尽管这是有关高速缓存的第一份出版物,但第一个高速缓存的实现可能是在剑桥大学建造的直接映射指令缓存(direct-mapped instruction cache)。它基于隧道二极管存储器(tunnel diode memory),这是当时最快的存储器形式。Wilkes 曾指出,G.Scarott提出了高速缓存的想法。
在这篇论文之后,IBM启动了一个相关项目,推出了第一台带有高速缓存的商用计算机IBM 360/85 [Liptay 1968]。Gibson [1967]描述了如何使用访存流量(memory traffic)和未命中率(miss rate)来测量程序行为,并展示了未命中率在不同程序之间的变化情况。Gibson使用包括20个程序(每个有300万个引用!)的样本,依据平均内存访问时间来对比具有和不具有高速缓存的系统的性能。这个尝试已有40多年的历史了,直到90年代初,未命中率仍然在架构研究和评测中得到广泛应用。
Conti,Gibson和Pitkowsky [1968]描述了360/85的最终性能。尽管360/85相较360/91的时钟较慢(时钟周期80ns对60ns),内存交织更少(4对16),主内存更慢(1.04us与0.75us),但在这篇论文使用的11个程序中,360/91只在3个程序上优于360/85。这篇论文也是第一个使用“Cache”一词的。
此后高速缓存的研究快速发展。Strecker [1976]发表了第一篇比较高速缓存设计的论文,研究了PDP-11的高速缓存。Smith(1982)后来发表了一份详尽的综述论文,其中使用了术语“spatial locality”(空间局部性)和“temporal locality”(时间局部性)。这篇文章为许多计算机设计师提供了参考。
尽管大多数研究都依赖于仿真,但是Clark [1983]在分析中使用硬件监视器(hardware monitor)记录了几天内VAX-11/780的高速缓存未命中的情况。Clark and Emer [1985]随后比较了对地址转译功能的仿真和硬件测量。
Hill(1987)提出了附录B中描述的用三个C(Compulsory, Capacity, Conflict)来解释高速缓存未命中的情况。Jouppi [1998]回顾说,Hill的三C模型直接导致了他发明victim cache(牺牲块缓冲区),以利用更快的直接映射缓存的优势,同时又避免了大多数冲突导致的未命中(conflict misses)的代价。Sugumar和Abraham [1993]认为,三C模型的基线缓存应使用最佳替换(optimal replacement)策略。这可以消除基于“最近最少使用”(LRU)的未命中分类所导致的异常。同时,这种方法还可以将冲突导致的未命中分解为两种:由映射引起的未命中,以及由非最佳替换算法引起的未命中。
关于非阻塞式缓存(nonblocking caches)的最早的论文之一是Kroft [1981]。之后,Kroft [1998]说明,他是第一个在Control Data Corporation(CDC,a mainframe and supercomputer firm)设计带有高速缓存的计算机的人。在尝试把旧的概念用于新机制时,他想到了使用双端口的高速缓存可以在发生未命中的情况时继续为其他访问提供服务。
Baer and Wang [1988] 对多级包含性(multilevel inclusion)进行了首次分析。Wang, Baer, and Levy [1989]随后发表了一篇有关多级高速缓存性能评估的早期论文。后来,Jouppi和Wilton [1994]提出了片上多级高速缓存的多级互斥性(multilevel exclusion)。
除了victim cache外,Jouppi [1990]还研究了通过流缓冲区(streaming buffers)进行的预取。Farkas, Jouppi, and Chow [1995]将他的工作进行了扩展,使流缓冲区可以很好的配合无阻塞负载和针对顺序处理器的推测执行(speculative execution)。之后,Farkas et al. [1997]表明,尽管乱序处理器对于不可预测的延迟有更好的容忍性,仍然可以通过该技术受益。他们还优化了流缓冲区对内存带宽的要求。
90年代的Symposium on Architectural Support for Compilers and Operating Systems(ASPLOS)和International Computer Architecture Symposium(ISCA)会议中充斥大量关于高速缓存的论文。(有些说法称,ISCA应该称为“International Cache Architecture Symposium”)
本书正文第2章使用Cantin and Hill [2001]收集的SPEC2000基准测试。第二章中一些图表使用了下面一些论文中的数据:Agarwal and Pudar [1993]; Barroso, Gharachorloo, and Bugnion [1998]; Farkas and Jouppi [1994]; Jouppi [1990]; Lam, Rothberg, and Wolf [1991]; Lebeck and Wood [1994]; McCalpin [2005]; Mowry, Lam, and Gupta [1992]; and Torrellas, Gupta, and Hennessy[1992]。
- 第二部分完 -
参考资料
Agarwal, A. [1987]. “Analysis of Cache Performance for Operating Systems and Multiprogramming,” Ph.D. thesis, Tech. Rep. No. CSL-TR-87-332, Stanford University, Palo Alto, Calif.
Agarwal, A., and S. D. Pudar [1993]. “Column-associative caches: A technique for reducing the miss rate of direct-mapped caches,” 20th Annual Int’l. Symposium on Computer Architecture (ISCA), May 16–19, 1993, San Diego, Calif. (Computer Architecture News 21:2 (May), 179–190).
Baer, J.-L., and W.-H. Wang [1988]. “On the inclusion property for multi-level cache hierarchies,” Proc. 15th Annual Int’l. Symposium on Computer Architecture (ISCA), May 30–June 2, 1988, Honolulu, Hawaii, 73–80.
Barham, P., B. Dragovic, K. Fraser, S. Hand, T. Harris, A. Ho, and R. Neugebauer [2003]. “Xen and the art of virtualization,” Proc. of the 19th ACM Symposium on Operating Systems Principles, October 19–22, 2003, Bolton Landing, N.Y.
Barroso, L. A., K. Gharachorloo, and E. Bugnion [1998]. “Memory system characterization of commercial workloads,” Proc. 25th Annual Int’l. Symposium on Computer Architecture (ISCA), July 3–14, 1998, Barcelona, Spain, 3–14.
Bell, C. G., and W. D. Strecker [1976]. “Computer structures: What have we learned from the PDP-11?” Proc. Third Annual Int’l. Symposium on Computer Architecture (ISCA), January 19–21, 1976, Tampa, Fla., 1–14.
Bhandarkar, D. P. [1995]. Alpha Architecture Implementations, Digital Press, Newton, Mass.
Borg, A., R. E. Kessler, and D. W. Wall [1990]. “Generation and analysis of very long address traces,” Proc. 17th Annual Int’l. Symposium on Computer Architecture (ISCA), May 28–31, 1990, Seattle, Wash., 270–279.
Cantin, J. F., and M. D. Hill [2001]. “Cache performance for selected SPEC CPU2000 benchmarks,” http://www.cs.wisc.edu/multifacet/misc/ spec2000cache-data/.
Cantin, J., and M. Hill [2003]. “Cache performance for SPEC CPU2000 benchmarks, version 3.0,” http://www.cs.wisc.edu/multifacet/misc/spec2000cache- data/index.html.
Case, R. P., and A. Padegs [1978]. “The architecture of the IBM System/370,” Communications of the ACM 21:1, 73–96. Also appears in D. P. Siewiorek, C. G. Bell, and A. Newell, Computer Structures: Principles and Examples, McGraw-Hill, New York, 1982, 830–855.
Clark, B., T. Deshane, E. Dow, S. Evanchik, M. Finlayson, J. Herne, and J. Neefe Matthews [2004]. “Xen and the art of repeated research,” Proc. USENIX Annual Technical Conf., June 27–July 2, 2004, Boston, 1135–1144.
Clark, D. W. [1983]. “Cache performance of the VAX-11/780,” ACM Trans. on Computer Systems 1:1, 24–37.
Clark, D. W., and J. S. Emer [1985]. “Performance of the VAX-11/780 translation buffer: Simulation and measurement,” ACM Trans. on Computer Systems 3:1 (February), 31–62.
Compaq Computer Corporation. [1999]. Compiler Writer’s Guide for the Alpha 21264, Order Number EC-RJ66A-TE, June.
Conti, C., D. H. Gibson, and S. H. Pitkowsky [1968]. “Structural aspects of the System/360 Model 85. Part I. General organization,” IBM Systems J. 7:1, 2–14. Crawford, J., and P. Gelsinger [1988]. Programming the 80386, Sybex, Alameda, Calif.
Cvetanovic, Z., and R. E. Kessler [2000]. “Performance analysis of the Alpha 21264-based Compaq ES40 system,” Proc. 27th Annual Int’l. Symposium on Computer Architecture (ISCA), June 10–14, 2000, Vancouver, Canada, 192–202.
Fabry, R. S. [1974]. “Capability based addressing,” Communications of the ACM 17:7 (July), 403–412.
Farkas, K. I., P. Chow, N. P. Jouppi, and Z. Vranesic [1997]. “Memory-system design considerations for dynamically-scheduled processors,” Proc. 24th Annual Int’l. Symposium on Computer Architecture (ISCA), June 2–4, 1997, Denver, Colo., 133–143.
Farkas, K. I., and N. P. Jouppi [1994]. “Complexity/performance trade-offs with non-blocking loads,” Proc. 21st Annual Int’l. Symposium on Computer Architecture (ISCA), April 18–21, 1994, Chicago.
Farkas, K. I., N. P. Jouppi, and P. Chow [1995]. “How useful are non-blocking loads, stream buffers and speculative execution in multiple issue processors?” Proc. First IEEE Symposium on High-Performance Computer Architecture, January 22–25, 1995, Raleigh, N.C., 78–89.
Gao, Q. S. [1993]. “The Chinese remainder theorem and the prime memory system,” 20th Annual Int’l. Symposium on Computer Architecture (ISCA), May 16–19, 1993, San Diego, Calif. (Computer Architecture News 21:2 (May), 337–340).
Gee, J. D., M. D. Hill, D. N. Pnevmatikatos, and A. J. Smith [1993]. “Cache performance of the SPEC92 benchmark suite,” IEEE Micro 13:4 (August), 17–27.
Gibson, D. H. [1967]. “Considerations in block-oriented systems design,” AFIPS Conf. Proc. 30, 75–80.
Handy, J. [1993]. The Cache Memory Book, Academic Press, Boston.
Heald, R., K. Aingaran, C. Amir, M. Ang, M. Boland, A. Das, P. Dixit, G. Gouldsberry, J. Hart, T. Horel, W.-J. Hsu, J. Kaku, C. Kim, S. Kim, F. Klass, H. Kwan, R. Lo, H. McIntyre, A. Mehta, D. Murata, S. Nguyen, Y.-P. Pai, S. Patel, K. Shin, K. Tam, S. Vishwanthaiah, J. Wu, G. Yee, and H. You [2000]. “Implementation of third-generation SPARC V9 64-b microprocessor,” ISSCC Digest of Technical Papers, 412–413 and slide supplement.
Hill, M. D. [1987]. “Aspects of Cache Memory and Instruction Buffer Performance,” Ph.D. thesis, Tech. Rep. UCB/CSD 87/381, Computer Science Division, University of California, Berkeley.
Hill, M. D. [1988]. “A case for direct mapped caches,” Computer 21:12 (Decem- ber), 25–40.
Horel, T., and G. Lauterbach [1999]. “UltraSPARC-III: Designing third-generation 64-bit performance,” IEEE Micro 19:3 (May–June), 73–85.
Hughes, C. J., P. Kaul, S. V. Adve, R. Jain, C. Park, and J. Srinivasan [2001]. “Variability in the execution of multimedia applications and implications for architecture,” Proc. 28th Annual Int’l. Symposium on Computer Architecture (ISCA), June 30–July 4, 2001, Goteborg, Sweden, 254–265.
IEEE. [2005]. “Intel virtualization technology, computer,” IEEE Computer Society 38:5 (May), 48–56.
Jouppi, N. P. [1990]. “Improving direct-mapped cache performance by the addition of a small fully-associative cache and prefetch buffers,” Proc. 17th Annual Int’l. Symposium on Computer Architecture (ISCA), May 28–31, 1990, Seattle, Wash., 364–373.
Jouppi, N. P. [1998]. “Retrospective: Improving direct-mapped cache performance by the addition of a small fully-associative cache and prefetch buffers,” in G. S. Sohi, ed., 25 Years of the International Symposia on Computer Architecture (Selected Papers), ACM, New York, 71–73.
Jouppi, N. P., and S. J. E. Wilton [1994]. “Trade-offs in two-level on-chip caching,” Proc. 21st Annual Int’l. Symposium on Computer Architecture (ISCA), April 18–21, 1994, Chicago, 34–45.
Kessler, R. E. [1999]. “The Alpha 21264 microprocessor,” IEEE Micro 19:2 (March/April), 24–36.
Kilburn, T., D. B. G. Edwards, M. J. Lanigan, and F. H. Sumner [1962]. “One-level storage system,” IRE Trans. on Electronic Computers EC-11 (April) 223–235. Also appears in D. P. Siewiorek, C. G. Bell, and A. Newell, Computer Structures: Principles and Examples, McGraw-Hill, New York, 1982, 135–148.
Kroft, D. [1981]. “Lockup-free instruction fetch/prefetch cache organization,” Proc. Eighth Annual Int’l. Symposium on Computer Architecture (ISCA), May 12–14, 1981, Minneapolis, Minn., 81–87.
Kroft, D. [1998]. “Retrospective: Lockup-free instruction fetch/prefetch cache organization,” in G. S. Sohi, ed., 25 Years of the International Symposia on Computer Architecture (Selected Papers), ACM, New York, 20–21.
Kunimatsu, A., N. Ide, T. Sato, Y. Endo, H. Murakami, T. Kamei, M. Hirano, F. Ishihara, H. Tago, M. Oka, A. Ohba, T. Yutaka, T. Okada, and M. Suzuoki [2000]. “Vector unit architecture for emotion synthesis,” IEEE Micro 20:2 (March–April), 40–47.
Lam, M. S., E. E. Rothberg, and M. E. Wolf [1991]. “The cache performance and optimizations of blocked algorithms,” Proc. Fourth Int’l. Conf. on Architectural Support for Programming Languages and Operating Systems (ASPLOS), April 8–11, 1991, Santa Clara, Calif. (SIGPLAN Notices 26:4 (April), 63–74).
Lebeck, A. R., and D. A. Wood [1994]. “Cache profiling and the SPEC benchmarks: A case study,” Computer 27:10 (October), 15–26.
Liptay, J. S. [1968]. “Structural aspects of the System/360 Model 85. Part II. The cache,” IBM Systems J. 7:1, 15–21.
Luk, C.-K., and T. C Mowry [1999]. “Automatic compiler-inserted prefetching for pointer-based applications,” IEEE Trans. on Computers, 48:2 (February), 134–141.
McCalpin, J. D. [2005]. “STREAM: Sustainable Memory Bandwidth in High Performance Computers,” www.cs.virginia.edu/stream/.
McFarling, S. [1989]. “Program optimization for instruction caches,” Proc. Third Int’l. Conf. on Architectural Support for Programming Languages and Operating Systems (ASPLOS), April 3–6, 1989, Boston, 183–191.
Menon, A., J. Renato Santos, Y. Turner, G. Janakiraman, and W. Zwaenepoel [2005]. “Diagnosing performance overheads in the xen virtual machine environment,” Proc. First ACM/USENIX Int’l. Conf. on Virtual Execution Environments, June 11–12, 2005, Chicago, 13–23.
Meyer, R. A., and L. H. Seawright [1970]. “A virtual machine time sharing system,” IBM Systems J. 9:3, 199–218.
Mowry, T. C., S. Lam, and A. Gupta [1992]. “Design and evaluation of a compiler algorithm for prefetching,” Proc. Fifth Int’l. Conf. on Architectural Support for Programming Languages and Operating Systems (ASPLOS), October 12–15, 1992, Boston (SIGPLAN Notices 27:9 (September), 62–73).
Oka, M., and M. Suzuoki [1999]. “Designing and programming the emotion engine,” IEEE Micro 19:6 (November–December), 20–28.
Pabst, T. [2000]. “Performance Showdown at 133 MHz FSB—The Best Platform for Coppermine,” www6.tomshardware.com/mainboard/00q1/000302/.
Palacharla, S., and R. E. Kessler [1994]. “Evaluating stream buffers as a secondary cache replacement,” Proc. 21st Annual Int’l. Symposium on Computer Architecture (ISCA), April 18–21, 1994, Chicago, 24–33.
Przybylski, S. A. [1990]. Cache Design: A Performance-Directed Approach, Morgan Kaufmann, San Francisco.
Przybylski, S. A., M. Horowitz, and J. L. Hennessy [1988]. “Performance trade-offs in cache design,” Proc. 15th Annual Int’l. Symposium on Computer Architecture (ISCA), May 30–June 2, 1988, Honolulu, Hawaii, 290–298.
Reinman, G., and N. P. Jouppi. [1999]. “Extensions to CACTI.”
Robin, J., and C. Irvine [2000]. “Analysis of the Intel Pentium’s ability to support a secure virtual machine monitor,” Proc. USENIX Security Symposium, August 14–17, 2000, Denver, Colo.
Saavedra-Barrera, R. H. [1992]. “CPU Performance Evaluation and Execution Time Prediction Using Narrow Spectrum Benchmarking,” Ph.D. dissertation, University of California, Berkeley.
Samples, A. D., and P. N. Hilfinger [1988]. Code Reorganization for Instruction Caches, Tech. Rep. UCB/CSD 88/447, University of California, Berkeley.
Sites, R. L. (ed.) [1992]. Alpha Architecture Reference Manual, Digital Press, Burlington, Mass.
Skadron, K., and D. W. Clark [1997]. “Design issues and tradeoffs for write buffers,” Proc. Third Int’l. Symposium on High-Performance Computer Architecture, February 1–5, 1997, San Antonio, Tex., 144–155.
Smith, A. J. [1982]. “Cache memories,” Computing Surveys 14:3 (September), 473–530.
Smith, J. E., and J. R. Goodman [1983]. “A study of instruction cache organizations and replacement policies,” Proc. 10th Annual Int’l. Symposium on Computer Architecture (ISCA), June 5–7, 1982, Stockholm, Sweden, 132–137.
Stokes, J. [2000]. “Sound and Vision: A Technical Overview of the Emotion Engine,” http://arstechnica.com/hardware/reviews/2000/02/ee.ars.
Strecker, W. D. [1976]. “Cache memories for the PDP-11?” Proc. Third Annual Int’l. Symposium on Computer Architecture (ISCA), January 19–21, 1976, Tampa, Fla., 155–158.
Sugumar, R. A., and S. G. Abraham [1993]. “Efficient simulation of caches under optimal replacement with applications to miss characterization,” Proc. ACM SIGMETRICS Conf. on Measurement and Modeling of Computer Systems, May 17–21, 1993, Santa Clara, Calif., 24–35.
Tarjan, D., S. Thoziyoor, and N. Jouppi [2006]. CACTI 4.0. Technical Report HPL-2006-86, HP Laboratories.
Torrellas, J., A. Gupta, and J. Hennessy [1992]. “Characterizing the caching and synchronization performance of a multiprocessor operating system,” Proc. Fifth Int’l. Conf. on Architectural Support for Programming Languages and Operating Systems (ASPLOS), October 12–15, 1992, Boston (SIGPLAN Notices 27:9 (September), 162–174).
Van Vleck, T. [2005]. “The IBM 360/67 and CP/CMS,” http://www.multicians. org/thvv/360-67.html.
Wang, W.-H., J.-L. Baer, and H. M. Levy [1989]. “Organization and performance of a two-level virtual-real cache hierarchy,” Proc. 16th Annual Int’l. Symposium on Computer Architecture (ISCA), May 28–June 1, 1989, Jerusalem, 140–148.
Wilkes, M. [1965]. “Slave memories and dynamic storage allocation,” IEEE Trans. Electronic Computers EC-14:2 (April), 270–271.
Wilkes, M. V. [1982]. “Hardware support for memory protection: Capability implementations,” Proc. Symposium on Architectural Support for Programming Languages and Operating Systems (ASPLOS), March 1–3, 1982, Palo Alto, Calif., 107–116.
Wulf, W. A., R. Levin, and S. P. Harbison [1981]. Hydra/C.mmp: An Experimental Computer System, McGraw-Hill, New York.
公众号专题:
人工智能芯片技术进步
人工智能芯片产业发展
人工智能芯片初创公司
人工智能芯片评测对比
科技巨头的芯片尝试
从学术会议看人工智能芯片
基础芯片技术