大模型分布式训练效能提升的必要性(万字长文推荐收藏)
The following article is from 小石头的码疯窝 Author burness
01
模型精度持续提升
作为早期的尝试,ELMo\cite{sarzynska2021detecting} 被提出通过首先预训练双向LSTM网络,根据特定的下游微调 双向LSTM网络来捕获上下文感知的单词表示。此外,基于具有自注意力机制的高度并行化的Transformer架构\cite{vaswani2017attention},通过在大规模未标记语料库上使用专门设计的预训练任务来预训练双向语言模型,提出了BERT。这些预先训练的上下文感知单词表示作为通用语义特征非常有效,这在很大程度上提高了自然语言处理任务的精度标准。这项研究启发了大量的后续工作,它设定了“预训练和微调”的学习范式。遵循这一范式,已经开展了大量关于预训练语言模型的研究,通过引入了不同的架构 (例如GPT-2\cite{radford2019language}和BART\cite{lewis2019bart})或改进的预训练策略\cite{liu2019roberta}\cite{sanh2021multitask}\cite{wang2022language}。在此范式,通常需要微调预训练语言模型(PLM)以适应不同的下游任务。
研究人员发现,扩展预训练(例如扩展模型参数量或预训练语料大小)通常会提高下游任务的模型能力(即遵循扩展定律,\cite{kaplan2020scaling})。许多研究者通过训练更大的预训练模型(例如175B参数GPT-3和540B 参数 PaLM)来探索精度上限。尽管模型的基础架构并没有变化, 这些更大规模预训练模型显示出相对于小型的预训练模型(例如 330M 参数 BERT和1.5B参数GPT-2)截显示出令人惊讶的能力, 并号称能解决一系列复杂任务的新兴能力(emergent abilities)\cite{wei2022emergent}。例如GPT-3可以通过上下文学习解决小样本类任务(few-shot task),而GPT-2则不能做得很好。因此,研究者专门为这些大型预训练创造了术语“大型语言模型(Large Language Model)”\cite{shanahan2022talking}。
大型语言模型的一个典型应用是ChatGPT\cite{chatgpt},ChatGPT基于GPT系列的大型语言模型构建智能对话系统,在2022年11月推出后,仅仅5天就获得100万用户,2个月后,突破1亿用户;
ChatGPT让研究者进一步理解到大型语言模型极限元没有达到:尽管大型语言模型在广泛的自然语言任务中表现出了卓越的能力,但是这些大语言模型有时可能会表现出意想不到的行为,例如捏造虚假信息、追求不准确的目标以及产生有害、误导和偏见的表达\cite{ouyang2022training}\cite{kenton2021alignment}。对于大型语言模型来说,语言建模目标通过单词预测来预训练模型参数,而缺乏对人类价值观或偏好的考虑。为了避免这些意外行为,有人提出人类对齐(alignment)\cite{sanh2021multitask},以使大型语言模型的行为符合人类的期望。然而,与最初的预训练和微调(例如,指令微调)不同,这种对齐需要考虑非常不同的标准(例如,有用性、诚实性和无害性)。研究表明,对齐可能会在一定程度上损害LLM的综合能力,相关文献中将其称为”对齐税“\cite{askell2021general},但是考虑到安全、伦理等社会因素, 对齐税更适合应用到实际当中。而其中OpenAI使用人类反馈强化学习 (RLHF) 训练该模型,使用与InstructGPT\cite{ouyang2022training}相同的方法,能够有效地对齐人类的真实行为,使得ChatGPT的交互更像“人类”,其训练流程主要包括以下:
基于海量语料预训练大型语言模型;
根据标注的SFT数据集对步骤1中得到的大型语言模型进行有监督的微调(Supervised FineTune,SFT);
收集人工标注的对比数据,训练奖励模型(Reword Model,RM);
使用RM作为强化学习的优化目标,利用PPO算法微调SFT模型;
02
模型精度提升引入的算力需求提升
使用超大规模的预训练模型以及在实际使用中对齐人类需求,会引入极大的算力消耗。早在2012年,AlexNet\cite{krizhevsky2012imagenet}因为单卡GPU性能瓶颈, 使用两张GTX 580 3GB GPU用来解决FC曾参数与计算量过大的问题。而到今天,大多数具有影响力的人工智能模型都是在多个GPU上训练的,尤其是最近火热的大型语言模型, 其参数规模从数十亿到千亿\cite{sevilla2022compute}以上,并且在相关的benchamark上指标稳定提升:
模型精度提升另一个因素来自于训练中使用更大规模的数据集。从引领深度学习尤其是计算机视觉浪潮的的ImageNet\cite{deng2009imagenet}、MSCOCO\cite{lin2014microsoft}等为代表的图像分类、目标检测数据集到如今跨模态文生图场景的Laion-5B\cite{schuhmann2022laion}等十亿级别的图文对数据集,而在自然语言领域,Common Craw\cite{common_crawl}, WebText2\cite{gao2020pile}, BookCorpus\cite{Zhu_2015_ICCV}, Wikipedia\cite{wikidump}等预训练语料更是达到万亿级别tokens。更大规模参数量的模型使用更大规模的语料,分布式训练对于算力的要求越来越大, 下图中统计1952-2022年的里程碑式的机器学习系统所消耗的算力需求\cite{sevilla2022compute}:
尤其是2010年之后,得益于Transformer\cite{vaswani2017attention}、分布式机器学习系统以及自监督学习\cite{liu2021self}的发展, 在自然语言处理领域,通过对大量未标注的文本语料进行随机掩码以获得高达万亿级别的tokens\cite{touvron2023llama}的海量低成本语料,Transformer高效的计算性能以及基于Transformer的诸如Bert\cite{devlin2018bert}、GPT\cite{radford2018improving}、LLaMA\cite{touvron2023llama}、GLM\cite{du2022glm}等模型架构的可扩展性,如上图, 里程碑式的模型训练系统在算力的需求上增长十分迅速,详细数据见上:
Meta开源的LLAMA 2\cite{touvron2023llama} 在其研究超级集群 (RSC)以及内部生产集群上对模型进行了预训练。其配置均使用NVIDIA A100,关键区别在于前者使用NVIDIA Quantum InfiniBand,而后者基于商用以太网交换机的RoCE(融合以太网上的 RDMA)解决方案,单次训练的GPU机时及碳排放量见下表:
03
计算加速设备利用率普遍较低
如下表所示,在不同规模参数的模型配置下,Deepak Narayanan在\cite{narayanan2021efficient},能做到单卡有效算力135TFLOPS-163TFLOPS,有效率用率43%-52%。直观上来看, 相对于Nvidia A100 312TFLOPS FP16的理论性能,还是有比较大的上升空间。同样在LLAMA 2\cite{touvron2023llama}中,7B-70B的模型在2000B的预训练tokens上训练,其需求的理论算力(即不包含recompute等trick带来的额外算力需求),可按以下公式预估其理论算力需求:
而真实训练的GPU使用机时见商标,可以很容易计算出下表,在LLAMA 2不同参数规模下的模型训练的单卡有效算力及利用率:
造成单卡利用率较低的原因,主要包括以下几个方面:
模型进行并行计算中存在很多较多串行,根据Amdahl定律\cite{Amdahl_law},整体并行程度相对一般无数据依赖任务天然较低;
单显卡显存与算力有效,不可能将模型完全载入到一张卡或者一台GPU主机,需要引入并行计算,对模型拆分到不同的节点以及GPU上,会额外增加通信,而这部分通信通常是同步操作,会引入较多的等待时间,很难被隐藏;
计算层面过多较小的kernel、kernel本身的计算访存比、不同显存的layout、以及访问密集型算子与计算密集型算子的合理编排;
单次预训练模型成本过高,很难完全在单次模型上通过常规的超参调优的方式,对分布式策略、kernel融合策略、甚至硬件配置完全实验以达到极限性能;
以A100单卡月故障率8%,计算单卡每天故障率约为0.27%,对于512卡训练任务来说,每天由于卡故障而导致训练不可进行的概率是75%,如果是1024卡训练,则为93.7%,如此高的任务失败率,极其影响单卡有效算力的提升;
04
效能提升带来很多好处
大规模分布式机器学习的效能提升对于经济、碳排放以及模型创新与标准化均有很多的意义:
效能提升减少经济投入
大规模分布式机器学习的效能提升,从而节省时间和资源成本,以Meta LLAMA 2 70B单次训练为例, 其GPU机时为1720320。考虑到构建数据中心成本过高,采用云计算服务,租用GPU机器, 以单A100定价20元/hour(因近期价格变化较大,取均值),整体成为约为约为3440万。从效能提升角度来看,每提升1个百分点,单次训练成本降低接近34万。
效能提升减少碳排放量
以Meta LLAMA 2 70B单次训练为例, 其碳排放量约为291吨,按照美国人均16.1/中国6.8吨年均排放量,约为18.2/42.8个美国/中国人的碳排放量。以每棵树每年固碳12千克, 需要额外种植2.4万棵树才能保证一年内完全吸收单次Meta LLAMA 2 70B训练的碳排放量。
效能提升有助于模型创新与标准化
大规模分布式训练效能的提升,为研究人员和开发者提供了更强大、更高效的工具和平台,更快速地推动模型创新和相关领域研究的进展。在低成本、高效率的技术体系架构下,能够加速人工智能产品化的迭代,提升国家在人工智能领域的话语权。
05
效能提升是涉及多门技术类别的方向
而对于研究者而言, 大规模分布式机器学习的效能提升是一个设计多门技术类别的方向,有很多的探索空间:
效能促使人工智能算法进化
Transformer架构\cite{vaswani2017attention}已成为自然语言处理和图像分类等应用中使用最广泛的架构。当由Transformer构成的模型,变得越来越大和越来越深时,因为self-attention的时间复杂度和空间复杂度与序列长度成二次方,模型的计算和存储需求越来越具有挑战性。
效能提升是分布式训练的终极目标
人工智能编译器
将分布式训练问题,抽象为构造intra-operation和inter-operation的执行计划; 通过设计合理的评估标准,设计易于处理的优化算法,推理出接近最优的执行计划; 如下图在统一IR上实现GPU集群上的分布式深度学习的编译器系统。Alpa具有以下特性:1. 采用分层优化算法生成执行计划的编译优化pass函数;2. 一种新的运行时架构,可协调计算设备如GPU和网络通信之间的操作间并行性;3. 一系列可提高性能的系统优化;
分布式系统监控与自动化运维
容错,单节点故障转移,无需重新启动整个作业; 自动扩展,在节点级别和CPU/内存级别自动扩展/缩减资源; 动态数据分片,动态调度训练数据到每个worker而不是均分,为不同处理数独的worker制定不同的训练数据计划; 自动资源优化,自动优化作业资源,提高任务整体的训练性能和资源利用率
06
人工智能伦理与安全
增进人类福祉。坚持以人为本,遵循人类共同价值观,尊重人权和人类根本利益诉求,遵守国家或地区伦理道德。坚持公共利益优先,促进人机和谐友好,改善民生,增强获得感幸福感,推动经济、社会及生态可持续发展,共建人类命运共同体; 促进公平公正。坚持普惠性和包容性,切实保护各相关主体合法权益,推动全社会公平共享人工智能带来的益处,促进社会公平正义和机会均等。在提供人工智能产品和服务时,应充分尊重和帮助弱势群体、特殊群体,并根据需要提供相应替代方案; 保护隐私安全。充分尊重个人信息知情、同意等权利,依照合法、正当、必要和诚信原则处理个人信息,保障个人隐私与数据安全,不得损害个人合法数据权益,不得以窃取、篡改、泄露等方式非法收集利用个人信息,不得侵害个人隐私权; 确保可控可信。保障人类拥有充分自主决策权,有权选择是否接受人工智能提供的服务,有权随时退出与人工智能的交互,有权随时中止人工智能系统的运行,确保人工智能始终处于人类控制之下; 强化责任担当。坚持人类是最终责任主体,明确利益相关者的责任,全面增强责任意识,在人工智能全生命周期各环节自省自律,建立人工智能问责机制,不回避责任审查,不逃避应负责任; 提升伦理素养。积极学习和普及人工智能伦理知识,客观认识伦理问题,不低估不夸大伦理风险。主动开展或参与人工智能伦理问题讨论,深入推动人工智能伦理治理实践,提升应对能力
07
参考文献
ant team. Dlrover project. Amanda Askell, Yuntao Bai, Anna Chen, Dawn Drain, Deep Ganguli, Tom Henighan, Andy Jones, Nicholas Joseph, Ben Mann, Nova DasSarma, et al. A general language assistant as a laboratory for alignment. arXiv preprint arXiv:2112.00861, 2021. Tianqi Chen, Bing Xu, Chiyuan Zhang, and Carlos Guestrin. Training deep nets with sublinear memory cost. arXiv preprint arXiv:1604.06174, 2016. Krzysztof Choromanski, Valerii Likhosherstov, David Dohan, Xingyou Song, Andreea Gane, Tamas Sarlos, Peter Hawkins, Jared Davis, Afroz Mohiuddin, Lukasz Kaiser, et al. Rethinking attention with performers. arXiv preprint arXiv:2009.14794, 2020. National New Generation Artificial Intelligence Governance Professional Committee. ai specification. common crawl group. common crawl dataset. Tri Dao. Flashattention-2: Faster attention with better parallelism and work partitioning. arXiv preprint arXiv:2307.08691, 2023. Tri Dao, Dan Fu, Stefano Ermon, Atri Rudra, and Christopher Ré. Flashattention: Fast and memory-efficient exact attention with io-awareness. Advances in Neural Information Processing Systems, 35:16344–16359, 2022. Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. Imagenet: A large-scale hierarchical image database. In 2009 IEEE conference on computer vision and pattern recognition, pages 248–255. Ieee, 2009. Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. Bert: Pre- training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805, 2018. Zhengxiao Du, Yujie Qian, Xiao Liu, Ming Ding, Jiezhong Qiu, Zhilin Yang, and Jie Tang. Glm: General language model pretraining with autoregressive blank infilling. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 320–335, 2022. Shiqing Fan, Yi Rong, Chen Meng, Zongyan Cao, Siyu Wang, Zhen Zheng, Chuan Wu, Guoping Long, Jun Yang, Lixue Xia, et al. Dapple: A pipelined data parallel approach for training large models. In Proceedings of the 26th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, pages 431–445, 2021. Jiarui Fang, Zilin Zhu, Shenggui Li, Hui Su, Yang Yu, Jie Zhou, and Yang You. Parallel training of pre-trained models via chunk-based dynamic memory management. IEEE Transactions on Parallel and Distributed Systems, 34(1):304–315, 2022. Wikimedia Foundation. Wikimedia downloads. Leo Gao, Stella Biderman, Sid Black, Laurence Golding, Travis Hoppe, Charles Foster, Jason Phang, Horace He, Anish Thite, Noa Nabeshima, et al. The pile: An 800gb dataset of diverse text for language modeling. arXiv preprint arXiv:2101.00027, 2020. Yanping Huang, Youlong Cheng, Ankur Bapna, Orhan Firat, Dehao Chen, Mia Chen, HyoukJoong Lee, Jiquan Ngiam, Quoc V Le, Yonghui Wu, et al. Gpipe: Efficient training of giant neural networks using pipeline parallelism. Advances in neural information processing systems, 32, 2019. Jared Kaplan, Sam McCandlish, Tom Henighan, Tom B Brown, Benjamin Chess, Rewon Child, Scott Gray, Alec Radford, Jeffrey Wu, and Dario Amodei. Scaling laws for neural language models. arXiv preprint arXiv:2001.08361, 2020. Can Karakus, Rahul Huilgol, Fei Wu, Anirudh Subramanian, Cade Daniel, Derya Cav- dar, Teng Xu, Haohan Chen, Arash Rahnama, and Luis Quintela. Amazon sagemaker model parallelism: A general and flexible framework for large model training. arXiv preprint arXiv:2111.05972, 2021. Zachary Kenton, Tom Everitt, Laura Weidinger, Iason Gabriel, Vladimir Mikulik, and Geoffrey Irving. Alignment of language agents. arXiv preprint arXiv:2103.14659, 2021. Nikita Kitaev, Łukasz Kaiser, and Anselm Levskaya. Reformer: The efficient trans- former. arXiv preprint arXiv:2001.04451, 2020. Vijay Anand Korthikanti, Jared Casper, Sangkug Lym, Lawrence McAfee, Michael An- dersch, Mohammad Shoeybi, and Bryan Catanzaro. Reducing activation recomputation in large transformer models. Proceedings of Machine Learning and Systems, 5, 2023. Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. Imagenet classification with deep convolutional neural networks. Advances in neural information processing systems, 25, 2012. Mike Lewis, Yinhan Liu, Naman Goyal, Marjan Ghazvininejad, Abdelrahman Mohamed, Omer Levy, Ves Stoyanov, and Luke Zettlemoyer. Bart: Denoising sequence-to-sequence pre-training for natural language generation, translation, and comprehension. arXiv preprint arXiv:1910.13461, 2019. Conglong Li, Ammar Ahmad Awan, Hanlin Tang, Samyam Rajbhandari, and Yuxiong He. 1-bit lamb: communication efficient large-scale large-batch training with lamb’s convergence speed. In 2022 IEEE 29th International Conference on High Performance Computing, Data, and Analytics (HiPC), pages 272–281. IEEE, 2022. Shigang Li and Torsten Hoefler. Chimera: efficiently training large-scale neural networks with bidirectional pipelines. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, pages 1–14, 2021. Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ra- manan, Piotr Dollár, and C Lawrence Zitnick. Microsoft coco: Common objects in con- text. In Computer Vision–ECCV 2014: 13th European Conference, Zurich, Switzerland, September 6-12, 2014, Proceedings, Part V 13, pages 740–755. Springer, 2014. Xiao Liu, Fanjin Zhang, Zhenyu Hou, Li Mian, Zhaoyu Wang, Jing Zhang, and Jie Tang. Self-supervised learning: Generative or contrastive. IEEE transactions on knowledge and data engineering, 35(1):857–876, 2021. Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer Levy, Mike Lewis, Luke Zettlemoyer, and Veselin Stoyanov. Roberta: A robustly optimized bert pretraining approach. arXiv preprint arXiv:1907.11692, 2019. Deepak Narayanan, Aaron Harlap, Amar Phanishayee, Vivek Seshadri, Nikhil R Deva- nur, Gregory R Ganger, Phillip B Gibbons, and Matei Zaharia. Pipedream: Generalized pipeline parallelism for dnn training. In Proceedings of the 27th ACM Symposium on Operating Systems Principles, pages 1–15, 2019. Deepak Narayanan, Mohammad Shoeybi, Jared Casper, Patrick LeGresley, Mostofa Pat- wary, Vijay Korthikanti, Dmitri Vainbrand, Prethvi Kashinkunti, Julie Bernauer, Bryan Catanzaro, et al. Efficient large-scale language model training on gpu clusters using megatron-lm. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, pages 1–15, 2021. openai team. chatgpt blog. Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, et al. Train- ing language models to follow instructions with human feedback. Advances in Neural Information Processing Systems, 35:27730–27744, 2022. Jay H Park, Gyeongchan Yun, M Yi Chang, Nguyen T Nguyen, Seungmin Lee, Jaesik Choi, Sam H Noh, and Young-ri Choi. {HetPipe}: Enabling large {DNN} training on (whimpy) heterogeneous {GPU} clusters through integration of pipelined model paral- lelism and data parallelism. In 2020 USENIX Annual Technical Conference (USENIX ATC 20), pages 307–321, 2020. Bo Peng, Eric Alcaide, Quentin Anthony, Alon Albalak, Samuel Arcadinho, Huanqi Cao, Xin Cheng, Michael Chung, Matteo Grella, Kranthi Kiran GV, et al. Rwkv: Reinventing rnns for the transformer era. arXiv preprint arXiv:2305.13048, 2023. Alec Radford, Karthik Narasimhan, Tim Salimans, Ilya Sutskever, et al. Improving language understanding by generative pre-training. 2018. Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, Ilya Sutskever, et al. Language models are unsupervised multitask learners. OpenAI blog, 1(8):9, 2019. Samyam Rajbhandari, Jeff Rasley, Olatunji Ruwase, and Yuxiong He. Zero: Memory optimizations toward training trillion parameter models. In SC20: International Confer- ence for High Performance Computing, Networking, Storage and Analysis, pages 1–16. IEEE, 2020. Samyam Rajbhandari, Olatunji Ruwase, Jeff Rasley, Shaden Smith, and Yuxiong He. Zero-infinity: Breaking the gpu memory wall for extreme scale deep learning. In Pro- ceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, pages 1–14, 2021. Jie Ren, Samyam Rajbhandari, Reza Yazdani Aminabadi, Olatunji Ruwase, Shuangyan Yang, Minjia Zhang, Dong Li, and Yuxiong He. {ZeRO-Offload}: Democratizing {Billion- Scale} model training. In 2021 USENIX Annual Technical Conference (USENIX ATC 21), pages 551–564, 2021. Aurko Roy, Mohammad Saffar, Ashish Vaswani, and David Grangier. Efficient content- based sparse attention with routing transformers. Transactions of the Association for Computational Linguistics, 9:53–68, 2021. Victor Sanh, Albert Webson, Colin Raffel, Stephen H Bach, Lintang Sutawika, Zaid Alyafeai, Antoine Chaffin, Arnaud Stiegler, Teven Le Scao, Arun Raja, et al. Multitask prompted training enables zero-shot task generalization. arXiv preprint arXiv:2110.08207, 2021. Justyna Sarzynska-Wawer, Aleksander Wawer, Aleksandra Pawlak, Julia Szymanowska, Izabela Stefaniak, Michal Jarkiewicz, and Lukasz Okruszek. Detecting formal thought disorder by deep contextualized word representations. Psychiatry Research, 304:114135, Christoph Schuhmann, Romain Beaumont, Richard Vencu, Cade Gordon, Ross Wight- man, Mehdi Cherti, Theo Coombes, Aarush Katta, Clayton Mullis, Mitchell Wortsman, et al. Laion-5b: An open large-scale dataset for training next generation image-text models. Advances in Neural Information Processing Systems, 35:25278–25294, 2022. Jaime Sevilla, Lennart Heim, Anson Ho, Tamay Besiroglu, Marius Hobbhahn, and Pablo Villalobos. Compute trends across three eras of machine learning. In 2022 International Joint Conference on Neural Networks (IJCNN), pages 1–8. IEEE, 2022. Murray Shanahan. Talking about large language models. arXiv preprint arXiv:2212.03551, 2022. Mohammad Shoeybi, Mostofa Patwary, Raul Puri, Patrick LeGresley, Jared Casper, and Bryan Catanzaro. Megatron-lm: Training multi-billion parameter language models using model parallelism. arXiv preprint arXiv:1909.08053, 2019. Hanlin Tang, Shaoduo Gan, Ammar Ahmad Awan, Samyam Rajbhandari, Conglong Li, Xiangru Lian, Ji Liu, Ce Zhang, and Yuxiong He. 1-bit adam: Communication efficient large-scale training with adam’s convergence speed. In International Conference on Machine Learning, pages 10118–10129. PMLR, 2021. Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, Yasmine Babaei, Nikolay Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti Bhosale, et al. Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288, Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need. Advances in neural information processing systems, 30, 2017. Thomas Wang, Adam Roberts, Daniel Hesslow, Teven Le Scao, Hyung Won Chung, Iz Beltagy, Julien Launay, and Colin Raffel. What language model architecture and pre- training objective works best for zero-shot generalization? In International Conference on Machine Learning, pages 22964–22984. PMLR, 2022. Jason Wei, Yi Tay, Rishi Bommasani, Colin Raffel, Barret Zoph, Sebastian Borgeaud, Dani Yogatama, Maarten Bosma, Denny Zhou, Donald Metzler, et al. Emergent abilities of large language models. arXiv preprint arXiv:2206.07682, 2022. wiki. Amdahl’s law wiki page. Qifan Xu and Yang You. An efficient 2d method for training super-large deep learning models. In 2023 IEEE International Parallel and Distributed Processing Symposium (IPDPS), pages 222–232. IEEE, 2023. Xiangyu Ye, Zhiquan Lai, Shengwei Li, Lei Cai, Ding Sun, Linbo Qiao, and Dongsheng Li. Hippie: A data-paralleled pipeline approach to improve memory-efficiency and scalability for large dnn training. In Proceedings of the 50th International Conference on Parallel Processing, pages 1–10, 2021. Lianmin Zheng, Zhuohan Li, Hao Zhang, Yonghao Zhuang, Zhifeng Chen, Yanping Huang, Yida Wang, Yuanzhong Xu, Danyang Zhuo, Eric P Xing, et al. Alpa: Au- tomating inter-and {Intra-Operator} parallelism for distributed deep learning. In 16th USENIX Symposium on Operating Systems Design and Implementation (OSDI 22), pages 559–578, 2022. Yukun Zhu, Ryan Kiros, Rich Zemel, Ruslan Salakhutdinov, Raquel Urtasun, Antonio Torralba, and Sanja Fidler. Aligning books and movies: Towards story-like visual expla- nations by watching movies and reading books. In The IEEE International Conference on Computer Vision (ICCV), December 2015.
往期优质文章推荐
往期推荐
点个在看你最好看