ML + System = ?

Original 唐杉 StarryHeavensAbove 2023-01-28

收录于合集 #人工智能芯片技术进步 36个

之前我曾经发过这样一张图，显示芯片和机器学习的良性循环关系。最近，随着第一次SysML Conference的召开以及Google把机器学习用于计算机系统设计的尝试，我们可以看出ML和System的互动越来越密切。未来，ML应该会和整个计算机系统（不仅是芯片，而是软硬件各个方面）形成一个更大范围的良性循环。

•••

SysML会议是由Google，Amazon和Facebook发起，今年是第一次会议。Jeff Dean是主要的组织者之一，而他在会议上做的Keynote的核心也是ML和System结合。前面的大部分内容和之前的演讲类似，这里我们主要关心新增的部分，不妨先看看他的结论。

这有点像我们上面所说的芯片和ML的良性循环。首先，专用的ML硬件还处在“婴儿期”，随着更快的系统不断出现并得到更广泛的部署，我们可以期待在更多领域取得突破；同时把学习机制引入计算系统的核心可以让其得到更好的优化。对于前一部分，我们已经有过很多的讨论，因此这里主要关注后一部分，即把“Learning”引入计算系统的优化。

对此，Jeff Dean可以说是非常的的乐观，他认为“凡是我们使用启发式（heuristic ）技术来做决定的领域，都是可能应用机器学习的好地方”。

A heuristic technique (/hjʊəˈrɪstɪk/; Ancient Greek: εὑρίσκω, "find" or "discover"), often called simply a heuristic, is any approach to problem solving, learning, or discovery that employs a practical method not guaranteed to be optimal or perfect, but sufficient for the immediate goals. Where finding an optimal solution is impossible or impractical, heuristic methods can be used to speed up the process of finding a satisfactory solution. Heuristics can be mental shortcuts that ease the cognitive load of making a decision. - Wikipedia

实际的例子很多，涉及计算机系统的方方面面。他给出的例子包括，编译器功能，网络优化，操作系统设计，任务调度系统，甚至是ASIC设计。而在这些应用中要获得成功，关键点包括两个：

其中，第一点是强调找到一个能用数字表示的指标，对增强学习来说，这就意味着一个清晰准确的Reward；而第二点，对于增强学习来说意味着能不能获得准确的环境（Environment），对于监督学习而言则意味着能不能方便的获得训练和测试数据。如果大家还不清楚这两点对于RL能否实现的重要性，可以参考这篇文章“推特爆款：谷歌大脑工程师的深度强化学习劝退文”。这里的好消息是，对于计算系统的优化，这两个要求似乎是比较容易实现的。比如要优化device placement，则runtime就可以作为一个很清晰的Reward；而runtime的结果可以通过计算任务在实际系统上运行获得。这也是为什么上述文章中专门提到Google的Device Placement的尝试是比较成功的，“I know there’s some neat work optimizing device placement for large Tensorflow graphs (Mirhoseini et al, ICML 2017).”

到目前为止，Google已经在这个方面做了几个实际的尝试，结果包括在下面几篇论文中。在Jeff Dean的Keynote当中，介绍了其中的前两篇文章。

“The Case for Learned Index Structures”（https://arxiv.org/abs/1712.01208）

Indexes are models: a B-Tree-Index can be seen as a model to map a key to the position of a record within a sorted array, a Hash-Index as a model to map a key to a position of a record within an unsorted array, and a BitMap-Index as a model to indicate if a data record exists or not. In this exploratory research paper, we start from this premise and posit that all existing index structures can be replaced with other types of models, including deep-learning models, which we term learned indexes. The key idea is that a model can learn the sort order or structure of lookup keys and use this signal to effectively predict the position or existence of records. We theoretically analyze under which conditions learned indexes outperform traditional index structures and describe the main challenges in designing learned index structures. Our initial results show, that by using neural nets we are able to outperform cache-optimized B-Trees by up to 70% in speed while saving an order-of-magnitude in memory over several real-world data sets. More importantly though, we believe that the idea of replacing core components of a data management system through learned models has far reaching implications for future systems designs and that this work just provides a glimpse of what might be possible.

“Device Placement Optimization with Reinforcement Learning”（https://arxiv.org/abs/1706.04972）

The past few years have witnessed a growth in size and computational requirements for training and inference with neural networks. Currently, a common approach to address these requirements is to use a heterogeneous distributed environment with a mixture of hardware devices such as CPUs and GPUs. Importantly, the decision of placing parts of the neural models on devices is often made by human experts based on simple heuristics and intuitions. In this paper, we propose a method which learns to optimize device placement for TensorFlow computational graphs. Key to our method is the use of a sequence-to-sequence model to predict which subsets of operations in a TensorFlow graph should run on which of the available devices. The execution time of the predicted placements is then used as the reward signal to optimize the parameters of the sequence-to-sequence model. Our main result is that on Inception-V3 for ImageNet classification, and on RNN LSTM, for language modeling and neural machine translation, our model finds non-trivial device placements that outperform hand-crafted heuristics and traditional algorithmic methods.

第三项是非常新的工作，其中把Prefetching中的地址预测看成是自然语言处理中“next-word or character prediction”的问题也是很有启发的。

“Learning Memory Access Patterns”（https://arxiv.org/abs/1803.02329）

The explosion in workload complexity and the recent slow-down in Moore's law scaling call for new approaches towards efficient computing. Researchers are now beginning to use recent advances in machine learning in software optimizations, augmenting or replacing traditional heuristics and data structures. However, the space of machine learning for computer hardware architecture is only lightly explored. In this paper, we demonstrate the potential of deep learning to address the von Neumann bottleneck of memory performance. We focus on the critical problem of learning memory access patterns, with the goal of constructing accurate and efficient memory prefetchers. We relate contemporary prefetching strategies to n-gram models in natural language processing, and show how recurrent neural networks can serve as a drop-in replacement. On a suite of challenging benchmark datasets, we find that neural networks consistently demonstrate superior performance in terms of precision and recall. This work represents the first step towards practical neural-network based prefetching, and opens a wide range of exciting directions for machine learning in computer architecture research.

•••

除了Google的工作，在这次SysML会议上还有很多比较有意思的话题。由于是结合ML和System的会议，话题覆盖的范围也非常广。

硬件加速器：

有Eyeress团队的Vivienne Sze的talk：“Understanding the Limitations of Current Energy-Efficient Design Approaches for Deep Neural Networks”。她们的“ Tutorial on Hardware Architectures for Deep Neural Networks”还是目前为止对深度神经网络硬件最好的综述。另外，她还透露了一下Eyeress2的情况。

还有“Efficient Deep Learning Inference on Edge Devices”，“Stitch-X: An Accelerator Architecture for Exploiting Unstructured Sparsity in Deep Neural Networks”，“Mobile Machine Learning Hardware at ARM: A Systems-on-Chip (SoC) Perspective”。

针对特殊系统优化模型：

“Towards Optimal Winograd Convolution on Manycores” ，“Blink: A fast NVLink-based collective communication library”， “On Scale-out Deep Learning Training for Cloud and HPC”

针对ML任务的系统优化和把ML用于系统优化：

“Learning Graph-based Cluster Scheduling Algorithms”，“Representation Learning for Resource Usage Prediction”，“Better Caching with Machine Learned Advice”, “Towards Interactive Curation & Automatic Tuning of ML Pipelines”，“SLAQ: Quality-Driven Scheduling for Distributed Machine Learning” ，“Distributed Shared Memory for Machine Learning”，“Learning Network Size While Training with ShrinkNets”

Benchmark：

“DAWNBench: An End-to-End Deep Learning Benchmark and Competition”，“DNN-Train: Benchmarking and Analyzing DNN Training”

•••

最后，还有一篇有趣的文章叫做“In-network Neural Networks”。其基本思想就是利用目前的网络设备中的可编程运算资源来实现神经网络的应用。这个和我之前的文章“AI芯片开年 ”中提到的直接在网络设备中加速AI应用的想法是类似的。另外，在之前的WMC会议上，Nokia也发布了他们5G基站芯片“ReefShark”，强调了其AI计算能力，号称要让运营商的网络成为最大的AI计算平台。在大量数据需要本地处理的趋势下，从端设备到云端的整个网络中间，各种节点都可能越来越多的增加AI处理能力，在离数据最近的地方完成对数据的处理。

从会议视频里可以看出，当Jeff Dean介绍“我们使用启发式技术的任何领域，都是可能应用机器学习的好地方——编译器、网络、操作系统、甚至是芯片设计...”的时候，他的脑子里应该在快速闪回各种可能性。正如他所说“There are many opportunities for this”。

- END-

题图来自网络，版权归原作者所有

长按二维码关注

观察｜官方通报陕西蒲城一职校学生坠亡：事发前与舍友发生口角和肢体冲突认定该生系高空坠落死亡

桐城一派｜倒在“跨年夜”的龚书记，13个字换来免职调查冤不冤？

市管干部“龚书记”免职迷局

讣告！又一知名女星在家中去世，终年54岁，曾是无数人白月光…

“我，19岁，瞒着父母把留学的钱，在北京买了套房，如今……”

ML + System = ?

您可能也对以下帖子感兴趣

观察｜官方通报陕西蒲城一职校学生坠亡：事发前与舍友发生口角和肢体冲突 认定该生系高空坠落死亡

桐城一派｜倒在“跨年夜”的龚书记，13个字换来免职调查冤不冤？

市管干部“龚书记”免职迷局

讣告！又一知名女星在家中去世，终年54岁，曾是无数人白月光…

“我，19岁，瞒着父母把留学的钱，在北京买了套房，如今……”

生成图片，分享到微信朋友圈

ML + System = ?

您可能也对以下帖子感兴趣

观察｜官方通报陕西蒲城一职校学生坠亡：事发前与舍友发生口角和肢体冲突认定该生系高空坠落死亡