首发｜贾扬清回应［1小时训练ImageNet］：要关注这10个技术细节，窝里斗没有必要

2017-06-10 不断探索的 AI100

昨天，整个深度学习领域，几乎被一篇论文刷了屏。

论文指出，通过一种新的方法，能够大幅加速模型训练的过程，仅需1小时，就能训练完ImageNet这样的大规模数据集。

AI100也在第一时间采访到业内技术大牛，了解到业内对此的反响。这些AI工程师们，纷纷拍手高呼，这是所有人都想干，却一直还没干成的行业大事，如今终于等到这一天。以往需要1周的训练时间，如今1天就能完成。（昨天AI100也在第一时间向大家推送了这个重要消息：重磅｜Facebook放大招，训练时间1周降至1天，AI工程师高呼终于等到这一天）

对于正在从事深度学习的科技公司来说，根本不缺钱，缺时间。

只要有了这个技术，多砸钱买机器买设备，这都不算事儿。但时间，太重要了。谁能在未来把这项技术用得更溜，更好，谁就将在未来占据高地。

所以，新一轮的竞赛，开始了。

昨天各家的报道一出来，一石激起千层浪，所有人都开始热议。

今天，媒体更是在知乎上直接发问：如何评价Facebook Training ImageNet in 1 Hour这篇论文？

作为该论文作者之一的贾扬清也很快作出了自己的回答。

其中面对媒体质疑其没有引用李沐的PhD thesis 的问题，贾扬清表示这是无心之失，对其表示歉意。并在文中写到：“我们还是专注技术，华人在机器学习领域做到今天这样的影响很不容易，窝里斗没有必要。”

AI100将其全文刊出，以满足读者急切想要知道更多的心情。

以下为贾扬清在知乎上的对该问题的回答，AI100保持原文形态，未做编辑：

机器学习方面的技术回答 Kaiming 在 fb 上已经很完整了（详见文末英文版），在系统设计上面我努力回答一些可能会有人感兴趣的问题。

@张昊的系统分析非常不错，我下面也努力添加了一些对于他文中提到的技术的展开解释。@廉相如的理论解释文章很有意思，我转给我们的 co-author 啦。

另外，我知道有一些媒体和评论提到了我们没有引用 @李沐的 PhD thesis 的问题，这是无心之失，在 arxiv 新版本里也当然会加上。对于这个失误我也代表其他作者表示歉意。希望我们还是专注技术，华人在机器学习领域做到今天这样的影响很不容易，窝里斗没有必要。我个人很敬佩李沐和 Kaiming 的工作，对于文章的异同，可以参见 Kaiming 的回复。

（1）fp16。目前 Pascal 上的纯 fp16 计算（16bit IO，16bit accumulation）在不对模型做 tuning 的情况下是很难收敛的，因为 16 bit accumulation 的 mantissa bit 太少，精度损失很大。在 Volta 上的 fp16 计算可以使用 32bit accumulation。所以在文章里面我们依然保持了最常用的 fp32 计算。关于 Volta 上 fp16 的训练，英伟达会有后续的介绍，基本的结论是比 Pascal 上的 fp16 要简单。

（2）50G Ethernet。这个的确是一个值得讨论的问题。Ethernet 的 latency 没有 Infiniband 这样低，同时价格的确比 Infiniband 要低很多，所以说是 "commodity ethernet" 从数据中心的角度是合理的。当然，很多实际网络比如说实验室或者 AWS 的带宽并没有那么高，这一点上今早我和 Pieter Noordhuis 讨论了一下，如果谁希望在低速的网络环境下复现结果或者探讨速度对于 convergence 的影响，欢迎和我们联系。我们对此也很感兴趣，但是限于各个公司和实验室的网络设计千差万别，大家合力分析会更全面。

（3）ResNet50 和其他 CNN 的关系。基本上绝大多数 CNN 的算法，特别是计算量很大的那些，都比较容易将通讯的时间隐藏到计算当中。AlexNet 是个例外，因为很大的 FC 层，通讯时间很难隐藏，这也是为什么 Alex Krizhevsky 要做 one weird trick 的缘故。VGG 的 FC 层也有一定问题，Inception 基本上没有通讯的问题。

（4）Async vs sync。接上条，绝大多数现有产品中的 CNN 训练都不会需要用到 async 的算法。从系统的角度说，虽然有知友提到 sync 是 async 的一种特例，但是用 async 实现 sync 一定程度上会损失效率。比如说，用 general parameter server 实现 sync sgd 的话，由于 PS 的星型结构，网络的堵塞情况会随着 worker 的增加而线性增长。虽然可以通过 sharding 来解决，但是这个会造成资源浪费。另外的解决方案是每一个 worker 同时兼做 sharded parameter server，但是可以发现这其实是 scatter gather allreduce 的实现。另外，数学上，sync sgd 比 async 常常更加稳定，参见 Google 的 Rethinking synchronized sgd 文章。

5）MPI。熟悉 HPC 的同学可能发现文章中提到了 double buffer ring reduction 这些传统 MPI 的算法。的确，在 sync sgd 的上下文里面，类 MPI 的 api 定义非常优秀，传统 HPC 也有很多这些算法的研究。又及，Facebook 没有直接使用 MPI 的原因是绝大多数 MPI 的实现都太重，而我们几乎只需要 Broadcast 和 Allreduce，所以我们设计了更轻的 Gloo。

（6）再说 Parameter server。上面说 async vs sync 的时候提到很多 CNN 的训练并不需要使用 ps。这不是说 ps 完全没有用处，比如说我们在训练 sparse + dense 联合的模型的时候，会混用 ps 和 sync sgd，甚至叠加 hogwild 这样的情况。所以基于不同的网络计算和通讯的比例，可以选择不同的通讯方式。

7）对于 framework 的要求。基本上的需求是下面几个：（a）需要有足够优化的数据 IO 以及 prefetching，（b）需要有基于 computation graph 的计算引擎保证 concurrent communication and computation，（c）需要 framework 的 overhead 足够低。该算法是可以在绝大多数现代的 DL 框架上重现的。对于早期框架，比如说 Caffe 或者 Torch，倒是会比较困难。

（8）Caffe2。我们重构 Caffe 到 Caffe2 的目的，一部分就是从系统上支持这些研究项目，另外一点是模型可以无缝转到各个应用平台，所以用 Caffe2 是一个比较自然的事情。但是我们在文章中明确指出，我们的算法在满足（7）条件的任何框架下应该都可以重现。

（9）CPU。在训练中 CPU 主要起到数据 preprocess 以及 GPU scheduler 的作用。的确，有足够多核的 CPU 对于计算是很重要的。如果机器 GPU 足够多但是 CPU 的算力不够的话，会需要有其他的方法 - 比如说在其他机器上做数据的 preprocessing，或者把 preprocessing 完全放到 GPU 上 - 这些方法来平衡 CPU 和 GPU 之间的 load。

（10）成本。这个无法估计非公开的数字，而且还有科研迭代的 cost。但是算法出来以后，训练的成本其实并不比传统算法更高，同时可以在更短时间内得到结果 - 这也是为什么 strong scaling 那么重要的缘故之一。按照 aws p2.8xlarge 的价格，训练一次大概是 230 美元，这个应用成本对于一般创业公司也都是可以接受的。

就想到这些，如果有需要我会再补充。

以下为Kaiming 在 Facebook 上对于文章 impact 的评论

-- We had an internal debate on whether we should publish a paper describing how we can achieve the results. I agree there is not so much new, because these have been what I and my colleagues had been doing in the past few years, including how we developed ResNets and Faster R-CNN. After discussing with many people including current/former scientists/engineers from Microsoft, Facebook, Google, Baidu, and universities, we realized that not all the details are widely known by practitioners, engineers, or researchers, and in general there had been limited success at this scale. So finally I was convinced that we should write this white paper: we hope it may be a helpful manual for people who might miss something in their systems.

In my experience, “linear scaling lr” is so surprisingly effective that it helped a lot for us to prototype and develop computer vision algorithm in the past few years, including ResNets, Faster R-CNN, and Mask R-CNN, on the old days when we didn’t have enough 8-GPU or even 4-GPU machines or when we need to migrate baselines. By “surprisingly effective” I mean that we don’t need to re-select any hyper-parameters (in contrast to picking individual lr and schedules as people usually do). This linear scaling lr is not a new thing: in our paper (Sec. 2.1 “Discussion”, p3) we cited Leon Bottou et al.’s survey paper[4] which gave theories behind linear scaling lr (and also warmup). Through personal communications with Leon we realized that this theory is so ancient and so natural that we are even not able to trace back who did it first. I hope to recommend this linear scaling lr “rule” (or theory) to broader audience as I benefited a lot from it in the past few years.

On the contrary, I had little successful experience using the “sqrt” rule: one experimental results can be found in Table 2(a) in our paper. There are discussions on the theoretical correctness of the “linear scaling” rule vs. the “sqrt” rule; but in this paper we share our rich empirical results (covering ImageNet/COCO, pre-training/fine-tuning, classification/detection/segmentation) to the readers and give strong support to the linear rule, as I have experienced in the past few years.

You mentioned “not stable” results using the linear scaling lr rule. This is consistent with our motivation of presenting the warmup techniques, which may find its theoretical support from [4].

I also benefited a lot from the warmup strategy in my research experience in the past few years, which helped me to scale out simpler and made my life much easier. We hope this can help some (if not all) researchers and engineers.

原帖地址 - https://www.zhihu.com/question/60874090/answer/181672076?from=groupmessage

点击下方“阅读原文”查看更多内容。

反向激励，在加速这个社会的黑化

平安信托深陷“爆雷”旋涡：终于尝到“偏爱”房地产的苦果

刀片电池存设计缺陷，或将导致几十万比亚迪车主自费更换or召回？

专家一会说要过“紧日子”，一会说“认为没坏就能用”是不对的

芒果TV十年：源自如日中天时的“诺亚方舟”计划