LG - 机器学习 CV - 计算机视觉 CL - 计算与语言 RO - 机器人
1、[IR] Dense Feature Memory Augmented Transformers for COVID-19 Vaccination Search Classification
2、[CV] Multi-Realism Image Compression with a Conditional Generator
3、[LG] On Implicit Bias in Overparameterized Bilevel Optimization
4、[CL] Cramming: Training a Language Model on a Single GPU in One Day
5、[LG] LAMBADA: Backward Chaining for Automated Reasoning in Natural Language
[CL] Demonstrate-Search-Predict: Composing retrieval and language models for knowledge-intensive NLP
[RO] A System-Level View on Out-of-Distribution Data in Robotics
[CV] Noise-aware Learning from Web-crawled Image-Text Data for Image Captioning
[CV] A Generalization of ViT/MLP-Mixer to Graphs
摘要:用于新冠疫苗接种搜索分类的稠密特征记忆增强Transformer、基于条件生成器的多(感知)真实性图像压缩、过参数化双层优化中的隐性偏差、不到一天时间在单个GPU上训练语言模型、自然语言自动推理的后向链、为知识密集型NLP编写检索和语言模型、机器人分布外数据的系统级视角、面向图像描述从网络抓取图像文本数据的噪声感知学习、将ViT/MLP-Mixer推广到图
J Gupta, Y Tay, C Kamath, V Q. Tran, D Metzler, S Bavadekar, M Sun, E Gabrilovich
[Google Research]
用于新冠疫苗接种搜索分类的稠密特征记忆增强Transformer
要点:
摘要:
随着新冠肺炎的全面爆发,疫苗是应对全球大流行中大规模感染的关键防线之一。鉴于其提供的保护,疫苗在某些社会和专业环境中成为强制性的。本文提出一种用于检测新冠肺炎疫苗接种相关搜索查询的分类模型,一种用于生成新冠肺炎疫苗接种搜索见解的机器学习模型。该方法结合并利用了现代最先进的(SOTA)自然语言理解(NLU)技术的进步,例如具有传统稠密特征的预训练Transformer。提出一种将稠密特征视为模型可处理记忆token的新方法。这种新的建模方法可以显著改进疫苗搜索洞察(VSI)任务,通过F1得分相对提高+15%和精度提高+14%来提高精心构造的梯度提升基线。
With the devastating outbreak of COVID-19, vaccines are one of the crucial lines of defense against mass infection in this global pandemic. Given the protection they provide, vaccines are becoming mandatory in certain social and professional settings. This paper presents a classification model for detecting COVID-19 vaccination related search queries, a machine learning model that is used to generate search insights for COVID-19 vaccinations. The proposed method combines and leverages advancements from modern state-of-the-art (SOTA) natural language understanding (NLU) techniques such as pretrained Transformers with traditional dense features. We propose a novel approach of considering dense features as memory tokens that the model can attend to. We show that this new modeling approach enables a significant improvement to the Vaccine Search Insights (VSI) task, improving a strong well-established gradient-boosting baseline by relative +15% improvement in F1 score and +14% in precision.
https://arxiv.org/abs/2212.13898
E Agustsson, D Minnen, G Toderici, F Mentzer
[Google Research]
基于条件生成器的多(感知)真实性图像压缩
要点:
摘要:
通过优化速率-失真-(感知)真实性的权衡,生成式压缩方法即使在低速率下也能产生详细、逼真的图像,而不是速率-失真优化模型产生的模糊重建。然而,之前的方法没有明确控制合成了多少细节,这导致了对这些方法的普遍批评:用户可能会担心会产生远离输入图像的误导性重建。本文通过训练一个解码器来缓解这些担忧,该解码器可以连接两种域,并驾驭失真-(感知)真实性权衡。从单个压缩表示中,接收器可以决定重建靠近输入的低MSE重建,重建具有高感知质量的逼真重建,或介于两者之间的任意内容。通过该方法,在失真-(感知)真实性方面开创了一种新的最先进的境界,推动了可实现的失真-(感知)真实性对的前沿,在高真实性时实现了比以往更好的失真,在低失真时实现了更好的真实性。
By optimizing the rate-distortion-realism trade-off, generative compression approaches produce detailed, realistic images, even at low bit rates, instead of the blurry reconstructions produced by rate-distortion optimized models. However, previous methods do not explicitly control how much detail is synthesized, which results in a common criticism of these methods: users might be worried that a misleading reconstruction far from the input image is generated. In this work, we alleviate these concerns by training a decoder that can bridge the two regimes and navigate the distortion-realism trade-off. From a single compressed representation, the receiver can decide to either reconstruct a low mean squared error reconstruction that is close to the input, a realistic reconstruction with high perceptual quality, or anything in between. With our method, we set a new state-of-the-art in distortion-realism, pushing the frontier of achievable distortion-realism pairs, i.e., our method achieves better distortions at high realism and better realism at low distortion than ever before.
https://arxiv.org/abs/2212.13824
P Vicol, J Lorraine, F Pedregosa, D Duvenaud, R Grosse
[University of Toronto & Google Brain]
过参数化双层优化中的隐性偏差
要点:
摘要:
机器学习中的许多问题涉及双层优化(BLO),包括超参数优化、元学习和数据集蒸馏。双层问题由两个嵌套的子问题组成,分别称为外部和内部问题。在实践中,这些子问题中通常至少有一个被过参数化。在这种情况下,在实现同等目标值的最佳选择中有很多方法可供选择。受最近对单层优化中优化算法归纳的隐式偏差的研究启发,本文研究了基于梯度的算法用于双层优化的隐性偏差。本文描述了两种标准的BLO方法——冷启动和热启动——并表明收敛解或长期运行行为在很大程度上取决于这些和其他算法选择,例如超梯度近似。本文还表明,即使外部参数是低维的,热启动BLO获得的内部解也可以编码出惊人的关于外部目标的信息。本文认为,隐性偏差在双层优化研究中应该发挥核心作用,就像在单级神经网络优化研究中一样。
Many problems in machine learning involve bilevel optimization (BLO), including hyperparameter optimization, meta-learning, and dataset distillation. Bilevel problems consist of two nested sub-problems, called the outer and inner problems, respectively. In practice, often at least one of these sub-problems is overparameterized. In this case, there are many ways to choose among optima that achieve equivalent objective values. Inspired by recent studies of the implicit bias induced by optimization algorithms in single-level optimization, we investigate the implicit bias of gradient-based algorithms for bilevel optimization. We delineate two standard BLO methods -- cold-start and warm-start -- and show that the converged solution or long-run behavior depends to a large degree on these and other algorithmic choices, such as the hypergradient approximation. We also show that the inner solutions obtained by warm-start BLO can encode a surprising amount of information about the outer objective, even when the outer parameters are low-dimensional. We believe that implicit bias deserves as central a role in the study of bilevel optimization as it has attained in the study of single-level neural net optimization.
https://arxiv.org/abs/2212.14032
J Geiping, T Goldstein
[University of Maryland]
Cramming: 不到一天时间在单个GPU上训练语言模型
要点:
摘要:
调研了完全从零开始训练的基于transformer的语言模型可以实现的下游性能,该模型在单个消费级GPU上进行一天的掩码语言建模。除了为该场景重新分析预训练管道的几乎所有组件,并提供性能接近BERT的修改管道外,还调研了为什么缩小规模很难,以及在这种情况下哪些修改实际上提高了性能。本文提供证据表明,即使在这种受限的环境中,性能也严格遵循在大计算环境中观察到的缩放律。通过缩放律的视角,对最近对训练和架构的一系列改进进行了分类,并讨论了它们在有限计算设置中的优点和实际适用性(或不适用性)。
We investigate the downstream performance achievable with a transformer-based language model trained completely from scratch with masked language modeling for a single day on a single consumer GPU. Aside from re-analyzing nearly all components of the pretraining pipeline for this scenario and providing a modified pipeline with performance close to BERT, we investigate why scaling down is hard, and which modifications actually improve performance in this scenario. We provide evidence that even in this constrained setting, performance closely follows scaling laws observed in large-compute settings. Through the lens of scaling laws, we categorize a range of recent improvements to training and architecture and discuss their merit and practical applicability (or lack thereof) for the limited compute setting.
https://arxiv.org/abs/2212.14034
S M Kazemi, N Kim, D Bhatia, X Xu, D Ramachandran
[Google Research]
LAMBADA: 自然语言自动推理的后向链
要点:
摘要:
通过用大型语言模型(LM)的能力以及思维链提示和选择推理等方法,在将知识指定为非结构化自然文本的自动推理方面取得了显著进展。这些技术从公理到结论的前向寻找证明,该公理受到搜索空间组合爆炸的影响,因此需要更长推理链的问题故障率很高。经典的自动推理文献表明,后向推理(即从预期的结论到支持它的公理集)在证据发现方面效率要高得多。将此直觉引入语言模型设置,开发了一种后向链算法,称为LAMBADA,将推理分解为四个子模块,每个子模块都可以通过少样本提示语言模型推理简单实现。在两个具有挑战性的逻辑推理数据集上,LAMBADA比最先进的前向推理方法实现了巨大的精度提升,特别是在需要深度和准确的证明链时。
Remarkable progress has been made on automated reasoning with knowledge specified as unstructured, natural text, by using the power of large language models (LMs) coupled with methods such as Chain-of-Thought prompting and Selection-Inference. These techniques search for proofs in the forward direction from axioms to the conclusion, which suffers from a combinatorial explosion of the search space, and thus high failure rates for problems requiring longer chains of reasoning. The classical automated reasoning literature has shown that reasoning in the backward direction (i.e. from the intended conclusion to the set of axioms that support it) is significantly more efficient at proof-finding problems. We import this intuition into the LM setting and develop a Backward Chaining algorithm, which we call LAMBADA, that decomposes reasoning into four sub-modules, each of which can be simply implemented by few-shot prompted LM inference. We show that LAMBADA achieves massive accuracy boosts over state-of-the-art forward reasoning methods on two challenging logical reasoning datasets, particularly when deep and accurate proof chains are required.
https://arxiv.org/abs/2212.13894
另外几篇值得关注的论文:
O Khattab, K Santhanam, X L Li, D Hall, P Liang, C Potts, M Zaharia
[Stanford University]
Demonstrate-Search-Predict: 为知识密集型NLP编写检索和语言模型
要点:
https://arxiv.org/abs/2212.14024
R Sinha, A Sharma, S Banerjee, T Lew, R Luo, S M. Richards, Y Sun, E Schmerling, M Pavone
[Stanford University]
机器人分布外数据的系统级视角
要点:
https://arxiv.org/abs/2212.14020
W Kang, J Mun, S Lee, B Roh
[Kakao Brain]
面向图像描述从网络抓取图像文本数据的噪声感知学习
要点:
https://arxiv.org/abs/2212.13563
X He, B Hooi, T Laurent, A Perold, Y LeCun, X Bresson
[National University of Singapore & Loyola Marymount University & Element, Inc & New York University]
将ViT/MLP-Mixer推广到图
要点:
https://arxiv.org/abs/2212.13350