LG - 机器学习 CV - 计算机视觉 CL - 计算与语言 AS - 音频与语音 RO - 机器人 GR - 图形学
1、[LG] The alignment problem from a deep learning perspective
2、[CL] ROSCOE: A Suite of Metrics for Scoring Step-by-Step Reasoning
3、[CV] HUM3DIL: Semi-supervised Multi-modal 3D Human Pose Estimation for Autonomous Driving
4、[CV] LADIS: Language Disentanglement for 3D Shape Editing
5、[LG] Controlling Commercial Cooling Systems Using Reinforcement Learning
[CV] MAViL: Masked Audio-Video Learners
[AS] Learning Representations for New Sound Classes With Continual Self-Supervised Learning
[LG] SchNetPack 2.0: A neural network toolbox for atomistic machine learning
[LG] Reproducible scaling laws for contrastive language-image learning
摘要:从深度学习角度看校准问题、面向逐步推理评分的度量方法、面向无人驾驶的半监督多模态3D人体姿态估计、面向3D形状编辑的语言解缠、用强化学习控制商用冷却系统、掩码音频-视频学习器、基于持续自监督学习的新音频类表示学习、原子机器学习神经网络工具箱、对比语言-图像学习的可复现缩放律
R Ngo, L Chan, S Mindermann
[OpenAI & UC Berkeley & University of Oxford]
从深度学习角度看校准问题
要点:
摘要:
在未来几十年里,通用人工智能(AGI)可能会在广泛的重要任务中超越人类的能力。本文概述了这样一种情况:如果没有实质性的努力来防止它,AGI可能会学会追求从人类角度来看非常不理想(换句话说,错位)的目标,以类似于今天最能干的模型的方式训练的AGI可以学会欺骗性的行为,以获得更高的回报;学习内部代表的目标,这些目标可以超越他们的训练分布;并用强力搜索策略来追求这些目标。本文概述了部署错位的AGI如何可能不可逆转地破坏人类对世界的控制,并简要回顾了旨在防止这些问题的研究方向。
Within the coming decades, artificial general intelligence (AGI) may surpass human capabilities at a wide range of important tasks. We outline a case for expecting that, without substantial effort to prevent it, AGIs could learn to pursue goals which are very undesirable (in other words, misaligned) from a human perspective. We argue that AGIs trained in similar ways as today's most capable models could learn to act deceptively to receive higher reward; learn internally-represented goals which generalize beyond their training distributions; and pursue those goals using power-seeking strategies. We outline how the deployment of misaligned AGIs might irreversibly undermine human control over the world, and briefly review research directions aimed at preventing these problems.
https://arxiv.org/abs/2209.00626
O Golovneva, M Chen, S Poff, M Corredor, L Zettlemoyer, M Fazel-Zarandi, A Celikyilmaz
[Meta AI Research]
ROSCOE: 面向逐步推理评分的度量方法
要点:
摘要:
大型语言模型在被提示生成分步推理以证明其最终答案时,显示出改进的下游任务性能。这些推理步骤极大地提高了模型的可解释性和验证性,但是如果没有可靠的自动评估方法,客观地研究其正确性(独立于最终答案)是很困难的,根本不知道所述的推理步骤实际支持最终的终端任务预测的频率。本文提出ROSCOE,一套可解释的、无监督的自动评分,改进并扩展了之前的文本生成评价指标。为了评估ROSCOE与基线指标的对比,本文设计了一个推理错误的分类学,并在常用的推理数据集上收集合成和人工的评价分数。与现有的指标相比,ROSCOE可以通过利用分步推理的特性来衡量语义一致性、逻辑性、信息量、流畅性和事实性——以及其他特征。在五个人工标注的和六个程序干扰的诊断数据集上实证验证了所提指标的强度——涵盖了需要推理技能的各种任务,并表明ROSCOE可以持续地超越基线指标。
Large language models show improved downstream task performance when prompted to generate step-by-step reasoning to justify their final answers. These reasoning steps greatly improve model interpretability and verification, but objectively studying their correctness (independent of the final answer) is difficult without reliable methods for automatic evaluation. We simply do not know how often the stated reasoning steps actually support the final end task predictions. In this work, we present ROSCOE, a suite of interpretable, unsupervised automatic scores that improve and extend previous text generation evaluation metrics. To evaluate ROSCOE against baseline metrics, we design a typology of reasoning errors and collect synthetic and human evaluation scores on commonly used reasoning datasets. In contrast with existing metrics, ROSCOE can measure semantic consistency, logicality, informativeness, fluency, and factuality - among other traits - by leveraging properties of step-by-step rationales. We empirically verify the strength of our metrics on five human annotated and six programmatically perturbed diagnostics datasets - covering a diverse set of tasks that require reasoning skills and show that ROSCOE can consistently outperform baseline metrics.
https://arxiv.org/abs/2212.07919
A Zanfir, M Zanfir, A Gorban, J Ji, Y Zhou, D Anguelov, C Sminchisescu
[Google Research & Waymo Research]
HUM3DIL: 面向无人驾驶的半监督多模态3D人体姿态估计
要点:
摘要: 无人驾驶是一个令人兴奋的新行业,提出了重要的研究问题。在感知模块中,3D人体姿态估计是一项新兴技术,可以使自动驾驶汽车感知和理解行人的微妙和复杂行为。虽然几十年来硬件系统和传感器有了极大的改善——汽车可能拥有复杂的激光雷达和视觉系统,而且这种新的可用信息的专用数据集也在不断扩大——但在利用这些新信号进行3D人体姿态估计的核心问题上,还没有做太多工作。本文的方法,即HUM3DIL(HUMan 3D from Images and LiDAR),以半监督方式有效地利用了这些互补的信号,并以很大的幅度超过了现有的方法。它是一个快速和紧凑的模型,适合线上部署。具体来说,本文将LiDAR点嵌入到像素对齐的多模态特征中,并通过一系列的Transformer细化阶段。在Waymo开放数据集上进行的定量实验支持了这些说法,在3D姿态估计任务上取得了最先进的结果。
Autonomous driving is an exciting new industry, posing important research questions. Within the perception module, 3D human pose estimation is an emerging technology, which can enable the autonomous vehicle to perceive and understand the subtle and complex behaviors of pedestrians. While hardware systems and sensors have dramatically improved over the decades -- with cars potentially boasting complex LiDAR and vision systems and with a growing expansion of the available body of dedicated datasets for this newly available information -- not much work has been done to harness these novel signals for the core problem of 3D human pose estimation. Our method, which we coin HUM3DIL (HUMan 3D from Images and LiDAR), efficiently makes use of these complementary signals, in a semi-supervised fashion and outperforms existing methods with a large margin. It is a fast and compact model for onboard deployment. Specifically, we embed LiDAR points into pixel-aligned multi-modal features, which we pass through a sequence of Transformer refinement stages. Quantitative experiments on the Waymo Open Dataset support these claims, where we achieve state-of-the-art results on the task of 3D pose estimation.
https://arxiv.org/abs/2212.07729
I Huang, P Achlioptas, T Zhang, S Tulyakov, M Sung, L Guibas
[Stanford University & Snap Research & KAIST]
LADIS: 面向3D形状编辑的语言解缠
要点:
摘要:
自然语言交互是3D形状设计大众化的一个有希望的方向。然而,现有的文本驱动的3D形状编辑方法在对3D形状进行解耦、局部编辑时面临挑战。本文通过学习解缠的潜在表征来解决这个问题,这些表征将语言置于3D几何中。提出一个互补的工具集,包括一个新的网络结构、一个解缠损失和一个新的编辑程序。此外,为了衡量编辑的局部性,定义了一种新的指标,称为部分编辑精度。所提出方法在编辑定位方面优于现有的SOTA方法20%,而在语言参考解决精度方面优于6.6%。通过单纯地拆分语言表征,下游的3D形状编辑可以变得对相关部分更加局部化,即使该模型从未被给予明确的基于部分的监督。
Natural language interaction is a promising direction for democratizing 3D shape design. However, existing methods for text-driven 3D shape editing face challenges in producing decoupled, local edits to 3D shapes. We address this problem by learning disentangled latent representations that ground language in 3D geometry. To this end, we propose a complementary tool set including a novel network architecture, a disentanglement loss, and a new editing procedure. Additionally, to measure edit locality, we define a new metric that we call part-wise edit precision. We show that our method outperforms existing SOTA methods by 20% in terms of edit locality, and up to 6.6% in terms of language reference resolution accuracy. Our work suggests that by solely disentangling language representations, downstream 3D shape editing can become more local to relevant parts, even if the model was never given explicit part-based supervision.
https://arxiv.org/abs/2212.05011
J Luo, C Paduraru, O Voicu, Y Chervonyi...
[DeepMind & Google & Trane]
用强化学习控制商用冷却系统
要点:
摘要:
本文是对DeepMind和Google最近在控制商业制冷系统方面的强化学习工作的技术概览。基于从Google数据中心更有效的冷却开始的专业知识,在两个真实世界的设施上进行了现场实验。这些现场实验在评估、从离线数据中学习以及约束条件的满足等方面有各种挑战。本文描述了这些挑战,希望对这些挑战的认识将有利于未来的强化学习应用工作。本文还描述了如何调整强化学习系统以应对这些挑战,从而在两个现场实验地点分别节省了大约9%和13%的能源。
This paper is a technical overview of DeepMind and Google's recent work on reinforcement learning for controlling commercial cooling systems. Building on expertise that began with cooling Google's data centers more efficiently, we recently conducted live experiments on two real-world facilities in partnership with Trane Technologies, a building management system provider. These live experiments had a variety of challenges in areas such as evaluation, learning from offline data, and constraint satisfaction. Our paper describes these challenges in the hope that awareness of them will benefit future applied RL work. We also describe the way we adapted our RL system to deal with these challenges, resulting in energy savings of approximately 9% and 13% respectively at the two live experiment sites.
https://arxiv.org/abs/2211.07357
另外几篇值得关注的论文:
P Huang, V Sharma, H Xu, C Ryali, H Fan, Y Li, S Li, G Ghosh, J Malik, C Feichtenhofer
[Meta AI]
MAViL: 掩码音频-视频学习器
要点:
https://arxiv.org/abs/2212.08071
Z Wang, C Subakan, X Jiang...
[University of Illinois at Urbana-Champaign & Concordia University & Columbia University]
基于持续自监督学习的新音频类表示学习
要点:
https://arxiv.org/abs/2205.07390
K T. Schütt, S S. P. Hessmann, N W. A. Gebauer, J Lederer, M Gastegger
[Technische Universität Berlin]
SchNetPack 2.0: 原子机器学习神经网络工具箱
要点:
https://arxiv.org/abs/2212.05517
M Cherti, R Beaumont, R Wightman, M Wortsman, G Ilharco, C Gordon, C Schuhmann, L Schmidt, J Jitsev
[LAION & UC Berkeley & University of Washington]
对比语言-图像学习的可复现缩放律
要点:
https://arxiv.org/abs/2212.07143