爱可可AI前沿推介(12.18)
LG - 机器学习 CV - 计算机视觉 CL - 计算与语言 AS - 音频与语音 RO - 机器人 GR - 图形学
1、[LG] The alignment problem from a deep learning perspective
2、[CL] ROSCOE: A Suite of Metrics for Scoring Step-by-Step Reasoning
3、[CV] HUM3DIL: Semi-supervised Multi-modal 3D Human Pose Estimation for Autonomous Driving
4、[CV] LADIS: Language Disentanglement for 3D Shape Editing
5、[LG] Controlling Commercial Cooling Systems Using Reinforcement Learning
[CV] MAViL: Masked Audio-Video Learners
[AS] Learning Representations for New Sound Classes With Continual Self-Supervised Learning
[LG] SchNetPack 2.0: A neural network toolbox for atomistic machine learning
[LG] Reproducible scaling laws for contrastive language-image learning
摘要:从深度学习角度看校准问题、面向逐步推理评分的度量方法、面向无人驾驶的半监督多模态3D人体姿态估计、面向3D形状编辑的语言解缠、用强化学习控制商用冷却系统、掩码音频-视频学习器、基于持续自监督学习的新音频类表示学习、原子机器学习神经网络工具箱、对比语言-图像学习的可复现缩放律
1、[LG] The alignment problem from a deep learning perspective
R Ngo, L Chan, S Mindermann
[OpenAI & UC Berkeley & University of Oxford]
从深度学习角度看校准问题
要点:
如果没有强力阻止,AGI 可能会学习追求从人类角度不可取的目标; AGI 可以被训练学会行为欺骗性,学习内部表达的目标,并用强力搜索策略来追求这些目标,可能会破坏人类对世界的控制。
摘要:
在未来几十年里,通用人工智能(AGI)可能会在广泛的重要任务中超越人类的能力。本文概述了这样一种情况:如果没有实质性的努力来防止它,AGI可能会学会追求从人类角度来看非常不理想(换句话说,错位)的目标,以类似于今天最能干的模型的方式训练的AGI可以学会欺骗性的行为,以获得更高的回报;学习内部代表的目标,这些目标可以超越他们的训练分布;并用强力搜索策略来追求这些目标。本文概述了部署错位的AGI如何可能不可逆转地破坏人类对世界的控制,并简要回顾了旨在防止这些问题的研究方向。
Within the coming decades, artificial general intelligence (AGI) may surpass human capabilities at a wide range of important tasks. We outline a case for expecting that, without substantial effort to prevent it, AGIs could learn to pursue goals which are very undesirable (in other words, misaligned) from a human perspective. We argue that AGIs trained in similar ways as today's most capable models could learn to act deceptively to receive higher reward; learn internally-represented goals which generalize beyond their training distributions; and pursue those goals using power-seeking strategies. We outline how the deployment of misaligned AGIs might irreversibly undermine human control over the world, and briefly review research directions aimed at preventing these problems.
https://arxiv.org/abs/2209.00626
2、[CL] ROSCOE: A Suite of Metrics for Scoring Step-by-Step Reasoning
O Golovneva, M Chen, S Poff, M Corredor, L Zettlemoyer, M Fazel-Zarandi, A Celikyilmaz
[Meta AI Research]
ROSCOE: 面向逐步推理评分的度量方法
要点:
提出ROSCOE,一套可解释的无监督自动评分,可改善和扩展之前的文本生成评估指标; 提出一个推理错误的分类,用来生成和评估所提出的指标; 实验结果与之前基于语义和词汇类似的基准指标相比,在文本生成方面表现出卓越的性能。
摘要:
大型语言模型在被提示生成分步推理以证明其最终答案时,显示出改进的下游任务性能。这些推理步骤极大地提高了模型的可解释性和验证性,但是如果没有可靠的自动评估方法,客观地研究其正确性(独立于最终答案)是很困难的,根本不知道所述的推理步骤实际支持最终的终端任务预测的频率。本文提出ROSCOE,一套可解释的、无监督的自动评分,改进并扩展了之前的文本生成评价指标。为了评估ROSCOE与基线指标的对比,本文设计了一个推理错误的分类学,并在常用的推理数据集上收集合成和人工的评价分数。与现有的指标相比,ROSCOE可以通过利用分步推理的特性来衡量语义一致性、逻辑性、信息量、流畅性和事实性——以及其他特征。在五个人工标注的和六个程序干扰的诊断数据集上实证验证了所提指标的强度——涵盖了需要推理技能的各种任务,并表明ROSCOE可以持续地超越基线指标。
Large language models show improved downstream task performance when prompted to generate step-by-step reasoning to justify their final answers. These reasoning steps greatly improve model interpretability and verification, but objectively studying their correctness (independent of the final answer) is difficult without reliable methods for automatic evaluation. We simply do not know how often the stated reasoning steps actually support the final end task predictions. In this work, we present ROSCOE, a suite of interpretable, unsupervised automatic scores that improve and extend previous text generation evaluation metrics. To evaluate ROSCOE against baseline metrics, we design a typology of reasoning errors and collect synthetic and human evaluation scores on commonly used reasoning datasets. In contrast with existing metrics, ROSCOE can measure semantic consistency, logicality, informativeness, fluency, and factuality - among other traits - by leveraging properties of step-by-step rationales. We empirically verify the strength of our metrics on five human annotated and six programmatically perturbed diagnostics datasets - covering a diverse set of tasks that require reasoning skills and show that ROSCOE can consistently outperform baseline metrics.
https://arxiv.org/abs/2212.07919
3、[CV] HUM3DIL: Semi-supervised Multi-modal 3D Human Pose Estimation for Autonomous Driving
A Zanfir, M Zanfir, A Gorban, J Ji, Y Zhou, D Anguelov, C Sminchisescu
[Google Research & Waymo Research]
HUM3DIL: 面向无人驾驶的半监督多模态3D人体姿态估计
要点:
提出HUM3DIL,一种轻量级3D人体关节预测网络,利用RGB信息和LiDAR点; 以半监督方式训练模型,以最大限度地利用2D标注和3D标签; Waymo Open Dataset上的定量结果表明了最新的性能。
摘要: 无人驾驶是一个令人兴奋的新行业,提出了重要的研究问题。在感知模块中,3D人体姿态估计是一项新兴技术,可以使自动驾驶汽车感知和理解行人的微妙和复杂行为。虽然几十年来硬件系统和传感器有了极大的改善——汽车可能拥有复杂的激光雷达和视觉系统,而且这种新的可用信息的专用数据集也在不断扩大——但在利用这些新信号进行3D人体姿态估计的核心问题上,还没有做太多工作。本文的方法,即HUM3DIL(HUMan 3D from Images and LiDAR),以半监督方式有效地利用了这些互补的信号,并以很大的幅度超过了现有的方法。它是一个快速和紧凑的模型,适合线上部署。具体来说,本文将LiDAR点嵌入到像素对齐的多模态特征中,并通过一系列的Transformer细化阶段。在Waymo开放数据集上进行的定量实验支持了这些说法,在3D姿态估计任务上取得了最先进的结果。
Autonomous driving is an exciting new industry, posing important research questions. Within the perception module, 3D human pose estimation is an emerging technology, which can enable the autonomous vehicle to perceive and understand the subtle and complex behaviors of pedestrians. While hardware systems and sensors have dramatically improved over the decades -- with cars potentially boasting complex LiDAR and vision systems and with a growing expansion of the available body of dedicated datasets for this newly available information -- not much work has been done to harness these novel signals for the core problem of 3D human pose estimation. Our method, which we coin HUM3DIL (HUMan 3D from Images and LiDAR), efficiently makes use of these complementary signals, in a semi-supervised fashion and outperforms existing methods with a large margin. It is a fast and compact model for onboard deployment. Specifically, we embed LiDAR points into pixel-aligned multi-modal features, which we pass through a sequence of Transformer refinement stages. Quantitative experiments on the Waymo Open Dataset support these claims, where we achieve state-of-the-art results on the task of 3D pose estimation.
https://arxiv.org/abs/2212.07729
4、[CV] LADIS: Language Disentanglement for 3D Shape Editing
I Huang, P Achlioptas, T Zhang, S Tulyakov, M Sung, L Guibas
[Stanford University & Snap Research & KAIST]
LADIS: 面向3D形状编辑的语言解缠
要点:
提出了一套互补工具集,包括新的网络架构、解缠损失和新的编辑程序; 定义了一个新的指标,部分编辑精度,来衡量编辑局部性。
摘要:
自然语言交互是3D形状设计大众化的一个有希望的方向。然而,现有的文本驱动的3D形状编辑方法在对3D形状进行解耦、局部编辑时面临挑战。本文通过学习解缠的潜在表征来解决这个问题,这些表征将语言置于3D几何中。提出一个互补的工具集,包括一个新的网络结构、一个解缠损失和一个新的编辑程序。此外,为了衡量编辑的局部性,定义了一种新的指标,称为部分编辑精度。所提出方法在编辑定位方面优于现有的SOTA方法20%,而在语言参考解决精度方面优于6.6%。通过单纯地拆分语言表征,下游的3D形状编辑可以变得对相关部分更加局部化,即使该模型从未被给予明确的基于部分的监督。
Natural language interaction is a promising direction for democratizing 3D shape design. However, existing methods for text-driven 3D shape editing face challenges in producing decoupled, local edits to 3D shapes. We address this problem by learning disentangled latent representations that ground language in 3D geometry. To this end, we propose a complementary tool set including a novel network architecture, a disentanglement loss, and a new editing procedure. Additionally, to measure edit locality, we define a new metric that we call part-wise edit precision. We show that our method outperforms existing SOTA methods by 20% in terms of edit locality, and up to 6.6% in terms of language reference resolution accuracy. Our work suggests that by solely disentangling language representations, downstream 3D shape editing can become more local to relevant parts, even if the model was never given explicit part-based supervision.
https://arxiv.org/abs/2212.05011
5、[LG] Controlling Commercial Cooling Systems Using Reinforcement Learning
J Luo, C Paduraru, O Voicu, Y Chervonyi...
[DeepMind & Google & Trane]
用强化学习控制商用冷却系统
要点:
用强化学习算法控制多个商业建筑的冷却系统,实现约9-13%的节能; 现实世界的工业控制问题存在许多大多数模拟和虚拟环境中没有的挑战,包括现场实验在评估、从离线数据中学习以及约束条件的满足等方面; 为解决遇到的挑战,需要对强化学习算法进行重大修改,需要具备重要的领域理解。
摘要:
本文是对DeepMind和Google最近在控制商业制冷系统方面的强化学习工作的技术概览。基于从Google数据中心更有效的冷却开始的专业知识,在两个真实世界的设施上进行了现场实验。这些现场实验在评估、从离线数据中学习以及约束条件的满足等方面有各种挑战。本文描述了这些挑战,希望对这些挑战的认识将有利于未来的强化学习应用工作。本文还描述了如何调整强化学习系统以应对这些挑战,从而在两个现场实验地点分别节省了大约9%和13%的能源。
This paper is a technical overview of DeepMind and Google's recent work on reinforcement learning for controlling commercial cooling systems. Building on expertise that began with cooling Google's data centers more efficiently, we recently conducted live experiments on two real-world facilities in partnership with Trane Technologies, a building management system provider. These live experiments had a variety of challenges in areas such as evaluation, learning from offline data, and constraint satisfaction. Our paper describes these challenges in the hope that awareness of them will benefit future applied RL work. We also describe the way we adapted our RL system to deal with these challenges, resulting in energy savings of approximately 9% and 13% respectively at the two live experiment sites.
https://arxiv.org/abs/2211.07357
另外几篇值得关注的论文:
[CV] MAViL: Masked Audio-Video Learners
P Huang, V Sharma, H Xu, C Ryali, H Fan, Y Li, S Li, G Ghosh, J Malik, C Feichtenhofer
[Meta AI]
MAViL: 掩码音频-视频学习器
要点:
MAViL用三种形式的自监督来训练音频-视觉表示:掩码音频视频输入数据重建,掩码模态内、模态间对比学习以及自训练; MAViL在AudioSet(53.1 mAP)和VGGSound(67.1%准确率)上达到了新的技术水平; MAViL在七个音频和音视频分类和检索任务上优于之前的最高水平。
https://arxiv.org/abs/2212.08071
[AS] Learning Representations for New Sound Classes With Continual Self-Supervised Learning
Z Wang, C Subakan, X Jiang...
[University of Illinois at Urbana-Champaign & Concordia University & Columbia University]
基于持续自监督学习的新音频类表示学习
要点:
提出一种用于声音类的持续表示学习框架,不依赖于标签,适用于只使用一小部分标签来微调输出分类器的实际情况; 框架中采用基于相似性的自监督学习方法比持续监督表示学习方法更具优势,并且与基于蒸馏的持续学习方法具有相似的性能; 即使没有使用任何防止遗忘的机制,持续自监督学习也可以获得竞争性能。
https://arxiv.org/abs/2205.07390
[LG] SchNetPack 2.0: A neural network toolbox for atomistic machine learning
K T. Schütt, S S. P. Hessmann, N W. A. Gebauer, J Lederer, M Gastegger
[Technische Universität Berlin]
SchNetPack 2.0: 原子机器学习神经网络工具箱
要点:
SchNetPack 2.0 是一款多功能的神经网络工具箱,既关注方法开发,也关注应用; SchNetPack 2.0 拥有改良后的数据管线,等变神经网络模块以及PyTorch实现的分子动力学; SchNetPack 2.0 具有灵活的命令行界面,由PyTorch Lightning和Hydra配置框架支持,允许定制代码和复杂的训练任务,如3D分子结构的生成。
https://arxiv.org/abs/2212.05517
[LG] Reproducible scaling laws for contrastive language-image learning
M Cherti, R Beaumont, R Wightman, M Wortsman, G Ilharco, C Gordon, C Schuhmann, L Schmidt, J Jitsev
[LAION & UC Berkeley & University of Washington]
对比语言-图像学习的可复现缩放律
要点:
神经网络规模提升已经在各种任务上取得了显著的性能提升; 本文进行了大规模实验,涉及训练最多达20亿个图像文本对的模型,以确定多种下游任务的幂律缩放律; 开源了评估工作流和模型,以确保可复现性,并使扩展律研究更加可用。