工劳快讯:汕尾美团骑手罢工取得阶段性胜利

记者调查泉州欣佳酒店倒塌曝惊人“案中案”:曾是卖淫场所,50名老板、官员卷入其中

退出中国市场的著名外企名单

去泰国看了一场“成人秀”,画面尴尬到让人窒息.....

【少儿禁】马建《亮出你的舌苔或空空荡荡》

生成图片,分享到微信朋友圈

自由微信安卓APP发布,立即下载! | 提交文章网址
查看原文

爱可可AI前沿推介(12.13)

爱可可爱生活 爱可可爱生活 2022-12-15

LG - 机器学习   CV - 计算机视觉   CL - 计算与语言   AS - 音频与语音 RO - 机器人 GR - 图形学

摘要:面向视觉识别的2^L子模型协同训练、离线强化学习的置信条件价值函数、单张图片的目标固有特性学习、稠密检查点专家混合训练、说话人脸生成中记忆是一对多映射缓解器、超高分辨率高保真神经辐射场、基于Transformer音视频上下文的掩码唇形同步预测、基于扩散模型的文本和形状引导目标补全、面向组合式文本到图像合成的免训练结构化扩散指导


1、[CV] Co-training 2^L Submodels for Visual Recognition

H Touvron, M Cord, M Oquab, P Bojanowski, J Verbeek, H Jégou
[Meta AI & Sorbonne University]

面向视觉识别的2^L子模型协同训练

要点:
1. 提出CoSub,一种正则化方法,用一组权重来协同训练子模型,不涉及预训练的外部模型或时间平均;

2. 在用于图像分类和语义分割的多种架构(例如 ViT、ResNet、RegNet、PiT、XCiT、Swin、ConvNext)上验证了该方法,显著改善了大多数模型的训练;
3. 提供了一个有效的实现来动态子采样模型,并表明子模型本身就是有效的模型,即使有显著的修剪。

摘要:
本文提出子模型协同训练,一种与协同训练、自蒸馏和随机深度相关的正则化方法。给定一个要训练的神经网络,对于每个样本,隐式地实例化两个变化的网络,即“子模型”,具有随机深度:只对层的一个子集进行激活。每个网络都充当另一个网络的软教师,通过提供损失来补充 one-hot 标签提供的常规损失。该方法称为 cosub,使用一组权重,不涉及预训练的外部模型或时间平均。实验表明,子模型协同训练可有效训练骨干以完成图像分类和语义分割等识别任务。所提出方法与多种架构兼容,包括 RegNet、ViT、PiT、XCiT、Swin 和 ConvNext。所提出的训练策略在可比较的环境中改善了其结果。

We introduce submodel co-training, a regularization method related to co-training, self-distillation and stochastic depth. Given a neural network to be trained, for each sample we implicitly instantiate two altered networks, "submodels'', with stochastic depth: we activate only a subset of the layers. Each network serves as a soft teacher to the other, by providing a loss that complements the regular loss provided by the one-hot label. Our approach, dubbed cosub, uses a single set of weights, and does not involve a pre-trained external model or temporal averaging.Experimentally, we show that submodel co-training is effective to train backbones for recognition tasks such as image classification and semantic segmentation. Our approach is compatible with multiple architectures, including RegNet, ViT, PiT, XCiT, Swin and ConvNext. Our training strategy improves their results in comparable settings. For instance, a ViT-B pretrained with cosub on ImageNet-21k obtains 87.4% top-1 acc. @448 on ImageNet-val.

https://arxiv.org/abs/2212.04884

2、[LG] Confidence-Conditioned Value Functions for Offline Reinforcement Learning

J Hong, A Kumar, S Levine
 [UC Berkeley]

离线强化学习的置信条件价值函数

要点:
1. 提出基于置信条件的价值学习(CCVL)自适应离线强化学习算法;
2. CCVL是在任意所需置信度下真实Q值的下界值估计;
3.CCVL在离散动作环境(如Atari)中优于现有最先进方法

摘要:
离线强化学习 (RL) 承诺能仅用现有的静态数据集来学习有效策略,而无需任何昂贵的在线交互。为此,离线强化学习方法必须处理数据集和学习策略之间的分布漂移。最常见的方法是学习保守或下界的价值函数,低估了分布外 (OOD) 行为的回报。然而,此类方法存在一个明显的缺点:针对此类价值函数优化的策略只能根据固定的、可能是次优的保守程度行事。然而,如果能在训练时学习不同程度的保守策略并设计一种方法在评估过程中动态选择其中之一,则可以缓解这种情况。为此,本文提出价值函数学习,这些价值函数还以保守程度为条件,将其称为以置信度为条件的价值函数。本文推导出一种新形式的 Bellman 备份,以高概率同时学习任意置信度的 Q 值。通过以置信度为条件,价值函数通过使用迄今为止的观察历史来控制置信度水平,从而在在线评估期间启用自适应策略。这种方法可以通过在置信度上调节现有保守算法的 Q 函数在实践中实现。本文从理论上表明,所学习到的价值函数可以在任意所需的置信度下对真实价值产生保守估计。本文凭经验表明,所提出算法在多个离散控制域上优于现有的保守离线强化学习算法。

Offline reinforcement learning (RL) promises the ability to learn effective policies solely using existing, static datasets, without any costly online interaction. To do so, offline RL methods must handle distributional shift between the dataset and the learned policy. The most common approach is to learn conservative, or lower-bound, value functions, which underestimate the return of out-of-distribution (OOD) actions. However, such methods exhibit one notable drawback: policies optimized on such value functions can only behave according to a fixed, possibly suboptimal, degree of conservatism. However, this can be alleviated if we instead are able to learn policies for varying degrees of conservatism at training time and devise a method to dynamically choose one of them during evaluation. To do so, in this work, we propose learning value functions that additionally condition on the degree of conservatism, which we dub confidence-conditioned value functions. We derive a new form of a Bellman backup that simultaneously learns Q-values for any degree of confidence with high probability. By conditioning on confidence, our value functions enable adaptive strategies during online evaluation by controlling for confidence level using the history of observations thus far. This approach can be implemented in practice by conditioning the Q-function from existing conservative algorithms on the confidence. We theoretically show that our learned value functions produce conservative estimates of the true value at any desired confidence. Finally, we empirically show that our algorithm outperforms existing conservative offline RL algorithms on multiple discrete control domains.

https://arxiv.org/abs/2212.04607

3、[CV] Seeing a Rose in Five Thousand Ways

Y Zhang, S Wu, N Snavely, J Wu
 [Stanford University & University of Oxford & Cornell Tech]

看玫瑰的五千种角度

要点:
1. 提出从单张图像中恢复对象固有特性的问题,图像中包含多个带有实例蒙版的相同对象类型的实例(例如花束照片中的玫瑰);
2. 设计了一种生成框架来学习这样的物体固有特性;
3. 通过广泛的评估,表明所提出模型在形状重建和生成,新视图合成和重照明中取得了优越的效果。

摘要:
从视觉上看,什么是玫瑰?一朵玫瑰包含其内在特征,包括特定于其对象类别的几何分布、纹理和材料。了解这些内在属性后,可以渲染出不同大小和形状、不同姿态和不同光照条件下的玫瑰。本文构建了一个生成模型,学习从单个图像(例如花束照片)中捕捉此类对象的内在特征。这样的图像包括对象类型的多个实例。这些实例都具有相同的内在特征,但由于这些内在特征的方差和外在因素(例如姿态和光照)的差异的组合而显得不同。实验表明,所提出模型成功地学习了各种对象的内在特征(几何、纹理和材料的分布),每个对象都来自单个 Internet 图像。所提出方法在多个下游任务上取得了优越的结果,包括固有图像分解、形状和图像生成、视图合成和重照明。

What is a rose, visually? A rose comprises its intrinsics, including the distribution of geometry, texture, and material specific to its object category. With knowledge of these intrinsic properties, we may render roses of different sizes and shapes, in different poses, and under different lighting conditions. In this work, we build a generative model that learns to capture such object intrinsics from a single image, such as a photo of a bouquet. Such an image includes multiple instances of an object type. These instances all share the same intrinsics, but appear different due to a combination of variance within these intrinsics and differences in extrinsic factors, such as pose and illumination. Experiments show that our model successfully learns object intrinsics (distribution of geometry, texture, and material) for a wide range of objects, each from a single Internet image. Our method achieves superior results on multiple downstream tasks, including intrinsic image decomposition, shape and image generation, view synthesis, and relighting.

https://arxiv.org/abs/2212.04965

4、[LG] Sparse Upcycling: Training Mixture-of-Experts from Dense Checkpoints

A Komatsuzaki, J Puigcerver, J Lee-Thorp...
[Google Research & Georgia Institute of Technology]

Sparse Upcycling:稠密检查点专家混合训练

要点:
1. 稀疏激活模型正成为相对于稠密模型的有吸引力的替代方案,具有计算效率优势;
2. Sparse Upcycling通过从稠密检查点初始化稀疏激活混合专家模型,来重用沉没(过往)训练成本;
3. 对视觉和语言模型而言,Upcycling总是有效的,只需要比训练稠密模型所需的成本更少的预算就能获得显著的性能提升。

摘要:
训练大型深度神经网络达到收敛,可能是非常昂贵的。因此,通常只有一小部分流行的稠密模型在不同的环境和任务中被重复使用。越来越多的稀疏激活模型,寻求将模型大小与计算成本脱钩,正在成为稠密模型的一个有吸引力的替代品。虽然在质量和计算成本方面更有效,但稀疏模型仍然对数据要求很高,而且在大规模系统中从头开始训练成本很高。本文提出Sparse Upcycling——通过从稠密检查点初始化稀疏激活专家混合模型来重新使用沉没(过往)训练成本的简单方法。本文表明,在SuperGLUE和ImageNet上,使用Sparse Upcycling的T5 Base、Large和XL语言模型以及Vision Transformer Base和Large模型的表现明显优于稠密模型,只使用了最初密集预训练沉没成本的50%。升级后的模型也优于从头开始训练的初始稠密型预训练计算预算为100%的稀疏模型。

Training large, deep neural networks to convergence can be prohibitively expensive. As a result, often only a small selection of popular, dense models are reused across different contexts and tasks. Increasingly, sparsely activated models, which seek to decouple model size from computation costs, are becoming an attractive alternative to dense models. Although more efficient in terms of quality and computation cost, sparse models remain data-hungry and costly to train from scratch in the large scale regime. In this work, we propose sparse upcycling -- a simple way to reuse sunk training costs by initializing a sparsely activated Mixture-of-Experts model from a dense checkpoint. We show that sparsely upcycled T5 Base, Large, and XL language models and Vision Transformer Base and Large models, respectively, significantly outperform their dense counterparts on SuperGLUE and ImageNet, using only ~50% of the initial dense pretraining sunk cost. The upcycled models also outperform sparse models trained from scratch on 100% of the initial dense pretraining computation budget.

https://arxiv.org/abs/2212.05055

5、[CV] Memories are One-to-Many Mapping Alleviators in Talking Face Generation

A Tang, T He, X Tan, J Ling, R Li, S Zhao, L Song, J Bian
[Microsoft Research Asia & Shanghai Jiao Tong University]

说话人脸生成中记忆是一对多映射缓解器

要点:
1. 提出MemFace,在音频到表达模型中用隐式记忆来捕获音频表达共享空间中的高级语义,在神经渲染模型中用显式记忆来帮助合成像素级细节,以缓解说话人脸生成中的一对多映射挑战;
2. 实验结果表明,MemFace在多种情况下均优于所有最先进的结果。

摘要:
说话人脸生成旨在生成由输入音频驱动的目标人物的照片般逼真的视频肖像。由于其从输入音频到输出视频的一对多映射的特性(例如,一个语音内容可能具有多种可行的视觉外观),像之前的工作一样学习确定性映射会在训练过程中带来歧义,从而导致较差的视觉效果。尽管这种一对多映射可以通过两阶段框架(即音频到表达模型和神经渲染模型)部分缓解,但仍然不够,因为预测是在没有足够信息的情况下产生的 (例如,情绪、皱纹等)。本文提出 MemFace 用分别遵循两个阶段意义的内隐记忆和外显记忆来补充缺失的信息。在音频到表达模型中使用隐式记忆来捕获音频表达共享空间中的高级语义,而在神经渲染模型中使用显式记忆来帮助合成像素级细节。实验结果表明,我们提出的 MemFace 在多个场景中一致且显著地超越了所有最先进的结果。

Talking face generation aims at generating photo-realistic video portraits of a target person driven by input audio. Due to its nature of one-to-many mapping from the input audio to the output video (e.g., one speech content may have multiple feasible visual appearances), learning a deterministic mapping like previous works brings ambiguity during training, and thus causes inferior visual results. Although this one-to-many mapping could be alleviated in part by a two-stage framework (i.e., an audio-to-expression model followed by a neural-rendering model), it is still insufficient since the prediction is produced without enough information (e.g., emotions, wrinkles, etc.). In this paper, we propose MemFace to complement the missing information with an implicit memory and an explicit memory that follow the sense of the two stages respectively. More specifically, the implicit memory is employed in the audio-to-expression model to capture high-level semantics in the audio-expression shared space, while the explicit memory is employed in the neural-rendering model to help synthesize pixel-level details. Our experimental results show that our proposed MemFace surpasses all the state-of-the-art results across multiple scenarios consistently and significantly.

https://arxiv.org/abs/2212.05005


另外几篇值得关注的论文:

[CV] 4K-NeRF: High Fidelity Neural Radiance Fields at Ultra High Resolutions

Z Wang, L Li, Z Shen, L Shen, L Bo
 [Alibaba Group]

4K-NeRF:超高分辨率高保真神经辐射场

要点:
1. 提出一种新框架,增强了NeRF对细节建模的能力;
2. 两个编-解码器模块,能有效建模几何属性,实现视图一致增强;
3. 基于图块的采样训练,从感知定向正则化中整合监督。

https://arxiv.org/abs/2212.04701

[CV] Masked Lip-Sync Prediction by Audio-Visual Contextual Exploitation in Transformers

Y Sun, H Zhou, K Wang, Q Wu...
 [Tokyo Institute of Technology & Baidu Inc & The University of Sydney & Monash University & ...]

基于Transformer音视频上下文的掩码唇形同步预测

要点:
1. 提出一种视听上下文感知Transformer(AV-CAT)框架,可以产生逼真的唇形同步,并用音频和参考帧填充掩码人脸部分;
2. 精心设计的Transformer用于充分利用音频和视觉信息;
3. 广泛实验证明该模型能产生高保真的唇形同步结果。

https://arxiv.org/abs/2212.04970

[CV] SmartBrush: Text and Shape Guided Object Inpainting with Diffusion Model

S Xie, Z Zhang, Z Lin, T Hinz, K Zhang
[CMU & Adobe Research]

SmartBrush:基于扩散模型的文本和形状引导目标补全

要点:
1. 提出一种新的基于扩散的模型SmartBrush,可以用文本和形状指导来填补缺失部分;
2. 为更好地保留背景,提出一种新的训练和采样策略,通过增强扩散U-net来预测目标掩码;
3. 提出一种利用分割数据集中文本和形状指导来解决文本错位问题的训练方法。

https://arxiv.org/abs/2212.05034

[CV] Training-Free Structured Diffusion Guidance for Compositional Text-to-Image Synthesis

W Feng, X He, T Fu, V Jampani, A Akula, P Narayana, S Basu, X E Wang, W Y Wang
[UC Santa Barbara & UC Santa Cruz & Google]

面向组合式文本到图像合成的免训练结构化扩散指导

要点:
1. 通过将语言结构与扩散引导过程相结合,来提高文本到图像(T2I)模型的组合能力,以更好地保留生成图像中的组合语义;
2. 提出一种无需训练的方法,可以准确地将对象与其正确属性绑定,同时保持整体图像质量和多样性;
3. 分析了冻结语言编码器和注意力图,以确定不正确属性绑定的原因。

https://arxiv.org/abs/2212.05032



文章有问题?点此查看未经处理的缓存