全国超9亿人或已感染新冠!超8成受访感染者出现发烧症状

经济学家王小鲁:有关某地向非公企业派驻第一书记的三个问题

李庄没能见到小花梅

母子乱伦:和儿子做了,我该怎么办?

过去一个月,“走了”多少老人?

生成图片,分享到微信朋友圈

自由微信安卓APP发布,立即下载! | 提交文章网址
查看原文

爱可可AI前沿推介(12.16)

爱可可爱生活 爱可可爱生活 2022-12-17

LG - 机器学习   CV - 计算机视觉   CL - 计算与语言   AS - 音频与语音 RO - 机器人 GR - 图形学

1、[CL] Talking About Large Language Models

2、[CV] PhoMoH: Implicit Photorealistic 3D Models of Human Heads

3、[LG] Hybrid Multi-agent Deep Reinforcement Learning for Autonomous Mobility on Demand Systems

4、[CV] 3DHumanGAN: Towards Photo-Realistic 3D-Aware Human Image Generation

5、[CV] You Only Need a Good Embeddings Extractor to Fix Spurious Correlations

[LG] Self-Play and Self-Describe: Policy Adaptation with Vision-Language Foundation Models

[CV] NoPe-NeRF: Optimising Neural Radiance Field with No Pose Prior

[CV] Structured 3D Features for Reconstructing Relightable and Animatable Avatars

[CV] REVEAL: Retrieval-Augmented Visual-Language Pre-Training with Multi-Source Multimodal Knowledge Memory

摘要:谈谈大型语言模型、头部隐式逼真3D模型、自主应需移动系统混合多智能体深度强化学习、逼真3D人体图像生成研究、修复虚假相关只需要好的嵌入提取器、基于视觉语言基础模型的策略自适应、无姿态先验神经辐射场优化、面向可重打光和可动画化身重建的结构化3D特征、基于多源多模态知识记忆的检索增强视觉语言预训练


1、[CL] Talking About Large Language Models

M Shanahan
[Imperial College London]

谈谈大型语言模型

要点:

  1. 人工智能的快速发展带来了技术和哲学相互碰撞的时代;
  2. 为了避免误用用于描述语言模型的哲学术语,有必要回头看看语言模型是如何工作的;
  3. 重要的是要避免把语言模型人格化,在谈论语言模型时使用准确的语言。

摘要:
由于人工智能的快速发展,我们已经进入了一个技术和哲学以有趣方式碰撞的时代。大型语言模型(LLMs)正处于这个碰撞的中心位置。大型语言模型越是善于模仿人类语言,我们越容易受到拟人主义的影响,把嵌入其中的系统看得比实际情况更像人类。在描述这些系统时,人们自然倾向于使用带有哲学意味的术语,例如"知道"、"相信"和"认为",从而扩大了这种趋势。为了缓解这种趋势,本文提倡反复退后一步,提醒自己LLM以及它们所构成的系统是如何实际工作的。希望科学精度的提高将鼓励在人工智能领域和公共领域的讨论中出现更多哲学上的细微差别。

Thanks to rapid progress in artificial intelligence, we have entered an era when technology and philosophy intersect in interesting ways. Sitting squarely at the centre of this intersection are large language models (LLMs). The more adept LLMs become at mimicking human language, the more vulnerable we become to anthropomorphism, to seeing the systems in which they are embedded as more human-like than they really are. This trend is amplified by the natural tendency to use philosophically loaded terms, such as "knows", "believes", and "thinks", when describing these systems. To mitigate this trend, this paper advocates the practice of repeatedly stepping back to remind ourselves of how LLMs, and the systems of which they form a part, actually work. The hope is that increased scientific precision will encourage more philosophical nuance in the discourse around artificial intelligence, both within the field and in the public sphere.

https://arxiv.org/abs/2212.03551

2、[CV] PhoMoH: Implicit Photorealistic 3D Models of Human Heads

M Zanfir, T Alldieck, C Sminchisescu
[Google Research]

PhoMoH: 头部隐式逼真3D模型

要点:

  1. 提出PhoMoH,一种构建逼真3D头部几何和外观模型的神经网络方法;
  2. 用神经场对头部进行建模,支持复杂拓扑结构;
  3. 所提出的几何网络可以从相对少量数据中学习逼真头部模型。

摘要:
本文提出PhoMoH,一种神经网络方法,用于构建人体头部(包括头发、胡须、服装和配件)的逼真3D几何和外观生成模型。与之前的工作不同,PhoMoH使用神经场对头部进行建模,从而支持复杂的拓扑结构。本文没有从头开始学习一个头部模型,而是用新的特征来增强现有的具有表现力的头部模型。在一个中等分辨率的头部模型之上学习一个高度详细的几何网络,同时学习一个详细的、局部几何感知的、解缠的颜色场。所提出的架构使得能从相对较少的数据中学习逼真的人体头部模型。学到的生成式几何和外观网络可以被单独采样,并允许创建多样化和逼真的头部。广泛的实验从质量上和不同的指标上验证了所提出方法。

We present PhoMoH, a neural network methodology to construct generative models of photorealistic 3D geometry and appearance of human heads including hair, beards, clothing and accessories. In contrast to prior work, PhoMoH models the human head using neural fields, thus supporting complex topology. Instead of learning a head model from scratch, we propose to augment an existing expressive head model with new features. Concretely, we learn a highly detailed geometry network layered on top of a mid-resolution head model together with a detailed, local geometry-aware, and disentangled color field. Our proposed architecture allows us to learn photorealistic human head models from relatively little data. The learned generative geometry and appearance networks can be sampled individually and allow the creation of diverse and realistic human heads. Extensive experiments validate our method qualitatively and across different metrics.

https://arxiv.org/abs/2212.07275



3、[LG] Hybrid Multi-agent Deep Reinforcement Learning for Autonomous Mobility on Demand Systems

T Enders, J Harrison, M Pavone, M Schiffer
[Technical University of Munich & Google Research & Stanford University]

自主应需移动系统混合多智能体深度强化学习

要点:

  1. 提出一种新方法,对自动应需移动系统运营商原本难以解决的行动空间进行分解,同时仍然获得全局协调的决策;
  2. 基于真实世界的数据进行的实验表明,该方法在性能、稳定性和计算易用性方面优于多种最先进的基准测试。

摘要:
本文考虑为自主移动需求系统的利润最大化的运营商做出主动请求分配和拒绝决定的顺序决策问题。将这一问题形式化为马尔可夫决策过程,提出一种新的多智能体软Actor-Critic和加权两面匹配的组合,以获得一种预期的控制策略。将运营商原本难以解决的行动空间进行了分解,但仍然获得了一个全局协调的决策。基于现实世界出租车数据的实验表明,所提出方法在性能、稳定性和计算可操作性方面优于现有的基准。

We consider the sequential decision-making problem of making proactive request assignment and rejection decisions for a profit-maximizing operator of an autonomous mobility on demand system. We formalize this problem as a Markov decision process and propose a novel combination of multi-agent Soft Actor-Critic and weighted bipartite matching to obtain an anticipative control policy. Thereby, we factorize the operator's otherwise intractable action space, but still obtain a globally coordinated decision. Experiments based on real-world taxi data show that our method outperforms state of the art benchmarks with respect to performance, stability, and computational tractability.

https://arxiv.org/abs/2212.07313



4、[CV] 3DHumanGAN: Towards Photo-Realistic 3D-Aware Human Image Generation

Z Yang, S Li, W Wu, B Dai
[Shanghai AI Lab & SenseTime Research]

3DHumanGAN: 逼真3D人体图像生成研究

要点:

  1. 提出3DHumanGAN,一种3D感知生成对抗网络(GAN),用于合成逼真的全身人体图像;
  2. 提出一种新的2D-3D混合生成器体系结构,既高效又富有表现力;
  3. 采用基于分割的GAN损失进行监督,以建立3D坐标和2D人体语义之间的映射。

摘要:
本文提出3DHumanGAN,一种3D感知生成对抗网络(GAN),可以合成在不同视角和身体姿态的具有一致外观的人体全身图像。为解决合成人体铰接结构的表征和计算挑战,本文提出一个新的生成器架构,其中2D卷积主干被3D姿态映射网络所调制。3D姿势映射网络被表述为一个可渲染的隐函数,以摆好的3D人体网格为条件。这种设计有几个优点:i)允许利用2D GAN的能力来生成逼真的图像;ii)在不同的视角和可指定的姿态下生成一致的图像;iii)可以从3D人体先验中受益。所提出模型是通过对抗学习从网络图像集中获得的,不需要人工标注。

We present 3DHumanGAN, a 3D-aware generative adversarial network (GAN) that synthesizes images of full-body humans with consistent appearances under different view-angles and body-poses. To tackle the representational and computational challenges in synthesizing the articulated structure of human bodies, we propose a novel generator architecture in which a 2D convolutional backbone is modulated by a 3D pose mapping network. The 3D pose mapping network is formulated as a renderable implicit function conditioned on a posed 3D human mesh. This design has several merits: i) it allows us to harness the power of 2D GANs to generate photo-realistic images; ii) it generates consistent images under varying view-angles and specifiable poses; iii) the model can benefit from the 3D human prior. Our model is adversarially learned from a collection of web images needless of manual annotation.

https://arxiv.org/abs/2212.07378



5、[CV] You Only Need a Good Embeddings Extractor to Fix Spurious Correlations

R Mehta, V Albiero, L Chen, I Evtimov, T Glaser, Z Li, T Hassner
[Meta AI]

修复虚假相关只需要好的嵌入提取器

要点:

  1. 训练数据中的虚假相关可能会导致鲁棒性问题;
  2. 通过使用大型预训练的视觉模型提取器的嵌入,并在其上训练一个线性分类器,可以在不使用训练集中任何子组信息的情况下达到90%的准确度;
  3. 与高容量卷积神经网络相比,高容量的视觉Transformer表现更好,而更大的预训练数据集导致在虚假相关数据集上有更好的最差组的准确性。

摘要:
训练数据中的虚假相关常导致鲁棒性问题,因为模型学会了用它们作为捷径。例如,当预测一个物体是否是一头牛时,一个模型可能学会依赖它的绿色背景,于是在沙质背景的牛身上表现得很差。衡量缓解该问题方法的最先进的标准数据集是Waterbirds。最好的方法(群体分布鲁棒优化GroupDRO)目前实现了89%的最差群体准确率,而在原始图像上从头开始的标准训练只得到72%。GroupDRO需要用子组标签以端到端的方式训练一个模型。本文表明,通过简单地使用来自大型预训练视觉模型提取器的嵌入并在其上训练一个线性分类器,可以在不使用训练集中任何子组信息的情况下实现高达90%的准确率。通过对各种预训练模型和预训练数据集的实验,本文表明预训练模型的容量和预训练数据集的大小很重要。实验显示,与高容量卷积神经网络相比,高容量的视觉变换器表现更好,而更大的预训练数据集导致在虚假相关数据集上有更好的最差组的准确性。

Spurious correlations in training data often lead to robustness issues since models learn to use them as shortcuts. For example, when predicting whether an object is a cow, a model might learn to rely on its green background, so it would do poorly on a cow on a sandy background. A standard dataset for measuring state-of-the-art on methods mitigating this problem is Waterbirds. The best method (Group Distributionally Robust Optimization - GroupDRO) currently achieves 89% worst group accuracy and standard training from scratch on raw images only gets 72%. GroupDRO requires training a model in an end-to-end manner with subgroup labels. In this paper, we show that we can achieve up to 90% accuracy without using any sub-group information in the training set by simply using embeddings from a large pre-trained vision model extractor and training a linear classifier on top of it. With experiments on a wide range of pre-trained models and pre-training datasets, we show that the capacity of the pre-training model and the size of the pre-training dataset matters. Our experiments reveal that high capacity vision transformers perform better compared to high capacity convolutional neural networks, and larger pre-training dataset leads to better worst-group accuracy on the spurious correlation dataset.

https://arxiv.org/abs/2212.06254



另外几篇值得关注的论文:

[LG] Self-Play and Self-Describe: Policy Adaptation with Vision-Language Foundation Models

Y Ge, A Macaluso, L E Li, P Luo, X Wang
[University of Hong Kong & University of California, San Diego & AWS AI]

自弈与自描述:基于视觉语言基础模型的策略自适应

要点:

  1. SPLAYD利用预训练的视觉语言基础模型自动为策略微调提供新的示范-指令数据对;
  2. SPLAYD在涉及构成泛化,超出分布泛化和模拟到真实迁移的广泛实验中取得了巨大的优势。

https://arxiv.org/abs/2212.07398



[CV] NoPe-NeRF: Optimising Neural Radiance Field with No Pose Prior

W Bian, Z Wang, K Li, J Bian, V A Prisacariu
[University of Oxford]

NoPe-NeRF: 无姿态先验神经辐射场优化

要点:

  1. 提出一种不需要预先计算摄像机姿态的神经辐射场(NeRF)联合训练方法;
  2. 纳入非失真单目深度先验,利用提出的新损失函数来约束连续帧间的相关姿态;
  3. 在处理复杂摄像机轨迹方面优于现有方法,而且在新视图渲染质量和摄像机轨迹精度方面也更出色。

https://arxiv.org/abs/2212.07388



[CV] Structured 3D Features for Reconstructing Relightable and Animatable Avatars

E Corona, M Zanfir, T Alldieck, E G Bazavan, A Zanfir, C Sminchisescu
[Google Research & UPC]

面向可重打光和可动画化身重建的结构化3D特征

要点:

  1. 提出结构化3D特征,一种基于新的隐式3D表示的模型;
  2. 提出一个完整的基于3D Transformer的注意力框架;
  3. S3F模型在各种任务上超过了之前的最先进水平,包括单目3D重建、反照率和阴影估计。

https://arxiv.org/abs/2212.06820



[CV] REVEAL: Retrieval-Augmented Visual-Language Pre-Training with Multi-Source Multimodal Knowledge Memory

Z Hu, A Iscen, C Sun, Z Wang, K Chang...
[Google Research & University of California, Los Angeles]

REVEAL: 基于多源多模态知识记忆的检索增强视觉语言预训练

要点:

  1. 提出了一种端到端的检索增强视觉语言模型(REVEAL),基于一个可以在大规模图像-文本和知识语料库上预训练的知识检索器;
  2. REVEAL利用各种多模态知识来源,取得了巨大收益;
  3. REVEAL在视觉问答和图像描述任务上实现了最先进的性能。

https://arxiv.org/abs/2212.05221




文章有问题?点此查看未经处理的缓存