爱可可AI前沿推介(12.31)
LG - 机器学习 CV - 计算机视觉 CL - 计算与语言
1、[CL] Repository-Level Prompt Generation for Large Language Models of Code
2、[CV] SegNeRF: 3D Part Segmentation with Neural Radiance Fields
3、[CV] LayoutDETR: Detection Transformer Is a Good Multimodal Layout Designer
4、[CV] Peekaboo: Text to Image Diffusion Models are Zero-Shot Segmentors
5、[CL] BLOOM+1: Adding Language Support to BLOOM for Zero-Shot Prompting
摘要:大型代码语言模型的项目(Repo)级提示生成、基于神经辐射场的3D部件分割、检测Transformer是好的多模态布局设计器、检测用Transformer也是多模态布局设计器、文本到图像扩散模型是零样本分割器、向BLOOM添加语言支持以进行零样本提示
1、[CL] Repository-Level Prompt Generation for Large Language Models of Code
D Shrivastava, H Larochelle, D Tarlow
[Google Research]
大型代码语言模型的项目(Repo)级提示生成
要点:
提出代码库(Repo)级提示生成器(RLPG),学习生成样本特定提示,无需访问LLM权重; RLPG使用一组代码库级提示建议来结合域知识实现提示设计过程,既包含代码库结构,也包含来自代码库中所有文件的相关上下文; 在单行代码自动补全任务中,展示了由所提出的提示建议构建的oracle相对于Codex可以提高最多36%的相对改进。
摘要:
随着代码大型语言模型(LLM)的成功及其作为代码助理的使用(例如GitHub Copilot中使用的Codex),在提示设计过程中引入域特定知识的技术变得很重要。本文提出名为Repo级提示生成器的框架,该框架学习使用提示建议生成特定于样本的提示。提示建议从整个代码库中获取上下文,从而结合了代码库的结构和其他相关文件(例如imports、父类文件)的上下文。该技术不需要任何访问LLM的权重,因此适用于只通过黑盒访问LLM的情况。使用从Google Code archives中提取的代码库进行单行代码自动补齐任务的实验证明,根据提示建议构建的oracle相比Codex具有36%的相对改进,这表明了这些建议的质量。当训练模型来预测提示建议时,可以比Codex和其他基线实现显著的性能提升。
With the success of large language models (LLMs) of code and their use as code assistants (e.g. Codex used in GitHub Copilot), techniques for introducing domain-specific knowledge in the prompt design process become important. In this work, we propose a framework called Repo-Level Prompt Generator that learns to generate example-specific prompts using prompt proposals. The prompt proposals take context from the entire repository, thereby incorporating both the structure of the repository and the context from other relevant files (e.g. imports, parent class files). Our technique doesn't require any access to the weights of the LLM, making it applicable in cases where we only have black-box access to the LLM. We conduct experiments on the task of single-line code-autocompletion using code repositories taken from Google Code archives. We demonstrate that an oracle constructed from our prompt proposals gives a remarkably high relative improvement of 36% over Codex, showing the quality of these proposals. Further, we show that when we train a model to predict a prompt proposal, we can achieve significant performance gains over Codex and other baselines.
https://arxiv.org/abs/2206.12839
2、[CV] SegNeRF: 3D Part Segmentation with Neural Radiance Fields
J Zarzar, S Rojas, S Giancola, B Ghanem
[KAUST]
SegNeRF: 基于神经辐射场的3D部件分割
要点:
提出SegNeRF,一种灵活的3D隐式表示,可以从给定的RGB图像中同时学习外观、几何形状和语义信息; 尽管在训练期间完全依赖图像监督,但广泛的实验验证了SegNeRF在3D部件分割方面的能力; SegNeRF是第一个多用途的隐式表示,能在不进行昂贵的测试时优化的情况下,联合重建和分割新对象。
摘要:
神经辐射场(NeRF)的最新进展为新视图合成和3D重建等生成任务提供了令人印象深刻的性能。基于神经辐射场的方法能完全依靠摆放图像来隐式地表示3D世界。然而,在3D部件分割等区分性任务领域,它们很少被探索。本文试图通过提出SegNeRF来弥合这一差距:一种将语义场与通常的辐射场集成在一起的神经场表示。SegNeRF从之前的工作中继承了执行新视图合成和3D重建的能力,并从少量图像实现3D部件分割。在PartNet上的广泛实验表明,SegNeRF能同时从摆放图像中预测几何形状、外观和语义信息,即使是没见过的物体。SegNeRF能从真实场景拍摄的物体的单张图像生成显式3D模型,并进行相应的部件分割。
Recent advances in Neural Radiance Fields (NeRF) boast impressive performances for generative tasks such as novel view synthesis and 3D reconstruction. Methods based on neural radiance fields are able to represent the 3D world implicitly by relying exclusively on posed images. Yet, they have seldom been explored in the realm of discriminative tasks such as 3D part segmentation. In this work, we attempt to bridge that gap by proposing SegNeRF: a neural field representation that integrates a semantic field along with the usual radiance field. SegNeRF inherits from previous works the ability to perform novel view synthesis and 3D reconstruction, and enables 3D part segmentation from a few images. Our extensive experiments on PartNet show that SegNeRF is capable of simultaneously predicting geometry, appearance, and semantic information from posed images, even for unseen objects. The predicted semantic fields allow SegNeRF to achieve an average mIoU of extbf{30.30%} for 2D novel view segmentation, and extbf{37.46%} for 3D part segmentation, boasting competitive performance against point-based methods by using only a few posed images. Additionally, SegNeRF is able to generate an explicit 3D model from a single image of an object taken in the wild, with its corresponding part segmentation.
https://arxiv.org/abs/2211.11215
3、[CV] LayoutDETR: Detection Transformer Is a Good Multimodal Layout Designer
N Yu, C Chen, Z Chen, R Meng, G Wu, P Josel, J C Niebles, C Xiong, R Xu
[Salesforce Research]
LayoutDETR: 检测用Transformer也是多模态布局设计器
要点:
提出LayoutDETR将两个研究领域——布局生成和视觉检测,桥接到一个框架中; 建立了一个大规模的广告条幅数据集,并对该数据集进行图形布局生成基准测试; 该方案为图形布局生成设定了新的最高水准。
摘要:
图形布局设计在视觉传播中起着至关重要的作用。然而,手工布局设计要求高,耗时,并且无法扩展到批量生产。虽然生成模型的出现使设计自动化不再高高在上,但定制符合设计师多模意图的设计,即受背景图像约束和前景内容驱动的设计仍然很困难。本文提出LayoutDETR,继承了生成式建模的高质量和现实性,同时将内容感知要求重新表述为检测问题:学会在背景图像中检测布局中多模态元素的合理位置、比例和空间关系。实验证实,该方案为公共基准和新策划的广告横幅数据集的布局生成提供了新的最先进性能。为了实际使用,本文将解决方案构建成一个图形系统,便于用户研究。其设计吸引了比基线更多的主观偏好。
Graphic layout designs play an essential role in visual communication. Yet handcrafting layout designs are skill-demanding, time-consuming, and non-scalable to batch production. Although generative models emerge to make design automation no longer utopian, it remains non-trivial to customize designs that comply with designers' multimodal desires, i.e., constrained by background images and driven by foreground contents. In this study, we propose extit{LayoutDETR} that inherits the high quality and realism from generative modeling, in the meanwhile reformulating content-aware requirements as a detection problem: we learn to detect in a background image the reasonable locations, scales, and spatial relations for multimodal elements in a layout. Experiments validate that our solution yields new state-of-the-art performance for layout generation on public benchmarks and on our newly-curated ads banner dataset. For practical usage, we build our solution into a graphical system that facilitates user studies. We demonstrate that our designs attract more subjective preference than baselines by significant margins. Our code, models, dataset, graphical system, and demos are available at this https URL.
https://arxiv.org/abs/2212.09877
4、[CV] Peekaboo: Text to Image Diffusion Models are Zero-Shot Segmentors
R Burgert, K Ranasinghe, X Li, M S. Ryoo
[Stony Brook University]
Peekaboo: 文本到图像扩散模型是零样本分割器
要点:
提出了新的无监督分割机制,适用于语义分割和引用分割设置; 确定了预训练文本到图像扩散模型中存在的像素级定位信息; 为下游分割任务提供了一种利用Stable Diffusion模型作为现成基础模型的机制。
摘要:
最近基于扩散的生成模型与视觉-语言模型相结合,能用自然语言提示创建逼真的图像。虽然这些模型是在大型互联网规模的数据集上训练的,但此类预训练模型不会直接引入任何语义定位。大多数当前的定位方法都依赖于以边框或分割掩码的形式进行人工标注的定位信息。例外是一些无监督方法,用面向本地化的架构或损失函数,但需要单独训练。本文探索了无需接触此类本地化信息的现成扩散模型如何能在没有细分特定再训练的情况下对各种语义短语进行定位。介绍了一种推理时优化过程,该过程能生成以自然语言为条件的分割掩码。本文评估了Pascal VOC数据集上无监督语义分割的建议Peekaboo。评估了RefCOCO数据集上的引用分割。总之,本文提出第一个零样本、开放词汇、无监督(无本地化信息)、语义定位技术,利用基于扩散的生成模型,无需再训练。
Recent diffusion-based generative models combined with vision-language models are capable of creating realistic images from natural language prompts. While these models are trained on large internet-scale datasets, such pre-trained models are not directly introduced to any semantic localization or grounding. Most current approaches for localization or grounding rely on human-annotated localization information in the form of bounding boxes or segmentation masks. The exceptions are a few unsupervised methods that utilize architectures or loss functions geared towards localization, but they need to be trained separately. In this work, we explore how off-the-shelf diffusion models, trained with no exposure to such localization information, are capable of grounding various semantic phrases with no segmentation-specific re-training. An inference time optimization process is introduced, that is capable of generating segmentation masks conditioned on natural language. We evaluate our proposal Peekaboo for unsupervised semantic segmentation on the Pascal VOC dataset. In addition, we evaluate for referring segmentation on the RefCOCO dataset. In summary, we present a first zero-shot, open-vocabulary, unsupervised (no localization information), semantic grounding technique leveraging diffusion-based generative models with no re-training. Our code will be released publicly.
https://arxiv.org/abs/2211.13224
5、[CL] BLOOM+1: Adding Language Support to BLOOM for Zero-Shot Prompting
Z Yong, H Schoelkopf, N Muennighoff, A F Aji, D I Adelani...
[Brown University & EleutherAI & Hugging Face]
BLOOM+1: 向BLOOM添加语言支持以进行零样本提示
要点:
研究语言自适应对零样本提示和指令微调的影响; 对不同规模的BLOOM模型进行参数高效自适应,并在所需计算量与零样本提示性能之间进行权衡。 量化语言自适应数据多少对语言自适应的影响。
摘要:
BLOOM模型是一个大型开源多语言模型,能进行零样本学习,但其预训练仅限于46种语言。为了提高其在未见语言上的零样本性能,最好对BLOOM进行改进,但之前的工作只探索了改进小型语言模型。本文将现有的语言自适应策略应用于BLOOM,并将其零样本提示性能与八种新语言进行比较。发现语言自适应可以有效地提高新语言的零样本性能。令人惊讶的是,对于大模型,基于适配器的微调比持续的预训练更有效。提示性能不受语言细节(如书写系统)的显著影响,主要取决于语言自适应数据的多少。本文还向BLOOMZ添加了新语言,BLOOMZ是BLOOM的多任务微调版本,能零样本完成任务指令。在多任务微调混合中加入一种新语言是教授BLOONZ新语言的最有效方法。有足够多的训练数据,语言自适应可以很好地推广到不同的语言。
The BLOOM model is a large open-source multilingual language model capable of zero-shot learning, but its pretraining was limited to 46 languages. To improve its zero-shot performance on unseen languages, it is desirable to adapt BLOOM, but previous works have only explored adapting small language models. In this work, we apply existing language adaptation strategies to BLOOM and benchmark its zero-shot prompting performance on eight new languages. We find language adaptation to be effective at improving zero-shot performance in new languages. Surprisingly, adapter-based finetuning is more effective than continued pretraining for large models. In addition, we discover that prompting performance is not significantly affected by language specifics, such as the writing system. It is primarily determined by the size of the language adaptation data. We also add new languages to BLOOMZ, which is a multitask finetuned version of BLOOM capable of following task instructions zero-shot. We find including a new language in the multitask fine-tuning mixture to be the most effective method to teach BLOOMZ a new language. We conclude that with sufficient training data language adaptation can generalize well to diverse languages.