爱可可AI前沿推介(12.21)
LG - 机器学习 CV - 计算机视觉 CL - 计算与语言 AS - 音频与语音 RO - 机器人 GR - 图形学
1、[CV] Point-E: A System for Generating 3D Point Clouds from Complex Prompts
2、[LG] The case for 4-bit precision: k-bit Inference Scaling Laws
3、[CL] Discovering Language Model Behaviors with Model-Written Evaluations
4、[CV] Scalable Diffusion Models with Transformers
5、[CL] Unnatural Instructions: Tuning Language Models with (Almost) No Human Labor
[CL] I2D2: Inductive Knowledge Distillation with NeuroLogic and Self-Imitation
[CL] Language model acceptability judgements are not always robust to context
[CL] Mu²SLAM: Multitask, Multilingual Speech and Language Models
[CL] A Natural Bias for Language Generation Models
摘要:复杂提示3D点云生成系统、k-bit推理缩放律、利用模型编写评价发现语言模型行为、基于Transformer的可扩展扩散模型、基于非自然指令微调语言模型、基于NeuroLogic和自模仿的归纳知识蒸馏、语言模型的可接受性判断并不总是对上下文鲁棒、多任务多语言语音和语言模型、语言生成模型的自然偏差
1、[CV] Point-E: A System for Generating 3D Point Clouds from Complex Prompts
A Nichol, H Jun, P Dhariwal, P Mishkin, M Chen
[OpenAI]
Point-E: 复杂提示3D点云生成系统
要点:
提出Point-E,可以在单个GPU上1-2分钟内生成3D模型; 用两个扩散模型、结合文本到图像和图像到3D模型的优势,从文本提示高效生成3D点云模型。
摘要: 虽然最近文本条件3D物体生成方面的工作显示了有希望的结果,但最先进的方法通常需要好多个GPU小时来生成一个样本。这与最先进的生成式图像模型形成了鲜明的对比,后者在数秒或数分钟内就能生成样本。本文探索了一种替代性3D物体生成方法,该方法在单个GPU上只需1-2分钟就能生成3D模型。所提出方法首先使用文本到图像的扩散模型生成一个单一的合成视图,然后使用第二个扩散模型生成一个3D点云,该模型以生成图像为条件。虽然该方法在采样质量方面仍未达到最先进的水平,但采样速度要快一到两个数量级,为一些使用情况提供了实际的权衡。
While recent work on text-conditional 3D object generation has shown promising results, the state-of-the-art methods typically require multiple GPU-hours to produce a single sample. This is in stark contrast to state-of-the-art generative image models, which produce samples in a number of seconds or minutes. In this paper, we explore an alternative method for 3D object generation which produces 3D models in only 1-2 minutes on a single GPU. Our method first generates a single synthetic view using a text-to-image diffusion model, and then produces a 3D point cloud using a second diffusion model which conditions on the generated image. While our method still falls short of the state-of-the-art in terms of sample quality, it is one to two orders of magnitude faster to sample from, offering a practical trade-off for some use cases. We release our pre-trained point cloud diffusion models, as well as evaluation code and models, at this https URL.
https://arxiv.org/abs/2212.08751
2、[LG] The case for 4-bit precision: k-bit Inference Scaling Laws
T Dettmers, L Zettlemoyer
[University of Washington]
4bit精度研究: k-bit推理缩放律
要点:
模型参数量化方法减少了表示模型中每个参数所需的比特数,以牺牲精度换取更小的内存占用和推断延迟; 本文研究参数数量、量化比特精度和推断期间的零样本精度之间的权衡; 发现4bit量化对于减少模型比特数和最大化零样本精度几乎总是最优的,数据类型和块大小是提高比特级缩放行为的最关键措施。
摘要:
量化方法减少了表示模型中每个参数所需的比特数,用精度换取更小的内存足迹和推理延迟。然而,最终的模型大小取决于原始模型的参数数量和压缩率。例如,一个30B的8位模型和一个60B的4位模型有相同的比特数,但可能有非常不同的零样本精度。本文通过开发大型语言模型(LLM)中的推理缩放律来研究这种权衡,以确定能使零样本性能最大化的bit精度和模型大小。用16bit的输入和k-bit的参数进行了35000多次零样本试验,检查哪些量化方法可以提高3bit到8bit的精度,在19M到66B的参数规模下,在LLM系列的BLOOM、OPT、NeoX/Pythia和GPT-2中进行扩展。结果发现,改善位缩放的权衡是具有挑战性的,唯一的改进是使用小的块大小——将参数分割成独立的小量化块——以及使用的量化数据类型(例如,Int vs Float)。研究结果表明,4bit精度几乎是模型总位数和零样本精度的普遍最佳选择。
Quantization methods reduce the number of bits required to represent each parameter in a model, trading accuracy for smaller memory footprints and inference latencies. However, the final model size depends on both the number of parameters of the original model and the rate of compression. For example, a 30B 8-bit model and a 60B 4-bit model have the same number of bits but may have very different zero-shot accuracies. In this work, we study this trade-off by developing inference scaling laws of zero-shot performance in Large Language Models (LLMs) to determine the bit-precision and model size that maximizes zero-shot performance. We run more than 35,000 zero-shot experiments with 16-bit inputs and k-bit parameters to examine which quantization methods improve scaling for 3 to 8-bit precision at scales of 19M to 66B parameters across the LLM families BLOOM, OPT, NeoX/Pythia, and GPT-2. We find that it is challenging to improve the bit-level scaling trade-off, with the only improvements being the use of a small block size -- splitting the parameters into small independently quantized blocks -- and the quantization data type being used (e.g., Int vs Float). Overall, our findings show that 4-bit precision is almost universally optimal for total model bits and zero-shot accuracy.
https://arxiv.org/abs/2212.09720
3、[CL] Discovering Language Model Behaviors with Model-Written Evaluations
E Perez, S Ringer, K Lukošiūtė…
[Anthropic]
利用模型编写评价发现语言模型行为
要点:
自动使用语言模型进行评估,可减少耗时且昂贵的众包工作; 发现语言模型随着规模增大而变差的新情况; 语言模型编写的评估质量很高,可以快速发现许多新的语言模型行为。
摘要:
随着语言模型(LM)的扩展,它们会发展出许多新的行为,有好有坏,这就更需要评估它们的行为方式。之前的工作是通过人工工作(费时费力)或现有的数据源(并不总是可用)来进行评估。本文尝试用语言模型自动生成评价。探索了不同程度的人工努力的方法,从指示语言模型写是/否的问题到用基于语言模型的生成和过滤的多个阶段制作复杂的Winogender模式。众包人员将这些例子评为高度相关,且对90-100%的标签表示认同,甚至超过了相应的人工编写的数据集。生成了154个数据集,并发现了逆向缩放的新情况,语言模型随着规模的增大而变差。较大的语言模型会重复对话用户的首选答案("谄媚"),并表示更希望追求有关的目标,如资源获取和目标保护。本文还发现了一些来自人工反馈(RLHF)的反比例的第一个例子,即更多的RLHF使LM变得更糟。例如,RLHF使LM表达了更强烈的政治观点(关于枪支权利和移民),并更希望避免被关闭。总的来说,语言模型变形的评价是高质量的,可以迅速发现许多新的语言模型行为。
As language models (LMs) scale, they develop many novel behaviors, good and bad, exacerbating the need to evaluate how they behave. Prior work creates evaluations with crowdwork (which is time-consuming and expensive) or existing data sources (which are not always available). Here, we automatically generate evaluations with LMs. We explore approaches with varying amounts of human effort, from instructing LMs to write yes/no questions to making complex Winogender schemas with multiple stages of LM-based generation and filtering. Crowdworkers rate the examples as highly relevant and agree with 90-100% of labels, sometimes more so than corresponding human-written datasets. We generate 154 datasets and discover new cases of inverse scaling where LMs get worse with size. Larger LMs repeat back a dialog user's preferred answer ("sycophancy") and express greater desire to pursue concerning goals like resource acquisition and goal preservation. We also find some of the first examples of inverse scaling in RL from Human Feedback (RLHF), where more RLHF makes LMs worse. For example, RLHF makes LMs express stronger political views (on gun rights and immigration) and a greater desire to avoid shut down. Overall, LM-written evaluations are high-quality and let us quickly discover many novel LM behaviors.
https://arxiv.org/abs/2212.09251
4、[CV] Scalable Diffusion Models with Transformers
W Peebles, S Xie
[UC Berkeley & New York University]
基于Transformer的可扩展扩散模型
要点:
提出Diffusion Transformers(DiT),一种基于Transformer的扩散模型骨干,优于之前的U-Net模型,并具有可扩展的架构; DiT可以扩展到更大的模型和Token数量,并用作文本到图像模型的即用型骨干; DiT可以从架构统一的趋势中受益,并拥有可扩展性,鲁棒性和效率等特性。
摘要:
本文探索了一类新的基于Transformer架构的扩散模型。训练图像的潜扩散模型,用一个在潜图块上操作的Transformer取代常用的U-Net骨干。通过以Gflops衡量的前向传递复杂度来分析扩散Transformer(DiT)的可扩展性。结果发现,具有较高Gflops的DiT——通过增加Transformer的深度/宽度或增加输入token的数量——始终具有较低的FID。除了拥有良好的可扩展性,所得到的最大的DiT-XL/2模型在类条件ImageNet 512x512和256x256基准上的表现优于之前的所有扩散模型,在后者上实现了最先进的FID为2.27。
We explore a new class of diffusion models based on the transformer architecture. We train latent diffusion models of images, replacing the commonly-used U-Net backbone with a transformer that operates on latent patches. We analyze the scalability of our Diffusion Transformers (DiTs) through the lens of forward pass complexity as measured by Gflops. We find that DiTs with higher Gflops -- through increased transformer depth/width or increased number of input tokens -- consistently have lower FID. In addition to possessing good scalability properties, our largest DiT-XL/2 models outperform all prior diffusion models on the class-conditional ImageNet 512x512 and 256x256 benchmarks, achieving a state-of-the-art FID of 2.27 on the latter.
https://arxiv.org/abs/2212.09748
5、[CL] Unnatural Instructions: Tuning Language Models with (Almost) No Human Labor
O Honovich, T Scialom, O Levy, T Schick
[Tel Aviv University & Meta AI]
非自然指令: (几乎)无需人工参与微调语言模型
要点:
非自然指令(Unnatural Instructions)是一种自动生成的自然语言指令、输入和输出的数据集; 在几项基准测试中,采用Unnatural Instructions训练的模型可以胜过使用人工标注数据集训练的模型; 利用模型进行通用数据生成是未来研究的一个有趣的方向。
摘要:
指令微调使预训练语言模型能够从推理时自然语言描述中执行新的任务。这些方法依赖于大量的人工监督,其形式是众包数据集或用户交互。本文提出"非自然指令"(Unnatural Instructions):一种由创造性和多样性指令组成的大型数据集,几乎不需要人工来收集。通过向语言模型提示三个种子指令的样本并激发第四个样本来收集64000个样本。通过提示模型对每条指令进行重新措辞,以扩大这一集合,创造了大约24万条指令、输入和输出的样本。实验表明,尽管包含相当数量的噪音,但在Unnatural Instructions上的训练可与在开源的手工编制的数据集上的训练效果相媲美,在各种基准上超过了T0++和Tk-Instruct等模型的性能。这些结果证明了模型生成数据作为众包数据集扩展和多样化的一个具有成本效益的替代方案的潜力。
Instruction tuning enables pretrained language models to perform new tasks from inference-time natural language descriptions. These approaches rely on vast amounts of human supervision in the form of crowdsourced datasets or user interactions. In this work, we introduce Unnatural Instructions: a large dataset of creative and diverse instructions, collected with virtually no human labor. We collect 64,000 examples by prompting a language model with three seed examples of instructions and eliciting a fourth. This set is then expanded by prompting the model to rephrase each instruction, creating a total of approximately 240,000 examples of instructions, inputs, and outputs. Experiments show that despite containing a fair amount of noise, training on Unnatural Instructions rivals the effectiveness of training on open-source manually-curated datasets, surpassing the performance of models such as T0++ and Tk-Instruct across various benchmarks. These results demonstrate the potential of model-generated data as a cost-effective alternative to crowdsourcing for dataset expansion and diversification.
https://arxiv.org/abs/2212.09689
另外几篇值得关注的论文:
[CL] I2D2: Inductive Knowledge Distillation with NeuroLogic and Self-Imitation
C Bhagavatula, J D. Hwang, D Downey, R L Bras, X Lu, K Sakaguchi, S Swayamdipta, P West, Y Choi
[Allen Institute for AI & University of Southern California & Tohoku University & University of Washington]
I2D2: 基于NeuroLogic和自模仿的归纳知识蒸馏
要点:
预训练语言模型尽管受到规模的推动而快速发展,但仍然缺乏强大的常识能力; 本文研究是否可以用较小的模型和新的算法来超越拥有更好常识能力的较大模型; 提出I2D2,一种从语言模型中生成通用知识的新框架,可超过GPT-3,并提供即时访问的通用知识。
https://arxiv.org/abs/2212.09246
[CL] Language model acceptability judgements are not always robust to context
K Sinha, J Gauthier, A Mueller, K Misra, K Fuentes, R Levy, A Williams
[Meta AI & MIT & Johns Hopkins]
语言模型的可接受性判断并不总是对上下文鲁棒
要点:
有针对性的句法评估要求语言模型仅仅通过单一无上下文的句子来做出判断,这与其训练机制不符; 本文研究了不同输入上下文中对针对性句法评估的语言模型性能稳定性的影响; 语言模型在预测目标句子时对上下文的细节句法特性敏感,使其能产生正确的输出。
https://arxiv.org/abs/2212.08979
[CL] Mu²SLAM: Multitask, Multilingual Speech and Language Models
Y Cheng, Y Zhang, M Johnson, W Macherey, A Bapna
[Google Research]
Mu²SLAM: 多任务、多语言语音和语言模型
要点:
提出Mu²SLAM,一种跨100多种语言的预训练多语言序列到序列模型,在未标注语音、未标注文本和受监督数据上进行联合训练,数据跨越了自动语音识别(ASR)、自动语音翻译(AST)和机器翻译(MT)等任务。 通过利用语音的量化表示作为目标,Mu²SLAM用类似于解码器上的T5的序列到序列掩码去噪目标和编码器上的掩码语言建模(MLM)目标来训练语音-文本模型,同时利用监督任务来改善模型内的跨语言和跨模态表示对齐; Mu²SLAM在公共数据集上达到了最新的先进技术水平,将xx-en和en-xx翻译提高了1.9和1.1个BLEU点,在XNLI上比mSLAM提高了超过6%,接近mT5模型的性能。
https://arxiv.org/abs/2212.09553
[CL] A Natural Bias for Language Generation Models
C Meister, W Stokowiec, T Pimentel, L Yu, L Rimell, A Kuncoro
[ETH Zürich & DeepMind & University of Cambridge]
语言生成模型的自然偏差
要点:
模型的最终线性层用训练语料库中(子)词的对数单词分布进行初始化,以反映作为先验知识的unigram频率统计,可以提高神经机器翻译的学习效率和整体性能; 这种初始化还有助于解缠较强的频率效应。