查看原文
其他

学术前沿|单词学习:多模态理解和推理的基石

姜广源 北京大学人工智能研究院
2024-09-16

 导读 


近日,北京大学元培学院、20级通班学生姜广源同学作为第一作者,在北大人工智能研究院朱毅鑫助理教授及北京通用人工智能研究院研究员张驰指导下,在ICML 2023 发表论文MEWL: Few-shot multimodal word learning with referential uncertainty 


论文第一作者姜广源。通讯作者分别为姜广源、张驰、朱毅鑫,合作作者徐满杰、辛世计、梁玮、彭玉佳、张驰、朱毅鑫。


单词学习(word learning)被认为是多模态理解和推理最基本的构建基石。受到儿童可以少样本(few-shot)单词学习能力的启发,我们构建了MEWL(MachinE Word Learning)来评估机器如何在基于视觉场景的基础上学习单词和概念。MEWL的九个任务涵盖了人类在单词学习中的核心认知理论:交叉情境(cross-situational)推理、自举(bootstrapping)和语用(pragmatic)学习。


论文链接:(点击下方阅读原文)

https://sites.google.com/view/mewl


01

儿童的单词学习

图1:儿童可以在仅有几次接触后通过跨情境(cross-situational)信息学习一种新词,即使存在一定的指代不确定性(referential uncertainty)。在这个例子中,一个儿童在没有明确指导的情况下,通过接触一个daxy tufa(绿色圆柱体)和一个hally tufa(品红色圆柱体)的经验,推测出daxy指的是绿色,hally指的是品红色


学习单词和语言是人类认知发展中最基本的阶段之一,它为后续的其他关键能力奠定了基础,例如学习新的物体类别、形成概念结构的抽象、进行概括和发展交流能力(Lake & Murphy, 2021; Murphy, 2004; Smith & Gasser, 2005; Tenenbaum et al., 2011)。值得注意的是,我们能够迅速而轻松地掌握单词的含义,甚至在没有明确反馈的情况下也能做到(Bloom, 2001)。一个引人注目的观察是,年幼的孩子仅凭几个示例就能理解一个新词的含义,这也被称为快速映射(Carey & Bartlett, 1978; Heibeck & Markman, 1987),一个孩子到八岁时可以每天学习约12个单词(Bloom, 2002)。这些迅速学习的单词构成了我们对世界的理解,也是概念符号表征的基础。


人类学习天生是少样本(few-shot)和开放式的(open-ended),在即使没有明确的指导(Landau et al., 1988; Lake et al., 2015)下,孩子们在学习新词时会遇到相当大的指代不确定性(referential uncertainty),然而他们仍然能够理解词语与指称物之间的映射关系; 请参见图1的示意图。我们如何能够从如此少的信息中学到这么多的单词呢?先前的发展研究表明,这些能力有多种组成方式:


• 我们从多个上下文中的共现(co-occurrence)学习单词(Scott & Fisher, 2012)。儿童是小小统计学家(Gopnik et al., 1999; Abdulla, 2001); 他们利用跨情境统计量(Smith & Yu, 2008)和类贝叶斯的推理(Tenenbaum, 1998)来理解来自多个场景的单词含义(Xu & Tenenbaum, 2007)。


• 我们利用语义(semantic)和句法(syntactic)线索来自举(bootstrap)新词学习(Quine, 1960; Pinker, 2009)。例如,我们可以使用熟悉的关系词语来推断未知词的含义:听到beef(牛肉)和dax,我们可以推断dax是一个名词,并是大概率可食用的; 它可能代表与beef类似的食物。


• 我们通过语用(pragmatics)来理解单词的含义,即通过其他说话者的帮助进行社会性的单词学习解释。基本前提是利用关于指称物的信息性描述(Frank & Goodman, 2014; Horowitz & Frank, 2016; Stacy et al., 2022)。例如,如果我们有一颗蓝色的立方体、一颗蓝色的球和一颗绿色的立方体排成一行,说话者会用词语“球”来指称中间的物体,这是区分它们最具信息量(informative)的词语(Frank & Goodman, 2012)。


02

构建MEWL

像人类一样的单词学习对于构建像人类一样学习和推理的机器至关重要(Lake et al., 2017; Zhu et al., 2020; Fan et al., 2022)。尽管近期在仅语言和视语言预训练方面有了发展,但我们还不知道这些模型是否以类似于人类的方式获取词义(Lake & Murphy, 2021; Bender & Koller, 2020; Mitchell & Krakauer, 2023)。人们对预训练范式无法捕获人类语言和概念结构的核心组成部分,如复合性Compositionality(Thrush et al., 2022)、概念关联Concept Association(Yamada et al., 2022)、关系理解Relational Understanding(Conwell & Ullman, 2022)和概念含义Conceptual Understanding(Piantasodi & Hill, 2022)提出了质疑。这些质疑可以与人类和机器获取词汇基元的方式之间的差异联系起来(Fodor et al., 1988; Tenenbaum et al., 2011)。据我们所知,对于机器中的类似人类的单词学习,仍缺乏系统和严谨的评估。


为了填补这个空白,我们设计了MachinE Word Learning(MEWL)基准测试,以评估机器在实地视觉场景中的单词学习,覆盖了人类在单词学习中的核心认知工具包。MEWL作为少样本视觉-语言推理的试验台,带有指代不确定性。它包括了四种类型的九项任务:基本属性命名,关系词学习,数字词学习和语用词学习。

图2: MEWL与先前工作的比较。我们在六个维度上比较了MEWL和相关任务,包括多模态、少样本、指代不确定性、关系推理、语用推理和人类基准。


在创建MEWL时,我们从人类单词学习中的这些方法中汲取灵感,并相应地强调这些方法:交叉情境学习,引导学习,和语用单词学习。我们在MEWL中设计了九个独特的任务,以全面评估人类和机器之间的对齐:形状(shape),颜色(color),材料(material),物体(object),复合(composite),关系(relation),自举(bootstrap),数字(number),和语用(pragmatic)。


这些任务涵盖了各种方面:


• 学习代表基本对象属性(即,形状,颜色,和材料),物理本身(即,物体),和基本属性的组合(即,复合)的新词或短语。


• 使用熟悉的词汇来引导学习新的(空间)关系词(关系)或反之亦然(即,自举)。


• 从一到六学习计数和数字词(即,数字)。


• 假设说话者是最具信息量的,使用语用线索学习新词(即,语用)。

图3-6: MEWL的四类任务概述:每个题目包括六个上下文图片(context)和相应的场景描述。模型必须从给定的五个选项中选择与测试图片(Query)匹配的正确描述。单词到概念映射的答案标注在图片下方。


这些任务是为了与人类单词学习的核心构建块对齐,并回应发展心理学文献中的理论(Carey & Bartlett, 1978; Pinker, 2009; Bloom, 2002; Scott & Fisher, 2012; Smith et al.,2011; Horowitz & Frank, 2016; Frank & Goodman, 2014)。MEWL构成了一个全面的套件,用于探索机器如何在各种少量示例场景中学习词汇的含义,这些场景中都存在指代不确定性。在MEWL中,所有九个任务都涉及不同程度的指代不确定性,并必须通过交叉情境消除歧义来解决。我们使用与以前单词学习文献中定义的相同的指代不确定概念:“对于任何听到的名字,都有许多候选的指代物,它们具有不同的感知属性”(Yu et al., 2021)。


03

测试MEWL

为了探索类人的单词学习在人工智能模型的中的表现,我们在MEWL上评测了现当代的模型。将MEWL表达为一个少样本的视觉-语言学习问题的前提下,我们选择了两大类模型:多模态(视觉-语言)和单模态(纯语言)模型。我们还对于人类被试进行了评估,用以提供人类平均水平的对照。


对于多模态模型,我们选取了CLIP(Radford et al., 2021),Flamingo-1.1B(Alayrac et al., 2022),和Aloe(Ding et al., 2020)。对于纯语言模型,我们选用了GPT-3.5(Brown et al., 2020)和BERT(Devlin et al., 2018)。我们测试基于“先文字化,再分类”范式的模型。首先,我们使用特定任务的先知(Oracle)标注器来解析输入的视觉场景,生成一个场景描述。接下来,我们使用语言模型来将结果分类为多项选择问题。值得注意的是,这些标注被注入了精确需要解决这些任务的归纳偏置(inductive biases),比在多模态模型中使用的图像具有更少的不确定性和模糊性。这种设计大大简化了任务难度,因为对于单模态模型来说,将标注中的句法模式映射到答案更容易。具体来说,受到Yang et al. (2021) 的启发,我们使用基于为上下文场景和查询生成的完整标注的零样本(zero-shot)多项选择模板提示(prompt)GPT-3.5。我们还对BERT模型进行了微调,对真实标注进行了学习,从而得到了从标注到答案的映射。

图7: 基准模型和人类在MEWL上的表现。




图8: 基准模型和人类在MEWL上的表现。




04

讨论

多模态模型的性能


总的来说,最好的视觉-语言模型是Flamingo-1.1B(41.0%),只有人类(73.2%)能力的大约一半。与此同时,带有CLIP特征的普通Transformer模型在所有任务上的表现只能达到随机水平(不到20%)。Aloe的以对象为中心(object-centric)的表示有助于提高性能至26.8%,但由于模型容量有限和缺乏预训练,可能会表现得更差。


深入观察任务特定的结果,我们发现视觉-语言预训练模型在基本属性命名任务(即,形状,颜色,材料)上表现相对良好,但无法推广到对象关系和利用语用线索进行推理。一个有趣的观察是,Flamingo模型可以解决一小部分自举任务和一些数字任务。这个结果可能归因于Flamingo模型基于语言模型,捕获句法线索并理解熟悉的词以自举单词学习。


单模态模型的性能


对于单模态语言模型,微调后的BERT具有最佳的整体性能,平均性能为68.3%。BERT和GPT-3.5在对象级任务(即,形状,颜色,材料,对象,复合,自举)上都表现出色,但在需要理解超越一对一映射的更复杂关系的任务上失败(即,关系,数字)。在训练集上进行微调,BERT模型在实用任务上也表现良好,而GPT-3.5(未经微调)则失败,表明某些能力确实可以通过任务特定的微调来学习。然而,我们也想指出,已经使用了带有强烈人类偏见的详细说明:我们给基本属性命名任务提供以对象为中心的说明,给关系任务提供相对空间关系,给实用任务提供地面真实指向。在这种意义上,问题被简化为类似翻译的问题,规避了人类单词学习中的概念抽象的挑战。


在比较多模态模型(即,CLIP,Flamingo和Aloe)和单模态模型(即,GPT-3.5和微调的BERT)时,我们观察到,带有真实文本标注(caption)的基于文本的模型通常优于基于视觉输入的模型。这种在机器中的观察似乎与人类多模态学习的经验观察和计算研究形成对比,后者认为多模态可以提升词汇和概念的获取(Clark, 2006; Smith & Gasser, 2005)。为什么以及如何现代单模态模型在少样本单词学习中优于多模态模型呢?我们在下面对这种现象进行了一些初步的讨论。


首先,我们认为,单模态语言模型中的部分概念,而不是全部,可能以与人类不同的方式获取。最近,一些研究已经显示,大型语言模型可以从单模态训练数据中编码出类似人类的概念结构,甚至是感知的结构(Piantasodi & Hill, 2022; Abdou et al., 2021),并得到了人类神经科学实验的证实(Bi, 2021)。在我们的实验中,GPT-3.5成功地在一些基本属性命名任务(即,颜色,材料,形状,对象和复合)上取得了可比的性能,但却未能学习复杂的关系词(即,数字,关系),表明它已经从单模态训练中获取了一些关于形状,颜色和材料的概念知识。然而,GPT-3.5未能通过语用线索进行学习,支持了基于文本的模型不能在没有感知基础的情况下推断出交流意图的说法(Lake & Murphy, 2021)。这引发了对机器中感知落地(perceptually grounded)单词学习的追求,我们的MEWL对此做出了贡献。


其次,MEWL的单模态版本类似于“Quine的Gavagai问题”(Quine, 1960)。由于我们使用专门为每个任务设计的真实文本标注(caption),单模态语言模型不需要像人类通过概念归纳(concept induction)进行原版的单词学习。相反,他们通过从熟悉的英语词汇进行少样本翻译来获取新词的含义,大大降低了多模态单词学习的难度和模糊性。换句话说,单模态设置与多模态设置不可比。从微调BERT模型的实验中,一些不需要复杂的跨情境推理的任务可以通过令人满意的性能解决。通过将问题简化为单模态翻译,微调单模态模型将其转化为模式识别问题,从训练数据中找到隐藏的统计模式,而无需获取实际的类人少样本单词学习能力。因此,我们建议未来的工作不应对MEWL的单模态字幕版本进行特定的微调以提高性能,而应使用它来比较单模态和多模态模型。


人类结果


基于217个有效的回应,我们的人类研究表明,MEWL设计得很好,反映了人类用于单词学习的核心认知技能。例如,我们观察到人类在基本的命名任务上有不错的表现,表现排名为形状≈颜色>材料>复合,这与之前的心理学发现形状偏见(shape bias)(Landau et al.,1988)和快速映射(Heibeck & Markman, 1987)相呼应。人类也能毫不费力地进行计数。关系词和语用词的学习任务比其他任务更具挑战性;关系词通常没有指向物体的参照物,而且也被认为是在发展的后期阶段才获得的(McCune-Nicolich, 1981; Gentner, 2005)。我们的人类研究为MEWL上应展示的人类级单词学习提供了关键的参考。


为什么机器应该具备类人的单词学习能力?


少样本学习是人类最基本的多模态推理能力之一;它是语言习得的第一步,有助于学习概念(Clark,2006)。尽管最近的大规模视觉-语言对比预训练(Radford et al., 2021)可以被视为从指代不确定中学习的近似形式,但它仍然与人类的学习方式有很大的差距:例如,社会语用(social-pragmatic)词汇学习的失败(MEWL和Lake & Murphy (2021) 中的语用任务),获取数值和关系词的困难(MEWL, Radford et al. (2021) 和Conwell & Ullman (2022) 中的数字和关系任务),无法理解组合性(Thrush et al., 2022),以及概念关联偏见(Yamada et al., 2022)。这些问题表明,当前的学习范式无法像人类一样理解词义,导致了对齐和效率问题。人类式的词汇学习是否应该是多模态AI的一种路径仍然是一个争论的问题,但它是人类-AI对齐的基本能力(Yuan et al., 2022)。


词汇学习代表了人类学习的一种普遍形式。我们在指代不确定中学习,而目前的机器并不这样做。我们使用交叉情境信息来支持少次学习的词汇和概念,而目前的模型却在这方面遇到困难。我们通过教学和社会语用线索学习,而人工智能目前无法理解。在弥合这个差距之前,我们如何评估机器在与人类相同的条件下学习词汇的能力呢?我们通过在机器中设计这些词汇学习任务来迈出第一步;MEWL简单直观,支持词汇学习和更广泛的人类学习中的这些基本元素。


姜广源在ICML 2023会议现场


References

(上下滑动查看更多)


1. Abdou, M., Kulmizev, A., Hershcovich, D., Frank, S., Pavlick, E., and Søgaard, A. Can language models encode perceptual struc- ture without grounding? a case study in color. In Computational Natural Language Learning, 2021.

2. Abdulla, S. Statistics starts young. Nature, 2001.

3. Abend, O., Kwiatkowski, T., Smith, N. J., Goldwater, S., and Steedman, M. Bootstrapping language acquisition. Cognition, 164:116–143, 2017.

4. Alayrac, J.-B., Donahue, J., Luc, P., Miech, A., Barr, I., Hasson, Y., Lenc, K., Mensch, A., Millican, K., Reynolds, M., Ring, R., Rutherford, E., Cabi, S., Han, T., Gong, Z., Samangooei, S., Monteiro, M., Menick, J., Borgeaud, S., Brock, A., Nematzadeh, A., Sharifzadeh, S., Binkowski, M., Barreira, R., Vinyals, O., Zisserman, A., and Simonyan, K. Flamingo: a visual language model for few-shot learning. In Advances in Neural Information Processing Systems (NeurIPS), 2022.

5. Barrett, D., Hill, F., Santoro, A., Morcos, A., and Lillicrap, T. Measuring abstract reasoning in neural networks. In International Conference on Machine Learning (ICML), 2018.

6. Bender, E. M. and Koller, A. Climbing towards nlu: On meaning, form, and understanding in the age of data. In Annual Meeting of the Association for Computational Linguistics (ACL), 2020.

7. Berger, U., Stanovsky, G., Abend, O., and Frermann, L. A compu- tational acquisition model for multimodal word categorization. arXiv preprint arXiv:2205.05974, 2022.

8. Bi, Y. Dual coding of knowledge in the human brain. Trends in Cognitive Sciences, 25(10):883–895, 2021.

9. Bloom, P. Pre ́cis of how children learn the meanings of words. Behavioral and Brain Sciences, 24(6):1095–1103, 2001.

10. Bloom, P. How children learn the meanings of words. MIT press, 2002.

11. Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J. D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al. Language models are few-shot learners. In Advances in Neural Information Processing Systems (NeurIPS), 2020.

12. Burgess, C. P., Matthey, L., Watters, N., Kabra, R., Hig- gins, I., Botvinick, M., and Lerchner, A. Monet: Unsuper- vised scene decomposition and representation. arXiv preprint arXiv:1901.11390, 2019.

13. Carey, S. and Bartlett, E. Acquiring a single new word. Papers and Reports on Child Language Development, 15:17–29, 1978.

14. Chen, Y., Li, Q., Kong, D., Kei, Y. L., Zhu, S.-C., Gao, T., Zhu, Y., and Huang, S. Yourefit: Embodied reference understanding with language and gesture. In Conference on Computer Vision and Pattern Recognition (CVPR), 2021.

15. Chollet, F. On the measure of intelligence. arXiv preprint arXiv:1911.01547, 2019.

16. Clark, A. Language, embodiment, and the cognitive niche. Trends in Cognitive Sciences, 10(8):370–374, 2006.

17. Community, B. O. Blender–a 3d modelling and rendering package, 2016.

18. Conwell, C. and Ullman, T. Testing relational understanding in text-guided image generation. arXiv preprint arXiv:2208.00005, 2022.

19. Depeweg, S., Rothkopf, C. A., and Ja ̈kel, F. Solving bongard problems with a visual language and pragmatic reasoning. arXiv preprint arXiv:1804.04452, 2018.

20. Devlin, J., Chang, M.-W., Lee, K., and Toutanova, K. Bert: Pre- training of deep bidirectional transformers for language under- standing. In North American Chapter of the Association for Computational Linguistics: Human Language Technologies (NAACL-HLT), 2018.

21. Ding, D., Hill, F., Santoro, A., Reynolds, M., and Botvinick, M. M. Attention over learned object embeddings enables complex vi- sual reasoning. In Advances in Neural Information Processing Systems (NeurIPS), 2020.

22. Edmonds, M., Kubricht, James, F., Summers, C., Zhu, Y., Rothrock, B., Zhu, S.-C., and Lu, H. Human causal transfer: Challenges for deep reinforcement learning. In Annual Meeting of the Cognitive Science Society (CogSci), 2018.

23. Edmonds, M., Qi, S., Zhu, Y., Kubricht, J., Zhu, S.-C., and Lu, H. Decomposing human causal learning: Bottom-up associative learning and top-down schema reasoning. In Annual Meeting of the Cognitive Science Society (CogSci), 2019.

24. Edmonds, M., Ma, X., Qi, S., Zhu, Y., Lu, H., and Zhu, S.-C. Theory-based causal transfer: Integrating instance-level induction and abstract-level structure learning. In AAAI Conference on Artificial Intelligence (AAAI), 2020.

25. Engelcke, M., Kosiorek, A. R., Parker Jones, O., and Posner, I. GENESIS: Generative Scene Inference and Sampling with Object-Centric Latent Representations. In International Conference on Learning Representations (ICLR), 2020.

26. Fan, L., Xu, M., Cao, Z., Zhu, Y., and Zhu, S.-C. Artificial social intelligence: A comparative and holistic view. CAAI Artificial Intelligence Research, 1(2):144–160, 2022.

27. Fay, N., Garrod, S., Roberts, L., and Swoboda, N. The interactive evolution of human communication systems. Cognitive Science, 34(3):351–386, 2010.

28. Fay, N., Ellison, M., and Garrod, S. Iconicity: From sign to system in human communication and language. Pragmatics & Cognition, 22(2):244–263, 2014.

29. Fay, N., Walker, B., Swoboda, N., and Garrod, S. How to create shared symbols. Cognitive Science, 42:241–269, 2018.

30. Fodor, J. A., Pylyshyn, Z. W., et al. Connectionism and cognitive architecture: A critical analysis. Cognition, 28(1-2):3–71, 1988.

31. Frank, M. C. and Goodman, N. D. Predicting pragmatic reasoning in language games. Science, 336(6084):998–998, 2012.

32. Frank, M. C. and Goodman, N. D. Inferring word meanings by assuming that speakers are informative. Cognitive Psychology, 75:80–96, 2014.

33. Frank, M. C., Braginsky, M., Yurovsky, D., and Marchman, V. A. Wordbank: An open repository for developmental vocabulary data. Journal of Child Language, 44(3):677–694, 2017.

34. Friedman, W. J. and Seely, P. B. The child’s acquisition of spatial and temporal word meanings. Child Development, pp. 1103– 1108, 1976.

35. Fuson, K. C. Children’s counting and concepts of number. Springer Science & Business Media, 2012.

36. Geirhos, R., Rubisch, P., Michaelis, C., Bethge, M., Wichmann, F. A., and Brendel, W. Imagenet-trained cnns are biased towards texture; increasing shape bias improves accuracy and robustness. arXiv preprint arXiv:1811.12231, 2018.

37. Gentner, D. The development of relational category knowledge. In Building object categories in developmental time, pp. 263–294. Psychology Press, 2005.

38. Girdhar, R. and Ramanan, D. Cater: A diagnostic dataset for compositional actions & temporal reasoning. In International Conference on Learning Representations (ICLR), 2019.

39. Gleitman, L. The structural sources of verb meanings. Language Acquisition, 1(1):3–55, 1990.

40. Goodman, N., Tenenbaum, J., and Black, M. A bayesian framework for cross-situational word-learning. In Advances in Neural Information Processing Systems (NeurIPS), 2007.

41. Gopnik, A., Meltzoff, A. N., and Kuhl, P. K. The scientist in the crib: Minds, brains, and how children learn. William Morrow & Co, 1999.

42. Heibeck, T. H. and Markman, E. M. Word learning in children: An examination of fast mapping. Child Development, pp. 1021– 1034, 1987.

43. Horowitz, A. C. and Frank, M. C. Children’s pragmatic inferences as a route for learning about the world. Child Development, 87 (3):807–819, 2016.

44. Horst, J. S. and Hout, M. C. The novel object and unusual name (noun) database: A collection of novel images for use in experimental research. Behavior Research Methods, 48(4):1393–1409, 2016.

45. Hsu, J., Wu, J., and Goodman, N. Geoclidean: Few-shot gener- alization in euclidean geometry. In Proceedings of the Neural Information Processing Systems (NeurIPS) Track on Datasets and Benchmarks, 2022.

46. Ji, A., Kojima, N., Rush, N., Suhr, A., Vong, W. K., Hawkins, R. D., and Artzi, Y. Abstract visual reasoning with tangram shapes. In Annual Conference on Empirical Methods in Natural Language Processing (EMNLP), 2022.

47. Jiang, J. and Ahn, S. Generative neurosymbolic machines. In Advances in Neural Information Processing Systems (NeurIPS), 2020.

48. Jiang, K., Stacy, S., Wei, C., Chan, A., Rossano, F., Zhu, Y., and Gao, T. Individual vs. joint perception: a pragmatic model of pointing as communicative smithian helping. arXiv preprint arXiv:2106.02003, 2021.

49. Jiang, K., Dahmani, A., Stacy, S., Jiang, B., Rossano, F., Zhu, Y., and Gao, T. What is the point? a theory of mind model of relevance. In Annual Meeting of the Cognitive Science Society (CogSci), 2022.

50. Johnson, J., Hariharan, B., Van Der Maaten, L., Fei-Fei, L., Lawrence Zitnick, C., and Girshick, R. Clevr: A diagnos- tic dataset for compositional language and elementary visual reasoning. In Conference on Computer Vision and Pattern Recognition (CVPR), 2017.

51. Krishnamohan, V., Soman, A., Gupta, A., and Ganapathy, S. Audiovisual correspondence learning in humans and machines. In INTERSPEECH, 2020.

52. Kuhnle, A. and Copestake, A. Shapeworld-a new test methodology for multimodal language understanding. arXiv preprint arXiv:1704.04517, 2017.

53. Lake, B. and Baroni, M. Generalization without systematicity: On the compositional skills of sequence-to-sequence recurrent networks. In International Conference on Machine Learning (ICML), 2018.

54. Lake, B. M. and Murphy, G. L. Word meaning in minds and machines. Psychological Review, 2021.

55. Lake, B. M., Salakhutdinov, R., and Tenenbaum, J. B. Human- level concept learning through probabilistic program induction. Science, 350(6266):1332–1338, 2015.

56. Lake, B. M., Ullman, T. D., Tenenbaum, J. B., and Gershman, S. J. Building machines that learn and think like people. Behavioral and Brain Sciences, 40, 2017.

57. Lake, B. M., Linzen, T., and Baroni, M. Human few-shot learning of compositional instructions. arXiv preprint arXiv:1901.04587, 2019a.

58. Lake, B. M., Salakhutdinov, R., and Tenenbaum, J. B. The om- niglot challenge: a 3-year progress report. Current Opinion in Behavioral Sciences, 29:97–104, 2019b.

59. Landau, B., Smith, L. B., and Jones, S. S. The importance of shape in early lexical learning. Cognitive Development, 3(3):299–321, 1988.

60. Li, Q., Zhu, Y., Liang, Y., Wu, Y. N., Zhu, S.-C., and Huang, S. Neural-symbolic recursive machine for systematic generalization. arXiv preprint arXiv:2210.01603, 2022a.

61. Li, Q., Huang, S., Hong, Y., Zhu, Y., Wu, Y. N., and Zhu, S.-C. A minimalist dataset for systematic generalization of perception, syntax, and semantics. In International Conference on Learning Representations (ICLR), 2023.

62. Li, S., Wu, K., Zhang, C., and Zhu, Y. On the learning mechanisms in physical reasoning. In Advances in Neural Information Processing Systems (NeurIPS), 2022b.

63. McCune-Nicolich, L. The cognitive bases of relational words in the single word period. Journal of Child language, 8(1):15–34, 1981.

64. Merriman, W. E., Bowman, L. L., and MacWhinney, B. The mutual exclusivity bias in children’s word learning. In Monographs of the society for research in child development, pp. i–129. JSTOR, 1989.

65. Miller, G. A. The magical number seven, plus or minus two: Some limits on our capacity for processing information. Psychological Review, 63(2):81, 1956.

66. Mitchell, M. and Krakauer, D. C. The debate over understanding in ai’s large language models. Proceedings of the National Academy of Sciences (PNAS), 120(13):e2215907120, 2023.

67. Murphy, G. The big book of concepts. MIT press, 2004.

68. Nie, W., Yu, Z., Mao, L., Patel, A. B., Zhu, Y., and Anandkumar, A. Bongard-logo: A new benchmark for human-level concept learning and reasoning. In Advances in Neural Information Processing Systems (NeurIPS), 2020.

69. Orhan, E., Gupta, V., and Lake, B. M. Self-supervised learning through the eyes of a child. In Advances in Neural Information Processing Systems (NeurIPS), 2020.

70. Piantadosi, S. T., Tenenbaum, J. B., and Goodman, N. D. Boot- strapping in a language of thought: A formal model of numerical concept learning. Cognition, 123(2):199–217, 2012.

71. Piantasodi, S. T. and Hill, F. Meaning without reference in large language models. arXiv preprint arXiv:2208.02957, 2022.

72. Pinker, S. Language learnability and language development: with new commentary by the author, volume 7. Harvard University Press, 2009.

73. Qiu, S., Xie, S., Fan, L., Gao, T., Joo, J., Zhu, S.-C., and Zhu, Y. Emergent graphical conventions in a visual communication game. In Advances in Neural Information Processing Systems (NeurIPS), 2022.

74. Quine, W. V. O. Word and object. MIT press, 1960.

75. Radford, A., Kim, J. W., Hallacy, C., Ramesh, A., Goh, G., Agar- wal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., Krueger, G., and Sutskever, I. Learning transferable visual models from natural language supervision. In International Conference on Machine Learning (ICML), 2021.

76. Rane, S., Nencheva, M. L., Wang, Z., Lew-Williams, C., Rus- sakovsky, O., and Griffiths, T. L. Predicting word learning in children from the performance of computer vision systems. arXiv preprint arXiv:2207.09847, 2022.

77. Sandhofer, C. M. and Smith, L. B. Learning color words involves learning a system of mappings. Developmental Psychology, 35 (3):668, 1999.

78. Scott, R. M. and Fisher, C. 2.5-year-olds use cross-situational consistency to learn verbs under referential uncertainty. Cognition, 122(2):163–180, 2012.

79. Sharma, P., Ding, N., Goodman, S., and Soricut, R. Conceptual captions: A cleaned, hypernymed, image alt-text dataset for automatic image captioning. In Annual Meeting of the Association for Computational Linguistics (ACL), 2018.

80. Smith, K., Smith, A. D., and Blythe, R. A. Cross-situational learning: An experimental study of word-learning mechanisms. Cognitive Science, 35(3):480–498, 2011.

81. Smith, L. and Gasser, M. The development of embodied cognition: Six lessons from babies. Artificial Life, 11(1-2):13–29, 2005.

82. Smith, L. and Yu, C. Infants rapidly learn word-referent mappings via cross-situational statistics. Cognition, 106(3):1558–1568, 2008.

83. Stacy, S., Parab, A., Kleiman-Weiner, M., and Gao, T. Overloaded communication as paternalistic helping. In Annual Meeting of the Cognitive Science Society (CogSci), 2022.

84. Suhr, A., Lewis, M., Yeh, J., and Artzi, Y. A corpus of natural language for visual reasoning. In Annual Meeting of the Association for Computational Linguistics (ACL), 2017.

85. Sullivan, J., Mei, M., Perfors, A., Wojcik, E., and Frank, M. C. Saycam: A large, longitudinal audiovisual dataset recorded from the infant’s perspective. Open Mind, 5:20–29, 2022.

86. Tartaglini, A. R., Vong, W. K., and Lake, B. A developmentally-inspired examination of shape versus texture bias in machines. In Annual Meeting of the Cognitive Science Society (CogSci), 2022.

87. Tenenbaum, J. Bayesian modeling of human concept learning. In Advances in Neural Information Processing Systems (NeurIPS), 1998.

88. Tenenbaum, J. B., Kemp, C., Griffiths, T. L., and Goodman, N. D. How to grow a mind: Statistics, structure, and abstraction. Sci- ence, 331(6022):1279–1285, 2011.

89. Thrush, T., Jiang, R., Bartolo, M., Singh, A., Williams, A., Kiela, D., and Ross, C. Winoground: Probing vision and language models for visio-linguistic compositionality. In Conference on Computer Vision and Pattern Recognition (CVPR), 2022.

90. Tomasello, M. The social-pragmatic theory of word learning. Pragmatics, 10(4):401–413, 2000.

91. Tsimpoukelli, M., Menick, J. L., Cabi, S., Eslami, S., Vinyals, O., and Hill, F. Multimodal few-shot learning with frozen language models. In Advances in Neural Information Processing Systems (NeurIPS), 2021.

92. Vedantam, R., Szlam, A., Nickel, M., Morcos, A., and Lake, B. M. Curi: a benchmark for productive concept learning under uncertainty. In International Conference on Machine Learning (ICML), 2021.

93. Vong, W. K. and Lake, B. M. Learning word-referent mappings and concepts from raw inputs. In Annual Meeting of the Cognitive Science Society (CogSci), 2020.

94. Vong, W. K. and Lake, B. M. Cross-situational word learning with multimodal neural networks. Cognitive Science, 46(4):e13122, 2022.

95. Vong, W. K., Orhan, E., and Lake, B. Cross-situational word learning from naturalistic headcam data. In CUNY Conference on Human Sentence Processing, 2021.

96. Wang, W., Vong, W. K., Kim, N., and Lake, B. M. Finding structure in one child’s linguistic experience, Dec 2022. URL psyarxiv.com/85k3y.

97. Wynn, K. Children’s understanding of counting. Cognition, 36(2): 155–193, 1990.

98. Xie, S., Ma, X., Yu, P., Zhu, Y., Wu, Y. N., and Zhu, S.-C. Halma: Humanlike abstraction learning meets affordance in rapid problem solving. arXiv preprint arXiv:2102.11344, 2021.

99. Xu, F. and Tenenbaum, J. B. Word learning as bayesian inference. Psychological Review, 114(2):245, 2007.

100.  Xu, M., Jiang, G., Zhang, C., Zhu, S.-C., and Zhu, Y. Est: Evaluating scientific thinking in artificial agents. arXiv preprint arXiv:2206.09203, 2022.

101.  Yamada, Y., Tang, Y., and Yildirim, I. When are lemons pur- ple? the concept association bias of clip. arXiv preprint arXiv:2212.12043, 2022.

102.  Yang, Z., Gan, Z., Wang, J., Hu, X., Lu, Y., Liu, Z., and Wang, L. An empirical study of gpt-3 for few-shot knowledge-based vqa. In AAAI Conference on Artificial Intelligence (AAAI), 2021.

103.  Yi, K., Gan, C., Li, Y., Kohli, P., Wu, J., Torralba, A., and Tenen- baum, J. B. Clevrer: Collision events for video representation and reasoning. In International Conference on Learning Representations (ICLR), 2019.

104.  Yu, C., Zhang, Y., Slone, L. K., and Smith, L. B. The infant’s view redefines the problem of referential uncertainty in early word learning. Proceedings of the National Academy of Sciences (PNAS), 118(52):e2107019118, 2021.

105.  Yuan, L., Gao, X., Zheng, Z., Edmonds, M., Wu, Y. N., Rossano, F., Lu, H., Zhu, Y., and Zhu, S.-C. In situ bidirectional human-robot value alignment. Science Robotics, 7(68), 2022.

106.  Zhang, C., Gao, F., Jia, B., Zhu, Y., and Zhu, S.-C. Raven: A dataset for relational and analogical visual reasoning. In Con- ference on Computer Vision and Pattern Recognition (CVPR), 2019a.

107.  Zhang, C., Jia, B., Gao, F., Zhu, Y., Lu, H., and Zhu, S.-C. Learning perceptual inference by contrasting. In Advances in Neural Information Processing Systems (NeurIPS), 2019b.

108.  Zhang, C., Jia, B., Edmonds, M., Zhu, S.-C., and Zhu, Y. Acre: Abstract causal reasoning beyond covariation. In Conference on Computer Vision and Pattern Recognition (CVPR), 2021a.

109.  Zhang, C., Jia, B., Zhu, S.-C., and Zhu, Y. Abstract spatial- temporal reasoning via probabilistic abduction and execution. In Conference on Computer Vision and Pattern Recognition (CVPR), 2021b.

110.  Zhang, C., Xie, S., Jia, B., Wu, Y. N., Zhu, S.-C., and Zhu, Y. Learning algebraic representation for systematic generalization in abstract reasoning. In European Conference on Computer Vision (ECCV), 2022a.

111.  Zhang, S., Roller, S., Goyal, N., Artetxe, M., Chen, M., Chen, S., Dewan, C., Diab, M., Li, X., Lin, X. V., et al. Opt: Open pre-trained transformer language models. arXiv preprint arXiv:2205.01068, 2022b.

112.  Zhang, W., Zhang, C., Zhu, Y., and Zhu, S.-C. Machine number sense: A dataset of visual arithmetic problems for abstract and relational reasoning. In AAAI Conference on Artificial Intelligence (AAAI), 2020.

113.  Zhu, Y., Gao, T., Fan, L., Huang, S., Edmonds, M., Liu, H., Gao, F., Zhang, C., Qi, S., Wu, Y. N., Tenenbaum, J. B., and Zhu, S.-C. Dark, beyond deep: A paradigm shift to cognitive ai with humanlike common sense. Engineering, 6(3):310–345, 2020.

114.  Zhuang, C., Yan, S., Nayebi, A., Schrimpf, M., Frank, M. C., DiCarlo, J. J., and Yamins, D. L. Unsupervised neural network models of the ventral visual stream. Proceedings of the National Academy of Sciences (PNAS), 118(3):e2014196118, 2021.




—   往期发布  —





学术前沿 | 大模型=缸中之脑?朱松纯团队剖析AGI关键缺失

点击图片查看原文






学术前沿 | 基于随机计算的神经网络模型-加速器电路协同设计研究进展

点击图片查看原文






学术前沿 | 面向边缘AI的模拟存内计算芯片的研究进展

点击图片查看原文


—   版权声明  —

本微信公众号所有内容,由北京大学人工智能研究院微信自身创作、收集的文字、图片和音视频资料,版权属北京大学人工智能研究院微信所有;从公开渠道收集、整理及授权转载的文字、图片和音视频资料,版权属原作者。本公众号内容原作者如不愿在本号刊登内容,请及时通知本号,予以删除。

继续滑动看下一个
北京大学人工智能研究院
向上滑动看下一个

您可能也对以下帖子感兴趣

文章有问题?点此查看未经处理的缓存