今年以来,以 GPT-4 (V)[1]、LLaVA [2]、PALM-E [3] 等为代表的多模态大语言模型(Multi-modal Large Language Model)在自然语言处理、视觉理解、机器人等任务上取得了显著的成功,但这类模型都是基于二维图片文本数据训练得到,在理解三维世界和与三维世界交互方面能力欠缺。
https://github.com/embodied-generalist/embodied-generalist 通才智能体 LEO 以大语言模型为基础,可以完成感知(perception)、定位(grounding)、推理(reasoning)、规划(planning)和动作执行(acting)等任务。 LEO 的三维视觉语言理解、具身推理和动作执行能力在现实世界中有广泛的应用场景与巨大的应用价值。作为未来的家庭助理,LEO 可以与人交互,回答与场景相关的问题,例如根据用户喜好调整家居布局、帮助用户找到特定物品、为用户的各种问题提供建议。LEO 的导航能力可用于购物中心、办公楼中的智能引导,其操控能力可用于家居自动化任务,如打扫、整理或简单厨房任务,以及仓库和物流中心的物品整理和搬运。
研究概述
▲ 图1. LEO 能力示意图
通才智能体 LEO 以 LLM 为基础,在不同任务之间采用共享的架构和权重,经由如下两个阶段训练得到:
[1] https://cdn.openai.com/papers/gpt-4-system-card.pdf[2] Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. Visual instruction tuning. arXiv preprint arXiv:2304.08485, 2023b.[3] Danny Driess, Fei Xia, Mehdi SM Sajjadi, Corey Lynch, Aakanksha Chowdhery, Brian Ichter, AyzaanWahid, Jonathan Tompson, Quan Vuong, Tianhe Yu, et al. Palm-e: An embodied multimodallanguage model. In International Conference on Machine Learning (ICML), 2023.[4] Yining Hong, Haoyu Zhen, Peihao Chen, Shuhong Zheng, Yilun Du, Zhenfang Chen, and Chuang Gan. 3d-llm: Injecting the 3d world into large language models. arXiv preprint arXiv:2307.12981, 2023.[5] Ziyu Zhu, Xiaojian Ma, Yixin Chen, Zhidong Deng, Siyuan Huang, and Qing Li. 3d-vista: Pre-trained transformer for 3d vision and text alignment. In International Conference on Computer Vision (ICCV), 2023c.[6] Mohit Shridhar, Lucas Manuelli, and Dieter Fox. Cliport: What and where pathways for robotic manipulation. In Conference on Robot Learning (CoRL), 2021.[7] Ram Ramrakhya, Eric Undersander, Dhruv Batra, and Abhishek Das. Habitat-web: Learning embodied object-search strategies from human demonstrations at scale. In Conference on Computer Vision and Pattern Recognition (CVPR), 2022.[8] Arjun Majumdar, Karmesh Yadav, Sergio Arnaud, Yecheng Jason Ma, Claire Chen, Sneha Silwal, Aryan Jain, Vincent-Pierre Berges, Pieter Abbeel, Jitendra Malik, et al. Where are we in the search for an artificial visual cortex for embodied intelligence? arXiv preprint arXiv:2303.18240, 2023. 7[9] Jared Kaplan, Sam McCandlish, Tom Henighan, Tom B Brown, Benjamin Chess, Rewon Child, Scott Gray, Alec Radford, Jeffrey Wu, and Dario Amodei. Scaling laws for neural language models. arXiv preprint arXiv:2001.08361, 2020