查看原文
其他

AIRS in the AIR 预告 | Multi-Agent Reinforcement Learning

第4期

Multi-Agent Reinforcement Learning

— Human-AI Coordination and Cognition

在过去的十年里,我们看到了很多人工智能在竞技中打败人类的案例,例如围棋、游戏等,由此引发一轮又一轮的关注和热议。

人工智能真的只能用于超越人类吗?我们是否能构建出可以与人类合作的人工智能?如何赋予人工智能系统社会认知和社会属性,并构建人工智能系统与人共同协作的社会?

在 AIRS in the AIR 第四期活动中,来自 Google 旗下人工智能公司 DeepMind 的研究员 Joel Z. Leibo 和牛津大学工程科学系副教授 Jakob Foerster 将围绕多智能体强化学习带来相关的最新研究进展。


01

执行主席

查宏远

AIRS 副院长

香港中文大学(深圳)教授、数据科学学院执行院长


02

讲座嘉宾

Joel Z. Leibo

DeepMind 研究员


Jakob Foerster

牛津大学工程科学系副教授


03

讲座介绍

主题报告:Reverse engineering the social-cognitive capacities, representations, and motivations that underpin human cooperation to help build cooperative artificial general intelligence

报告嘉宾:Joel Z. Leibo

As a route to building cooperative artificial general intelligence, I propose we try to reverse engineer human cooperation. As humans, we employ a set of social-cognitive capacities, representations, and motivations which underlie our critical ability to cooperate with one another.

Here I will argue that we need to figure out how human cooperation works so that we can build general artificial intelligence that cooperates like humans do. Specifically, in this talk I will describe how to use Melting Pot, an evaluation methodology and suite of test scenarios for multi-agent reinforcement learning, to further this goal of reverse engineering human cooperation in order to build cooperative artificial general intelligence.

主题报告:Zero-shot coordination and off-belief

报告嘉宾:Jakob Foerster

There has been a large body of work studying how agents can learn communication protocols in decentralized settings, using their actions to communicate information. Surprisingly little work has studied how this can be prevented, yet this is a crucial prerequisite from a human-AI coordination and AI-safety point of view.

The standard problem setting in Dec-POMDPs is self-play, where the goal is to find a set of policies that play optimally together. Policies learned through self-play may adopt arbitrary conventions and implicitly rely on multi-step reasoning based on fragile assumptions about other agents' actions and thus fail when paired with humans or independently trained agents at test time. To address this, we present off-belief learning (OBL). At each timestep OBL agents follow a policy pi_1 that is optimized assuming past actions were taken by a given, fixed policy, pi_0, but assuming that future actions will be taken by pi_1. When pi_0 is uniform random, OBL converges to an optimal policy that does not rely on inferences based on other agents' behavior.

OBL can be iterated in a hierarchy, where the optimal policy from one level becomes the input to the next, thereby introducing multi-level cognitive reasoning in a controlled manner. Unlike existing approaches, which may converge to any equilibrium policy, OBL converges to a unique policy, making it suitable for zero-shot coordination (ZSC).

OBL can be scaled to high-dimensional settings with a fictitious transition mechanism and shows strong performance in both a toy-setting and the benchmark human-AI & ZSC problem Hanabi.



活动时间

2022年3月22日 16:00 - 18:00


参与方式

请通过下方二维码免费报名线上观看




AIRS in the AIR 为 AIRS 重磅推出的系列活动,每周二与您相约线上,一起探索人工智能与机器人领域的前沿技术、产业应用、发展趋势。



您可能也对以下帖子感兴趣

文章有问题?点此查看未经处理的缓存