WSDM 2022 | 一种用于在线广告自动竞价的协作竞争多智能体框架
▐ 摘要
论文下载:https://arxiv.org/pdf/2106.06224.pdf
▐ 背景
其中自动出价智能体由广告平台负责设计,该智能体目标是在广告主设置的约束下,根据广告主设置的目标来优化其出价策略。在阿里妈妈超级推荐&引力魔方上存在多种诉求,大体可以分为三类:优化点击、优化成交和优化收藏加购。这些自动出价智能体之间存在相互竞争关系。为了学习自动竞价智能体的竞价策略,最自然的方式就是去为每一个自动竞价智能体求解一个独立的优化问题,而将其他智能体出价的影响隐式地建模为环境的一部分。然而这种方式忽略了拍卖机制本质上是一个多智能体系统,即最终的拍卖结果取决于所有智能体的出价,且任一智能体的策略的改变会影响到其他所有智能体的策略。因此若不做任何的协调,则所有智能体会处于一个无约束状态,进而降低系统的整体效果。因此我们希望构建一个多智能体框架,通过精心设计协作机制来引导智能体走向一个具有较好系统性能的均衡状态。然而这面临以下几个挑战:
智能体间复杂的竞争与合作关系使得联合优化个体效果和系统整体性能变得困难。一方面,在完全竞争的环境下,每个广告主的效用可以被极度优化,但预算充足或可接受成本更高的广告主将会以更加激进的出价方式以获得更多的曝光,导致流量的按需分配无法实现,进而导致对社会福利的负面影响。另一方面,在完全协作的优化范式中,尽管能够让所有广告主以最优化整体社会福利为目标进行出价,但这可能会牺牲单个广告主的效果,同时广告主可能学得“共谋”出低价的行为,导致平台受损。因此,为了平衡个体效果和整体社会福利,一个可能的方案是构建一个混合合作-竞争框架(MCC, mixed cooperative-competitive),来使平台能够在社会福利和平台收入之间进行一个灵活的取舍。为实现混合合作-竞争,现有方案一般通过手动修改奖赏函数或改变与环境有关的参数来达到该目标,然而前者在拍卖场景下并没有一个确定的奖赏函数形式,而后者仅在模拟器中可行。
MCC中的合作关系可能会损害平台的收入,例如合作的出价智能体可能会共谋出低价。尽管保留价是一种保证平台收入的有效方法,但如何在MCC框架中优化保留价来减少对社会福利的影响仍是一个开放性问题。
MCC框架在工业界的实现也是一个巨大的挑战。理想情况下每个广告主对应一个智能体,但这个数量级过于巨大,且每个智能体得到的奖赏过于稀疏,导致难以学得一个较好的出价策略。
为了平衡出价智能体间的竞争和合作关系,我们提出了一种基于温度调控的奖励分配机制。即将一次拍卖中的奖赏根据softmax函数产出的权重分配给各方智能体。此外,softmax函数中引入的温度参数可以有效调控智能体之间的竞争与合作关系。
为了减少智能体合作共谋出价导致平台收入受损的问题,我们引入了门槛智能体来为每一个自动出价智能体设置一个个性化的竞价门槛。直觉上,门槛智能体的目标是通过提高竞价门槛来获取较高的平台收入,然而自动竞价智能体则具有一个相反的目标,即降低出价门槛使得可以以较低的成本获取流量。门槛智能体和出价智能体是通过一种对抗的方式进行联合训练,直到彼此策略达到某种均衡点。
我们提出一种类似平均场的方法来解决来自工业场景大规模多智能体系统的挑战。通过将同目标的智能体聚合为一个平均自动出价智能体,百万级别广告主之间复杂的交互可以被简化,使得在大规模多智能体系统中部署自动竞价服务变为可能。
▐ 基础概念
1. 自动出价模型
2. 马尔科夫过程
3. 独立学习 (IL, Independent Learner)
环境奖赏,即每个智能体从环境中获得的自己的奖赏。当时,各智能体之间是完全竞争的,称为CM-IL。 总奖赏,是所有智能体奖赏之和,也为此次分配结果的社会总福利(Social welfare)。当时,各智能体是合作关系,即为了总社会福利共同努力,此时他们为合作关系,称为CO-IL。
▐ IL的行为分析
智能体1获得的总价值:智能体2获得的价值由社会福利以及智能体1获取的总价值反推出来,因此没有绘出。 社会福利:社会福利为所有智能体价值的总和。 平台收入:扣费使用GSP机制。
▐ 我们的方法
为平衡竞争与合作关系,提出基于温度调控的奖励分配机制(Temperature Regularized Credit Assignment, TRCA); 为了降低因合作导致的平台收入损失,引入门槛智能体; 用于大规模多智能体系统的平均场方法。
基于温度调控的奖励分配机制TRCA
门槛智能体
用于大规模多智能体系统的平均场方法
Q-learning算法中需要下时刻状态下的最大Q值用于训练,但聚类后的下时刻状态未知 计划间通常有不同的预算约束,每条流量对应的流量价值也不同,共用策略存在困难
观测状态:平均智能体i在时刻t的观测值被定义为:。其中是在时刻t的剩余预算,其初始值为。为流量的平均价值。是剩余出价机会。 动作空间:平均智能体的动作为平均出价。计划在展现机会e上的出价为,其中。clip(.)用于保证最终出价不会出现极端值。 奖赏函数:奖赏也定义在一个聚合粒度: 转移函数:展现机会e上获胜计划的期望扣费为,其中j为ecpm排序中下一位广告的下标。因此平均智能体的消耗为:,则下一时刻观测状态为,当剩余预算为0是,智能体的出价只能为0.
▐ 实验
离线数据集仿真
离线数据集
评估指标
预算约束
对比方法
实验结果
离线实验结果如下图:
在线实验
消融实验
TRCA有效性
可以看到合作和竞争程度可以很方便的通过调节来平衡。
门槛智能体的影响
▐ 总结
参考文献
[1] Gagan Aggarwal, Ashwinkumar Badanidiyuru, and Aranyak Mehta. 2019. Autobidding with constraints. In WINE. Springer, 17–30.
[2] Han Cai, Kan Ren, Weinan Zhang, Kleanthis Malialis, Jun Wang, Yong Yu, and Defeng Guo. 2017. Real-time bidding by reinforcement learning in display advertising. In WSDM. 661–670.
[3] Google Ads Help Center. 2021. About automated bidding. https://support.google. com/google-ads/answer/2979071. Accessed: January 24, 2021.
[4] Carl Davidson and Raymond Deneckere. 1986. Long-run competition in capacity, short-run competition in price, and the Cournot model. The Rand Journal of Economics (1986), 404–415.
[5] Paul Dütting, Zhe Feng, Harikrishna Narasimhan, David Parkes, and Sai Srivatsa Ravindranath. 2019. Optimal auctions through deep learning. In ICML. PMLR, 1706–1715.
[6] Benjamin Edelman, Michael Ostrovsky, and Michael Schwarz. 2007. Internet advertising and the generalized second-price auction: Selling billions of dollars worth of keywords. American economic review 97, 1 (2007), 242–259.
[7] eMarketer. 2015. Worldwide retail ecommerce sales: eMarketer’s updated estimates and forecast through 2019. (2015).
[8] Facebook. 2021. Facebook. https://www.facebook.com/business/m/one-sheeters/ facebook-bid-strategy-guide. Accessed: January 24, 2021.
[9] Jakob Foerster, Gregory Farquhar, Triantafyllos Afouras, Nantas Nardelli, and Shimon Whiteson. 2018. Counterfactual multi-agent policy gradients. In AAAI, Vol. 32.
[10] Google. 2021. Google AdWords API. https://developers.google.com/adwords/ api/docs/guides/start. Accessed: January 24, 2021.
[11] Ziyu Guan, Hongchang Wu, Qingyu Cao, Hao Liu, Wei Zhao, Sheng Li, Cai Xu, Guang Qiu, Jian Xu, and Bo Zheng. 2021. Multi-Agent Cooperative Bidding Games for Multi-Objective Optimization in e-Commercial Sponsored Search. arXiv preprint arXiv:2106.04075 (2021).
[12] Garrett Hardin. 2009. The tragedy of the commons. Journal of Natural Resources Policy Research 1, 3 (2009), 243–253.
[13] Pablo Hernandez-Leal, Bilal Kartal, and Matthew E Taylor. 2019. A survey and critique of multiagent deep reinforcement learning. AAMAS 33, 6 (2019), 750–797.
[14] Junqi Jin, Chengru Song, Han Li, Kun Gai, Jun Wang, and Weinan Zhang. 2018. Real-time bidding with multi-agent reinforcement learning in display advertising. In CIKM. 2193–2201.
[15] Jean-Michel Lasry and Pierre-Louis Lions. 2007. Mean field games. Japanese journal of mathematics 2, 1 (2007), 229–260.
[16] Joel Z Leibo and Marc Lanctot. 2017. Multi-agent Reinforcement Learning in Sequential Social Dilemmas. (2017). arXiv:arXiv:1702.03037v1
[17] Michael L Littman. 1994. Markov games as a framework for multi-agent reinforcement learning. In Machine learning proceedings 1994. Elsevier, 157–163.
[18] Xiangyu Liu, Chuan Yu, Zhilin Zhang, Zhenzhe Zheng, Yu Rong, Hongtao Lv, Da Huo, Yiqing Wang, Dagui Chen, Jian Xu, Fan Wu, Guihai Chen, and Xiaoqiang Zhu. 2021. Neural Auction: End-to-End Learning of Auction Mechanisms for E-Commerce Advertising. In SIGKDD. 3354–3364.
[19] Ryan Lowe, Yi I Wu, Aviv Tamar, Jean Harb, OpenAI Pieter Abbeel, and Igor Mordatch. 2017. Multi-agent actor-critic for mixed cooperative-competitive environments. In NIPS. 6379–6390.
[20] Robert C Marshall and Leslie M Marx. 2007. Bidder collusion. Journal of Economic Theory 133, 1 (2007), 374–402.
[21] Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Andrei A Rusu, Joel Veness, Marc G Bellemare, Alex Graves, Martin Riedmiller, Andreas K Fidjeland, Georg Ostrovski, et al. 2015. Human-level control through deep reinforcement learning. nature 518, 7540 (2015), 529–533.
[22] Mehryar Mohri and Andres Munoz Medina. 2014. Learning theory and algorithms for revenue optimization in second price auctions with reserve. In ICML. PMLR, 262–270.
[23] Roger B Myerson. 1981. Optimal auction design. Mathematics of operations research 6, 1 (1981), 58–73.
[24] Michael Ostrovsky and Michael Schwarz. 2011. Reserve prices in internet advertising auctions: A field experiment. In EC. 59–60.
[25] Tabish Rashid, Mikayel Samvelyan, Christian Schroeder, Gregory Farquhar, Jakob Foerster, and Shimon Whiteson. 2018. QMIX: Monotonic Value Function Factorisation for Deep Multi-Agent Reinforcement Learning. In ICML. 4295–4304.
[26] Peter Sunehag, Guy Lever, Audrunas Gruslys, Wojciech Marian Czarnecki, Vinícius Flores Zambaldi, Max Jaderberg, Marc Lanctot, Nicolas Sonnerat, Joel Z Leibo, Karl Tuyls, et al. 2018. Value-Decomposition Networks For Cooperative Multi-Agent Learning Based On Team Reward.. In AAMAS. 2085–2087.
[27] Ardi Tampuu, Tambet Matiisen, Dorian Kodelja, Ilya Kuzovkin, Kristjan Korjus, Juhan Aru, Jaan Aru, and Raul Vicente. 2017. Multiagent cooperation and competition with deep reinforcement learning. PloS one 12, 4 (2017), e0172395.
[28] Ming Tan. 1993. Multi-agent reinforcement learning: Independent vs. cooperative agents. In ICML. 330–337.
[29] David RM Thompson and Kevin Leyton-Brown. 2013. Revenue optimization in the generalized second-price auction. In EC. 837–852.
[30] Chao Wen, Xinghu Yao, Yuhui Wang, and Xiaoyang Tan. 2020. SMIX (𝜆): Enhancing Centralized Value Functions for Cooperative Multi-Agent Reinforcement Learning.. In AAAI. 7301–7308.
[31] Di Wu, Xiujun Chen, Xun Yang, Hao Wang, Qing Tan, Xiaoxun Zhang, Jian Xu, and Kun Gai. 2018. Budget constrained bidding by model-free reinforcement learning in display advertising. In CIKM. 1443–1451.
[32] Xiao Yang, Daren Sun, Ruiwei Zhu, Tao Deng, Zhi Guo, Zongyao Ding, Shouke Qin, and Yanfeng Zhu. 2019. Aiads: Automated and intelligent advertising system for sponsored search. In SIGKDD. 1881–1890.
[33] Yaodong Yang, Rui Luo, Minne Li, Ming Zhou, Weinan Zhang, and Jun Wang. 2018. Mean field multi-agent reinforcement learning. In ICML. PMLR, 5571–5580.
[34] Shuai Yuan, Jun Wang, Bowei Chen, Peter Mason, and Sam Seljan. 2014. An empirical study of reserve price optimisation in real-time bidding. In SIGKDD. 1897–1906.
也许你还想看
丨WSDM 2022 | 点击率模型特征交叉方向的发展及CAN模型介绍